PXF 5 11 2
PXF 5 11 2
(PXF)
Version 5.11.2
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
With the explosion of data stores and cloud services, data now resides across many disparate
systems and in a variety of formats. Often, data is classified both by its location and the
operations performed on the data, as well as how often the data is accessed: real-time or
transactional (hot), less frequent (warm), or archival (cold).
The diagram below describes a data source that tracks monthly sales across many years. Real-
time operational data is stored in MySQL. Data subject to analytic and business intelligence
operations is stored in Greenplum Database. The rarely accessed, archival data resides in AWS
S3.
centered image
When multiple, related data sets exist in external systems, it is often more efficient to join data
sets remotely and return only the results, rather than negotiate the time and storage
requirements of performing a rather expensive full data load operation. The Greenplum Platform
Extension Framework (PXF), a Greenplum extension that provides parallel, high throughput data
access and federated query processing, provides this capability.
With PXF, you can use Greenplum and SQL to query these heterogeneous data sources:
Avro, AvroSequenceFile
JSON
ORC
Parquet
RCFile
SequenceFile
Text (plain, delimited, embedded line feeds)
Basic Usage
You use PXF to map data from an external source to a Greenplum Databaseexternal table
definition. You can then use the PXF external table and SQL to:
Perform queries on the external data, leaving the referenced data in place on the remote
system.
Load a subset of the external data into Greenplum Database.
Check out the PXF introduction for a high level overview important PXF concepts.
See Accessing Hadoop with PXF when the data resides in Hadoop.
See Accessing Azure, Google Cloud Storage, Minio, and S3 Object Stores with PXFwhen
the data resides in an object store.
See Accessing an SQL Database with PXF when the data resides in an external SQL
database.
Basic Usage
Get Started Configuring PXF
Get Started Using PXF
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Introduction to PXF
Supported Platforms
Architectural Overview
About Connectors, Servers, and Profiles
Creating an External Table
Other PXF Features
You can query the external table via Greenplum Database, leaving the referenced data in place.
Or, you can use the external table to load the data into Greenplum Database for higher
performance.
Supported Platforms
PXF bundles all of the Hadoop JAR files on which it depends, and supports the following Hadoop
component versions:
PXF Version Hadoop Version Hive Server Version HBase Server Version
5.10, 5.11 2.x, 3.1+ 1.x, 2.x, 3.1+ 1.3.2
5.9 2.x, 3.1+ 1.x, 2.x, 3.1+ 1.3.2
5.8 2.x 1.x 1.3.2
Architectural Overview
Your Greenplum Database deployment consists of a master node and multiple segment hosts. A
single PXF agent process on each Greenplum Database segment host allocates a worker thread
for each segment instance on a segment host that participates in a query against an external
table. The PXF agents on multiple segment hosts communicate with the external data store in
parallel.
A PXF Server is a named configuration for a connector. A server definition provides the
information required for PXF to access an external data source. This configuration information is
data-store-specific, and may include server location, access credentials, and other relevant
properties.
The Greenplum Database administrator will configure at least one server definition for each
external data store that they will allow Greenplum Database users to access, and will publish the
available server names as appropriate.
You specify a SERVER=<server_name> setting when you create the external table to identify the
server configuration from which to obtain the configuration and credentials to access the external
data store.
Finally, a PXF profile is a named mapping identifying a specific data format or protocol supported
by a specific external data store. PXF supports text, Avro, JSON, RCFile, Parquet, SequenceFile,
and ORC data formats, and the JDBC protocol, and provides several built-in profiles as discussed
in the following section.
The LOCATION clause in a CREATE EXTERNAL TABLE statement specifying the pxf protocol is a URI.
This URI identifies the path to, or other information describing, the location of the external data.
For example, if the external data store is HDFS, the <path-to-data> identifies the absolute path to
a specific HDFS file. If the external data store is Hive, <path-to-data> identifies a schema-qualified
Hive table name.
You use the query portion of the URI, introduced by the question mark (?), to identify the PXF
server and profile names.
PXF may require additional information to read or write certain data formats. You provide profile-
specific information using the optional <custom-option>=<value> component of the LOCATION
string and formatting information via the <formatting-properties> component of the string. The
custom options and formatting properties supported by a specific profile vary; they are identified in
usage documentation for the profile.
Supported Platforms
Architectural Overview
About Connectors, Servers, and Profiles
Creating an External Table
Other PXF Features
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
PXF supports filter pushdown. When filter pushdown is enabled, the constraints from theWHERE
clause of a SELECT query can be extracted and passed to the external data source for filtering.
You enable or disable filter pushdown for all external table protocols, includingpxf, by setting the
gp_external_enable_filter_pushdown server configuration parameter. The default value of this
configuration parameter is on; set it to off to disable filter pushdown. For example:
SHOW gp_external_enable_filter_pushdown;
SET gp_external_enable_filter_pushdown TO 'on';
Note: Some external data sources do not support filter pushdown. Also, filter pushdown may
not be supported with certain data types or operators. If a query accesses a data source that
does not support filter push-down for the query constraints, the query is instead executed without
filter pushdown (the data is filtered after it is transferred to Greenplum Database).
PXF filter pushdown can be used with these data types (connector- and profile-specific):
You can use PXF filter pushdown with these arithmetic and logical operators (connector- and
profile-specific):
PXF accesses data sources using profiles exposed by different connectors, and filter pushdown
support is determined by the specific connector implementation. The following PXF profiles
support some aspect of filter pushdown:
<, >,
<=, IS [NOT]
Profile LIKE IN AND OR NOT
>=, NULL
=, <>
Jdbc Y Y Y Y Y Y Y
*:parquet Y1 N Y1 N Y1 Y1 Y1
s3:parquet and s3:text with S3-
Y N Y Y Y Y Y
Select
HBase Y N Y N Y Y N
Hive Y2 N N N Y2 Y2 N
HiveText Y2 N N N Y2 Y2 N
HiveRC Y2 N N N Y2 Y2 N
PXF does not support filter pushdown for any profile not mentioned in the table above, including:
*:avro, *:AvroSequenceFile, *:SequenceFile, *:json, *:text, and *:text:multi.
To summarize, all of the following criteria must be met for filter pushdown to occur:
For queries on external tables that you create with thepxf protocol, the underlying PXF
connector must also support filter pushdown. For example, the PXF Hive, HBase, and
JDBC connectors support pushdown.
Refer to Hive Partition Filter Pushdown for more information about Hive support for
this feature.
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
PXF supports column projection, and it is always enabled. With column projection, only the
columns required by a SELECT query on an external table are returned from the external data
Note: Some external data sources do not support column projection. If a query accesses a data
source that does not support column projection, the query is instead executed without it, and the
data is filtered after it is transferred to Greenplum Database.
Column projection is automatically enabled for the pxf external table protocol. PXF accesses
external data sources using different connectors, and column projection support is also
determined by the specific connector implementation. The following PXF connector and profile
combinations support column projection on read operations:
Note: PXF may disable column projection in cases where it cannot successfully serialize a
query filter; for example, when the WHERE clause resolves to a boolean type.
To summarize, all of the following criteria must be met for column projection to occur:
The external data source that you are accessing must support column projection. For
example, Hive supports column projection for ORC-format data, and certain SQL
databases support column projection.
The underlying PXF connector and profile implementation must also support column
projection. For example, the PXF Hive and JDBC connector profiles identified above
support column projection, as do the PXF connectors that support reading Parquet data.
PXF must be able to serialize the query filter.
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF
Your Greenplum Database deployment consists of a master node and multiple segment hosts.
When you initialize and configure the Greenplum Platform Extension Framework (PXF), you
PXF provides connectors to Hadoop, Hive, HBase, object stores, and external SQL data stores.
You must configure PXF to support the connectors that you plan to use.
3. If you plan to use the Hadoop, Hive, or HBase PXF connectors, you must perform the
configuration procedure described in Configuring PXF Hadoop Connectors.
4. If you plan to use the PXF connectors to access the Azure, Google Cloud Storage, Minio,
or S3 object store(s), you must perform the configuration procedure described in
Configuring Connectors to Azure, Google Cloud Storage, Minio, and S3 Object Stores.
5. If you plan to use the PXF JDBC Connector to access an external SQL database, perform
the configuration procedure described in Configuring the JDBC Connector.
6. Start PXF.
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
PXF is installed on your master and segment nodes when you install Greenplum Database.
Directory Description
apache‑tomcat/ The PXF Tomcat directory.
bin/ The PXF script and executable directory.
The PXF internal configuration directory. This directory contains the pxf-env-
conf/ default.sh and pxf-profiles-default.xml configuration files. After initializing PXF, this
directory will also include the pxf-private.classpath file.
lib/ The PXF library directory.
templates/ Configuration templates for PXF.
Directory Description
pxf‑service/ After initializing PXF, the PXF service instance directory.
After starting PXF, the PXF run directory. Includes a PXF catalina process id
run/
file.
Directory Description
The location of user-customizable PXF configuration files: pxf-env.sh, pxf-log4j.properties ,
conf/
and pxf-profiles.xml .
keytabs/ The default location for the PXF service Kerberos principal keytab file.
lib/ The default PXF user runtime library directory.
The PXF runtime log file directory. Includes pxf-service.log and the Tomcat-related log
logs/
catalina.out. The logs/ directory and log files are readable only by thegpadmin user.
The server configuration directory; each subdirectory identifies the name of a server.
servers/ The default server is named default. The Greenplum Database administrator may
configure other servers.
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Prerequisites
Ensure that you have access to, or superuser permissions to install, Java 8 or Java 11 on each
Greenplum Database host.
Procedure
Perform the following procedure to install Java on the master, standby master, and on each
segment host in your Greenplum Database cluster. You will use the gpssh utility where possible
to run a command on multiple hosts.
$ ssh gpadmin@<gpmaster>
3. If the system does not include a Java version 8 or 11 installation, install one of these Java
versions on the master, standby master, and on each Greenplum Database segment host.
1. Create a text file that lists your Greenplum Database standby master host and
segment hosts, one host name per line. For example, a file named gphostfile may
include:
gpmaster
mstandby
seghost1
seghost2
seghost3
2. Install the Java package on each host. For example, to install Java version 8:
JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.x86_64/jre
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.4.11-0.el7_6.x86_64
If the superuser configures the newly-installed Java alternative as the system default:
JAVA_HOME=/usr/lib/jvm/jre
5. Note the $JAVA_HOME setting; you provide this value when you initialize PXF.
Prerequisites
Procedure
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Initializing PXF
Configuration Properties
Initialization Overview
Prerequisites
Procedure
The PXF server is a Java application. You must explicitly initialize the PXF Java service
instance. This one-time initialization creates the PXF service web application and generates
PXF configuration files and templates.
PXF provides two management commands that you can use for initialization:
pxf cluster init - initialize all PXF service instances in the Greenplum Database cluster
pxf init - initialize the PXF service instance on the current Greenplum Database host
PXF also provides similar reset commands that you can use to reset your PXF configuration.
Configuration Properties
PXF supports both internal and user-customizable configuration properties.
PXF internal configuration files are located in your Greenplum Database installation in the
$GPHOME/pxf/conf directory.
You identify the PXF user configuration directory at PXF initialization time via an environment
variable named $PXF_CONF. If you do not set $PXF_CONF prior to initializing PXF, PXF may prompt
you to accept or decline the default user configuration directory, $HOME/pxf, during the
initialization process.
Note: Choose a $PXF_CONF directory location that you can back up, and ensure that it resides
outside of your Greenplum Database installation directory.
Refer to PXF User Configuration Directories for a list of $PXF_CONF subdirectories and their
contents.
Initialization Overview
The PXF server runs on Java 8 or 11. You identify the PXF$JAVA_HOME and $PXF_CONF settings
at PXF initialization time.
Initializing PXF creates the PXF Java web application, and generates PXF internal configuration
files, setting default properties specific to your configuration.
Initializing PXF also creates the $PXF_CONF user configuration directory if it does not already
exist, and then populates conf and templates subdirectories with the following:
conf/ - user-customizable files for PXF runtime and logging configuration settings
templates/ - template configuration files
PXF remembers the JAVA_HOME setting that you specified during initialization by updating the
property of the same name in the $PXF_CONF/conf/pxf-env.sh user configuration file. PXF sources
this environment file on startup, allowing it to run with a Java installation that is different than the
If the $PXF_CONF directory that you specify during initialization already exists, PXF updates only
the templates subdirectory and the $PXF_CONF/conf/pxf-env.sh environment configuration file.
Prerequisites
Before initializing PXF in your Greenplum Database cluster, ensure that:
Procedure
Perform the following procedure to initialize PXF on each segment host in your Greenplum
Database cluster.
$ ssh gpadmin@<gpmaster>
3. Run the pxf cluster init command to initialize the PXF service on the master, standby master,
and on each segment host. For example, the following command specifies
/usr/local/greenplum-pxf as the PXF user configuration directory for initialization:
Note: The PXF service runs only on the segment hosts. However,pxf cluster init also sets up
the PXF user configuration directories on the Greenplum Database master and standby
master hosts.
Resetting PXF
Should you need to, you can reset PXF to its uninitialized state. You might choose to reset PXF
if you specified an incorrect PXF_CONF directory, or if you want to start the initialization procedure
from scratch.
When you reset PXF, PXF prompts you to confirm the operation. If you confirm, PXF removes
the following runtime files and directories (where PXF_HOME=$GPHOME/pxf):
$PXF_HOME/conf/pxf-private.classpath
$PXF_HOME/pxf-service
$PXF_HOME/run
You must stop the PXF service instance on a segment host before you can reset PXF on the
host.
Procedure
Perform the following procedure to reset PXF on each segment host in your Greenplum
Database cluster.
$ ssh gpadmin@<gpmaster>
2. Stop the PXF service instances on each segment host. For example:
3. Reset the PXF service instances on all Greenplum hosts. For example:
Note: After you reset PXF, you must initialize and start PXF to use the service again.
Configuration Properties
Initialization Overview
Prerequisites
Procedure
Resetting PXF
Procedure
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
This topic provides an overview of PXF server configuration. To configure a server, refer to the
topic specific to the connector that you want to configure.
You read from or write data to an external data store via a PXF connector. To access an external
data store, you must provide the server location. You may also be required to provide client
access credentials and other external data store-specific properties. PXF simplifies configuring
access to external data stores by:
A PXF Server definition is simply a named configuration that provides access to a specific
external data store. A PXF server name is the name of a directory residing in $PXF_CONF/servers/.
The information that you provide in a server configuration is connector-specific. For example, a
PXF JDBC Connector server definition may include settings for the JDBC driver class name,
URL, username, and password. You can also configure connection-specific and session-specific
properties in a JDBC server definition.
PXF provides a server template file for each connector; this template identifies the typical set of
properties that you must configure to use the connector.
You will configure a server definition for each external data store that Greenplum Database
users need to access. For example, if you require access to two Hadoop clusters, you will create
a PXF Hadoop server configuration for each cluster. If you require access to an Oracle and a
MySQL database, you will create one or more PXF JDBC server configurations for each
database.
A server configuration may include default settings for user access credentials and other
properties for the external data store. You can allow Greenplum Database users to access the
external data store using the default settings, or you can configure access and other properties
on a per-user basis. This allows you to configure different Greenplum Database users with
different external data store access credentials in a single PXF server definition.
PXF provides a template configuration file for each connector. These server template
configuration files are located in the $PXF_CONF/templates/ directory after you initialize PXF:
gpadmin@gpmaster$ ls $PXF_CONF/templates
adl-site.xml hbase-site.xml jdbc-site.xml pxf-site.xml yarn-site.xml
core-site.xml hdfs-site.xml mapred-site.xml s3-site.xml
gs-site.xml hive-site.xml minio-site.xml wasbs-site.xml
Note: The template files for the Hadoop connectors are not intended to be modified and used
for configuration, as they only provide an example of the information needed. Instead of
modifying the Hadoop templates, you will copy several Hadoop *-site.xml files from the Hadoop
cluster to your PXF Hadoop server configuration.
PXF automatically uses the default server configuration if you omit the SERVER=<server_name>
setting in the CREATE EXTERNAL TABLE command LOCATION clause.
Configuring a Server
When you configure a PXF connector to an external data store, you add a named PXF server
configuration for the connector. Among the tasks that you perform, you may:
1. Determine if you are configuring the default PXF server, or choose a new name for the
server configuration.
2. Create the directory $PXF_CONF/servers/<server_name>.
3. Copy template or other configuration files to the new server directory.
4. Fill in appropriate default values for the properties in the template file.
5. Add any additional configuration properties and values required for your environment.
6. Configure one or more users for the server configuration as described in About
Configuring a PXF User.
7. Synchronize the server and user configuration to the Greenplum Database cluster.
Note: You must re-sync the PXF configuration to the Greenplum Database cluster after you add
or update PXF server configuration.
To configure a PXF server for Hadoop, refer to Configuring PXF Hadoop Connectors .
To configure a PXF server for an object store, refer to Configuring Connectors to Minio
and S3 Object Stores and Configuring Connectors to Azure and Google Cloud Storage
Object Stores.
To configure a PXF JDBC server, refer to Configuring the JDBC Connector .
The settings in this file apply only to Hadoop and JDBC server configurations; they do not apply
to object store server configurations.
You configure properties in the pxf-site.xml file for a PXF server when one or more of the following
conditions hold:
Refer to Configuring PXF Hadoop Connectors and Configuring the JDBC Connector for
information about relevant pxf-site.xml property settings for Hadoop and JDBC server
configurations, respectively.
PXF per-server, per-user configuration provides the most benefit for JDBC servers.
You configure external data store user access credentials and properties for a specific
Greenplum Database user by providing a <greenplum_user_name>-user.xml user configuration file in
the PXF server configuration directory, $PXF_CONF/servers/<server_name>/. For example, you specify
the properties for the Greenplum Database user named bill in the file
$PXF_CONF/servers/<server_name>/bill-user.xml. You can configure zero, one, or more users in a PXF
server configuration.
The properties that you specify in a user configuration file are connector-specific. You can
specify any configuration property supported by the PXF connector server in a
<greenplum_user_name>-user.xml configuration file.
For example, suppose you have configured access to a PostgreSQL database in the PXF JDBC
server configuration named pgsrv1. To allow the Greenplum Database user namedbill to access
this database as the PostgreSQL user named pguser1, password changeme, you create the user
configuration file $PXF_CONF/servers/pgsrv1/bill-user.xml with the following properties:
<configuration>
<property>
<name>jdbc.user</name>
<value>pguser1</value>
</property>
<property>
<name>jdbc.password</name>
<value>changeme</value>
</property>
</configuration>
If you want to configure a specific search path and a larger read fetch size forbill, you would also
add the following properties to the bill-user.xml user configuration file:
<property>
<name>jdbc.session.property.search_path</name>
<value>bill_schema</value>
</property>
<property>
<name>jdbc.statement.fetchSize</name>
<value>2000</value>
</property>
Procedure
For each PXF user that you want to configure, you will:
5. Add each property/value pair that you identified in Step 3 within the configuration block in
the <greenplum_user_name>-user.xml file.
6. If you are adding the PXF user configuration to previously configured PXF server definition,
synchronize the user configuration to the Greenplum Database cluster.
For a given Greenplum Database user, PXF uses the following precedence rules (highest to
lowest) to obtain configuration property settings for the user:
These precedence rules allow you create a single external table that can be accessed by
multiple Greenplum Database users, each with their own unique external data store user
credentials.
For example, the following command accesses an S3 object store using the server configuration
defined in the $PXF_CONF/servers/s3srvcfg/s3-site.xml file:
CREATE EXTERNAL TABLE pxf_ext_tbl(name text, orders int)
LOCATION ('pxf://BUCKET/dir/file.txt?PROFILE=s3:text&SERVER=s3srvcfg')
PXF automatically uses the default server configuration when no SERVER=<server_name> setting is
provided.
For example, if the default server configuration identifies a Hadoop cluster, the following example
command references the HDFS file located at /path/to/file.txt:
CREATE EXTERNAL TABLE pxf_ext_hdfs(location text, miles int)
LOCATION ('pxf://path/to/file.txt?PROFILE=hdfs:text')
FORMAT 'TEXT' (delimiter=E',');
A Greenplum Database user who queries or writes to an external table accesses the external
data store with the credentials configured for the <server_name> user. If no user-specific
credentials are configured for <server_name>, the Greenplum user accesses the external data
store with the default credentials configured for <server_name>.
Privacy Policy | Terms of Use
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
Granting Users Access to PXF
Registering PXF JAR Dependencies
PXF is compatible with Cloudera, Hortonworks Data Platform, MapR, and generic Apache Hadoop distributions. This
topic describes how configure the PXF Hadoop, Hive, and HBase connectors.
If you do not want to use the Hadoop-related PXF connectors, then you do not need to perform this procedure.
Prerequisites
Configuring PXF Hadoop connectors involves copying configuration files from your Hadoop cluster to the Greenplum
Database master host. If you are using the MapR Hadoop distribution, you must also copy certain JAR files to the master
host. Before you configure the PXF Hadoop connectors, ensure that you can copy files from hosts in your Hadoop
cluster to the Greenplum Database master.
Procedure
Perform the following procedure to configure the desired PXF Hadoop-related connectors on the Greenplum Database
master host. After you configure the connectors, you will use the pxf cluster sync command to copy the PXF configuration to
the Greenplum Database cluster.
In this procedure, you use the default, or create a new, PXF server configuration. You copy Hadoop configuration files to
the server configuration directory on the Greenplum Database master host. You identify Kerberos and user
impersonation settings required for access, if applicable. You may also copy libraries to $PXF_CONF/lib for MapR support.
You then synchronize the PXF configuration on the master host to the standby master and segment hosts. (PXF creates
the$PXF_CONF/* directories when you run pxf cluster init .)
3. If you are not using the default PXF server, create the $PXF_CONF/servers/<server_name> directory. For example, use
the following command to create a Hadoop server configuration named hdp3:
gpadmin@gpmaster$ mkdir $PXF_CONF/servers/hdp3
Or,
gpadmin@gpmaster$ cd $PXF_CONF/servers/hdp3
5. PXF requires information from core-site.xml and other Hadoop configuration files. Copy the core-site.xml, hdfs-site.xml,
mapred-site.xml, and yarn-site.xml Hadoop configuration files from your Hadoop cluster NameNode host to the current
host using your tool of choice. Your file paths may differ based on the Hadoop distribution in use. For example,
these commands use scp to copy the files:
gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml .
gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml .
gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml .
gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml .
6. If you plan to use the PXF Hive connector to access Hive table data, similarly copy the Hive configuration to the
Greenplum Database master host. For example:
gpadmin@gpmaster$ scp hiveuser@hivehost:/etc/hive/conf/hive-site.xml .
7. If you plan to use the PXF HBase connector to access HBase table data, similarly copy the HBase configuration to
the Greenplum Database master host. For example:
gpadmin@gpmaster$ scp hbaseuser@hbasehost:/etc/hbase/conf/hbase-site.xml .
8. If you are using PXF with the MapR Hadoop distribution, you must copy certain JAR files from your MapR cluster to
the Greenplum Database master host. (Your file paths may differ based on the version of MapR in use.) For
example, these commands use scp to copy the files:
gpadmin@gpmaster$ cd $PXF_CONF/lib
gpadmin@gpmaster$ scp mapruser@maprhost:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/maprfs-5.2.2-mapr.jar .
gpadmin@gpmaster$ scp mapruser@maprhost:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/hadoop-auth-2.7.0-mapr-1707.jar .
gpadmin@gpmaster$ scp mapruser@maprhost:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/hadoop-common-2.7.0-mapr-1707.jar .
9. Synchronize the PXF configuration to the Greenplum Database cluster. For example:
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
10. PXF accesses Hadoop services on behalf of Greenplum Database end users. By default, PXF tries to access
HDFS, Hive, and HBase using the identity of the Greenplum Database user account that logs into Greenplum
Database. In order to support this functionality, you must configure proxy settings for Hadoop, as well as for Hive
and HBase if you intend to use those PXF connectors. Follow procedures in Configuring User Impersonation and
Proxying to configure user impersonation and proxying for Hadoop services, or to turn off PXF user impersonation.
11. Grant read permission to the HDFS files and directories that will be accessed as external tables in Greenplum
Database. If user impersonation is enabled (the default), you must grant this permission to each Greenplum
Database user/role name that will use external tables that reference the HDFS files. If user impersonation is not
enabled, you must grant this permission to the gpadmin user.
12. If your Hadoop cluster is secured with Kerberos, you must configure PXF and generate Kerberos principals and
keytabs for each segment host as described in Configuring PXF for Secure HDFS.
Prerequisites
Procedure
About Updating the Hadoop Configuration
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
When user impersonation is enabled (the default), PXF accesses Hadoop services using the
identity of the Greenplum Database user account that logs in to Greenplum and performs an
operation that uses a PXF connector. Keep in mind that PXF uses only the login identity of the
user when accessing Hadoop services. For example, if a user logs in to Greenplum Database as
the user jane and then execute SET ROLE or SET SESSION AUTHORIZATION to assume a different
user identity, all PXF requests still use the identity jane to access Hadoop services. When user
impersonation is enabled, you must explicitly configure each Hadoop data source (HDFS, Hive,
HBase) to allow PXF to act as a proxy for impersonating specific Hadoop users or groups.
When user impersonation is disabled, PXF executes all Hadoop service requests as the PXF
process owner (usually gpadmin) or the Hadoop user identity that you specify. This behavior
provides no means to control access to Hadoop services for different Greenplum Database
users. It requires that this user have access to all files and directories in HDFS, and all tables in
Hive and HBase that are referenced in PXF external table definitions.
You configure the Hadoop user and PXF user impersonation setting for a server via thepxf-
site.xml server configuration file. Refer to About Kerberos and User Impersonation Configuration
(pxf-site.xml) for more information about the configuration properties in this file.
The following table describes some of the PXF configuration scenarios for Hadoop access:
pxf-
Impersonation
Scenario site.xml Required Configuration
Setting
Required
Enable user impersonation, identify
PXF accesses Hadoop using the Hadoop proxy user in the
the identity of the Greenplum yes true pxf.service.user.name, and configure
Database user. Hadoop proxying for this Hadoop
user identity.
PXF accesses Hadoop using
the identity of the operating
yes false Disable user impersonation.
system user that started the
PXF process.
Disable user impersonation and
PXF accesses Hadoop using a identify the Hadoop user identity in
yes false
user identity that you specify. the pxf.service.user.name property
setting.
2. Identify the name of the PXF Hadoop server configuration that you want to update.
3. Navigate to the server configuration directory. For example, if the server is namedhdp3:
gpadmin@gpmaster$ cd $PXF_CONF/servers/hdp3
4. If the server configuration does not yet include a pxf-site.xml file, copy the template file to the
directory. For example:
gpadmin@gpmaster$ cp $PXF_CONF/templates/pxf-site.xml .
5. Open the pxf-site.xml file in the editor of your choice, and configure the Hadoop user name.
When impersonation is disabled, this name identifies the Hadoop user identity that PXF will
use to access the Hadoop system. When user impersonation is enabled, this name
identifies the PXF proxy Hadoop user. For example, if you want to access Hadoop as the
user hdfsuser1:
<property>
<name>pxf.service.user.name</name>
<value>hdfsuser1</value>
</property>
7. Use the pxf cluster sync command to synchronize the PXF Hadoop server configuration to
your Greenplum Database cluster. For example:
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
In previous versions of Greenplum Database, you configured user impersonation globally for
Hadoop clusters via the now deprecated PXF_USER_IMPERSONATION setting in the pxf-env.sh
configuration file.
1. Navigate to the server configuration directory. For example, if the server is namedhdp3:
gpadmin@gpmaster$ cd $PXF_CONF/servers/hdp3
2. If the server configuration does not yet include a pxf-site.xml file, copy the template file to the
directory. For example:
3. Open the pxf-site.xml file in the editor of your choice, and update the user impersonation
property setting. For example, if you do not require user impersonation for this server
configuration, set the pxf.service.user.impersonation property to false:
<property>
<name>pxf.service.user.impersonation</name>
<value>false</value>
</property>
4. If you enabled user impersonation, you must configure Hadoop proxying as described in
Configure Hadoop Proxying. You must also configure Hive User Impersonation and
HBase User Impersonation if you plan to use those services.
6. Use the pxf cluster sync command to synchronize the PXF Hadoop server configuration to
your Greenplum Database cluster. For example:
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
1. Log in to your Hadoop cluster and open thecore-site.xml configuration file using a text editor,
or use Ambari or another Hadoop cluster manager to add or edit the Hadoop property
values described in this procedure.
2. Set the property hadoop.proxyuser.<name>.hosts to specify the list of PXF host names from
which proxy requests are permitted. Substitute the PXF proxy Hadoop user for <name>.
The PXF proxy Hadoop user is the pxf.service.user.name that you configured in the procedure
above, or, if you are using Kerberos authentication to Hadoop, the proxy user identity is
the primary component of the Kerberos principal. If you have not configured
pxf.service.user.name, the proxy user is the operating system user that started PXF. Provide
multiple PXF host names in a comma-separated list. For example, if the PXF proxy user is
named hdfsuser2:
<property>
<name>hadoop.proxyuser.hdfsuser2.hosts</name>
<value>pxfhost1,pxfhost2,pxfhost3</value>
</property>
3. Set the property hadoop.proxyuser.<name>.groups to specify the list of HDFS groups that PXF
4. You must restart Hadoop for your core-site.xml changes to take effect.
5. Copy the updated core-site.xml file to the PXF Hadoop server configuration directory
$PXF_CONF/servers/<server_name> on the Greenplum Database master and synchronize the
configuration to the standby master and each Greenplum Database segment host.
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
Granting Users Access to PXF
Registering PXF JAR Dependencies
Monitoring PXF
Accessing Hadoop with PXF
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
Reading a Multi-Line Text File into a Single Table Row
Reading Hive Table Data
Reading HBase Table Data
When Kerberos is enabled for your HDFS filesystem, PXF, as an HDFS client, requires a principal and keytab file to authenticate access to
HDFS. To read or write files on a secure HDFS, you must create and deploy Kerberos principals and keytabs for PXF, and ensure that
Kerberos authentication is enabled and functioning.
In previous versions of Greenplum Database, you configured the PXF Kerberos principal and keytab for thedefault Hadoop server via the now
deprecated PXF_PRINCIPAL and PXF_KEYTAB settings in the pxf-env.sh configuration file.
When Kerberos is enabled, you access Hadoop with the PXF principal and keytab. You can also choose to access Hadoop using the identity
of the Greenplum Database user.
You configure the impersonation setting and the Kerberos principal and keytab for a Hadoop server via thepxf-site.xml server-specific
configuration file. Refer to About Kerberos and User Impersonation Configuration (pxf-site.xml) for more information about the configuration
properties in this file.
Configure the Kerberos principal and keytab using the followingpxf-site.xml properties:
The following table describes two scenarios for accessing Hadoop when Kerberos authentication is enabled:
Prerequisites
Before you configure PXF for access to a secure HDFS filesystem, ensure that you have:
Configured a PXF server for the Hadoop cluster, and can identify the server configuration name.
Verified that the HDFS configuration parameter dfs.block.access.token.enable is set to true. You can find this setting in thehdfs-site.xml
Noted the host name or IP address of each Greenplum Database segment host (<seghost>) and the Kerberos Key Distribution Center
(KDC) <kdc-server> host.
Noted the name of the Kerberos <realm> in which your cluster resides.
Installed the Kerberos client packages on each Greenplum Database segment host if they are not already installed. You must have
superuser permissions to install operating system packages. For example:
root@gphost$ rpm -qa | grep krb
root@gphost$ yum install krb5-libs krb5-workstation
Procedure
There are different procedures for configuring PXF for secure HDFS with a Microsoft Active Directory KDC Server vs. with an MIT Kerberos
KDC Server.
When you configure PXF for secure HDFS using an AD Kerberos KDC server, you will perform tasks on both the KDC server host and the
Greenplum Database master host.
7. Open Powershell or a command prompt and run thektpass command to generate the keytab file. For example:
powershell#>ktpass -out pxf.service.keytab -princ [email protected] -mapUser ServiceGreenplumPROD1 -pass ******* -crypto all -ptype KRB5_NT_PRINCIPAL
With Active Directory, the principal and the keytab file are shared by all Greenplum Database segment hosts.
2. Identify the name of the PXF Hadoop server configuration, and navigate to the server configuration directory. For example, if the server
is named hdp3:
gpadmin@gpmaster$ cd $PXF_CONF/servers/hdp3
3. If the server configuration does not yet include a pxf-site.xml file, copy the template file to the directory. For example:
gpadmin@gpmaster$ cp $PXF_CONF/templates/pxf-site.xml .
4. Open the pxf-site.xml file in the editor of your choice, and update the keytab and principal property settings, if required. Specify the
location of the keytab file and the Kerberos principal, substituting your realm. For example:
<property>
<name>pxf.service.kerberos.principal</name>
<value>[email protected]</value>
</property>
<property>
<name>pxf.service.kerberos.keytab</name>
<value>${pxf.conf}/keytabs/pxf.service.keytab</value>
</property>
5. Enable user impersonation as described in Configure PXF User Impersonation, and configure or verify Hadoop proxying for theprimary
component of the Kerberos principal as described in Configure Hadoop Proxying. For example, if your principal is
[email protected], configure proxying for the Hadoop user gpadmin.
7. Synchronize the PXF configuration to your Greenplum Database cluster and restart PXF. For example:
gpadmin@master$ $GPHOME/pxf/bin/pxf cluster sync
8. Step 7 does not synchronize the keytabs in$PXF_CONF. You must distribute the keytab file to$PXF_CONF/keytabs/. Locate the keytab file,
copy the file to the $PXF_CONF user configuration directory, and set required permissions. For example:
Perform the following steps on the MIT Kerberos KDC server host:
$ ssh root@<kdc-server>
root@kdc-server$
2. Distribute the /etc/krb5.conf Kerberos configuration file on the KDC server host to each segment host in your Greenplum Database cluster
if not already present. For example:
root@kdc-server$ scp /etc/krb5.conf seghost:/etc/krb5.conf
3. Use the kadmin.local command to create a Kerberos PXF service principal for each Greenplum Database segment host. The service
principal should be of the form gpadmin/<seghost>@<realm> where <seghost> is the DNS resolvable, fully-qualified hostname of the
segment host system (output of the hostname -f command).
For example, these commands create PXF service principals for the hosts named host1.example.com, host2.example.com, and
host3.example.com in the Kerberos realm named EXAMPLE.COM:
4. Generate a keytab file for each PXF service principal that you created in the previous step. Save the keytab files in any convenient
location (this example uses the directory /etc/security/keytabs). You will deploy the keytab files to their respective Greenplum Database
segment host machines in a later step. For example:
Repeat the xst command as necessary to generate a keytab for each PXF service principal that you created in the previous step.
6. Copy the keytab file for each PXF service principal to its respective segment host. For example, the following commands copy each
principal generated in step 4 to the PXF default keytab directory on the segment host when PXF_CONF=/usr/local/greenplum-pxf:
Note the file system location of the keytab file on each PXF host; you will need this information for a later configuration step.
7. Change the ownership and permissions on the pxf.service.keytab files. The files must be owned and readable by only thegpadmin user. For
example:
$ ssh gpadmin@<gpmaster>
2. Identify the name of the PXF Hadoop server configuration that requires Kerberos access.
3. Navigate to the server configuration directory. For example, if the server is namedhdp3:
gpadmin@gpmaster$ cd $PXF_CONF/servers/hdp3
4. If the server configuration does not yet include a pxf-site.xml file, copy the template file to the directory. For example:
5. Open the pxf-site.xml file in the editor of your choice, and update the keytab and principal property settings, if required. Specify the
location of the keytab file and the Kerberos principal, substituting your realm. The default values for these settings are identified below:
<property>
<name>pxf.service.kerberos.principal</name>
<value>gpadmin/[email protected]</value>
</property>
<property>
<name>pxf.service.kerberos.keytab</name>
<value>${pxf.conf}/keytabs/pxf.service.keytab</value>
</property>
PXF automatically replaces _HOST with the FQDN of the segment host.
7. If you want to access Hadoop using the identity of the Kerberos principal, disable user impersonation as described inConfigure PXF
User Impersonation.
8. PXF ignores the pxf.service.user.name property when it uses Kerberos authentication to Hadoop. You may choose to remove this property
from the pxf-site.xml file.
10. Synchronize the PXF configuration to your Greenplum Database cluster. For example:
Prerequisites
Procedure
Configuring PXF with a Microsoft Active Directory Kerberos KDC Server
Configuring PXF with an MIT Kerberos KDC Server
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
You can use PXF to access S3-compatible object stores. This topic describes how to configure
the PXF connectors to these external data sources.
If you do not plan to use these PXF object store connectors, then you do not need to perform
this procedure.
PXF provides a template configuration file for each object store connector. These template files
are located in the $PXF_CONF/templates/ directory.
S3 Server Configuration
The template configuration file for S3 is $PXF_CONF/templates/s3-site.xml. When you configure an S3
server, you must provide the following server configuration properties and replace the template
values with your credentials:
If required, fine-tune PXF S3 connectivity by specifying properties identified in the S3A section
You can override the credentials for an S3 server configuration by directly specifying the S3
access ID and secret key via custom options in the CREATE EXTERNAL TABLE command
LOCATION clause. Refer to Overriding the S3 Server Configuration with DDL for additional
information.
PXF supports Amazon Web Service S3 Server-Side Encryption (SSE) for S3 files that you
access with readable and writable Greenplum Database external tables that specify the pxf
protocol and an s3:* profile. AWS S3 server-side encryption protects your data at rest; it encrypts
your object data as it writes to disk, and transparently decrypts the data for you when you access
it.
PXF supports the following AWS SSE encryption key management schemes:
SSE with S3-Managed Keys (SSE-S3) - Amazon manages the data and master encryption
keys.
SSE with Key Management Service Managed Keys (SSE-KMS) - Amazon manages the
data key, and you manage the encryption key in AWS KMS.
SSE with Customer-Provided Keys (SSE-C) - You set and manage the encryption key.
Your S3 access key and secret key govern your access to all S3 bucket objects, whether the
data is encrypted or not.
S3 transparently decrypts data during a read operation of an encrypted file that you access via a
readable external table that is created by specifying the pxf protocol and an s3:* profile. No
additional configuration is required.
To encrypt data that you write to S3 via this type of external table, you have two options:
Configure the default SSE encryption key management scheme on a per-S3-bucket basis
via the AWS console or command line tools (recommended).
Configure SSE encryption options in your PXF S3 servers3-site.xml configuration file.
You can create S3 Bucket Policy(s) that identify the objects that you want to encrypt, the
encryption key management scheme, and the write actions permitted on those objects. Refer to
Protecting Data Using Server-Side Encryption in the AWS S3 documentation for more
information about the SSE encryption key management schemes. How Do I Enable Default
Encryption for an S3 Bucket? describes how to set default encryption bucket policies.
You must include certain properties in s3-site.xml to configure server-side encryption in a PXF S3
server configuration. The properties and values that you add to the file are dependent upon the
SSE encryption key management scheme.
To enable SSE-S3 on any file that you write to any S3 bucket, set the following encryption
algorithm property and value in the s3-site.xml file:
<property>
<name>fs.s3a.server-side-encryption-algorithm</name>
<value>AES256</value>
</property>
To enable SSE-S3 for a specific S3 bucket, use the property name variant that includes the
bucket name. For example:
<property>
<name>fs.s3a.bucket.YOUR_BUCKET1_NAME.server-side-encryption-algorithm</name>
<value>AES256</value>
</property>
SSE-KMS
To enable SSE-KMS on any file that you write to any S3 bucket, set both the encryption
algorithm and encryption key ID. To set these properties in the s3-site.xml file:
<property>
<name>fs.s3a.server-side-encryption-algorithm</name>
<value>SSE-KMS</value>
</property>
<property>
<name>fs.s3a.server-side-encryption.key</name>
<value>YOUR_AWS_SSE_KMS_KEY_ARN</value>
</property>
Substitute YOUR_AWS_SSE_KMS_KEY_ARN with your key resource name. If you do not specify an
encryption key, the default key defined in the Amazon KMS is used. Example KMS key:
arn:aws:kms:us-west-2:123456789012:key/1a23b456-7890-12cc-d345-6ef7890g12f3.
Note: Be sure to create the bucket and the key in the same Amazon Availability Zone.
To enable SSE-KMS for a specific S3 bucket, use property name variants that include the bucket
name. For example:
<property>
<name>fs.s3a.bucket.YOUR_BUCKET2_NAME.server-side-encryption-algorithm</name>
<value>SSE-KMS</value>
</property>
<property>
<name>fs.s3a.bucket.YOUR_BUCKET2_NAME.server-side-encryption.key</name>
<value>YOUR_AWS_SSE_KMS_KEY_ARN</value>
</property>
SSE-C
To enable SSE-C on any file that you write to any S3 bucket, set both the encryption algorithm
To enable SSE-C for a specific S3 bucket, use the property name variants that include the
bucket name as described in the SSE-KMS example.
In this procedure, you name and add a PXF server configuration in the$PXF_CONF/servers
directory on the Greenplum Database master host for the S3 Cloud Storage connector. You then
use the pxf cluster sync command to sync the server configuration(s) to the Greenplum Database
cluster.
$ ssh gpadmin@<gpmaster>
2. Choose a name for the server. You will provide the name to end users that need to
reference files in the object store.
4. Copy the PXF template file for S3 to the server configuration directory. For example:
5. Open the template server configuration file in the editor of your choice, and provide
appropriate property values for your environment. For example:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>fs.s3a.access.key</name>
<value>access_key_for_user1</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>secret_key_for_user1</value>
</property>
<property>
<name>fs.s3a.fast.upload</name>
7. Use the pxf cluster sync command to copy the new server configuration to the Greenplum
Database cluster. For example:
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
Granting Users Access to PXF
Registering PXF JAR Dependencies
Monitoring PXF
You can use PXF to access Azure Data Lake, Azure Blob Storage, and Google Cloud Storage object stores. This topic
describes how to configure the PXF connectors to these external data sources.
If you do not plan to use these PXF object store connectors, then you do not need to perform this procedure.
PXF provides a template configuration file for each object store connector. These template files are located in the
$PXF_CONF/templates/ directory.
The template configuration file for Azure Data Lake is $PXF_CONF/templates/adl-site.xml. When you configure an Azure Data
Lake server, you must provide the following server configuration properties and replace the template values with your
credentials:
In this procedure, you name and add a PXF server configuration in the$PXF_CONF/servers directory on the Greenplum
Database master host for the Google Cloud Storate (GCS) connector. You then use the pxf cluster sync command to sync the
server configuration(s) to the Greenplum Database cluster.
$ ssh gpadmin@<gpmaster>
2. Choose a name for the server. You will provide the name to end users that need to reference files in the object store.
3. Create the $PXF_CONF/servers/<server_name> directory. For example, use the following command to create a server
configuration for a Google Cloud Storage server named gs_public:
4. Copy the PXF template file for GCS to the server configuration directory. For example:
5. Open the template server configuration file in the editor of your choice, and provide appropriate property values for
your environment. For example, if your Google Cloud Storage key file is located in /home/gpadmin/keys/gcs-account.key.json:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
7. Use the pxf cluster sync command to copy the new server configurations to the Greenplum Database cluster. For
example:
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
You can use PXF to access an external SQL database including MySQL, ORACLE, Microsoft
SQL Server, DB2, PostgreSQL, Hive, and Apache Ignite. This topic describes how to configure
the PXF JDBC Connector to access these external data sources.
If you do not plan to use the PXF JDBC Connector, then you do not need to perform this
procedure.
In previous releases of Greenplum Database, you may have specified the JDBC driver class
name, database URL, and client credentials via options in the CREATE EXTERNAL TABLE
command. PXF now supports file-based server configuration for the JDBC Connector. This
configuration, described below, allows you to specify these options and credentials in a file.
Note: PXF external tables that you previously created that directly specified the JDBC
connection options will continue to work. If you want to move these tables to use JDBC file-
based server configuration, you must create a server configuration, drop the external tables, and
then recreate the tables specifying an appropriate SERVER=<server_name> clause.
PXF provides a template configuration file for the JDBC Connector. This server template
configuration file, located in $PXF_CONF/templates/jdbc-site.xml, identifies properties that you can
configure to establish a connection to the external SQL database. The template also includes
optional properties that you can set before executing query or insert commands in the external
database session.
Connection-Level Properties
Replace <CPROP_NAME> with the connection property name and specify its value:
<property>
<name>jdbc.connection.property.createDatabaseIfNotExist</name>
<value>true</value>
</property>
Ensure that the JDBC driver for the external SQL database supports any connection-level
property that you specify.
The SQL standard defines four transaction isolation levels. The level that you specify for a given
connection to an external SQL database determines how and when the changes made by one
transaction executed on the connection are visible to another.
The PXF JDBC Connector exposes an optional server configuration property named
jdbc.connection.transactionIsolation that enables you to specify the transaction isolation level. PXF sets
the level (setTransactionIsolation()) just after establishing the connection to the external SQL
database.
For example, to set the transaction isolation level toRead uncommitted, add the following
property block to the jdbc-site.xml file:
<property>
<name>jdbc.connection.transactionIsolation</name>
<value>READ_UNCOMMITTED</value>
</property>
Different SQL databases support different transaction isolation levels. Ensure that the external
database supports the level that you specify.
Statement-Level Properties
The PXF JDBC Connector executes a query or insert command on an external SQL database
table in a statement. The Connector exposes properties that enable you to configure certain
aspects of the statement before the command is executed in the external database. The
Connector supports the following statement-level properties:
Example: To set the read fetch size to 5000, add the following property block tojdbc-site.xml:
<property>
<name>jdbc.statement.fetchSize</name>
<value>5000</value>
</property>
Ensure that the JDBC driver for the external SQL database supports any statement-level
property that you specify.
Session-Level Properties
Replace <SPROP_NAME> with the session property name and specify its value:
Note: The PXF JDBC Connector passes both the session property name and property value to
the external SQL database exactly as specified in the jdbc-site.xml server configuration file. To limit
the potential threat of SQL injection, the Connector rejects any property name or value that
contains the ;, \n, \b, or \0 characters.
The PXF JDBC Connector handles the session property SET syntax for all supported external
SQL databases.
Example: To set the search_path parameter before running a query in a PostgreSQL database,
add the following property block to jdbc-site.xml:
<property>
<name>jdbc.session.property.search_path</name>
<value>public</value>
</property>
Ensure that the JDBC driver for the external SQL database supports any property that you
specify.
The PXF JDBC Connector uses JDBC connection pooling implemented by HikariCP. When a
user queries or writes to an external table, the Connector establishes a connection pool for the
associated server configuration the first time that it encounters a unique combination of jdbc.url,
jdbc.user, jdbc.password, connection property, and pool property settings. The Connector reuses
connections in the pool subject to certain connection and timeout settings.
Note: If you have enabled JDBC user impersonation in a server configuration, the JDBC
Connector creates a separate connection pool for each Greenplum Database user that
accesses any external table specifying that server configuration.
The jdbc.pool.enabled property governs JDBC connection pooling for a server configuration.
Connection pooling is enabled by default. To disable JDBC connection pooling for a server
configuration, set the property to false:
<property>
<name>jdbc.pool.enabled</name>
<value>false</value>
</property>
If you disable JDBC connection pooling for a server configuration, PXF does not reuse JDBC
connections for that server. PXF creates a connection to the remote database for every partition
of a query, and closes the connection when the query for that partition completes.
PXF exposes connection pooling properties that you can configure in a JDBC server definition.
These properties are named with the jdbc.pool.property. prefix and apply to each PXF JVM. The
JDBC Connector automatically sets the following connection pool properties and default values:
Default
Property Description
Value
The maximum number of connections to the
jdbc.pool.property.maximumPoolSize 5
database backend.
The maximum amount of time, in milliseconds, to
jdbc.pool.property.connectionTimeout 30000
wait for a connection from the pool.
The maximum amount of time, in milliseconds,
jdbc.pool.property.idleTimeout after which an inactive connection is considered 30000
idle.
The minimum number of idle connections
jdbc.pool.property.minimumIdle 0
maintained in the connection pool.
You can set other HikariCP-specific connection pooling properties for a server configuration by
specifying jdbc.pool.property.<HIKARICP_PROP_NAME> and the desired value in thejdbc-site.xml
configuration file for the server. Also note that the JDBC Connector passes along any property
that you specify with a jdbc.connection.property. prefix when it requests a connection from the JDBC
DriverManager. Refer to Connection-Level Properties above.
To not exceed the maximum number of connections allowed by the target database, and at the
same time ensure that each PXF JVM services a fair share of the JDBC connections, determine
the maximum value of maxPoolSize based on the size of the Greenplum Database cluster as
follows:
max_conns_allowed_by_remote_db / #_greenplum_segment_hosts
In practice, you may choose to setmaxPoolSize to a lower value, since the number of concurrent
connections per JDBC query depends on the number of partitions used in the query. When a
query uses no partitions, a single PXF JVM services the query. If a query uses 12 partitions, PXF
establishes 12 concurrent JDBC connections to the remote database. Ideally, these connections
are distributed equally among the PXF JVMs, but that is not guaranteed.
When you enable PXF JDBC user impersonation, the PXF JDBC Connector accesses the
external data store on behalf of a Greenplum Database end user. The Connector uses the name
of the Greenplum Database user that accesses the PXF external table to try to connect to the
external data store.
When you enable JDBC user impersonation for a PXF server, PXF overrides the value of a
jdbc.user property setting defined in either jdbc-site.xml or <greenplum_user_name>-user.xml, or specified
in the external table DDL, with the Greenplum Database user name. For user impersonation to
work effectively when the external data store requires passwords to authenticate connecting
users, you must specify the jdbc.password setting for each user that can be impersonated in that
user’s <greenplum_user_name>-user.xml property override file. Refer to Configuring a PXF User for
more information about per-server, per-Greenplum-user configuration.
The pxf.service.user.impersonation property in the jdbc-site.xml configuration file governs JDBC user
impersonation.
In previous versions of Greenplum Database, you configured JDBC user impersonation via the
now deprecated pxf.impersonation.jdbc property setting in the jdbc-site.xml configuration file.
By default, PXF JDBC user impersonation is disabled. Perform the following procedure to turn
PXF user impersonation on or off for a JDBC server configuration.
2. Identify the name of the PXF JDBC server configuration that you want to update.
gpadmin@gpmaster$ cd $PXF_CONF/servers/mysqldb
4. Open the jdbc-site.xml file in the editor of your choice, and add or uncomment the user
impersonation property and setting. For example, if you require user impersonation for this
server configuration, set the pxf.service.user.impersonation property to true:
<property>
<name>pxf.service.user.impersonation</name>
<value>true</value>
</property>
6. Use the pxf cluster sync command to synchronize the PXF JDBC server configuration to your
Greenplum Database cluster. For example:
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
In databases that support it, you can configure a session property to switch the effective user.
For example, in DB2, you use the SET SESSION_USER <username> command to switch the effective
DB2 user. If you configure the DB2 session_user variable via a PXF session-level property
(jdbc.session.property.<SPROP_NAME>) in your jdbc-site.xml file, PXF runs this command for you.
For example, to switch the effective DB2 user to the user namedbill, you configure your jdbc-
site.xml as follows:
<property>
<name>jdbc.session.property.session_user</name>
<value>bill</value>
</property>
After establishing the database connection, PXF implicitly runs the following command to set the
session_user DB2 session variable to the value that you configured:
PXF recognizes a synthetic property value, ${pxf.session.user}, that identifies the Greenplum
Database user name. You may choose to use this value when you configure a property that
requires a value that changes based on the Greenplum user running the session.
A scenario where you might use ${pxf.session.user} is when you authenticate to the remote SQL
database with Kerberos, the primary component of the Kerberos principal identifies the
Greenplum Database user name, and you want to run queries in the remote database using this
effective user name. For example, if you are accessing DB2, you would configure your jdbc-
<property>
<name>jdbc.session.property.session_user</name>
<value>${pxf.session.user}</value>
</property>
With this configuration, PXF SETs the DB2 session_user variable to the current Greenplum
Database user name, and runs subsequent operations on the DB2 table as that user.
To make use of this feature, add or uncomment the following property block injdbc-site.xml to
prompt PXF to include the Greenplum user name in connection pool creation/reuse criteria:
<property>
<name>jdbc.pool.qualifier</name>
<value>${pxf.session.user}</value>
</property>
PXF runs the query each time the user invokes aSELECT command on the Greenplum Database
external table.
You must place a query text file in the PXF JDBC server configuration directory from which it will
be accessed. If you want to make the query available to more than one JDBC server
The query text file must contain a single query that you want to run in the remote SQL database.
You must construct the query in accordance with the syntax supported by the database.
For example, if a MySQL database has a customers table and an orders table, you could include
the following SQL statement in a query text file:
You may optionally provide the ending semicolon (;) for the SQL statement.
Query Naming
The Greenplum Database user references a named query by specifying the query file name
without the extension. For example, if you define a query in a file named report.sql, the name of
that query is report.
Named queries are associated with a specific JDBC server configuration. You will provide the
available query names to the Greenplum Database users that you allow to create external tables
using the server configuration.
The Greenplum Database user specifies query:<query_name> rather than the name of a remote
SQL database table when they create the external table. For example, if the query is defined in
the file $PXF_CONF/servers/mydb/report.sql, the CREATE EXTERNAL TABLE LOCATION clause would
include the following components:
LOCATION ('pxf://query:report?PROFILE=JDBC&SERVER=mydb ...')
Refer to About Using Named Queries for information about using PXF JDBC named queries.
In this procedure, you name and add a PXF JDBC server configuration for a PostgreSQL
database and synchronize the server configuration(s) to the Greenplum Database cluster.
2. Choose a name for the JDBC server. You will provide the name to Greenplum users that
you choose to allow to reference tables in the external SQL database as the configured
user.
4. Copy the PXF JDBC server template file to the server configuration directory. For example:
5. Open the template server configuration file in the editor of your choice, and provide
appropriate property values for your environment. For example, if you are configuring
access to a PostgreSQL database named testdb on a PostgreSQL instance running on the
host named pgserverhost for the user named user1:
7. Use the pxf cluster sync command to copy the new server configuration to the Greenplum
Database cluster. For example:
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
You can use the PXF JDBC Connector to retrieve data from Hive. You can also use a JDBC named query to submit
a custom SQL query to Hive and retrieve the results using the JDBC Connector.
This topic describes how to configure the PXF JDBC Connector to access Hive. When you configure Hive access
with JDBC, you must take into account the Hive user impersonation setting, as well as whether or not the Hadoop
cluster is secured with Kerberos.
If you do not plan to use the PXF JDBC Connector to access Hive, then you do not need to perform this procedure.
When you configure a PXF JDBC server for Hive access, you must specify the JDBC driver class name, database
URL, and client credentials just as you would when configuring a client connection to an SQL database.
To access Hive via JDBC, you must specify the following properties and values in thejdbc-site.xml server
Property Value
jdbc.driver org.apache.hive.jdbc.HiveDriver
jdbc.url jdbc:hive2://<hiveserver2_host>:<hiveserver2_port>/<database>
The following table enumerates the Hive2 authentication and impersonation combinations supported by the PXF
JDBC Connector. It identifies the possible Hive user identities and the JDBC server configuration required for each.
Note: There are additional configuration steps required when Hive utilizes Kerberos authentication.
3. Create the $PXF_CONF/servers/<server_name> directory. For example, use the following command to create a
JDBC server configuration named hivejdbc1:
gpadmin@gpmaster$ mkdir $PXF_CONF/servers/hivejdbc1
6. When you access Hive secured with Kerberos, you also need to specify configuration properties in thepxf-
site.xml file. If this file does not yet exist in your server configuration, copy the pxf-site.xml template file to the
server config directory. For example:
gpadmin@gpmaster$ cp $PXF_CONF/templates/pxf-site.xml .
7. Open the jdbc-site.xml file in the editor of your choice and set thejdbc.driver and jdbc.url properties. Be sure to
specify your Hive host, port, and database name:
<property>
<name>jdbc.driver</name>
<value>org.apache.hive.jdbc.HiveDriver</value>
</property>
<property>
<name>jdbc.url</name>
<value>jdbc:hive2://<hiveserver2_host>:<hiveserver2_port>/<database></value>
</property>
8. Obtain the hive-site.xml file from your Hadoop cluster and examine the file.
10. If the hive.server2.authentication property in hive-site.xml is set to NONE , or the property is not specified, you must set
the jdbc.user property. The value to which you set thejdbc.user property is dependent upon the
hive.server2.enable.doAs impersonation setting in hive-site.xml:
1. If hive.server2.enable.doAs is set to TRUE (the default), Hive runs Hadoop operations on behalf of the user
connecting to Hive. Choose/perform one of the following options:
Set jdbc.user to specify the user that has read permission on all Hive data accessed by Greenplum
Database. For example, to connect to Hive and run all requests as user gpadmin:
<property>
<name>jdbc.user</name>
<value>gpadmin</value>
</property>
Or, turn on JDBC server-level user impersonation so that PXF automatically uses the Greenplum
Database user name to connect to Hive; uncomment the pxf.service.user.impersonation property in jdbc-site.xml
and set the value to `true:
<property>
<name>pxf.service.user.impersonation</name>
<value>true</value>
</property>
If you enable JDBC impersonation in this manner, you must not specify ajdbc.user nor include the setting
in the jdbc.url.
2. If required, create a PXF user configuration file as described in Configuring a PXF User to manage the
password setting.
3. If hive.server2.enable.doAs is set to FALSE, Hive runs Hadoop operations as the user who started the
HiveServer2 process, usually the user hive. PXF ignores the jdbc.user setting in this circumstance.
5. Add the saslQop property to jdbc.url, and set it to match the hive.server2.thrift.sasl.qop property setting in hive-
site.xml. For example, if the hive-site.xml file includes the following property setting:
<property>
<name>hive.server2.thrift.sasl.qop</name>
<value>auth-conf</value>
</property>
jdbc:hive2://hs2server:10000/default;principal=hive/hs2server@REALM;saslQop=auth-conf
7. If hive.server2.enable.doAs is set to TRUE (the default), Hive runs Hadoop operations on behalf of the user
connecting to Hive. Choose/perform one of the following options:
Do not specify any additional properties. In this case, PXF initiates all Hadoop access with the identity
provided in the PXF Kerberos principal (usually gpadmin).
Or, set the hive.server2.proxy.user property in the jdbc.url to specify the user that has read permission on all
Hive data. For example, to connect to Hive and run all requests as the user named integration use the
following jdbc.url:
jdbc:hive2://hs2server:10000/default;principal=hive/hs2server@REALM;saslQop=auth-conf;hive.server2.proxy.user=integration
Or, enable PXF JDBC impersonation in thepxf-site.xml file so that PXF automatically uses the
Greenplum Database user name to connect to Hive. Add or uncomment the pxf.service.user.impersonation
property and set the value to true. For example:
<property>
<name>pxf.service.user.impersonation</name>
<value>true</value>
</property>
If you enable JDBC impersonation, you must not explicitly specify a hive.server2.proxy.user in the jdbc.url.
8. If required, create a PXF user configuration file to manage the password setting.
9. If hive.server2.enable.doAs is set to FALSE, Hive runs Hadoop operations with the identity provided by the
PXF Kerberos principal (usually gpadmin).
13. Use the pxf cluster sync command to copy the new server configuration to the Greenplum Database cluster. For
example:
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Environment
Description
Variable
PXF_HOST The name of the host or IP address. The default host name islocalhost.
The port number on which the PXF agent listens for requests on the host. The
PXF_PORT
default port number is 5888.
Set the environment variables in the gpadmin user’s .bashrc shell login file on each segment host.
You must restart both Greenplum Database and PXF when you configure the agent host and/or
port in this manner. Consider performing this configuration during a scheduled down time.
Procedure
Perform the following procedure to configure the PXF agent host and/or port number on one or
more Greenplum Database segment hosts:
$ ssh gpadmin@<gpmaster>
5. Set the PXF_HOST and/or PXF_PORT environment variables. For example, to set the
PXF agent port number to 5998, add the following to the .bashrc file:
export PXF_PORT=5998
4. Restart PXF on each Greenplum Database segment host as described in Restarting PXF.
Procedure
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
The PXF upgrade procedure describes how to upgrade PXF in your Greenplum Database
installation. This procedure uses PXF.from to refer to your currently-installed PXF version and
PXF.to to refer to the PXF version installed when you upgrade to the new version of Greenplum
Database.
The PXF upgrade procedure has two parts. You perform one procedure before, and one
procedure after, you upgrade to a new version of Greenplum Database:
$ ssh gpadmin@<gpmaster>
3. Upgrade to the new version of Greenplum Database and then continue your PXF upgrade
with Step 2: Upgrading PXF.
$ ssh gpadmin@<gpmaster>
2. Initialize PXF on each segment host as described in Initializing PXF. You may choose to
use your existing $PXF_CONF for the initialization.
3. If you are upgrading from Greenplum Database version 6.1.x or earlierand you have
configured any JDBC servers that access Kerberos-secured Hive, you must now set the
hadoop.security.authentication property to the jdbc-site.xml file to explicitly identify use of the
Kerberos authentication method. Perform the following for each of these server configs:
2. Open the jdbc-site.xml file in the editor of your choice and uncomment or add the
following property block to the file:
<property>
<name>hadoop.security.authentication</name>
4. Synchronize the PXF configuration from the master host to the standby master and each
Greenplum Database segment host. For example:
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
pxf cluster - manage all PXF service instances in the Greenplum Database cluster
pxf - manage the PXF service instance on a specific Greenplum Database host
The pxf cluster command supports init, start, restart, status, stop , and sync subcommands. When you
run a pxf cluster subcommand on the Greenplum Database master host, you perform the
operation on all segment hosts in the Greenplum Database cluster. PXF also runs the init and
sync commands on the standby master host.
The pxf command supports init, start, stop , restart, and status operations. These operations run
locally. That is, if you want to start or stop the PXF agent on a specific Greenplum Database
segment host, you log in to the host and run the command.
Starting PXF
After initializing PXF, you must start PXF on each segment host in your Greenplum Database
cluster. The PXF service, once started, runs as the gpadmin user on default port 5888. Only the
gpadmin user can start and stop the PXF service.
If you want to change the default PXF configuration, you must update the configuration before
you start PXF.
The pxf-env.sh file exposes the following PXF runtime configuration parameters:
You must synchronize any changes that you make topxf-env.sh, pxf-log4j.properties , or pxf-profiles.xml
to the Greenplum Database cluster, and (re)start PXF on each segment host.
Prerequisites
Before you start PXF in your Greenplum Database cluster, ensure that:
Procedure
Perform the following procedure to start PXF on each segment host in your Greenplum
Database cluster.
2. Run the pxf cluster start command to start PXF on each segment host. For example:
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster start
Prerequisites
Before you stop PXF in your Greenplum Database cluster, ensure that your Greenplum
Database cluster is up and running.
Procedure
Perform the following procedure to stop PXF on each segment host in your Greenplum
Database cluster.
2. Run the pxf cluster stop command to stop PXF on each segment host. For example:
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster stop
Restarting PXF
If you must restart PXF, for example if you updated PXF user configuration files in
$PXF_CONF/conf, you run pxf cluster restart to stop, and then start, PXF on all segment hosts in your
Greenplum Database cluster.
Prerequisites
Before you restart PXF in your Greenplum Database cluster, ensure that your Greenplum
Database cluster is up and running.
Procedure
Perform the following procedure to restart PXF in your Greenplum Database cluster.
$ ssh gpadmin@<gpmaster>
2. Restart PXF:
Starting PXF
Stopping PXF
Restarting PXF
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
You must enable the PXF extension in each database in which you plan to use the framework to
access external data. You must also explicitly GRANT permission to the pxf protocol to those
users/roles who require access.
Perform the following procedure for each database in which you want to use PXF:
2. Create the PXF extension. You must have Greenplum Database administrator privileges to
create an extension. For example:
Creating the pxf extension registers the pxf protocol and the call handlers required for PXF
to access external data.
The DROP command fails if there are any currently defined external tables using thepxf
protocol. Add the CASCADE option if you choose to forcibly remove these external tables.
To grant a specific role access to the pxf protocol, use the GRANT command. For example, to
grant the role named bill read access to data referenced by an external table created with thepxf
protocol:
To write data to an external data store with PXF, you create an external table with theCREATE
WRITABLE EXTERNAL TABLE command that specifies the pxf protocol. You must specifically grant
INSERT permission to the pxf protocol to all non-SUPERUSER Greenplum Database roles that
require such access. For example:
GRANT INSERT ON PROTOCOL pxf TO bill;
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
You use PXF to access data stored on external systems. Depending upon the external data
store, this access may require that you install and/or configure additional components or
PXF depends on JAR files and other configuration information provided by these additional
components. The $GPHOME/pxf/conf/pxf-private.classpath file identifies PXF internal JAR
dependencies. In most cases, PXF manages the pxf-private.classpath file, adding entries as
necessary based on the connectors that you use.
Should you need to add an additional JAR dependency for PXF, for example a JDBC driver JAR
file, you must log in to the Greenplum Database master host, copy the JAR file to the PXF user
configuration runtime library directory ($PXF_CONF/lib), sync the PXF configuration to the
Greenplum Database cluster, and then restart PXF on each segment host. For example:
$ ssh gpadmin@<gpmaster>
gpadmin@gpmaster$ cp new_dependent_jar.jar $PXF_CONF/lib/
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster restart
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Monitoring PXF
The pxf cluster status command displays the status of the PXF service instance on all segment
hosts in your Greenplum Database cluster. pxf status displays the status of the PXF service
Only the gpadmin user can request the status of the PXF service.
Perform the following procedure to request the PXF status of your Greenplum Database cluster.
$ ssh gpadmin@<gpmaster>
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
In previous versions of Greenplum Database, you may have used thegphdfs external table
protocol to access data stored in Hadoop. Greenplum Database version 6.0.0 removes the
gphdfs protocol. Use PXF and the pxf external table protocol to access Hadoop in Greenplum
Database version 6.x.
Architecture
HDFS is the primary distributed storage mechanism used by Apache Hadoop. When a user or
application performs a query on a PXF external table that references an HDFS file, the
Greenplum Database master node dispatches the query to all segment hosts. Each segment
instance contacts the PXF agent running on its host. When it receives the request from a
segment instance, the PXF agent:
A segment instance uses its Greenplum Database gp_segment_id and the file block information
described in the metadata to assign itself a specific portion of the query data. The segment
instance then sends a request to the PXF agent to read the assigned data. This data may reside
on one or more HDFS DataNodes.
The PXF agent invokes the HDFS Java API to read the data and delivers it to the segment
instance. The segment instance delivers its portion of the data to the Greenplum Database
master node. This communication occurs across segment hosts and segment instances in
parallel.
Prerequisites
Before working with Hadoop data using PXF, ensure that:
You have configured and initialized PXF, and PXF is running on each Greenplum
Database segment host. See Configuring PXF for additional information.
You have configured the PXF Hadoop Connectors that you plan to use. Refer to
Configuring PXF Hadoop Connectors for instructions. If you plan to access JSON-
formatted data stored in a Cloudera Hadoop cluster, PXF requires a Cloudera version 5.8
or later Hadoop distribution.
If user impersonation is enabled (the default), ensure that you have granted read (and write
A Hadoop installation includes command-line tools that interact directly with your HDFS file
system. These tools support typical file system operations that include copying and listing files,
changing file permissions, and so forth. You run these tools on a system with a Hadoop client
installation. By default, Greenplum Database hosts do not include a Hadoop client installation.
The HDFS file system command syntax is hdfs dfs <options> [<file>] . Invoked with no options, hdfs dfs
lists the file system options supported by the tool.
The user invoking the hdfs dfs command must have read privileges on the HDFS data store to list
and view directory and file contents, and write permission to create directories and files.
The hdfs dfs options used in the PXF Hadoop topics are:
Option Description
-cat Display file contents.
-mkdir Create a directory in HDFS.
Copy a file from the local file system to
-put
HDFS.
Examples:
The PXF Hadoop connectors expose the following profiles to read, and in many cases write,
these supported data formats:
You provide the profile name when you specify thepxf protocol on a CREATE EXTERNAL TABLE
command to create a Greenplum Database external table that references a Hadoop file,
directory, or table. For example, the following command creates an external table that uses the
default server and specifies the profile named hdfs:text:
CREATE EXTERNAL TABLE pxf_hdfs_text(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text')
FORMAT 'TEXT' (delimiter=E',');
Architecture
Prerequisites
HDFS Shell Command Primer
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
Granting Users Access to PXF
Registering PXF JAR Dependencies
Monitoring PXF
Accessing Hadoop with PXF
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
Reading a Multi-Line Text File into a Single Table Row
Reading Hive Table Data
Reading HBase Table Data
The PXF HDFS Connector supports plain delimited and comma-separated value form text data. This section describes how to use PXF to
access HDFS text data, including how to create, query, and insert data into an external table that references files in the HDFS data store.
Prerequisites
Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read data from or write data to HDFS.
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword Value
<path‑to‑hdfs‑file> The absolute path to the directory or file in the HDFS data store.
PROFILE The PROFILE keyword must specify hdfs:text.
SERVER=
The named server configuration that PXF uses to access the data. Optional; PXF uses thedefault server if not specified.
<server_name>
Use FORMAT 'TEXT' when <path-to-hdfs-file> references plain text delimited data.
FORMAT
Use FORMAT 'CSV' when <path-to-hdfs-file> references comma-separated value data.
The delimiter character in the data. For FORMAT 'CSV', the default <delim_value> is a comma ,. Preface the <delim_value>
delimiter
with an E when the value is an escape sequence. Examples: (delimiter=E'\t'), (delimiter ':').
Note: PXF does not support CSV files with a header row, nor does it support the(HEADER) formatter option in the CREATE EXTERNAL TABLE
command.
Perform the following procedure to create a sample text file, copy the file to HDFS, and use thehdfs:text profile and the default PXF server to
create two PXF external tables to query the data:
1. Create an HDFS directory for PXF example data files. For example:
$ echo 'Prague,Jan,101,4875.33
Rome,Mar,87,1557.39
Bangalore,May,317,8936.99
Beijing,Jul,411,11600.67' > /tmp/pxf_hdfs_simple.txt
Note the use of the comma , to separate the four data fields.
$ psql -d postgres
6. Use the PXF hdfs:text profile to create a Greenplum Database external table that references thepxf_hdfs_simple.txt file that you just created
and added to HDFS:
postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text')
FORMAT 'TEXT' (delimiter=E',');
8. Create a second external table that references pxf_hdfs_simple.txt, this time specifying the CSV FORMAT:
postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple_csv(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text')
FORMAT 'CSV';
postgres=# SELECT * FROM pxf_hdfs_textsimple_csv;
When you specify FORMAT 'CSV' for comma-separated value data, no delimiter formatter option is required because comma is the default
delimiter value.
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword Value
<path‑to‑hdfs‑file> The absolute path to the directory or file in the HDFS data store.
PROFILE The PROFILE keyword must specify hdfs:text:multi.
SERVER=
The named server configuration that PXF uses to access the data. Optional; PXF uses thedefault server if not specified.
<server_name>
Use FORMAT 'TEXT' when <path-to-hdfs-file> references plain text delimited data.
FORMAT
Use FORMAT 'CSV' when <path-to-hdfs-file> references comma-separated value data.
The delimiter character in the data. For FORMAT 'CSV', the default <delim_value> is a comma ,. Preface the <delim_value>
delimiter
with an E when the value is an escape sequence. Examples: (delimiter=E'\t'), (delimiter ':').
Notice the use of the colon : to separate the three fields. Also notice the quotes around the first (address) field. This field includes an
embedded line feed separating the street address from the city and state.
4. Use the hdfs:text:multi profile to create an external table that references the pxf_hdfs_multi.txt HDFS file, making sure to identify the : (colon)
as the field separator:
postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textmulti(address text, month text, year int)
LOCATION ('pxf://data/pxf_examples/pxf_hdfs_multi.txt?PROFILE=hdfs:text:multi')
FORMAT 'CSV' (delimiter ':');
Note: External tables that you create with a writable profile can only be used forINSERT operations. If you want to query the data that you
inserted, you must create a separate readable external table that references the HDFS directory.
Use the following syntax to create a Greenplum Database writable external table that references an HDFS directory:
CREATE WRITABLE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-dir>
?PROFILE=hdfs:text[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>');
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword Value
<path‑to‑hdfs‑dir> The absolute path to the directory in the HDFS data store.
PROFILE The PROFILE keyword must specify hdfs:text
SERVER=
The named server configuration that PXF uses to access the data. Optional; PXF uses thedefault server if not specified.
<server_name>
<custom‑option> <custom-option>s are described below.
Use FORMAT 'TEXT' to write plain, delimited text to <path-to-hdfs-dir>.
FORMAT
Use FORMAT 'CSV' to write comma-separated value text to <path-to-hdfs-dir>.
The delimiter character in the data. For FORMAT 'CSV', the default <delim_value> is a comma ,. Preface the <delim_value>
delimiter
with an E when the value is an escape sequence. Examples: (delimiter=E'\t'), (delimiter ':').
If you want to load data from an existing Greenplum Database table into the writable external table, consider specifying
DISTRIBUTED
the same distribution policy or <column_name> on both tables. Doing so will avoid extra motion of data between segments
BY
on the load operation.
org.apache.hadoop.io.compress.DefaultCodec
org.apache.hadoop.io.compress.GzipCodec
org.apache.hadoop.io.compress.BZip2Codec
You specify the compression codec via custom options in theCREATE EXTERNAL TABLE LOCATION clause. The hdfs:text profile support the
following custom write options:
Data
Column Name
Type
location text
month text
number_of_orders int
total_sales float8
This example also optionally uses the Greenplum Database external table namedpxf_hdfs_textsimple that you created in that exercise.
Procedure
Perform the following procedure to create Greenplum Database writable external tables utilizing the same data schema as described above,
one of which will employ compression. You will use the PXF hdfs:text profile and the default PXF server to write data to the underlying HDFS
directory. You will also create a separate, readable external table to read the data that you wrote to the HDFS directory.
1. Create a Greenplum Database writable external table utilizing the data schema described above. Write to the HDFS directory
/data/pxf_examples/pxfwritable_hdfs_textsimple1. Create the table specifying a comma , as the delimiter:
postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_writabletbl_1(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://data/pxf_examples/pxfwritable_hdfs_textsimple1?PROFILE=hdfs:text')
FORMAT 'TEXT' (delimiter=',');
You specify the FORMAT subclause delimiter value as the single ascii comma character ,.
2. Write a few individual records to the pxfwritable_hdfs_textsimple1 HDFS directory by invoking the SQL INSERT command on
pxf_hdfs_writabletbl_1:
3. (Optional) Insert the data from the pxf_hdfs_textsimple table that you created in Example: Reading Text Data on HDFS into
pxf_hdfs_writabletbl_1:
4. In another terminal window, display the data that you just added to HDFS:
$ hdfs dfs -cat /data/pxf_examples/pxfwritable_hdfs_textsimple1/*
Frankfurt,Mar,777,3956.98
Cleveland,Oct,3812,96645.37
Prague,Jan,101,4875.33
Rome,Mar,87,1557.39
Bangalore,May,317,8936.99
Beijing,Jul,411,11600.67
Because you specified comma , as the delimiter when you created the writable external table, this character is the field separator used in
each record of the HDFS data.
5. Greenplum Database does not support directly querying a writable external table. To query the data that you just added to HDFS, you
postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple_r1(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://data/pxf_examples/pxfwritable_hdfs_textsimple1?PROFILE=hdfs:text')
FORMAT 'CSV';
You specify the 'CSV' FORMAT when you create the readable external table because you created the writable table with a comma, as the
delimiter character, the default delimiter for 'CSV' FORMAT.
The pxf_hdfs_textsimple_r1 table includes the records you individually inserted, as well as the full contents of thepxf_hdfs_textsimple table if
you performed the optional step.
7. Create a second Greenplum Database writable external table, this time using Gzip compression and employing a colon: as the
delimiter:
postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_writabletbl_2 (location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://data/pxf_examples/pxfwritable_hdfs_textsimple2?PROFILE=hdfs:text&COMPRESSION_CODEC=org.apache.hadoop.io.compress.GzipCodec')
FORMAT 'TEXT' (delimiter=':');
8. Write a few records to the pxfwritable_hdfs_textsimple2 HDFS directory by inserting directly into the pxf_hdfs_writabletbl_2 table:
gpadmin=# INSERT INTO pxf_hdfs_writabletbl_2 VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 );
gpadmin=# INSERT INTO pxf_hdfs_writabletbl_2 VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
9. In another terminal window, display the contents of the data that you added to HDFS; use the-text option to hdfs dfs to view the
compressed data as text:
$ hdfs dfs -text /data/pxf_examples/pxfwritable_hdfs_textsimple2/*
Frankfurt:Mar:777:3956.98
Cleveland:Oct:3812:96645.3
Notice that the colon : is the field separator in this HDFS data.
To query data from the newly-created HDFS directory namedpxfwritable_hdfs_textsimple2, you can create a readable external Greenplum
Database table as described above that references this HDFS directory and specifies FORMAT 'CSV' (delimiter=':') .
Prerequisites
Reading Text Data
Example: Reading Text Data on HDFS
Reading Text Data with Quoted Linefeeds
Example: Reading Multi-Line Text Data on HDFS
Writing Text Data to HDFS
Example: Writing Text Data to HDFS
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
Granting Users Access to PXF
Registering PXF JAR Dependencies
Monitoring PXF
Accessing Hadoop with PXF
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
Reading a Multi-Line Text File into a Single Table Row
Reading Hive Table Data
Reading HBase Table Data
Accessing Azure, Google Cloud Storage, Minio, and S3 Object Stores with PXF
About Accessing the S3 Object Store
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Use the PXF HDFS Connector to read and write Avro-format data. This section describes how to use PXF to read and write Avro data in HDFS, including
how to create, query, and insert into an external table that references an Avro file in the HDFS data store.
Note: PXF does not support reading or writing compressed Avro files.
Prerequisites
Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read data from HDFS.
To represent Avro primitive data types in Greenplum Database, map data values to Greenplum Database columns of the same type.
Avro supports complex data types including arrays, maps, records, enumerations, and fixed types. Map top-level fields of these complex data types to
the Greenplum Database TEXT type. While Greenplum Database does not natively support these types, you can create Greenplum Database functions
or application code to extract or further process subcomponents of these complex data types.
The following table summarizes external mapping rules for Avro data.
Avro schemas are defined using JSON, and composed of the same primitive and complex types identified in the data type mapping section above. Avro
schema files typically have a .avsc suffix.
An Avro data file contains the schema and a compact binary representation of the data. Avro data files typically have the.avro suffix.
You can specify an Avro schema on both read and write operations to HDFS. You can provide either a binary*.avro file or a JSON-format *.avsc file for the
schema file:
External Table
Schema Specified? Description
Type
PXF uses the specified schema; this overrides the schema embedded in the Avro data
readable yes
file.
readable no PXF uses the schema embedded in the Avro data file.
writable yes PXF uses the specified schema.
writable no PXF creates the Avro schema based on the external table definition.
When you provide the Avro schema file to PXF, the file must reside in the same location on each Greenplum Database segment hostor the file may
reside on the Hadoop file system. PXF first searches for an absolute file path on the Greenplum segment hosts. If PXF does not find the schema file
there, it searches for the file relative to the PXF classpath. If PXF cannot find the schema file locally, it searches for the file on HDFS.
The $PXF_CONF/conf directory is in the PXF classpath. PXF can locate an Avro schema file that you add to this directory on every Greenplum Database
segment host.
See Writing Avro Data for additional schema considerations when writing Avro data to HDFS.
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword Value
<path‑to‑hdfs‑file> The absolute path to the directory or file in the HDFS data store.
PROFILE The PROFILE keyword must specify hdfs:avro.
SERVER=
The named server configuration that PXF uses to access the data. Optional; PXF uses thedefault server if not specified.
<server_name>
<custom‑option> <custom-option>s are discussed below.
FORMAT
Use FORMAT ’CUSTOM’ with (FORMATTER='pxfwritable_export') (write) or (FORMATTER='pxfwritable_import') (read).
‘CUSTOM’
If you want to load data from an existing Greenplum Database table into the writable external table, consider specifying the same
DISTRIBUTED
distribution policy or <column_name> on both tables. Doing so will avoid extra motion of data between segments on the load
BY
operation.
For complex types, the PXF hdfs:avro profile inserts default delimiters between collection items and values before display. You can use non-default
delimiter characters by identifying values for specific hdfs:avro custom options in the CREATE EXTERNAL TABLE command.
id - long
username - string
followers - array of string
fmap - map of long
relationship - enumerated type
address - record comprised of street number (int), street name (string), and city (string)
Create Schema
Perform the following operations to create an Avro schema to represent the example schema described above.
$ vi /tmp/avro_schema.avsc
{"id":2, "username":"jim","followers":["john", "pam"], "relationship": "COLLEAGUE", "fmap": {"john":3,"pam":3}, "address":{"number":9, "street":"deer creek", "city":"palo alto"}}
The sample data uses a comma , to separate top level records and a colon: to separate map/key values and record field name/values.
3. Convert the text file to Avro format. There are various ways to perform the conversion, both programmatically and via the command line. In this
example, we use the Java Avro tools; the jar avro-tools-1.9.1.jar file resides in the current directory:
1. Use the hdfs:avro profile to create a queryable external table from thepxf_avro.avro file:
The simple query of the external table shows the components of the complex type data separated with the delimiters specified in theCREATE
EXTERNAL TABLE call.
3. Process the delimited components in the text columns as necessary for your application. For example, the following command uses the Greenplum
Database internal string_to_array function to convert entries in thefollowers field to a text array column in a new view.
postgres=# CREATE VIEW followers_view AS
SELECT username, address, string_to_array(substring(followers FROM 2 FOR (char_length(followers) - 2)), ',')::text[]
AS followers
FROM pxf_hdfs_avro;
4. Query the view to filter rows based on whether a particular follower appears in the view:
postgres=# SELECT username, address FROM followers_view WHERE followers @> '{john}';
username | address
----------+---------------------------------------------
jim | {number:9,street:deer creek,city:palo alto}
If you do not specify a SCHEMA file, PXF generates a schema for the Avro file based on the Greenplum Database external table definition. PXF assigns
the name of the external table column to the Avro field name. Because Avro has a null type and Greenplum external tables do not support theNOT NULL
column qualifier, PXF wraps each data type in an Avro union of the mapped type and null. For example, for a writable external table column that you define
with the Greenplum Database text data type, PXF generates the following schema element:
["string", "null"]
PXF returns an error if you provide a schema that does not include aunion of the field data type withnull, and PXF encounters a NULL data field.
PXF supports writing only Avro primitive data types. It does not support writing complex types to Avro:
When you specify a SCHEMA file in the LOCATION, the schema must include only primitive data types.
When PXF generates the schema, it writes any complex type that you specify in the writable external table column definition to the Avro file as a
single Avro string type. For example, if you write an array of integers, PXF converts the array to astring, and you must read this data with a
Greenplum text-type column.
The Avro file that you create and read in this example includes the following fields:
id: int
username: text
followers: text[]
Example procedure:
PXF uses the external table definition to generate the Avro schema.
2. Create an external table to read the Avro data that you just inserted into the table:
postgres=# CREATE EXTERNAL TABLE read_pxfwrite(id int, username text, followers text)
LOCATION ('pxf://data/pxf_examples/pxfwrite.avro?PROFILE=hdfs:avro')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
id | username | followers
----+----------+--------------
77 | lisa | {tom,mary}
33 | oliver | {alex,frank}
(2 rows)
followers is a single string comprised of the text array elements that you inserted into the table.
Prerequisites
Working with Avro Data
Data Type Mapping
Avro Schemas and Data
Creating the External Table
Example: Reading Avro Data
Create Schema
Create Avro Data File (JSON)
Reading Avro Data
Writing Avro Data
Example: Writing Avro Data
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Use the PXF HDFS Connector to read JSON-format data. This section describes how to use
PXF to access JSON data in HDFS, including how to create and query an external table that
references a JSON file in the HDFS data store.
Prerequisites
Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read data from
HDFS.
A .json file will contain a collection of objects. A JSON object is a collection of unordered
name/value pairs. A value can be a string, a number, true, false, null, or an object or an array.
You can define nested JSON objects and arrays.
{
"created_at":"MonSep3004:04:53+00002013",
"id_str":"384529256681725952",
"user": {
"id":31424214,
"location":"COLUMBUS"
},
"coordinates":{
"type":"Point",
"values":[
13,
99
]
}
}
In the sample above, user is an object composed of fields named id and location. To specify the
nested fields in the user object as Greenplum Database external table columns, use . projection:
user.id
user.location
coordinates is an object composed of a text field named type and an array of integers named values .
Use [] to identify specific elements of the values array as Greenplum Database external table
columns:
coordinates.values[0]
coordinates.values[1]
To represent JSON data in Greenplum Database, map data values that use a primitive data type
to Greenplum Database columns of the same type. JSON supports complex data types
including projections and arrays. Use N-level projection to map members of nested objects and
arrays to primitive data types.
The following table summarizes external mapping rules for JSON data.
PXF supports two data read modes. The default mode expects one full JSON record per line.
PXF also supports a read mode operating on JSON records that span multiple lines.
In upcoming examples, you will use both read modes to operate on a sample data set. The
schema of the sample data set defines objects with the following member names and value data
types:
“created_at” - text
“id_str” - text
“user” - object
“id” - integer
“location” - text
“coordinates” - object (optional)
“type” - text
“values” - array
[0] - integer
[1] - integer
This is the data set for the multi-line JSON record data set:
{
"root":[
{
"record_obj":{
"created_at":"MonSep3004:04:53+00002013",
"id_str":"384529256681725952",
"user":{
"id":31424214,
"location":"COLUMBUS"
},
"coordinates":null
},
"record_obj":{
"created_at":"MonSep3004:04:54+00002013",
"id_str":"384529260872228864",
"user":{
"id":67600981,
"location":"KryberWorld"
},
"coordinates":{
"type":"Point",
"values":[
8,
52
]
}
}
}
]
}
You will create JSON files for the sample data sets and add them to HDFS in the next section.
Copy and paste the single line JSON record sample data set above to a file namedsingleline.json .
Similarly, copy and paste the multi-line JSON record data set to a file named multiline.json.
Note: Ensure that there are no blank lines in your JSON files.
Copy the JSON data files that you just created to your HDFS data store. Create the
/data/pxf_examples directory if you did not do so in a previous exercise. For example:
Once the data is loaded to HDFS, you can use Greenplum Database and PXF to query and
analyze the JSON data.
The specific keywords and values used in the CREATE EXTERNAL TABLE command are
described in the table below.
Keyword Value
<path‑to‑hdfs‑file> The absolute path to the directory or file in the HDFS data store.
PROFILE The PROFILE keyword must specify hdfs:json.
SERVER= The named server configuration that PXF uses to access the data. Optional;
<server_name> PXF uses the default server if not specified.
<custom‑option> <custom-option>s are discussed below.
FORMAT Use FORMAT 'CUSTOM' with the hdfs:json profile. The CUSTOM FORMAT requires
‘CUSTOM’ that you specify (FORMATTER='pxfwritable_import').
PXF supports single- and multi- line JSON records. When you want to read multi-line JSON
records, you must provide an IDENTIFIER <custom-option> and value. Use this <custom-option>
to identify the member name of the first field in the JSON record object:
Option
Syntax, Example(s) Description
Keyword
You must include the IDENTIFIER keyword and <value> in
the LOCATION string only when you are accessing JSON
&IDENTIFIER=<value>
IDENTIFIER data comprised of multi-line records. Use the <value> to
&IDENTIFIER=created_at
identify the member name of the first field in the JSON
record object.
Notice the use of . projection to access the nested fields in theuser and coordinates objects. Also
identifies the member name of the first field in the JSON recordrecord_obj in the sample
created_at
data schema.
Prerequisites
Working with JSON Data
JSON to Greenplum Database Data Type Mapping
JSON Data Read Modes
Loading the Sample JSON Data to HDFS
Creating the External Table
Example: Reading a JSON File with Single Line Records
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
Granting Users Access to PXF
Registering PXF JAR Dependencies
Monitoring PXF
Accessing Hadoop with PXF
Use the PXF HDFS connector to read and write Parquet-format data. This section describes how to read and write HDFS
files that are stored in Parquet format, including how to create, query, and insert into external tables that reference files in the
HDFS data store.
PXF currently supports reading and writing primitive Parquet data types only.
Prerequisites
Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read data from or write data to HDFS.
Parquet supports a small set of primitive data types, and uses metadata annotations to extend the data types that it supports.
These annotations specify how to interpret the primitive type. For example, Parquet stores both INTEGER and DATE types as
the INT32 primitive type. An annotation identifies the original type as a DATE.
Read Mapping
PXF uses the following data type mapping when reading Parquet data:
Note: PXF supports filter predicate pushdown on all parquet data types listed above,except the fixed_len_byte_array and int96
types.
Write Mapping
PXF uses the following data type mapping when writing Parquet data:
1 PXF localizes a Timestamp to the current system timezone and converts it to universal time (UTC) before finally converting to
int96.
2 PXF converts a Timestamptz to a UTC timestamp and then converts to int96. PXF loses the time zone information during this
conversion.
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword Value
<path‑to‑hdfs‑file> The absolute path to the directory in the HDFS data store.
PROFILE The PROFILE keyword must specify hdfs:parquet.
SERVER= The named server configuration that PXF uses to access the data. Optional; PXF uses thedefault server if
<server_name> not specified.
<custom‑option>=
<custom-option>s are described below.
<value>
FORMAT
Use FORMAT ’CUSTOM’ with (FORMATTER='pxfwritable_export') (write) or (FORMATTER='pxfwritable_import') (read).
‘CUSTOM’
If you want to load data from an existing Greenplum Database table into the writable external table,
DISTRIBUTED
consider specifying the same distribution policy or <column_name> on both tables. Doing so will avoid
BY
extra motion of data between segments on the load operation.
The PXF hdfs:parquet profile supports encoding- and compression-related write options. You specify these write options in the
CREATE WRITABLE EXTERNAL TABLE LOCATION clause. The hdfs:parquet profile supports the following custom options:
Note: You must explicitly specify uncompressed if you do not want PXF to compress the data.
Parquet files that you write to HDFS with PXF have the following naming format:<file>.<compress_extension>.parquet, for example
1547061635-0000004417_0.gz.parquet .
Example
This example utilizes the data schema introduced in Example: Reading Text Data on HDFS.
Data
Column Name
Type
location text
month text
number_of_orders int
total_sales float8
In this example, you create a Parquet-format writable external table that uses the default PXF server to reference Parquet-
format data in HDFS, insert some data into the table, and then create a readable external table to read the data.
1. Use the hdfs:parquet profile to create a writable external table. For example:
2. Write a few records to the pxf_parquet HDFS directory by inserting directly into the pxf_tbl_parquet table. For example:
3. Recall that Greenplum Database does not support directly querying a writable external table. To read the data in
pxf_parquet , create a readable external Greenplum Database referencing this HDFS directory:
postgres=# CREATE EXTERNAL TABLE read_pxf_parquet(location text, month text, number_of_orders int, total_sales double precision)
LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Prerequisites
Data Type Mapping
Read Mapping
Write Mapping
Creating the External Table
Example
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
Granting Users Access to PXF
Registering PXF JAR Dependencies
Monitoring PXF
Accessing Hadoop with PXF
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
Reading a Multi-Line Text File into a Single Table Row
Reading Hive Table Data
Reading HBase Table Data
Accessing Azure, Google Cloud Storage, Minio, and S3 Object Stores with PXF
About Accessing the S3 Object Store
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
The PXF HDFS connector supports SequenceFile format binary data. This section describes how to use PXF to read and write HDFS SequenceFile data,
including how to create, insert, and query data in external tables that reference files in the HDFS data store.
Prerequisites
Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read data from or write data to HDFS.
Note: External tables that you create with a writable profile can only be used for INSERT operations. If you want to query the data that you inserted, you
must create a separate readable external table that references the HDFS directory.
Use the following syntax to create a Greenplum Database external table that references an HDFS directory:
CREATE [WRITABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-dir>
?PROFILE=hdfs:SequenceFile[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (<formatting-properties>)
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword Value
<path‑to‑hdfs‑dir> The absolute path to the directory in the HDFS data store.
PROFILE The PROFILE keyword must specify hdfs:SequenceFile .
SERVER=
The named server configuration that PXF uses to access the data. Optional; PXF uses thedefault server if not specified.
<server_name>
<custom‑option> <custom-option>s are described below.
FORMAT Use FORMAT ’CUSTOM’ with (FORMATTER='pxfwritable_export') (write) or (FORMATTER='pxfwritable_import') (read).
DISTRIBUTED If you want to load data from an existing Greenplum Database table into the writable external table, consider specifying the same
BY distribution policy or <column_name> on both tables. Doing so will avoid extra motion of data between segments on the load operation.
SequenceFile format data can optionally employ record or block compression. The PXFhdfs:SequenceFile profile supports the following compression codecs:
org.apache.hadoop.io.compress.DefaultCodec
org.apache.hadoop.io.compress.BZip2Codec
When you use the hdfs:SequenceFile profile to write SequenceFile format data, you must provide the name of the Java class to use for serializing/deserializing
the binary data. This class must provide read and write methods for each data type referenced in the data schema.
You specify the compression codec and Java serialization class via custom options in theCREATE EXTERNAL TABLE LOCATION clause. The hdfs:SequenceFile
profile supports the following custom options:
In this example, you create a Java class namedPxfExample_CustomWritable that will serialize/deserialize the fields in the sample schema used in previous
examples. You will then use this class to access a writable external table that you create with the hdfs:SequenceFile profile and that uses the default PXF
server.
Perform the following procedure to create the Java class and writable table.
$ mkdir -p pxfex/com/example/pxf/hdfs/writable/dataschema
$ cd pxfex/com/example/pxf/hdfs/writable/dataschema
$ vi PxfExample_CustomWritable.java
2. Copy and paste the following text into the PxfExample_CustomWritable.java file:
package com.example.pxf.hdfs.writable.dataschema;
import org.apache.hadoop.io.*;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.lang.reflect.Field;
/**
* PxfExample_CustomWritable class - used to serialize and deserialize data with
* text, int, and float data types
*/
public class PxfExample_CustomWritable implements Writable {
public PxfExample_CustomWritable() {
st1 = new String("");
st2 = new String("");
int1 = 0;
ft = 0.f;
}
String GetSt1() {
return st1;
}
String GetSt2() {
return st2;
}
int GetInt1() {
return int1;
}
float GetFt() {
return ft;
}
@Override
public void write(DataOutput out) throws IOException {
@Override
public void readFields(DataInput in) throws IOException {
3. Compile and create a Java class JAR file forPxfExample_CustomWritable. Provide a classpath that includes the hadoop-common.jar file for your Hadoop
distribution. For example, if you installed the Hortonworks Data Platform Hadoop client:
4. Copy the pxfex-customwritable.jar file to the Greenplum Database master node. For example:
6. Copy the pxfex-customwritable.jar JAR file to the user runtime library directory, and note the location. For example, ifPXF_CONF=/usr/local/greenplum-pxf:
7. Synchronize the PXF configuration to the Greenplum Database cluster. For example:
8. Restart PXF on each Greenplum Database segment host as described in Restarting PXF.
9. Use the PXF hdfs:SequenceFile profile to create a Greenplum Database writable external table. Identify the serialization/deserialization Java class you
created above in the DATA-SCHEMA <custom-option>. Use BLOCK mode compression with BZip2 when you create the writable table.
postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_tbl_seqfile (location text, month text, number_of_orders integer, total_sales real)
LOCATION ('pxf://data/pxf_examples/pxf_seqfile?PROFILE=hdfs:SequenceFile&DATA-SCHEMA=com.example.pxf.hdfs.writable.dataschema.PxfExample_CustomWritable&COMPRESSION
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
Notice that the 'CUSTOM' FORMAT <formatting-properties> specifies the built-in pxfwritable_export formatter.
10. Write a few records to the pxf_seqfile HDFS directory by inserting directly into the pxf_tbl_seqfile table. For example:
11. Recall that Greenplum Database does not support directly querying a writable external table. To read the data inpxf_seqfile, create a readable external
Greenplum Database referencing this HDFS directory:
postgres=# CREATE EXTERNAL TABLE read_pxf_tbl_seqfile (location text, month text, number_of_orders integer, total_sales real)
LOCATION ('pxf://data/pxf_examples/pxf_seqfile?PROFILE=hdfs:SequenceFile&DATA-SCHEMA=com.example.pxf.hdfs.writable.dataschema.PxfExample_CustomWritable')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
You must specify the DATA-SCHEMA <custom-option> when you read HDFS data via thehdfs:SequenceFile profile. You need not provide compression-
related options.
The field type of recordkey must correspond to the key type, much as the other fields must match the HDFS data.
BooleanWritable
ByteWritable
DoubleWritable
FloatWritable
IntWritable
LongWritable
Text
If no record key is defined for a row, Greenplum Database returns the id of the segment that processed the row.
Create an external readable table to access the record keys from the writable tablepxf_tbl_seqfile that you created in Example: Writing Binary Data to HDFS.
Define the recordkey in this example to be of typeint8.
postgres=# CREATE EXTERNAL TABLE read_pxf_tbl_seqfile_recordkey(recordkey int8, location text, month text, number_of_orders integer, total_sales real)
LOCATION ('pxf://data/pxf_examples/pxf_seqfile?PROFILE=hdfs:SequenceFile&DATA-SCHEMA=com.example.pxf.hdfs.writable.dataschema.PxfExample_CustomWritable')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
gpadmin=# SELECT * FROM read_pxf_tbl_seqfile_recordkey;
You did not define a record key when you inserted the rows into the writable table, so therecordkey identifies the segment on which the row data was
processed.
Prerequisites
Creating the External Table
Reading and Writing Binary Data
Reading the Record Key
Example: Using Record Keys
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
You can use the PXF HDFS connector to read one or more multi-line text files in HDFS each as
a single table row. This may be useful when you want to read multiple files into the same
Greenplum Database external table, for example when individual JSON files each contain a
separate record.
PXF supports reading only text and JSON files in this manner.
Note: Refer to the Reading JSON Data from HDFS topic if you want to use PXF to read JSON
files that include more than one record.
Prerequisites
Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read files from
HDFS.
PXF reads the complete file data into a single row and column. When you create the external
table to read multiple files, you must ensure that all of the files that you want to read are of the
same (text or JSON) type. You must also specify a single text or json column, depending upon the
file type.
The following syntax creates a Greenplum Database readable external table that references one
or more text or JSON files on HDFS:
CREATE EXTERNAL TABLE <table_name>
( <column_name> text|json | LIKE <other_table> )
LOCATION ('pxf://<path-to-files>?PROFILE=hdfs:text:multi[&SERVER=<server_name>]&FILE_AS_ROW=true')
FORMAT 'CSV');
The keywords and values used in this CREATE EXTERNAL TABLE command are described in
the table below.
Keyword Value
<path‑to‑files> The absolute path to the directory or files in the HDFS data store.
PROFILE The PROFILE keyword must specify hdfs:text:multi.
SERVER= The named server configuration that PXF uses to access the data.
<server_name> Optional; PXF uses the default server if not specified.
The required option that instructs PXF to read each file into a single table
FILE_AS_ROW=true
row.
FORMAT The FORMAT must specify 'CSV'.
For example, if /data/pxf_examples/jdir identifies an HDFS directory that contains a number of JSON
files, the following statement creates a Greenplum Database external table that references all of
the files in that directory:
CREATE EXTERNAL TABLE pxf_readjfiles(j1 json)
LOCATION ('pxf://data/pxf_examples/jdir?PROFILE=hdfs:text:multi&FILE_AS_ROW=true')
FORMAT 'CSV';
When you query the pxf_readjfiles table with a SELECT statement, PXF returns the contents of each
JSON file in jdir/ as a separate row in the external table.
When you read JSON files, you can use the JSON functions provided in Greenplum Database to
access individual data fields in the JSON record. For example, if the pxf_readjfiles external table
above reads a JSON file that contains this JSON record:
{
"root":[
{
"record_obj":{
"created_at":"MonSep3004:04:53+00002013",
"id_str":"384529256681725952",
"user":{
"id":31424214,
"location":"COLUMBUS"
},
"coordinates":null
}
}
]
}
You can use the json_array_elements() function to extract specific JSON fields from the table row.
For example, the following command displays the user->id field:
SELECT json_array_elements(j1->'root')->'record_obj'->'user'->'id'
AS userid FROM pxf_readjfiles;
userid
----------
31424214
(1 rows)
Refer to Working with JSON Data for specific information on manipulating JSON data with
Greenplum Database.
Perform the following procedure to create 3 sample text files in an HDFS directory, and use the
PXF hdfs:text:multi profile and the default PXF server to read all of these text files in a single
external table query.
8. Use the hdfs:text:multi profile to create an external table that references the tdir HDFS
directory. For example:
CREATE EXTERNAL TABLE pxf_readfileasrow(c1 text)
LOCATION ('pxf://data/pxf_examples/tdir?PROFILE=hdfs:text:multi&FILE_AS_ROW=true')
FORMAT 'CSV';
postgres=# \x on
postgres=# SELECT * FROM pxf_readfileasrow;
-[ RECORD 1 ]---------------------------
c1 | Prague,Jan,101,4875.33
| Rome,Mar,87,1557.39
| Bangalore,May,317,8936.99
| Beijing,Jul,411,11600.67
-[ RECORD 2 ]---------------------------
c1 | text file with only one line
-[ RECORD 3 ]---------------------------
c1 | "4627 Star Rd.
Prerequisites
Reading Multi-Line Text and JSON Files
Example: Reading an HDFS Text File into a Single Table Row
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
Granting Users Access to PXF
Registering PXF JAR Dependencies
Monitoring PXF
Accessing Hadoop with PXF
Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple
data formats, including comma-separated value (.csv) TextFile, RCFile, ORC, and Parquet.
The PXF Hive connector reads data stored in a Hive table. This section describes how to use the PXF Hive connector.
Prerequisites
Before working with Hive table data using PXF, ensure that you have met the PXF Hadoop Prerequisites.
If you plan to use PXF filter pushdown with Hive integral types, ensure that the configuration parameter
hive.metastore.integral.jdo.pushdown exists and is set to true in the hive-site.xml file in both your Hadoop cluster and
$PXF_CONF/servers/default/hive-site.xml. Refer to About Updating Hadoop Configuration for more information.
Note: The Hive profile supports all file storage formats. It will use the optimalHive* profile for the underlying file format type.
The following table summarizes external mapping rules for Hive primitive types.
Note: The HiveVectorizedORC profile does not support the timestamp data type.
Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types totext. You
can create Greenplum Database functions or application code to extract subcomponents of these complex data types.
Examples using complex data types with the Hive and HiveORC profiles are provided later in this topic.
Data
Column Name
Type
location text
month text
number_of_orders integer
total_sales double
$ vi /tmp/pxf_hive_datafile.txt
2. Add the following data to pxf_hive_datafile.txt; notice the use of the comma, to separate the four field values:
Prague,Jan,101,4875.33
Rome,Mar,87,1557.39
Bangalore,May,317,8936.99
Beijing,Jul,411,11600.67
San Francisco,Sept,156,6846.34
Paris,Nov,159,7134.56
San Francisco,Jan,113,5397.89
Prague,Dec,333,9894.77
Bangalore,Jul,271,8320.55
Beijing,Dec,100,4248.41
Make note of the path to pxf_hive_datafile.txt; you will use it in later exercises.
Notice that:
The STORED AS textfile subclause instructs Hive to create the table in Textfile (the default) format. Hive Textfile
format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.
The DELIMITED FIELDS TERMINATED BY subclause identifies the field delimiter within a data record (line). Thesales_info
table field delimiter is a comma (,).
2. Load the pxf_hive_datafile.txt sample data file into the sales_info table that you just created:
hive> LOAD DATA LOCAL INPATH '/tmp/pxf_hive_datafile.txt'
INTO TABLE sales_info;
In examples later in this section, you will access thesales_info Hive table directly via PXF. You will also insertsales_info
data into tables of other Hive file format types, and use PXF to access those directly as well.
3. Perform a query on sales_info to verify that you loaded the data successfully:
hive> SELECT * FROM sales_info;
Should you need to identify the HDFS file location of a Hive managed table, reference it using its HDFS file path. You can
determine a Hive table’s location in HDFS using the DESCRIBE command. For example:
The HiveText and HiveRC profiles are specifically optimized for text and RCFile formats, respectively. The HiveORC and
HiveVectorizedORC profiles are optimized for ORC file formats. The Hive profile is optimized for all file storage types; you can use
the Hive profile when the underlying Hive table is composed of multiple partitions with differing file formats.
Use the following syntax to create a Greenplum Database external table that references a Hive table:
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<hive-db-name>.<hive-table-name>
?PROFILE=Hive|HiveText|HiveRC|HiveORC|HiveVectorizedORC[&SERVER=<server_name>]'])
FORMAT 'CUSTOM|TEXT' (FORMATTER='pxfwritable_import' | delimiter='<delim>')
Hive connector-specific keywords and values used in the CREATE EXTERNAL TABLE call are described below.
Keyword Value
<hive‑db‑name> The name of the Hive database. If omitted, defaults to the Hive database nameddefault.
<hive‑table‑name> The name of the Hive table.
The PROFILE keyword must specify one of the values Hive, HiveText, HiveRC, HiveORC, or
PROFILE
HiveVectorizedORC.
The named server configuration that PXF uses to access the data. Optional; PXF uses
SERVER=<server_name>
the default server if not specified.
FORMAT (Hive, HiveORC, and The FORMAT clause must specify 'CUSTOM'. The CUSTOM format requires the built-in
HiveVectorizedORC profiles) pxfwritable_import formatter.
FORMAT (HiveText and HiveRC The FORMAT clause must specify TEXT. Specify the single ascii character field delimiter
profiles) in the delimiter='<delim>' formatting option.
Use the Hive profile to create a readable Greenplum Database external table that references the Hivesales_info textfile format
table that you created earlier.
postgres=# CREATE EXTERNAL TABLE salesinfo_hivetextprofile(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://default.sales_info?PROFILE=HiveText')
FORMAT 'TEXT' (delimiter=E',');
Notice that the FORMAT subclause delimiter value is specified as the single ascii comma character','. E escapes the
1. Start the hive command line and create a Hive table stored in RCFile format:
$ HADOOP_USER_NAME=hdfs hive
A copy of the sample data set is now stored in RCFile format in the Hivesales_info_rcfile table.
3. Query the sales_info_rcfile Hive table to verify that the data was loaded correctly:
4. Use the PXF HiveRC profile to create a readable Greenplum Database external table that references the Hive
sales_info_rcfile table that you created in the previous steps. For example:
postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://default.sales_info_rcfile?PROFILE=HiveRC')
FORMAT 'TEXT' (delimiter=E',');
location | total_sales
---------------+-------------
Prague | 4875.33
Rome | 1557.39
Bangalore | 8936.99
Beijing | 11600.67
...
ORC is type-aware and specifically designed for Hadoop workloads. ORC files store both the type of and encoding
information for the data in the file. All columns within a single group of row data (also known as stripe) are stored together on
disk in ORC format files. The columnar nature of the ORC format type enables read projection, helping avoid accessing
unnecessary columns during a query.
ORC also supports predicate pushdown with built-in indexes at the file, stripe, and row levels, moving the filter operation to
the data loading phase.
$ HADOOP_USER_NAME=hdfs hive
A copy of the sample data set is now stored in ORC format insales_info_ORC .
3. Perform a Hive query on sales_info_ORC to verify that the data was loaded successfully:
$ psql -d postgres
postgres=> \timing
Timing is on.
5. Use the PXF HiveORC profile to create a Greenplum Database external table that references the Hive table named
sales_info_ORC you created in Step 1. The FORMAT clause must specify 'CUSTOM'. The HiveORC CUSTOM format supports
only the built-in 'pxfwritable_import' formatter.
postgres=> CREATE EXTERNAL TABLE salesinfo_hiveORCprofile(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://default.sales_info_ORC?PROFILE=HiveORC')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Time: 425.416 ms
$ psql -d postgres
2. Use the PXF HiveVectorizedORC profile to create a readable Greenplum Database external table that references the Hive
table named sales_info_ORC that you created in Step 1 of the previous example. TheFORMAT clause must specify
'CUSTOM'. The HiveVectorizedORC CUSTOM format supports only the built-in 'pxfwritable_import' formatter.
postgres=> CREATE EXTERNAL TABLE salesinfo_hiveVectORC(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://default.sales_info_ORC?PROFILE=HiveVectorizedORC')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Time: 425.416 ms
postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (location text, month text, number_of_orders int, total_sales double precision)
LOCATION ('pxf://default.hive_parquet_table?profile=Hive')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
This example employs the Hive profile and the array and map complex types, specifically an array of integers and a string
key/value pair map.
The data schema for this example includes fields with the following names and data types:
When you specify an array field in a Hive table, you must identify the terminator for each item in the collection. Similarly, you
must also specify the map key termination character.
1. Create a text file from which you will load the data set:
2. Add the following text to pxf_hive_complex.txt. This data uses a comma , to separate field values, the percent symbol% to
separate collection items, and a : to terminate map key values:
3,Prague,1%2%3,zone:euro%status:up
89,Rome,4%5%6,zone:euro
400,Bangalore,7%8%9,zone:apac%status:pending
183,Beijing,0%1%2,zone:apac
94,Sacramento,3%4%5,zone:noam%status:down
101,Paris,6%7%8,zone:euro%status:up
56,Frankfurt,9%0%1,zone:euro
202,Jakarta,2%3%4,zone:apac%status:up
313,Sydney,5%6%7,zone:apac%status:pending
76,Atlanta,8%9%0,zone:noam%status:down
hive> CREATE TABLE table_complextypes( index int, name string, intarray ARRAY<int>, propmap MAP<string, string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '%'
MAP KEYS TERMINATED BY ':'
STORED AS TEXTFILE;
Notice that:
4. Load the pxf_hive_complex.txt sample data file into the table_complextypes table that you just created:
5. Perform a query on Hive table table_complextypes to verify that the data was loaded successfully:
6. Use the PXF Hive profile to create a readable Greenplum Database external table that references the Hive table named
table_complextypes:
postgres=# CREATE EXTERNAL TABLE complextypes_hiveprofile(index int, name text, intarray text, propmap text)
LOCATION ('pxf://table_complextypes?PROFILE=Hive')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Notice that the integer array and map complex types are mapped to Greenplum Database data type text.
$ HADOOP_USER_NAME=hdfs hive
hive> CREATE TABLE table_complextypes_ORC( index int, name string, intarray ARRAY<int>, propmap MAP<string, string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '%'
MAP KEYS TERMINATED BY ':'
STORED AS ORC;
2. Insert the data from the table_complextypes table that you created in the previous example intotable_complextypes_ORC:
A copy of the sample data set is now stored in ORC format intable_complextypes_ORC.
3. Perform a Hive query on table_complextypes_ORC to verify that the data was loaded successfully:
OK
3 Prague [1,2,3] {"zone":"euro","status":"up"}
89 Rome [4,5,6] {"zone":"euro"}
400 Bangalore [7,8,9] {"zone":"apac","status":"pending"}
...
5. Use the PXF HiveORC profile to create a readable Greenplum Database external table from the Hive table named
table_complextypes_ORC you created in Step 1. The FORMAT clause must specify 'CUSTOM'. The HiveORC CUSTOM format
supports only the built-in 'pxfwritable_import' formatter.
postgres=> CREATE EXTERNAL TABLE complextypes_hiveorc(index int, name text, intarray text, propmap text)
LOCATION ('pxf://default.table_complextypes_ORC?PROFILE=HiveORC')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Notice that the integer array and map complex types are again mapped to Greenplum Database data type text.
The PXF Hive Connector partition filtering support for Hive string and integral types is described below:
The relational operators =, <, <=, >, >=, and <> are supported on string types.
The relational operators = and <> are supported on integral types (To use partition filtering with Hive integral types, you
must update the Hive configuration as described in the Prerequisites).
The logical operators AND and OR are supported when used with the relational operators mentioned above.
The LIKE string operator is not supported.
To take advantage of PXF partition filtering pushdown, the Hive and PXF partition field names must be the same. Otherwise,
The PXF Hive connector filters only on partition columns, not on other table attributes. Additionally, filter pushdown is
supported only for those data types and operators identified above.
PXF filter pushdown is enabled by default. You configure PXF filter pushdown as described in About Filter Pushdown.
1. Create a Hive table named sales_part with two partition columns, delivery_state and delivery_city:
hive> CREATE TABLE sales_part (name string, type string, supplier_key int, price double)
PARTITIONED BY (delivery_state string, delivery_city string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
2. Load data into this Hive table and add some partitions:
A SELECT * statement on a Hive partitioned table shows the partition fields at the end of the record.
5. Create a PXF external table to read the partitioned sales_part Hive table. To take advantage of partition filter push-down,
define fields corresponding to the Hive partition fields at the end of the CREATE EXTERNAL TABLE attribute list.
$ psql -d postgres
7. Perform another query (no pushdown) on pxf_sales_part to return records where the delivery_city is Sacramento and item_name
is cube:
postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'cube';
The query filters the delivery_city partition Sacramento. The filter on item_name is not pushed down, since it is not a partition
column. It is performed on the Greenplum Database side after all the data in the Sacramento partition is transferred for
processing.
This query reads all of the data in theCALIFORNIA delivery_state partition, regardless of the city.
In this example, you create a partitioned Hive external table. The table is composed of the HDFS data files associated with the
sales_info (text format) and sales_info_rcfile (RC format) Hive tables that you created in previous exercises. You will partition the
data by year, assigning the data from sales_info to the year 2013, and the data fromsales_info_rcfile to the year 2016. (Ignore at
the moment the fact that the tables contain the same data.) You will then use the PXF Hive profile to query this partitioned Hive
external table.
1. Create a Hive external table named hive_multiformpart that is partitioned by a string field named year:
$ HADOOP_USER_NAME=hdfs hive
hive> CREATE EXTERNAL TABLE hive_multiformpart( location string, month string, number_of_orders int, total_sales double)
PARTITIONED BY( year string )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
2. Describe the sales_info and sales_info_rcfile tables, noting the HDFS file location for each table:
hive> DESCRIBE EXTENDED sales_info;
hive> DESCRIBE EXTENDED sales_info_rcfile;
3. Create partitions in the hive_multiformpart table for the HDFS file locations associated with each of thesales_info and
sales_info_rcfile tables:
hive> ALTER TABLE hive_multiformpart ADD PARTITION (year = '2013') LOCATION 'hdfs://namenode:8020/apps/hive/warehouse/sales_info';
hive> ALTER TABLE hive_multiformpart ADD PARTITION (year = '2016') LOCATION 'hdfs://namenode:8020/apps/hive/warehouse/sales_info_rcfile';
4. Explicitly identify the file format of the partition associated with the sales_info_rcfile table:
You need not specify the file format of the partition associated with thesales_info table, as TEXTFILE format is the default.
6. Show the partitions defined for the hive_multiformpart table and exit hive:
hive> SHOW PARTITIONS hive_multiformpart;
year=2013
year=2016
hive> quit;
$ psql -d postgres
8. Use the PXF Hive profile to create a readable Greenplum Database external table that references the Hive
hive_multiformpart external table that you created in the previous steps:
postgres=# CREATE EXTERNAL TABLE pxf_multiformpart(location text, month text, num_orders int, total_sales float8, year text)
LOCATION ('pxf://default.hive_multiformpart?PROFILE=Hive')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
10. Perform a second query to calculate the total number of orders for the year 2013:
Similar to Hive, PXF represents a table’s partitioning columns as columns that are appended to the end of the table. However,
PXF translates any column value in a default partition to a NULL value. This means that a Greenplum Database query that
includes an IS NULL filter on a partitioning column can return different results than the same Hive query.
The table is loaded with five rows that contain the following data:
1.0 1900-01-01
2.2 1994-04-14
3.3 2011-03-31
4.5 NULL
5.0 2013-12-06
Inserting row 4 creates a Hive default partition, because the partition columnxdate contains a null value.
In Hive, any query that filters on the partition column omits data in the default partition. For example, the following query
returns no rows:
However, if you map this Hive table to a PXF external table in Greenplum Database, all default partition values are translated
into actual NULL values. In Greenplum Database, executing the same query against the PXF external table returns row 4 as
the result, because the filter matches the NULL value.
Keep this behavior in mind when you executeIS NULL queries on Hive partitioned tables.
Prerequisites
Hive Data Formats
Data Type Mapping
Sample Data Set
Hive Command Line
Querying External Hive Data
Accessing TextFile-Format Hive Tables
Accessing RCFile-Format Hive Tables
Accessing ORC-Format Hive Tables
Accessing Parquet-Format Hive Tables
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
The PXF HBase connector reads data stored in an HBase table. The HBase connector supports
filter pushdown.
Prerequisites
Before working with HBase table data, ensure that you have:
Copied $GPHOME/pxf/lib/pxf-hbase-*.jar to each node in your HBase cluster, and that the
location of this PXF JAR file is in the $HBASE_CLASSPATH. This configuration is required for
the PXF HBase connector to support filter pushdown.
Met the PXF Hadoop Prerequisites.
HBase Primer
This topic assumes that you have a basic understanding of the following HBase concepts:
An HBase column includes two components: a column family and a column qualifier.
These components are delimited by a colon : character, <column-family>:<column-
qualifier>.
An HBase row consists of a row key and one or more column values. A row key is a unique
identifier for the table row.
An HBase table is a multi-dimensional map comprised of one or more columns and rows of
data. You specify the complete set of column families when you create an HBase table.
An HBase cell is comprised of a row (column family, column qualifier, column value) and a
timestamp. The column value and timestamp in a given cell represent a version of the
value.
For detailed information about HBase, refer to the Apache HBase Reference Guide.
HBase Shell
The HBase shell is a subsystem similar to that ofpsql. To start the HBase shell:
$ hbase shell
<hbase output>
hbase(main):001:0>
1. Create an HBase table named order_info in the default namespace. order_info has two column
families: product and shipping_info:
hbase(main):> create 'order_info', 'product', 'shipping_info'
2. The order_info product column family has qualifiers named name and location. The shipping_info
column family has qualifiers named state and zipcode. Add some data to theorder_info table:
You will access the orders_info HBase table directly via PXF in examples later in this topic.
Use the following syntax to create a Greenplum Database external table that references an
HBase table:
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<hbase-table-name>?PROFILE=HBase[&SERVER=<server_name>]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
HBase connector-specific keywords and values used in the CREATE EXTERNAL TABLE call
are described below.
Keyword Value
<hbase‑table‑name> The name of the HBase table.
PROFILE The PROFILE keyword must specify HBase.
Column Mapping
You can create a Greenplum Database external table that references all, or a subset of, the
column qualifiers defined in an HBase table. PXF supports direct or indirect mapping between a
Greenplum Database table column and an HBase table column qualifier.
Direct Mapping
When you use direct mapping to map Greenplum Database external table column names to
HBase qualifiers, you specify column-family-qualified HBase qualifier names as quoted values.
The PXF HBase connector passes these column names as-is to HBase as it reads the table
data.
For example, to create a Greenplum Database external table accessing the following data:
from the order_info HBase table that you created in Example: Creating an HBase Table, use this
CREATE EXTERNAL TABLE syntax:
When you use indirect mapping to map Greenplum Database external table column names to
HBase qualifiers, you specify the mapping in a lookup table that you create in HBase. The
lookup table maps a <column-family>:<column-qualifier> to a column name alias that you specify
when you create the Greenplum Database external table.
You must name the HBase PXF lookup table pxflookup. And you must define this table with a
single column family named mapping. For example:
While the direct mapping method is fast and intuitive, using indirect mapping allows you to
create a shorter, character-based alias for the HBase <column-family>:<column-qualifier> name.
HBase qualifier names can be very long. Greenplum Database has a 63 character limit on
the size of the column name.
HBase qualifier names can include binary or non-printable characters. Greenplum
Database column names are character-based.
When populating the pxflookup HBase table, add rows to the table such that the:
For example, to use indirect mapping with the order_info table, add these entries to the pxflookup
table:
Then create a Greenplum Database external table using the followingCREATE EXTERNAL TABLE
syntax:
CREATE EXTERNAL TABLE orderinfo_map (pname varchar, zip int)
LOCATION ('pxf://order_info?PROFILE=HBase')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Row Key
The HBase table row key is a unique identifier for the table row. PXF handles the row key in a
special way.
To use the row key in the Greenplum Database external table query, define the external table
using the PXF reserved column named recordkey. The recordkey column name instructs PXF to
return the HBase table record key for each row.
For example:
CREATE EXTERNAL TABLE <table_name> (recordkey bytea, ... )
LOCATION ('pxf://<hbase_table_name>?PROFILE=HBase')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
After you have created the external table, you can use therecordkey in a WHERE clause to filter
the HBase table on a range of row key values.
Note: To enable filter pushdown on the recordkey, define the field as text.
Prerequisites
HBase Primer
HBase Shell
Querying External HBase Data
Data Type Mapping
Column Mapping
Row Key
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
Granting Users Access to PXF
Registering PXF JAR Dependencies
Monitoring PXF
Accessing Hadoop with PXF
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
Reading a Multi-Line Text File into a Single Table Row
Reading Hive Table Data
Reading HBase Table Data
Accessing Azure, Google Cloud Storage, Minio, and S3 Object Stores with PXF
About Accessing the S3 Object Store
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
PXF is installed with connectors to Azure Blob Storage, Azure Data Lake, Google Cloud Storage, Minio, and S3 object stores.
Prerequisites
Before working with object store data using PXF, ensure that:
You have configured and initialized PXF, and PXF is running on each Greenplum Database segment host. See Configuring PXF for additional
information.
You have configured the PXF Object Store Connectors that you plan to use. Refer to Configuring Connectors to Azure and Google Cloud Storage
Object Stores and Configuring Connectors to Minio and S3 Object Stores for instructions.
Time is synchronized between the Greenplum Database segment hosts and the external object store systems.
Text
Avro
JSON
Parquet
AvroSequenceFile
SequenceFile
The PXF connectors to Azure expose the following profiles to read, and in many cases write, these supported data formats:
Similarly, the PXF connectors to Google Cloud Storage, Minio, and S3 expose these profiles:
Google Cloud
Data Format S3 or Minio
Storage
delimited single line plain text gs:text s3:text
delimited text with quoted
gs:text:multi s3:text:multi
linefeeds
Avro gs:avro s3:avro
JSON gs:json s3:json
Parquet gs:parquet s3:parquet
AvroSequenceFile gs:AvroSequenceFile s3:AvroSequenceFile
SequenceFile gs:SequenceFile s3:SequenceFile
You provide the profile name when you specify thepxf protocol on a CREATE EXTERNAL TABLE command to create a Greenplum Database external table
that references a file or directory in the specific object store.
The following command creates an external table that references a text file on S3. It specifies the profile nameds3:text and the server configuration named
s3srvcfg:
CREATE EXTERNAL TABLE pxf_s3_text(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://S3_BUCKET/pxf_examples/pxf_s3_simple.txt?PROFILE=s3:text&SERVER=s3srvcfg')
FORMAT 'TEXT' (delimiter=E',');
The following command creates an external table that references a text file on Azure Blob Storage. It specifies the profile namedwasbs:text and the server
configuration named wasbssrvcfg. You would provide the Azure Blob Storage container identifier and your Azure Blob Storage account name.
CREATE EXTERNAL TABLE pxf_wasbs_text(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://AZURE_CONTAINER@YOUR_AZURE_BLOB_STORAGE_ACCOUNT_NAME.blob.core.windows.net/path/to/blob/file?PROFILE=wasbs:text&SERVER=wasbssrvcfg')
FORMAT 'TEXT';
The following command creates an external table that references a text file on Azure Data Lake. It specifies the profile namedadl:text and the server
configuration named adlsrvcfg. You would provide your Azure Data Lake account name.
CREATE EXTERNAL TABLE pxf_adl_text(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://YOUR_ADL_ACCOUNT_NAME.azuredatalakestore.net/path/to/file?PROFILE=adl:text&SERVER=adlsrvcfg')
FORMAT 'TEXT';
The following command creates an external table that references a JSON file on Google Cloud Storage. It specifies the profile namedgs:json and the
server configuration named gcssrvcfg:
CREATE EXTERNAL TABLE pxf_gsc_json(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://dir/subdir/file.json?PROFILE=gs:json&SERVER=gcssrvcfg')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Prerequisites
Connectors, Data Formats, and Profiles
Sample CREATE EXTERNAL TABLE Commands
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
PXF is installed with a connector to the S3 object store. PXF supports the following additional runtime features
with this connector:
Overriding the S3 credentials specified in the server configuration by providing them in theCREATE
EXTERNAL TABLE command DDL.
Using the Amazon S3 Select service to read certain CSV and Parquet data from S3.
For example:
CREATE EXTERNAL TABLE pxf_ext_tbl(name text, orders int)
LOCATION ('pxf://S3_BUCKET/dir/file.txt?PROFILE=s3:text&SERVER=s3srvcfg&accesskey=YOURKEY&secretkey=YOURSECRET')
FORMAT 'TEXT' (delimiter=E',');
Credentials that you provide in this manner are visible as part of the external table definition. Do not use this
method of passing credentials in a production environment.
PXF does not support overriding Azure, Google Cloud Storage, and Minio server credentials in this manner at
this time.
Refer to Configuration Property Precedence for detailed information about the precedence rules that PXF
uses to obtain configuration property settings for a Greenplum Database user.
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
Granting Users Access to PXF
Registering PXF JAR Dependencies
Monitoring PXF
Accessing Hadoop with PXF
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
Reading a Multi-Line Text File into a Single Table Row
Reading Hive Table Data
Reading HBase Table Data
Accessing Azure, Google Cloud Storage, Minio, and S3 Object Stores with PXF
About Accessing the S3 Object Store
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
The PXF object store connectors support plain delimited and comma-separated value format text data. This section describes how to use PXF to access
text data in an object store, including how to create, query, and insert data into an external table that references files in the object store.
Note: Accessing text data from an object store is very similar to accessing text data in HDFS.
Prerequisites
Ensure that you have met the PXF Object Store Prerequisites before you attempt to read data from or write data to an object store.
Profile
Object Store
Prefix
Azure Blob Storage wasbs
Azure Data Lake adl
Google Cloud Storage gs
Minio s3
S3 s3
The following syntax creates a Greenplum Database readable external table that references a simple text file in an object store:
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword Value
<path‑to‑file> The absolute path to the directory or file in the S3 object store.
PROFILE=
The PROFILE keyword must identify the specific object store. For example, s3:text.
<objstore>:text
SERVER=
The named server configuration that PXF uses to access the data.
<server_name>
Use FORMAT 'TEXT' when <path-to-file> references plain text delimited data.
FORMAT
Use FORMAT 'CSV' when <path-to-file> references comma-separated value data.
The delimiter character in the data. For FORMAT 'CSV', the default <delim_value> is a comma ,. Preface the <delim_value> with an E when
delimiter
the value is an escape sequence. Examples: (delimiter=E'\t'), (delimiter ':').
You can provide S3 credentials via custom options in theCREATE EXTERNAL TABLE command as described in Overriding the S3 Server Configuration
with DDL.
If you are reading CSV-format data from S3, you can direct PXF to use the S3 Select Amazon service to retrieve the data. Refer toUsing the Amazon
S3 Select Service for more information about the PXF custom option used for this purpose.
Perform the following procedure to create a sample text file, copy the file to S3, and use thes3:text profile to create two PXF external tables to query the
1. Create a directory in S3 for PXF example data files. For example, if you have write access to an S3 bucket namedBUCKET:
$ aws s3 mb s3://BUCKET/pxf_examples
Note the use of the comma , to separate the four data fields.
6. Use the PXF s3:text profile to create a Greenplum Database external table that references thepxf_s3_simple.txt file that you just created and added to S3.
For example, if your server name is s3srvcfg:
postgres=# CREATE EXTERNAL TABLE pxf_s3_textsimple(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://BUCKET/pxf_examples/pxf_s3_simple.txt?PROFILE=s3:text&SERVER=s3srvcfg')
FORMAT 'TEXT' (delimiter=E',');
8. Create a second external table that references pxf_s3_simple.txt, this time specifying the CSV FORMAT:
postgres=# CREATE EXTERNAL TABLE pxf_s3_textsimple_csv(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://BUCKET/pxf_examples/pxf_s3_simple.txt?PROFILE=s3:text&SERVER=s3srvcfg')
FORMAT 'CSV';
postgres=# SELECT * FROM pxf_s3_textsimple_csv;
When you specify FORMAT 'CSV' for comma-separated value data, no delimiter formatter option is required because comma is the default delimiter value.
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword Value
<path‑to‑file> The absolute path to the directory or file in the S3 data store.
PROFILE=
The PROFILE keyword must identify the specific object store. For example, s3:text:multi.
<objstore>:text:multi
SERVER=
The named server configuration that PXF uses to access the data.
<server_name>
Use FORMAT 'TEXT' when <path-to-file> references plain text delimited data.
FORMAT
Use FORMAT 'CSV' when <path-to-file> references comma-separated value data.
The delimiter character in the data. For FORMAT 'CSV', the default <delim_value> is a comma ,. Preface the <delim_value> with an E
delimiter
when the value is an escape sequence. Examples: (delimiter=E'\t'), (delimiter ':').
If you are accessing an S3 object store, you can provide S3 credentials via custom options in theCREATE EXTERNAL TABLE command as described in
Overriding the S3 Server Configuration with DDL.
Notice the use of the colon : to separate the three fields. Also notice the quotes around the first (address) field. This field includes an embedded line
feed separating the street address from the city and state.
4. Use the s3:text:multi profile to create an external table that references the pxf_s3_multi.txt S3 file, making sure to identify the : (colon) as the field separator.
For example, if your server name is s3srvcfg:
postgres=# CREATE EXTERNAL TABLE pxf_s3_textmulti(address text, month text, year int)
LOCATION ('pxf://BUCKET/pxf_examples/pxf_s3_multi.txt?PROFILE=s3:text:multi&SERVER=s3srvcfg')
FORMAT 'CSV' (delimiter ':');
Note: External tables that you create with a writable profile can only be used forINSERT operations. If you want to query the data that you inserted, you must
create a separate readable external table that references the directory.
Use the following syntax to create a Greenplum Database writable external table that references an object store directory:
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword Value
<path‑to‑dir> The absolute path to the directory in the S3 data store.
PROFILE=
The PROFILE keyword must identify the specific object store. For example, s3:text.
<objstore>:text
SERVER=
The named server configuration that PXF uses to access the data.
<server_name>
<custom‑option>=
Writable external tables that you create using an <objstore>:text profile can optionally use record or block compression. The PXF<objstore>:text profiles support
the following compression codecs:
org.apache.hadoop.io.compress.DefaultCodec
org.apache.hadoop.io.compress.GzipCodec
org.apache.hadoop.io.compress.BZip2Codec
You specify the compression codec via custom options in theCREATE EXTERNAL TABLE LOCATION clause. The <objstore>:text profile support the following
custom write options:
If you are accessing an S3 object store, you can provide S3 credentials via custom options in theCREATE EXTERNAL TABLE command as described in
Overriding the S3 Server Configuration with DDL.
Data
Column Name
Type
location text
month text
number_of_orders int
total_sales float8
This example also optionally uses the Greenplum Database external table namedpxf_s3_textsimple that you created in that exercise.
Procedure
Perform the following procedure to create Greenplum Database writable external tables utilizing the same data schema as described above, one of which
will employ compression. You will use the PXF s3:text profile to write data to S3. You will also create a separate, readable external table to read the data that
you wrote to S3.
1. Create a Greenplum Database writable external table utilizing the data schema described above. Write to the S3 directory
BUCKET/pxf_examples/pxfwrite_s3_textsimple1 . Create the table specifying a comma , as the delimiter. For example, if your server name iss3srvcfg:
postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_s3_writetbl_1(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://BUCKET/pxf_examples/pxfwrite_s3_textsimple1?PROFILE=s3:text&SERVER=s3srvcfg')
FORMAT 'TEXT' (delimiter=',');
You specify the FORMAT subclause delimiter value as the single ascii comma character ,.
2. Write a few individual records to the pxfwrite_s3_textsimple1 S3 directory by invoking the SQL INSERT command on pxf_s3_writetbl_1:
postgres=# INSERT INTO pxf_s3_writetbl_1 VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 );
postgres=# INSERT INTO pxf_s3_writetbl_1 VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
3. (Optional) Insert the data from the pxf_s3_textsimple table that you created in Example: Reading Text Data from S3 into pxf_s3_writetbl_1:
postgres=# INSERT INTO pxf_s3_writetbl_1 SELECT * FROM pxf_s3_textsimple;
4. Greenplum Database does not support directly querying a writable external table. To query the data that you just added to S3, you must create a
readable external Greenplum Database table that references the S3 directory:
postgres=# CREATE EXTERNAL TABLE pxf_s3_textsimple_r1(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://BUCKET/pxf_examples/pxfwrite_s3_textsimple1?PROFILE=s3:text&SERVER=s3srvcfg')
FORMAT 'CSV';
You specify the 'CSV' FORMAT when you create the readable external table because you created the writable table with a comma, as the delimiter
character, the default delimiter for 'CSV' FORMAT.
The pxf_s3_textsimple_r1 table includes the records you individually inserted, as well as the full contents of thepxf_s3_textsimple table if you performed the
optional step.
6. Create a second Greenplum Database writable external table, this time using Gzip compression and employing a colon: as the delimiter:
postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_s3_writetbl_2 (location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://BUCKET/pxf_examples/pxfwrite_s3_textsimple2?PROFILE=s3:text&SERVER=s3srvcfg&COMPRESSION_CODEC=org.apache.hadoop.io.compress.GzipCodec'
FORMAT 'TEXT' (delimiter=':');
7. Write a few records to the pxfwrite_s3_textsimple2 S3 directory by inserting directly into the pxf_s3_writetbl_2 table:
gpadmin=# INSERT INTO pxf_s3_writetbl_2 VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 );
gpadmin=# INSERT INTO pxf_s3_writetbl_2 VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
8. To query data from the newly-created S3 directory namedpxfwrite_s3_textsimple2 , you can create a readable external Greenplum Database table as
described above that references this S3 directory and specifies FORMAT 'CSV' (delimiter=':') .
Prerequisites
Reading Text Data
Example: Reading Text Data from S3
Reading Text Data with Quoted Linefeeds
Example: Reading Multi-Line Text Data from S3
Writing Text Data
Example: Writing Text Data to S3
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
Granting Users Access to PXF
Registering PXF JAR Dependencies
Monitoring PXF
Accessing Hadoop with PXF
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
Reading a Multi-Line Text File into a Single Table Row
Reading Hive Table Data
Reading HBase Table Data
Accessing Azure, Google Cloud Storage, Minio, and S3 Object Stores with PXF
The PXF object store connectors support reading Avro-format data. This section describes how to use PXF to read and write Avro data in an
object store, including how to create, query, and insert into an external table that references an Avro file in the store.
Note: Accessing Avro-format data from an object store is very similar to accessing Avro-format data in HDFS. This topic identifies object store-
specific information required to read Avro data, and links to the PXF HDFS Avro documentation where appropriate for common information.
Prerequisites
Ensure that you have met the PXF Object Store Prerequisites before you attempt to read data from an object store.
You must include the bucket in the schema file path. This bucket need not specify the same bucket as the Avro data file.
The secrets that you specify in the SERVER configuration must provide access to both the data file and schema file buckets.
Profile
Object Store
Prefix
Azure Blob Storage wasbs
Azure Data Lake adl
Google Cloud Storage gs
Minio s3
S3 s3
The following syntax creates a Greenplum Database external table that references an Avro-format file:
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword Value
If you are accessing an S3 object store, you can provide S3 credentials via custom options in theCREATE EXTERNAL TABLE command as
described in Overriding the S3 Server Configuration with DDL.
Example
Refer to Example: Reading Avro Data in the PXF HDFS Avro documentation for an Avro example. Modifications that you must make to run the
example with an object store include:
Copying the file to the object store instead of HDFS. For example, to copy the file to S3:
Using the CREATE EXTERNAL TABLE syntax and LOCATION keywords and settings described above. For example, if your server name is
s3srvcfg:
CREATE EXTERNAL TABLE pxf_s3_avro(id bigint, username text, followers text, fmap text, relationship text, address text)
LOCATION ('pxf://BUCKET/pxf_examples/pxf_avro.avro?PROFILE=s3:avro&SERVER=s3srvcfg&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
You make similar modifications to follow the steps in Example: Writing Avro Data.
Prerequisites
Working with Avro Data
Creating the External Table
Example
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Note: Accessing JSON-format data from an object store is very similar to accessing JSON-
format data in HDFS. This topic identifies object store-specific information required to read JSON
data, and links to the PXF HDFS JSON documentation where appropriate for common
information.
Prerequisites
Ensure that you have met the PXF Object Store Prerequisites before you attempt to read data
from an object store.
Profile
Object Store
Prefix
Azure Blob Storage wasbs
Azure Data Lake adl
Google Cloud Storage gs
Minio s3
S3 s3
The following syntax creates a Greenplum Database readable external table that references a
JSON-format file:
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-file>?PROFILE=<objstore>:json&SERVER=<server_name>[&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
The specific keywords and values used in the CREATE EXTERNAL TABLE command are
described in the table below.
Keyword Value
<path‑to‑file> The absolute path to the directory or file in the object store.
PROFILE= The PROFILE keyword must identify the specific object store. For example,
If you are accessing an S3 object store, you can provide S3 credentials via custom options in the
CREATE EXTERNAL TABLE command as described in Overriding the S3 Server Configuration with
DDL.
Example
Refer to Loading the Sample JSON Data to HDFS and Example: Reading a JSON File with
Single Line Records in the PXF HDFS JSON documentation for a JSON example. Modifications
that you must make to run the example with an object store include:
Copying the file to the object store instead of HDFS. For example, to copy the file to S3:
$ aws s3 cp /tmp/singleline.json s3://BUCKET/pxf_examples/
$ aws s3 cp /tmp/multiline.json s3://BUCKET/pxf_examples/
Using the CREATE EXTERNAL TABLE syntax and LOCATION keywords and settings described
above. For example, if your server name is s3srvcfg:
Prerequisites
Working with Avro Data
Creating the External Table
Example
Release Notes
Download
Ask for Help
Knowledge Base
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
The PXF object store connectors support reading and writing Parquet-format data. This section describes how to use
PXF to access Parquet-format data in an object store, including how to create and query external tables that
references a Parquet file in the store.
Note: Accessing Parquet-format data from an object store is very similar to accessing Parquet-format data in HDFS.
This topic identifies object store-specific information required to read and write Parquet data, and links to the PXF
HDFS Parquet documentation where appropriate for common information.
Prerequisites
Ensure that you have met the PXF Object Store Prerequisites before you attempt to read data from or write data to an
object store.
Profile
Object Store
Prefix
Azure Blob Storage wasbs
Azure Data Lake adl
Google Cloud Storage gs
Minio s3
S3 s3
Use the following syntax to create a Greenplum Database external table that references an HDFS directory. When you
insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the
directory that you specified.
CREATE [WRITABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-dir>
?PROFILE=<objstore>:parquet&SERVER=<server_name>[&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export');
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table
below.
Keyword Value
<path‑to‑dir> The absolute path to the directory in the object store.
PROFILE=
The PROFILE keyword must identify the specific object store. For example, s3:parquet.
<objstore>:parquet
SERVER=
The named server configuration that PXF uses to access the data.
<server_name>
<custom‑option>=
Parquet-specific custom write options are described in the PXF HDFS Parquet documentation.
<value>
FORMAT Use FORMAT ’CUSTOM’ with (FORMATTER='pxfwritable_export') (write) or (FORMATTER='pxfwritable_import')
‘CUSTOM’ (read).
If you want to load data from an existing Greenplum Database table into the writable external
DISTRIBUTED BY table, consider specifying the same distribution policy or <column_name> on both tables. Doing
so will avoid extra motion of data between segments on the load operation.
You can provide S3 credentials via custom options in theCREATE EXTERNAL TABLE command as described in
Overriding the S3 Server Configuration with DDL.
If you are reading Parquet data from S3, you can direct PXF to use the S3 Select Amazon service to retrieve the
data. Refer to Using the Amazon S3 Select Service for more information about the PXF custom option used for
this purpose.
Example
Refer to the Example in the PXF HDFS Parquet documentation for a Parquet write/read example. Modifications that
you must make to run the example with an object store include:
Using the CREATE WRITABLE EXTERNAL TABLE syntax and LOCATION keywords and settings described above for the
writable external table. For example, if your server name is s3srvcfg:
CREATE WRITABLE EXTERNAL TABLE pxf_tbl_parquet_s3 (location text, month text, number_of_orders int, total_sales double precision)
LOCATION ('pxf://BUCKET/pxf_examples/pxf_parquet?PROFILE=s3:parquet&SERVER=s3srvcfg')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
Using the CREATE EXTERNAL TABLE syntax and LOCATION keywords and settings described above for the readable
Prerequisites
Data Type Mapping
Creating the External Table
Example
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
Granting Users Access to PXF
Registering PXF JAR Dependencies
Monitoring PXF
Accessing Hadoop with PXF
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
Reading a Multi-Line Text File into a Single Table Row
Reading Hive Table Data
Reading HBase Table Data
Accessing Azure, Google Cloud Storage, Minio, and S3 Object Stores with PXF
About Accessing the S3 Object Store
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
The PXF object store connectors support SequenceFile format binary data. This section describes how to use PXF to read and write SequenceFile data,
including how to create, insert, and query data in external tables that reference files in an object store.
Note: Accessing SequenceFile-format data from an object store is very similar to accessing SequenceFile-format data in HDFS. This topic identifies object
store-specific information required to read and write SequenceFile data, and links to the PXF HDFS SequenceFile documentation where appropriate for
common information.
Prerequisites
Ensure that you have met the PXF Object Store Prerequisites before you attempt to read data from or write data to an object store.
Profile
Object Store
Prefix
Azure Blob Storage wasbs
Azure Data Lake adl
Google Cloud Storage gs
Minio s3
S3 s3
Use the following syntax to create a Greenplum Database external table that references an HDFS directory. When you insert records into a writable external
table, the block(s) of data that you insert are written to one or more files in the directory that you specified.
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword Value
<path‑to‑dir> The absolute path to the directory in the object store.
PROFILE=
The PROFILE keyword must identify the specific object store. For example, s3:SequenceFile.
<objstore>:SequenceFile
SERVER=
The named server configuration that PXF uses to access the data.
<server_name>
<custom‑option>=
SequenceFile-specific custom options are described in the PXF HDFS SequenceFile documentation.
<value>
FORMAT ‘CUSTOM’ Use FORMAT ’CUSTOM’ with (FORMATTER='pxfwritable_export') (write) or (FORMATTER='pxfwritable_import') (read).
If you want to load data from an existing Greenplum Database table into the writable external table, consider specifying the same
DISTRIBUTED BY distribution policy or <column_name> on both tables. Doing so will avoid extra motion of data between segments on the load
operation.
If you are accessing an S3 object store, you can provide S3 credentials via custom options in theCREATE EXTERNAL TABLE command as described in
Overriding the S3 Server Configuration with DDL.
Example
Refer to Example: Writing Binary Data to HDFS in the PXF HDFS SequenceFile documentation for a write/read example. Modifications that you must make
to run the example with an object store include:
Using the CREATE EXTERNAL TABLE syntax and LOCATION keywords and settings described above for the writable external table. For example, if your
Using the CREATE EXTERNAL TABLE syntax and LOCATION keywords and settings described above for the readable external table. For example, if your
server name is s3srvcfg:
CREATE EXTERNAL TABLE read_pxf_tbl_seqfile_s3(location text, month text, number_of_orders integer, total_sales real)
LOCATION ('pxf://BUCKET/pxf_examples/pxf_seqfile?PROFILE=s3:SequenceFile&DATA-SCHEMA=com.example.pxf.hdfs.writable.dataschema.PxfExample_CustomWritable&SERVER=s3srvcfg'
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Prerequisites
Creating the External Table
Example
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
PXF supports reading only text and JSON files in this manner.
Note: Accessing multi-line files from an object store is very similar to accessing multi-line files in
HDFS. This topic identifies the object store-specific information required to read these files. Refer
to the PXF HDFS documentation for more information.
Prerequisites
Ensure that you have met the PXF Object Store Prerequisites before you attempt to read data
from multiple files residing in an object store.
Profile
Object Store
Prefix
Azure Blob Storage wasbs
Azure Data Lake adl
Google Cloud Storage gs
Minio s3
S3 s3
The following syntax creates a Greenplum Database readable external table that references one
or more text files in an object store:
The specific keywords and values used in the CREATE EXTERNAL TABLE command are
described in the table below.
Keyword Value
<path‑to‑files> The absolute path to the directory or files in the object store.
PROFILE= The PROFILE keyword must identify the specific object store. For
<objstore>:text:multi example, s3:text:multi.
SERVER=
The named server configuration that PXF uses to access the data.
<server_name>
The required option that instructs PXF to read each file into a single
FILE_AS_ROW=true
table row.
If you are accessing an S3 object store, you can provide S3 credentials via custom options in the
CREATE EXTERNAL TABLE command as described in Overriding the S3 Server Configuration with
DDL.
Example
Refer to Example: Reading an HDFS Text File into a Single Table Row in the PXF HDFS
documentation for an example. Modifications that you must make to run the example with an
object store include:
Copying the file to the object store instead of HDFS. For example, to copy the file to S3:
$ aws s3 cp /tmp/file1.txt s3://BUCKET/pxf_examples/tdir
$ aws s3 cp /tmp/file2.txt s3://BUCKET/pxf_examples/tdir
$ aws s3 cp /tmp/file3.txt s3://BUCKET/pxf_examples/tdir
Using the CREATE EXTERNAL TABLE syntax and LOCATION keywords and settings described
above. For example, if your server name is s3srvcfg:
CREATE EXTERNAL TABLE pxf_readfileasrow_s3( c1 text )
LOCATION('pxf://BUCKET/pxf_examples/tdir?PROFILE=s3:text:multi&SERVER=s3srvcfg&FILE_AS_ROW=true')
FORMAT 'CSV'
Prerequisites
Creating the External Table
Example
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
Granting Users Access to PXF
Registering PXF JAR Dependencies
Monitoring PXF
Accessing Hadoop with PXF
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
Reading a Multi-Line Text File into a Single Table Row
Reading Hive Table Data
Reading HBase Table Data
Accessing Azure, Google Cloud Storage, Minio, and S3 Object Stores with PXF
About Accessing the S3 Object Store
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
The PXF S3 connector supports reading certain CSV- and Parquet-format data from S3 using the Amazon S3 Select service. S3 Select provides direct
query-in-place features on data stored in Amazon S3.
When you enable it, PXF uses S3 Select to filter the contents of S3 objects to retrieve the subset of data that you request. This typically reduces both
the amount of data transferred to Greenplum Database and the query time.
PXF supports column projection as well as predicate pushdown for AND, OR, and NOT operators when using S3 Select.
Using the Amazon S3 Select service may increase the cost of data access and retrieval. Be sure to consider the associated costs before you enable
PXF to use the S3 Select service.
By default, PXF does not use S3 Select (S3_SELECT=OFF). You can enable PXF to always use S3 Select, or to use S3 Select only when PXF determines
that it could be beneficial for performance. For example, when S3_SELECT=AUTO, PXF automatically uses S3 Select when a query on the external table
utilizes column projection or predicate pushdown, or when the referenced CSV file has a header row.
Or,
&COMPRESSION_CODEC=snappy
You must specify FORMAT 'CSV' when you enable PXF to use S3 Select on an external table that accesses a Parquet file on S3.
For example, use the following command to have PXF use S3 Select to access a Parquet file on S3 when optimal:
CREATE EXTERNAL TABLE parquet_on_s3 ( LIKE table1 )
LOCATION ('pxf://bucket/file.parquet?PROFILE=s3:parquet&SERVER=s3srvcfg&S3_SELECT=AUTO')
FORMAT 'CSV';
CSV files may include a header line. When you enable PXF to use S3 Select to access a CSV-format file, you use theFILE_HEADER custom option in the
LOCATION URI to identify whether or not the CSV file has a header row and, if so, how you want PXF to handle the header. PXF never returns the
header row.
Note: You must specify S3_SELECT=ON or S3_SELECT=AUTO when the CSV file has a header row. Do not specifyS3_SELECT=OFF in this case.
FILE_HEADER
Description
Value
NONE The file has no header row; the default.
The file has a header row; ignore the header. Use when the order of the columns in the external table and the CSV file are the same.
IGNORE
(When the column order is the same, the column names and the CSV header names may be different.)
The file has a header row; read the header. Use when the external table column names and the CSV header names are the same, but
USE
are in a different order.
If both the order and the names of the external table columns and the CSV header are the same, you can specify eitherFILE_HEADER=IGNORE or
FILE_HEADER=USE.
PXF cannot match the CSV data with the external table definition when both the order and the names of the external table columns are different from
the CSV header columns. Any query on an external table with these conditions fails with the error Some headers in the query are missing from the file .
For example, if the order of the columns in the CSV file header and the external table are the same, add the following to theCREATE EXTERNAL TABLE
LOCATION URI to have PXF ignore the CSV header:
&FILE_HEADER=IGNORE
Or,
&COMPRESSION_CODEC=bzip2
Note: Do not use the (HEADER) formatter option in the CREATE EXTERNAL TABLE command.
For example, use the following command to have PXF always use S3 Select to access agzip-compressed file on S3, where the field delimiter is a pipe
(’|’) character and the external table and CSV header columns are in the same order.
CREATE EXTERNAL TABLE gzippedcsv_on_s3 ( LIKE table2 )
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
Granting Users Access to PXF
Registering PXF JAR Dependencies
Monitoring PXF
Accessing Hadoop with PXF
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
Reading a Multi-Line Text File into a Single Table Row
Reading Hive Table Data
Reading HBase Table Data
Accessing Azure, Google Cloud Storage, Minio, and S3 Object Stores with PXF
About Accessing the S3 Object Store
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
Some of your data may already reside in an external SQL database. PXF provides access to this data via the PXF JDBC connector. The JDBC connector is
a JDBC client. It can read data from and write data to SQL databases including MySQL, ORACLE, Microsoft SQL Server, DB2, PostgreSQL, Hive, and
Apache Ignite.
This section describes how to use the PXF JDBC connector to access data in an external SQL database, including how to create and query or insert data
into a PXF external table that references a table in an external database.
The JDBC connector does not guarantee consistency when writing to an external SQL database. Be aware that if anINSERT operation fails, some data may
be written to the external database table. If you require consistency for writes, consider writing to a staging table in the external database, and loading to the
target table only after verifying the write operation.
Prerequisites
Before you access an external SQL database using the PXF JDBC connector, ensure that:
You have configured and initialized PXF, and PXF is running on each Greenplum Database segment host. See Configuring PXF for additional
information.
You can identify the PXF user configuration directory ($PXF_CONF).
Connectivity exists between all Greenplum Database segment hosts and the external SQL database.
You have configured your external SQL database for user access from all Greenplum Database segment hosts.
You have registered any JDBC driver JAR dependencies.
(Recommended) You have created one or more named PXF JDBC connector server configurations as described inConfiguring the PXF JDBC
Connector.
Any data type not listed above is not supported by the PXF JDBC connector.
Note: The JDBC connector does not support reading or writing Hive data stored as a byte arraybyte[]
( ).
To access data in a remote SQL database, you create a readable or writable Greenplum Database external table that references the remote database
table. The Greenplum Database external table and the remote database table or query result tuple must have the same definition; the column names and
types must match.
Use the following syntax to create a Greenplum Database external table that references a remote SQL database table or a query result from the remote
database:
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword Value
<external‑table‑name> The full name of the external table. Depends on the external SQL database, may include a schema name and a table name.
query:<query_name> The name of the query to execute in the remote SQL database.
PROFILE The PROFILE keyword value must specify Jdbc.
SERVER=
The named server configuration that PXF uses to access the data. Optional; PXF uses thedefault server if not specified.
<server_name>
<custom‑option>=
<custom-option> is profile-specific. Jdbc profile-specific options are discussed in the next section.
<value>
The JDBC CUSTOM FORMAT supports the built-in 'pxfwritable_import' FORMATTER function for read operations and the built-in
FORMAT ‘CUSTOM’
'pxfwritable_export' function for write operations.
Note: You cannot use the HEADER option in your FORMAT specification when you create a PXF external table.
You include JDBC connector custom options in the LOCATION URI, prefacing each option with an ampersand &. CREATE EXTERNAL TABLE <custom-option>s
supported by the Jdbc profile include:
When the JDBC driver of the external SQL database supports it, batching of INSERT operations may significantly increase performance.
Write batching is enabled by default, and the default batch size is 100. To disable batching or to modify the default batch size value, create the PXF external
table with a BATCH_SIZE setting:
When the external database JDBC driver does not support batching, the behaviour of the PXF JDBC connector depends on theBATCH_SIZE setting as
follows:
By default, the PXF JDBC connector automatically batches the rows it fetches from an external database table. The default row fetch size is 1000. To
modify the default fetch size value, specify a FETCH_SIZE when you create the PXF external table. For example:
FETCH_SIZE=5000
If the external database JDBC driver does not support batching on read, you must explicitly disable read row batching by settingFETCH_SIZE=0.
The PXF JDBC connector can further increase write performance by processingINSERT operations in multiple threads when threading is supported by the
JDBC driver of the external SQL database.
Consider using batching together with a thread pool. When used together, each thread receives and processes one complete batch of data. If you use a
thread pool without batching, each thread in the pool receives exactly one tuple.
The JDBC connector returns an error when any thread in the thread pool fails. Be aware that if anINSERT operation fails, some data may be written to the
external database table.
To disable or enable a thread pool and set the pool size, create the PXF external table with aPOOL_SIZE setting as follows:
Partitioning (Read)
The PXF JDBC connector supports simultaneous read access from PXF instances running on multiple segment hosts to an external SQL table. This feature
is referred to as partitioning. Read partitioning is not enabled by default. To enable read partitioning, set the PARTITION_BY, RANGE, and INTERVAL custom
options when you create the PXF external table.
PXF uses the RANGE and INTERVAL values and the PARTITON_BY column that you specify to assign specific data rows in the external table to PXF instances
running on the Greenplum Database segment hosts. This column selection is specific to PXF processing, and has no relationship to a partition column that
you may have specifed for the table in the external SQL database.
When you enable partitioning, the PXF JDBC connector splits aSELECT query into multiple subqueries that retrieve a subset of the data, each of which is
called a fragment. The JDBC connector automatically adds extra query constraints (WHERE expressions) to each fragment to guarantee that every tuple of
data is retrieved from the external database exactly once.
For example, when a user queries a PXF external table created with aLOCATION clause that specifies &PARTITION_BY=id:int&RANGE=1:5&INTERVAL=2, PXF
generates 5 fragments: two according to the partition settings and up to three implicitly generated fragments. The constraints associated with each fragment
are as follows:
Fragment 1: WHERE (id < 1) - implicitly-generated fragment for RANGE start-bounded interval
Fragment 2: WHERE (id >= 1) AND (id < 3) - fragment specified by partition settings
Fragment 3: WHERE (id >= 3) AND (id < 5) - fragment specified by partition settings
Fragment 4: WHERE (id >= 5) - implicitly-generated fragment for RANGE end-bounded interval
Fragment 5: WHERE (id IS NULL) - implicitly-generated fragment
PXF distributes the fragments among Greenplum Database segments. A PXF instance running on a segment host spawns a thread for each segment on
that host that services a fragment. If the number of fragments is less than or equal to the number of Greenplum segments configured on a segment host, a
single PXF instance may service all of the fragments. Each PXF instance sends its results back to Greenplum Database, where they are collected and
returned to the user.
When you specify the PARTITION_BY option, tune the INTERVAL value and unit based upon the optimal number of JDBC connections to the target database
and the optimal distribution of external data across Greenplum Database segments. The INTERVAL low boundary is driven by the number of Greenplum
Database segments while the high boundary is driven by the acceptable number of JDBC connections to the target database. The INTERVAL setting
influences the number of fragments, and should ideally not be set too high nor too low. Testing with multiple values may help you select the optimal
settings.
Create a PostgreSQL database and table, and insert data into the table
Create a PostgreSQL user and assign all privileges on the table to the user
Configure the PXF JDBC connector to access the PostgreSQL database
Create a PXF readable external table that references the PostgreSQL table
Read the data in the PostgreSQL table
Create a PXF writable external table that references the PostgreSQL table
Write data to the PostgreSQL table
Read the data in the PostgreSQL table again
Perform the following steps to create a PostgreSQL table namedforpxf_table1 in the public schema of a database named pgtestdb, and grant a user named
pxfuser1 all privileges on this table:
2. Connect to the default PostgreSQL database as the postgres user. For example, if your PostgreSQL server is running on the default port on the host
named pserver:
4. Create a table named forpxf_table1 and insert some data into this table:
6. Assign user pxfuser1 all privileges on table forpxf_table1, and exit the psql subsystem:
With these privileges, pxfuser1 can read from and write to the forpxf_table1 table.
7. Update the PostgreSQL configuration to allow user pxfuser1 to access pgtestdb from each Greenplum Database segment host. This configuration is
specific to your PostgreSQL environment. You will update the /var/lib/pgsql/pg_hba.conf file and then restart the PostgreSQL server.
You must create a JDBC server configuration for PostgreSQL, download the PostgreSQL driver JAR file to your system, copy the JAR file to the PXF user
configuration directory, synchronize the PXF configuration, and then restart PXF.
2. Create a JDBC server configuration for PostgreSQL as described in Example Configuration Procedure, naming the server/directory pgsrvcfg. The jdbc-
site.xml file contents should look similar to the following (substitute your PostgreSQL host system forpgserverhost):
3. Synchronize the PXF server configuration to the Greenplum Database cluster. For example:
Perform the following procedure to create a PXF external table that references theforpxf_table1 PostgreSQL table that you created in the previous section,
and read the data in the table:
1. Create the PXF external table specifying the Jdbc profile. For example:
Perform the following procedure to insert some data into theforpxf_table1 Postgres table and then read from the table. You must create a new external table
for the write operation.
1. Create a writable PXF external table specifying the Jdbc profile. For example:
3. Use the pxf_tblfrompg readable external table that you created in the previous section to view the new data in theforpxf_table1 PostgreSQL table:
You need to join several tables that all reside in the same external database.
You want to perform complex aggregation closer to the data source.
You would use, but are not allowed to create, aVIEW in the external database.
You would rather consume computational resources in the external system to minimize utilization of Greenplum Database resources.
You want to run a HIVE query and control resource utilization via YARN.
The Greenplum Database administrator defines a query and provides you with the query name to use when you create the external table. Instead of a table
name, you specify query:<query_name> in the CREATE EXTERNAL TABLE LOCATION clause to instruct the PXF JDBC connector to run the static query named
<query_name> in the remote SQL database.
PXF supports named queries only with readable external tables. You must create a unique Greenplum Database readable external table for each query that
you want to run.
The names and types of the external table columns must exactly match the names, types, and order of the columns return by the query result. If the query
returns the results of an aggregation or other function, be sure to use the AS qualifier to specify a specific column name.
For example, suppose that you are working with PostgreSQL tables that have the following definitions:
CREATE TABLE customers(id int, name text, city text, state text);
CREATE TABLE orders(customer_id int, amount int, month int, year int);
This query returns tuples of type (name text, total int, month int) . If the order_rpt query is defined for the PXF JDBC server namedpgserver , you could create a
Greenplum Database external table to read these query results as follows:
CREATE EXTERNAL TABLE orderrpt_frompg(name text, total int, month int)
LOCATION ('pxf://query:order_rpt?PROFILE=Jdbc&SERVER=pgserver&PARTITION_BY=month:int&RANGE=1:13&INTERVAL=3')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
This command references a query named order_rpt defined in the pgserver server configuration. It also specifies JDBC read partitioning options that provide
PXF with the information that it uses to split/partition the query result data across its servers/segments.
The PXF JDBC connector automatically applies column projection and filter pushdown to external tables that reference named queries.
Use the PostgreSQL database pgtestdb, user pxfuser1, and PXF JDBC connector server configuration pgsrvcfg that you created in Example: Reading
From and Writing to a PostgreSQL Database.
Create two PostgreSQL tables and insert data into the tables.
Assign all privileges on the tables to pxfuser1.
Define a named query that performs a complex SQL statement on the two PostgreSQL tables, and add the query to thepgsrvcfg JDBC server
configuration.
Create a PXF readable external table definition that matches the query result tuple and also specifies read partitioning options.
Read the query results, making use of PXF column projection and filter pushdown.
Perform the following procedure to create PostgreSQL tables namedcustomers and orders in the public schema of the database named pgtestdb, and grant the
user named pxfuser1 all privileges on these tables:
3. Create a table named customers and insert some data into this table:
CREATE TABLE customers(id int, name text, city text, state text);
INSERT INTO customers VALUES (111, 'Bill', 'Helena', 'MT');
INSERT INTO customers VALUES (222, 'Mary', 'Athens', 'OH');
INSERT INTO customers VALUES (333, 'Tom', 'Denver', 'CO');
INSERT INTO customers VALUES (444, 'Kate', 'Helena', 'MT');
INSERT INTO customers VALUES (555, 'Harry', 'Columbus', 'OH');
INSERT INTO customers VALUES (666, 'Kim', 'Denver', 'CO');
INSERT INTO customers VALUES (777, 'Erik', 'Missoula', 'MT');
INSERT INTO customers VALUES (888, 'Laura', 'Athens', 'OH');
INSERT INTO customers VALUES (999, 'Matt', 'Aurora', 'CO');
4. Create a table named orders and insert some data into this table:
CREATE TABLE orders(customer_id int, amount int, month int, year int);
INSERT INTO orders VALUES (111, 12, 12, 2018);
INSERT INTO orders VALUES (222, 234, 11, 2018);
INSERT INTO orders VALUES (333, 34, 7, 2018);
INSERT INTO orders VALUES (444, 456, 111, 2018);
INSERT INTO orders VALUES (555, 56, 11, 2018);
INSERT INTO orders VALUES (666, 678, 12, 2018);
INSERT INTO orders VALUES (777, 12, 9, 2018);
INSERT INTO orders VALUES (888, 120, 10, 2018);
INSERT INTO orders VALUES (999, 120, 11, 2018);
5. Assign user pxfuser1 all privileges on tables customers and orders, and then exit the psql subsystem:
In this procedure you create a named query text file, add it to thepgsrvcfg JDBC server configuration, and synchronize the PXF configuration to the
Greenplum Database cluster.
gpadmin@gpmaster$ cd $PXF_CONF/servers/pgsrvcfg
3. Open a query text file named pg_order_report.sql in a text editor and copy/paste the following query into the file:
SELECT c.name, c.city, sum(o.amount) AS total, o.month
FROM customers c JOIN orders o ON c.id = o.customer_id
WHERE c.state = 'CO'
GROUP BY c.name, c.city, o.month
5. Synchronize these changes to the PXF configuration to the Greenplum Database cluster. For example:
Perform the following procedure on your Greenplum Database cluster to create a PXF external table that references the query file that you created in the
previous section, and then reads the query result data:
1. Create the PXF external table specifying the Jdbc profile. For example:
CREATE EXTERNAL TABLE pxf_queryres_frompg(name text, city text, total int, month int)
LOCATION ('pxf://query:pg_order_report?PROFILE=Jdbc&SERVER=pgsrvcfg&PARTITION_BY=month:int&RANGE=1:13&INTERVAL=3')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
With this partitioning scheme, PXF will issue 4 queries to the remote SQL database, one query per quarter. Each query will return customer names
and the total amount of all of their orders in a given month, aggregated per customer, per month, for each month of the target quarter. Greenplum
Database will then combine the data into a single result set for you when you query the external table.
city | sum
--------+-----
Aurora | 120
Denver | 712
(2 rows)
When you execute this query, PXF requests and retrieves query results for only thecity and total columns, reducing the amount of data sent back to
Greenplum Database.
city | sum
--------+-----
Denver | 678
Aurora | 120
(2 rows)
In this example, PXF will add the WHERE filter to the subquery. This filter is pushed to and executed on the remote database system, reducing the
amount of data that PXF sends back to Greenplum Database. The GROUP BY aggregation, however, is not pushed to the remote and is performed by
Greenplum.
&JDBC_DRIVER=org.postgresql.Driver&DB_URL=jdbc:postgresql://pgserverhost:5432/pgtestdb&USER=pguser1&PASS=changeme
&JDBC_DRIVER=com.mysql.jdbc.Driver&DB_URL=jdbc:mysql://mysqlhost:3306/testdb&USER=user1&PASS=changeme
For example:
Credentials that you provide in this manner are visible as part of the external table definition. Do not use this method of passing credentials in a production
environment.
Refer to Configuration Property Precedence for detailed information about the precedence rules that PXF uses to obtain configuration property settings for
a Greenplum Database user.
Prerequisites
Data Types Supported
Accessing an External SQL Database
JDBC Custom Options
Example: Reading From and Writing to a PostgreSQL Table
About Using Named Queries
Example: Reading the Results of a PostgreSQL Query
Overriding the JDBC Server Configuration with DDL
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Troubleshooting PXF
PXF Errors
PXF Logging
Service-Level Logging
Client-Level Logging
PXF Errors
The following table describes some errors you may encounter while using PXF:
PXF Logging
Enabling more verbose logging may aid PXF troubleshooting efforts. PXF provides two
categories of message logging: service-level and client-level.
Service-Level Logging
PXF utilizes log4j for service-level logging. PXF-service-related log messages are captured in a
log file specified by PXF’s log4j properties file, $PXF_CONF/conf/pxf-log4j.properties. The default PXF
logging configuration will write INFO and more severe level logs to $PXF_CONF/logs/pxf-service.log.
You can configure the logging level and the log file location.
PXF provides more detailed logging when the DEBUG level is enabled. To configure PXF DEBUG
logging and examine the output:
#log4j.logger.org.greenplum.pxf=DEBUG
3. Use the pxf cluster sync command to copy the updated pxf-log4j.properties file to the Greenplum
Database cluster. For example:
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
4. Restart PXF on each Greenplum Database segment host as described in Restarting PXF.
5. With DEBUG level logging now enabled, you can perform your PXF operations. Be sure to
make note of the time; this will direct you to the relevant log messages in
$PXF_CONF/logs/pxf-service.log.
$ date
Wed Oct 4 09:30:06 MDT 2017
$ psql -d <dbname>
Client-Level Logging
Database-level client logging may provide insight into internal PXF service operations.
Enable Greenplum Database and PXF debug message logging during operations on PXF
external tables by setting the client_min_messages server configuration parameter to DEBUG2 in your
psql session.
$ psql -d <dbname>
Note: DEBUG2 database session logging has a performance impact. Remember to turn off
DEBUG2 logging after you have collected the desired information.
Note: The configuration changes described in this topic require modifying config files oneach
node in your Greenplum Database cluster. After you perform the updates on the master, be sure
to synchronize the PXF configuration to the Greenplum Database cluster.
In an out of memory (OOM) situation, PXF returns the following error in response to a query:
java.lang.OutOfMemoryError: Java heap space
By default, PXF is configured such that when the PXF JVM detects an out of memory condition
on a segment host, it automatically runs a script that kills the PXF server running on the host.
The PXF_OOM_KILL configuration property governs this auto-kill behavior.
When auto-kill is enabled and the PXF JVM detects an OOM condition and kills the PXF server
on the segment host:
Any query that you run on a PXF external table will fail with the following error until you
restart the PXF server on the segment host:
... Failed to connect to <host> port 5888: Connection refused
When the PXF server on a segment host is shut down in this manner, you must explicitly
restart the PXF server on the host. See the pxf reference page for more information on the pxf
start command.
Refer to the configuration procedure below for the instructions to disable/enable this PXF
configuration property.
In an out of memory situation, it may be useful to capture the Java heap dump to help determine
what factors contributed to the resource exhaustion. You can use the PXF_OOM_DUMP_PATH
property to configure PXF to write the heap dump to a file when it detects an OOM condition. By
default, PXF does not dump the Java heap on OOM.
If you choose to enable the heap dump on OOM, you must setPXF_OOM_DUMP_PATH to the
absolute path to a file or directory:
If you specify a directory, the PXF JVM writes the heap dump to the file
<directory>/java_pid<pid>.hprof , where <pid> identifies the process ID of the PXF server instance.
The PXF JVM writes a new file to the directory every time the JVM goes OOM.
If you specify a file and the file does not exist, the PXF JVM writes the heap dump to the
file when it detects an OOM. If the file already exists, the JVM will not dump the heap.
Ensure that the gpadmin user has write access to the dump file or directory.
Note: Heap dump files are often rather large. If you enable heap dump on OOM for PXF and
specify a directory for PXF_OOM_DUMP_PATH, multiple OOMs will generate multiple files in the
Refer to the configuration procedure below for the instructions to enable/disable this PXF
configuration property.
Procedure
Auto-kill of the PXF server on OOM is enabled by default. Heap dump generation on OOM is
disabled by default. To configure one or both of these properties, perform the following
procedure:
3. If you want to configure (i.e. turn off, or turn back on) auto-kill of the PXF server on OOM,
locate the PXF_OOM_KILL property in the pxf-env.sh file. If the setting is commented out,
uncomment it, and then update the value. For example, to turn off this behavior, set the
value to false:
export PXF_OOM_KILL=false
4. If you want to configure (i.e. turn on, or turn back off) automatic heap dumping when the
PXF server hits an OOM condition, locate the PXF_OOM_DUMP_PATH setting in the pxf-env.sh
file.
1. To turn this behavior on, set the PXF_OOM_DUMP_PATH property value to the file
system location to which you want the PXF JVM to dump the Java heap. For
example, to dump to a file named /home/gpadmin/pxfoom_segh1:
export PXF_OOM_DUMP_PATH=/home/pxfoom_segh1
2. To turn off heap dumping after you have turned it on, comment out the
PXF_OOM_DUMP_PATH property setting:
#export PXF_OOM_DUMP_PATH=/home/pxfoom_segh1
6. Use the pxf cluster sync command to copy the updated pxf-env.sh file to the Greenplum
Database cluster. For example:
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
7. Restart PXF on each Greenplum Database segment host as described in Restarting PXF.
Each PXF agent running on a segment host is configured with a default maximum Java heap
size of 2GB and an initial heap size of 1GB. If the segment hosts in your Greenplum Database
cluster have an ample amount of memory, try increasing the maximum heap size to a value
between 3-4GB. Set the initial and maximum heap size to the same value if possible.
Perform the following procedure to increase the heap size for the PXF agent running on each
segment host in your Greenplum Database cluster.
$ ssh gpadmin@<gpmaster>
3. Locate the PXF_JVM_OPTS setting in the pxf-env.sh file, and update the -Xmx and/or -Xms
options to the desired value. For example:
PXF_JVM_OPTS="-Xmx3g -Xms3g"
5. Use the pxf cluster sync command to copy the updated pxf-env.sh file to the Greenplum
Database cluster. For example:
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
6. Restart PXF on each Greenplum Database segment host as described in Restarting PXF.
The default maximum number of Tomcat threads for PXF is 200. ThePXF_MAX_THREADS
configuration property controls this setting.
PXF thread capacity is determined by the profile and whether or not the data is compressed. If
you plan to run large workloads on a large number of files in an external Hive data store, or you
are reading compressed ORC or Parquet data, consider specifying a lower PXF_MAX_THREADS
value.
Note: Keep in mind that an increase in the thread count correlates with an increase in memory
consumption when the thread count is exhausted.
Perform the following procedure to set the maximum number of Tomcat threads for the PXF
3. Locate the PXF_MAX_THREADS setting in the pxf-env.sh file. Uncomment the setting and
update it to the desired value. For example, to set the maximum number of Tomcat
threads to 100:
export PXF_MAX_THREADS=100
5. Use the pxf cluster sync command to copy the updated pxf-env.sh file to the Greenplum
Database cluster. For example:
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
6. Restart PXF on each Greenplum Database segment host as described in Restarting PXF.
For example, if you use the PXF JDBC connector to access an Oracle database with a
conflicting time zone, PXF logs an error similar to the following:
SEVERE: Servlet.service() for servlet [PXF REST Service] in context with path [/pxf] threw exception
java.io.IOException: ORA-00604: error occurred at recursive SQL level 1
ORA-01882: timezone region not found
Should you encounter this error, you can set default time zone option(s) for the PXF server in
the $PXF_CONF/conf/pxf-env.sh configuration file, PXF_JVM_OPTS property setting. For example, to set
the time zone:
export PXF_JVM_OPTS="<current_settings> -Duser.timezone=America/Chicago"
You can use the PXF_JVM_OPTS property to set other Java options as well.
As described in previous sections, you must synchronize the updated PXF configuration to the
Greenplum Database cluster and restart the PXF server on each segment host.
PXF fragment metadata caching is enabled by default. To turn off fragment metadata caching, or
to re-enable it after turning it off, perform the following procedure:
3. Locate the PXF_FRAGMENTER_CACHE setting in the pxf-env.sh file. If the setting is commented
out, uncomment it, and then update the value. For example, to turn off fragment metadata
caching, set the value to false:
export PXF_FRAGMENTER_CACHE=false
5. Use the pxf cluster sync command to copy the updated pxf-env.sh file to the Greenplum
Database cluster. For example:
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
6. Restart PXF on each Greenplum Database segment host as described in Restarting PXF.
PXF Errors
PXF Logging
Service-Level Logging
Client-Level Logging
Addressing PXF Memory Issues
Configuring Out of Memory Condition Actions
Increasing the JVM Memory for PXF
Another Option for Resource-Constrained PXF Segment Hosts
Addressing PXF JDBC Connector Time Zone Errors
PXF Fragment Metadata Caching
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
You may be using the gphdfs external table protocol in a Greenplum Database version 4 or 5
cluster to access data stored in Hadoop. Greenplum Database version 6 removes support for the
gphdfs protocol. To maintain access to Hadoop data in Greenplum 6, you must migrate yourgphdfs
external tables to use the Greenplum Platform Extension Framework (PXF). This involves setting
up PXF and creating new external tables that use the pxf external table protocol.
To migrate your gphdfs external tables to use the pxf external table protocol, you:
If you are migrating gphdfs from a Greenplum Database 5 installation, you perform the migration in
the order above in your Greenplum 5 cluster before you migrate data to Greenplum 6.
If you are migrating gphdfs from a Greenplum Database 4 installation, you perform the migration in
a similar order. However, since PXF is not available in Greenplum Database 4, you must perform
certain actions in the Greenplum 6 installation before you migrate the data:
Limitations
Keep the following in mind as you plan for the migration:
PXF does not support reading or writing compressed Avro files on HDFS.
You can run the following query to list thegphdfs external tables in a database:
2. For each table that you choose to migrate, identify the format of the external data and the
column definitions. Also identify the options with which the gphdfs table was created. You can
use the \dt+ SQL meta-command to obtain this information. For example:
\d+ public.gphdfs_writable_parquet
External table "public.gphdfs_writable_parquet"
Column | Type | Modifiers | Storage | Description
--------+---------+-----------+----------+-------------
id | integer | | plain |
msg | text | | extended |
Type: writable
Encoding: UTF8
Format type: parquet
Format options: formatter 'gphdfs_export'
External options: {}
External location: gphdfs://hdfshost:8020/data/dir1/gphdfs_writepq?codec=GZIP
Execute on: all segments
gphdfs Configuration
Description pxf Consideration
Option
Environment
variable that
identifies the Not applicable; PXF is bundled with the required
HADOOP_HOME
Hadoop dependent Hadoop libraries and JARs
installation
directory
Environment
Not applicable, PXF automatically includes the
variable that
Hadoop libraries, JARs, and configuration files that
identifies the
CLASSPATH it bundles in the CLASSPATH. PXF also automatically
locations of
includes user-registered dependencies found in the
Hadoop JAR and
$PXF_CONF/lib directory in the CLASSPATH.
configuration files
Server
configuration
parameter that Not applicable, PXF works out-of-the-box with the
gp_hadoop_target_version
identifies the different Hadoop distributions
Hadoop
distribution
Server
configuration
Configuration properties required by PXF, and the gphdfs equivalent, if applicable, include:
Configuration
Description gphdfs Config pxf Config
Item
Environment variable that
Set JAVA_HOME on each Set JAVA_HOME on each
JAVA_HOME identifies the Java installation
segment host segment host
directory
Set options in the Set options in the
JVM option Options with which to start GP_JAVA_OPT PXF_JVM_OPTS
settings the JVM environment variable in environment variable in
hadoop_env.sh $PXF_CONF/conf/pxf-env.sh
PXF server configuration for Configure a PXF server
PXF Server Not applicable
Hadoop for Hadoop
The Greenplum Database Grant SELECT and INSERT Grant SELECT and INSERT
privileges required to create privileges on the gphdfs privileges on the pxf
Privileges
external tables in the given protocol to appropriate protocol to appropriate
protocol users users
After you determine the equivalent PXF configuration properties, you will:
1. Update the Java version installed on each Greenplum Database host, if necessary. PXF
supports Java version 8. If your Greenplum Database cluster hosts are running Java 7,
upgrade to Java version 8 as described in Installing Java for PXF.
3. Configure the PXF Hadoop Connectors. This procedure creates a PXF server configuration
that provides PXF the information that it requires to access Hadoop.
LOCATION('gphdfs://<hdfs_host>:<hdfs_port>/<path-to-data>?[&<custom-option>=<value>[...]]')
PXF’s LOCATION clause takes the following format when you access data stored on Hadoop:
LOCATION('pxf://<path-to-data>?PROFILE=<profile_name>[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
You are not required to specify the HDFS host and port number when you create a PXF external
table. PXF obtains this information from the default server configuration, or from the server
configuration name that you specify in <server_name>.
Refer to Creating an External Table in the PXF documentation for more information about the
PXF CREATE EXTERNAL TABLE syntax and keywords.
When you create an external table specifying the gphdfs protocol, you identify the format of the
external data in the FORMAT clause (discussed in the next section). PXF uses aPROFILE option in
the LOCATION clause to identify the source and type of the external data.
Data pxf
Format PROFILE
Avro hdfs:avro
Parquet hdfs:parquet
Text hdfs:text
Refer to Connectors, Data Formats, and Profiles in the PXF documentation for more information
about the PXF profiles supported for Hadoop.
Both gphdfs and pxf utilize custom options in the LOCATION clause to identify data-format-, operation-
, or profile-specific options supported by the protocol. For example, both gphdfs and pxf support
parquet and compression options on INSERT operations.
Should you need to migrate a gphdfs writable external table that references an HDFS file to PXF,
map gphdfs to PXF writable external table compression options as follows:
gphdfs LOCATION
Description pxf LOCATION Option
Option
Not applicable; depends on the profile - may be
Use of
compress uncompressed by default or specified via
Compression
COMPRESSION_CODEC
Type of
compression_type COMPRESSION_TYPE
compression
Compression
codec COMPRESSION_CODEC
codec
codec_level (Avro
Level of
format deflate codec Not applicable; PXF does not support writing Avro data
Compression
only)
gphdfs LOCATION
Description pxf LOCATION Option
Option
Parquet schema schema SCHEMA
Page size pagesize PAGE_SIZE
Row group size rowgroupsize ROWGROUP_SIZE
parquetversion or
Parquet version PARQUET_VERSION
pqversion
Enable a The dictionary is always enabled when writing Parquet
dictionaryenable
dictionary data with PXF
Dictionary page
dictionarypagesize DICTIONARY_PAGE_SIZE
size
Data
gphdfs FORMAT Option pxf FORMAT Option
Format
FORMAT 'CUSTOM’
Avro FORMAT ‘AVRO’
(FORMATTER='pxfwritable_import’)
FORMAT 'CUSTOM’
(FORMATTER='pxfwritable_import’) (read)
Parquet FORMAT 'PARQUET’
FORMAT 'CUSTOM’
(FORMATTER='pxfwritable_export’) (write)
FORMAT 'TEXT’
Text FORMAT 'TEXT’ (DELIMITER ’,’)
(DELIMITER ’,’)
For text data, the FORMAT clause may identify a delimiter or other formatting option as described
on the CREATE EXTERNAL TABLE command reference page.
Example gphdfs to pxf External Table Mapping for an HDFS Text File
Example gphdfs CREATE EXTERNAL TABLE command to read a text file on HDFS:
CREATE EXTERNAL TABLE ext_expenses (
name text,
date date,
amount float4,
category text,
desc1 text )
LOCATION ('gphdfs://hdfshost-1:8081/dir/filename.txt')
FORMAT 'TEXT' (DELIMITER ',');
Equivalent pxf CREATE EXTERNAL TABLE command, providing that the default PXF server contains the
Hadoop configuration:
Limitations
Preparing for the Migration
Initializing, Configuring, and Starting PXF
Creating a PXF External Table
Verifying Access with PXF
Removing the gphdfs External Tables
Revoking Privileges to the gphdfs Protocol
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
Configuring PXF Servers
Configuring Hadoop Connectors (Optional)
Configuring User Impersonation and Proxying
Configuring PXF for Secure HDFS
Configuring Connectors to Minio and S3 Object Stores (Optional)
Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
Configuring the JDBC Connector (Optional)
Configuring the JDBC Connector for Hive Access (Optional)
Configuring the PXF Agent Host and Port (Optional)
Upgrading PXF
Starting, Stopping, and Restarting PXF
Granting Users Access to PXF
Registering PXF JAR Dependencies
Monitoring PXF
Accessing Hadoop with PXF
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
Reading a Multi-Line Text File into a Single Table Row
Reading Hive Table Data
Reading HBase Table Data
Accessing Azure, Google Cloud Storage, Minio, and S3 Object Stores with PXF
About Accessing the S3 Object Store
Reading and Writing Text Data
Reading and Writing Avro Data
Reading JSON Data
Reading and Writing Parquet Data
Reading and Writing SequenceFile Data
If you are using PXF in your Greenplum Database 5 installation, Pivotal recommends that you upgrade Greenplum Database to version5.21.2 or later
before you migrate PXF to Greenplum Database 6. If you migrate from an earlier version of Greenplum 5, you will be required to perform additional
migration steps in your Greenplum 6 installation.
The PXF Greenplum Database 5 to 6 migration procedure has two parts. You perform one PXF procedure in your Greenplum Database 5 installation, then
install, configure, and migrate data to Greenplum 6:
Prerequisites
Before migrating PXF from Greenplum 5 to Greenplum 6, ensure that you can:
2. Identify the Greenplum Database version number of your 5 installation. For example:
SELECT version();
version
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 8.3.23 (Greenplum Database 5.21.2 build commit:610b6d777436fe4a281a371cae85ac40f01f4f5e) on x86_64-pc-linux-gnu, compiled by GCC gcc (GCC) 6.2.0, 64-bit compiled
(1 row)
If you are running a Greenplum Database version prior to 5.21.2, consider upgrading to version 5.21.2 as described inUpgrading PXF in the
Greenplum 5 documentation.
3. Greenplum 6 removes the gphdfs external table protocol. If you have gphdfs external tables defined in your Greenplum 5 installation, you must delete or
migrate them to pxf as described in Migrating gphdfs External Tables to PXF.
5. If you plan to install Greenplum Database 6 on a new set of hosts, be sure to save a copy of the$PXF_CONF directory in your Greenplum 5 installation.
6. Install and configure Greenplum Database 6, migrate Greenplum 5 table definitions and data to your Greenplum 6 installation, and then continue your
PXF migration with Step 2: Migrating PXF to Greenplum 6.
$ ssh gpadmin@<gp6master>
2. If you installed Greenplum Database 6 on a new set of hosts, copy the$PXF_CONF directory from your Greenplum 5 installation to the master node.
3. Initialize PXF on each segment host as described in Initializing PXF, specifying the PXF_CONF directory that you copied in the step above.
4. If you are migrating from Greenplum Database version 5.21.1 or earlier, perform the version-applicable steps identified in the Greenplum
Database 5.21 Upgrading PXF documentation in your Greenplum Database 6 installation. Start with step 4 in the procedure. (Note that this
procedure identifies the actions required to upgrade PXF between Greenplum Database 5.x releases. These steps are required to configure a
Greenplum version-6-compatible PXF.)
5. Synchronize the PXF configuration from the Greenplum Database 6 master host to the standby master and each segment host in the cluster. For
example:
gpadmin@gp6master$ $GPHOME/pxf/bin/pxf cluster sync
6. Start PXF on each Greenplum Database 6 segment host as described in Starting PXF.
7. Verify the migration by testing that each PXF external table can access the referenced data store.
Prerequisites
Step 1: Complete PXF Greenplum Database 5 Pre-Migration Actions
Step 2: Migrating PXF to Greenplum 6
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
The Greenplum Platform Extension Framework (PXF) includes the following utility reference
pages:
pxf
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
pxf cluster
Synopsis
Description
Commands
Options
Manage the PXF configuration and the PXF service instance on all Greenplum Database hosts.
Synopsis
pxf cluster <command> [<option>]
help
init
reset
restart
start
status
stop
sync
Description
The pxf cluster utility command manages PXF on the master, standby master, and on all
Greenplum Database segment hosts. You can use the utility to:
requires a running Greenplum Database cluster. You must run the utility on the
pxf cluster
Greenplum Database master host.
(If you want to manage the PXF service instance on a specific segment host, use thepxf utility.
See pxf.)
Commands
help
Display the pxf cluster help message and then exit.
init
Initialize the PXF service instance on the master, standby master, and on all segment
hosts. When you initialize PXF across your Greenplum Database cluster, you must identify
the PXF user configuration directory via an environment variable named $PXF_CONF. If you
do not set $PXF_CONF prior to initializing PXF, PXF returns an error.
restart
Stop, and then start, the PXF service instance on all segment hosts.
start
Start the PXF service instance on all segment hosts.
status
Display the status of the PXF service instance on all segment hosts.
stop
Stop the PXF service instance on all segment hosts.
sync
Synchronize the PXF configuration ($PXF_CONF) from the master to the standby master and
to all Greenplum Database segment hosts. By default, this command updates files on and
copies files to the remote. You can instruct PXF to also delete files during the
synchronization; see Options.
If you have updated the PXF user configuration or added JAR files, you must also restart
PXF after you synchronize the PXF configuration.
Options
The pxf cluster sync command takes the following option:
–d | ––delete
Delete any files in the PXF user configuration on the standby master and segment hosts
that are not also present on the master host.
Examples
Stop the PXF service instance on all segment hosts:
$ $GPHOME/pxf/bin/pxf cluster stop
Synchronize the PXF configuration to the standby and all segment hosts, deleting files that do
not exist on the master host:
See Also
pxf
Synopsis
Description
Commands
Options
Examples
See Also
Release Notes
Download
Ask for Help
Knowledge Base
PDF
v5.1.0
v5.0.0
v4.3.17
v4.3.16
v4.3.15
v4.3.14
v4.3.13
v4.3.12
v4.3.11
v4.3.10
v4.3.9
v4.3.8
v4.3.7
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
Introduction to PXF
About PXF Filter Pushdown
About Column Projection in PXF
Administering PXF
Configuring PXF
About the PXF Installation and Configuration Directories
Installing Java for PXF
Initializing PXF
pxf
Synopsis
Description
Commands
Options
Manage the PXF configuration and the PXF service instance on the local Greenplum Database
host.
Synopsis
pxf <command> [<option>]
cluster
help
init
reset
restart
start
status
stop
sync
version
Description
The pxf utility manages the PXF configuration and the PXF service instance on the local
Greenplum Database host.
You can initialize or reset PXF on the master, master standby, or a specific segment host. You
can also synchronize the PXF configuration from the master to these hosts.
You can start, stop, or restart the PXF service instance on a specific segment host, or display
the status of the PXF service instance running on a segment host.
(Use the pxf cluster command to initialize or reset PXF on all hosts, synchronize the PXF
configuration to the Greenplum Database cluster, or to start, stop, or display the status of the
PXF service instance on all segment hosts in the cluster.)
Commands
cluster
Manage the PXF configuration and the PXF service instance on all Greenplum Database
hosts. See pxf cluster.
help
Display the pxf management utility help message and then exit.
init
Initialize the PXF service instance on the host. When you initialize PXF, you must identify
the PXF user configuration directory via an environment variable named $PXF_CONF. If you
reset
Reset the PXF service instance running on the host. Resetting removes PXF runtime files
and directories, and returns PXF to an uninitialized state. You must stop the PXF service
instance running on a segment host before you reset PXF on the host.
restart
Restart the PXF service instance running on the segment host.
start
Start the PXF service instance on the segment host.
status
Display the status of the PXF service instance running on the segment host.
stop
Stop the PXF service instance running on the segment host.
sync
Synchronize the PXF configuration ($PXF_CONF) from the master to a specific Greenplum
Database standby master or segment host. You must run pxf sync on the master host. By
default, this command updates files on and copies files to the remote. You can instruct
PXF to also delete files during the synchronization; see Options.
version
Display the PXF version and then exit.
Options
The pxf init command takes the following option:
–y
Do not prompt, use the default $PXF_CONF directory location if the environment variable is
not set.
–f | ––force
Do not prompt before resetting the PXF service instance; reset without user interaction.
The pxf sync command, which you must run on the Greenplum Database master host, takes the
following option and argument:
–d | ––delete
Delete any files in the PXF user configuration on<gphost> that are not also present on the
master host. If you specify this option, you must provide it on the command line before
<gphost>.
Examples
Start the PXF service instance on the local segment host:
$ $GPHOME/pxf/bin/pxf start
See Also
pxf cluster
Synopsis
Description
Commands
Options
Examples
See Also
Release Notes
Download
Ask for Help
Knowledge Base
PDF