Hive Integrating Hive and Bi
Hive Integrating Hive and Bi
https://docs.cloudera.com/
Legal Notice
© Cloudera Inc. 2024. All rights reserved.
The documentation is and contains Cloudera proprietary information protected by copyright and other intellectual property
rights. No license under copyright or any other intellectual property right is granted herein.
Unless otherwise noted, scripts and sample code are licensed under the Apache License, Version 2.0.
Copyright information for Cloudera software may be found within the documentation accompanying each component in a
particular release.
Cloudera software includes software from various open source or other third party projects, and may be released under the
Apache Software License 2.0 (“ASLv2”), the Affero General Public License version 3 (AGPLv3), or other license terms.
Other software included may be released under the terms of alternative open source licenses. Please review the license and
notice files accompanying the software for additional licensing information.
Please visit the Cloudera software product page for more information on Cloudera software. For more information on
Cloudera support services, please visit either the Support or Sales page. Feel free to contact us directly to discuss your
specific needs.
Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes no responsibility nor
liability arising from the use of products, except as expressly agreed to in writing by Cloudera.
Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered or unregistered
trademarks in the United States and other countries. All other trademarks are the property of their respective owners.
Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA,
CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OF
ANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY OR
RELATED SUPPORT PROVIDED IN CONNECTION THEREWITH. CLOUDERA DOES NOT WARRANT THAT
CLOUDERA PRODUCTS NOR SOFTWARE WILL OPERATE UNINTERRUPTED NOR THAT IT WILL BE
FREE FROM DEFECTS NOR ERRORS, THAT IT WILL PROTECT YOUR DATA FROM LOSS, CORRUPTION
NOR UNAVAILABILITY, NOR THAT IT WILL MEET ALL OF CUSTOMER’S BUSINESS REQUIREMENTS.
WITHOUT LIMITING THE FOREGOING, AND TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE
LAW, CLOUDERA EXPRESSLY DISCLAIMS ANY AND ALL IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, QUALITY, NON-INFRINGEMENT, TITLE, AND
FITNESS FOR A PARTICULAR PURPOSE AND ANY REPRESENTATION, WARRANTY, OR COVENANT BASED
ON COURSE OF DEALING OR USAGE IN TRADE.
Cloudera Runtime | Contents | iii
Contents
Introduction to HWC............................................................................................... 4
Introduction to HWC execution modes............................................................................................................... 6
Spark Direct Reader mode....................................................................................................................... 9
JDBC execution mode............................................................................................................................ 11
Automating mode selection................................................................................................................................ 13
Configuring Spark Direct Reader mode.............................................................................................................14
Configuring JDBC execution mode................................................................................................................... 15
Kerberos configurations for HWC..................................................................................................................... 15
Configuring external file authorization.............................................................................................................. 16
Reading managed tables through HWC.............................................................................................................17
Writing managed tables through HWC..............................................................................................................18
API operations.................................................................................................................................................... 19
HWC supported types mapping............................................................................................................. 20
Catalog operations.................................................................................................................................. 22
Read and write operations...................................................................................................................... 22
Commit transaction in Spark Direct Reader mode................................................................................ 24
Close HiveWarehouseSession operations...............................................................................................24
Use HWC for streaming.........................................................................................................................25
HWC API Examples...............................................................................................................................25
Hive Warehouse Connector Interfaces...................................................................................................26
Submit a Scala or Java application.................................................................................................................... 28
Submit a Python app.......................................................................................................................................... 29
Introduction to HWC
You need to understand Hive Warehouse Connector (HWC) to query Apache Hive tables from Apache Spark.
Examples of supported APIs, such as Spark SQL, show some operations you can perform, including how to write to a
Hive ACID table or write a DataFrame from Spark.
HWC is software for securely accessing Hive tables from Spark. You need to use the HWC if you want to access
Hive managed tables from Spark. You explicitly use HWC by calling the HiveWarehouseConnector API to write to
managed tables. You might use HWC without even realizing it. HWC implicitly reads tables when you run a Spark
SQL query on a Hive managed table.
You do not need HWC to read or write Hive external tables. You can use native Spark SQL. You might want to use
HWC to purge external table files. From Spark, using HWC you can read Hive external tables in ORC or Parquet
formats. From Spark, using HWC you can write Hive external tables in ORC format only.
Creating an external table stores only the metadata in HMS. If you use HWC to create the external table, HMS keeps
track of the location of table names and columns. Dropping an external table deletes the metadata from HMS. You
can set an option to also drop the actual data in files, or not, from the file system.
If you do not use HWC, dropping an external table deletes only the metadata from HMS. If you do not have
permission to access the file system, and you want to purge table data in addition to metadata, you need to use HWC.
Supported APIs
• Spark SQL
Supports native Spark SQL query read (only) patterns. Output conforms to native spark.sql conventions.
• HWC
Supports HiveWarehouse Session API operations using the HWC sql API.
• DataFrames
Supports accessing a Hive ACID table from Scala, or pySpark, directly using DataFrames. Use the short name
HiveAcid. Direct reads and writes from the file are not supported.
Spark SQL Example
4
Cloudera Runtime Introduction to HWC
DataFrames Example
Hive ACID tables are tables in Hive metastore and must be formatted using DataFrames as follows:
Syntax:
Example:
HWC Limitations
• You cannot write data using Spark Direct Reader.
• Transaction semantics of Spark RDDs are not ensured when using Spark Direct Reader to read ACID tables.
• HWC supports reading tables in any format, but currently supports writing tables in ORC format only.
• The spark thrift server is not supported.
• Table stats (basic stats and column stats) are not generated when you write a DataFrame to Hive.
• The Hive Union types are not supported.
• When the HWC API save mode is overwrite, writes are limited.
You cannot read from and overwrite the same table. If your query accesses only one table and you try to overwrite
that table using an HWC API write method, a deadlock state might occur. Do not attempt this operation.
Example: Operation Not Supported
5
Cloudera Runtime Introduction to HWC
Orc vs Parquet
Blog: Enabling high-speed Spark direct reader for Apache Hive ACID tables
Union Types
Table 1:
6
Cloudera Runtime Introduction to HWC
7
Cloudera Runtime Introduction to HWC
You need to use HWC to read or write managed tables from Spark. Spark Direct Reader mode does not support
writing to managed tables. Managed table queries go through HiveServer, which is integrated with Ranger. External
table queries go through the HMS API, which is also integrated with Ranger.
In Spark Direct Reader mode, SparkSQL queries read managed table metadata directly from the HMS, but only if you
have permission to access files on the file system.
If you do not use HWC, the Hive metastore (HMS) API, integrated with Ranger, authorizes external table access.
HMS API-Ranger integration enforces the Ranger Hive ACL in this case. When you use HWC, queries such as
DROP TABLE affect file system data as well as metadata in HMS.
Managed tables
A Spark job impersonates the end user when attempting to access an Apache Hive managed table. As an end user,
you do not have permission to secure, managed files in the Hive warehouse. Managed tables have default file system
permissions that disallow end user access, including Spark user access.
As Administrator, you set permissions in Ranger to access the managed tables in JDBC mode. You can fine-tune
Ranger to protect specific data. For example, you can mask data in certain columns, or set up tag-based access
control.
8
Cloudera Runtime Introduction to HWC
In Spark Direct Reader mode, you cannot use Ranger. You must set read access to the file system location for
managed tables. You must have Read and Execute permissions on the Hive warehouse location (hive.metastore.wareh
ouse.dir).
External tables
Ranger authorization of external table reads and writes is supported. You need to configure a few properties in
Cloudera Manager for authorization of external table writes. You must be granted file system permissions on external
table files to allow Spark direct access to the actual table data instead of just the table metadata. For example, to purge
actual data you need access to the file system.
Component interaction
The following diagram shows component interaction in HWC Spark Direct Reader mode.
9
Cloudera Runtime Introduction to HWC
10
Cloudera Runtime Introduction to HWC
spark-shell --jars \
/opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/hive-warehouse-conne
ctor-assembly-<version>.jar \
--conf "spark.sql.extensions=com.qubole.spark.hiveacid.HiveAcidAutoConvert
Extension" \
--conf "spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroR
egistrator" \
--conf "spark.hadoop.hive.metastore.uris=<metastore_uri>"
Unsupported functionality
Spark Direct Reader does not support the following functionality:
• Writes
• Streaming inserts
• CTAS statements
Limitations
• Does not enforce authorization; hence, you must configure read access to the HDFS, or other, location for
managed tables. You must have Read and Execute permissions on hive warehouse location (hive.metastore.wareh
ouse.dir).
• Supports only single-table transaction consistency. The direct reader does not guarantee that multiple tables
referenced in a query read the same snapshot of data.
• Does not auto-commit transactions submitted by rdd APIs. Explicitly close transactions to release locks.
• Requires read and execute access on the hive-managed table locations.
• Does not support Ranger column masking and fine-grained access control.
• Blocks compaction on open read transactions.
The way Spark handles null and empty strings can cause a discrepancy between metadata and actual data when
writing the data read by Spark Direct Reader to a CSV file.
Related Information
Configuring Spark Direct Reader mode
Configuring JDBC execution mode
Blog: Enabling high-speed Spark direct reader for Apache Hive ACID tables
Component Interaction
JDBC mode creates only one JDBC connection to HiveServer (HS2) or HiveServer Interactive (HSI), a potential
bottleneck in data transfer to Spark. The following diagram shows interaction in JDBC mode with Hive metastore
(HMS), TEZ, and HDFS.
11
Cloudera Runtime Introduction to HWC
HWC does not use JDBC mode during a write. HWC writes to an intermediate location from Spark, and then
executes a LOAD DATA query to write the data. Using HWC to write data is recommended for production.
Configuration
In JDBC mode, execution takes place in these locations:
• Driver: Using the Hive JDBC url, connects to Hive and executes the query on the driver side.
• Cluster: From Spark executors, connects to Hive through JDBC and executes the query.
Authorization occurs on the server.
JDBC mode runs in the client or cluster:
• Client (Driver)
In client mode, any failures to connect to HiveServer (HS2) will not be retried.
• Cluster (Executor)--recommended
In cluster mode any failures to connect to HS2 will be retried automatically.
JDBC mode is not recommended for production reads due to slow performance when reading huge data sets. Where
your queries are executed affects the Kerberos configurations for HWC.
In configuration/spark-defaults.conf, or using the --conf option in spark-submit/spark-shell set the following
properties:
Name: spark.datasource.hive.warehouse.read.jdbc.mode
Value: client or cluster
Configures the driver location.
12
Cloudera Runtime Introduction to HWC
Name: spark.sql.hive.hiveserver2.jdbc.url
Value:
The JDBC endpoint for HiveServer. For more information, see the Apache Hive Wiki (link below).
For Knox, provide the HiveServer, not Knox, endpoint.
Name: spark.datasource.hive.warehouse.load.staging.dir
Value: Temporary staging location required by HWC. Set the value to a file system location where
the HWC user has write permission.
Name: spark.hadoop.hive.zookeeper.quorum
Procedure
1. Submit the Spark application, including spark.sql.extensions property to enable Auto Translate.
2. If you use the kyro serializer, include -- conf spark.sql.extensions=com.qubole.spark.hiveacid.HiveAcidAutoC
onvertExtension
For example:
13
Cloudera Runtime Introduction to HWC
+------+----------+--------------------+-------------+--------------+-----
+-----+-------+
|emp_id|first_name| e_mail|date_of_birth| city|st
ate| zip|dept_id|
+------+----------+--------------------+-------------+--------------+-----
+-----+-------+
|677509| Lois|lois.walker@hotma… | 3/29/1981| Denver|
CO|80224| 4|
|940761| Brenda|brenda.robinson@g...| 7/31/1970| Stonewall| L
A|71078| 5|
|428945| Joe|joe.robinson@gmai… | 6/16/1963| Michigantown|
IN|46057| 3|
……….
……….
……….
You do not need to specify an execution mode. You simply submit the query. Using the HWC API, to use hive
.execute to execute a read. This command processes queries through HWC in either JDBC and Spark Direct
Reader modes.
Related Information
Configuring Spark Direct Reader mode
Configuring JDBC execution mode
Blog: Enabling high-speed Spark direct reader for Apache Hive ACID tables
Procedure
1. In Cloudera Manager, in Hosts > Roles, if Hive Metastore appears in the list of roles, copy the host name or IP
address.
You use the host name or IP address in the next step to set the host value.
2. Launch the Spark shell and include the configuration of the spark.hadoop.hive.metastore.uris property to thrift://
<host>:<port>.
For example:
14
Cloudera Runtime Introduction to HWC
Procedure
1. Find the HiveServer (HS2) JDBC URL in /etc/hive/conf.cloudera.HIVE_ON_TEZ-1/beeline-site.xml
The value of beeline.hs2.jdbc.url.HIVE_ON_TEZ-1 is the HS2 JDBC URL in this sample file.
...
<configuration>
<property>
<name>beeline.hs2.jdbc.url.default</name>
<value>HIVE_ON_TEZ-1</value>
</property>
<property>
<name>beeline.hs2.jdbc.url.HIVE_ON_TEZ-1</name>
<value>jdbc:hive2://nightly7x-unsecure-1.nightly7x-unsecure.root.hwx.sit
e:2181/;serviceDiscoveryMode=zooKeeper; \
zooKeeperNamespace=hiveserver2;retries=5</value>
</property>
</configuration>
2. Set the Spark property to the value of the HS2 JDBC URL.
For example, in /opt/cloudera/parcels/CDH-7.2.1-1.cdh7.2.1.p0.4847773/etc/spark/conf.dist/spark-defaults.conf,
add the JDBC URL:
...
spark.sql.hive.hiveserver2.jdbc.url spark.sql.hive.hiveserver2.jdbc.url
jdbc:hive2://nightly7x-unsecure-1.nightly7x-unsecure.root.hwx.site:2181/
;serviceDiscoveryMode=zooKeeper; \
zooKeeperNamespace=hiveserver2;retries=5
15
Cloudera Runtime Introduction to HWC
• Property: spark.security.credentials.hiveserver2.enabled
Value: Set this value to "true".
You do not need to explicitly provide other authentication configurations, such as auth type and principal. When
Spark opens a secure connection to Hive metastore, Spark automatically picks the authentication configurations from
the hive-site.xml that is present on the Spark app classpath. For example, to execute queries in direct reader mode
through HWC, Spark opens a secure connection to Hive metastore and this authentication occurs automatically.
You can set the properties using the spark-submit/spark-shell --conf option.
org.apache.hadoop.hive.ql.security.authorization.plugin.metastor
e. \
HiveMetaStoreAuthorizer
Procedure
1. In Cloudera Manager, to configure Hive Metastore properties click Clusters Hive-1 Configuration .
2. Search for hive-site.
3. In Hive Metastore Server Advanced Configuration Snippet (Safety Valve) for hive-site.xml, click +.
16
Cloudera Runtime Introduction to HWC
Procedure
1. Choose a configuration based on your execution mode.
• Spark Direct Reader mode:
--conf spark.sql.extensions=com.qubole.spark.hiveacid.HiveAcidAutoConver
tExtension
• JDBC mode:
--conf spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
--conf spark.datasource.hive.warehouse.read.via.llap=false
Also set a location for running the application in JDBC mode. For example, set the recommended cluster
location for example:
spark.datasource.hive.warehouse.read.jdbc.mode=cluster
2. Start the Spark session using the execution mode you chose in the last step.
For example, start the Spark session using Spark Direct Reader mode and configure for kyro serialization:
For example, start the Spark session using JDBC execution mode:
You must start the Spark session after setting Spark Direct Reader mode, so include the configurations in the
launch string.
3. Read Apache Hive managed tables.
For example:
17
Cloudera Runtime Introduction to HWC
Related Information
Configuring Spark Direct Reader mode
Configuring JDBC execution mode
Blog: Enabling high-speed Spark direct reader for Apache Hive ACID tables
Procedure
1. Start the Apache Spark session and include the URL for HiveServer.
2. Include in the launch string a configuration of the intermediate location to use as a staging directory.
Example syntax:
...
--conf spark.sql.hive.hwc.execution.mode=spark \
--conf spark.datasource.hive.warehouse.read.via.llap=false \
--conf spark.datasource.hive.warehouse.load.staging.dir=<path to direc
tory>
import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
18
Cloudera Runtime Introduction to HWC
hive.createTable("newTable").ifNotExists()
.column("ws_sold_time_sk", "bigint")
.column("ws_ship_date_sk", "bigint")
.create();
4. Write to a statically partitioned, Hive managed table named t1 having two partitioned columns c1 and c2.
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("partiti
on", "c1='val1',c2='val2'").option("table", "t1").save();
HWC internally fires the following query to Hive through JDBC after writing data to a temporary location.
5. Write to a dynamically partitioned table named t1 having two partitioned cols c1 and c2.
df.write.format(HIVE_WAREHOUSE_CONNECTOR).mode("append").option("partiti
on", "c1='val1',c2").option("table", "t1").save();
HWC internally fires the following query to Hive through JDBC after writing data to a temporary location.
where <cols> should have comma separated list of columns in the table with dynamic partition columns being the
last in the list and in the same order as the partition definition.
Related Information
Configuring Spark Direct Reader mode
Configuring JDBC execution mode
Blog: Enabling high-speed Spark direct reader for Apache Hive ACID tables
API operations
As an Apache Spark developer, you learn the code constructs for executing Apache Hive queries using the
HiveWarehouseSession API. In Spark source code, you see how to create an instance of HiveWarehouseSession.
19
Cloudera Runtime Introduction to HWC
• STREAM_TO_STREAM
Assuming spark is running in an existing SparkSession, use this code for imports:
• Scala
import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()
• Java
import com.hortonworks.hwc.HiveWarehouseSession;
import static com.hortonworks.hwc.HiveWarehouseSession.*;
HiveWarehouseSession hive = HiveWarehouseSession.session(spark).build();
• Python
Executing queries
HWC supports three methods for executing queries:
• .sql()
• Executes queries in any HWC mode.
• Consistent with the Spark sql interface.
• Masks the internal implementation based on the cluster type you configured, either JDBC_CLIENT or
JDBC_CLUSTER.
• .execute()
• Required for executing queries if spark.datasource.hive.warehouse.read.mode=JDBC_CLUSTER.
• Uses a driver side JDBC connection.
• Provided for backward compatibility where the method defaults to reading in JDBC client mode irrespective of
the value of JDBC client or cluster mode configuration.
• Recommended for catalog queries.
• .executeQuery()
• Executes queries, except catalog queries, in LLAP mode (spark.datasource.hive.warehouse.read.via.llap= true)
• If LLAP is not enabled in the cluster, .executeQuery() does not work. CDP Data Center does not support
LLAP.
• Provided for backward compatibility.
Results are returned as a DataFrame to Spark.
Related Information
HMS storage
Orc vs Parquet
Blog: Enabling high-speed Spark direct reader for Apache Hive ACID tables
20
Cloudera Runtime Introduction to HWC
ByteType TinyInt
ShortType SmallInt
IntegerType Integer
LongType BigInt
FloatType Float
DoubleType Double
DecimalType Decimal
BinaryType Binary
BooleanType Boolean
TimestampType** Timestamp**
DateType Date
ArrayType Array
StructType Struct
Notes:
* StringType (Spark) and String, Varchar (Hive)
A Hive String or Varchar column is converted to a Spark StringType column. When a Spark StringType column has
maxLength metadata, it is converted to a Hive Varchar column; otherwise, it is converted to a Hive String column.
** Timestamp (Hive)
The Hive Timestamp column loses submicrosecond precision when converted to a Spark TimestampType column
because a Spark TimestampType column has microsecond precision, while a Hive Timestamp column has
nanosecond precision.
Hive timestamps are interpreted as UTC. When reading data from Hive, timestamps are adjusted according to the
local timezone of the Spark session. For example, if Spark is running in the America/New_York timezone, a Hive
timestamp 2018-06-21 09:00:00 is imported into Spark as 2018-06-21 05:00:00 due to the 4-hour time difference
between America/New_York and UTC.
CalendarIntervalType Interval
N/A Char
MapType Map
N/A Union
NullType N/A
Related Information
HMS storage
Blog: Enabling high-speed Spark direct reader for Apache Hive ACID tables
21
Cloudera Runtime Introduction to HWC
Catalog operations
Short descriptions and the syntax of catalog operations, which include creating, dropping, and describing an Apache
Hive database and table from Apache Spark, helps you write HWC API apps.
Catalog operations
Three methods of executing catalog operations are supported: .sql (recommended), .execute() ( spark.datasource.hiv
e.warehouse.read.jdbc.mode = client), or .executeQuery() for backward compatibility in LLAP mode.
• Set the current database for unqualified Hive table references
hive.setDatabase(<database>)
• Execute a catalog operation and return a DataFrame
hive.execute("describe extended web_sales").show()
• Show databases
hive.showDatabases().show(100)
• Show tables for the current database
hive.showTables().show(100)
• Describe a table
hive.describeTable(<table_name>).show(100)
• Create a database
hive.createDatabase(<database_name>,<ifNotExists>)
• Create an ORC table
hive.createTable("web_sales").ifNotExists().column("sold_time_sk", "bigi
nt").column("ws_ship_date_sk", "bigint").create()
See the CreateTableBuilder interface section below for additional table creation options. You can also create Hive
tables using hive.executeUpdate.
• Drop a database
hive.dropDatabase(<databaseName>, <ifExists>, <useCascade>)
• Drop a table
hive.dropTable(<tableName>, <ifExists>, <usePurge>)
Related Information
HMS storage
Blog: Enabling high-speed Spark direct reader for Apache Hive ACID tables
Read operations
Execute a Hive SELECT query and return a DataFrame.
hive.sql("select * from web_sales")
HWC supports push-downs of DataFrame filters and projections applied to .sql().
Alternatively, you can use .execute or .executeQuery as previously described.
22
Cloudera Runtime Introduction to HWC
df.write.format(HIVE_WAREHOUSE_CONNECTOR).option("table", <tableName>).s
ave()
Python:
df.write.format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option("tab
le", &tableName>).save()
df.write.format(HIVE_WAREHOUSE_CONNECTOR).option("table", <tableName>).optio
n("partition", <partition_spec>).save()
Python:
df.write.format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option("tab
le", &tableName>).option("partition", <partition_spec>).save()
23
Cloudera Runtime Introduction to HWC
df.write.format(DATAFRAME_TO_STREAM).option("table", <tableName>).option("p
artition", <partition>).save()
Python:
stream.writeStream.format(STREAM_TO_STREAM).option("table", "web_sales").sta
rt()
Python:
stream.writeStream.format(HiveWarehouseSession().STREAM_TO_STREAM).option("t
able", "web_sales").start()
Related Information
HMS storage
SPARK-20236
Blog: Enabling high-speed Spark direct reader for Apache Hive ACID tables
scala> com.qubole.spark.hiveacid.transaction.HiveAcidTxnManagerObject.commit
Txn(spark)
scala> hive.commitTxn
Or, if you are using Hive Warehouse Connector with Direct Reader Mode enabled, you can invoke following API to
commit transaction:
scala> hive.commitTxn
24
Cloudera Runtime Introduction to HWC
Calling close() invalidates the HiveWarehouseSession instance and you cannot perform any further operations on the
instance.
Procedure
Call close() when you finish running all other operations on the instance of HiveWarehouseSession.
import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()
hive.setDatabase("tpcds_bin_partitioned_orc_1000")
val df = hive.sql("select * from web_sales")
. . . //Any other operations
.close()
You can also call close() at the end of an iteration if the application is designed to run in a microbatch, or iterative,
manner that does not need to share previous states.
No more operations can occur on the DataFrame obtained by table() or sql() (or alternatively, .execute() or .execute
Query()).
Related Information
Blog: Enabling high-speed Spark direct reader for Apache Hive ACID tables
Procedure
Change the value of the default delimiter property escape.delim to a backslash that the Hive Warehouse Connector
uses to write streams to mytable.
ALTER TABLE mytable SET TBLPROPERTIES ('escape.delim' = '\\');
Related Information
HMS storage
Blog: Enabling high-speed Spark direct reader for Apache Hive ACID tables
25
Cloudera Runtime Introduction to HWC
df.write.format(HIVE_WAREHOUSE_CONNECTOR)
.mode("append")
.option("table", "my_Table")
.save()
import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()
hive.setDatabase("tpcds_bin_partitioned_orc_1000")
val df = hive.sql("select * from web_sales")
df.createOrReplaceTempView("web_sales")
hive.setDatabase("testDatabase")
hive.createTable("newTable")
.ifNotExists()
.column("ws_sold_time_sk", "bigint")
.column("ws_ship_date_sk", "bigint")
.create()
sql("SELECT ws_sold_time_sk, ws_ship_date_sk FROM web_sales WHERE ws_sold_
time_sk > 80000)
.write.format(HIVE_WAREHOUSE_CONNECTOR)
.mode("append")
.option("table", "newTable")
.save()
Related Information
HMS storage
Blog: Enabling high-speed Spark direct reader for Apache Hive ACID tables
HiveWarehouseSession interface
package com.hortonworks.hwc;
//Execute Hive SELECT query and return DataFrame in LLAP mode (not available
in this release)
Dataset<Row> executeQuery(String sql);
26
Cloudera Runtime Introduction to HWC
/**
* Helpers: wrapper functions over execute or executeUpdate
*/
//Closes the HWC session. Session cannot be reused after being closed.
void close();
CreateTableBuilder interface
package com.hortonworks.hwc;
27
Cloudera Runtime Introduction to HWC
//Make table bucketed, with given number of buckets and bucket columns
CreateTableBuilder clusterBy(long numBuckets, String ... columns);
MergeBuilder interface
package com.hortonworks.hwc;
//Specify fields to update for rows affected by merge condition and match
Expr
MergeBuilder whenMatchedThenUpdate(String matchExpr, String... nameValuePa
irs);
//Insert rows into target table affected by merge condition and matchExpr
MergeBuilder whenNotMatchedInsert(String matchExpr, String... values);
Related Information
HMS storage
Blog: Enabling high-speed Spark direct reader for Apache Hive ACID tables
Procedure
1. Choose an execution mode, for example the HWC JDBC execution mode, for your application and check that you
meet the configuration requirements, described earlier.
2. Configure a Spark-HiveServer connection, described earlier or, in your app submission include the appropriate --
conf in step 4.
28
Cloudera Runtime Introduction to HWC
/opt/cloudera/parcels/CDH/jars
4. Add the connector jar and configurations to the app submission using the --jars option.
Example syntax:
5. Add the path to app you wrote based on the HiveWarehouseConnector API.
Example syntax:
<path to app>
For example:
Procedure
1. Choose an execution mode, for example the HWC JDBC execution mode, for your application and check that you
meet the configuration requirements, described earlier.
2. Configure a Spark-HiveServer connection, described earlier or, in your app submission include the appropriate --
conf in step 4.
3. Locate the hive-warehouse-connector-assembly jar in the /hive_warehouse_connector/ directory.
For example, find hive-warehouse-connector-assembly-<version>.jar in the following location:
/opt/cloudera/parcels/CDH/jars
4. Add the connector jar and configurations to the app submission using the --jars option.
Example syntax:
29
Cloudera Runtime Apache Hive-Kafka integration
--py-files <path>/hive_warehouse_connector/pyspark_hwc-<version>.zip
Related Information
Configuring Spark Direct Reader mode
Configuring JDBC execution mode
HMS storage
Blog: Enabling high-speed Spark direct reader for Apache Hive ACID tables
30
Cloudera Runtime Apache Hive-Kafka integration
Procedure
1. Get the name of the Kafka topic you want to query to use as a table property.
For example: "kafka.topic" = "wiki-hive-topic"
2. Construct the Kafka broker connection string.
For example: "kafka.bootstrap.servers"="kafka.hostname.com:9092"
3. Create an external table named kafka_table by using 'org.apache.hadoop.hive.kafka.KafkaStorageHandler', as
shown in the following example:
4. If the default JSON serializer-deserializer is incompatible with your data, choose another format in one of the
following ways:
• Alter the table to use another supported serializer-deserializer. For example, if your data is in Avro format, use
the Kafka serializer-deserializer for Avro:
Related Information
Apache Kafka Documentation
31
Cloudera Runtime Apache Hive-Kafka integration
In the Hive representation of the Kafka record, the key byte array is called __key and is of type binary. You can cast
__key at query time. Hive appends __key to the last column derived from value byte array, and appends the partition,
offset, and timestamp to __key columns that are named accordingly.
Related Information
Apache Kafka Documentation
Procedure
1. List the table properties and all the partition or offset information for the topic.
DESCRIBE EXTENDED kafka_table;
2. Count the number of Kafka records that have timestamps within the past 10 minutes.
Such a time-based seek requires Kafka 0.11 or later, which has a Kafka broker that supports time-based lookups;
otherwise, this query leads to a full stream scan.
3. Define a view of data consumed within the past 15 minutes and mask specific columns.
5. Join the view of the stream over the past 15 minutes to user_table, group by gender, and compute aggregates over
metrics from fact table and dimension tables.
6. Perform a classical user retention analysis over the Kafka stream consisting of a stream-to-stream join that runs
adhoc queries on a view defined over the past 15 minutes.
32
Cloudera Runtime Apache Hive-Kafka integration
-- Stream-to-stream join
-- Assuming wiki_kafka_hive is the entire stream.
SELECT floor_hour(activity.`timestamp`), COUNT( DISTINCT activity.`user`)
AS active_users,
COUNT(DISTINCT future_activity.`user`) as retained_users
FROM wiki_kafka_hive AS activity
LEFT JOIN wiki_kafka_hive AS future_activity ON activity.`user` = future_
activity.`user`
AND activity.`timestamp` = future_activity.`timestamp` - interval '1' ho
ur
GROUP BY floor_hour(activity.`timestamp`);
Related Information
Apache Kafka Documentation
Procedure
1. Create a table to represent source Kafka record offsets.
33
Cloudera Runtime Apache Hive-Kafka integration
6. Repeat step 4 periodically until all the data is loaded into Hive.
Write semantics
The Hive-Kafka connector supports the following write semantics:
• At least once (default)
• Exactly once
At least once (default)
The default semantic. At least once is the most common write semantic used by streaming engines.
The internal Kafka producer retries on errors. If a message is not delivered, the exception is raised
to the task level, which causes a restart, and more retries. The At least once semantic leads to one of
the following conclusions:
• If the job succeeds, each record is guaranteed to be delivered at least once.
• If the job fails, some of the records might be lost and some might not be sent.
In this case, you can retry the query, which eventually leads to the delivery of each record at
least once.
Exactly once
Following the exactly once semantic, the Hive job ensures that either every record is delivered
exactly once, or nothing is delivered. You can use only Kafka brokers supporting the Transaction
API (0.11.0.x or later). To use this semantic, you must set the table property "kafka.write.semanti
c"="EXACTLY_ONCE".
Metadata columns
In addition to the user row payload, the insert statement must include values for the following extra columns:
__key
Although you can set the value of this metadata column to null, using a meaningful key value to
avoid unbalanced partitions is recommended. Any binary value is valid.
__partition
Use null unless you want to route the record to a particular partition. Using a nonexistent partition
value results in an error.
__offset
You cannot set this value, which is fixed at -1.
__timestamp
You can set this value to a meaningful timestamp, represented as the number of milliseconds since
epoch. Optionally, you can set this value to null or -1, which means that the Kafka broker strategy
sets the timestamp column.
34
Cloudera Runtime Apache Hive-Kafka integration
Related Information
Apache Kafka Documentation
Procedure
1. Create an external table to represent the Hive data that you want to load into Kafka.
2. Insert data that you select from the Kafka topic back into the Kafka record.
The timestamps of the selected data are converted to milliseconds since epoch for clarity.
Related Information
Query live data from Kafka
Procedure
For example, if you want to inject 5000 poll records into the Kafka consumer, use the following syntax.
35
Cloudera Runtime Apache Hive-Kafka integration
You set the following table properties forwith the Kafka storage handler:
kafka.topic
The Kafka topic to connect to
kafka.bootstrap.servers
The broker connection string
The Kafka consumer supports seeking on the stream based on an offset, which the storage handler leverages to push
down filters over metadata columns. The storage handler in the example above performs seeks based on the Kafka
record __timestamp to read only recently arrived data.
The following logical operators and predicate operators are supported in the WHERE clause:
Logical operators: OR, AND
Predicate operators: <, <=, >=, >, =
The storage handler reader optimizes seeks by performing partition pruning to go directly to a particular partition
offset used in the WHERE clause:
The storage handler scans partition 0 only, and then read only records between offset 4 and 109.
Kafka metadata
In addition to the user-defined payload schema, the Kafka storage handler appends to the table some additional
columns, which you can use to query the Kafka metadata fields:
36
Cloudera Runtime Connecting Hive to BI tools using a JDBC/ODBC driver
__key
Kafka record key (byte array)
__partition
Kafka record partition identifier (int 32)
__offset
Kafka record offset (int 64)
__timestamp
Kafka record timestamp (int 64)
The partition identifier, record offset, and record timestamp plus a key-value pair constitute a Kafka record. Because
the key-value is a 2-byte array, you must use SerDe classes to transform the array into a set of columns.
Table Properties
You use certain properties in the TBLPROPERTIES clause of a Hive query that specifies the Kafka storage handler.
Property Description Required Default
37
Cloudera Runtime Connecting Hive to BI tools using a JDBC/ODBC driver
• Configure authenticated users for querying Hive through JDBC or ODBC driver. For example, set up a Ranger
policy.
Procedure
1. Obtain the Hive database driver in one of the following ways:
• For an ODBC connection: Get the Cloudera ODBC driver from the Cloudera Downloads page.
• For a JDBC connection in CDP Private Cloud Base: Download and extract the Cloudera Hive JDBC driver
from the Cloudera Downloads page.
• For a JDBC connection in CDP Public Cloud: Using the CDW service, in a Virtual Warehouse in the CDW
service, select Hive, and from the more options menu, click Download JDBC JAR to download to Apache
Hive JDBC jar.
For a JDBC connection in Data Hub, download and extract the Cloudera JDBC driver from the Cloudera
Downloads page.
2. Depending on the type of driver you obtain, proceed as follows:
•
ODBC driver: follow instructions on the ODBC driver download site, and skip the rest of the steps in this
procedure.
• JDBC driver: add the driver to the classpath of your JDBC client, such as Tableau. For example, check the
client documentation about where to put the driver.
3. Find the JDBC URL for HiveServer using one of a number methods. For example:
• Using the CDW service in a Virtual Warehouse, from the options menu of your Virtual Warehouse, click Copy
JDBC URL.
• In Cloudera Manager (CM), click Clusters Hive click Actions, and select Download Client Configuration.
jdbc:hive2://my_hiveserver.com:2181/;serviceDiscoveryMode=zooKeeper; \
zooKeeperNamespace=hiveserver2
4. In the BI tool, such as Tableau, configure the JDBC connection using the JDBC URL and driver class name, com.
cloudera.hive.jdbc.HS2Driver.
38
Cloudera Runtime Connecting Hive to BI tools using a JDBC/ODBC driver
Procedure
1. Create a minimal JDBC connection string for connecting Hive to a BI tool.
• Embedded mode: Create the JDBC connection string for connecting to Hive in embedded mode.
• Remote mode: Create a JDBC connection string for making an unauthenticated connection to the Hive default
database on the localhost port 10000.
Embedded mode: "jdbc:hive://"
Remote mode: "jdbc:hive://myserver:10000/default", "", "");
2. Modify the connection string to change the transport mode from TCP (the default) to HTTP using the transpor
tMode and httpPath session configuration variables.
jdbc:hive2://myserver:10000/default;transportMode=http;httpPath=myendpoint.com;
You need to specify httpPath when using the HTTP transport mode. <http_endpoint> has a corresponding HTTP
endpoint configured in hive-site.xml.
3. Add parameters to the connection string for Kerberos authentication.
jdbc:hive2://myserver:10000/default;[email protected]
39
Cloudera Runtime Connecting Hive to BI tools using a JDBC/ODBC driver
jdbc:hive2://<host>:<port>/<dbName>;transportMode=http;httpPath=<http_endpoi
nt>; \
<otherSessionConfs>?<hiveConfs>#<hiveVars>
User Authentication
If configured in remote mode, HiveServer supports Kerberos, LDAP, Pluggable Authentication Modules (PAM), and
custom plugins for authenticating the JDBC user connecting to HiveServer. The format of the JDBC connection URL
for authentication with Kerberos differs from the format for other authentication models. The following table shows
the variables for Kerberos authentication.
User Authentication Variable Description
saslQop Quality of protection for the SASL framework. The level of quality is
negotiated between the client and server during authentication. Used by
Kerberos authentication with TCP transport.
jdbc:hive://<host>:<port>/<dbName>;principal=<HiveServer2_kerberos_principal
>;<otherSessionConfs>?<hiveConfs>#<hiveVars>
40
Cloudera Runtime Using JdbcStorageHandler to query RDBMS
jdbc:hive2://<host>:<port>/<dbName>; \
ssl=true;sslTrustStore=<ssl_truststore_path>;trustStorePassword=<truststo
re_password>; \
<otherSessionConfs>?<hiveConfs>#<hiveVars>
When using TCP for transport and Kerberos for security, HiveServer2 uses Sasl QOP for encryption rather than SSL.
Sasl QOP Variable Description
The JDBC connection string for Sasl QOP uses these variables.
jdbc:hive2://fqdn.example.com:10000/default;principal=hive/_H
[email protected];saslQop=auth-conf
The _HOST is a wildcard placeholder that gets automatically replaced with the fully qualified domain name (FQDN)
of the server running the HiveServer daemon process.
Procedure
1. Load data into a supported SQL database, such as MySQL, on a node in your cluster, or familiarize yourself with
existing data in the your database.
2. Create an external table using the JdbcStorageHandler and table properties that specify the minimum information:
database type, driver, database connection string, user name and password for querying hive, table name, and
number of active connections to Hive.
41
Cloudera Runtime Set up JDBCStorageHandler for Postgres
"hive.sql.dbcp.maxActive" = "1"
);
Procedure
1. In CDP Private Cloud Base, click Cloudera Manager Clusters and select the Hive service, for example, HIVE.
2. Click Configuration and search for Hive Auxiliary JARs Directory.
3. Specify a directory value for the Hive Aux JARs property if necessary, or make a note of the path.
4. Upload the JAR to the specified directory on all HiveServer instances.
42