Spark Cds 3
Spark Cds 3
https://docs.cloudera.com/
Legal Notice
© Cloudera Inc. 2024. All rights reserved.
The documentation is and contains Cloudera proprietary information protected by copyright and other intellectual property
rights. No license under copyright or any other intellectual property right is granted herein.
Unless otherwise noted, scripts and sample code are licensed under the Apache License, Version 2.0.
Copyright information for Cloudera software may be found within the documentation accompanying each component in a
particular release.
Cloudera software includes software from various open source or other third party projects, and may be released under the
Apache Software License 2.0 (“ASLv2”), the Affero General Public License version 3 (AGPLv3), or other license terms.
Other software included may be released under the terms of alternative open source licenses. Please review the license and
notice files accompanying the software for additional licensing information.
Please visit the Cloudera software product page for more information on Cloudera software. For more information on
Cloudera support services, please visit either the Support or Sales page. Feel free to contact us directly to discuss your
specific needs.
Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes no responsibility nor
liability arising from the use of products, except as expressly agreed to in writing by Cloudera.
Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered or unregistered
trademarks in the United States and other countries. All other trademarks are the property of their respective owners.
Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA,
CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OF
ANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY OR
RELATED SUPPORT PROVIDED IN CONNECTION THEREWITH. CLOUDERA DOES NOT WARRANT THAT
CLOUDERA PRODUCTS NOR SOFTWARE WILL OPERATE UNINTERRUPTED NOR THAT IT WILL BE
FREE FROM DEFECTS NOR ERRORS, THAT IT WILL PROTECT YOUR DATA FROM LOSS, CORRUPTION
NOR UNAVAILABILITY, NOR THAT IT WILL MEET ALL OF CUSTOMER’S BUSINESS REQUIREMENTS.
WITHOUT LIMITING THE FOREGOING, AND TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE
LAW, CLOUDERA EXPRESSLY DISCLAIMS ANY AND ALL IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, QUALITY, NON-INFRINGEMENT, TITLE, AND
FITNESS FOR A PARTICULAR PURPOSE AND ANY REPRESENTATION, WARRANTY, OR COVENANT BASED
ON COURSE OF DEALING OR USAGE IN TRADE.
Cloudera Runtime | Contents | iii
Contents
Running applications using CDS 3.3 with GPU SupportCDS 3.3 with GPU
Support.................................................................................................................11
Unsupported connectors
This release does not support the following connectors:
• SparkR
Unsupported Features
This release does not support the following feature:
• Hudi
• Push-based shuffle
CDP Versions
Important: CDS 3.3 Powered by Apache Spark is an add-on service for CDP Private Cloud Base, and is
only supported with Cloudera Runtime 7.1.9. Spark 2 is included in CDP, and does not require a separate
parcel.
Supported versions of CDP are described below.
5
Cloudera Runtime CDS 3.3 Powered by Apache Spark Requirements
A Spark 2 service (included in CDP) can co-exist on the same cluster as Spark 3 (installed as a separate parcel). The
two services are configured to not conflict, and both run on the same YARN service. Spark 3 installs and uses its own
external shuffle service.
Although Spark 2 and Spark 3 can coexist in the same CDP Private Cloud Base cluster, you cannot use multiple
Spark 3 versions simultaneously. All clusters managed by the same Cloudera Manager Server must use exactly the
same version of CDS Powered by Apache Spark.
Software requirements
For CDS 3.3
Each cluster host must have the following software installed:
Java
JDK 8, JDK 11 or JDK 17. Cloudera recommends using JDK 8, as most testing has been done with
JDK 8. Remove other JDK versions from all cluster and gateway hosts to ensure proper operation.
Python
Python 3.7 - 3.10
Hardware requirements
For CDS 3.3
CDS 3.3 Powered by Apache Spark has no specific hardware requirements on top of what is required for Cloudera
Runtime deployments.
6
Cloudera Runtime Installing CDS 3.3 Powered by Apache Spark
CDS 3.3 with GPU Support requires cluster hosts with NVIDIA Pascal™or better GPUs, with a compute capability
rating of 6.0 or higher.
For more information, see Getting Started at the RAPIDS website.
Cloudera and NVIDIA recommend using NVIDIA-certified systems. For more information, see NVIDIA-
Certified Systems in the NVIDIA GPU Cloud documentation.
7
Cloudera Runtime Enabling CDS 3.3 with GPU Support
1. In the Cloudera Manager Admin Console, add the CDS 3 parcel repository to the Remote Parcel Repository URLs
in Parcel Settings as described in Parcel Configuration Settings.
Note: If your Cloudera Manager Server does not have Internet access, you can use the Livy parcel files:
put them into a new parcel repository, and then configure the Cloudera Manager Server to target this
newly created repository.
2. Download the CDS 3.3 parcel, distribute the parcel to the hosts in your cluster, and activate the parcel. For
instructions, see Managing Parcels.
3. Add the Livy for Spark 3 service to your cluster.
a. Note that the Livy port is 28998 instead of the usual 8998.
b. Complete the remaining steps in the wizard.
4. Return to the Home page by clicking the Cloudera Manager logo in the upper left corner.
5. Click the stale configuration icon to launch the Stale Configuration wizard and restart the necessary services.
If you want to activate the CDS 3.3 with GPU Support feature, Set up a Yarn role group to enable GPU usage on page
8 and optionally Configure NVIDIA RAPIDS Shuffle Manager on page 9
Procedure
1. In Cloudera Manager navigate to Yarn Instances .
2. Create a role group where you can add nodes with GPUs.
For more information, see Creating a Role Group.
3. Move role instances with GPUs to the group you created.
On the Configuration tab select the source role group with the hosts you want to move, then click Move Selected
Instances To Group and select the role group you created.
You may need to restart the cluster.
4. Enable GPU usage for the role group.
a) On the Configuration tab select Categories GPU Management .
b) Under GPU Usage click Edit Individual Values and select the role group you created.
c) Click Save Changes.
8
Cloudera Runtime Updating Spark 2 applications for Spark 3
Procedure
1. Validate your UCX environment following the instructions provided in the NVIDIA spark-rapids documentation.
2. Before running applications with the RAPIDS Shuffle Manager, make the following configuration changes:
--conf "spark.shuffle.service.enabled=false" \
--conf "spark.dynamicAllocation.enabled=false"
spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,rc,tcp
spark.executorEnv.UCX_RNDV_SCHEME=put_zcopy
spark.executorEnv.UCX_MAX_RNDV_RAILS=1
spark.executorEnv.UCX_IB_RX_QUEUE_LEN=1024
For more information on environment variables, see the NVIDIA spark-rapids documentation.
Note: Running a job with the --rapids-shuffle=true flag does not affect these optional settings. You need
to set them manually.
9
Cloudera Runtime Running Applications with CDS 3.3 Powered by Apache Spark
10
Cloudera Runtime Running applications using CDS 3.3 with GPU SupportCDS 3.3
with GPU Support
where
--conf spark.rapids.sql.enabled=true
enables the following environment variables for GPUs:
For example,
$SPARK_HOME/bin/spark3-shell \
--conf spark.task.resource.gpu.amount=2 \
--conf spark.rapids.sql.concurrentGpuTasks=2 \
--conf spark.sql.files.maxPartitionBytes=256m \
--conf spark.locality.wait=0s \
--conf spark.sql.adaptive.enabled=true \
--conf spark.rapids.memory.pinnedPool.size=2G \
11
Cloudera Runtime Running applications using CDS 3.3 with GPU SupportCDS 3.3
with GPU Support
--conf spark.sql.adaptive.advisoryPartitionSizeInBytes=1g
--conf "spark.executor.memoryOverhead=5g"
sets the amount of additional memory to be allocated per executor process
Note: cuDF uses a Just-In-Time (JIT) compilation approach for some kernels, and the JIT process can
add a few seconds to query wall-clock time. You are recommended to set a JIT cache path to speed up
subsequent invocations with: --conf spark.executorEnv.LIBCUDF_KERNEL_CACHE_PATH=[local pa
th]. The path should be local to the executor (not HDFS) and not shared between different cluster users in
a multi-tenant environment. For example, the path may be in /tmp: (/tmp/cudf-[***USER***]). If the/tmp
directory is not writable, consult your system administrator to find a path that is.
You can override these configuration settings both from the command line and from code. For more information
on environment variables, see the NVIDIA spark-rapids documentation and the Spark SQL Performance Tuning
Guide.
3. Run a job in spark3-shell.
For example:
12
Cloudera Runtime Running applications using CDS 3.3 with GPU SupportCDS 3.3
with GPU Support
4. You can verify that the job run used GPUs, by logging on to the Yarn UI v2 to review the execution plan and the
performance of your spark3-shell application:
Select the Applications tab then select your [spark3-shell application]. Select ApplicationMaster SQL count at
<console>:28 to see the execution plan.
13
Cloudera Runtime CDS 3.3 Powered by Apache Spark version and download
information
Running a Spark job using CDS 3.3 with GPU Support with UCX enabled
1. Log on to the node where you want to run the job.
2. Run the following command to launch spark3-shell:
where
--rapids-shuffle=true
makes the following configuration changes for UCX:
spark.shuffle.manager=com.nvidia.spark.rapids.spark332cdh.Rapid
sShuffleManager
spark.executor.extraClassPath=/opt/cloudera/parcels/SPARK3/lib/s
park3/rapids-plugin/*
spark.executorEnv.UCX_ERROR_SIGNALS=
spark.executorEnv.UCX_MEMTYPE_CACHE=n
For more information on environment variables, see the NVIDIA spark-rapids documentation.
3. Run a job in spark3-shell.
3.3.2.3.3.7191000.0-78 https://archive.cloudera.com/p/spark3/3.3.7191000.0/parcels/
14
Cloudera Runtime Using the CDS 3.3 Powered by Apache Spark Maven Repository
POM fragment
The following pom fragment shows how to access a CDS 3.3 artifact from a Maven POM.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.3.2.3.3.7191000.0-78</version>
<scope>provided</scope>
</dependency>
15
Cloudera Runtime Apache Spark 3 integration with Schema Registry
16
Cloudera Runtime Apache Spark 3 integration with Schema Registry
// parse the messages using the above schema and do further operations
val df = messages
.select(from_json($"value".cast("string"), schema).alias("value"))
...
// project (driverId, truckId, miles) for the events where miles > 300
val filtered = df.select($"value.driverId", $"value.truckId", $"value.miles"
)
.where("value.miles > 300")
However, this approach is not practical because the schema information is tightly coupled with the code. The code
needs to be changed when the schema changes, and there is no ability to share or reuse the schema between the
message producers and the applications that consume the messages.
Using Schema Registry is a better solution because it enables you to manage different versions of the schema and
define compatibility policies.
Configuration
The Schema Registry integration comes as a utility method which can be imported into the scope.
import com.hortonworks.spark.registry.util._
Before invoking the APIs, you need to define an implicit SchemaRegistryConfig which will be passed to the APIs.
The main configuration parameter is the schema registry URL.
17
Cloudera Runtime Apache Spark 3 integration with Schema Registry
SSL configuration
SchemaRegistryConfig expects the following SSL configuration properties:
// parse the messages using the above schema and do further operations
val df = messages
.select(from_json($"value".cast("string"), schema).alias("value"))
...
// project (driverId, truckId, miles) for the events where miles > 300
val filtered = df.select($"value.driverId", $"value.truckId", $"value.miles
")
.where("value.miles > 300")
18
Cloudera Runtime Apache Spark 3 integration with Schema Registry
The output schema outSchemaName is automatically published to the Schema Registry if it does not exist.
<dependency>
<groupId>com.hortonworks</groupId>
<artifactId>spark-schema-registry-for-spark3_2.12</artifactId>
<version>version</version>
</dependency>
Once the application JAR file is built, deploy it by adding the dependency in spark3-submit using --packages:
19
Cloudera Runtime Apache Spark 3 integration with Schema Registry
--class YourApp \
your-application-jar \
args ...
RegistryClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="./app.keytab"
storeKey=true
useTicketCache=false
principal="[***PRINCIPAL***]";
};
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="./app.keytab"
storeKey=true
useTicketCache=false
serviceName="kafka"
principal="[***PRINCIPAL***]";
};
3. Provide the required ACLs for the kafka topics (in-topic, out-topic) for the principal.
4. Use spark3-submit to pass the JAAS configuration file with extraJavaOptions. (And also as local resource files in
YARN cluster mode.)
20
Cloudera Runtime Cumulative hotfixes for CDS
Unsupported features
Apache Spark 3 integration with Schema Registry is not supported in pyspark.
https://[username]:[password]@archive.cloudera.com/p/spark3/3.3.7190.2/p
arcels/
CDS Powered by Apache Spark Version Dependent Stack Version Supported CDP Versions
3.3.2.3.3.7190.2-1 Cloudera Runtime 7.1.9.2-10 CDP Private Cloud Base with Cloudera
Runtime 7.1.9
21
Cloudera Runtime Cumulative hotfixes for CDS
Important: CDS 3.3 for 7.1.9 Powered by Apache Spark is an add-on service for CDP Private Cloud Base,
and is only supported with Cloudera Runtime 7.1.9. Spark 2 is included in CDP, and does not require a
separate parcel.
Contact Cloudera Support for questions related to any specific hotfixes.
Following is the list of fixes that were shipped for CDS 3.3.2.3.3.7190.3-1-1.p0.48047943:
• CDPD-64135: [7.1.7 SP2, CDS 3.x] Backport HBASE-27624
• CDPD-63799: [7.1.9, CDS 3.x] Livy - Upgrade snakeyaml to 1.33 due to high CVEs
• CDPD-61742: Test failure: org.apache.spark.sql.hive.execution.HiveTableScanSuite.Spark-4077: timestamp
query for null value
https://[username]:[password]@archive.cloudera.com/p/spark3/3.3.7190.3/p
arcels/
CDS Powered by Apache Spark Version Dependent Stack Version Supported CDP Version
3.3.2.3.3.7190.3-1 Cloudera Runtime 7.1.9.2 (7.1.9 CHF1) CDP Private Cloud Base with Cloudera
Runtime 7.1.9
22
Cloudera Runtime Cumulative hotfixes for CDS
https://[username]:[password]@archive.cloudera.com/p/spark3/3.3.7190.4/p
arcels/
CDS Powered by Apache Spark Version Dependent Stack Version Supported CDP Version
3.3.2.3.3.7190.4-1 Cloudera Runtime 7.1.9.4 (7.1.9 CHF3) CDP Private Cloud Base with Cloudera
Runtime 7.1.9
https://[***USERNAME***]:[***PASSWORD***]@archive.cloudera.com/p/spark3/3
.3.7190.5/parcels/
CDS Powered by Apache Spark Version Dependent Stack Version Supported CDP Version
3.3.7190.5-2 Cloudera Runtime 7.1.9.14 (7.1.9 CHF7) CDP Private Cloud Base with Cloudera
Runtime 7.1.9
23
Cloudera Runtime Cumulative hotfixes for CDS
https://[***USERNAME***]:[***PASSWORD***]@archive.cloudera.com/p/spark3/3
.3.7190.7/parcels/
CDS Powered by Apache Spark Version Dependent Stack Version Supported CDP Version
3.3.7190.7-2 Cloudera Runtime 7.1.9.14 (7.1.9 CHF7) CDP Private Cloud Base with Cloudera
Runtime 7.1.9
https://[***USERNAME***]:[***PASSWORD***]@archive.cloudera.com/p/spark3/3
.3.7190.8/parcels/
24
Cloudera Runtime Cumulative hotfixes for CDS
CDS Powered by Apache Spark Version Dependent Stack Version Supported CDP Version
3.3.7190.8-2 Cloudera Runtime 7.1.9.14 (7.1.9 CHF7) CDP Private Cloud Base with Cloudera
Runtime 7.1.9
https://[***USERNAME***]:[***PASSWORD***]@archive.cloudera.com/p/spark3/3
.3.7190.9/parcels/
CDS Powered by Apache Spark Version Dependent Stack Version Supported CDP Version
3.3.7190.9-1 Cloudera Runtime 7.1.9.14 (7.1.9 CHF7) CDP Private Cloud Base with Cloudera
Runtime 7.1.9
25
Cloudera Runtime Using Apache Iceberg in CDS
https://[***USERNAME***]:[***PASSWORD***]@archive.cloudera.com/p/spark3/3
.3.7190.10/parcels/
CDS Powered by Apache Spark Version Dependent Stack Version Supported CDP Version
3.3.7190.10-1 Cloudera Runtime 7.1.9.14 (7.1.9 CHF7) CDP Private Cloud Base with Cloudera
Runtime 7.1.9
https://[***USERNAME***]:[***PASSWORD***]@archive.cloudera.com/p/spark3/3
.3.7191000.3/parcels/
CDS Powered by Apache Spark Version Dependent Stack Version Supported CDP Version
3.3.7191000.3-1 Cloudera Runtime 7.1.9.1015-6 (7.1.9 SP1 CDP Private Cloud Base with Cloudera
CHF3) Runtime 7.1.9 SP1
26
Cloudera Runtime Using Apache Iceberg in CDS
Limitations
• Iceberg tables with equality deletes do not support partition evolution or schema evolution on Primary Key
columns.
Users should not do partition evolution on tables with Primary Keys or Identifier Fields available, or do Schema
Evolution on Primary Key columns, Partition Columns, or Identifier Fields from Spark.
• The use of Iceberg tables as Structured Streaming sources or sinks is not supported.
• PyIceberg is not supported. Using Spark SQL to query Iceberg tables in PySpark is supported.
You log into the Ranger Admin UI, and the Ranger Service Manager appears.
27
Cloudera Runtime Using Apache Iceberg in CDS
Prerequisites
• Obtain the RangerAdmin role.
• Get the user name and password your Administrator set up for logging into the Ranger Admin.
The default credentials for logging into the Ranger Admin Web UI are admin/admin123.
Procedure
1. Log into Ranger Admin Web UI.
The Ranger Service Manager appears:
28
Cloudera Runtime Using Apache Iceberg in CDS
3.
In Service Manager, in Hadoop SQL, select Edit and edit the all storage-type, storage-url policy.
4. Below Policy Label, select storage-type, and enter iceberg..
5. In Storage URL, enter the value *, enable Include.
For more information about these policy settings, see Ranger storage handler documentation.
29
Cloudera Runtime Using Apache Iceberg in CDS
6. In Allow Conditions, specify roles, users, or groups to whom you want to grant RW storage permissions.
You can specify PUBLIC to grant access to Iceberg tables permissions to all users. Alternatively, you can grant
access to one user. For example, add the systest user to the list of users who can access Iceberg:
For more information about granting permissions, see Configure a resource-based policy: Hadoop-SQL.
7. Add the RW Storage permission to the policy.
8. Save your changes.
Procedure
1. Log into Ranger Admin Web UI.
The Ranger Service Manager appears.
2. Click Add New Policy.
30
Cloudera Runtime Using Apache Iceberg in CDS
4. Scroll down to Allow Conditions, and select the roles, groups, or users you want to access the table.
You can use Deny All Other Accesses to deny access to all other roles, groups, or users other than those specified
in the allow conditions for the policy.
5. Select permissions to grant.
For example, select Create, Select, and Alter. Alternatively, to provide the broadest permissions, select All.
Ignore RW Storage and other permissions not named after SQL queries. These are for future implementations.
6. Click Add.
31
Cloudera Runtime Using Apache Iceberg in CDS
An example Spark SQL creation command to create a new Iceberg table is as follows:
spark.sql("""CREATE EXTERNAL TABLE ice_t (idx int, name string, state string
)
USING iceberg
PARTITIONED BY (state)""")
CREATE TABLE logs (app string, lvl string, message string, event_ts timestam
p) USING iceberg TBLPROPERTIES ('format-version' = '2')
<delete-mode> <update-mode> and <merge-mode> can be specified during table creation for modes of the respective
operation. If unspecified, they default to merge-on-read.
Here, <source> is an existing Iceberg table. This operation may appear to succeed and does not display errors and
only warnings, but the resulting table is not a usable table.
Procedure
1. In Cloudera Manager, select the service for the Hive Metastore.
2. Click the Configuration tab.
3. Search for safety valve and find the Hive Metastore Server Advanced Configuration Snippet (Safety Valve) for
hive-site.xml safety valve.
4. Add the following property:
• Name: hive.metastore.disallow.incompatible.col.type.changes
• Value: false
5. Click Save Changes.
6. Restart the service to apply the configuration change.
32
Cloudera Runtime Using Apache Iceberg in CDS
Importing
Call the snapshot procedure to import a Hive table into Iceberg using a Spark 3 application.
spark.sql("CALL
<catalog>.system.snapshot('<src>', '<dest>')")
Definitions:
• <src> is the qualified name of the Hive table
• <dest> is the qualified name of the Iceberg table to be created
• <catalog> is the name of the catalog, which you pass in a configuration file. For more information, see
Configuring Catalog linked below.
For example:
spark.sql("CALL
spark_catalog.system.snapshot('hive_db.hive_tbl',
'iceberg_db.iceberg_tbl')")
For information on compiling Spark 3 application with Iceberg libraries, see Iceberg library dependencies for Spark
applications linked below.
Migrating
When you migrate a Hive table to Iceberg, a backup of the table, named <table_name>_backup_, is created.
Ensure that the TRANSLATED_TO_EXTERNAL property, that is located in TBLPROPERTIES, is set to false
before migrating the table. This ensures that a table backup is created by renaming the table in Hive metastore (HMS)
instead of moving the physical location of the table. Moving the physical location of the table would entail copying
files in Amazon s3.
We recommend that you refrain from dropping the backup table, as doing so will invalidate the newly migrated table.
If you want to delete the backup table, set the following:
'external.table.purge'='FALSE'
Note: For CDE 1.19 and above, the property will be set automatically.
Deleting the backup table in the manner above will prevent underlying data from being deleted, therefore, only the
table will be deleted from the metastore.
To undo the migration, drop the migrated table and restore the Hive table from the backup table by renaming it.
Call the migrate procedure to migrate a Hive table to Iceberg.
spark.sql("CALL
<catalog>.system.migrate('<src>')")
Definitions:
• <src> is the qualified name of the Hive table
• <catalog> is the name of the catalog, which you pass in a configuration file. For more information, see
Configuring Catalog linked below.
For example:
spark.sql(“CALL
spark_catalog.system.migrate(‘hive_db.hive_tbl’)”)
33
Cloudera Runtime Using Apache Iceberg in CDS
Importing
Call the snapshot procedure to import a Hive table into Iceberg table format v2 using a Spark 3 application.
spark.sql("CALL
<catalog>.system.snapshot(source_table => '<src>',
table => '<dest>',
properties => map('format-version', '2', 'write.delete.mode', '<delete-mo
de>',
'write.update.mode', '<update-mode>',
'write.merge.mode', '<merge-mode>'))")
Definitions:
• <src> is the qualified name of the Hive table
• <dest> is the qualified name of the Iceberg table to be created
• <catalog> is the name of the catalog, which you pass in a configuration file. For more information, see
Configuring Catalog linked below.
• <delete-mode> <update-mode> and <merge-mode> are the modes that shall be used to perform the respective
operation. If unspecified, they default to 'merge-on-read'
For example:
spark.sql("CALL
spark_catalog.system.snapshot('hive_db.hive_tbl',
'iceberg_db.iceberg_tbl')")
For information on compiling Spark 3 application with Iceberg libraries, see Iceberg library dependencies for Spark
applications linked below.
Migrating
Call the migrate procedure to migrate a Hive table to Iceberg.
spark.sql("CALL
<catalog>.system.migrate('<src>',
map('format-version', '2',
'write.delete.mode', '<delete-mode>',
'write.update.mode', '<update-mode>',
'write.merge.mode', '<merge-mode>'))")
Definitions:
• <src> is the qualified name of the Hive table
• <catalog> is the name of the catalog, which you pass in a configuration file. For more information, see
Configuring Catalog linked below.
• <delete-mode> <update-mode> and <merge-mode> are the modes that shall be used to perform the respective
operation. If unspecified, they default to 'merge-on-read'
For example:
spark.sql("CALL
34
Cloudera Runtime Using Apache Iceberg in CDS
spark_catalog.system.migrate('hive_db.hive_tbl',
map('format-version', '2',
'write.delete.mode', 'merge-on-read',
'write.update.mode', 'merge-on-read',
'write.merge.mode', 'merge-on-read'))")
<delete-mode>,<update-mode>, and <merge-mode> can be specified as the modes that shall be used to perform the
respective operation. If unspecified, they default to ‘merge-on-read'
Configuring Catalog
When using Spark SQL to query an Iceberg table from Spark, you refer to a table using the following dot notation:
<catalog_name>.<database_name>.<table_name>
The default catalog used by Spark is named spark_catalog. When referring to a table in a database known to spark_ca
talog, you can omit <catalog_name>. .
Iceberg provides a SparkCatalog property that understands Iceberg tables, and a SparkSessionCatalog property that
understands both Iceberg and non-Iceberg tables. The following are configured by default:
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type=hive
This replaces Spark’s default catalog by Iceberg’s SparkSessionCatalog and allows you to use both Iceberg and non-
Iceberg tables out of the box.
There is one caveat when using SparkSessionCatalog. Iceberg supports CREATE TABLE … AS SELECT (CTAS)
and REPLACE TABLE … AS SELECT (RTAS) as atomic operations when using SparkCatalog. Whereas, the CTAS
and RTAS are supported but are not atomic when using SparkSessionCatalog. As a workaround, you can configure
another catalog that uses SparkCatalog. For example, to create the catalog named iceberg_catalog, set the following:
spark.sql.catalog.iceberg_catalog=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg_catalog.type=hive
You can configure more than one catalog in the same Spark job. For more information, see the Iceberg
documentation.
Related Tasks
Iceberg documentation
35
Cloudera Runtime Using Apache Iceberg in CDS
Or
Example:
Important: When querying Iceberg tables in HDFS, CDS disables locality by default. Because enabling
locality generally leads to increased Spark planning time for queries on such tables and often the increase is
quite significant. If you wish to enable locality, set the spark.cloudera.iceberg.locality.enabled to true . For
example, you can do it by passing --conf spark.cloudera.iceberg.locality.enabled=true to your spark3-submit
command.
v1 format
Iceberg supports bulk updates through MERGE, by defaulting to copy-on-write deletes when using v1 table format.
v2 format
Iceberg table format v2 supports efficient row-level updates and delete operations leveraging merge-on-read.
For more details, refer to Position Delete Files linked below.
For updating data examples, see Spark Writes linked below.
Cloudera publishes Iceberg artifacts to a Maven repository with versions matching the Iceberg in CDS.
36
Cloudera Runtime Using Apache Iceberg in CDS
Note: Use 1.3.0.7.1.9.0-387 iceberg version for compilation. The below iceberg dependencies should only be
used for compilation. Including iceberg jars within a Spark application fat jar must be avoided.
<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-core</artifactId>
<version>${iceberg.version}</version>
<scope>provided</scope>
</dependency>
<!-- for org.apache.iceberg.hive.HiveCatalog -->
<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-hive-metastore</artifactId>
<version>${iceberg.version}</version>
<scope>provided</scope>
</dependency>
<!-- for org.apache.iceberg.spark.* classes if used -->
<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-spark</artifactId>
<version>${iceberg.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-spark-runtime-3.3_2.12</artifactId>
<version>${iceberg.version}</version>
<scope>provided</scope>
</dependency>
The iceberg-spark3-runtime JAR contains the necessary Iceberg classes for Spark runtime support, and includes the
classes from the dependencies above.
37