Iceberg How To
Iceberg How To
https://docs.cloudera.com/
Legal Notice
© Cloudera Inc. 2025. All rights reserved.
The documentation is and contains Cloudera proprietary information protected by copyright and other intellectual property
rights. No license under copyright or any other intellectual property right is granted herein.
Unless otherwise noted, scripts and sample code are licensed under the Apache License, Version 2.0.
Copyright information for Cloudera software may be found within the documentation accompanying each component in a
particular release.
Cloudera software includes software from various open source or other third party projects, and may be released under the
Apache Software License 2.0 (“ASLv2”), the Affero General Public License version 3 (AGPLv3), or other license terms.
Other software included may be released under the terms of alternative open source licenses. Please review the license and
notice files accompanying the software for additional licensing information.
Please visit the Cloudera software product page for more information on Cloudera software. For more information on
Cloudera support services, please visit either the Support or Sales page. Feel free to contact us directly to discuss your
specific needs.
Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes no responsibility nor
liability arising from the use of products, except as expressly agreed to in writing by Cloudera.
Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered or unregistered
trademarks in the United States and other countries. All other trademarks are the property of their respective owners.
Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA,
CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OF
ANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY OR
RELATED SUPPORT PROVIDED IN CONNECTION THEREWITH. CLOUDERA DOES NOT WARRANT THAT
CLOUDERA PRODUCTS NOR SOFTWARE WILL OPERATE UNINTERRUPTED NOR THAT IT WILL BE
FREE FROM DEFECTS NOR ERRORS, THAT IT WILL PROTECT YOUR DATA FROM LOSS, CORRUPTION
NOR UNAVAILABILITY, NOR THAT IT WILL MEET ALL OF CUSTOMER’S BUSINESS REQUIREMENTS.
WITHOUT LIMITING THE FOREGOING, AND TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE
LAW, CLOUDERA EXPRESSLY DISCLAIMS ANY AND ALL IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, QUALITY, NON-INFRINGEMENT, TITLE, AND
FITNESS FOR A PARTICULAR PURPOSE AND ANY REPRESENTATION, WARRANTY, OR COVENANT BASED
ON COURSE OF DEALING OR USAGE IN TRADE.
| Contents | iii
Contents
Apache Iceberg features................................................................................................................. 5
Alter table feature................................................................................................................................................. 5
Create table feature...............................................................................................................................................6
Create table as select feature................................................................................................................................8
Create partitioned table as select feature............................................................................................................. 8
Create table … like feature.................................................................................................................................. 9
Delete data feature................................................................................................................................................9
Describe table metadata feature......................................................................................................................... 11
Drop partition feature......................................................................................................................................... 12
Drop table feature............................................................................................................................................... 12
Expire snapshots feature.....................................................................................................................................13
Insert table data feature...................................................................................................................................... 14
Load data inpath feature.....................................................................................................................................15
Load or replace partition data feature................................................................................................................15
Materialized view feature................................................................................................................................... 16
Materialized view rebuild feature...................................................................................................................... 17
Merge feature...................................................................................................................................................... 19
Migrate Hive table to Iceberg feature................................................................................................................20
Changing the table metadata location.................................................................................................... 20
Flexible partitioning............................................................................................................................................21
Partition evolution feature...................................................................................................................... 21
Partition transform feature......................................................................................................................22
Query metadata tables feature............................................................................................................................ 23
Rollback table feature.........................................................................................................................................26
Select Iceberg data feature................................................................................................................................. 26
Schema evolution feature................................................................................................................................... 26
Schema inference feature................................................................................................................................... 28
Snapshot management........................................................................................................................................ 28
Time travel feature............................................................................................................................................. 29
Truncate table feature......................................................................................................................................... 29
Best practices for Iceberg in Cloudera....................................................................................... 30
Making row-level changes on V2 tables only................................................................................................... 30
Performance tuning.......................................................................................................................31
Caching manifest files........................................................................................................................................ 31
Configuring manifest caching in Cloudera Manager.........................................................................................32
Unsupported features and limitations.........................................................................................32
Accessing Iceberg tables............................................................................................................... 34
Opening Ranger in Cloudera Data Hub.............................................................................................................34
Editing a storage handler policy to access Iceberg files on the file system...................................................... 35
Creating a SQL policy to query an Iceberg table..............................................................................................38
Creating an Iceberg table.............................................................................................................39
Creating an Iceberg partitioned table........................................................................................ 40
Expiring snapshots........................................................................................................................ 41
Inserting data into a table............................................................................................................41
Migrating a Hive table to Iceberg...............................................................................................42
Selecting an Iceberg table............................................................................................................ 43
Running time travel queries........................................................................................................ 43
Updating an Iceberg partition..................................................................................................... 44
Test driving Iceberg from Impala...............................................................................................45
Hive demo data..............................................................................................................................46
Test driving Iceberg from Hive................................................................................................... 49
Iceberg data types......................................................................................................................... 51
Iceberg table properties................................................................................................................52
Apache Iceberg features
Iceberg partitioning
The Iceberg partitioning technique has performance advantages over conventional partitioning, such as Apache Hive
partitioning. Iceberg hidden partitioning is easier to use. Iceberg supports in-place partition evolution; to change a
partition, you do not rewrite the entire table to add a new partition column, and queries do not need to be rewritten
for the updated table. Iceberg continuously gathers data statistics, which supports additional optimizations, such as
partition pruning.
Iceberg uses multiple layers of metadata files to find and prune data. Hive and Impala keep track of data at the folder
level and not at the file level, performing file list operations when working with data in a table. Performance problems
occur during the execution of multiple list operations. Iceberg keeps track of a complete list of files within a table
using a persistent tree structure. Changes to an Iceberg table use an atomic object/file level commit to update the path
to a new snapshot file. The snapshot points to the individual data files through manifest files.
The manifest files track several data files across many partitions. These files store partition information and column
metrics for each data file. A manifest list is an additional index for pruning entire manifests. File pruning increases
efficiency.
Iceberg relieves Hive metastore (HMS) pressure by storing partition information in metadata files on the file system/
object store instead of within the HMS. This architecture supports rapid scaling without performance hits.
5
Apache Iceberg features
• table_properties
A list of properties and values using the following syntax:
Impala syntax
Hive example
Impala examples
6
Apache Iceberg features
Hive syntax
Impala syntax
7
Apache Iceberg features
Hive examples
Impala examples
Related Information
Drop table feature
Partition transform feature
Hive examples
Impala examples
You see an example of using a partition transform with the PARTITIONED BY SPEC clause.
The newly created table does not inherit the partition spec and table properties from the source table in SELECT.
The Iceberg table and the corresponding Hive table is created at the beginning of the query execution. The data is
inserted / committed when the query finishes. So for a transient period the table exists but contains no data.
Hive syntax
8
Apache Iceberg features
PARTITIONED BY (part)
TBLPROPERTIES ('key'='value')
AS SELECT ...
Hive examples
Impala examples
Hive example
Impala example
9
Apache Iceberg features
• If you are using Apache Flink or Apache NiFi to write equality deletes, then ensure that you provide a PRIMARY
KEY for the table. This is required for engines to know which columns to write into the equality delete files.
• If the table is partitioned then the partition columns have to be part of the PRIMARY KEY
• For Apache Flink, the table should be in 'upsert-mode' to write equality deletes
• Partition evolution is not allowed for Iceberg tables that have PRIMARY KEYs
Position delete files contain the following information:
• file_path, which is a full URI
• pos, the file position of the row
Delete files are sorted by file_path and pos. The following table shows an example of delete files in a partitioned
table:
Table 1:
Hive and Impala evaluate rows from one table against a WHERE clause, and delete all the rows that match WHERE
conditions. If you want delete all rows, use the Truncate feature. The WHERE expression is similar to the WHERE
expression used in SELECT. The conditions in the WHERE clause can refer to any columns.
Concurrent operations that include DELETE do not introduce inconsistent table states. Iceberg runs validation
checks to check for concurrent modifications, such as DELETE+INSERT. Only one will succeed. On the other hand,
DELETE+DELETE, and INSERT+INSERT can both succeed, but in the case of a concurrent DELETE+UPDATE,
UPDATE+UPDATE, DELETE+INSERT, UPDATE+INSERT from Hive, only the first operation will succeed.
From joined tables, you can delete all matching rows from one of the tables. You can join tables of any kind, but
the table from which the rows are deleted must be an Iceberg table. The FROM keyword is required in this case, to
separate the name of the table whose rows are being deleted from the table names of the join clauses.
create external table tbl_ice(a int, b string, c int) stored by iceberg tblp
roperties ('format-version'='2');
insert into tbl_ice values (1, 'one', 50), (2, 'two', 51), (3, 'three', 5
2), (4, 'four', 53), (5, 'five', 54), (111, 'one', 55), (333, 'two', 56);
10
Apache Iceberg features
The following example deletes 0, 1, or more rows of the table. If col1 is a primary key, 0 or 1 rows are deleted:
SHOW CREATE TABLE table_name Reveals the schema that created the table. Hive and Impala
SHOW FILES IN table_name Lists the files related to the table. Impala
SHOW PARTITIONS table_name Returns the Iceberg partition spec, just the Impala
column information, not actual partitions or
files.
DESCRIBE [EXTENDED] table_name The optional EXTENDED shows all the Hive and Impala
metadata for the table in Thrift serialized form,
which is useful for debugging.
DESCRIBE HISTORY table_name [BET Optionally limits the output history to a period Impala
WEEN timestamp1 AND timestamp2] of time.
Hive example
DESCRIBE t;
x int
y int
NULL NULL
y IDENTITY NULL
The output of DESCRIBE HISTORY includes the following columns about the snapshot. The first three are self-
explanatory. The is_current_ancestor column value is TRUE if the snapshot is the ancestor of the table:
• creation_time
• snapshot_id
• parent_id
• is_current_ancestor
11
Apache Iceberg features
Impala examples
Prerequisites
• The filter expression in the drop partition syntax below is a combination of, or at least one of, the following
predicates:
• Minimum of one binary predicate
• IN predicate
• IS NULL predicate
• The argument of the predicate in the filter expression must be a partition transform, such as ‘YEARS(col-name)’
or ‘column’.
• Any non-identity transform must be included in the select statement. For example, if you partition a column by
days, the filter must be days.
Impala syntax
The following operators are supported in the predicate: =, !=,<, >, <=, >=
Impala example
12
Apache Iceberg features
Related Information
Create table feature
13
Apache Iceberg features
table for the last 24 hours. You configure history.expire.min-snapshots-to-keep as a safety mechanism to enforce
this. If your table receives only one modification (insert / update / merge) per hour, then setting history.expire.min-
snapshots-to-keep = 24 is sufficient to meet your requirement. However, if your table was consistently receiving
updates every minute, then the last 24 hour period would entail 1440 snapshots, and the history.expire.min-snapshots-
to-keep setting would need to be configured appropriately.
14
Apache Iceberg features
Hive example
FROM customers
INSERT INTO target1 SELECT customer_id, first_name;
INSERT INTO target2 SELECT last_name, customer_id;
Impala syntax
Impala example
In this example, you create a table using the LIKE clause to point to a table stored as Parquet. This is required for
Iceberg to infer the schema. You also load data stored as ORC.
Impala example
15
Apache Iceberg features
Hive example
The following example creates a materialized view of an Iceberg table from Hive.
The following example creates a materialized view of two Iceberg tables. Joined tables must be in the same table
format, either Iceberg or Hive ACID.
The following example uses explain to examine a materialized view and then creates a materialized view of an
Iceberg V1 table from Hive.
create table tbl_ice(a int, b string, c int) stored by iceberg stored as orc
tblproperties ('format-version'='1');
insert into tbl_ice values (1, 'one', 50), (2, 'two', 51), (3, 'three', 52),
(4, 'four', 53), (5, 'five', 54);
explain create materialized view mat1 stored by iceberg stored as orc tblpr
operties ('format-version'='1') as
16
Apache Iceberg features
Related Information
Materialized view rebuild feature
Hive syntax
17
Apache Iceberg features
explain cbo
select tbl_ice.b, tbl_ice.c, tbl_ice_v2.e from tbl_ice join tbl_ice_v2 on
tbl_ice.a=tbl_ice_v2.d where tbl_ice.c > 52;
explain cbo
alter materialized view mat1 rebuild;
insert into tbl_ice values (1, 'one', 50), (2, 'two', 51), (3, 'three', 5
2), (4, 'four', 53), (5, 'five', 54);
explain
create materialized view mat1 partitioned on spec (bucket(16, b), trunc
ate(3, c)) stored by Iceberg stored as orc tblproperties(‘format-version’=’1
’)as
select tbl_ice.b, tbl)ice.c from tbl_ice where tbl_ice.c > 52;
18
Apache Iceberg features
You use the DESCRIBE command to see the output query plan, which shows details about the view, including if it
can be used in automatic query rewrites.
With the query rewrite option enabled, you insert data into the source table, and incremental rebuild occurs
automatically. You do not need to rebuild the view manually before running queries.
Related Information
Materialized view feature
Merge feature
You can perform actions on an Iceberg table based on the results of a join with a v2 Iceberg table.
Hive syntax
Hive example
Use the MERGE INTO statement to update an Iceberg table based on a staging table:
19
Apache Iceberg features
To convert a Hive table to an Iceberg V2 table, you must run two queries. Use the following syntax:
In-place table migration saves time generating Iceberg tables. There is no need to regenerate data files. Only
metadata, which points to source data files, is regenerated.
20
Apache Iceberg features
Flexible partitioning
Iceberg partition evolution, which is a unique Iceberg feature, and the partition transform feature, greatly simplify
partitioning tables and changing partitions.
Partitions based on transforms are stored in the Iceberg metadata layer, not in the directory structure. You can change
the partitioning completely, or just refine existing partitioning, and write new data based on the new partition layout--
no need to rewrite existing data files. For example, change a partition by month to a partition by day.
Use partition transforms, such as IDENTITY, TRUNCATE, BUCKET, YEAR, MONTH, DAY, HOUR. Iceberg
solves scalability problems caused by having too many partitions. Partitioning can also include columns with a large
number of distinct values. Partitioning is hidden in the Iceberg metadata layer, eliminating the need to explicitly write
partition columns (YEAR, MONTH for example) or to add extra predicates to queries for partition pruning.
Year, month, and day can be automatically extracted from ‘2023-04-21 20:56:08’ if the table is partitioned by
DAY(ts)
• spec
The specification for a transform listed in the next topic, "Partition transform feature".
ALTER TABLE t
21
Apache Iceberg features
Related Information
Partition transform feature
Partition by hashed value mod N buckets bucket(N, col) Hive and Impala
Strings are truncated to length L. Integers and longs are truncated to bins. For example, truncate(10, i) yields
partitions 0, 10, 20, 30 …
The idea behind transformation partition by hashed value mod N buckets is the same as hash bucketing for Hive
tables. A hashing algorithm calculates the bucketed column value (modulus). For example, for 10 buckets, data is
stored in column value % 10, ranging from 0-9 (0 to n-1) buckets.
You use the PARTITIONED BY SPEC clause to partition a table by an identity transform.
Hive syntax
22
Apache Iceberg features
• BUCKET(bucket_num,col_name)
• TRUNCATE(length, col_name)
Impala syntax
Hive example
The following example creates a top level partition based on column i, a second level partition based on the hour part
of the timestamp, and a third level partition based on the first 1000 characters in column j.
Impala examples
The following examples show how to use the PARTITION BY SPEC clause in a CREATE TABLE query from
Impala.The same transforms are available in a CREATE EXTERNAL TABLE query from Hive.
Related Information
Creating an Iceberg partitioned table
Create table feature
Partition evolution feature
23
Apache Iceberg features
Iceberg metadata tables include information that is useful for efficient table maintenance (about snapshots, manifests,
data, delete files, etc.) as well as statistics that help query engines plan and execute queries more efficiently (value
count, min-max values, number of NULLs, etc.).
Note: Iceberg metadata tables are read-only. You cannot add, remove, or modify records in the tables. Also,
you cannot drop or create new metadata tables.
For more information about the Apache Iceberg Iceberg metadata table types, see the Apache Iceberg
MetadataTableType enumeration.
For more information about querying Iceberg metadata, see the Apache Iceberg Spark documentation.
The following sections describe how you can interact with and query Iceberg metadata tables:
Impala Syntax:
Impala Example:
Output:
+----------------------+
| name |
+----------------------+
| all_data_files |
| all_delete_files |
| all_entries |
| all_files |
| all_manifests |
| data_files |
| delete_files |
| entries |
| files |
| history |
| manifests |
| metadata_log_entries |
| partitions |
| position_deletes |
| refs |
| snapshots |
+----------------------+
24
Apache Iceberg features
You can select any subset of the columns or all of them using ‘*’. In comparison to regular tables, running a SELE
CT * from Impala on metadata tables always includes complex-typed columns in the result. The Impala query option,
EXPAND_COMPLEX_TYPES only applies to regular tables. However, Hive always includes complex columns
irresepctive of whether SELECT queries are run on regular tables or metadata tables.
For Impala queries that have a mix of regular tables and metadata tables, a SELECT * expression where the sources
are metadata tables always includes complex types, whereas for SELECT * expressions where the sources are regular
tables, complex types are included only if the EXPAND_COMPLEX_TYPES query option is set to 'true'.
In the case of Hive, columns with complex types are always included.
You can also filter the result set using a WHERE clause, use aggregate functions such as MAX or SUM, JOIN
metadata tables with other metadata tables or regular tables, and so on.
Example:
SELECT
s.operation,
h.is_current_ancestor,
s.summary
FROM default.ice_table.history h
JOIN default.ice_table.snapshots s
ON h.snapshot_id = s.snapshot_id
WHERE s.operation = 'append'
ORDER BY made_current_at;
Limitations
• Impala does not support the DATE and BINARY data types. NULL is returned instead of their actual values.
• Impala does not support unnesting collections from metadata tables.
DESCRIBE database_name.table_name.metadata_table_name;
DESCRIBE default.ice_table.history;
Output:
+---------------------+-----------+---------+----------+
| name | type | comment | nullable |
+---------------------+-----------+---------+----------+
| made_current_at | timestamp | | true |
| snapshot_id | bigint | | true |
| parent_id | bigint | | true |
| is_current_ancestor | boolean | | true |
+---------------------+-----------+---------+----------+
25
Apache Iceberg features
Related Information
Apache Iceberg MetadataTableType
Apache Spark documentation
The following example rolls the table back to the latest snapshot having a creation timestamp earlier than '2022-08-08
00:00:00'.
26
Apache Iceberg features
The Iceberg table schema is synchronized with the Hive/Impala table schema. A change to the schema of the Iceberg
table by an outside entity, such as Spark, changes the corresponding Hive/Impala table. You can change the Iceberg
table using ALTER TABLE to make the following changes:
From Hive:
• Add a column
• Replace a column
• Change a column type or its position in the table
From Impala:
• Add a column
• Rename a column
• Drop a column
• Change a column type
An unsafe change to a column type, which would require updating each row of the table for example, is not allowed.
The following type changes are safe:
• int to long
• float to double
• decimal(P, S) to decimal(P', S) if precision is increased
You can drop a column by changing the old column to the new column.
Hive syntax
Impala syntax
Hive examples
Impala examples
27
Apache Iceberg features
hive.parquet.infer.binary.as = <value>
Hive syntax
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name LIKE FILE PARQU
ET 'object_storage_path_of_parquet_file'
[PARTITIONED BY [SPEC]([col_name][, spec(value)][, spec(value)]...)]]
[STORED AS file_format]
STORED BY ICEBERG
[TBLPROPERTIES (property_name=property_value, ...)]
Impala syntax
Hive example
Impala example
Snapshot management
In addition to time travel queries, expiring a snapshot, and using a snapshot to rollback to a version of a table, you can
also set any snapshot to be the current snapshot from Hive.
Hive syntax
28
Apache Iceberg features
Hive example
• time_stamp
The state of the Iceberg table at the time specified by the UTC timestamp.
• snapshot_id
The ID of the Iceberg table snapshot from the history output.
Hive syntax
TRUNCATE table_name
Impala syntax
29
Best practices for Iceberg in Cloudera
TRUNCATE t;
30
Performance tuning
Performance tuning
Impala uses its own C++ implementation to deal with Iceberg tables. This implementation provides significant
performance advantages over other engines.
To tune performance, try the following actions:
• Increase parallelism to handle large manifest list files in Spark.
By default, the number of processors determines the preset value of the iceberg.worker.num-threads system
property. Try increasing parallelism by setting the iceberg.worker.num-threads system property to a higher value
to speed up query compilation.
• Speed up drop table performance, preventing deletion of data files by using the following table properties:
31
Unsupported features and limitations
Procedure
1. Navigate to Cloudera Manager.
2. Search for Iceberg.
By default, manifest caching is enabled, but if you have turned it off, check Impala-1 (Service-Wide) to re-enable.
Unsupported features
The following table presents feature limitations or unsupported features:
# means not yet tested
N/A means will never be tested, not a GA candidate
Iceberg Feature Hive Impala Spark
Branching/Tagging # # #
Bucketing # # #
The table above shows that the following features are not supported in this release of Cloudera:
32
Unsupported features and limitations
Limitations
The following features have limitations or are not supported in this release:
• Multiple insert overwrite queries that read data from a source table.
• When the underlying table is changed, you need to rebuild the materialized view manually, or use the Hive query
scheduling to rebuild the materialized view.
• You must be aware of the following considerations when using equality deletes:
• Equality updates and deletes are not supported.
• If you are using Apache Flink or Apache NiFi to write equality deletes, then ensure that you provide a
PRIMARY KEY for the table. This is required for engines to know which columns to write into the equality
delete files.
• If the table is partitioned then the partition columns have to be part of the PRIMARY KEY
• For Apache Flink, the table should be in 'upsert-mode' to write equality deletes
• Partition evolution is not allowed for Iceberg tables that have PRIMARY KEYs
• An equality delete file in the table is the likely cause of a problem with updates or deletes in the following
situations:
• In Change Data Capture (CDC) applications
• In upserts from Apache Flink
• From a third-party engine
• You must be aware of the following:
• An Iceberg table that points to another Iceberg table in the HiveCatalog is not supported.
For example:
Bucketing workaround
A query from Hive to define buckets/folders in Iceberg do not create the same number of buckets/folders as the
same query creates in Hive. In Hive bucketing by multiple columns using the following clause creates 64 buckets
maximum inside each partition.
| CLUSTERED BY ( |
| id, |
| partition_id) |
| INTO 64 BUCKETS
33
Accessing Iceberg tables
Defining bucketing from Hive on multiple columns of an Iceberg table using this query creates 64*64 buckets/
folders; consequently, bucketing by group does not occur as expected. The operation will create many small files at
scale, a drag on performance.
Add multiple bucket transforms (partitions) to more than one column in the current version of Iceberg as follows:
You log into the Ranger Admin UI, and the Ranger Service Manager appears.
Prerequisites
• Obtain the RangerAdmin role.
• Get the user name and password your Administrator set up for logging into the Ranger Admin.
The default credentials for logging into the Ranger Admin Web UI are admin/admin123.
34
Accessing Iceberg tables
Procedure
1. Log into Cloudera, and in Cloudera Management Console, click Data Hub Clusters.
2. In Data Hubs, select the name of your Data Hub from the list.
3. In Environment Details, click the link to your Data Lake.
For example, click dlscale8-bc8bqz.
Editing a storage handler policy to access Iceberg files on the file system
You learn how to edit the existing default Hadoop SQL Storage Handler policy to access files. This policy is one of
the two Ranger policies required to use Iceberg.
35
Accessing Iceberg tables
Procedure
1. Log into Ranger Admin Web UI.
The Ranger Service Manager appears:
3.
In Service Manager, in Hadoop SQL, select Edit and edit the all storage-type, storage-url policy.
36
Accessing Iceberg tables
For more information about these policy settings, see Ranger storage handler documentation.
6. In Allow Conditions, specify roles, users, or groups to whom you want to grant RW storage permissions.
You can specify PUBLIC to grant access to Iceberg tables permissions to all users. Alternatively, you can grant
access to one user. For example, add the systest user to the list of users who can access Iceberg:
For more information about granting permissions, see Configure a resource-based policy: Hadoop-SQL.
7. Add the RW Storage permission to the policy.
8. Save your changes.
37
Accessing Iceberg tables
Procedure
1. Log into Ranger Admin Web UI.
The Ranger Service Manager appears.
2. Click Add New Policy.
3. Fill in required fields.
For example, enter the following required settings:
• In Policy Name, enter the name of the policy, for example IcebergPolicy1.
• In database, enter the name of the database controlled by this policy, for example icedb.
• In table, enter the name of the table controlled by this policy, for example icetable.
• In columns, enter the name of the column controlled by this policy, for example enter the wildcard asterisk (*)
to allow access to all columns of icetable.
• Accept defaults for other settings.
4. Scroll down to Allow Conditions, and select the roles, groups, or users you want to access the table.
You can use Deny All Other Accesses to deny access to all other roles, groups, or users other than those specified
in the allow conditions for the policy.
38
Creating an Iceberg table
Ignore RW Storage and other permissions not named after SQL queries. These are for future implementations.
6. Click Add.
Procedure
1. Log into Cloudera, and click Data Hub Clusters.
2. Click Hue.
3. Select a database.
39
Creating an Iceberg partitioned table
4. Enter a query to create a simple Iceberg table in the default Parquet format.
Hive example:
Impala example:
In Cloudera, CREATE EXTERNAL TABLE, and just CREATE TABLE, are valid from Hive. You use the
EXTERNAL keyword from Hive to create the Iceberg table to purge the data when you drop the table. In
Cloudera, from Impala, you must use CREATE TABLE to initialize the Iceberg table.
5.
Click to run the query.
Procedure
1. Select, or use, a database.
2. Create an identity-partitioned table and run the query.
Hive:
Impala:
3. Create a table and specify an identity transform, such as bucket, truncate, or date, using the Iceberg V2
PARTITION BY SPEC clause.
Hive:
40
Expiring snapshots
STORED BY ICEBERG;
Impala:
Related Information
Partition transform feature
Expiring snapshots
You can expire snapshots of an Iceberg table using an ALTER TABLE query. You should periodically expire
snapshots to delete data files that are no longer needed, and reduce the size of table metadata.
Procedure
1. Enter a query to expire snapshots older than the following timestamp: '2021-12-09 05:39:18.689000000'
2. Enter a query to expire snapshots having between December 10, 2022 and November 8, 2023.
Examples
41
Migrating a Hive table to Iceberg
In this task, from a Cloudera Data Hub cluster, you open Hue, and use Hive or Impala to create a table. In Impala,
you can configure the NUM_THREADS_FOR_TABLE_MIGRATION query option to tweak the performance of
the table migration. It sets the maximum number of threads to be used for the migration process but could also be
limited by the number of CPUs. If set to zero then the number of available CPUs on the coordinator node is used as
the maximum number of threads. Parallelism occurs on the basis of data files within a partition, which means one
partition is processed at a time with multiple threads processing the files inside the partition. In case there is only one
file in each partition, sequential execution occurs.
Procedure
1. Log into Cloudera, and click Data Hub Clusters.
2. Click Hue.
3. Select a database.
4. Enter a query to use a database.
For example:
USE mydb;
42
Selecting an Iceberg table
'format-version' = '2');
The first ALTER command converts the Hive table to an Iceberg V1 table.
6.
Click to run the queries.
An Iceberg V2 table is created, replacing the Hive table.
Procedure
1. Use a database.
For example:
USE mydatabase;
43
Updating an Iceberg partition
Procedure
1. View the table history.
SELECT * FROM T
FOR SYSTEM_TIME AS OF <TIMESTAMP>;
SELECT * FROM t
FOR SYSTEM_VERSION AS OF <SNAPSHOT_ID>;
Procedure
1. Create a table partitioned by year.
Hive
Impala:
44
Test driving Iceberg from Impala
Procedure
1. In Impala, use a database.
2. Create an Impala table to hold mock data for this task.
45
Hive demo data
6. Insert into the customer_demo_iceberg table the results of selecting all data from the customer_demo2 table.
7. Create an Iceberg table partitioned by the year_month column and based on the customer_demo_iceberg table.
9. Insert the results of reading the customer_demo_iceberg table into the partitioned table.
10. Run time travel queries on the Iceberg tables, using the history output to get the snapshot id, and substitute the id
in the second SELECT query.
46
Hive demo data
47
Hive demo data
arrdelay int,
depdelay int,
origin string,
dest string,
distance int,
taxiin int,
taxiout int,
cancelled int,
cancellationcode string,
diverted string,
carrierdelay int,
weatherdelay int,
nasdelay int,
securitydelay int,
lateaircraftdelay int
)
partitioned by (year int)
stored as orc;
48
Test driving Iceberg from Hive
ALTER TABLE planes ADD CONSTRAINT planes_pk PRIMARY KEY (tailnum) DISABLE NO
VALIDATE;
ALTER TABLE flights ADD CONSTRAINT planes_fk FOREIGN KEY (tailnum) REFEREN
CES planes(tailnum) DISABLE NOVALIDATE RELY;
ALTER TABLE airlines ADD CONSTRAINT airlines_pk PRIMARY KEY (code) DISABLE
NOVALIDATE;
ALTER TABLE flights ADD CONSTRAINT airlines_fk FOREIGN KEY (uniquecarrier)
REFERENCES airlines(code) DISABLE NOVALIDATE RELY;
ALTER TABLE airports ADD CONSTRAINT airports_pk PRIMARY KEY (iata) DISABLE N
OVALIDATE;
ALTER TABLE flights ADD CONSTRAINT airports_orig_fk FOREIGN KEY (origin)
REFERENCES airports(iata) DISABLE NOVALIDATE RELY;
ALTER TABLE flights ADD CONSTRAINT airports_dest_fk FOREIGN KEY (dest) RE
FERENCES airports(iata) DISABLE NOVALIDATE RELY;
Procedure
1. Connect to Hive running in a Cloudera Data Hub cluster.
2. Run the queries in the previous topic, "Hive demo data" to set up the following databases: airline_ontime_iceberg,
airline_ontime_orc, airline_ontime_parquet.
3. Use the airline_ontime_iceberg database.
4. Take a look at the tables in the airline_ontime_iceberg database.
USE airline_ontime_iceberg;
49
Test driving Iceberg from Hive
SHOW TABLES;
Flights is the fact table. It has 100M rows and three dirmensions, ariline, airports, and planes. This records flights
for more than 10 years in the US, and includes the following details:
• origin
• destination
• delay
• air time
5. Query the demo data from Hive.
For example, find the flights that departed each year, by IATA code, airport, city, state, and country. Find the
average departure delay.
+----------+---------+------------------------------------+---------------
+----------+------------+
| f.month | a.iata | a.airport | a.city
| a.state | a.country |
+----------+---------+------------------------------------+---------------
+----------+------------+
| 12 | ORD | Chicago O'Hare International | Chicago
| NULL | USA |
| 6 | EWR | Newark Intl | Newark
| NULL | USA |
| 7 | JFK | John F Kennedy Intl | New York
| NULL | USA |
| 6 | IAD | Washington Dulles International | Chantilly
| NULL | USA |
| 7 | EWR | Newark Intl | Newark
| NULL | USA |
| 6 | PHL | Philadelphia Intl | Philadelphia
| NULL | USA |
| 1 | ORD | Chicago O'Hare International | Chicago
| NULL | USA |
| 6 | ORD | Chicago O'Hare International | Chicago
| NULL | USA |
| 7 | ATL | William B Hartsfield-Atlanta Intl | Atlanta
| NULL | USA |
| 12 | MDW | Chicago Midway | Chicago
| NULL | USA |
+----------+---------+------------------------------------+---------------
+----------+------------+
10 rows selected (103.812 seconds)
50
Iceberg data types
Table 2:
timestamptz TIMESTAMP WITH LOCAL Use TIMESTAMP WITH Read timestamptz into
TIME ZONE LOCAL TIMEZONE for handling TIMESTAMP values
these in queries
Writing not supported
51
Iceberg table properties
With Spark 3.4, Spark SQL supports a timestamp with local timezone (TIMESTAMP_LTZ) type and a timestamp
without timezone (TIMESTAMP_NTZ) type, with TIMESTAMP defaulting to the TIMESTAMP_LTZ type.
However, this can be configured by setting the spark.sql.timestampType (the default value is TIMESTAMP_LTZ).
When creating an Iceberg table using Spark SQL, if spark.sql.timestampType is set to TIMESTAMP_LTZ,
TIMESTAMP is mapped to Iceberg's timestampz type. If spark.sql.timestampType is set to TIMESTAMP_NTZ, then
TIMESTAMP is mapped to Iceberg's timestamp type.
Impala is unable to write to Iceberg tables with timestamptz columns. For interoperability, when creating Iceberg
tables from Spark, you can use the Spark configuration, spark.sql.timestampType=TIMESTAMP_NTZ.
For consistent results across query engines, all the engines must be running in UTC.
52
Iceberg table properties
• write.parquet.compression-codec
Valid values: GZIP, LZ4, NONE, SNAPPY (default value), ZSTD
• write.parquet.compression-level
Valid values: 1 - 22. Default = 3
• write.parquet.row-group-size-bytes
Valid values: 8388608 (or 8 MB) - 2146435072 (or 2047MB). Overiden by PARQUET_FILE_SIZE.
• write.parquet.page-size-bytes
Valid values: 65536 (or 64KB) - 1073741824 (or 1GB).
• write.parquet.dict-size-bytes
Valid values: 65536 (or 64KB) - 1073741824 (or 1GB)
53