How To Work With Iceberg Format in AWS-Glue
How To Work With Iceberg Format in AWS-Glue
format in AWS-Glue
As the official guide might be
overwhelming some times, this post has
been designed to cover all the main
operations that one would want to…
By Cesar Cordoba
%idle_timeout 60
%glue_version 3.0
%worker_type G.1X
%number_of_workers 2
%%configure
{
"--
conf":"spark.sql.extensions=org.apache.iceberg.spa
rk.extensions.IcebergSparkSessionExtensions",
"--datalake-formats":"iceberg"
}
{
"--conf":
"spark.sql.extensions=org.apache.iceberg.spark.ext
ensions.IcebergSparkSessionExtensions --conf
spark.sql.catalog.glue_catalog=org.apache.iceberg.
spark.SparkCatalog --conf
spark.sql.catalog.glue_catalog.warehouse=s3://<you
r-warehouse-dir>.....",
"--datalake-formats":"iceberg"
}
Create table
Since on this post we will be using the
SparkSession.builder, let’s create a function to
initialize spark. Keep in mind that this is
necessary as we are using the simplified
version of the -conf with only the sql.extensions
setting.
def create_spark_iceberg(catalog_nm:str =
"glue_catalog"):
"""
Function to initialize a session with iceberg
by default
:param catalog_nm:
:return spark:
"""
from pyspark.sql import SparkSession
# You can set this as a variable if required
warehouse_path = "s3://<bucket>/"
spark = SparkSession.builder \
.config(f"spark.sql.catalog.
{catalog_nm}",
"org.apache.iceberg.spark.SparkCata
log") \
.config(f"spark.sql.catalog.
{catalog_nm}.warehouse",
warehouse_path) \
.config(f"spark.sql.catalog.
{catalog_nm}.catalog-impl",
"org.apache.iceberg.aws.glue.GlueCa
talog") \
.config(f"spark.sql.catalog.
{catalog_nm}.io-impl",
"org.apache.iceberg.aws.s3.S3FileIO
") \
.getOrCreate()
return spark
catalog_nm = "catalog_for_medium"
spark =
create_spark_iceberg(catalog_nm)
sc = spark.sparkContext
glueContext = GlueContext(sc)
job = Job(glueContext)
# The table should follow the
format:
# <your_catalog_name>.
<your_database_name>.
<your_table_name>
table = f"
{catalog_nm}.database_nm.table_nm"
table_path = "s3://<target-bucket-
nm>/table_nm"
spark.sql(
f"""
CREATE TABLE IF NOT EXISTS
{table} (
col_1 timestamp,
col_2 integer,
col_3 string,
col_4 string,
col_5 date
)
LOCATION '{table_path}'
PARTITIONED BY (col_5)
TBLPROPERTIES ('table_type' =
'ICEBERG',
'write_target_data_file_size_bytes'
='536870912'
)
"""
)
Read tables
Reading is straight forward:
Write tables
Iceberg was thought to be used mainly with
SQL but dataframe api works too. Therefore,
when appending data we have several options:
## SQL
df = "some custom logic"
df.createOrReplaceTempView("new_data") # Source
dataframe
table = "catalog_nm.database.table" # Target
table
spark.sql(f"INSERT INTO {table} SELECT * from
new_data")
## Dataframe Api v1
# Not recommended if we plan to use the table we
are writting in as
# it will not automatically refresh the tables
used by queries
(
df
.write.format("iceberg")
.mode("append")
.save(table)
)
# Dataframe Api v2
# It's recommended to use this option as it
refresh the table
df.writeTo(table).append()
Merge
To update a table with another table or
dataframe, the syntax will be like:
MERGE INTO prod.db.target t -- a target table
USING (SELECT ...) s -- the source
updates
ON t.id = s.id -- condition to find
updates for target rows
WHEN ... -- updates
Time Travel
In iceberg we are able to query different
versions of a table. Normally you won’t know
exactly the id of the version you want to query.
Lucky us, we have a property called history! (Pro
tip: you can tag versions so they have a specific
name instead of a number)
table_nm = "catalog.database.table"
# Sql
spark.sql(f"SELECT * FROM
{table_nm}.snapshots;")
# Dataframe
spark.read.format("iceberg").load(f
"{table_nm}.snapshots")
table_nm = "catalog.database.table"
snapshot_by_id_df = (
spark
.read.format("iceberg")
.option("snapshot-id", 10963874102873)
.load(table_nm)
)
We can also query it as it was in a certain point
of time:
Procedures
Procedure are special operations in iceberg
that always start with CALL. They accept either
named arguments or positional arguments but
not both at the same time.
-- Named (recommended)
CALL catalog.system.procedure_name(arg_name_2 =>
arg_2, arg_name_1 => arg_1)
-- Positional
CALL catalog.system.procedure_name(arg_1, arg_2,
... arg_n)
CALL
catalog_nm.system.rollback_to_snapshot('database.t
able',
5781947118336215154)
Important! Notice that for this operation we
tell the catalog_nm first and then
database.table instead of what we have been
doing so far (catalog.database.table)
CALL
catalog_nm.system.rollback_to_timestamp('database.
table',
Cleaning — expire_snapshots
To remove older versions that you might not
need to query anymore is as easy as follows:
table_nm = "database.table"
catalog_nm = "glue_catalog"
spark.sql(f"""
CALL {catalog_nm}.system.expire_snapshots(
table => '{table_nm}',
older_than => TIMESTAMP
'2023-06-30 00:00:00.000',
retain_last => 15)
"""
).show()
Theexpire_snapshots procedure removes
older snapshots and the associated files to
them. It will never remove files that are being
used in a non expired snapshot.
Cleaning — remove_orphan_files
Orphan files are files that are not referenced by
any manifest. To clean this files we run:
table_nm = "database.table"
catalog_nm = "glue_catalog"
spark.sql(f"""
CALL {catalog_nm}.system.remove_orphan_files(
table => '{table_nm}',
dry_run=> false)
"""
).show()
Optimization — rewrite_data_files
This procedure will combine small files into
bigger files, the size of which is define in the
tables properties.
CALL catalog_name.system.rewrite_data_files(
table => 'db.sample',
strategy => 'sort',
sort_order =>
'zorder(c1,c2)',
where => 'id = 3 and
name = "foo"',
options =>
map('target-file-size-bytes', 536870912)
)
Optimization — rewrite_manifest
To finish, we have the last procedure which is
rewrite_manifest. This one rewrites the
manifest files to make it more efficient to query
the table.
VACUUM target_table
References
If you want to learn more of iceberg or table
formats, please consider reading too: