0% found this document useful (0 votes)
32 views17 pages

How To Work With Iceberg Format in AWS-Glue

Uploaded by

avilanchee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views17 pages

How To Work With Iceberg Format in AWS-Glue

Uploaded by

avilanchee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

How to work with Iceberg

format in AWS-Glue
As the official guide might be
overwhelming some times, this post has
been designed to cover all the main
operations that one would want to…

By Cesar Cordoba

Aug 07, 2024 05:37 PM · 8 min. read ·


View original

Data Engineer | Cloud


As the official guide might be overwhelming
some times, this post has been designed to
cover all the main operations that one would
want to do in iceberg such as create, read,
write, update or alter a table. It also explains
time travel and optimization in a simpler
way.

We will go through several pyspark and SQL


examples, therefore, this “guide” is
extremely practical.

I hope you find it useful!

Side note. If you want to learn more about table


formats and in particular Iceberg, check this
post. Also, if you want to discuss anything, here
you can connect with me
www.linkedin.com/in/cesar-antonio-restrepo-
cordoba.

Setting up iceberg in Spark to use glue


catalog

Important! In case you have worked with


emr or glue and s3, you might be familiar to
work with paths as “s3a://”. With iceberg we
can forget about that (and actually you
shouldn’t be using it anymore). When we
initialize the cluster we will use the new s3-
file-io which works with “s3://”. Don’t believe
me? Check AWS (apache.org)

To work with iceberg in glue we need two


things:

1. Set the datalake-format parameter to


iceberg.
2. Initialize the iceberg extension:
spark.sql.extensions=org.apache.iceberg.spark.exte
nsions.IcebergSparkSessionExtensions
--conf
spark.sql.catalog.glue_catalog=org.apache.iceberg.
spark.SparkCatalog
--conf
spark.sql.catalog.glue_catalog.warehouse=s3://<you
r-warehouse-dir>/
--conf spark.sql.catalog.glue_catalog.catalog-
impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.glue_catalog.io-
impl=org.apache.iceberg.aws.s3.S3FileIO

Let’s understand these settings first:

If you are working on a notebook, then it is as


easy as running the code below always on the
first cell:

Notice that we are only enabling the iceberg


extensions in the -conf option.

%idle_timeout 60
%glue_version 3.0
%worker_type G.1X
%number_of_workers 2
%%configure
{
"--
conf":"spark.sql.extensions=org.apache.iceberg.spa
rk.extensions.IcebergSparkSessionExtensions",
"--datalake-formats":"iceberg"
}

And in a second cell, we will need to run this:

from awsglue.context import GlueContext


from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.sql import SparkSession
catalog_nm = "glue_catalog"
s3_bucket = "s3://your-bucket-nm/"
spark = SparkSession.builder \
.config(f"spark.sql.catalog.{catalog_nm}",
"org.apache.iceberg.spark.SparkCatalog") \
.config(f"spark.sql.catalog.
{catalog_nm}.warehouse", s3_bucket) \
.config(f"spark.sql.catalog.
{catalog_nm}.catalog-impl",
"org.apache.iceberg.aws.glue.GlueCatalog")
\
.config(f"spark.sql.catalog.{catalog_nm}.io-
impl",
"org.apache.iceberg.aws.s3.S3FileIO") \
.getOrCreate()
sc = spark.sparkContext
glueContext = GlueContext(sc)
job = Job(glueContext)

Alternatively, you could have put everything in


the -conf option, although it might be less
readable.

{
"--conf":
"spark.sql.extensions=org.apache.iceberg.spark.ext
ensions.IcebergSparkSessionExtensions --conf
spark.sql.catalog.glue_catalog=org.apache.iceberg.
spark.SparkCatalog --conf
spark.sql.catalog.glue_catalog.warehouse=s3://<you
r-warehouse-dir>.....",
"--datalake-formats":"iceberg"
}

In a similar way, if you are working directly with


jobs, everything can be set on the Job
parameters: inside the job, go to job details and
scroll down until you find them.

Create table
Since on this post we will be using the
SparkSession.builder, let’s create a function to
initialize spark. Keep in mind that this is
necessary as we are using the simplified
version of the -conf with only the sql.extensions
setting.

def create_spark_iceberg(catalog_nm:str =
"glue_catalog"):
"""
Function to initialize a session with iceberg
by default
:param catalog_nm:
:return spark:
"""
from pyspark.sql import SparkSession
# You can set this as a variable if required
warehouse_path = "s3://<bucket>/"

spark = SparkSession.builder \
.config(f"spark.sql.catalog.
{catalog_nm}",

"org.apache.iceberg.spark.SparkCata
log") \
.config(f"spark.sql.catalog.
{catalog_nm}.warehouse",
warehouse_path) \
.config(f"spark.sql.catalog.
{catalog_nm}.catalog-impl",

"org.apache.iceberg.aws.glue.GlueCa
talog") \
.config(f"spark.sql.catalog.
{catalog_nm}.io-impl",

"org.apache.iceberg.aws.s3.S3FileIO
") \
.getOrCreate()
return spark

Inside the glue job or notebook, after setting


the configuration (-conf), just use this:

from awsglue.context import GlueContext


from awsglue.job import Job
from awsglue.utils import getResolvedOptions

catalog_nm = "catalog_for_medium"

spark =
create_spark_iceberg(catalog_nm)
sc = spark.sparkContext
glueContext = GlueContext(sc)
job = Job(glueContext)
# The table should follow the
format:
# <your_catalog_name>.
<your_database_name>.
<your_table_name>
table = f"
{catalog_nm}.database_nm.table_nm"
table_path = "s3://<target-bucket-
nm>/table_nm"

spark.sql(
f"""
CREATE TABLE IF NOT EXISTS
{table} (
col_1 timestamp,
col_2 integer,
col_3 string,
col_4 string,
col_5 date
)
LOCATION '{table_path}'
PARTITIONED BY (col_5)
TBLPROPERTIES ('table_type' =
'ICEBERG',

'write_target_data_file_size_bytes'
='536870912'
)
"""
)

Another option to create the table is with the


dataframe Api:

df = "your own logic"


table = "catalog_nm.database.table"
(
df.writeTo(table)
.using("iceberg")
.tableProperty("location",
"s3://path/to/location")
.tableProperty("write.format.default",
"parquet")
.partitionedBy("col_5")
.createOrReplace()
)

Important! The table properties you define here


can have an impact when using procedures on
iceberg, specifically in rewrite_data_files.

Read tables
Reading is straight forward:

# Here we assume that the table has already been


created
# and it has a path associated to it
# To read we have
table = "catalog_nm.database.table"
# Sql
df = spark.sql(f"SELECT * FROM {table}")
# Dataframe
df = spark.read.format("iceberg").load(table)

Write tables
Iceberg was thought to be used mainly with
SQL but dataframe api works too. Therefore,
when appending data we have several options:

## SQL
df = "some custom logic"
df.createOrReplaceTempView("new_data") # Source
dataframe
table = "catalog_nm.database.table" # Target
table
spark.sql(f"INSERT INTO {table} SELECT * from
new_data")

## Dataframe Api v1
# Not recommended if we plan to use the table we
are writting in as
# it will not automatically refresh the tables
used by queries
(
df
.write.format("iceberg")
.mode("append")
.save(table)
)
# Dataframe Api v2
# It's recommended to use this option as it
refresh the table
df.writeTo(table).append()

Merge
To update a table with another table or
dataframe, the syntax will be like:
MERGE INTO prod.db.target t -- a target table
USING (SELECT ...) s -- the source
updates
ON t.id = s.id -- condition to find
updates for target rows
WHEN ... -- updates

Some conditions you can use are:

WHEN MATCHED AND s.op = 'delete' THEN DELETE


WHEN MATCHED AND s.op = 'increment' THEN UPDATE
SET t.count = t.count + 1
WHEN NOT MATCHED THEN INSERT *

And as an example, in this query we only insert


new rows based on a condition:

# Define your dataframe


df = "custom logic, some joins and a filter"
df.createOrReplaceTempView("new_data")
## Update
target_table =
"target_catalog.target_database.target_table"
condition = "old.col_1 = new.col_1"
spark.sql(f""" MERGE INTO {target_table} old
USING new_data new
ON {condition}
WHEN NOT MATCHED THEN INSERT * """)

Find more examples in Writes (apache.org).

Alter & delete


Here we will use some examples directly from
the oficial iceberg documentation as they are
easy and great to illustrate.

We can delete records in Iceberg based on


some conditions:

DELETE FROM catalog.db.table


WHERE ts >= '2020-05-01 00:00:00' and ts < '2020-
06-01 00:00:00'
And we can modify tables too! For instance, we
can change its properties, rename columns or
simply add new ones. There are too many
possibilities to list here, so I recommend you to
look the documentation on this topic.

ALTER TABLE catalog.db.table SET TBLPROPERTIES (


'read.split.target-size'='268435456'
)

ALTER TABLE catalog.db.table


ADD COLUMNS (
new_column string comment 'new_column docs'
)

Time Travel
In iceberg we are able to query different
versions of a table. Normally you won’t know
exactly the id of the version you want to query.
Lucky us, we have a property called history! (Pro
tip: you can tag versions so they have a specific
name instead of a number)

Here is the Official Doc on time travel if you


want to learn more.

Query the snapshots history:


To check how a table has evolved over time we
have:

# You can use sql or pyspark but keep in mind that


to query
# metadata in iceberg, the path is
catalog.database.table.<property>
# Some of them are:
# prod.db.table.history
# prod.db.table.metadata_log_entries
# prod.db.table.snapshots
# prod.db.table.manifests
# ...

table_nm = "catalog.database.table"
# Sql
spark.sql(f"SELECT * FROM
{table_nm}.snapshots;")
# Dataframe
spark.read.format("iceberg").load(f
"{table_nm}.snapshots")

Output of .show() when querying the snapshots


of a table. Extracted from Queries (apache.org)

Query a specific snapshot:


If we know the snapshot_id we can use SQL or
pyspark to query that version of the table as if it
was the current one.

-- get the table's partitions with snapshot id


10963874102873L
SELECT * FROM catalog.db.table.partitions VERSION
AS OF 10963874102873;

table_nm = "catalog.database.table"
snapshot_by_id_df = (
spark
.read.format("iceberg")
.option("snapshot-id", 10963874102873)
.load(table_nm)
)
We can also query it as it was in a certain point
of time:

-- time travel to October 26, 1986 at 01:21:00


SELECT * FROM catalog.db.table TIMESTAMP AS OF
'1986-10-26 01:21:00';

# time travel to October 26, 1986 at 01:21:00


# as-of-timestamp selects the current snapshot at
a timestamp, in milliseconds
table_nm = "catalog.database.table"
snapshot_by_ts_df = (
spark
.read.format("iceberg")
.option("as-of-timestamp", "499162860000")
.load(table_nm)
)

Procedures
Procedure are special operations in iceberg
that always start with CALL. They accept either
named arguments or positional arguments but
not both at the same time.

-- Named (recommended)
CALL catalog.system.procedure_name(arg_name_2 =>
arg_2, arg_name_1 => arg_1)
-- Positional
CALL catalog.system.procedure_name(arg_1, arg_2,
... arg_n)

Rollback to previous versions


If you have deleted or inserted something you
shouldn’t and need to rollback your table, rest
assure as this is posible easily done with:

CALL
catalog_nm.system.rollback_to_snapshot('database.t
able',

5781947118336215154)
Important! Notice that for this operation we
tell the catalog_nm first and then
database.table instead of what we have been
doing so far (catalog.database.table)

As you might expect, we can also use a


timestamp to do rollback:

CALL
catalog_nm.system.rollback_to_timestamp('database.
table',

TIMESTAMP '2021-06-30 00:00:00.000')

Optimize and Cleaning


As iceberg works with manifest files and
parquet files, every time we delete, append or
merge something, we create a new snapshot of
the table that maps to some unmodified old
files and newly created ones.

This causes two problems:

Cleaning — expire_snapshots
To remove older versions that you might not
need to query anymore is as easy as follows:

table_nm = "database.table"
catalog_nm = "glue_catalog"
spark.sql(f"""
CALL {catalog_nm}.system.expire_snapshots(
table => '{table_nm}',
older_than => TIMESTAMP
'2023-06-30 00:00:00.000',
retain_last => 15)
"""
).show()
Theexpire_snapshots procedure removes
older snapshots and the associated files to
them. It will never remove files that are being
used in a non expired snapshot.

The arguments are:

Cleaning — remove_orphan_files
Orphan files are files that are not referenced by
any manifest. To clean this files we run:

table_nm = "database.table"
catalog_nm = "glue_catalog"
spark.sql(f"""
CALL {catalog_nm}.system.remove_orphan_files(
table => '{table_nm}',
dry_run=> false)
"""
).show()

From the arguments just to mention that if we


set dry_run to true, it won’t delete the files. It
will only list them.

Optimization — rewrite_data_files
This procedure will combine small files into
bigger files, the size of which is define in the
tables properties.

Therefore, if you want to overwrite the behavior


of the table, you need to explicitly declare the
value inside the function arguments.

We have two options:

Use the bin pack strategy:


catalog_nm = "glue_catalog"
table_nm = "database.table"
sort_column = "col_5" # Suppose this column
contains dates
condition = "2023-08-20"
# target_mb. By default, when creating a table,
its properties
# are set to 500mm. Let's change it to 300mb
final_mb = str(300*1024*1024)
spark.sql(f"""
CALL
{catalog_nm}.system.rewrite_data_files(
table => '{table_nm}',
where => '{sort_column} > "
{condition}"',
options => map('target-file-
size-bytes', {final_mb})
)
"""
)

Use the Zorder strategy:

CALL catalog_name.system.rewrite_data_files(
table => 'db.sample',
strategy => 'sort',
sort_order =>
'zorder(c1,c2)',
where => 'id = 3 and
name = "foo"',
options =>
map('target-file-size-bytes', 536870912)
)

Optimization — rewrite_manifest
To finish, we have the last procedure which is
rewrite_manifest. This one rewrites the
manifest files to make it more efficient to query
the table.

This is not mandatory to use but once in a


while, I recommend to use it.
table_nm = "database.table"
spark.sql(f"CALL
{catalog_nm}.system.rewrite_manifests('{table_nm}'
)")

For more information regarding procedures,


have a look at Procedures (apache.org)

BONUS: Do this on Athena


Another excellent way to work with iceberg
without the necessity to start a cluster is to use
Athena.

Query in Athena to create an iceberg table.


Creating Iceberg tables — Amazon Athena

You can create tables similarly as with glue.


Notice however, that we are not using the
catalog name here! In Athena you don’t have to
define it as it is using the glue catalog
implicitly.

Another particularity is that if you don’t specify


the database when creating (or querying) the
table, you will use the one you are launching the
query on.

All the sql code we have been seeing can be


used in Athena by removing the catalog name
from the equation. In a similar way, the
procedures we have seen on glue can be used
on Athena but with different names.

You can optimize, i.e, compact the files with:

OPTIMIZE iceberg_table REWRITE DATA USING BIN_PACK


WHERE category = 'c1'

And Vacuum, which is the two operations of


expiring snapshots and removing orphan files
combined:

VACUUM target_table

Both operations Vacuum and Optimize will use


their default values and the table properties you
defined when creating the tables.

References
If you want to learn more of iceberg or table
formats, please consider reading too:

You might also like