0% found this document useful (0 votes)

32 views17 pages

How To Work With Iceberg Format in AWS-Glue

Uploaded by

avilanchee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views17 pages

How To Work With Iceberg Format in AWS-Glue

Uploaded by

avilanchee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

How to work with Iceberg

format in AWS-Glue
As the official guide might be
overwhelming some times, this post has
been designed to cover all the main
operations that one would want to…

By Cesar Cordoba

Aug 07, 2024 05:37 PM · 8 min. read ·

View original

Data Engineer | Cloud

As the official guide might be overwhelming
some times, this post has been designed to
cover all the main operations that one would
want to do in iceberg such as create, read,
write, update or alter a table. It also explains
time travel and optimization in a simpler
way.

We will go through several pyspark and SQL

examples, therefore, this “guide” is
extremely practical.

I hope you find it useful!

Side note. If you want to learn more about table

formats and in particular Iceberg, check this
post. Also, if you want to discuss anything, here
you can connect with me
www.linkedin.com/in/cesar-antonio-restrepo-
cordoba.

Setting up iceberg in Spark to use glue

catalog

Important! In case you have worked with

emr or glue and s3, you might be familiar to
work with paths as “s3a://”. With iceberg we
can forget about that (and actually you
shouldn’t be using it anymore). When we
initialize the cluster we will use the new s3-
file-io which works with “s3://”. Don’t believe
me? Check AWS (apache.org)

To work with iceberg in glue we need two

things:

1. Set the datalake-format parameter to

iceberg.
2. Initialize the iceberg extension:
spark.sql.extensions=org.apache.iceberg.spark.exte
nsions.IcebergSparkSessionExtensions
--conf
spark.sql.catalog.glue_catalog=org.apache.iceberg.
spark.SparkCatalog
--conf
spark.sql.catalog.glue_catalog.warehouse=s3://<you
r-warehouse-dir>/
--conf spark.sql.catalog.glue_catalog.catalog-
impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.glue_catalog.io-
impl=org.apache.iceberg.aws.s3.S3FileIO

Let’s understand these settings first:

If you are working on a notebook, then it is as

easy as running the code below always on the
first cell:

Notice that we are only enabling the iceberg

extensions in the -conf option.

%idle_timeout 60
%glue_version 3.0
%worker_type G.1X
%number_of_workers 2
%%configure
{
"--
conf":"spark.sql.extensions=org.apache.iceberg.spa
rk.extensions.IcebergSparkSessionExtensions",
"--datalake-formats":"iceberg"
}

And in a second cell, we will need to run this:

from awsglue.context import GlueContext

from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.sql import SparkSession
catalog_nm = "glue_catalog"
s3_bucket = "s3://your-bucket-nm/"
spark = SparkSession.builder \
.config(f"spark.sql.catalog.{catalog_nm}",
"org.apache.iceberg.spark.SparkCatalog") \
.config(f"spark.sql.catalog.
{catalog_nm}.warehouse", s3_bucket) \
.config(f"spark.sql.catalog.
{catalog_nm}.catalog-impl",
"org.apache.iceberg.aws.glue.GlueCatalog")
\
.config(f"spark.sql.catalog.{catalog_nm}.io-
impl",
"org.apache.iceberg.aws.s3.S3FileIO") \
.getOrCreate()
sc = spark.sparkContext
glueContext = GlueContext(sc)
job = Job(glueContext)

Alternatively, you could have put everything in

the -conf option, although it might be less
readable.

{
"--conf":
"spark.sql.extensions=org.apache.iceberg.spark.ext
ensions.IcebergSparkSessionExtensions --conf
spark.sql.catalog.glue_catalog=org.apache.iceberg.
spark.SparkCatalog --conf
spark.sql.catalog.glue_catalog.warehouse=s3://<you
r-warehouse-dir>.....",
"--datalake-formats":"iceberg"
}

In a similar way, if you are working directly with

jobs, everything can be set on the Job
parameters: inside the job, go to job details and
scroll down until you find them.

Create table
Since on this post we will be using the
SparkSession.builder, let’s create a function to
initialize spark. Keep in mind that this is
necessary as we are using the simplified
version of the -conf with only the sql.extensions
setting.

def create_spark_iceberg(catalog_nm:str =
"glue_catalog"):
"""
Function to initialize a session with iceberg
by default
:param catalog_nm:
:return spark:
"""
from pyspark.sql import SparkSession
# You can set this as a variable if required
warehouse_path = "s3://<bucket>/"

spark = SparkSession.builder \
.config(f"spark.sql.catalog.
{catalog_nm}",

"org.apache.iceberg.spark.SparkCata
log") \
.config(f"spark.sql.catalog.
{catalog_nm}.warehouse",
warehouse_path) \
.config(f"spark.sql.catalog.
{catalog_nm}.catalog-impl",

"org.apache.iceberg.aws.glue.GlueCa
talog") \
.config(f"spark.sql.catalog.
{catalog_nm}.io-impl",

"org.apache.iceberg.aws.s3.S3FileIO
") \
.getOrCreate()
return spark

Inside the glue job or notebook, after setting

the configuration (-conf), just use this:

from awsglue.context import GlueContext

from awsglue.job import Job
from awsglue.utils import getResolvedOptions

catalog_nm = "catalog_for_medium"

spark =
create_spark_iceberg(catalog_nm)
sc = spark.sparkContext
glueContext = GlueContext(sc)
job = Job(glueContext)
# The table should follow the
format:
# <your_catalog_name>.
<your_database_name>.
<your_table_name>
table = f"
{catalog_nm}.database_nm.table_nm"
table_path = "s3://<target-bucket-
nm>/table_nm"

spark.sql(
f"""
CREATE TABLE IF NOT EXISTS
{table} (
col_1 timestamp,
col_2 integer,
col_3 string,
col_4 string,
col_5 date
)
LOCATION '{table_path}'
PARTITIONED BY (col_5)
TBLPROPERTIES ('table_type' =
'ICEBERG',

'write_target_data_file_size_bytes'
='536870912'
)
"""
)

Another option to create the table is with the

dataframe Api:

df = "your own logic"

table = "catalog_nm.database.table"
(
df.writeTo(table)
.using("iceberg")
.tableProperty("location",
"s3://path/to/location")
.tableProperty("write.format.default",
"parquet")
.partitionedBy("col_5")
.createOrReplace()
)

Important! The table properties you define here

can have an impact when using procedures on
iceberg, specifically in rewrite_data_files.

Read tables
Reading is straight forward:

# Here we assume that the table has already been

created
# and it has a path associated to it
# To read we have
table = "catalog_nm.database.table"
# Sql
df = spark.sql(f"SELECT * FROM {table}")
# Dataframe
df = spark.read.format("iceberg").load(table)

Write tables
Iceberg was thought to be used mainly with
SQL but dataframe api works too. Therefore,
when appending data we have several options:

## SQL
df = "some custom logic"
df.createOrReplaceTempView("new_data") # Source
dataframe
table = "catalog_nm.database.table" # Target
table
spark.sql(f"INSERT INTO {table} SELECT * from
new_data")

## Dataframe Api v1
# Not recommended if we plan to use the table we
are writting in as
# it will not automatically refresh the tables
used by queries
(
df
.write.format("iceberg")
.mode("append")
.save(table)
)
# Dataframe Api v2
# It's recommended to use this option as it
refresh the table
df.writeTo(table).append()

Merge
To update a table with another table or
dataframe, the syntax will be like:
MERGE INTO prod.db.target t -- a target table
USING (SELECT ...) s -- the source
updates
ON t.id = s.id -- condition to find
updates for target rows
WHEN ... -- updates

Some conditions you can use are:

WHEN MATCHED AND s.op = 'delete' THEN DELETE

WHEN MATCHED AND s.op = 'increment' THEN UPDATE
SET t.count = t.count + 1
WHEN NOT MATCHED THEN INSERT *

And as an example, in this query we only insert

new rows based on a condition:

# Define your dataframe

df = "custom logic, some joins and a filter"
df.createOrReplaceTempView("new_data")
## Update
target_table =
"target_catalog.target_database.target_table"
condition = "old.col_1 = new.col_1"
spark.sql(f""" MERGE INTO {target_table} old
USING new_data new
ON {condition}
WHEN NOT MATCHED THEN INSERT * """)

Find more examples in Writes (apache.org).

Alter & delete

Here we will use some examples directly from
the oficial iceberg documentation as they are
easy and great to illustrate.

We can delete records in Iceberg based on

some conditions:

DELETE FROM catalog.db.table

WHERE ts >= '2020-05-01 00:00:00' and ts < '2020-
06-01 00:00:00'
And we can modify tables too! For instance, we
can change its properties, rename columns or
simply add new ones. There are too many
possibilities to list here, so I recommend you to
look the documentation on this topic.

ALTER TABLE catalog.db.table SET TBLPROPERTIES (

'read.split.target-size'='268435456'
)

ALTER TABLE catalog.db.table

ADD COLUMNS (
new_column string comment 'new_column docs'
)

Time Travel
In iceberg we are able to query different
versions of a table. Normally you won’t know
exactly the id of the version you want to query.
Lucky us, we have a property called history! (Pro
tip: you can tag versions so they have a specific
name instead of a number)

Here is the Official Doc on time travel if you

want to learn more.

Query the snapshots history:

To check how a table has evolved over time we
have:

# You can use sql or pyspark but keep in mind that

to query
# metadata in iceberg, the path is
catalog.database.table.<property>
# Some of them are:
# prod.db.table.history
# prod.db.table.metadata_log_entries
# prod.db.table.snapshots
# prod.db.table.manifests
# ...

table_nm = "catalog.database.table"
# Sql
spark.sql(f"SELECT * FROM
{table_nm}.snapshots;")
# Dataframe
spark.read.format("iceberg").load(f
"{table_nm}.snapshots")

Output of .show() when querying the snapshots

of a table. Extracted from Queries (apache.org)

Query a specific snapshot:

If we know the snapshot_id we can use SQL or
pyspark to query that version of the table as if it
was the current one.

-- get the table's partitions with snapshot id

10963874102873L
SELECT * FROM catalog.db.table.partitions VERSION
AS OF 10963874102873;

table_nm = "catalog.database.table"
snapshot_by_id_df = (
spark
.read.format("iceberg")
.option("snapshot-id", 10963874102873)
.load(table_nm)
)
We can also query it as it was in a certain point
of time:

-- time travel to October 26, 1986 at 01:21:00

SELECT * FROM catalog.db.table TIMESTAMP AS OF
'1986-10-26 01:21:00';

# time travel to October 26, 1986 at 01:21:00

# as-of-timestamp selects the current snapshot at
a timestamp, in milliseconds
table_nm = "catalog.database.table"
snapshot_by_ts_df = (
spark
.read.format("iceberg")
.option("as-of-timestamp", "499162860000")
.load(table_nm)
)

Procedures
Procedure are special operations in iceberg
that always start with CALL. They accept either
named arguments or positional arguments but
not both at the same time.

-- Named (recommended)
CALL catalog.system.procedure_name(arg_name_2 =>
arg_2, arg_name_1 => arg_1)
-- Positional
CALL catalog.system.procedure_name(arg_1, arg_2,
... arg_n)

Rollback to previous versions

If you have deleted or inserted something you
shouldn’t and need to rollback your table, rest
assure as this is posible easily done with:

CALL
catalog_nm.system.rollback_to_snapshot('database.t
able',

5781947118336215154)
Important! Notice that for this operation we
tell the catalog_nm first and then
database.table instead of what we have been
doing so far (catalog.database.table)

As you might expect, we can also use a

timestamp to do rollback:

CALL
catalog_nm.system.rollback_to_timestamp('database.
table',

TIMESTAMP '2021-06-30 00:00:00.000')

Optimize and Cleaning

As iceberg works with manifest files and
parquet files, every time we delete, append or
merge something, we create a new snapshot of
the table that maps to some unmodified old
files and newly created ones.

This causes two problems:

Cleaning — expire_snapshots
To remove older versions that you might not
need to query anymore is as easy as follows:

table_nm = "database.table"
catalog_nm = "glue_catalog"
spark.sql(f"""
CALL {catalog_nm}.system.expire_snapshots(
table => '{table_nm}',
older_than => TIMESTAMP
'2023-06-30 00:00:00.000',
retain_last => 15)
"""
).show()
Theexpire_snapshots procedure removes
older snapshots and the associated files to
them. It will never remove files that are being
used in a non expired snapshot.

The arguments are:

Cleaning — remove_orphan_files
Orphan files are files that are not referenced by
any manifest. To clean this files we run:

table_nm = "database.table"
catalog_nm = "glue_catalog"
spark.sql(f"""
CALL {catalog_nm}.system.remove_orphan_files(
table => '{table_nm}',
dry_run=> false)
"""
).show()

From the arguments just to mention that if we

set dry_run to true, it won’t delete the files. It
will only list them.

Optimization — rewrite_data_files
This procedure will combine small files into
bigger files, the size of which is define in the
tables properties.

Therefore, if you want to overwrite the behavior

of the table, you need to explicitly declare the
value inside the function arguments.

We have two options:

Use the bin pack strategy:

catalog_nm = "glue_catalog"
table_nm = "database.table"
sort_column = "col_5" # Suppose this column
contains dates
condition = "2023-08-20"
# target_mb. By default, when creating a table,
its properties
# are set to 500mm. Let's change it to 300mb
final_mb = str(300*1024*1024)
spark.sql(f"""
CALL
{catalog_nm}.system.rewrite_data_files(
table => '{table_nm}',
where => '{sort_column} > "
{condition}"',
options => map('target-file-
size-bytes', {final_mb})
)
"""
)

Use the Zorder strategy:

CALL catalog_name.system.rewrite_data_files(
table => 'db.sample',
strategy => 'sort',
sort_order =>
'zorder(c1,c2)',
where => 'id = 3 and
name = "foo"',
options =>
map('target-file-size-bytes', 536870912)
)

Optimization — rewrite_manifest
To finish, we have the last procedure which is
rewrite_manifest. This one rewrites the
manifest files to make it more efficient to query
the table.

This is not mandatory to use but once in a

while, I recommend to use it.
table_nm = "database.table"
spark.sql(f"CALL
{catalog_nm}.system.rewrite_manifests('{table_nm}'
)")

For more information regarding procedures,

have a look at Procedures (apache.org)

BONUS: Do this on Athena

Another excellent way to work with iceberg
without the necessity to start a cluster is to use
Athena.

Query in Athena to create an iceberg table.

Creating Iceberg tables — Amazon Athena

You can create tables similarly as with glue.

Notice however, that we are not using the
catalog name here! In Athena you don’t have to
define it as it is using the glue catalog
implicitly.

Another particularity is that if you don’t specify

the database when creating (or querying) the
table, you will use the one you are launching the
query on.

All the sql code we have been seeing can be

used in Athena by removing the catalog name
from the equation. In a similar way, the
procedures we have seen on glue can be used
on Athena but with different names.

You can optimize, i.e, compact the files with:

OPTIMIZE iceberg_table REWRITE DATA USING BIN_PACK

WHERE category = 'c1'

And Vacuum, which is the two operations of

expiring snapshots and removing orphan files
combined:

VACUUM target_table

Both operations Vacuum and Optimize will use

their default values and the table properties you
defined when creating the tables.

References
If you want to learn more of iceberg or table
formats, please consider reading too:

Cheat Sheet AWS Data Engineer Associate
No ratings yet
Cheat Sheet AWS Data Engineer Associate
117 pages
Building Applications With Snowpark For Dummies
No ratings yet
Building Applications With Snowpark For Dummies
49 pages
702 - Sample Assignment
No ratings yet
702 - Sample Assignment
20 pages
Training Proposal Organic Chicken Production Fga
100% (1)
Training Proposal Organic Chicken Production Fga
6 pages
Hath Yoga
No ratings yet
Hath Yoga
5 pages
DZ TR Data Engineering 2024
No ratings yet
DZ TR Data Engineering 2024
53 pages
2024 Data Streaming Report: Breaking Down The Barriers To Business Agility & Innovation
No ratings yet
2024 Data Streaming Report: Breaking Down The Barriers To Business Agility & Innovation
46 pages
Responsive-Navigating Kafka Streams
No ratings yet
Responsive-Navigating Kafka Streams
9 pages
2 Term 9 Form
No ratings yet
2 Term 9 Form
31 pages
Grade 8 and 9 Workbook
No ratings yet
Grade 8 and 9 Workbook
155 pages
Big Data Analytics in Oil and Gas
No ratings yet
Big Data Analytics in Oil and Gas
23 pages
Computer Science Worksheet
No ratings yet
Computer Science Worksheet
6 pages
Social Media Data Integration and Analysis Ps
No ratings yet
Social Media Data Integration and Analysis Ps
7 pages
Energy Virtual Work Diagram
100% (1)
Energy Virtual Work Diagram
2 pages
MIT Dremio A New Paradigm For Managing Data
No ratings yet
MIT Dremio A New Paradigm For Managing Data
8 pages
Intro Snowflake Datawarehouse
No ratings yet
Intro Snowflake Datawarehouse
12 pages
S3 Tables
No ratings yet
S3 Tables
10 pages
K. Palepu - Business Analysis Valuation - Ch.1
No ratings yet
K. Palepu - Business Analysis Valuation - Ch.1
40 pages
HD07 - Amadeus Reservation and Ticketing Help Desk - Air - Help Desk Module - Jan2018 - 3903939 - en - US
No ratings yet
HD07 - Amadeus Reservation and Ticketing Help Desk - Air - Help Desk Module - Jan2018 - 3903939 - en - US
66 pages
Quran & Prime Numbers - Part 2
No ratings yet
Quran & Prime Numbers - Part 2
6 pages
AWS Interview Q&A - Advanced
No ratings yet
AWS Interview Q&A - Advanced
10 pages
Deloitte Mergers Aquisitons Tax
No ratings yet
Deloitte Mergers Aquisitons Tax
1 page
Shri Chinai College of Commerce and Economics Andheri (East), Mumbai-400 069 Bachlor of Management Studies Project Report On "Marketing Strategy of Samsung" Submitted by Pinak Varu Tybms B (Sem. V
No ratings yet
Shri Chinai College of Commerce and Economics Andheri (East), Mumbai-400 069 Bachlor of Management Studies Project Report On "Marketing Strategy of Samsung" Submitted by Pinak Varu Tybms B (Sem. V
33 pages
Unit - 6 Promotion Decisions: Jacqueline
No ratings yet
Unit - 6 Promotion Decisions: Jacqueline
22 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Grok Stuff A
No ratings yet
Grok Stuff A
5 pages
w12 - Runningnotes 201026 001818
No ratings yet
w12 - Runningnotes 201026 001818
25 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Steps
No ratings yet
Steps
9 pages
Physics FYUGP
No ratings yet
Physics FYUGP
57 pages
Emcee Script
100% (2)
Emcee Script
2 pages
Why Do You Need Apache Iceberg
No ratings yet
Why Do You Need Apache Iceberg
10 pages
What Is Information Schema in Snowflake
No ratings yet
What Is Information Schema in Snowflake
7 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
(COURSE SUPPORT) Getting Started - Apache Iceberg
No ratings yet
(COURSE SUPPORT) Getting Started - Apache Iceberg
34 pages
Question
No ratings yet
Question
6 pages
Standard Operating Procedure (SOP) For Shipping An Outbound Package
No ratings yet
Standard Operating Procedure (SOP) For Shipping An Outbound Package
15 pages
Narrative Tenses - Docx - Google Dokumenti
No ratings yet
Narrative Tenses - Docx - Google Dokumenti
2 pages
SQL Vs Pyspark-1
No ratings yet
SQL Vs Pyspark-1
9 pages
ZHAO - Variability of Surface Heat Fluxes and Its Driving Forces at Different Time Scales Over A Large Ephemeral Lake in China - 2018
No ratings yet
ZHAO - Variability of Surface Heat Fluxes and Its Driving Forces at Different Time Scales Over A Large Ephemeral Lake in China - 2018
19 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
A Short Introduction To Apache Iceberg - by Christine Mathiesen - Expedia Group Technology - Medium
No ratings yet
A Short Introduction To Apache Iceberg - by Christine Mathiesen - Expedia Group Technology - Medium
12 pages
Check Point FW MONITOR Cheat Sheet 3.1d
No ratings yet
Check Point FW MONITOR Cheat Sheet 3.1d
2 pages
SQL PySpark Cheat Sheet 1731729790
No ratings yet
SQL PySpark Cheat Sheet 1731729790
9 pages
Apache Iceberg - Additional Real World Use Cases
No ratings yet
Apache Iceberg - Additional Real World Use Cases
25 pages
Module - 3: Engineering As Social Experimentation
No ratings yet
Module - 3: Engineering As Social Experimentation
16 pages
Apache Iceberg - Java and Python APIs
No ratings yet
Apache Iceberg - Java and Python APIs
9 pages
Steps To Set Up JDBC Connection in Spark: Define JDBC Properties
No ratings yet
Steps To Set Up JDBC Connection in Spark: Define JDBC Properties
5 pages
BR047 Current24 AWS Noritaka Sekiyama
No ratings yet
BR047 Current24 AWS Noritaka Sekiyama
57 pages
(Exam) Data Engineering Certification Prep Guide - Partners
No ratings yet
(Exam) Data Engineering Certification Prep Guide - Partners
15 pages
DWDM As-1
No ratings yet
DWDM As-1
14 pages
Apache Iceberg Quick Guide
No ratings yet
Apache Iceberg Quick Guide
20 pages
CLS Aipmt-18-19 XIII Bot Study-Package-1 SET-1 Chapter-1 PDF
No ratings yet
CLS Aipmt-18-19 XIII Bot Study-Package-1 SET-1 Chapter-1 PDF
38 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Data Engineering 101 PySpark Vs Pandas 1721887961
No ratings yet
Data Engineering 101 PySpark Vs Pandas 1721887961
36 pages
21 Reasons Kettlebells PDF
No ratings yet
21 Reasons Kettlebells PDF
4 pages
Abd213 R Howtobuildadatalakewithawsgluedatacatalog 180208045612
No ratings yet
Abd213 R Howtobuildadatalakewithawsgluedatacatalog 180208045612
43 pages
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
No ratings yet
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
12 pages
Company Interview Question Bank
No ratings yet
Company Interview Question Bank
16 pages
Data Pipelines With AWS Glue (Level 200)
No ratings yet
Data Pipelines With AWS Glue (Level 200)
33 pages
Processing XML With AWS Glue and Databricks Spark
No ratings yet
Processing XML With AWS Glue and Databricks Spark
23 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Aws Glue Information
No ratings yet
Aws Glue Information
46 pages
Exercise 3 - Processing Data in A Data Lake
No ratings yet
Exercise 3 - Processing Data in A Data Lake
6 pages
AWS Project1
No ratings yet
AWS Project1
13 pages
At Home and Abroad
No ratings yet
At Home and Abroad
6 pages
4220 5 (Python)
No ratings yet
4220 5 (Python)
12 pages
Modernserverlessdatalak
No ratings yet
Modernserverlessdatalak
45 pages
Pan Conveyors PDF
100% (1)
Pan Conveyors PDF
24 pages
KBKrishnaTeja Interview Questions
No ratings yet
KBKrishnaTeja Interview Questions
2 pages
House Dzone Refcard 382 Getting Started Apache Ice
No ratings yet
House Dzone Refcard 382 Getting Started Apache Ice
9 pages
Tabular Iceberg-Spark Cheat-Sheet
No ratings yet
Tabular Iceberg-Spark Cheat-Sheet
1 page
Data Storage and AWS
No ratings yet
Data Storage and AWS
24 pages
Snowflake Document
No ratings yet
Snowflake Document
21 pages
Change Data Capture (CDC) For Iceberg
No ratings yet
Change Data Capture (CDC) For Iceberg
11 pages
Lohmann - Poultry
No ratings yet
Lohmann - Poultry
12 pages
Prostitution in Victorian Era Society
No ratings yet
Prostitution in Victorian Era Society
11 pages
Interview Prep
No ratings yet
Interview Prep
24 pages
Kruthika CV
No ratings yet
Kruthika CV
4 pages
Notes of Azure Data Bricks
No ratings yet
Notes of Azure Data Bricks
16 pages
Databricks
No ratings yet
Databricks
15 pages
Brand Personality
No ratings yet
Brand Personality
3 pages
02 Querying Data On External Object Storage - v1 - 0 - DA016655
No ratings yet
02 Querying Data On External Object Storage - v1 - 0 - DA016655
11 pages
Nelder Mead Slides
No ratings yet
Nelder Mead Slides
47 pages
SQL & pySPARK
No ratings yet
SQL & pySPARK
9 pages
AWS Glue
No ratings yet
AWS Glue
10 pages
Athena
No ratings yet
Athena
13 pages
Snow SQL
No ratings yet
Snow SQL
3 pages
Combining Data in Pandas With Merge, .Join, and Concat - Real Python
No ratings yet
Combining Data in Pandas With Merge, .Join, and Concat - Real Python
2 pages
Databricks How To Data Import PDF
No ratings yet
Databricks How To Data Import PDF
16 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
AWS Big Data
100% (1)
AWS Big Data
39 pages

How To Work With Iceberg Format in AWS-Glue

Uploaded by

How To Work With Iceberg Format in AWS-Glue

Uploaded by

How to work with Iceberg

Aug 07, 2024 05:37 PM · 8 min. read ·

Data Engineer | Cloud

We will go through several pyspark and SQL

I hope you find it useful!

Side note. If you want to learn more about table

Setting up iceberg in Spark to use glue

Important! In case you have worked with

To work with iceberg in glue we need two

1. Set the datalake-format parameter to

Let’s understand these settings first:

If you are working on a notebook, then it is as

Notice that we are only enabling the iceberg

And in a second cell, we will need to run this:

from awsglue.context import GlueContext

Alternatively, you could have put everything in

In a similar way, if you are working directly with

Inside the glue job or notebook, after setting

from awsglue.context import GlueContext

Another option to create the table is with the

df = "your own logic"

Important! The table properties you define here

# Here we assume that the table has already been

Some conditions you can use are:

WHEN MATCHED AND s.op = 'delete' THEN DELETE

And as an example, in this query we only insert

# Define your dataframe

Find more examples in Writes (apache.org).

Alter & delete

We can delete records in Iceberg based on

DELETE FROM catalog.db.table

ALTER TABLE catalog.db.table SET TBLPROPERTIES (

ALTER TABLE catalog.db.table

Here is the Official Doc on time travel if you

Query the snapshots history:

# You can use sql or pyspark but keep in mind that

Output of .show() when querying the snapshots

Query a specific snapshot:

-- get the table's partitions with snapshot id

-- time travel to October 26, 1986 at 01:21:00

# time travel to October 26, 1986 at 01:21:00

Rollback to previous versions

As you might expect, we can also use a

TIMESTAMP '2021-06-30 00:00:00.000')

Optimize and Cleaning

This causes two problems:

The arguments are:

From the arguments just to mention that if we

Therefore, if you want to overwrite the behavior

We have two options:

Use the bin pack strategy:

Use the Zorder strategy:

This is not mandatory to use but once in a

For more information regarding procedures,

BONUS: Do this on Athena

Query in Athena to create an iceberg table.

You can create tables similarly as with glue.

Another particularity is that if you don’t specify

All the sql code we have been seeing can be

You can optimize, i.e, compact the files with:

OPTIMIZE iceberg_table REWRITE DATA USING BIN_PACK

And Vacuum, which is the two operations of

Both operations Vacuum and Optimize will use

You might also like