Flink: Maintenance - RewriteDataFiles #11497

pvary · 2024-11-08T18:21:16Z

Compaction implementation.

For details see:
https://docs.google.com/document/d/16g3vR18mVBy8jbFaLjf2JwAANuYOmIwr15yDDxovdnA/edit#heading=h.yd2vbtnf7z6w

pvary · 2024-11-08T20:02:45Z

@aokolnychyi, @RussellSpitzer, @szehon-ho - I have touched the Spark compaction as well to remove duplicate code, and move it to the core module. Could you please check the Spark, and the core changes?

@stevenzwu: Could you please review the PR?

RussellSpitzer · 2024-11-08T20:05:40Z

Could we break this up into smaller PR's? I think just one for the refactoring of classes out of the Spark Module would be a good first step?

pvary · 2024-11-09T06:52:17Z

Could we break this up into smaller PR's? I think just one for the refactoring of classes out of the Spark Module would be a good first step?

Sure thing. I think for these refactors, it is important to see the whole picture as well, and it is always a tough decision to decide how it is easier to review.

Will create the PR soon.

mxm · 2024-11-11T09:56:31Z

IMHO just grouping into several commits would suffice, e.g. Core/Spark/Flink. Grouping into multiple PRs, might make this harder to test and review.

pvary · 2024-11-11T11:03:39Z

Created #11513 for the refactoring part. I think it is a good start for the review, but drop me a message if you think we need to split the things further.

ygrzjh · 2024-11-18T10:08:58Z

Using the parameter io-impl: org.apache.iceberg.aws.s3.S3FileIO, a Kryo serialization exception occurred in the DataFileRewritePlanner$PlannedGroup class. The exception details are as follows:

com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
s3 (org.apache.iceberg.aws.s3.S3FileIO)
io (org.apache.iceberg.SerializableTable)
table (org.apache.iceberg.flink.maintenance.operator.DataFileRewritePlanner$PlannedGroup)
	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:82)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:495)
	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523)
	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:61)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:495)
	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523)
	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:61)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:495)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:599)
	at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.serialize(KryoSerializer.java:356)
......

pvary · 2024-11-18T11:29:45Z

@ygrzjh: Seems like an issue with the S3FileIO serialization. Do you have any more issue on which field causes the nullpointer?

.../flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/DeleteFilesProcessor.java

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/api/RewriteDataFiles.java

...nk/src/main/java/org/apache/iceberg/flink/maintenance/operator/DataFileRewriteCommitter.java

stevenzwu · 2024-12-02T22:41:26Z

...link/src/main/java/org/apache/iceberg/flink/maintenance/operator/DataFileRewritePlanner.java

+    private final RewriteFileGroup group;
+
+    private PlannedGroup(
+        SerializableTable table, int groupsPerCommit, long splitSize, RewriteFileGroup group) {


wondering if SerializableTable can be passed to the DataFileRewriteExecutor just once during initialization (instead of per planned group)? how big is the serialized Table object?

DataFileRewriteExecutor input is rebalanced. It is not trivial to make sure that the table is sent to every instance.

The size of the serialized table is mostly dependent on the size of the FileIO.

pass SerializableTable to the constructor of DataFileRewriteExecutor so that it becomes part of the class state.

if the planner needs more than a read-only table, maybe we can pass a TableLoader to the planner class and load a table that way?

We need the current schema/spec for the table. If we pass the data at the constructor then we will work from old table metadata.

why don't we let the committer load the Table object via a TableLoader. Regular Table can be refreshed.

It is totally good if the PlannedGroup contains the snapshotId long. But I am a little concerned about including the whole SerializableTable over the Flink network stack for serialization. if unaligned checkpoint is enabled, we need to worry about serialization compatibility problem.

We need the info on the DataFileRewriteExecutor as well. That would mean Catalog connection/request from every writer.

TBH, I'm not too concerned about state compatibility. We only has to be very careful about silent issues. In other cases we could always drop the state and start the compaction again. We will not have any data loss if a maintenance task does not finish.

...link/src/main/java/org/apache/iceberg/flink/maintenance/operator/DataFileRewritePlanner.java

...nk/src/main/java/org/apache/iceberg/flink/maintenance/operator/DataFileRewriteCommitter.java

stevenzwu · 2024-12-03T22:39:08Z

...link/src/main/java/org/apache/iceberg/flink/maintenance/operator/DataFileRewritePlanner.java

+    private final RewriteFileGroup group;
+
+    private PlannedGroup(
+        SerializableTable table, int groupsPerCommit, long splitSize, RewriteFileGroup group) {


pass SerializableTable to the constructor of DataFileRewriteExecutor so that it becomes part of the class state.

if the planner needs more than a read-only table, maybe we can pass a TableLoader to the planner class and load a table that way?

...link/src/main/java/org/apache/iceberg/flink/maintenance/operator/DataFileRewritePlanner.java

pvary · 2024-12-04T16:02:33Z

#11131 changed how the delete files are removed. See: #11131 (comment)

...nk/src/main/java/org/apache/iceberg/flink/maintenance/operator/DataFileRewriteCommitter.java

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/api/RewriteDataFiles.java

...link/src/main/java/org/apache/iceberg/flink/maintenance/operator/DataFileRewritePlanner.java

stevenzwu · 2024-12-04T20:01:12Z

...link/src/main/java/org/apache/iceberg/flink/maintenance/operator/DataFileRewritePlanner.java

+      }
+
+      RewritePlanResult plan =
+          planner.plan(


so we are doing a full table scan to figure out the rewrite candidate, which can be expensive for large tables. wondering if incremental scan is a good fit here.

That's a possibility for later optimization. It is a non-trivial task to keep a table state in memory and keep it sync with the actual table - especially when there are independent writers to the table

...ink/src/main/java/org/apache/iceberg/flink/maintenance/operator/DataFileRewriteExecutor.java

...nk/src/main/java/org/apache/iceberg/flink/maintenance/operator/DataFileRewriteCommitter.java

pvary · 2024-12-05T09:19:36Z

To summarise the state, I see the following open questions:

State handling and restoration
How to handle table schema, specification, configuration changes
How to handle errors

State handling and restoration
What we know:

There will be no issues with aligned checkpoints, and the checkpoint barriers are aligned with the triggers, so every event is processed in a single checkpoint
With unaligned checkpoints
- The state of the compaction job is not super important as the user could always trigger a new compaction and drop old pending results
- There are good barriers against double committing the same compaction results. If a replaced file doesn’t exists in the table anymore then the commit will fail
- If we want to use the core CommitManager strategy (which handles partial commits), then we can’t guarantee the consistency of the state and the Iceberg table, since the CommitManager uses its own thread to do the actual commits - here the Flink folks might come up with some solution currently unknown to me
- If we restore an old checkpoint or savepoint we are almost guaranteed to fail to commit the inProgress records as they are most probably already committed by the previous iteration of the job

Currently the following ideas emerged to handle the situation:

Use core CommitManager, store the state and do a best effort when committing the inProgress records
- Helps if the job stopped ungracefully (without calling the close on the DataFileRewriteCommitter
- Generates errors on start (gracefully handled)
- Generates errors on receiving groups stored in the wire state (gracefully handled)
Create our own strategy for committing
- Needs an elaborate strategy to store the group ids in the commits
- Before adding a new group we need to scan the table if the group is already committed - either with cache, or continuous checks
- We lose the features provided by the core CommitManager (groups handling and parallel commits)
Remove state from the Compaction Task
- Simplest implementation
- We might lose some in flight work - nothing on graceful stop, but there might be some loss on crashing jobs

How to handle table schema, specification, configuration changes
We could decide if we want to handle table schema/specification/configuration changes.

We could opt for foregoing those changes and use the values available when the job started (Like every current Flink job)
We need to add guardrails to prevent compaction when there are some changes in the table which could cause issues. Like losing date from new columns when reading new files with the old schema
We could opt for keeping the schema/specification/configuration up-to-date. For this we have the following options:
a. Generate a SerializableTable object at planning time, and send it through the wire for the DataFileRewriteExecutors
- The size of the table could increase the network traffic
- There is some issue with S3FileIO serialization which needs to be fixed
b. Send only the snapshotId to the DataFileRewriteExecutors and load the table on every executor
- Adds extra load to the Catalog, as every executor needs to fetch the table data once for every trigger

How to handle errors
We have 2 options for propagating the errors in the specific operators

Use side output to emitt the error and aggregate it in an final operator
- Better separation of concerns
Send a Pair<Data, Exception> as an output and handle the exceptions in every operator
- Simpler flow

Another ortogonal question wrt. the exceptions is what to propagate:

Propagate the whole exception
- We have the full stack trace available at the aggregation point
- Might cause state issues if the Exception is changed
Propagate the exception message only and log the exception locally
- We have less network traffic

My current preferences:

State handling and restoration: 3
How to handle table schema, specification, configuration changes: 2/a
How to handle errors: 1, 1

pvary · 2025-04-14T11:26:02Z

Rebased based on the new API.
@stevenzwu: Could you please review?

Thanks,
Peter

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/api/RewriteDataFiles.java

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/api/TableMaintenance.java

...link/src/main/java/org/apache/iceberg/flink/maintenance/operator/DataFileRewritePlanner.java

...v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/api/TestRewriteDataFiles.java

...1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/OperatorTestBase.java

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/api/ExpireSnapshots.java

...rc/test/java/org/apache/iceberg/flink/maintenance/operator/TestDataFileRewriteCommitter.java

.../src/test/java/org/apache/iceberg/flink/maintenance/operator/TestDataFileRewritePlanner.java

...k/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestDataFileRewriteRunner.java

pvary · 2025-04-29T05:34:52Z

@stevenzwu: Thanks for the review!
I was OOO for a while, but I'm back now. Addressed all of your concerns. If you have time, could you please review?

Thanks, Peter

Guosmilesmile

Some small confuse

.../flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TaskResultAggregator.java

...flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/DataFileRewriteRunner.java

...rc/test/java/org/apache/iceberg/flink/maintenance/operator/TestDataFileRewriteCommitter.java

.../src/test/java/org/apache/iceberg/flink/maintenance/operator/TestDataFileRewritePlanner.java

...k/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestDataFileRewriteRunner.java

pvary · 2025-05-06T08:06:11Z

Merged to main.
Thanks for the review @stevenzwu, @Guosmilesmile, @RussellSpitzer, @mxm!

github-actions bot added spark core flink labels Nov 8, 2024

pvary force-pushed the rewrite branch from 21dbe24 to 7ed8709 Compare November 8, 2024 18:36

pvary mentioned this pull request Nov 8, 2024

Flink: Maintenance - RewriteDataFiles #10303

Closed

pvary force-pushed the rewrite branch from 7ed8709 to fcc793a Compare November 8, 2024 18:54

pvary requested review from RussellSpitzer, aokolnychyi, stevenzwu and szehon-ho November 8, 2024 20:03

jbonofre self-requested a review November 9, 2024 09:38

pvary mentioned this pull request Nov 11, 2024

Core, Spark: Refactor FileRewriter interface to separate planning and execution #11513

Closed

stevenzwu reviewed Dec 3, 2024

View reviewed changes

pvary force-pushed the rewrite branch 3 times, most recently from 691c61a to 997304c Compare December 4, 2024 10:46

pvary mentioned this pull request Dec 4, 2024

Core: Optimize MergingSnapshotProducer to use referenced manifests to determine if manifest needs to be rewritten #11131

Merged

pvary force-pushed the rewrite branch from 7f5408e to cf103c2 Compare December 4, 2024 16:57

stevenzwu reviewed Dec 5, 2024

View reviewed changes

pvary force-pushed the rewrite branch from 7895f63 to b0311d7 Compare December 5, 2024 12:19

pvary force-pushed the rewrite branch from 3b469bb to 7c92117 Compare April 14, 2025 10:54

stevenzwu reviewed Apr 18, 2025

View reviewed changes

pvary force-pushed the rewrite branch from 0ad3fa5 to 0d5194c Compare April 28, 2025 15:33

Steven's comments

07aa921

pvary force-pushed the rewrite branch from 0d5194c to 07aa921 Compare April 28, 2025 15:37

Fix issue where old table is used as a base for the committer

2705846

Guosmilesmile reviewed Apr 29, 2025

View reviewed changes

.../flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TaskResultAggregator.java Outdated Show resolved Hide resolved

Fix Preconditions import

c7710ed

Guosmilesmile mentioned this pull request Apr 30, 2025

Flink: Change Preconditions import from flink util to guava #12939

Merged

stevenzwu reviewed May 4, 2025

View reviewed changes

Steven's comments

f3f7c8f

pvary force-pushed the rewrite branch from 95e7cd9 to f3f7c8f Compare May 5, 2025 12:10

stevenzwu approved these changes May 5, 2025

View reviewed changes

pvary merged commit cbf2695 into apache:main May 6, 2025
20 checks passed

github-project-automation bot moved this from In Progress to Done in Flink Table Maintenance May 6, 2025

gyfora mentioned this pull request May 7, 2025

Flink: Backport Maintenance - RewriteDataFiles to Flink 1.19, 1.20 #12991

Merged

Flink: Maintenance - RewriteDataFiles #11497

Flink: Maintenance - RewriteDataFiles #11497

Uh oh!

Conversation

pvary commented Nov 8, 2024

Uh oh!

pvary commented Nov 8, 2024

Uh oh!

RussellSpitzer commented Nov 8, 2024

Uh oh!

pvary commented Nov 9, 2024

Uh oh!

mxm commented Nov 11, 2024

Uh oh!

pvary commented Nov 11, 2024

Uh oh!

ygrzjh commented Nov 18, 2024

Uh oh!

pvary commented Nov 18, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevenzwu Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

pvary Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

stevenzwu Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvary Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

stevenzwu Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvary Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevenzwu Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pvary commented Dec 4, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevenzwu Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

pvary Dec 5, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pvary commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvary commented Apr 14, 2025

Uh oh!

Uh oh!

stevenzwu Dec 3, 2024 •

edited

Loading

stevenzwu Dec 4, 2024 •

edited

Loading

stevenzwu Dec 3, 2024 •

edited

Loading

pvary commented Dec 5, 2024 •

edited

Loading