Skip to content

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Nov 8, 2024

@pvary
Copy link
Contributor Author

pvary commented Nov 8, 2024

@aokolnychyi, @RussellSpitzer, @szehon-ho - I have touched the Spark compaction as well to remove duplicate code, and move it to the core module. Could you please check the Spark, and the core changes?

@stevenzwu: Could you please review the PR?

@RussellSpitzer
Copy link
Member

Could we break this up into smaller PR's? I think just one for the refactoring of classes out of the Spark Module would be a good first step?

@pvary
Copy link
Contributor Author

pvary commented Nov 9, 2024

Could we break this up into smaller PR's? I think just one for the refactoring of classes out of the Spark Module would be a good first step?

Sure thing. I think for these refactors, it is important to see the whole picture as well, and it is always a tough decision to decide how it is easier to review.

Will create the PR soon.

@jbonofre jbonofre self-requested a review November 9, 2024 09:38
@mxm
Copy link
Contributor

mxm commented Nov 11, 2024

IMHO just grouping into several commits would suffice, e.g. Core/Spark/Flink. Grouping into multiple PRs, might make this harder to test and review.

@pvary
Copy link
Contributor Author

pvary commented Nov 11, 2024

Created #11513 for the refactoring part. I think it is a good start for the review, but drop me a message if you think we need to split the things further.

@ygrzjh
Copy link

ygrzjh commented Nov 18, 2024

Using the parameter io-impl: org.apache.iceberg.aws.s3.S3FileIO, a Kryo serialization exception occurred in the DataFileRewritePlanner$PlannedGroup class. The exception details are as follows:

com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
s3 (org.apache.iceberg.aws.s3.S3FileIO)
io (org.apache.iceberg.SerializableTable)
table (org.apache.iceberg.flink.maintenance.operator.DataFileRewritePlanner$PlannedGroup)
	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:82)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:495)
	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523)
	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:61)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:495)
	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523)
	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:61)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:495)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:599)
	at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.serialize(KryoSerializer.java:356)
......

@pvary
Copy link
Contributor Author

pvary commented Nov 18, 2024

@ygrzjh: Seems like an issue with the S3FileIO serialization. Do you have any more issue on which field causes the nullpointer?

private final RewriteFileGroup group;

private PlannedGroup(
SerializableTable table, int groupsPerCommit, long splitSize, RewriteFileGroup group) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if SerializableTable can be passed to the DataFileRewriteExecutor just once during initialization (instead of per planned group)? how big is the serialized Table object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFileRewriteExecutor input is rebalanced. It is not trivial to make sure that the table is sent to every instance.

The size of the serialized table is mostly dependent on the size of the FileIO.

Copy link
Contributor

@stevenzwu stevenzwu Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass SerializableTable to the constructor of DataFileRewriteExecutor so that it becomes part of the class state.

if the planner needs more than a read-only table, maybe we can pass a TableLoader to the planner class and load a table that way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the current schema/spec for the table. If we pass the data at the constructor then we will work from old table metadata.

Copy link
Contributor

@stevenzwu stevenzwu Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we let the committer load the Table object via a TableLoader. Regular Table can be refreshed.

It is totally good if the PlannedGroup contains the snapshotId long. But I am a little concerned about including the whole SerializableTable over the Flink network stack for serialization. if unaligned checkpoint is enabled, we need to worry about serialization compatibility problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the info on the DataFileRewriteExecutor as well. That would mean Catalog connection/request from every writer.

TBH, I'm not too concerned about state compatibility. We only has to be very careful about silent issues. In other cases we could always drop the state and start the compaction again. We will not have any data loss if a maintenance task does not finish.

private final RewriteFileGroup group;

private PlannedGroup(
SerializableTable table, int groupsPerCommit, long splitSize, RewriteFileGroup group) {
Copy link
Contributor

@stevenzwu stevenzwu Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass SerializableTable to the constructor of DataFileRewriteExecutor so that it becomes part of the class state.

if the planner needs more than a read-only table, maybe we can pass a TableLoader to the planner class and load a table that way?

@pvary
Copy link
Contributor Author

pvary commented Dec 4, 2024

#11131 changed how the delete files are removed. See: #11131 (comment)

}

RewritePlanResult plan =
planner.plan(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we are doing a full table scan to figure out the rewrite candidate, which can be expensive for large tables. wondering if incremental scan is a good fit here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a possibility for later optimization. It is a non-trivial task to keep a table state in memory and keep it sync with the actual table - especially when there are independent writers to the table

@pvary
Copy link
Contributor Author

pvary commented Dec 5, 2024

To summarise the state, I see the following open questions:

  • State handling and restoration
  • How to handle table schema, specification, configuration changes
  • How to handle errors

State handling and restoration
What we know:

  • There will be no issues with aligned checkpoints, and the checkpoint barriers are aligned with the triggers, so every event is processed in a single checkpoint
  • With unaligned checkpoints
    • The state of the compaction job is not super important as the user could always trigger a new compaction and drop old pending results
    • There are good barriers against double committing the same compaction results. If a replaced file doesn’t exists in the table anymore then the commit will fail
    • If we want to use the core CommitManager strategy (which handles partial commits), then we can’t guarantee the consistency of the state and the Iceberg table, since the CommitManager uses its own thread to do the actual commits - here the Flink folks might come up with some solution currently unknown to me
    • If we restore an old checkpoint or savepoint we are almost guaranteed to fail to commit the inProgress records as they are most probably already committed by the previous iteration of the job

Currently the following ideas emerged to handle the situation:

  1. Use core CommitManager, store the state and do a best effort when committing the inProgress records
    • Helps if the job stopped ungracefully (without calling the close on the DataFileRewriteCommitter
    • Generates errors on start (gracefully handled)
    • Generates errors on receiving groups stored in the wire state (gracefully handled)
  2. Create our own strategy for committing
    • Needs an elaborate strategy to store the group ids in the commits
    • Before adding a new group we need to scan the table if the group is already committed - either with cache, or continuous checks
    • We lose the features provided by the core CommitManager (groups handling and parallel commits)
  3. Remove state from the Compaction Task
    • Simplest implementation
    • We might lose some in flight work - nothing on graceful stop, but there might be some loss on crashing jobs

How to handle table schema, specification, configuration changes
We could decide if we want to handle table schema/specification/configuration changes.

  1. We could opt for foregoing those changes and use the values available when the job started (Like every current Flink job)
  2. We need to add guardrails to prevent compaction when there are some changes in the table which could cause issues. Like losing date from new columns when reading new files with the old schema
  3. We could opt for keeping the schema/specification/configuration up-to-date. For this we have the following options:
    a. Generate a SerializableTable object at planning time, and send it through the wire for the DataFileRewriteExecutors
    - The size of the table could increase the network traffic
    - There is some issue with S3FileIO serialization which needs to be fixed
    b. Send only the snapshotId to the DataFileRewriteExecutors and load the table on every executor
    - Adds extra load to the Catalog, as every executor needs to fetch the table data once for every trigger

How to handle errors
We have 2 options for propagating the errors in the specific operators

  1. Use side output to emitt the error and aggregate it in an final operator
    • Better separation of concerns
  2. Send a Pair<Data, Exception> as an output and handle the exceptions in every operator
    • Simpler flow

Another ortogonal question wrt. the exceptions is what to propagate:

  1. Propagate the whole exception
    • We have the full stack trace available at the aggregation point
    • Might cause state issues if the Exception is changed
  2. Propagate the exception message only and log the exception locally
    • We have less network traffic

My current preferences:

  • State handling and restoration: 3
  • How to handle table schema, specification, configuration changes: 2/a
  • How to handle errors: 1, 1

@pvary
Copy link
Contributor Author

pvary commented Apr 14, 2025

Rebased based on the new API.
@stevenzwu: Could you please review?

Thanks,
Peter

@pvary
Copy link
Contributor Author

pvary commented Apr 29, 2025

@stevenzwu: Thanks for the review!
I was OOO for a while, but I'm back now. Addressed all of your concerns. If you have time, could you please review?

Thanks, Peter

Copy link
Contributor

@Guosmilesmile Guosmilesmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small confuse

@pvary pvary merged commit cbf2695 into apache:main May 6, 2025
20 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Flink Table Maintenance May 6, 2025
@pvary
Copy link
Contributor Author

pvary commented May 6, 2025

Merged to main.
Thanks for the review @stevenzwu, @Guosmilesmile, @RussellSpitzer, @mxm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants