Flink: new sink base on the unified sink API - WIP #8653

pvary · 2023-09-26T12:53:12Z

Summary

The Flink community created a new Sink specification in FLIP-143 with the explicit goal to guarantee the unified handling of the bounded and unbounded data streams. Later it was enhanced in FLIP-191 so there is a well defined place to execute small files compaction. The deprecation of the old SinkFunction is postponed to somewhere around Flink 2.0 based on the discussion on the dev mailing list , so the migration is not extremely urgent, but having the possibility to use the PostCommitTopology to execute the compaction of the small files could provide immediate benefits for the users of the Iceberg-Flink integration.

Previous work

There is an existing Iceberg PR #4904 for the Sink migration by Liwei Li (https://github.com/hililiwei) and Kyle Bendickson (https://github.com/kbendick) with the related documentation which is authored by the same team. The discussion there is stuck, and the PR has been out of date for almost a year now. The current proposal builds heavily on their work and wants to keep them as the co-authors for the proposed change.

To start the discussion, I have created the following document.
https://docs.google.com/document/d/1K1M4wb9r_Tr-SDsUvqLyBaI5F14eRcqe3-ud6io0Da0/edit?usp=sharing

I propose the following timeline:

Review the design document
Update the PR
PR reviews
Merge the PR to the Iceberg source
Restart the discussion about the missing features in the Flink community by creating a FLIP
Discuss/review/merge the relevant Flink changes
Release the Flink changes
Create a PR in the Iceberg repo to start using the new Flink features
Merge the Iceberg PR
Be happy 😀

pvary · 2023-09-26T13:39:43Z

CC: @hililiwei, @stevenzwu, @chenjunjiedada, @gyfora - you might be interested in this

gyfora · 2023-10-04T09:27:34Z

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/FlinkConfigOptions.java

          .defaultValue(false)
          .withDescription("Use the FLIP-27 based Iceberg source implementation.");

+  public static final ConfigOption<Boolean> TABLE_EXEC_ICEBERG_USE_FLIP143_SINK =


I think this config name is not very good. We should not include "flip" in it.
How about calling use-v2-sink?

gyfora · 2023-10-04T09:34:20Z

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/committer/CommonCommitter.java

+ * #init(IcebergFilesCommitterMetrics)} is idempotent, so it could be called multiple times to
+ * accommodate the missing init feature in the {@link Committer}.
+ */
+public class CommonCommitter implements Serializable {


Similar to the Sink base class maybe this should be called CommitterBase ?

gyfora · 2023-10-04T12:14:27Z

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/SinkBase.java

+  SinkBuilder builder() {
+    return builder;
+  }
+
+  FlinkWriteConf flinkWriteConf() {
+    return flinkWriteConf;
+  }
+
+  Map<String, String> writeProperties() {
+    return writeProperties;
+  }
+
+  SerializableSupplier<Table> tableSupplier() {
+    return tableSupplier;
+  }
+
+  RowType flinkRowType() {
+    return flinkRowType;
+  }
+
+  List<Integer> equalityFieldIds() {


Why not simply protected / package private fields and access them directly to reduce boilerplate?

gyfora · 2023-10-04T12:17:59Z

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java

+        .setParallelism(1)
+        .setMaxParallelism(1)
+        .global();


SetP/MaxP could be replaced by .forceNonParallel()

gyfora · 2023-10-04T12:20:06Z

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java

+            prefixIfNotNull(uidPrefix, initTable.name() + "-" + sinkId + "-pre-commit-topology"),
+            typeInformation,
+            new SinkV2Aggregator(commonCommitter))
+        .uid(prefixIfNotNull(uidPrefix, "pre-commit-topology"))


Maybe a better name/uid would be ...-write-result-aggregator in case we add more things here later and it's also more descriptive

gyfora · 2023-10-04T12:20:54Z

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java

+    TypeInformation<CommittableMessage<SinkV2Committable>> typeInformation =
+        CommittableMessageTypeInfo.of(this::getCommittableSerializer);
+    return writeResults
+        .global()


We don't really need global() as the parallelism is forced to be 1

github-actions · 2024-09-20T00:15:21Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-09-27T00:15:27Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Flink: new sink base on the unified sink API (apache#4904)

793b563

github-actions bot added flink build labels Sep 26, 2023

Extract common code paths between SinkV1 and SinkV2

2ef94ab

pvary force-pushed the sink2 branch from 1ea2275 to 2ef94ab Compare September 26, 2023 13:34

gyfora reviewed Oct 4, 2023

View reviewed changes

pvary mentioned this pull request Dec 15, 2023

org.apache.iceberg.actions.RewriteDataFiles implementation for Apache Flink #9306

Closed

rodmeneses mentioned this pull request May 9, 2024

Introduces the new IcebergSink based on the new V2 Flink Sink Abstraction #10179

Merged

github-actions bot added the stale label Sep 20, 2024

github-actions bot closed this Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: new sink base on the unified sink API - WIP #8653

Flink: new sink base on the unified sink API - WIP #8653

Uh oh!

pvary commented Sep 26, 2023

Uh oh!

pvary commented Sep 26, 2023

Uh oh!

gyfora Oct 4, 2023

Uh oh!

gyfora Oct 4, 2023

Uh oh!

gyfora Oct 4, 2023

Uh oh!

gyfora Oct 4, 2023 •

edited

Loading

Uh oh!

gyfora Oct 4, 2023

Uh oh!

gyfora Oct 4, 2023

Uh oh!

github-actions bot commented Sep 20, 2024

Uh oh!

github-actions bot commented Sep 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Flink: new sink base on the unified sink API - WIP #8653

Flink: new sink base on the unified sink API - WIP #8653

Uh oh!

Conversation

pvary commented Sep 26, 2023

Summary

Previous work

Uh oh!

pvary commented Sep 26, 2023

Uh oh!

gyfora Oct 4, 2023

Choose a reason for hiding this comment

Uh oh!

gyfora Oct 4, 2023

Choose a reason for hiding this comment

Uh oh!

gyfora Oct 4, 2023

Choose a reason for hiding this comment

Uh oh!

gyfora Oct 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gyfora Oct 4, 2023

Choose a reason for hiding this comment

Uh oh!

gyfora Oct 4, 2023

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 20, 2024

Uh oh!

github-actions bot commented Sep 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gyfora Oct 4, 2023 •

edited

Loading