Flink: port range distribution to v2 iceberg sink #12071

rodmeneses · 2025-01-23T17:43:45Z

This PR ports the RANGE distribution mode on the FlinkSink to the new IcebergSink based on the Flink V2 sink interface.

cc: @stevenzwu @mxm @pvary @Guosmilesmile

mxm · 2025-01-24T09:03:56Z

Hey @rodmeneses! Thanks for porting this feature over. I'll have a look shortly.

mxm

LGTM

mxm · 2025-02-07T12:06:22Z

@pvary Can you take a look as well?

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java

...link/src/test/java/org/apache/iceberg/flink/sink/TestFlinkIcebergSinkV2DistributionMode.java

flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/sink/TestIcebergSink.java

github-actions · 2025-03-10T00:14:02Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

rodmeneses · 2025-03-13T23:29:41Z

keeping it alive

github-actions · 2025-04-13T00:41:30Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2025-04-21T00:17:51Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Guosmilesmile · 2025-05-15T03:42:00Z

Hi team, how is the progress on this PR? I really need this feature in SinkV2.

stevenzwu · 2025-05-15T05:26:33Z

reopened the PR. @rodmeneses please update when the it is ready for review. right now, it is marked as draft.

mxm · 2025-05-16T16:21:50Z

+1 it would be nice to follow up with this. If @rodmeneses is busy, maybe @Guosmilesmile could also take this one?

rodmeneses · 2025-05-16T16:27:34Z

Hi @stevenzwu thanks for reopening this. I will try to finish it this coming week. I think it needs only to port some fixes recently made in the FlinkSink RANGE distribution mode, as well as address some reviews.

rodmeneses · 2025-05-20T19:57:15Z

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java

+    }
+  }
+
+  private DataStream<RowData> distributeDataStreamByNoneDistributionMode(


Breaking distributeDataStream into smaller functions, each one for a given distributionMode. due to:

Cyclomatic Complexity is 13 (max allowed is 12). [CyclomaticComplexity]

This way, it is also clear what each distributionMode needs as function parameters, i.e:

distributeDataStreamByNoneDistributionMode -> (DataStream<RowData> input, Schema schema)

distributeDataStreamByHashDistributionMode -> (DataStream<RowData> input, Schema schema, PartitionSpec spec)

distributeDataStreamByRangeDistributionMode ->
DataStream input, Schema schema, PartitionSpec spec, SortOrder sortOrderParam`

Which brings nice information about what each distributionMode needs for its internal calculation

stevenzwu

@rodmeneses please mark the PR ready for view when ready. right now, it is still a draft.

One uber comment: it has been a long time since the initial creation of this PR. there were some changes in FlinkSink. please do another path and make sure the drift has been ported here.

also, there are some old comments from Peter not addressed. but looks like those problem might also exist in the v1 sink test too.

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java

stevenzwu · 2025-05-20T20:08:50Z

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java

+                + "and table is unpartitioned");
+        return input;
+      } else {
+        if (BucketPartitionerUtil.hasOneBucketField(spec)) {


can you check with FlinkSink code again? this code has been removed/reverted there.

stevenzwu · 2025-05-20T20:16:53Z

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java

+
+    return shuffleStream
+        .partitionCustom(new RangePartitioner(schema, sortOrder), r -> r)
+        .filter(StatisticsOrRecord::hasRecord)


there has been changes from when this PR was initially created. please re-sync with FlinkSink

rodmeneses · 2025-05-20T23:53:09Z

flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/sink/TestIcebergSink.java

  }

-  @TestTemplate
-  void testJobNoneDistributeMode() throws Exception {


All these methods were incorrectly copied here in previous commit and they are duplicates of the ones on TestFlinkIcebergSinkV2DistributionMode
So I'm removing them

stevenzwu · 2025-05-22T04:16:11Z

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java

+      PartitionSpec partitionSpec,
+      SortOrder sortOrderParam) {
+
+    int writerParallelism =


this logic should be applied to the writer parallelism for v2 sink.

if write parallelism is not configured, the v1 sink default the writer parallelism to the input parallelism to promote chaining. Want to confirm if that is the case for v2 sink? from reading the code, I thought v2 sink will default the writer parallelism to default job parallelism?

// Note that IcebergSink internally consists o multiple operators (like writer, committer, // aggregator). // The following parallelism will be propagated to all of the above operators. if (sink.flinkWriteConf.writeParallelism() != null) { rowDataDataStreamSink.setParallelism(sink.flinkWriteConf.writeParallelism()); }

technically, if this is a behavior change problem for v2 sink, it is not caused by this PR. but it is critical that the same writer parallelism is used by the shuffle operator to properly range partition the data to downstream writer tasks. That is why in the v1 FlinkSink, you can see writerParallelism is computed once and pass to two methods.

int writerParallelism = flinkWriteConf.writeParallelism() == null ? rowDataInput.getParallelism() : flinkWriteConf.writeParallelism(); // Distribute the records from input data stream based on the write.distribution-mode and // equality fields. DataStream<RowData> distributeStream = distributeDataStream(rowDataInput, equalityFieldIds, flinkRowType, writerParallelism); // Add parallel writers that append rows to files SingleOutputStreamOperator<FlinkWriteResult> writerStream = appendWriter(distributeStream, flinkRowType, equalityFieldIds, writerParallelism);

Hi @stevenzwu
Thanks for your comment.
We have the same logic in the beginning of the distributeDataStreamByRangeDistributionMode method:

int writerParallelism = flinkWriteConf.writeParallelism() == null ? input.getParallelism() : flinkWriteConf.writeParallelism();

So I think that for range partitioning the behavior should be the same. What do you think ?

sorry, I wasn't very clear earlier. The v2 sink writer parallelism selection is different than the v1 sink. It doesn't use input parallelism if write parallelism is not configured explicitly.

@Override public Builder writeParallelism(int newWriteParallelism) { writeOptions.put( FlinkWriteOptions.WRITE_PARALLELISM.key(), Integer.toString(newWriteParallelism)); return this; }

Hi @stevenzwu . Thanks for the clarification. I think I know what you mean. On FlinkSink, even without talking about RANGE distribution, the parallelism of the v1 sink by default will be the same as the input source parallelism.
This is a good approach, because it encourage chaining.

However, we dont have that logic on the IcebergSink v2. I think we should have the same logic there. I could do that in another PR and the follow up with this one. What do you think? Thanks

sure. you can follow up with a new PR for the writer parallelism fix.

can you resolve the conflict? then we can merge this.

Hi @stevenzwu I have rebased and this is ready! Thanks
Tagging @mxm and @pvary as well

mxm

LGTM. This should be ported to 2.0 and 1.19 once merged.

stevenzwu · 2025-06-03T20:33:07Z

thanks @rodmeneses for the contribution and @mxm @pvary for the review

…13228) backports #12071

…pache#13228) backports apache#12071

github-actions bot added the flink label Jan 23, 2025

mxm approved these changes Feb 7, 2025

View reviewed changes

pvary reviewed Feb 7, 2025

View reviewed changes

flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java Outdated Show resolved Hide resolved

pvary reviewed Feb 7, 2025

View reviewed changes

...link/src/test/java/org/apache/iceberg/flink/sink/TestFlinkIcebergSinkV2DistributionMode.java Outdated Show resolved Hide resolved

pvary reviewed Feb 7, 2025

View reviewed changes

...link/src/test/java/org/apache/iceberg/flink/sink/TestFlinkIcebergSinkV2DistributionMode.java Show resolved Hide resolved

pvary reviewed Feb 7, 2025

View reviewed changes

flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/sink/TestIcebergSink.java Outdated Show resolved Hide resolved

github-actions bot added the stale label Mar 10, 2025

rodmeneses marked this pull request as draft March 13, 2025 23:31

github-actions bot removed the stale label Mar 14, 2025

github-actions bot added the stale label Apr 13, 2025

github-actions bot closed this Apr 21, 2025

stevenzwu reopened this May 15, 2025

github-actions bot removed the stale label May 16, 2025

rodmeneses force-pushed the rangeDistributionIcebergSink branch from c44ba98 to 21c628a Compare May 20, 2025 16:52

rodmeneses commented May 20, 2025

View reviewed changes

stevenzwu reviewed May 20, 2025

View reviewed changes

rodmeneses marked this pull request as ready for review May 20, 2025 22:48

rodmeneses force-pushed the rangeDistributionIcebergSink branch from 82e7d90 to de8ea5e Compare May 20, 2025 23:50

rodmeneses commented May 20, 2025

View reviewed changes

rodmeneses force-pushed the rangeDistributionIcebergSink branch from de8ea5e to d2b4b78 Compare May 20, 2025 23:53

stevenzwu changed the title ~~Range distribution iceberg sink~~ Flink: port range distribution to v2 iceberg sink May 22, 2025

stevenzwu reviewed May 22, 2025

View reviewed changes

rodmeneses added 4 commits June 2, 2025 09:57

Flink: RANGE distribution for IcebergSink (new sinkV2 interface)

128a1cc

Fixes unit test

7cc63ae

checkStyleMain

824b688

Pvary comments

aea3938

rodmeneses force-pushed the rangeDistributionIcebergSink branch 3 times, most recently from ce63639 to 1dcf734 Compare June 2, 2025 17:15

Steven's and resynching IcebergSink-FlinkSink RANGE implementation

2721fbe

rodmeneses force-pushed the rangeDistributionIcebergSink branch from 1dcf734 to 2721fbe Compare June 2, 2025 17:44

mxm approved these changes Jun 3, 2025

View reviewed changes

stevenzwu approved these changes Jun 3, 2025

View reviewed changes

stevenzwu merged commit 931865e into apache:main Jun 3, 2025
18 checks passed

pvary pushed a commit that referenced this pull request Jun 4, 2025

Flink: Backport IcerbegSink RANGE distribution to Flink 1.19 and 2.0 (#…

73802f6

…13228) backports #12071

rodmeneses mentioned this pull request Jun 6, 2025

Flink: If IcebergSink writeParallelism is not specified, defaults to the input source parallelism #13260

Merged

rodmeneses added a commit to rodmeneses/iceberg that referenced this pull request Jun 23, 2025

Flink: port range distribution to v2 iceberg sink (apache#12071)

bb7f13c

rodmeneses added a commit to rodmeneses/iceberg that referenced this pull request Jun 23, 2025

Flink: Backport IcerbegSink RANGE distribution to Flink 1.19 and 2.0 (a…

98b7856

…pache#13228) backports apache#12071

devendra-nr pushed a commit to devendra-nr/iceberg that referenced this pull request Dec 8, 2025

Flink: port range distribution to v2 iceberg sink (apache#12071)

ea3e9e7

devendra-nr pushed a commit to devendra-nr/iceberg that referenced this pull request Dec 8, 2025

Flink: Backport IcerbegSink RANGE distribution to Flink 1.19 and 2.0 (a…

3affcb9

…pache#13228) backports apache#12071

Flink: port range distribution to v2 iceberg sink #12071

Flink: port range distribution to v2 iceberg sink #12071

Uh oh!

Conversation

rodmeneses commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mxm commented Jan 24, 2025

Uh oh!

mxm left a comment

Choose a reason for hiding this comment

Uh oh!

mxm commented Feb 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 10, 2025

Uh oh!

rodmeneses commented Mar 13, 2025

Uh oh!

github-actions bot commented Apr 13, 2025

Uh oh!

github-actions bot commented Apr 21, 2025

Uh oh!

Guosmilesmile commented May 15, 2025

Uh oh!

stevenzwu commented May 15, 2025

Uh oh!

mxm commented May 16, 2025

Uh oh!

rodmeneses commented May 16, 2025

Uh oh!

rodmeneses May 20, 2025

Choose a reason for hiding this comment

Uh oh!

stevenzwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stevenzwu May 20, 2025

Choose a reason for hiding this comment

Uh oh!

stevenzwu May 20, 2025

Choose a reason for hiding this comment

Uh oh!

rodmeneses May 20, 2025

Choose a reason for hiding this comment

Uh oh!

stevenzwu May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rodmeneses May 22, 2025

Choose a reason for hiding this comment

Uh oh!

stevenzwu May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rodmeneses May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu May 29, 2025

Choose a reason for hiding this comment

Uh oh!

rodmeneses Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

rodmeneses Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mxm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stevenzwu commented Jun 3, 2025

rodmeneses commented Jan 23, 2025 •

edited

Loading

stevenzwu May 22, 2025 •

edited

Loading

stevenzwu May 22, 2025 •

edited

Loading

rodmeneses May 28, 2025 •

edited

Loading

rodmeneses Jun 2, 2025 •

edited

Loading