Flink: Implement data statistics operator to collect traffic distribution for guiding smart shuffling #6382

yegangy0718 · 2022-12-08T01:59:15Z

This PR is created as part of issue #6303 and project https://github.com/apache/iceberg/projects/27

In this PR, we focus on creating the ShuffleOperator for bin packing based on traffic distribution statistics. (the second one in the issue #6303).

Changes:

Implement ShuffleOperator which will be added before Iceberg Writer operator to collect data distribution based on key(generated from provided KeySelector)
Implement ShuffleRecordWrapper which contains either the record or data distribution information

I will have following up PRs to implement ShuffleCoordinator, the data distribution sending and receiving logic between coordinator and operator, and etc.

pvary · 2022-12-08T10:48:45Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleOperator.java

+  @Override
+  public void initializeState(StateInitializationContext context) throws Exception {
+    localDataStatisticsMap = Maps.newHashMap();
+    globalDataDistributionWeightState =


The global data distribution should be the same (or samish) for every instance of the operator.
Shall we use getBroadcastState instead?
Based on my understanding that would better approximate our requirements for redistributing the state when rescaling. Even if there are some small discrepancies in the values between the different operation instances, we should not be concerned too much about them.

WDYT?

@pvary discussed this with me offline a few days ago. Flink doc said that broadcast is usually used with broadcast stream. But it may just work for regular stream. This is the part yet to be verified. @yegangy0718 can you experiment the broadcast with some unit test (especially for rescale)?

OK. Let me take some time to investigate if BroadcastState can be applied to this case.

My current understanding is, when using broadcast, there will be two streams. One stream will be send to all tasks, and the other stream won’t be send to all tasks. Then connecting them together by calling connect . The task can use getBroadcastState to get the broadcast stream.
From concept perspective, yes, it’s similar to what we are dong. Global data statistics will be the broadcast one.
But for our case, there is no stream/place publish the broadcast state, where the shuffle operator can get BroadcastState? Do you mean the shuffle operator publishes the data statistics once receiving from coordinator and then in function initializeState shuffle operator can get BroadcastState?

Let's merge this change, and we can play around the Broadcaststate in a followup PR. WTYT?

SGTM. I can create a followup PR to add scaling up test case to check the behavior for BroadcastState.

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleOperator.java

pvary · 2022-12-08T11:22:07Z

flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/sink/shuffle/TestShuffleOperator.java

+  }
+
+  @Test
+  public void testInitializeState() throws Exception {


Might as well add tests for the rebalance case if it is not too complicated

The logic to send data statistics from operator to coordinator and data statistics aggregation function in coordinator are not defined in the PR. Thus the global state is always empty when restoring.
I will add the rebalance case in the next PR once the above parts are there.

not sure if this test is necessary. seems too trivial to test.

pvary · 2022-12-08T11:24:19Z

flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/sink/shuffle/TestShuffleOperator.java

+    MockEnvironment env = new MockEnvironmentBuilder().build();
+    AbstractStateBackend abstractStateBackend = new HashMapStateBackend();
+    CloseableRegistry cancelStreamRegistry = new CloseableRegistry();
+    return abstractStateBackend.createOperatorStateBackend(


Maybe put some data here, and check that the state is restored as expected?

@yegangy0718 maybe we should use OneInputStreamOperatorTestHarness. you can refer to TestIcebergStreamWriter.

We get globalDataStatistics from coordinator and then save the value during snapshot. The function handleOperatorEvent to receive globalDataStatistics from coordinator is not implemented in this PR yet. Thus we won't be able to test that part.

I will add a test using OneInputStreamOperatorTestHarness in latest commit. After implementing the globalDataStatistics setting logic in next PR, I will add the test to take snapshot and restore from new state to verify state is restored as expected.

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleOperator.java

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleRecordWrapper.java

zinking · 2022-12-12T06:46:26Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleRecordWrapper.java

+
+  private static final long serialVersionUID = 1L;
+
+  private final Map<K, Long> globalDataDistributionWeight;


wasn't this simply channel Id in the design spec ?

We plan to create a StreamPartitioner which is chained with the shuffle operator to takes the global distribution via ShuffleRecordWrapper, and then generate the channel assignment at there

@zinking what do you mean channel Id? the downstream channel/subtask id?

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleRecordWrapper.java

stevenzwu · 2023-01-08T00:33:42Z

...k/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsFactory.java

+ */
+class DataStatisticsFactory<K> {
+
+  DataStatistics<K> createDataStatistics() {


I am not sure if there is much value of this factory class. even if we want to keep it, this method should be called createMapStatistics.

I would feel that we either need a clean method to the DataStatistics API, or need to abstract away the way to recreate a new one after sending it to the JobManager (at checkpoint). I do not feel that the Operator should know which type of statistics is used.

Alternatively we could use statistics which automatically keep historical data - but I would not bother with them in this phase

The factory will create different types DataStatistics based on different scenarios like MapStatistics for low-cardinary and SketchStatistics/ DigestStatistics for high-cardinary. Thus I would prefer to keep it general.
Regarding whether we need this factory, it depends. We can either use ShuffleOperatorFactory to get config(low or high cardinary or range mode) and then pass the right data statistics to operator and coordinator, or use this factory, pass it into operator and coordinator to create DataStatistics by calling this function.

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapDataStatistics.java

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatistics.java

stevenzwu · 2023-01-08T00:46:28Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapDataStatistics.java

+  public void merge(DataStatistics<K> other) {
+    Preconditions.checkArgument(
+        other instanceof MapDataStatistics, "Can not merge this type of statistics: " + other);
+    MapDataStatistics<K> mapDataStatistic = (MapDataStatistics<K>) other;


nit: I would call this otherStatistics

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleOperator.java

stevenzwu · 2023-01-08T04:29:21Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleOperator.java

+  }
+
+  @VisibleForTesting
+  ListStateDescriptor<DataStatistics<K>> generateGlobalDataDistributionWeightDescriptor() {


not sure if this method is necessary for code reuse. not sure why it is necessary to check it in testing.

we use it in unit test now to get the state to make sure state is not null like what SourceOperatorTest did https://github.com/apache/flink/blob/68b37fb867374df5a201f0b170e35c21266e5d7b/flink-streaming-java/src/test/java/org/apache/flink/streaming/api/operators/SourceOperatorTest.java#L87
I was thinking we can use it to verify globalStatisticsState was updated as expected after taking snapshot.
But actually, we can also create a test-only visible function to return globalStatisticsState directly if we need to check its value.

stevenzwu · 2023-01-08T04:30:08Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleOperator.java

+
+  private final KeySelector<T, K> keySelector;
+  private final OperatorEventGateway operatorEventGateway;
+  // key is generated by applying KeySelector to record


nit: these comments seem not relevant anymore

stevenzwu · 2023-01-08T04:31:26Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleOperator.java

+  // key is generated by applying KeySelector to record
+  // value is the times key occurs
+  // TODO: support to store statistics for high cardinality cases
+  private transient DataStatistics<K> localDataStatistics;


nit: simplify the names a little by removing Data

stevenzwu · 2023-01-08T04:33:25Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleOperator.java

+  @Override
+  public void initializeState(StateInitializationContext context) throws Exception {
+    localDataStatistics = dataStatisticsFactory.createDataStatistics();
+    globalDataStatisticsState =


we need to check if context is restored

if (context.isRestored()) {

will add the check to line 81

if (context.isRestored() && globalStatisticsState.get() != null && globalStatisticsState.get().iterator().hasNext())

since we still need to initialize the globalDataStatisticsState variable.
If we set globalDataStatisticsState to null when context.isRestored() = false, then when taking snapshot, we cannot execute globalStatisticsState.add(globalStatistics)

…ticsFactory

yegangy0718 · 2023-02-22T07:39:02Z

Hi @stevenzwu if we decide to keep shuffling implementation in the Iceberg repo, could you help to take another look at the PR? Thanks!

stevenzwu · 2023-02-24T22:47:51Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleOperator.java

+ * shuffle record to improve data clustering while maintaining relative balanced traffic
+ * distribution to downstream subtasks.
+ */
+class ShuffleOperator<T, K> extends AbstractStreamOperator<ShuffleRecordWrapper<T, K>>


I have been thinking about the name of this class. This operator technically only does statistics calculation. Hence ShuffleOperator sounds like a misleading name. But StatisticsOperator is too generic. Maybe DataStatisticsOperator?

DataStatisticsOperator is better than StatisticsOperator. I'm OK to update the class name since this operator indeed only collects data statistics. I will rename ShuffleRecordWrapper to DataStatisticsAndRecordWrapper as well.

hililiwei · 2023-03-03T09:15:53Z

.../v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOperator.java

+
+  // keySelector will be used to generate key from data for collecting data statistics
+  private final KeySelector<T, K> keySelector;
+  private final OperatorEventGateway operatorEventGateway;


it doesn't seem to be in use.

it will be used in the next PR with the shuffle coordinator to aggregate statistics globally

hililiwei · 2023-03-03T09:27:45Z

.../v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOperator.java

+ * shuffle record to improve data clustering while maintaining relative balanced traffic
+ * distribution to downstream subtasks.
+ */
+class DataStatisticsOperator<T, K> extends AbstractStreamOperator<DataStatisticsAndRecordWrapper<T, K>>


a quick question: Do we need to override finish() to cover the bounded case?

I will be fine either way. this streaming operator is mainly responsible for calculating and reporting statistics periodically. with finish(), it is probably not important to report the last statistics.

Note that for real batch jobs, the shuffling will be different. Statistics are probably sampled and calculated based on all the input data. it is a different shuffling compared to stream shuffling.

Ok, I see. Thank you for your explanation.

hililiwei · 2023-03-03T09:41:10Z

.../flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/statistics/MapDataStatistics.java

+  @Override
+  public void add(K key) {
+    // increase count of occurrence by one in the dataStatistics map
+    statistics.put(key, statistics.getOrDefault(key, 0L) + 1L);


nit

Suggested change

statistics.put(key, statistics.getOrDefault(key, 0L) + 1L);

statistics.merge(key, 1L, Long::sum);

will update in latest commit

hililiwei · 2023-03-03T10:02:27Z

.../flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/statistics/MapDataStatistics.java

+    Preconditions.checkArgument(
+        otherStatistics instanceof MapDataStatistics,
+        "Can not merge this type of statistics: " + otherStatistics);
+    MapDataStatistics<K> mapDataStatistic = (MapDataStatistics<K>) otherStatistics;


nit: if the input is not of type MapDataStatistics during the type conversion in line 48, will throw an error on its own. Therefore, whether it is necessary to perform this checkArgument?

I think if the input is not in the right type, it would be better to throw IllegalArgumentException.
I checked the other places as well, for example at

iceberg/api/src/main/java/org/apache/iceberg/SortOrder.java

Line 245 in 9cf9ca2

Preconditions.checkArgument(term instanceof UnboundTerm, "Term must be unbound");

and

iceberg/flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/data/FlinkSchemaVisitor.java

Line 89 in 9cf9ca2

Preconditions.checkArgument(flinkType instanceof RowType, "%s is not a RowType.", flinkType);

It's common that first checking type and then convert it into specific type.

I agree with @yegangy0718 that it is a little better to check the type IllegalArgumentException

hililiwei · 2023-03-03T10:04:27Z

.../flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/statistics/MapDataStatistics.java

+        otherStatistics instanceof MapDataStatistics,
+        "Can not merge this type of statistics: " + otherStatistics);
+    MapDataStatistics<K> mapDataStatistic = (MapDataStatistics<K>) otherStatistics;
+    mapDataStatistic.statistics.forEach(


also use merge?

hililiwei · 2023-03-03T10:15:43Z

...16/flink/src/test/java/org/apache/iceberg/flink/sink/shuffle/TestDataStatisticsOperator.java

+import org.junit.Before;
+import org.junit.Test;
+
+public class TestDataStatisticsOperator {


It seems that we haven't tested the recovery case?

yes, for now in the current DataStatisticsOperator implementation, we set globalStatisticsState in function DataStatisticsOperator#snapshotState. But to get globalStatisticsState, it needs to implement the function DataStatisticsOperator#handleOperatorEvent which we plan to do in another PR.

hililiwei · 2023-03-03T10:25:22Z

Left some comments, and I will think about them more deeply in the next few days. Thank you.

stevenzwu · 2023-03-03T16:11:46Z

...link/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsAndRecordWrapper.java

+ * shuffling, a filter and mapper are required to filter out the data distribution weight, unwrap the
+ * object and extract the original record type T.
+ */
+public class DataStatisticsAndRecordWrapper<T, K> implements Serializable {


@yegangy0718 ShuffleRecord probably still makes sense for this wrapper class.

I saw some other classes in Iceberg with names like FooAndBar. But they typically contains both Foo and Bar. Here is an or relationship. Hence I don't know if And is the most accurate name.

Regardless, I prefer us remove the Wrapper suffice from the name. it can be described in the Javadoc

The reason I rename it is, in DataStatisticsOperator, it generates the object which contains either global data statistics or record, while the place where shuffle happens is at the partitioner. The name DataStatisticsAndRecord is closer to the usage of the class(transmit global data statistics).

Indeed, like you said, the other class which uses And contains both. For example, RecordAndPosition class, it contains both record and position.
How about naming it DataStatisticsOrRecord even though there is no such kind of or class in the current repo :(
WDYT

DataStatisticsOrRecord sounds good to me

…RecordWrapper to DataStatisticsOrRecord

.../v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOperator.java

yegangy0718 · 2023-03-15T05:44:57Z

Hi @hililiwei, thanks for reviewing the code change. I have handled and replied to your comments. Let me know if you have more thoughts.

.../v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOperator.java

stevenzwu · 2023-03-21T20:31:02Z

.../v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOrRecord.java

+ * The wrapper class for data statistics and record. It is the only way for data statistics operator to send
+ * global data statistics to custom partitioner to distribute data based on statistics
+ *
+ * <p>DataStatisticsOrRecord is sent from {@link DataStatisticsOperator} to partitioner. It


it seems that we can remove the comments before It contains either ....

We can also change It contains either ... to This wrapper class contains either ...

I actually think it's good to keep DataStatisticsOrRecord is sent from {@link DataStatisticsOperator} to partitioner. since it tells the readers where the class/object is being used. Once readers get the context, then they understand the later part Once partitioner receives the data...

maybe add the additional context after the sentence of it contains either ....

stevenzwu · 2023-03-21T20:36:37Z

.../flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/statistics/MapDataStatistics.java

+    Preconditions.checkArgument(
+        otherStatistics instanceof MapDataStatistics,
+        "Can not merge this type of statistics: " + otherStatistics);
+    MapDataStatistics<K> mapDataStatistic = (MapDataStatistics<K>) otherStatistics;


I agree with @yegangy0718 that it is a little better to check the type IllegalArgumentException

stevenzwu · 2023-03-21T20:38:42Z

.../flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/statistics/MapDataStatistics.java

+  public void merge(DataStatistics<K> otherStatistics) {
+    Preconditions.checkArgument(
+        otherStatistics instanceof MapDataStatistics,
+        "Can not merge this type of statistics: " + otherStatistics);


nit: the error msg can be a little clearer. note that maybe we can add a String type() to the DataStatistics interface. we shouldn't dump the whole statistics from toString() in the error msg.

Map statistics can not merge with statistics type: " + otherStatistics.type()

how about getting the type for statistics from the class name
"Map statistics can not merge with " + otherStatistics.getClass()

yes, simple class name works well too.

...16/flink/src/test/java/org/apache/iceberg/flink/sink/shuffle/TestDataStatisticsOperator.java

… global statistics for subtask0

stevenzwu · 2023-03-27T19:45:38Z

.../v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOperator.java

+
+  @Override
+  public void processElement(StreamRecord<T> streamRecord) throws Exception {
+    final K key = keySelector.getKey(streamRecord.getValue());


nit: Iceberg coding style doesn't use final for stack local variables.

OK. Will remove final

stevenzwu · 2023-03-27T19:46:44Z

.../v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOperator.java

+  public void processElement(StreamRecord<T> streamRecord) throws Exception {
+    final K key = keySelector.getKey(streamRecord.getValue());
+    localStatistics.add(key);
+    output.collect(new StreamRecord<>(DataStatisticsOrRecord.fromRecord(streamRecord.getValue())));


nit: maybe cache the value at the beginning of this method. T record = streamRecord.getValue(). I saw more of this style in the Iceberg code.

.../v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOperator.java

stevenzwu · 2023-03-27T19:48:53Z

.../v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOperator.java

+    // an exact copy of globalStatistics
+    if (!globalStatistics.isEmpty() && getRuntimeContext().getIndexOfThisSubtask() == 0) {
+      globalStatisticsState.clear();
+      LOG.debug("Saving global statistics {} to state", globalStatistics);


nit: I think this should actually be info level logging. maybe add subtask index to the log.

will update log level to info and add subtask id in latest commit

stevenzwu · 2023-03-27T19:51:24Z

.../v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOperator.java

+  @Override
+  public void snapshotState(StateSnapshotContext context) throws Exception {
+    long checkpointId = context.getCheckpointId();
+    LOG.info("Taking data statistics operator snapshot for checkpoint {}", checkpointId);


nit: add subtask index to the log. E.g. Snapshotting data statistics operator for checkpoint {} in subtask {}

stevenzwu · 2023-03-28T04:10:16Z

thanks @yegangy0718 for the contribution and @pvary and @hililiwei for the review

github-actions bot added the flink label Dec 8, 2022

Implement ShuffleOperator to collect data statistics

aaecd1d

yegangy0718 force-pushed the 20221206-oss-shuffle-operator branch from 65973d0 to aaecd1d Compare December 8, 2022 06:05

pvary reviewed Dec 8, 2022

View reviewed changes

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleOperator.java Outdated Show resolved Hide resolved

pvary reviewed Dec 8, 2022

View reviewed changes

stevenzwu reviewed Dec 9, 2022

View reviewed changes

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleOperator.java Outdated Show resolved Hide resolved

stevenzwu reviewed Dec 9, 2022

View reviewed changes

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleOperator.java Outdated Show resolved Hide resolved

stevenzwu reviewed Dec 9, 2022

View reviewed changes

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleRecordWrapper.java Outdated Show resolved Hide resolved

stevenzwu reviewed Dec 9, 2022

View reviewed changes

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleRecordWrapper.java Outdated Show resolved Hide resolved

stevenzwu reviewed Dec 9, 2022

View reviewed changes

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleRecordWrapper.java Outdated Show resolved Hide resolved

zinking reviewed Dec 12, 2022

View reviewed changes

gang_ye added 4 commits December 13, 2022 23:45

handle comments and add TestHarness

225bc04

remove map suffix for data statistics

c48bfd1

Add data statistics interface

99532ce

update comments for class

694c705

pvary reviewed Jan 4, 2023

View reviewed changes

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleRecordWrapper.java Outdated Show resolved Hide resolved

add toString in MapDataStatistics class

a6b5e0b

pvary approved these changes Jan 6, 2023

View reviewed changes

stevenzwu reviewed Jan 8, 2023

View reviewed changes

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapDataStatistics.java Outdated Show resolved Hide resolved

stevenzwu reviewed Jan 8, 2023

View reviewed changes

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapDataStatistics.java Outdated Show resolved Hide resolved

stevenzwu reviewed Jan 8, 2023

View reviewed changes

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatistics.java Outdated Show resolved Hide resolved

stevenzwu reviewed Jan 8, 2023

View reviewed changes

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleOperator.java Outdated Show resolved Hide resolved

stevenzwu reviewed Jan 8, 2023

View reviewed changes

convert DataStatisticsFactory into interface and define MapDataStatis…

4c19c1a

…ticsFactory

yegangy0718 force-pushed the 20221206-oss-shuffle-operator branch from 7330fce to 4c19c1a Compare January 18, 2023 02:03

gang_ye added 2 commits January 17, 2023 22:20

Add link in comments

c00e9d1

remove transient

62a0da4

stevenzwu reviewed Feb 24, 2023

View reviewed changes

rename operator from ShuffleOperator to DataStatisticsOperator

bdabda5

hililiwei reviewed Mar 3, 2023

View reviewed changes

stevenzwu reviewed Mar 3, 2023

View reviewed changes

gang_ye added 2 commits March 3, 2023 23:17

use merge to add key or merge statistics and rename DataStatisticsAnd…

c69e2c5

…RecordWrapper to DataStatisticsOrRecord

use unionListState in operator and fix unit test

371a7b6

yegangy0718 commented Mar 15, 2023

View reviewed changes

.../v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOperator.java Show resolved Hide resolved

stevenzwu reviewed Mar 22, 2023

View reviewed changes

add transient volatile to operator statistics variables and only save…

0fd1a86

… global statistics for subtask0

stevenzwu reviewed Mar 27, 2023

View reviewed changes

add subtask id to snapshot log

6086285

stevenzwu approved these changes Mar 28, 2023

View reviewed changes

stevenzwu changed the title ~~Implement ShuffleOperator to collect data statistics~~ Flink: Implement data statistics operator to collect traffic distribution for guiding smart shuffling Mar 28, 2023

stevenzwu merged commit 47f42f5 into apache:master Mar 28, 2023

This was referenced Apr 19, 2023

Flink: add more sink shuffling support #6303

Closed

The serialization problem caused by Flink shuffling design #7393

Closed

yegangy0718 mentioned this pull request Apr 21, 2023

Flink: Backport #6382 and #7269 to 1.15 for shuffle operator #7400

Merged


		private static final long serialVersionUID = 1L;

		private final Map<K, Long> globalDataDistributionWeight;

	statistics.put(key, statistics.getOrDefault(key, 0L) + 1L);
	statistics.merge(key, 1L, Long::sum);

Flink: Implement data statistics operator to collect traffic distribution for guiding smart shuffling #6382

Flink: Implement data statistics operator to collect traffic distribution for guiding smart shuffling #6382

Uh oh!

Conversation

yegangy0718 commented Dec 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu Dec 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yegangy0718 Jan 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yegangy0718 Dec 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yegangy0718 Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yegangy0718 commented Dec 8, 2022 •

edited

Loading

stevenzwu Dec 9, 2022 •

edited

Loading

yegangy0718 Jan 4, 2023 •

edited

Loading

yegangy0718 Dec 13, 2022 •

edited

Loading

yegangy0718 Jan 10, 2023 •

edited

Loading