Skip to content

Conversation

@hililiwei
Copy link
Contributor

Closes #3816

FlinkInputSplit extends LocatableInputSplit instread of InputSplit, get the location of all files in CombinedScanTask and replace DefaultInputSplitAssigner with LocatableInputSplitAssigner.

@github-actions github-actions bot added the flink label Dec 29, 2021
@hililiwei hililiwei force-pushed the #3816 branch 2 times, most recently from af027b1 to 9e02d1a Compare December 30, 2021 15:10
@hililiwei hililiwei requested a review from rdblue December 30, 2021 15:25
@rdblue
Copy link
Contributor

rdblue commented Dec 30, 2021

This looks like it is getting close. I'd like @openinx, @stevenzwu, or @kbendick to comment on how this should be configured, though.

@github-actions github-actions bot added the spark label Jan 6, 2022
} else {
contextBuilder.project(FlinkSchemaUtil.convert(icebergSchema, projectedSchema));
}
contextBuilder.exposeLocality(localityEnabled());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be possible to override exposeLocality in this builder so that you can set it differently for different sources. Keeping a boolean in this builder and passing that as an override for the environment property in localityEnabled() should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review whether meet expectations, thx.

}

public Builder exposeLocality(boolean newExposeLocality) {
contextBuilder.exposeLocality(newExposeLocality);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we set exposeLocality to contextBuilder in the buildFormat method. we probably don't need to do it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

}

@Test
public void testExposeLocality() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only verifies data read. this doesn't really verify locality aware assignment. Ideally, we need 2 files stored in 2 hosts with HDFS and run a cluster of TMs on those two hosts. Then we can verify the assigned files/splits are from the same host. But I am not sure if this can be done in a unit test setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, +1, ideally it is, but I haven't found a way to achieve it.So here only test whether it works properly when table.exec.iceberg.expose-split-locality-info is set to false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hililiwei How does Flink code base test this feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be tested by manually specifying the hostname. more refer: https://github.com/apache/flink/blob/master/flink-core/src/test/java/org/apache/flink/core/io/LocatableSplitAssignerTest.java
Or try to test it by introducing miniDFS, but the project doesn't seem willing to introduce it.

Copy link
Contributor

@stevenzwu stevenzwu Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for unit test the assigner, not an e2e test of the whole thing. Except for this lack of e2e test, PR overall looks good to me. Have you tested this in a hadoop cluster setup manually?

Copy link
Contributor Author

@hililiwei hililiwei Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, during the test phase, I printed some logs to see if it was working properly, such as this one:
code:
2022-1-17-2
Logs:
2022-1-17

tableConf.setBoolean(FlinkConfigOptions.TABLE_EXEC_ICEBERG_EXPOSE_SPLIT_LOCALITY_INFO.key(), false);

List<Row> results = sql("select * from t");
org.apache.iceberg.flink.TestHelpers.assertRecords(results, expectedRecords, TestFixtures.SCHEMA);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conflict with org.apache.iceberg.TestHelpers.

@hililiwei hililiwei force-pushed the #3816 branch 2 times, most recently from a663f13 to 0c14e19 Compare January 11, 2022 01:12
@hililiwei hililiwei requested review from kbendick and rdblue January 11, 2022 01:26
} else {
contextBuilder.project(FlinkSchemaUtil.convert(icebergSchema, projectedSchema));
}
contextBuilder.exposeLocality(localityEnabled());
Copy link
Contributor

@rdblue rdblue Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the exposeLocality variable used? If an explicit value is passed to this builder, it should be passed into localityEnabled() so that method can use the setting, but only if the underlying file system is hdfs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boolean localityEnabled() {
Boolean localityConfig =
this.exposeLocality != null ? this.exposeLocality :
readableConfig.get(FlinkConfigOptions.TABLE_EXEC_ICEBERG_EXPOSE_SPLIT_LOCALITY_INFO);

It is used here to determine whether to use flink config.

@hililiwei hililiwei force-pushed the #3816 branch 2 times, most recently from be3c954 to 52808f5 Compare January 19, 2022 05:17
@hililiwei hililiwei requested a review from rdblue January 19, 2022 08:05
Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hililiwei
Copy link
Contributor Author

@rdblue @kbendick Please take a look, do I need any further changes?

return parallelism;
}

boolean localityEnabled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Test case was also modified.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good. I think we just need to fix a few style issues in the localityEnabled() method.

@rdblue rdblue merged commit d43cb4c into apache:master Feb 14, 2022
@rdblue
Copy link
Contributor

rdblue commented Feb 14, 2022

Thanks, @hililiwei!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FlinkInputSplit extends LocatableInputSplit instread of InputSplit

4 participants