Flink: FLIP-27 Iceberg source and builder that puts everything together #5109

stevenzwu · 2022-06-21T19:02:19Z

No description provided.

…P-27 Flink source does a deep copy to array pool RowData

stevenzwu · 2022-06-21T19:08:22Z

flink/v1.15/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java

    protected ArrayData buildList(ReusableArrayData list) {
-      list.setNumElements(writePos);
-      return list;
+      // Since ReusableArrayData is not accepted by Flink, use GenericArrayData temporarily to walk around it.


FLIP-27 source reader needs to do a deep copy when handing over a batch/array of records from reader thread to Flink operator thread. I reverted this change from PR #4712 so that the CI build for this PR can pass for now.

We should discuss how to fix this forward. @yittg would love to get the input.

We may need to update/fix the ArrayDataSerializer#copy method from Flink first.

@Override public ArrayData copy(ArrayData from) { if (from instanceof GenericArrayData) { return copyGenericArray((GenericArrayData) from); } else if (from instanceof ColumnarArrayData) { return copyColumnarArray((ColumnarArrayData) from); } else if (from instanceof BinaryArrayData) { return ((BinaryArrayData) from).copy(); } else { return toBinaryArray(from); } }

Just realized the ReusableArrayData is actually from Iceberg code FlinkParquetReaders

I don't really get it, i think the fix in Flink is included in 1.15.

FLIP-27 source reader uses this RowDataUtil util method to clone the RowData as it batches records for thread handover. It depends on ArrayDataSerializer to clone the field. With the change from PR #4712, ReusableArrayData object is not cloned and reused, which corrupts the batched RowData array.

public static RowData clone(RowData from, RowData reuse, RowType rowType, TypeSerializer[] fieldSerializers) { GenericRowData ret; if (reuse instanceof GenericRowData) { ret = (GenericRowData) reuse; } else { ret = new GenericRowData(from.getArity()); } ret.setRowKind(from.getRowKind()); for (int i = 0; i < rowType.getFieldCount(); i++) { if (!from.isNullAt(i)) { RowData.FieldGetter getter = RowData.createFieldGetter(rowType.getTypeAt(i), i); ret.setField(i, fieldSerializers[i].copy(getter.getFieldOrNull(from))); } else { ret.setField(i, null); } } return ret; }

@stevenzwu Do you have any failed test case? Let me have a look or reproduce it locally. I'm still confused why it still can not work.

@yittg you can check out the dev branch for this PR and run the TestIcebergSourceBounded. testCustomizedFlinkDataTypes method should fail, because the array field has the same value for all records.

For easier read of the diff, you can change the record count from 10 to 2.

List<Record> records = RandomGenericData.generate(schema, 10, 0L);

Thanks @stevenzwu,
I finally got the point, the ArrayDataSerializer in Flink should be renew each time because it reuse the BinaryArrayData internally. 'Think we can change the signature of RowDataUtil#clone to accept a supplier of serializer to walk around it for now.

I looked into the Flink for why map data works but array data not with similar implementation, found that the MapDataSerializer#toBinaryMap always contains copy semantics implicitly but not for ArrayDataSerializer#toBinaryArray.

see https://issues.apache.org/jira/browse/FLINK-28214

stevenzwu · 2022-06-21T19:09:19Z

flink/v1.15/flink/src/main/java/org/apache/iceberg/flink/source/ScanContext.java

  }

-  public StreamingStartingStrategy startingStrategy() {
+  public StreamingStartingStrategy streamingStartingStrategy() {


ScanContext is an internal class. this change shouldn't break any users.

rdblue · 2022-06-24T16:11:38Z

flink/v1.15/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java

+    try (TableLoader loader = tableLoader) {
+      return loader.loadTable();
+    } catch (IOException e) {
+      throw new RuntimeException("Failed to close table loader", e);


Should this be UncheckedIOException instead of generic RuntimeException?

rdblue · 2022-06-24T16:12:30Z

flink/v1.15/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java

+
+  @Override
+  public Boundedness getBoundedness() {
+    return scanContext.isStreaming() ? Boundedness.BOUNDED : Boundedness.CONTINUOUS_UNBOUNDED;


This is backwards. Good catch, @stevenzwu!

rdblue · 2022-06-24T16:17:41Z

flink/v1.15/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java

+      return this;
+    }
+
+    public IcebergSource<T> build() {


You could use public <T> IcebergSource<T> build() here to allow a bit easier type customization.

rdblue · 2022-06-24T16:21:08Z

flink/v1.15/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java

+    }
+  }
+
+  public static <T> Builder<T> builder() {


You could use builderForRowData here also.

If you do that, you may want to pass the ReaderFunction<T> in here.

When I tried to implement the RowData factory method, I found it kind of need to duplicate 3 of the ScanContext configs for the RowDataReaderFunction.

public static Builder<RowData> builderForRowData(Configuration readConfig, Table table, ScanContext context) { ReaderFunction<RowData> readerFunction = new RowDataReaderFunction(readConfig, table.schema(), context.project(), context.nameMapping(), context.caseSensitive(), table.io(), table.encryption()); return new Builder<RowData>() .readerFunction(readerFunction); }

Personally, I actually like to make ScanContext public and exposed to users. Then it is ok to have method like Builder<RowData> builderForRowData(Configuration readConfig, Table table, ScanContext context). We can also avoid duplicate more than a dozen of methods from ScanContext to IcebergSource$Builder. Users can construct ScanContext once.

We can probably do better here. Why not initialize readerFunction in the build method if it is null?

stevenzwu · 2022-06-24T21:24:10Z

flink/v1.15/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java

+      return this;
+    }
+
+    public Builder caseSensitive(boolean newCaseSensitive) {


@rdblue I like to revisit the discussion of exposing ScanContext directly to users (instead of replicating the ScanContext methods in the source builder here). Not only we can avoid code duplication, we can also avoid potential out-of-sync problem. In the past, I have seen the case where we added new methods to ScanContext but forgot to add them to source builder.

@yittg @kbendick what's your take?

I think this is making the API more complicated for the convenience of developers, so I don't think that we should expose it.

rdblue · 2022-06-24T23:43:49Z

Thanks, @stevenzwu!

I know we still need to discuss the builder that makes reading as RowData easy, but I went ahead and merged this because we can add that later.

stevenzwu · 2022-06-25T04:05:46Z

@klam-shop @zoucao This is the PR for source builder that puts everything together. You can try the MVP version out from the master branch. Also your feedbacks on the source builder/construction are welcomed.

zoucao · 2022-06-25T04:21:03Z

Thanks for your great work, @stevenzwu, and we will try it as soon as possible.

klam-shop · 2022-06-27T13:14:22Z

Thank you @stevenzwu !

…odule

stevenzwu added 2 commits June 21, 2022 11:59

revert the change for ReusableArrayData in FlinkParquetReaders as FLI…

ec2f580

…P-27 Flink source does a deep copy to array pool RowData

FLIP-27 IcebergSource that put everything together

5f9e919

github-actions bot added the flink label Jun 21, 2022

stevenzwu commented Jun 21, 2022

View reviewed changes

rdblue reviewed Jun 24, 2022

View reviewed changes

fix getBoundedness bug and use UncheckedIOException

1e4f220

stevenzwu commented Jun 24, 2022

View reviewed changes

rdblue approved these changes Jun 24, 2022

View reviewed changes

rdblue merged commit 35b8558 into apache:master Jun 24, 2022

stevenzwu deleted the flip27IcebergSource branch June 25, 2022 03:25

zoucao added a commit to zoucao/iceberg that referenced this pull request Jul 4, 2022

Flink: port apache#5109, FLIP-27 Iceberg source and builder to 1.14 m…

5885146

…odule

zoucao mentioned this pull request Jul 4, 2022

Flink: port #5109, FLIP-27 Iceberg source and builder to 1.14 module #5191

Merged

rdblue pushed a commit that referenced this pull request Jul 10, 2022

Flink 1.14: FLIP-27 Iceberg source and builder, port #5109 (#5191)

0f13b2a

namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022

Flink: FLIP-27 Iceberg source and builder (apache#5109)

56375f8

namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022

Flink: FLIP-27 Iceberg source and builder (apache#5109)

840bf21

Flink: FLIP-27 Iceberg source and builder that puts everything together #5109

Flink: FLIP-27 Iceberg source and builder that puts everything together #5109

Uh oh!

Conversation

stevenzwu commented Jun 21, 2022

Uh oh!

stevenzwu Jun 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu Jun 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yittg Jun 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu Jun 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jun 24, 2022

Uh oh!

stevenzwu commented Jun 25, 2022

Uh oh!

zoucao commented Jun 25, 2022

Uh oh!

klam-shop commented Jun 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

stevenzwu Jun 21, 2022 •

edited

Loading

stevenzwu Jun 22, 2022 •

edited

Loading

yittg Jun 23, 2022 •

edited

Loading

rdblue Jun 24, 2022 •

edited

Loading

stevenzwu Jun 24, 2022 •

edited

Loading