Flink: Apply row level filtering #7109

Fokko · 2023-03-14T22:29:24Z

For Flink, we apply partition pruning, filtering based on metrics, and row-group skipping, but no row-level filtering.

See the issue for more details

Resolves #7022

@stevenzwu would you have time to take a peek at this one?

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/FlinkSourceFilter.java

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSource.java

doki23

LGTM, except for a potential NPE problem.

doki23

Great!

stevenzwu · 2023-03-21T02:54:22Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/FlinkSourceFilter.java

+
+public class FlinkSourceFilter implements FilterFunction<RowData> {
+
+  private final RowDataWrapper wrapper;


we don't have to make RowDataWrapper serializable. it can be lazily initialized.

Lazily initialized is ok, but it makes sense that making it serializable -- it's meaningful that we don't need to check the if statement per record.

I'm open to both. It is just one null check.

stevenzwu · 2023-03-21T03:03:44Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSource.java

        }
-        return env.createInput(format, typeInfo).setParallelism(parallelism);
+
+        DataStreamSource<RowData> source =


This covers one scenario. there are two other scenarios.

Use FlinkInputFormat directly. e.g. StreamingReaderOperator.

private void processSplits() throws IOException { FlinkInputSplit split = splits.poll(); if (split == null) { currentSplitState = SplitState.IDLE; return; } format.open(split); try { RowData nextElement = null; while (!format.reachedEnd()) { nextElement = format.nextRecord(nextElement); sourceContext.collect(nextElement); } } finally { currentSplitState = SplitState.IDLE; format.close(); } // Re-schedule to process the next split. enqueueProcessSplits(); }

new Flink FLIP-27 IcebergSource. Here is an example from IcebergTableSource that shows how users can construct the DataStream. We can fix it in IcebergTableSource. but we can't control users' code to add the filter in the DataStream. Note that FLIP-27 source will be the future Flink source.

private DataStreamSource<RowData> createFLIP27Stream(StreamExecutionEnvironment env) { SplitAssignerType assignerType = readableConfig.get(FlinkConfigOptions.TABLE_EXEC_SPLIT_ASSIGNER_TYPE); IcebergSource<RowData> source = IcebergSource.forRowData() .tableLoader(loader) .assignerFactory(assignerType.factory()) .properties(properties) .project(getProjectedSchema()) .limit(limit) .filters(filters) .flinkConfig(readableConfig) .build(); DataStreamSource stream = env.fromSource( source, WatermarkStrategy.noWatermarks(), source.name(), TypeInformation.of(RowData.class)); return stream; }

another possible common place to evaluate the residual filter is probably RowDataFileScanTaskReader. This approach also brings a benefit that it doesn't change the shape of Flink DAG. E.g., users won't see an extra filter function/operator and wonder where does it come from.

Fokko · 2023-04-10T19:23:15Z

@stevenzwu Sorry for the long wait. I've updated the PR according to your suggestion. Looking at the FLIP27 tests, I think row filtering is already applied over there. I also took the liberty to change some public methods, since Flink 1.17 hasn't been released to the public.

stevenzwu

Thanks for making the change. It seems that performing the filtering at RowDataFileScanTaskReader is the right approach, as it is the common denominator.

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/FlinkSourceFilter.java

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/RowDataFileScanTaskReader.java

flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestFlinkInputFormat.java

...k/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamingReaderOperator.java

...nk/src/main/java/org/apache/iceberg/flink/source/reader/AvroGenericRecordReaderFunction.java

stevenzwu · 2023-04-11T04:13:44Z

flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestFlinkInputFormat.java

+  }
+
+  @Test
+  public void testBasicRowDataFiltering() throws Exception {


this method won't be necessary if we add the new test method above to TestFlinkScan. Currently, TestFlinkScan#testFilterExp only covers filter with partition column. we can add the above test method as testResidualFilter where filter is constructed with non-partition column.

TestFlinkScan is the base class for both the old FlinkSource(covered by TestFlinkInputFormat and others) and the new FLIP-27 IcebergSource (covered by TestIcebergSourceBounded and others)

I've added a test in TestFlinkScan that filters on a not-partitioned column

Fokko · 2023-04-12T13:31:19Z

@stevenzwu thanks again for the review, could you do another pass?

flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestFlinkScan.java

flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestFlinkInputFormat.java

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/RowDataFileScanTaskReader.java

stevenzwu · 2023-04-18T16:33:15Z

flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestFlinkScan.java

+    DataFile dataFile = helper.writeFile(expectedRecords);
+    helper.appendToTable(dataFile);
+    List<Row> actual =
+        runWithFilter(Expressions.greaterThanOrEqual("data", "b"), "where data>='b'");


ideally, we also want to test the ignore case sensitive option with filter. but the current test structure seems very hard to do that. we need both filter and options

That's a good point. I got it working quite easily when using the expression. Do you know if Flink supports case-insensitive mode when it comes to Flink SQL? Seems to be an option issue: https://issues.apache.org/jira/browse/FLINK-16175

stevenzwu

LGTM.

added a comment for testing the case insensitive scenario. but seems difficult with the current test code structure.

stevenzwu · 2023-04-18T21:58:36Z

flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestFlinkScan.java

+  }
+
+  @Test
+  public void testFilterExpCaseInsensitive() throws Exception {


nit: for now, maybe extract a private method to avoid test duplications. with Junit 5, this can be handled better.

Fokko · 2023-04-19T07:29:48Z

Thanks @stevenzwu and @doki23 for the review!

stevenzwu · 2023-04-20T16:02:40Z

@Fokko thx for fixing this issue

stevenzwu · 2023-04-20T16:03:34Z

@Fokko can you also create the backport PR for 1.15 and 1.16?

Fokko · 2023-04-21T12:19:36Z

@stevenzwu sure thing! Here you go: #7397

* Flink: Apply row level filtering * Fix the tests * Add test for case-sensitive * Reduce duplication using a private method

advancedxy · 2023-08-04T06:19:25Z

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/RowDataFileScanTaskReader.java

+    if (rowFilter != null) {
+      return CloseableIterable.filter(iter, rowFilter::filter);
+    }


@Fokko @stevenzwu We have an internal request to add row filter dynamically, current impl requires the filter to be supplied at job startup time. After some investigation, I believe maybe we don't have to passing the filters all the way done to this class.. The task itself already has the row filter task.residual(). We could simply convert it to rowFilter here, such as:

// if (rowFilter != null) { // return CloseableIterable.filter(iter, rowFilter::filter); // } if (task.residual() != null && !task.residual().isEquivalentTo(Expressions.alwaysTrue())) { FlinkSourceFilter dataFilter = new FlinkSourceFilter(this.projectedSchema, task.residual(), this.caseSensitive); return CloseableIterable.filter(iter, dataFilter::filter); }

WDYT?

I believe maybe we don't have to passing the filters all the way done to this class.. The task itself already has the row filter task.residual().

if that is the case, it is probably simpler.

add row filter dynamically,

can you explain what's the dynamic part and how is it related to the residual filter?

can you explain what's the dynamic part and how is it related to the residual filter?

The feature we are developing needs to add some additional scan filters at planIcebergSourceSplits phase, the additional scan filter is added to scanContext, which would be part of task.residual.

The additional filter itself is dynamically added.

if that is the case, it is probably simpler.

Then, if we can just use task.residual, do you think we should refactor this PR to use that? Is it possible to revert the class interface change in this PR.

@advancedxy That sounds like a great approach. If we only have to apply the residual, then we're also more efficient. That would be an improvement over the current approach.

github-actions bot added the flink label Mar 14, 2023

Fokko mentioned this pull request Mar 14, 2023

Flink: filters not applied at row-level for non-partition columns #7022

Closed

Fokko commented Mar 14, 2023

View reviewed changes

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/FlinkSourceFilter.java Show resolved Hide resolved

doki23 reviewed Mar 18, 2023

View reviewed changes

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSource.java Outdated Show resolved Hide resolved

doki23 reviewed Mar 18, 2023

View reviewed changes

stevenzwu self-requested a review March 18, 2023 15:13

github-actions bot added the API label Mar 19, 2023

Fokko force-pushed the fd-flink branch from e874dbe to f4188f2 Compare March 19, 2023 23:15

doki23 approved these changes Mar 20, 2023

View reviewed changes

stevenzwu reviewed Mar 21, 2023

View reviewed changes

Fokko force-pushed the fd-flink branch from 7cbcda2 to 8965dd5 Compare April 10, 2023 18:57

github-actions bot removed the API label Apr 10, 2023

Fokko force-pushed the fd-flink branch 2 times, most recently from 982469e to b876007 Compare April 10, 2023 19:03

stevenzwu reviewed Apr 11, 2023

View reviewed changes

Fokko force-pushed the fd-flink branch from b876007 to d1a230a Compare April 12, 2023 10:19

github-actions bot added data docs labels Apr 12, 2023

Fokko force-pushed the fd-flink branch from d1a230a to e625496 Compare April 12, 2023 10:20

Flink: Apply row level filtering

3fa49bc

Fokko force-pushed the fd-flink branch from e625496 to 3fa49bc Compare April 12, 2023 10:30

stevenzwu reviewed Apr 12, 2023

View reviewed changes

Fix the tests

7e33c32

stevenzwu reviewed Apr 18, 2023

View reviewed changes

stevenzwu approved these changes Apr 18, 2023

View reviewed changes

Fokko force-pushed the fd-flink branch from 6112511 to 7cc00ae Compare April 18, 2023 18:43

Add test for case-sensitive

d037800

Fokko force-pushed the fd-flink branch from 7cc00ae to d037800 Compare April 18, 2023 19:09

stevenzwu reviewed Apr 18, 2023

View reviewed changes

Reduce duplication using a private method

e794097

Fokko merged commit c57952e into apache:master Apr 19, 2023

Fokko deleted the fd-flink branch April 19, 2023 07:29

manisin pushed a commit to Snowflake-Labs/iceberg that referenced this pull request May 9, 2023

Flink: Apply row level filtering (apache#7109)

17478a7

* Flink: Apply row level filtering * Fix the tests * Add test for case-sensitive * Reduce duplication using a private method

advancedxy reviewed Aug 4, 2023

View reviewed changes


		public class FlinkSourceFilter implements FilterFunction<RowData> {

		private final RowDataWrapper wrapper;

Flink: Apply row level filtering #7109

Flink: Apply row level filtering #7109

Uh oh!

Conversation

Fokko commented Mar 14, 2023

Uh oh!

Uh oh!

Uh oh!

doki23 left a comment

Choose a reason for hiding this comment

Uh oh!

doki23 left a comment

Choose a reason for hiding this comment

Uh oh!

stevenzwu Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko commented Apr 10, 2023

Uh oh!

stevenzwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko commented Apr 12, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko commented Apr 19, 2023

Uh oh!

stevenzwu commented Apr 20, 2023

Uh oh!

stevenzwu commented Apr 20, 2023

Uh oh!

Fokko commented Apr 21, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stevenzwu Mar 21, 2023 •

edited

Loading

stevenzwu Mar 21, 2023 •

edited

Loading