Spark 3.2: Implement merge-on-read MERGE #4047

aokolnychyi · 2022-02-05T01:09:38Z

This PR implements merge-on-read MERGE in Spark 3.2.

aokolnychyi · 2022-02-05T01:13:18Z

spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMerge.java


    sql("ALTER TABLE %s ADD PARTITION FIELD dep", tableName);

-    append(tableName,


I had to adapt this test and use a similar trick like in a few other places where we delete files we don't expect to query. Otherwise, I'd need to have different checks for copy-on-write and merge-on-read.

aokolnychyi · 2022-02-07T19:43:26Z

.../v3.2/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java

+  // merge mode is NOT SET -> rely on write distribution and ordering as a basis
+  // merge mode is NONE -> unspecified distribution +
+  //                       LOCALLY ORDERED BY _spec_id, _partition, _file, _pos, date, days(ts)
+  // merge mode is HASH -> CLUSTER BY _spec_id, _partition, date, days(ts) +


This means we may split records for a partition into a number of tasks if that partition is in the old spec and is not aligned with the current partitioning (i.e. date, days(ts) in this case). Not sure how to avoid this, though.

I just thought about this for the case above. I think it's okay.

aokolnychyi · 2022-02-07T19:44:04Z

.../v3.2/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java

+  //                       LOCALLY ORDERED BY _spec_id, _partition, _file, _pos, date, id
+  // merge mode is HASH -> CLUSTER BY _spec_id, _partition, date +
+  //                       LOCALLY ORDER BY _spec_id, _partition, _file, _pos, date, id
+  // merge mode is RANGE -> RANGE DISTRIBUTE BY _spec_id, _partition, _file, date, id


Note that we don't include _pos in the distribution. Just in the local ordering.

...extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteMergeIntoTable.scala

spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMerge.java

rdblue · 2022-02-13T23:14:06Z

.../v3.2/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java

+  //
+  // UNPARTITIONED UNORDERED
+  // -------------------------------------------------------------------------
+  // merge mode is NOT SET -> rely on write distribution and ordering as a basis


I think by this you mean that we will check the write distribution mode and then use one of the following cases, right? So we can expect this to be unspecified distribution + LOCALLY ORDER BY _spec_id, _partition, _file, _pos because all 3 modes (whether merge or write mode) have that behavior?

Correct, I just did not want to duplicate all possible values of the write property. If the merge mode is not set, use the write mode and adapt it to cover deletes.

.../v3.2/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java

rdblue · 2022-02-13T23:21:51Z

.../v3.2/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java

+  // merge mode is NOT SET -> rely on write distribution and ordering as a basis
+  // merge mode is NONE -> unspecified distribution +
+  //                       LOCALLY ORDER BY _spec_id, _partition, _file, _pos, id, data
+  // merge mode is HASH -> unspecified distribution +


Why is the distribution unspecified when mode is hash here? Shouldn't this hash distribute by original data file (or maybe original partition), then by the new partition? Or is the fear that this will create too many small tasks by dividing original data files by the new partitioning?

If that's the case, then it seems like this is optimizing for the case where you're running a MERGE with data from an old partition spec. I'd rather optimize for the case where the partition spec matches.

Oh, I think I see. I was thinking about the PARTITIONED BY, UNORDERED case that is actually below. I concluded what you did for that case, so that's good validation!

Here, it still seems bad to me not to distribute. That's going to result in a lot of small delete files, which is really expensive and possibly worse than having a single writer for all the inserted data. It would be nice to be able to round-robin the new data... what about using something like HASH DISTRIBUTE BY _spec, _partition, bucket(id, data, numShufflePartitions)?

Well, I am not sure. I like that our merge and write logic are consistent right now. My hope was that AQE would coalesce tasks as needed to avoid a huge number of small writing tasks (and hence a huge number of delete files). I think AQE should behave better than a round-robin distribution. This case is about unpartitioned tables so we will most likely produce at a single delete file per writing task (that shouldn't be that bad). As long as we don't have a huge number of writing tasks, we should be fine, I guess?

We chatted offline and decided that clustering by _spec_id, _partition, and _file is a good idea to avoid a large number of delete files.

Modified the behavior for HASH and RANGE modes.

rdblue

Everything looks really close!

One question that I think I've answered, but I want to make sure about: when MERGE INTO gets rewritten to INSERT new rows because there are no MATCHED actions, we don't create a RowLevelOperationTable. That's the only place where we check the write mode and possibly create a delta operation and go into all the delta sort code path. Right?

I was briefly concerned that we might use a broken sort order for the INSERT cases, assuming that we create the sort order for deltas but actually rewrite to insert. But it looks like the write is always specific to the actual write operation.

aokolnychyi · 2022-02-14T21:48:29Z

@rdblue, you are right. There will be no RowLevelOperationTable when a MERGE gets converted into an INSERT. We will use the regular write distribution and ordering in that case.

szehon-ho

Looks good, just some very minor comments

szehon-ho · 2022-02-15T00:43:36Z

...extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteMergeIntoTable.scala

+
+    // use inner join if there is no NOT MATCHED action, unmatched source rows can be discarded
+    // use right outer join in all other cases, unmatched source rows may be needed
+    // also disable broadcasts for the target table to perform the cardinality check later


Was also trying to compare the two code-paths. Noticed this has a 'later' in the comment and the other does not, the 'later' seems to be more clear. Actually wondering, did you consider putting some of these repeated codes/comments into single methods/scaladoc?

val sourceTableProj = sourceTableProj(source) val joinPlan = joinPlan(sourceTableProj, targetTableProj, cond, notMatchedActions)

Removed "later" for consistency in comments. The comments in copy-on-write and merge-on-read are slightly different as different join type are used. I'll take a look at what extra methods we can introduce to simplify this. I did that in a few places already.

...extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteMergeIntoTable.scala

szehon-ho · 2022-02-15T03:14:28Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkDistributionAndOrderingUtil.java

+      return deleteOrdering;
    } else {
-      throw new IllegalArgumentException("Only position deletes and updates are currently supported");
+      // all metadata columns like _spec_id, _file, _pos will be null for new data records


Optional: just noticed we throw IllegalArgumentException in other places where we have an unknown command, so was just wondering would it be clearer/more consistent to put this in a MERGE case block and throw exception in unknown command case? Up to you

It looks like we usually do that in switch blocks as it's required by the syntax. I think it should be safe enough to assume we know a finite set of commands.

aokolnychyi · 2022-02-15T22:27:58Z

Thanks for reviewing, @rdblue @szehon-ho!

rdblue · 2022-02-15T22:34:14Z

Awesome work, @aokolnychyi!

github-actions bot added the spark label Feb 5, 2022

aokolnychyi commented Feb 5, 2022

View reviewed changes

aokolnychyi force-pushed the mor-merge branch from da69be2 to 7b88c94 Compare February 7, 2022 18:00

Spark 3.2: Implement merge-on-read MERGE

fea1dd9

aokolnychyi force-pushed the mor-merge branch from 7b88c94 to fea1dd9 Compare February 7, 2022 19:41

aokolnychyi commented Feb 7, 2022

View reviewed changes