Spark 3.4: Distribution and ordering enhancements #7637

aokolnychyi · 2023-05-18T01:12:41Z

This fixes #7633.

core/src/main/java/org/apache/iceberg/io/BasePositionDeltaWriter.java

...extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteMergeIntoTable.scala

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

aokolnychyi · 2023-05-19T21:16:41Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteRequirements.java

+import org.apache.spark.sql.connector.expressions.SortOrder;
+
+/** A set of requirements such as distribution and ordering reported to Spark during writes. */
+public class SparkWriteRequirements {


In the future, this class would also hold the advisory partition size.

aokolnychyi · 2023-05-19T21:20:16Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteUtil.java

+ *
+ * <p>Note it is an evolving internal API that is subject to change even in minor releases.
+ */
+public class SparkWriteUtil {


I'd recommend checking the tests first to see what changed.

aokolnychyi · 2023-05-19T21:22:12Z

.../v3.4/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java

+  // write mode is NOT SET -> CLUSTER BY date, days(ts) + LOCALLY ORDER BY date, days(ts)
+  // write mode is NOT SET (fanout) -> CLUSTER BY date, days(ts) + empty ordering
+  // write mode is NONE -> unspecified distribution + LOCALLY ORDERED BY date, days(ts)
+  // write mode is NONE (fanout) -> unspecified distribution + empty ordering


I disabled local sort by partition columns in regular writes and in CoW operations if fanout writers enabled. If the table sort order is undefined, no need to sort records by partition columns if we are not using clustered writers.

Curious, would it be useful to have all these distribution and ordering examples accessible in our public docs?

@amogh-jahagirdar, yes. It would be good to have these documented.

Will add, I believe @RussellSpitzer also had some doc update PR, I am yet to check it out.

aokolnychyi · 2023-05-19T21:22:46Z

.../v3.4/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java

  // UNPARTITIONED UNORDERED
  // -------------------------------------------------------------------------
-  // delete mode is NOT SET -> CLUSTER BY _file + LOCALLY ORDER BY _file, _pos
+  // delete mode is NOT SET -> CLUSTER BY _file + empty ordering


I disabled the local sort by _file and _pos in DELETE operations as it is not helping that much. If we perform a DELETE operation and shuffle the records all over the place, we will cluster them by _file before writing. In most cases, records from multiple files will end up in a single task. If we stitch together two sorted chunks into one output file, the order of that file will be broken. So what’s the point of doing the sort and potentially spilling to disk? There is a very narrow use case where the old behavior could make sense: a task gets only records from a single file and that file was properly sorted yet the sort order is not defined in the table. I don’t think it is a good idea to optimize for that use case. Keep in mind it only happens if the sort order is empty. In most cases, it really means there is no reasonable sort order to preserve.

aokolnychyi · 2023-05-19T21:23:24Z

.../v3.4/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java

  //
  // PARTITIONED BY date ORDERED BY id
  // -------------------------------------------------------------------------
-  // delete mode is NOT SET -> CLUSTER BY _file + LOCALLY ORDER BY date, id


I am ditching clustering by _file in favor of clustering by partition columns for CoW operations to reduce the number of produced files. Right now, each output task may get records from various files and partitions. Hence, we produce more than needed files. This was done to avoid OOM exceptions with too large partitions. This is no longer a problem with AQE writes in Spark 3.4.

aokolnychyi · 2023-05-19T21:35:00Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteUtil.java

+  private static final SortOrder[] POSITION_DELETE_ORDERING =
+      orderBy(SPEC_ID, PARTITION, FILE_PATH, ROW_POSITION);
+
+  private SparkWriteUtil() {}


This utility would also hold the logic to compute the advisory partition size. It is similar to the old distribution and ordering utility but uses a shorter name/syntax and builds SparkWriteRequirements.

aokolnychyi · 2023-05-19T21:48:59Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

+  }
+
+  public SparkWriteRequirements positionDeltaRequirements(Command command) {
+    return SparkWriteUtil.positionDeltaRequirements(


Can't skip distribution and ordering because the spec requires position deletes to be sorted.

Let's make sure to follow up on this in the spec. The sorting requirement I think is a little odd, but if we switch to another layout, like a delete bitmap per file this won't really matter.

I have a follow-up PR that adds a fanout position delete writer. It buffers deletes into a bitmap and then produces a sorted file when closed. In the future, we either remove the requirement for position deletes to be sorted or add support for Puffin delete files that would persist bitmaps. I am already looking into that.

That's beyond this release, though.

aokolnychyi · 2023-05-19T21:50:35Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

        Context context) {
      this.delegate =
          new BasePositionDeltaWriter<>(
-              newInsertWriter(table, writerFactory, dataFileFactory, context),


Using one data writer now.

aokolnychyi · 2023-05-19T22:01:38Z

.../v3.4/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java

-  // _file, _pos
+  // UNPARTITIONED (ORDERED & UNORDERED)
+  // -------------------------------------------------------------------------
+  // delete mode is NOT SET -> CLUSTER BY _spec_id, _partition, _file +


Adding _file to the clustering for unpartitioned tables as spec_id, _partition will be the same for all rows and deletes for the same data file may end up in multiple delete files.

aokolnychyi · 2023-05-19T22:02:07Z

.../v3.4/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java

-  // _file, _pos
+  // UNPARTITIONED UNORDERED
+  // -------------------------------------------------------------------------
+  // update mode is NOT SET -> CLUSTER BY _spec_id, _partition, _file +


Same as DELETE.

aokolnychyi · 2023-05-19T22:26:23Z

.../v3.4/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java

  //                       LOCALLY ORDERED BY _spec_id, _partition, _file, _pos, date, days(ts)
  // merge mode is HASH -> CLUSTER BY _spec_id, _partition, date, days(ts) +
  //                       LOCALLY ORDER BY _spec_id, _partition, _file, _pos, date, days(ts)
-  // merge mode is RANGE -> RANGE DISTRIBUTE BY _spec_id, _partition, _file, date, days(ts) +


This may be debatable. We initially included _file to split the deletes if too many but AQE should be a better option we are sure we won't split deletes per partition unless needed.

rdblue · 2023-05-21T22:30:53Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteRequirements.java

+  private final Distribution distribution;
+  private final SortOrder[] ordering;
+
+  public SparkWriteRequirements(Distribution distribution, SortOrder[] ordering) {


Does this need to be public or can it be package-private?

Good idea, I'll change. The class must stay public but I can make the constructor package-private.

rdblue · 2023-05-21T22:37:46Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

    public void update(InternalRow meta, InternalRow id, InternalRow row) throws IOException {
-      delete(meta, id);
-      delegate.update(row, dataSpec, null);
+      throw new UnsupportedOperationException("Update must be represented as delete and insert");


This makes me happy that we actually write updates as separate deletes and inserts.

Makes clustering easier!

rdblue · 2023-05-21T22:39:54Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java

-      } else {
-        return new ClusteredDataWriter<>(writerFactory, fileFactory, table.io(), targetFileSize);
-      }
+    private PartitioningWriter<InternalRow, DataWriteResult> newDataWriter(


Can you help me understand why this no longer supports the explicit fanout writer option?

I think it's because we are always adding a shuffle after the merge or update anyway so it never matters?

We always add a local sort as it is required to write out position deletes. That's why we always add a sort by partition columns if the table is partitioned. Therefore, we should be able to always use a clustered writer.

rdblue · 2023-05-21T22:45:45Z

Overall this looks good to me. Thanks for taking the time to work through all the cases with AQE.

amogh-jahagirdar

Overall looks good to me, just a comment on documentation. Thanks @aokolnychyi !

amogh-jahagirdar · 2023-05-22T15:37:29Z

.../v3.4/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java

+  // write mode is NOT SET -> CLUSTER BY date, days(ts) + LOCALLY ORDER BY date, days(ts)
+  // write mode is NOT SET (fanout) -> CLUSTER BY date, days(ts) + empty ordering
+  // write mode is NONE -> unspecified distribution + LOCALLY ORDERED BY date, days(ts)
+  // write mode is NONE (fanout) -> unspecified distribution + empty ordering


Curious, would it be useful to have all these distribution and ordering examples accessible in our public docs?

RussellSpitzer · 2023-05-22T17:33:30Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java

        .booleanConf()
        .option(SparkWriteOptions.USE_TABLE_DISTRIBUTION_AND_ORDERING)
        .defaultValue(SparkWriteOptions.USE_TABLE_DISTRIBUTION_AND_ORDERING_DEFAULT)
+        .negate()


This makes me slightly apprehensive, since I don't like flipping the sign in the builder. I do think it's cleaner than just adding a ! to the return statement though so +1

I debated it but having negation in the main logic bothered me more :)

RussellSpitzer · 2023-05-22T17:45:44Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteUtil.java

+  private static final NamedReference ROW_POSITION = ref(MetadataColumns.ROW_POSITION);
+
+  private static final Expression[] FILE_CLUSTERING = clusterBy(FILE_PATH);
+  private static final Expression[] SPEC_ID_PARTITION_CLUSTERING = clusterBy(SPEC_ID, PARTITION);


Nit: Could just all this Partition_Clustering, at least in my head a partition is unique only in combination with the spec id

Makes sense, will change.

RussellSpitzer · 2023-05-22T21:35:01Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteUtil.java

+      case HASH:
+        if (table.spec().isUnpartitioned()) {
+          Expression[] clustering = concat(SPEC_ID_PARTITION_FILE_CLUSTERING, clustering(table));
+          return Distributions.clustered(clustering);


nit: can pull the return out of this branch statement, that way we wouldn't make two identical clustering variables in different scopes.

I realize looking at future code you did this to avoid exceeding line length

After I renamed the variable above like you suggested, it fits on one line now. I've inlined this call too.

RussellSpitzer · 2023-05-22T21:35:14Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteUtil.java

+      case RANGE:
+        if (table.spec().isUnpartitioned()) {
+          SortOrder[] ordering = concat(SPEC_ID_PARTITION_FILE_ORDERING, ordering(table));
+          return Distributions.ordered(ordering);


same comment about return statements

RussellSpitzer · 2023-05-22T21:40:53Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteUtil.java

+  private static final NamedReference ROW_POSITION = ref(MetadataColumns.ROW_POSITION);
+
+  private static final Expression[] FILE_CLUSTERING = clusterBy(FILE_PATH);
+  private static final Expression[] SPEC_ID_PARTITION_CLUSTERING = clusterBy(SPEC_ID, PARTITION);


Could user your concat function here?

I tried it locally but it was not very obvious for ordering below. It is a bit more explicit to include all columns one by one.

aokolnychyi · 2023-05-23T00:09:38Z

Thanks for reviewing, @amogh-jahagirdar @rdblue @RussellSpitzer!

github-actions bot added core spark labels May 18, 2023

aokolnychyi commented May 18, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/io/BasePositionDeltaWriter.java Show resolved Hide resolved

aokolnychyi commented May 18, 2023

View reviewed changes

...extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteMergeIntoTable.scala Show resolved Hide resolved

aokolnychyi commented May 18, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java Show resolved Hide resolved

aokolnychyi mentioned this pull request May 18, 2023

Spark 3.4: Split update into delete and insert for position deltas #7646

Merged

aokolnychyi force-pushed the distribution-and-ordering-impr branch from c799dd8 to 7ff85e6 Compare May 19, 2023 21:13

aokolnychyi commented May 19, 2023

View reviewed changes

aokolnychyi force-pushed the distribution-and-ordering-impr branch from 7ff85e6 to 15a08b3 Compare May 19, 2023 21:51

aokolnychyi commented May 19, 2023

View reviewed changes

aokolnychyi added this to the Iceberg 1.3.0 milestone May 19, 2023

rdblue reviewed May 21, 2023

View reviewed changes

rdblue approved these changes May 21, 2023

View reviewed changes

amogh-jahagirdar approved these changes May 22, 2023

View reviewed changes

RussellSpitzer reviewed May 22, 2023

View reviewed changes

RussellSpitzer approved these changes May 22, 2023

View reviewed changes

Spark 3.4: Distribution and ordering enhancements

e5c133e

aokolnychyi force-pushed the distribution-and-ordering-impr branch from f39a850 to e5c133e Compare May 22, 2023 22:55

aokolnychyi merged commit 2502a23 into apache:master May 23, 2023

namrathamyske mentioned this pull request Jul 11, 2023

Spark 3.3: Adding Rebalance operator solving for small files problem #8042

Closed

chenjunjiedada mentioned this pull request Jul 30, 2023

Spark: support use-table-distribution-and-ordering in session conf #8164

Closed

szehon-ho mentioned this pull request Jun 27, 2024

Docs: Update defaults for distribution mode #10575

Merged

Spark 3.4: Distribution and ordering enhancements #7637

Spark 3.4: Distribution and ordering enhancements #7637

Uh oh!

Conversation

aokolnychyi commented May 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi May 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi May 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi May 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi May 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi May 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi May 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented May 21, 2023

Uh oh!

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

aokolnychyi commented May 18, 2023 •

edited

Loading

aokolnychyi May 19, 2023 •

edited

Loading

aokolnychyi May 19, 2023 •

edited

Loading

aokolnychyi May 19, 2023 •

edited

Loading

aokolnychyi May 22, 2023 •

edited

Loading

aokolnychyi May 22, 2023 •

edited

Loading

aokolnychyi May 22, 2023 •

edited

Loading

aokolnychyi May 22, 2023 •

edited

Loading