Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog #2777

szehon-ho · 2021-07-02T01:59:07Z

When running add_files on partitioned source table, noticed that a significant time is spent in the Hive listPartitions call.

It might be good to push down the filter to Hive (faster database query on the partitions in question, and less traffic getting serialized/deserialized across the wire).

…ark/Hive catalog

szehon-ho · 2021-07-07T21:28:37Z

@RussellSpitzer could you take a look if you have some time?

RussellSpitzer · 2021-07-07T21:46:19Z

spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

      CatalogTable catalogTable = catalog.getTableMetadata(tableIdent);

-      Seq<CatalogTablePartition> partitions = catalog.listPartitions(tableIdent, Option.empty());
+      Option<scala.collection.immutable.Map<String, String>> partSpec =


I am not smart enough to hold this whole block in my head. Can we break this down into multiple lines?

One line to make the scala Map
One line to wrap it in optional?

Or could just do something like

Option<scala.collection.mutable.Map<String, String>> scalaFilter = Option.apply(scala.collection.JavaConverters.mapAsScalaMapConverter(pkfilter).asScala());

But you can static import JavaConverters here too to make it shorter

RussellSpitzer · 2021-07-07T21:47:38Z

spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

   * @return all table's partitions
   */
-  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent) {
+  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent,


Do we benefit from having this as a Java Optional? Since we have to immediately convert it to Scala maybe we should just pass a normal Map?

I think the "filterPartitions" function would just use an empty map as "no filter"?

RussellSpitzer · 2021-07-07T21:51:09Z

versions.props

 com.github.stephenc.findbugs:findbugs-annotations = 1.3.9-1
 software.amazon.awssdk:* = 2.15.7
 org.scala-lang:scala-library = 2.12.10
+org.scala-lang.modules:scala-java8-compat_2.12 = 0.8.0


I think we can just use Optional.apply() instead of using the option converter here

It would be nice to not have an additional explicit scala 2.12 dependency. I believe Spark2 can still be compiled for scala 2.11. Would this prohibit that?

Yep you guys convinced me, got rid of it.

RussellSpitzer

I don't think we need the java-8-compat module, left some suggestions there. I think this is a great idea but I would recommend just dropping the optional from the api and just pass through the map.

Other than that, looks good to me.

RussellSpitzer · 2021-07-07T22:00:07Z

spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java


+  @Test
+  public void addFilteredPartitionsToPartitioned2() {
+    createCompositePartitionedTable("parquet");


I think we need a SparkTableUtil test for the new getPartitions code as well, unless that's a pain

szehon-ho · 2021-07-13T01:23:52Z

spark2/src/test/java/org/apache/iceberg/spark/source/TestSparkTableUtil.java

+      TableIdentifier source = spark.sessionState().sqlParser()
+          .parseTableIdentifier(tableName);
+
+      Map<String, String> partition1 = Stream.of(


I suppose we cannot use Java9 factory methods in Spark2 project

:( I think we still are running against java 8

That said if you are looking for a better method here we do have all of guava, so you can use
ImmutableMaps.of(....)

Or

ImmutableMaps.Builder
.put
.put
.build

Map<String, String> newProperties = ImmutableMap.of("hello", "world");

Oh right forgot we have guava, thanks, changed, its much cleaner now.

szehon-ho · 2021-07-13T01:24:56Z

spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

+      Option<scala.collection.immutable.Map<String, String>> scalaPartitionFilter;
+      if (partitionFilter != null && !partitionFilter.isEmpty()) {
+        scalaPartitionFilter = Option.apply(JavaConverters.mapAsScalaMapConverter(partitionFilter).asScala()
+            .toMap(Predef.conforms()));


The scala api requires immuable map, hence this extra step

RussellSpitzer

Looks good to me! I think you just want to clean up that test to use Guava so that the maps are prettier :) Other than that I think this is good to go

RussellSpitzer · 2021-07-13T14:58:18Z

Thanks @szehon-ho !

…park/Hive catalog (apache#2777) Pushes down partition filters in Spark/Hive Import to underlying catalog instead of retrieving all partitions and then filtering.

Core : Add Files Perf improvement by push down partition filter to Sp…

9059a67

…ark/Hive catalog

github-actions bot added build spark labels Jul 2, 2021

szehon-ho changed the title ~~Core : Add Files Perf improvement by push down partition filter to Spark/Hive catalog~~ Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog Jul 2, 2021

RussellSpitzer reviewed Jul 7, 2021

View reviewed changes

Remove Scala conversion dependency and add unit test

fb91eee

szehon-ho force-pushed the add_file_optim_master branch from 50ff1d8 to fb91eee Compare July 13, 2021 01:22

szehon-ho commented Jul 13, 2021

View reviewed changes

RussellSpitzer approved these changes Jul 13, 2021

View reviewed changes

Use guava in new test

8b9d576

szehon-ho force-pushed the add_file_optim_master branch from 27fe3e0 to 8b9d576 Compare July 13, 2021 04:09

RussellSpitzer merged commit 1a903f6 into apache:master Jul 13, 2021

Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog #2777

Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog #2777

Uh oh!

Conversation

szehon-ho commented Jul 2, 2021

Uh oh!

szehon-ho commented Jul 7, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jul 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jul 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Jul 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RussellSpitzer Jul 7, 2021 •

edited

Loading

RussellSpitzer Jul 13, 2021 •

edited

Loading