Skip to content

Conversation

@szehon-ho
Copy link
Member

When running add_files on partitioned source table, noticed that a significant time is spent in the Hive listPartitions call.

It might be good to push down the filter to Hive (faster database query on the partitions in question, and less traffic getting serialized/deserialized across the wire).

@szehon-ho szehon-ho changed the title Core : Add Files Perf improvement by push down partition filter to Spark/Hive catalog Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog Jul 2, 2021
@szehon-ho
Copy link
Member Author

@RussellSpitzer could you take a look if you have some time?

CatalogTable catalogTable = catalog.getTableMetadata(tableIdent);

Seq<CatalogTablePartition> partitions = catalog.listPartitions(tableIdent, Option.empty());
Option<scala.collection.immutable.Map<String, String>> partSpec =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not smart enough to hold this whole block in my head. Can we break this down into multiple lines?

One line to make the scala Map
One line to wrap it in optional?

Copy link
Member

@RussellSpitzer RussellSpitzer Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or could just do something like

Option<scala.collection.mutable.Map<String, String>> scalaFilter =
          Option.apply(scala.collection.JavaConverters.mapAsScalaMapConverter(pkfilter).asScala());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you can static import JavaConverters here too to make it shorter

* @return all table's partitions
*/
public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent) {
public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we benefit from having this as a Java Optional? Since we have to immediately convert it to Scala maybe we should just pass a normal Map?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "filterPartitions" function would just use an empty map as "no filter"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

versions.props Outdated
com.github.stephenc.findbugs:findbugs-annotations = 1.3.9-1
software.amazon.awssdk:* = 2.15.7
org.scala-lang:scala-library = 2.12.10
org.scala-lang.modules:scala-java8-compat_2.12 = 0.8.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just use Optional.apply() instead of using the option converter here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to not have an additional explicit scala 2.12 dependency. I believe Spark2 can still be compiled for scala 2.11. Would this prohibit that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep you guys convinced me, got rid of it.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the java-8-compat module, left some suggestions there. I think this is a great idea but I would recommend just dropping the optional from the api and just pass through the map.

Other than that, looks good to me.


@Test
public void addFilteredPartitionsToPartitioned2() {
createCompositePartitionedTable("parquet");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a SparkTableUtil test for the new getPartitions code as well, unless that's a pain

@szehon-ho szehon-ho force-pushed the add_file_optim_master branch from 50ff1d8 to fb91eee Compare July 13, 2021 01:22
TableIdentifier source = spark.sessionState().sqlParser()
.parseTableIdentifier(tableName);

Map<String, String> partition1 = Stream.of(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we cannot use Java9 factory methods in Spark2 project

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:( I think we still are running against java 8

Copy link
Member

@RussellSpitzer RussellSpitzer Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said if you are looking for a better method here we do have all of guava, so you can use
ImmutableMaps.of(....)

Or

ImmutableMaps.Builder
.put
.put
.build

    Map<String, String> newProperties = ImmutableMap.of("hello", "world");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right forgot we have guava, thanks, changed, its much cleaner now.

Option<scala.collection.immutable.Map<String, String>> scalaPartitionFilter;
if (partitionFilter != null && !partitionFilter.isEmpty()) {
scalaPartitionFilter = Option.apply(JavaConverters.mapAsScalaMapConverter(partitionFilter).asScala()
.toMap(Predef.conforms()));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scala api requires immuable map, hence this extra step

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I think you just want to clean up that test to use Guava so that the maps are prettier :) Other than that I think this is good to go

@szehon-ho szehon-ho force-pushed the add_file_optim_master branch from 27fe3e0 to 8b9d576 Compare July 13, 2021 04:09
@RussellSpitzer RussellSpitzer merged commit 1a903f6 into apache:master Jul 13, 2021
@RussellSpitzer
Copy link
Member

Thanks @szehon-ho !

minchowang pushed a commit to minchowang/iceberg that referenced this pull request Aug 2, 2021
…park/Hive catalog (apache#2777)

Pushes down partition filters in Spark/Hive Import to underlying catalog instead of retrieving all partitions and then filtering.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants