-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog #2777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog #2777
Conversation
|
@RussellSpitzer could you take a look if you have some time? |
| CatalogTable catalogTable = catalog.getTableMetadata(tableIdent); | ||
|
|
||
| Seq<CatalogTablePartition> partitions = catalog.listPartitions(tableIdent, Option.empty()); | ||
| Option<scala.collection.immutable.Map<String, String>> partSpec = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not smart enough to hold this whole block in my head. Can we break this down into multiple lines?
One line to make the scala Map
One line to wrap it in optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or could just do something like
Option<scala.collection.mutable.Map<String, String>> scalaFilter =
Option.apply(scala.collection.JavaConverters.mapAsScalaMapConverter(pkfilter).asScala());There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you can static import JavaConverters here too to make it shorter
| * @return all table's partitions | ||
| */ | ||
| public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent) { | ||
| public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we benefit from having this as a Java Optional? Since we have to immediately convert it to Scala maybe we should just pass a normal Map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "filterPartitions" function would just use an empty map as "no filter"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
versions.props
Outdated
| com.github.stephenc.findbugs:findbugs-annotations = 1.3.9-1 | ||
| software.amazon.awssdk:* = 2.15.7 | ||
| org.scala-lang:scala-library = 2.12.10 | ||
| org.scala-lang.modules:scala-java8-compat_2.12 = 0.8.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just use Optional.apply() instead of using the option converter here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to not have an additional explicit scala 2.12 dependency. I believe Spark2 can still be compiled for scala 2.11. Would this prohibit that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep you guys convinced me, got rid of it.
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the java-8-compat module, left some suggestions there. I think this is a great idea but I would recommend just dropping the optional from the api and just pass through the map.
Other than that, looks good to me.
|
|
||
| @Test | ||
| public void addFilteredPartitionsToPartitioned2() { | ||
| createCompositePartitionedTable("parquet"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a SparkTableUtil test for the new getPartitions code as well, unless that's a pain
50ff1d8 to
fb91eee
Compare
| TableIdentifier source = spark.sessionState().sqlParser() | ||
| .parseTableIdentifier(tableName); | ||
|
|
||
| Map<String, String> partition1 = Stream.of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we cannot use Java9 factory methods in Spark2 project
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:( I think we still are running against java 8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said if you are looking for a better method here we do have all of guava, so you can use
ImmutableMaps.of(....)
Or
ImmutableMaps.Builder
.put
.put
.build
Map<String, String> newProperties = ImmutableMap.of("hello", "world");There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right forgot we have guava, thanks, changed, its much cleaner now.
| Option<scala.collection.immutable.Map<String, String>> scalaPartitionFilter; | ||
| if (partitionFilter != null && !partitionFilter.isEmpty()) { | ||
| scalaPartitionFilter = Option.apply(JavaConverters.mapAsScalaMapConverter(partitionFilter).asScala() | ||
| .toMap(Predef.conforms())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scala api requires immuable map, hence this extra step
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I think you just want to clean up that test to use Guava so that the maps are prettier :) Other than that I think this is good to go
27fe3e0 to
8b9d576
Compare
|
Thanks @szehon-ho ! |
…park/Hive catalog (apache#2777) Pushes down partition filters in Spark/Hive Import to underlying catalog instead of retrieving all partitions and then filtering.
When running add_files on partitioned source table, noticed that a significant time is spent in the Hive listPartitions call.
It might be good to push down the filter to Hive (faster database query on the partitions in question, and less traffic getting serialized/deserialized across the wire).