Skip to content

Conversation

@JonasJ-ap
Copy link
Contributor

@JonasJ-ap JonasJ-ap commented Feb 4, 2023

Problem Addressed

This PR fix the problem described in issue #6715 by using reflection to instantiate the httpclient configuration impl class to avoid runtime deps of both url-connection-client and apache-httpclient

Test Environment

Spark3.3, Scala 2.12.

scripts to spawn spark shell:

BRANCH_NAME=bug_fix_aws_httpclient_conflict_prefix_map
DEPENDENCIES=""

# add AWS dependnecy
AWS_SDK_VERSION=2.17.257
AWS_MAVEN_GROUP=software.amazon.awssdk
AWS_PACKAGES=(
    "apache-client"
    "s3"
    "glue"
    "kms"
    "iam"
    "sts"
    "dynamodb"
)
for pkg in "${AWS_PACKAGES[@]}"; do
    DEPENDENCIES+="$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION,"
done

JARS="iceberg-spark-runtime-3.3_$BRANCH_NAME.jar"

# start Spark SQL client shell
spark-shell --packages=$DEPENDENCIES --jars=$JARS\
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.demo=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.demo.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
    --conf spark.sql.catalog.demo.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
    --conf spark.sql.catalog.demo.warehouse=s3://gluetestjonas/warehouse \
    --conf spark.sql.catalog.demo.http-client.type=apache

Test commands:

scala> val data = spark.range(0, 10)
scala> data.writeTo("demo.default.test1").create()

Before the fix:

will raise java.lang.NoClassDefFoundError

java.lang.NoClassDefFoundError: software/amazon/awssdk/http/urlconnection/UrlConnectionHttpClient$Builder
        at java.base/java.lang.Class.getDeclaredMethods0(Native Method)
        at java.base/java.lang.Class.privateGetDeclaredMethods(Class.java:3458)
        at java.base/java.lang.Class.getDeclaredMethod(Class.java:2726)
        at java.base/java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1525)
        at java.base/java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:413)
        at java.base/java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:384)
        at java.base/java.security.AccessController.doPrivileged(AccessController.java:318)
        at java.base/java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:384)
        at java.base/java.io.ObjectStreamClass$Caches$1.computeValue(ObjectStreamClass.java:110)
        at java.base/java.io.ObjectStreamClass$Caches$1.computeValue(ObjectStreamClass.java:107)
        at java.base/java.io.ClassCache$1.computeValue(ClassCache.java:73)
        at java.base/java.io.ClassCache$1.computeValue(ClassCache.java:70)
        at java.base/java.lang.ClassValue.getFromHashMap(ClassValue.java:229)
        at java.base/java.lang.ClassValue.getFromBackup(ClassValue.java:211)
        at java.base/java.lang.ClassValue.get(ClassValue.java:117)
        at java.base/java.io.ClassCache.get(ClassCache.java:84)
        at java.base/java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:363)
        at 

After the fix

Successfully create the table, indicating that url-connection-client is not required
Screen Shot 2023-02-04 at 18 43 55
Screen Shot 2023-02-04 at 18 44 35

@github-actions github-actions bot added the AWS label Feb 4, 2023
DynConstructors.Ctor<HttpClientConfigurations> ctor;
try {
ctor =
DynConstructors.builder(HttpClientConfigurations.class).hiddenImpl(impl).buildChecked();
Copy link
Contributor

@jackye1995 jackye1995 Feb 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to see if we can avoid having to create a new public interface like HttpClientConfigurations. We can have some static methods in non-public class, and dynamically load them using something like

          DynMethods.builder("initialize")
              .impl(impl, Map.class)
              .buildStaticChecked()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see Jack's point on not exposing a new public interface. I am wondering if we can define HttpClientConfigurations as a package private abstract class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestions. Currently I switch the implementation to define HttpClientConfigurations.

For the DynMethods way, it seems the final code will be something like:

case HTTP_CLIENT_TYPE_URLCONNECTION:
    Object httpClientConfigurations = loadHttpClientConfigurations(
                UrlConnectionHttpClientConfigurations.class.getName(),
                urlConnectionHttpClientProperties);
    ((UrlConnectionHttpClientConfigurations)httpClientConfigurations).configureHttpClientBuilder(builder);

since we no longer have a interface or superclass. I personally feel it wierd to return and then cast an Object directly here. So I vote +1 for the having a package-private abstract class.

Also, now I am curious about why the previous impl

interface HttpClientConfigurations {
...

does not work. It seems having a package-private interface also will be similar to a package-private abstract class in this case. Is there any design purpose to not use package-private interface or do I misunderstand something here. Thank you in advance for your help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. it is a miss from my end. I just assumed it is a public interface. Yeah, I agree that package-private interface should be same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The abstract class looks good to me. But why not just having a static method though? In that case we completely save the need for them to inherit the same abstract class.

@github-actions github-actions bot added the core label Feb 6, 2023
* @param prefix prefix to choose keys from input map
* @return subset of input map with keys starting with provided prefix
*/
public static Map<String, String> propertiesWithPrefixNoTrim(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Trim" typically means removing whitespace. Is there a better way to name this method? Maybe the startsWith could be passed as a Predicate<String> and this could be filterProperties? Or maybe filterPropertiesByPrefix?

Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I assume latest change is tested with the env that can reproduce the issue.


if (httpClientApacheUseIdleConnectionReaperEnabled != null) {
builder.useIdleConnectionReaper(httpClientApacheUseIdleConnectionReaperEnabled);
private Object loadHttpClientConfigurations(String impl) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe add a brief Javadoc to explain why we are doing the reflection here and link to the github issue for more context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JonasJ-ap
Copy link
Contributor Author

Thanks everyone for reviewing this! I will add a proof of running the latest update on EKS later today or tomorrow

@JonasJ-ap
Copy link
Contributor Author

JonasJ-ap commented Feb 9, 2023

Sorry for the late update. It took me some time to construct the EKS environment properly.

Test Environment

AWS EKS: 1.24, Spark 3.1.2

Test spark job / k8s job config:

import org.apache.spark.sql.SparkSession

object GlueApp {

  def main(sysArgs: Array[String]) {
    val spark: SparkSession = SparkSession.builder.
      config("spark.master", "local").
      config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions").
      config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog").
      config("spark.sql.catalog.demo.io-impl", "org.apache.iceberg.aws.s3.S3FileIO").
      config("spark.sql.catalog.demo.warehouse", "s3a://gluetestjonas/warehouse").
      config("spark.sql.catalog.demo.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog").
      config("spark.sql.catalog.demo.client.factory", "org.apache.iceberg.aws.AssumeRoleAwsClientFactory").
      config("spark.sql.catalog.demo.client.assume-role.arn", "arn:aws:iam::481640105715:role/jonasjiang_gluecatalog").
      config("spark.sql.catalog.demo.client.assume-role.region", "us-east-1").
      config("spark.sql.catalog.demo.client.assume-role.session-name", "test").
      config("spark.sql.catalog.demo.client.assume-role.external-id", "1234546").
      config("spark.sql.catalog.demo.http-client.type", "apache").
      getOrCreate()

    val data = spark.range(0, 100)
    data.writeTo("demo.default.table100").createOrReplace()
    spark.sql("SELECT * FROM demo.default.table100").show()

    // read using SQL
    // spark.sql("SELECT * FROM demo.reviews.book_reviews").show()
  }
}
name: spark
          image: jonasjiang/spark-eks:v3.1.2
          args: [
              "/bin/sh",
              "-c",
              "/opt/spark/bin/spark-submit \
            --master k8s://https://kubernetes.default.svc.cluster.local:443 \
            --deploy-mode cluster \
            --name spark-eks \
            --class GlueApp \
            --jars s3a://gluetestjonas/jars/iceberg-spark-runtime-3.1_bug_fix_aws_httpclient_conflict_prefix_map.jar \
            --packages software.amazon.awssdk:apache-client:2.17.257,software.amazon.awssdk:s3:2.17.257,software.amazon.awssdk:glue:2.17.257,software.amazon.awssdk:kms:2.17.257,software.amazon.awssdk:iam:2.17.257,software.amazon.awssdk:sts:2.17.257,software.amazon.awssdk:dynamodb:2.17.257 \
            ...
            --conf spark.kubernetes.container.image=jonasjiang/spark-eks:v3.1.2 \
            ...
            --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
            --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider \
            --conf spark.kubernetes.authenticate.submission.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
            --conf spark.kubernetes.authenticate.submission.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token \
            s3a://gluetestjonas/jars/scalar-glue_apache.jar

Test result:

Current master branch had the following error:
Screen Shot 2023-02-09 at 03 57 22
This PR was fine:
Screen Shot 2023-02-09 at 04 01 00

Indicating that we no longer need urlconnection when set http-client.type to apache

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@jackye1995
Copy link
Contributor

I think all the comments are addressed and we have enough vote, I will go ahead and merge this. Thanks for fixing this with such detailed verification! And thanks for the review @stevenzwu @rdblue

@jackye1995 jackye1995 merged commit 4d43c25 into apache:master Feb 9, 2023
krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants