Skip to content

Conversation

@akhilputhiry
Copy link
Contributor

@akhilputhiry akhilputhiry commented Feb 27, 2025

This PR Fixes the following issues

#12059
#9174

# extra configs needed to connect via proxy
conf.set("spark.sql.catalog.demo.rest.client.proxy.hostname", "127.0.0.1")
conf.set("spark.sql.catalog.demo.rest.client.proxy.port", "8080")

# below configs are needed only if the proxy require credentials
conf.set("spark.sql.catalog.demo.rest.client.proxy.username", "admin")
conf.set("spark.sql.catalog.demo.rest.client.proxy.password", "admin")

Notes:

@github-actions github-actions bot added the core label Feb 27, 2025
@akhilputhiry
Copy link
Contributor Author

@flyrain Could you please take a look at this PR

@akhilputhiry
Copy link
Contributor Author

@adutra Could you please take a look at this PR

@adutra
Copy link
Contributor

adutra commented Feb 27, 2025

Hi @akhilputhiry while I understand the problem I think there are a few concerns with this PR:

First off, there is already some proxy support in HTTPClient:

  1. org.apache.iceberg.rest.HTTPClient.Builder#withProxy
  2. org.apache.iceberg.rest.HTTPClient.Builder#withProxyCredentialsProvider

Can you confirm that adding ProxySupport on top of the existing code will work, e.g. if withProxy is used? Or are these two things mutually exclusive? I also think that you should pass the ProxySelector via the builder, just like the methods above.

Secondly, the introduction of proxy support in HTTPClient has me wondering for a while how this is supposed to work in the case where the IDP is external. Right now, my understanding is that if a proxy is defined, it will be used for all requests, both to the catalog server and the IDP.

This may not be desirable in all cases. I would like to make it possible to select different proxy configurations depending on the request URL. Do you think that would be possible with ProxySupport?

@akhilputhiry
Copy link
Contributor Author

akhilputhiry commented Feb 28, 2025

Thanks for the feedback @adutra, please find my thoughts below

Can you confirm that adding ProxySupport on top of the existing code will work, e.g. if withProxy is used?

Yes it works, I had tested with org.apache.iceberg.rest.HTTPClient.Builder#withProxy

Secondly, the introduction of proxy support in HTTPClient has me wondering for a while how this is supposed to work in the case where the IDP is external.

Thinking of making it explicit to RESTCatalog by moving to org.apache.iceberg.rest.RESTCatalog.java and controlling via properties

This may not be desirable in all cases. I would like to make it possible to select different proxy configurations depending on the request URL. Do you think that would be possible with ProxySupport?

For IDP scenario, similar approach of having proxy setting and using it via builder should address the problem I believe

Updated the PR with new approach. Let me know your thoughts.

Thanks

Integer.parseInt(config.get(CatalogProperties.PROXY_PORT)));
}

return builder.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you ended up giving up on ProxySupport eventually?

The current code looks OK to me, although, it still doesn't solve the problem when 2 different proxies will be required for contacting the catalog server and the authorization server.

It also doesn't address proxy credentials, but this could be done as a follow-up task.

Copy link
Contributor Author

@akhilputhiry akhilputhiry Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adutra Thanks for the feedback

For point 3, I have added support for simple auth for now, we can add other mechanisms in follow up PRs

For point 2, I shall try the following.

we could use the ProxyRoutePlanner to use different proxies for different domains

The confs would look something like the following, we can adjust the domains parameter so that proxy will be selected accordingly

conf.set("spark.sql.catalog.demo.proxy.myproxy1.hostname", "127.0.0.1")
conf.set("spark.sql.catalog.demo.proxy.myproxy1.port", "8080")
conf.set("spark.sql.catalog.demo.proxy.myproxy1.requires-credentials", "true")
conf.set("spark.sql.catalog.demo.proxy.myproxy1.username", "ac")
conf.set("spark.sql.catalog.demo.proxy.myproxy1.password", "dc")
conf.set("spark.sql.catalog.demo.proxy.myproxy1.domains", "*")
conf.set("spark.sql.catalog.demo.proxy.myproxy1.priority", "1")

conf.set("spark.sql.catalog.demo.proxy.myproxy2.hostname", "127.0.0.1")
conf.set("spark.sql.catalog.demo.proxy.myproxy2.port", "9090")
conf.set("spark.sql.catalog.demo.proxy.myproxy2.requires-credentials", "true")
conf.set("spark.sql.catalog.demo.proxy.myproxy2.username", "ac")
conf.set("spark.sql.catalog.demo.proxy.myproxy2.password", "dc")
conf.set("spark.sql.catalog.demo.proxy.myproxy2.domains", "*")
conf.set("spark.sql.catalog.demo.proxy.myproxy2.priority", "2")

BTW i am using mitmproxy for my testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would it be okay to implement multiple proxy support in a different PR ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could use the ProxyRoutePlanner to use different proxies for different domains

Sounds very promising! Your example is pretty much what I had in mind.

Also would it be okay to implement multiple proxy support in a different PR ?

Sure, that's fine with me. Btw I can approve the PR, but I am not a committer, so you will need to obtain another review from somebody else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adutra I have made changes as per your suggestion.
Can you point me to some committers who will be able to take a look.

public static final String PROXY_HOSTNAME = "proxy.hostname";
public static final String PROXY_PORT = "proxy.port";

public static final String PROXY_REQUIRES_CREDENTIALS = "proxy.requires-credentials";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property could maybe be inferred from the presence (or absence) of PROXY_USERNAME and PROXY_PASSWORD.

@akhilputhiry
Copy link
Contributor Author

akhilputhiry commented Mar 12, 2025

@amogh-jahagirdar @rdblue @nastra

Could you folks please take a look at this

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @akhilputhiry for working on it. LGTM with minor comments.


/** http proxy configuration for rest catalog */
public static final String PROXY_HOSTNAME = "proxy.hostname";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove the empty line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line is added automatically when I run 'gradle spotlessJavaApply'

@amogh-jahagirdar amogh-jahagirdar self-requested a review March 12, 2025 19:03
SessionCatalog.SessionContext.createEmpty(),
config -> HTTPClient.builder(config).uri(config.get(CatalogProperties.URI)).build());
config -> {
HTTPClient.Builder builder =
Copy link
Contributor

@nastra nastra Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think one possible issue that I can currently think of is that this approach will only work for the REST catalog itself, but things like refreshing vended credentials with S3/GCS (VendedCredentialsProvider/OAuth2RefreshCredentialsHandler) or S3 signing (S3V4RestSignerClient) won't work, since those places instantiate their own HTTP client that wouldn't configure the proxy

@akhilputhiry
Copy link
Contributor Author

Thanks for the feedback @nastra
I have made the changes considering your comments
Could you please take a look again

Integer proxyPort =
PropertyUtil.propertyAsNullableInt(properties, HTTPClient.REST_PROXY_PORT);

if (proxyHostname != null && !proxyHostname.isEmpty() && proxyPort != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use !Strings.isNullOrEmpty(proxyHostname). Same further below with username/password

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some tests to TestHTTPClient by passing different properties and ensuring that the proxy has been configured with/without auth

@akhilputhiry akhilputhiry force-pushed the http-proxy branch 4 times, most recently from ec0c4d7 to 7f49607 Compare March 15, 2025 08:57
@akhilputhiry
Copy link
Contributor Author

Thanks @nastra
I have made the suggested changes

@akhilputhiry akhilputhiry force-pushed the http-proxy branch 2 times, most recently from d363790 to 3cece07 Compare March 17, 2025 17:12
PropertyUtil.propertyAsString(properties, HTTPClient.REST_PROXY_PASSWORD, null);

if (!Strings.isNullOrEmpty(proxyUsername) && !Strings.isNullOrEmpty(proxyPassword)) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

import org.mockserver.model.HttpResponse;
import org.mockserver.verify.VerificationTimes;

import static org.assertj.core.api.Assertions.assertThat;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might need to double-check your IDE settings, but static imports need to be at the top

@nastra
Copy link
Contributor

nastra commented Mar 18, 2025

@akhilputhiry can you please fix the ordering of the static imports and also update the PR title to reflect the latest changes?

@akhilputhiry
Copy link
Contributor Author

@nastra recreated the IDEA project files using the following

gradle cleanIdea
gradle idea

The imports are good now

Thanks

@akhilputhiry akhilputhiry changed the title pass proxy configuration from environment vars to http client Enable HTTP proxy support for the Client used by REST Catalog Mar 19, 2025
@akhilputhiry akhilputhiry changed the title Enable HTTP proxy support for the Client used by REST Catalog Enable HTTP proxy support for the client used by REST Catalog Mar 19, 2025
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @akhilputhiry. @amogh-jahagirdar / @danielcweeks can you also please take a look?

@danielcweeks
Copy link
Contributor

I have a few concerns here and it may overlap a little with what @adutra was getting at. This appears to tunnel a separate config/auth to the http client as opposed to extending and using the AuthManager. Since this is primarily setting host/port and auth, why wouldn't we configure this via basic auth manager or a proxy auth manager? I'm just concerned we're creating two alternate paths to configure auth.

@adutra and @nastra thoughts?

@adutra
Copy link
Contributor

adutra commented Mar 21, 2025

I have a few concerns here and it may overlap a little with what @adutra was getting at. This appears to tunnel a separate config/auth to the http client as opposed to extending and using the AuthManager. Since this is primarily setting host/port and auth, why wouldn't we configure this via basic auth manager or a proxy auth manager? I'm just concerned we're creating two alternate paths to configure auth.

@danielcweeks I am fine having the proxy authentication being done through the http client own machinery rather than an auth manager, for a few reasons:

  • An AuthManager is basically a layered authentication system (catalog/context/table) but I can hardly think of a valid use case where one layer would want to override the proxy authentication settings of another layer. It looks a bit overkill to leverage an AuthManager here.
  • AuthManager was designed for authenticating against the target host, not the proxy host: to leverage an AuthManager for proxy auth, we'd need to introduce a new method e.g. proxyAuthenticate, or a new constructor parameter boolean isProxyAuth – because the request headers to inject are not the same. Probably not worth the hassle.
  • Proxy authentication is imo an infrastructure concern and should stay transparent to the application logic, whereas the AuthManager integrates more tightly with application logic.

I expressed however a different concern: we are moving towards a world where the REST client needs to talk to TWO servers instead of one: the catalog server and the authorization server. We should therefore make it possible for the client to use different proxy settings for each server. @akhilputhiry proposed a solution for that using "named" proxy configs:

#12406 (comment)

The proposal is interesting but I don't think it has been implemented in this PR, so we'd need to address that as a follow-up task.

To summarize my POV: I'm +1 on this PR, provided that we introduce multi-host proxy settings later on.

@danielcweeks
Copy link
Contributor

To summarize my POV: I'm +1 on this PR, provided that we introduce multi-host proxy settings later on.

I'm not convinced we want to go down this path until there are real world examples where we would have both. We're adding a lot of complexity to the configuration and I don't want to do that speculatively.

As for the AuthManager vs. native client, I thought it would be possible (I think it might be for HTTP), but for https it's a little more complicated with how the client communicates with the proxy server.

@akhilputhiry
Copy link
Contributor Author

@danielcweeks @adutra
I believe the current discussion on multiple proxies can be continued in the follow up PR or Slack
Do we have agreement on the current proxy enablement? Are we good to take this forward and close?

@sfc-gh-mbaron
Copy link

@akhilputhiry @danielcweeks @adutra any plans to merge this soon? This is blocking some things I'm trying to do and trying to get a sense of timing.

@akhilputhiry
Copy link
Contributor Author

@sfc-gh-mbaron I am also eagerly waiting for this to be merged

@nastra @adutra @danielcweeks @amogh-jahagirdar Could you folks please help to move this forward.

Thanks

@akhilputhiry
Copy link
Contributor Author

Wanted to follow up on this

@amogh-jahagirdar @danielcweeks @nastra @adutra

@flyrain
Copy link
Contributor

flyrain commented May 2, 2025

Thanks for the discussion. I'm leaning toward keeping proxy separate from authManager.

The tiny bit we’re adding(an HTTP proxy) is more on the transport-level wiring for the HTTP client. Things like proxy settings, TLS, and timeouts normally belong to the HTTP transport layer and control how HTTP works. AuthManager, on the other hand, handles app-level auth policies such as user id, token, and authorization logic. The line can blur in practice, but it’s important to keep a clean boundary in our code so we don’t end up with two overlapping ways to configure different layers, which usually leads to messiness and bugs. With that, +1 on the method proposed here.

@sfc-gh-mbaron
Copy link

@flyrain @adutra @nastra @danielcweeks @amogh-jahagirdar any update here??

Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with exposing proxy support. +1

@flyrain flyrain merged commit a04c532 into apache:main May 6, 2025
42 checks passed
devendra-nr pushed a commit to devendra-nr/iceberg that referenced this pull request Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants