Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression when using ExternalName services #13081

Open
philpep opened this issue Mar 25, 2025 · 24 comments · May be fixed by #13154
Open

Regression when using ExternalName services #13081

philpep opened this issue Mar 25, 2025 · 24 comments · May be fixed by #13154
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@philpep
Copy link

philpep commented Mar 25, 2025

After upgrade from chart 4.11.3 to 4.12.1 my ingresses using ExternalName services aren't working anymore (http 503 Service Temporarily Unavailable).

Logs of the controller shows:

2025/03/25 10:22:17 [error] 26#26: *33847 lua entry thread aborted: runtime error: /etc/nginx/lua/balancer.lua:78: bad argument #1 to 'ipairs' (table expected, got nil)
stack traceback:
coroutine 0:
        [C]: in function 'ipairs'
        /etc/nginx/lua/balancer.lua:78: in function 'resolve_external_names'
        /etc/nginx/lua/balancer.lua:114: in function 'sync_backend'
        /etc/nginx/lua/balancer.lua:148: in function </etc/nginx/lua/balancer.lua:146>, context: ngx.timer

Example:

apiVersion: v1
kind: Service
metadata:
  name: example
spec:
  type: ExternalName
  externalName: internal.example.com
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: example.com
  annotations:
    kubernetes.io/ingress.class: "nginx"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
spec:
  rules:
    - host: example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: example
                port:
                  number: 443
  tls:
    - hosts:
      - example.com
@philpep philpep added the kind/bug Categorizes issue or PR as related to a bug. label Mar 25, 2025
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Mar 25, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@philpep
Copy link
Author

philpep commented Mar 25, 2025

It might relate to #13076 but seems a different issue.

@philpep
Copy link
Author

philpep commented Mar 25, 2025

Seems the bug was introduced in 4.11.5 release (4.11.4 is OK).

@philpep
Copy link
Author

philpep commented Mar 25, 2025

I think issue is from this commit c6c5b48

@tolix1
Copy link

tolix1 commented Mar 26, 2025

same issue here. 4.11.5 and 4.12.1 impacted

@Gacko
Copy link
Member

Gacko commented Mar 26, 2025

Please do not override the issue template and instead fill it as requested. This is important for reproducing your issue.

Also please add information about what internal.example.com is pointing at. Are these IP addresses? Is it a CNAME?

@philpep
Copy link
Author

philpep commented Mar 26, 2025

Please do not override the issue template and instead fill it as requested. This is important for reproducing your issue.

Sorry will do better next time.

Also please add information about what internal.example.com is pointing at. Are these IP addresses? Is it a CNAME?

Yes it's a resolvable CNAME.

@sepich
Copy link
Contributor

sepich commented Mar 26, 2025

Another possibly related issue is that nginx.ingress.kubernetes.io/default-backend stopped working for externalName Ingresses.

Here is example yamls
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo
spec:
  selector:
    matchLabels:
      app: echo
  template:
    metadata:
      labels:
        app: echo
    spec:
      containers:
        - image: inanimate/echo-server
          name: echo
      enableServiceLinks: false
---
apiVersion: v1
kind: Service
metadata:
  name: echo
spec:
  selector:
    app: echo
  ports:
    - port: 8080
      targetPort: 8080
      protocol: TCP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/upstream-vhost: google.com
    nginx.ingress.kubernetes.io/custom-http-errors: "301"
    nginx.ingress.kubernetes.io/default-backend: echo
    prometheus.io/probe: "false"
  name: test
spec:
  ingressClassName: nginx
  defaultBackend:
    service:
      name: ext
      port:
        number: 80
---
apiVersion: v1
kind: Service
metadata:
  name: ext
spec:
  type: ExternalName
  externalName: google.com

So in example we trying to use custom errors:
https://kubernetes.github.io/ingress-nginx/examples/customization/custom-errors/
In this case we access google.com via http/80 and get redirect to https:

$ curl -si google.com | head -1
HTTP/1.1 301 Moved Permanently

Then we try to forward this 301 to be served from out service echo.
It works in v1.11.4:

127.0.0.1 - - [26/Mar/2025:15:22:12 +0000] "GET / HTTP/1.1" 200 1774 "-" "curl/8.7.1" 67 0.033 [custom-default-backend-default-echo] [] 142.250.186.174:80 : 10.244.0.8:8080 0 : 1774 0.030 : 0.003 301 : 200 0cc21efb7ceba0b69745bc9afc359738

but broken in v1.11.5:

127.0.0.1 - - [26/Mar/2025:15:15:00 +0000] "GET / HTTP/1.1" 502 150 "-" "curl/8.7.1" 67 0.036 [custom-default-backend-default-echo] [] 216.58.206.78:80 : 0.0.0.1:80 0 : 0 0.035 : 0.001 301 : 502 6958aeaac151beea2b82922fd8da30fd
2025/03/26 15:15:00 [error] 38#38: *24724 connect() failed (113: Host is unreachable) while connecting to upstream, client: 127.0.0.1, server: _, request: "GET / HTTP/1.1", upstream: "http://0.0.0.1:80/", host: "test"
2025/03/26 15:15:00 [warn] 38#38: *24724 upstream server temporarily disabled while connecting to upstream, client: 127.0.0.1, server: _, request: "GET / HTTP/1.1", upstream: "http://0.0.0.1:80/", host: "test"

@Gacko
Copy link
Member

Gacko commented Mar 26, 2025

Yes it's a resolvable CNAME.

Out of curiosity: Can you make it an A / AAAA record? Just wanna see if it's related to the record type.

@philpep
Copy link
Author

philpep commented Mar 27, 2025

Yes it's a resolvable CNAME.

Out of curiosity: Can you make it an A / AAAA record? Just wanna see if it's related to the record type.

In my case it's a A record (without AAAA) resolving into a RFC1918 reserved ip address, outside of the k8s cluster (e.g. 192.168.42.12)

@vasili439
Copy link

Yes it's a resolvable CNAME.

Out of curiosity: Can you make it an A / AAAA record? Just wanna see if it's related to the record type.

In my case it's a A record (without AAAA) resolving into a RFC1918 reserved ip address, outside of the k8s cluster (e.g. 192.168.42.12)

in my case the same: I've tried as a quick fix to replace RFC1918 IP with temp domain name (A record) but no luck. HTTP503 with plain IP address and with DNS A record.

@Gacko
Copy link
Member

Gacko commented Mar 27, 2025

Maybe @neerfri can shed some light on this, as they implemented the change.

@Confushion
Copy link

Confushion commented Mar 28, 2025

Same issue here (v.4.11.5) using ip address as externalName

@Confushion
Copy link

Update: changing externalName from an ip address to a (valid) dns-name (A record) seemed to fix my issue....

@wilmardo
Copy link
Contributor

@strongjz any reason why this is closed? This for sure is an undocumented regression. It was previously supported to have an IP address as yourExternalName address and now it isn't working anymore.
If it is an intended change it at least needs to be documented that this isn't supported anymore.

@tgraskemper
Copy link

tgraskemper commented Mar 31, 2025

@wilmardo You seem to be referring to the other closed ticket of #13076 where the response stated, as the documentation also does, that IP addresses as ExternalName is not supported.

Your statement about this being an undocumented regression remains true, however. The fact that CNAME's now behave such that 503's randomly occur and are sent to the user is clearly an issue, as we are facing the same. If an A record is suppose to fix this, it misses the fact that some of us are proxying to an external service where we don't control the DNS entry, and so have to come up with a hacky workaround, like resolving the name, potentially able to change, and upload that as a new record to be used as the ExternalName (assuming you don't have other TLS issues).

An alternative workaround, potentially impacting your existing ingress configuration and also quite ugly, is to capture the 503 and use the proxy_pass directive not impacted by this issue to proxy back to the original service.

    nginx.ingress.kubernetes.io/configuration-snippet: |
      error_page 503 = @fallback_pass;
    nginx.ingress.kubernetes.io/server-snippet: |
      location @fallback_pass {
        set $proxy_host mysubdomain.mysite.com;
        proxy_set_header Host $proxy_host;
        proxy_pass https://$proxy_host/;
      }

Truthfully this solution might seem quite dumb to someone who knows the NGINX configuration options better than me, so would love for someone else to chime in on a better workaround.

@philpep
Copy link
Author

philpep commented Mar 31, 2025

@strongjz @Confushion @wilmardo please note this ticket is about using a valid hostname/CNAME (not IP address) as externalName. This is not the same as #13076

I think this ticket should be re-opened since the issue still exists.

I still have this traceback on 4.12.1

2025/03/25 10:22:17 [error] 26#26: *33847 lua entry thread aborted: runtime error: /etc/nginx/lua/balancer.lua:78: bad argument #1 to 'ipairs' (table expected, got nil)
stack traceback:
coroutine 0:
        [C]: in function 'ipairs'
        /etc/nginx/lua/balancer.lua:78: in function 'resolve_external_names'
        /etc/nginx/lua/balancer.lua:114: in function 'sync_backend'
        /etc/nginx/lua/balancer.lua:148: in function </etc/nginx/lua/balancer.lua:146>, context: ngx.timer

@strongjz
Copy link
Member

strongjz commented Mar 31, 2025

Apologies, folks, I read @Confushion response and thought the issue was resolved.

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Mar 31, 2025
@k8s-ci-robot
Copy link
Contributor

@strongjz: Reopened this issue.

In response to this:

Apologies for folks, I read @Confushion response and thought the issue was resolved.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@neerfri
Copy link
Contributor

neerfri commented Apr 1, 2025

Hi all!
I'm the original author of the commit that caused the issue.
I'd love to help solve this.

@philpep - Your comment stating the difference between this issue and #13076 was important. Thank you.

After reading the code again, with the stack-trace you provided in mind, it's clear that backend.endpoints in your case is nil.
It's possible to fix the code such that it handles this case. Before I issue such a fix I think we need to understand why this happening. This will allow us to add a test to ensure we don't have regressions again.

Here's what I know:

  • resolve_external_names(original_backend) is called from sync_backend(backend).
  • In this case sync_backend(backend) is called from sync_backends_with_external_name(). The call from sync_backends skips external name backends.
  • sync_backends_with_external_name() is called by a timer and operates on the backends_with_external_name variable.
  • The backends_with_external_name variable is update in sync_backends(), which is also called by a timer.
  • sync_backends() pulls the backend data using configuration.get_backends_data() which is a JSON object describing the backends
  • To the best of my understanding, that JSON is set from the go function func configureBackends(rawBackends []*ingress.Backend)
  • From what I can tell by reading these lines the endpoints for a backend are always set in the Backend struct. An empty array is always initialized.
  • At the struct's definition, the Endpoints attribute is defined with Endpoints []Endpoint json:"endpoints,omitempty"`` which means that zero-length arrays are omitted. Hence the nil on the lua side.

To sum this knowledge into a plan we can either:

  • fix the lua code to account for a nil value in backend.endpoints or
  • change the JSON coming in by removing the omitempty from the JSON struct.

Personally I feel more comfortable with changing the lua code to reduce possible effects on other part of the system.

@strongjz @wilmardo @Gacko - As members of the repo please share your opinion and who can escort this change in. It took 1 year to get the original PR merged. I want to make sure if I put in the hours to issue a PR and solve this that there is a member that want's to help us get it merged in a timely manner.

Thank you ✌️

@Gacko
Copy link
Member

Gacko commented Apr 1, 2025

Hello,

first @wilmardo is not a maintainer of this project. Second I think we can get a possible fix in quite fast now, because there's a proper discussion and investigation. This is why I was linking you here, as you have way more context and knowledge around this than I do, so I'm pretty sure we can get this fixed soon.

Thank you!

@philpep
Copy link
Author

philpep commented Apr 1, 2025

Hi @neerfri thanks for detailed analysis!

While I'm not sure I understand all the details, it appear not everyone have this issue, since people having an IP address and then switching to CNAME fix their issue.

So I think it might be related to my environment (dns server or k8s setup). One important thing I omitted to mention is that I'm running a EOL kubernetes 1.28.15 cluster. Given what you said about "Endpoint", maybe there was a change in kubernetes/client-go expecting "nil" values to return empty map or something like this.

I plan to upgrade my clusters soon, I'll see if it fixes my issue.

What k8s cluster version are using other people having the same issue ?

@jmiller-ca
Copy link

jmiller-ca commented Apr 1, 2025

I was going to keep quiet but maybe some additional info will help

EKS 1.32
Using the helm chart for install

The manifest below works fine on v4.11.4 but updating to 4.12.1 breaks

apiVersion: v1
kind: Service
metadata:
  name: files-cache-proxy
spec:
  type: ExternalName
  externalName: media.mycompany.com
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-mycompany-com
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
    nginx.ingress.kubernetes.io/proxy-ssl-name: media.mycompany.com
    nginx.ingress.kubernetes.io/upstream-vhost: "media.mycompany.com"
    # # Add caching annotations
    # nginx.ingress.kubernetes.io/proxy-cache: "cache-media-my-company.com"

    # Add header control
    nginx.ingress.kubernetes.io/configuration-snippet: |
      expires 1M;
      add_header Cache-Control "public, max-age=2592000, immutable";
      proxy_ssl_name media.mycompany.com;
      proxy_ssl_server_name on;
      proxy_cache_valid 200 302 24h;
      proxy_cache_valid 301      1h;
      proxy_cache_valid any      1m;
      # Hide cache-related headers
      proxy_hide_header X-Powered-By;
      proxy_hide_header Vary;
      proxy_hide_header Pragma;
      proxy_hide_header Last-Modified;
      proxy_hide_header Set-Cookie;
  name: files-cache-proxy
spec:
  ingressClassName: external
  rules:
    - host: my-01.qa.mycompany.com
      http:
        paths:
          - backend:
              service:
                name: files-cache-proxy
                port:
                  number: 443
            path: /files/cache
            pathType: Prefix
  tls:
    - hosts:
        - my-01.qa.mycompany.com
      secretName: my-01.qa.mycompany-com-tls

when looking at ingress list files-cache-proxy did not have a ADDRESS

with a revert to v4.11.4 everything started working again as expected.

sorry for the multiple updates

neerfri added a commit to neerfri/ingress-nginx that referenced this issue Apr 3, 2025
@neerfri neerfri linked a pull request Apr 3, 2025 that will close this issue
10 tasks
@neerfri
Copy link
Contributor

neerfri commented Apr 3, 2025

Hi All,

TL;DR;

A pull request was opened to fix this.
#13154
Maintainers, please help approve it for tests
Others, please go subscribe if you are effected

Updating here as I'm making progress on tracking the source of problem here in order to decide the best course of action.

It seems that a behavior regarding the endpoints of a service with type ExternalName has changed, which is the reason for different Kubernetes versions providing different outcomes here.
For those who wish to learn further you can read:
The issue reported: kubernetes/kubernetes#105986 (comment)
The PR that made the change: kubernetes/kubernetes#114814

From an hour of digging and thinking about this I'm not sure how we can fix the implementation for Kubernetes versions that do not send the endpoint because the Lua code was reading the DNS entry from the endpoints, I might need to learn more the exact payload we have from Kubernetes for this.

@Gacko I've created a PR at #13154
Currently the PR only contains tests to cover the scenarios discussed here
Please approve it for testing so I can make progress on the tests to guide us regarding a possible implementation.
As you can see I'm giving this priority since we have users being impacted, your quick response will be much appreciated.

Thank you ✌️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Development

Successfully merging a pull request may close this issue.