Controller-ID Proposal

Introduction

This proposal aims to implement a solution to support multiple cluster-level Prometheus instances running concurrently without conflicting over the same custom resources. This solution isn’t limited to the Prometheus resources as it’ll also be available for AlertManagerand ThanosRuler ones, as well as for any pod-based resource that could be added in the future.

This issue can significantly impact use cases where multiple Prometheus operator instances run at the same time in the Kubernetes cluster.

Why

Currently, we encounter issues when different users deploy different instances of the Prometheus operator, that will try to reconcile the same resources.

In the worst-case scenario, these operators may not only compete for ownership of the CRD resources but also attempt to rewrite or redeploy different versions of the CRD, causing disruptions to all pods.

The remediation for this scenario, where users deploy their Prometheus operator instances in parallel, involves using one of the many CLI arguments such as --deny-namespaces, --namespaces, --prometheus-instance-selector or prometheus-instance-namespaces. But this requires cooperation between the different parties and there’s no way to ensure that a specific monitoring resource is managed only by a specific operator instance.

How

After some research from @machine424, we have identified a potential solution already implemented by the zalando/postgres-operator. When an operator is configured with a specific “controller ID” value, it will only reconcile resources that have a matching “controller ID” annotation.

Conversely, if the operator is not configured with a “controller ID,” it will skip all resources that have a “controller ID” annotation. More details can be found in the zalando/postgres-operator documentation.

Goals

  • Guarantee that a custom resource will be managed by a specific Prometheus operator instance.

Audience

This proposal is relevant to the following audience:

  • Users who provide Prometheus as a service and want to run multiple Prometheus operator instances in different namespaces.
  • Users seeking to mitigate the impact of rogue Prometheus instances.

Non-Goals

  • Provide a solution that works with user intervantion. It’ll require work from the user deploying the operator and resources. (e.g. if the operator is started without any specific argument, it’ll attempt to reconcile all resources in all namespaces).

Alternatives

Although the initial discussion for this proposal considered adding an owner reference within the scope, the ControllerRef model does not directly address this problem because the ControllerRefmodel solves the problem of controllers that fight over controlled objects due to overlapping selectors (e.g. a ReplicaSet fighting with a ReplicationController over Pods because both controllers have label selectors that match those Pods)