Infrastructure

How LinkedIn moved its Kubernetes APIs to a different API group

LinkedIn has used Kubernetes for many years because of the capability, scalability and customization it provides our engineering teams. Customization is particularly valuable for a platform of our size and complexity, which is why it is helpful that Kubernetes offers API extensibility through custom resource definitions. As a Kubernetes user, you can define your own APIs, in addition to the builtin APIs that come with Kubernetes, and each custom API belongs to a particular API group. One limitation we encountered is that Kubernetes doesn't offer any migration functionality to rename a custom API or move the custom API to another API group; once you choose a name and API group, it's permanent.

We recently migrated one of LinkedIn’s major internal custom Kubernetes APIs to a new API group, while also introducing major changes to the API. To achieve this, we built the migration machinery we needed ourselves. With several hundred microservices at LinkedIn already using the old API to run their applications, it was critical to complete the migration without creating downtime on our website– a goal we achieved.

This article will explain why we moved this API between Kubernetes API groups, the limitations of the API versioning machinery in Kubernetes, and how we created our own solution –a “mirror controller”– to seamlessly migrate to a new API while the old API was actively being used.

Motivation for the API group migration

Our first internal custom Kubernetes API at LinkedIn –named LiDeployment API– is our version of the Kubernetes builtin Deployment API. It runs stateless workloads and natively integrates with our internal service ecosystem around:

  • deployment orchestration and ramp control
  • canary analysis 
  • auto-rightsizing
  • service discovery

After several years of development, the shape of this custom API and controller implementation were in a suboptimal state due to organically accumulated tech debt and accidental complexity resulting from changing requirements. As a result, the LiDeployment API started to look less consistent with Kubernetes API design conventions and principles.

Before we took on this migration, the LiDeployment API ran several hundred internal microservices applications, and we planned to onboard thousands more. We wanted to use this opportunity to move the API group to a new one called “apps.linkedin.com.” The new API group name better fits its purpose and long-term trajectory, while also offering a more idiomatic API design to better satisfy newer use cases and requirements to make engineers at LinkedIn more productive.

Why we didn’t use Kubernetes API conversion

Kubernetes API machinery provides versioning for custom resources, where the object is stored in one version and converted to different versions by conversion webhooks when a request comes. This means that Kubernetes can use different API versions to present the same object in different forms, but there is still a single object in a specific version stored in etcd.

For example, if you store a custom resource at v1beta2, you can still serve the object at its v1beta1 or v1alpha1 version thanks to conversion webhooks. However, this means any new field you add to “v1beta2” has to be added to all older active apiVersions since the backwards compatibility requires lossless round-trip conversions from one apiVersion to another back and forth.

In practice, this means API versions in Kubernetes do not allow you to add or remove fields to your API. All versions of your object must be isomorphic. This severely limited our ability to introduce fundamental changes to the LiDeployment API. On top of this, Kubernetes version conversion does not allow moving an API from one group to another; and we wanted to do away with our old API group name as mentioned earlier.

We needed to do what the conversion webhook does between API versions across API groups. Therefore, we built our own API migration machinery to carry out a no-downtime migration where we continued to serve the LiDeployment API on both the old and the new API group.

Migrating APIs by mirroring

Once the design and development of the new LiDeployment API (under the apps.linkedin.com apiGroup) and its controller implementation were complete, the next step was to migrate the objects on the old API group to this new API.

Migration plan

Since we had several hundred applications deployed and serving production traffic through the old API, this migration needed to be non-disruptive to the application owners. This meant that application owners could continue to use the old LiDeployment API to submit their resource manifests, and it should continue to work. It wasn’t feasible to introduce a company-wide deployment freeze to accomplish this.

Similarly, we had internal systems that read/wrote the old LiDeployment API to do things like monitoring deployment status or autoscaling. Therefore, both the old and new LiDeployment objects needed to be in the cluster for a prolonged time to ensure these integrations worked during the migration to the new API.

We devised a controlled ramp plan to move each LiDeployment from one API group to another in incremental batches. Defining three distinct stages for a LiDeployment helped us to reason which version is the source of truth and which controller is actively operating on the object. These stages are:

  1. Not yet migrated: Old LiDeployment object is the source of truth, and the old controller actively reconciles this object. There is no “new” LiDeployment object for the app at this stage.
  2. Mirroring: Old LiDeployment object remains as the source of truth, but a corresponding “new” LiDeployment object derived from it exists. Old controller becomes no-op, the new controller takes over and reconciles the new object.
  3. Fully migrated: New LiDeployment is now the source of truth for configuration, and its spec is no longer mirrored from the old LiDeployment.

Mirror controller

To facilitate the coordination between the old controller and the new controller and synchronize the data (spec and status) between LiDeployments on the old and the new API group, we built a special type of controller that mirrors the old LiDeployment resources to the new API version that is in another apiGroup.

This controller (which only runs in the “Mirroring” state described above) has a few top-level working principles:

  • Objects are selectively mirrored: If an old LiDeployment object is in the “mirroring” stage (i.e. has mirror=true annotation), the mirror controller picks up the object and maintains its mirror until the LiDeployment is “fully migrated.” Otherwise, the object is ignored.
  • “Spec” is mirrored from the old API to the new API: Since the old API is still the source of truth for the configuration for any LiDeployment in Mirroring stage, we continuously apply the spec changes on old LiDeployment to its new mirror. This way, our engineers could continue using the old API to manage their apps.
  • “Status” is mirrored from the new API to old API: Since the new LiDeployment API and its controller has the actual Pod status from the app, we reflect this status on the LiDeployment object’s status. Any API client that relied on old LiDeployment’s status continues to work.
Figure 1: High-level architecture of the old and new LiDeployment APIs, and how the mirror controller facilitates handoff between the old and the new controllers.
Figure 1: High-level architecture of the old and new LiDeployment APIs, and how the mirror controller facilitates handoff between the old and the new controllers.

Essentially, the mirror controller’s job is to create the illusion that the old API is still functional. This controller does its job by watching both the old/new LiDeployment API as its event source, reconciling the goal state, which is ensuring that there is a new LiDeployment object with the same name/namespace as the old one in the desired shape.

The controller comes up with the desired shape for the “new” API object (its metadata/spec) mostly by looking at the “old” object. Then, the controller either creates or updates the new object. Any status updates on the new object are reflected on the old object by the controller to give the impression that the old object is still functioning.

Mirroring logic

One of the first things the mirror controller does is mark the new object owned by the old object (through ownerReferences). This helps us clean up the new LiDeployment (and its Pods) if the customer was using the old LiDeployment, and chooses to delete it.

Then, the mirror controller comes up with the values on the new LiDeployment object by copying fields from the old object to the new. Once the new LiDeployment object is created by the mirror controller, the controller continues to merge the changes from the old object to the new object as customers continue to use the old API to configure their apps.

This merging logic is aware of the intricacies of how each field works (similar to Kubernetes conversion webhooks). For example, any label/annotation we wanted to clean up is filtered out, or any bad default value we did not want to carry over is left out of the new configuration at this stage. The merging logic also is aware of how to merge maps/structs, since unsetting a field that’s defaulted by the webhook may mean the controller can get caught in an infinite reconciliation loop.

As a result, the merging logic largely is aware of both the current values on the new and the old LiDeployment, which looks like this in pseudocode:

Figure 2: Pseudocode example of merging logic between the new and old LiDeployment.
Figure 2: Pseudocode example of merging logic between the new and old LiDeployment.

It is imperative to get the merging logic right, as there isn’t another chance to fix the migrated configuration after the old API is deleted. Rigorous input validation on the provided object helped us ensure the created new objects are in ideal shape. We also performed defaulting on fields missing values to ensure the objects are “complete.” These API validations relieved the new controller code from having to handle missing values on the objects, as recommended by API conventions.

At this stage, any old LiDeployment that was mirrored now has a corresponding “new” LiDeployment object, even though all customers continue to use the old LiDeployment objects, and any existing tooling that integrates with the old LiDeployment API continues to work.

Switching over to the new API

After mirroring all several hundred LiDeployments to the new API (and validating that the apps are running happily on the new controller), we completed two tasks in parallel:

  1. Moving integrations (such as the continuous deployment, autoscaling systems) to understand the new API shape.
  2. A mass refactoring of YAML manifests checked into the application repositories to use the new API shape.

When the application owner starts deploying their app using the new LiDeployment API, it means that the controller should stop mirroring the spec from the old LiDeployment to the new one, as the new API object is now the source of truth.

To facilitate this, we mark the new LiDeployment with “source-of-truth=true” annotation. If this annotation is present on a new LiDeployment object, the mirror controller disassociates the old and the new LiDeployments (by deleting the previously added ownerReference), and it stops updating the new LiDeployment object based on the old object.

At this stage, the mirror controller has completed its job and the app is fully migrated. We clean up the old LiDeployment object as it no longer has any use.

An example reconciliation logic, including the handoff between “mirroring” to “fully migrated” stage, looks like the following in pseudocode:

Figure 3: Pseudocode example of the reconciliation logic between the new and old LiDeployment.
Figure 3: Pseudocode example of the reconciliation logic between the new and old LiDeployment.

While mirroring an application from the old to the new API, we could not predict whether an app would run correctly on the new controller implementation (as its Pod spec has heavily changed). Therefore, Pods created via the old controller and the new controller ran at the same time; this way, application availability wasn’t impacted if the new controller wasn’t creating the Pods with the right configuration. Only when the Pods on the new controller readied up, we cleaned up the underlying workload resource (and therefore the Pods) managed by the old controller.

Challenges

Observability

Throughout the migration, we benefited from having per-object controlled ramp annotations, having a rollback mechanism (in case an app did not work happily on the new controller), and used status/conditions to reflect the mirroring status of a LiDeployment to clearly understand the migration phase the app is in.

Kubectl API name conflicts

A benefit that comes with Kubernetes builtin versioning machinery is that the API owner can specify a “preferred” version for clients like kubectl to work with. In our case, we had two CRDs in different API groups with the same name, and there wasn’t a way to specify a precedence order for which CRD was queried when a user runs kubectl get/edit/delete lideployment; so the user might end up working on the wrong object depending on which API group kubectl picks.

Empirically, there seems to be a deterministic order in kubectl when two identical custom APIs exist, though we did not want to rely on it since the behavior is not guaranteed, and we certainly did not want to implement an Extension APIserver just to make this work. As a workaround, we created an API category (e.g. kubectl get apps) that contained both old/new APIs and made our engineers use this category instead.

Double-deployment of apps

As mentioned earlier, we ran Pods from old/new versions of the controller simultaneously for a brief period of time to validate the Pods created by the new controller with the new spec ran correctly.

This revealed some stateless applications that took a dependency on the assumption that they may have at most one replica, which was not guaranteed by the system.

Conclusion

While the Kubernetes custom resource machinery gives a rather basic way to perform minor versioning changes on an object, it’s still possible to harness the power of controllers to build a custom controller that handles migrating custom resource APIs from one API group to another.

Acknowledgments

Special thanks to Scott Nichols for coming up with the mirroring controller approach and spearheading its implementation. Also, thanks to Kutta Srinivasan and Sudheer Vinukonda for reading drafts of this article.