The following references and outlines general guidelines for metric instrumentation in Kubernetes components. Components are instrumented using the Prometheus Go client library. For non-Go components. Libraries in other languages are available.
The metrics are exposed via HTTP in the Prometheus metric format, which is open and well-understood by a wide range of third party applications and vendors outside of the Prometheus eco-system.
The general instrumentation advice from the Prometheus documentation applies. This document reiterates common pitfalls and some Kubernetes specific considerations.
Prometheus metrics are cheap as they have minimal internal memory state. Set and increment operations are thread safe and take 10-25 nanoseconds (Go & Java). Thus, instrumentation can and should cover all operationally relevant aspects of an application, internal and external.
The following describes the basic steps required to add a new metric (in Go).
-
Import "k8s.io/component-base/metrics" for metrics and "k8s.io/component-base/metrics/legacyregistry" to register your declared metrics.
-
Create a top-level var to define the metric. For this, you have to:
- Pick the type of metric. Use a Gauge for things you want to set to a particular value, a Counter for things you want to increment, or a Histogram or Summary for histograms/distributions of values (typically for latency). Histograms are better if you're going to aggregate the values across jobs, while summaries are better if you just want the job to give you a useful summary of the values.
- Give the metric a name and description.
- Pick whether you want to distinguish different categories of things using labels on the metric. If so, add "Vec" to the name of the type of metric you want and add a slice of the label names to the definition.
requestCounter = compbasemetrics.NewCounterVec( &compbasemetrics.CounterOpts{ Name: "apiserver_request_total", Help: "Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code.", StabilityLevel: compbasemetrics.STABLE, }, []string{"verb", "dry_run", "group", "version", "resource", "subresource", "scope", "component", "code"}, )
-
Register the metric so that prometheus will know to export it. This can be done in manually or through an init function.
legacyregistry.MustRegister(metric)
-
Use the metric by calling the appropriate method for your metric type (Set, Inc/Add, or Observe, respectively for Gauge, Counter, or Histogram/Summary), first calling WithLabelValues if your metric has any labels
requestCounter.WithLabelValues(*verb, *resource, client, strconv.Itoa(*httpCode)).Inc()
Components have metrics capturing events and states that are inherent to their application logic. Examples are request and error counters, request latency histograms, or internal garbage collection cycles. Those metrics are instrumented directly in the application code.
Secondly, there are business logic metrics. Those are not about observed application behavior but abstract system state, such as desired replicas for a deployment. They are not directly instrumented but collected from otherwise exposed data.
In Kubernetes they are generally captured in the kube-state-metrics component, which reads them from the API server. For this types of metric exposition, the exporter guidelines apply additionally.
Please see our documentation on Kubernetes metrics stability.
General metric and label naming best practices apply. Beyond that, metrics added directly by application or package code should have a unique name. This avoids collisions of metrics added via dependencies. They also clearly distinguish metrics collected with different semantics. This is solved through prefixes:
<component_name>_<metric>
For example, suppose the kubelet instrumented its HTTP requests but also uses an HTTP router providing its own implementation. Both expose metrics on total http requests. They should be distinguishable as in:
kubelet_http_requests_total{path=”/some/path”,status=”200”}
routerpkg_http_requests_total{path=”/some/path”,status=”200”,method=”GET”}
As we can see they expose different labels and thus a naming collision would not have been possible to resolve even if both metrics counted the exact same requests.
Resource objects that occur in names should inherit the spelling that is used
in kubectl, i.e. daemon sets are daemonset
rather than daemon_set
.
One exception to the component prefix rule is for metrics derived from the state of Kubernetes objects. From the users' perspective, controllers are an implementation detail of object reconciliation. The collection of controllers which comprise a working Kubernetes cluster is viewed as a single system which drives objects towards their specified desired state. Metrics concerning a given object should be easily discoverable and comparable even when they are produced by different controllers. Metrics describing the state of a built-in Kubernetes object take the form:
kube_<kind>_<metric>
Metrics describing the state of a custom resource avoids collisions by adding a group. Metrics take the form:
kube_[<group>](https://kubernetes.io/docs/reference/using-api/#api-groups)_<kind>_metric
The Kube-State-Metrics project introduced the original kube_* prefixed metrics. For examples of kube_* prefixed metrics, refer to the list of Exposed Metrics in the Kube-State-Metrics documentation.
Metrics can often replace more expensive logging as they are time-aggregated over a sampling interval. The multidimensional data model enables deep insights and all metrics should use those label dimensions where appropriate.
A common error that often causes performance issues in the ingesting metric system is considering dimensions that inhibit or eliminate time aggregation by being too specific. Typically those are user IDs or error messages. More generally: one should know a comprehensive list of all possible values for a label at instrumentation time.
Notable exceptions are exporters like kube-state-metrics, which expose per-pod or per-deployment metrics, which are theoretically unbound over time as one could constantly create new ones, with new names. However, they have a reasonable upper bound for a given size of infrastructure they refer to and its typical frequency of changes.
In general, “external” labels like pod name, node name (any object name), & namespace do not belong in the instrumentation itself (the exception being kube-state-metrics). They are to be attached to metrics by the collecting system that has the external knowledge (blog post).
Metrics should be normalized with respect to their dimensions. They should expose the minimal set of labels, each of which provides additional information. Labels that are composed from values of different labels are not desirable. For example:
example_metric{pod=”abc”,container=”proxy”,container_long=”abc/proxy”}
It often seems feasible to add additional meta information about an object to all metrics about that object, e.g.:
kube_pod_container_restarts{namespace=...,pod=...,container=...}
A common use case is wanting to look at such metrics w.r.t to the node the pod is scheduled on. So it seems convenient to add a “node” label.
kube_pod_container_restarts{namespace=...,pod=...,container=...,node=...}
This however only caters to one specific query use case. There are many more pieces of metadata that could be added, effectively blowing up the instrumentation. They are also not guaranteed to be stable over time. What if pods at some point can be live migrated? Those pieces of information should be normalized into an info-level metric (blog post), which is always set to 1. For example:
kube_pod_info{pod=...,namespace=...,pod_ip=...,host_ip=..,node=..., ...} 1
The metric system can later denormalize those along the identifying labels “pod” and “namespace” labels.
It is often desirable to correlate different metrics about a common object, such as a pod. Label dimensions can be used to match up different metrics. This is most easy if label names and values are following a common pattern. For metrics exposed by the same application, that often happens naturally.
For a system composed of several independent, and also pluggable components, it makes sense to set cross-component standards to allow easy querying in metric systems without extensive post-processing of data. In Kubernetes, those are the resource objects such as deployments, pods, or services and the namespace they belong to.
The following should be consistently used:
example_metric_ccc{pod=”example-app-5378923”, namespace=”default”}
An object is referenced by its unique name in a label named after the resource
itself (i.e. pod
/deployment
/... and not pod_name
/deployment_name
)
and the namespace it belongs to in the namespace
label.
Note: namespace/name combinations are only unique at a certain point in time.
For time series this is given by the timestamp associated with any data point.
UUIDs are truly unique but not convenient to use in user-facing time series
queries.
They can still be incorporated using an info level metric as described above for
kube_pod_info
. A query to a metric system selecting by UUID via a the info level
metric could look as follows:
kube_pod_restarts and on(namespace, pod) kube_pod_info{uuid=”ABC”}
The process of metric deprecation is outlined in the official Kubernetes Deprecation Policy. When deprecating a metric, one must set the deprecated version for a version which is in the future from which point that metric will be considered deprecated. If there is a replacement metric, please note that in the help text of the deprecated metric as well as in the corresponding release note of the relevant pull request.