Kubernetes Threat Model
Kubernetes Threat Model
Kubernetes
Threat Model
June 28, 2019
Prepared For:
Kubernetes Security WG | Kubernetes
Prepared By:
Stefan Edwards | T rail of Bits
[email protected]
Introduction
The Cloud Native Computing Foundation (CNCF) tasked Trail of Bits to conduct a
component-focused threat model of the Kubernetes system. This threat model reviewed
Kubernetes’ components across six control families:
● Networking
● Cryptography
● Authentication
● Authorization
● Secrets Management
● Multi-tenancy
Kubernetes itself is a large system, spanning from API gateways to container orchestration
to networking and beyond. In order to contain scope creep and keep a reasonable timeline
for threat modeling, the CNCF selected eight components within the larger Kubernetes
ecosystem for evaluation:
● kube-apiserver
● etcd
● kube-scheduler
● kube-controller-manager
● cloud-controller-manager
● kubelet
● kube-proxy
● Container Runtime
In total, the assessment team found 17 issues across the various components, ranging in
severity from Medium to Informational.
Key Findings
Kubernetes allows users to define policies at multiple levels of the cluster. This can impact
items from network filtering through to how Pods interact with the underlying host.
However, many of these policies are not applied without the correct components installed
and enabled; this, unto itself, is fine. However, Kubernetes does not warn users that these
policies are not applied due to missing components. This may lead to a situation wherein a
user assumes they have applied a security control, when it is in fact missing. Warning users
when security controls are missing or unapplied would allow them to respond by either
enabling the correct components or mitigate the issue in another way. More generally,
Kubernetes does not readily provide auditing information to users in a unified fashion.
Components may or may not collect auditing information sufficient to track a user’s path
through the system. Providing more feedback to users, administrators, and incident
responders will help not only to increase the general security of Kubernetes, but also to
give users a more consistent and friendly User Experience as well.
Kubernetes is also a highly networked system, with multiple communications protocols.
Many of these protocols traffic in sensitive information, such as cluster secrets or
credentials, and yet do not use the strongest TLS configurations possible or even use it at
all. While it is difficult for attackers to intercept communications between components, it is
possible in certain configurations. Therefore, always enforcing HTTPS, providing ingress
and egress filtering for clusters, and using the strongest cryptographic controls possible will
provide users with a secure baseline that must be modified to be insecure, rather than the
other way around.
Furthermore, it must rely on traditional operating systems to do the heavy lifting of
running containers and cluster services. However, it generally assumes those operating
systems to be trustworthy, and that sensitive data may be exposed to the operating system
freely. However, Hosts (called “Node” in the Kubernetes ontology if running a kubelet)
should be treated as a separate security domain from the cluster itself: systems
administrators for Hosts may not be the same systems administrators for the cluster.
Sensitive data exposed to a Host could allow Internal Attackers or Malicious Internal Users
to parlay their access to a single Host into wider cluster access, simply by viewing logs,
processes, or environmental data. There are two general directions that Kubernetes can
take:
1. Hosts must be “closed,” and all systems administrators for Hosts should be treated
as privileged enough to view the data shared via IPC and environment variables.
2. Move away from exposing sensitive data to systems, and towards a model similar to
a Vault or Key Management System/Hardware Security Module, wherein the system
provides APIs that do not expose secrets to any other processes or persons under
normal circumstances.
Either choice would be a reasonable direction for Kubernetes, however, a choice must be
made. Choosing one direction or the other will ensure that controls and designs can be
made that satisfy the chosen direction, and that all components understand and can
adhere to this direction. A similar issue exists with Multi-tenancy: many components do
not have a notion of Multi-tenancy, which is relatively loosely defined at the time of this
assessment. Choosing a direction, and providing developers with guidance as to how to
achieve this direction. This will strengthen the core of Kubernetes, and remove many of
the current issues, wherein components do not have an answer for certain aspects of the
system, because they are ill-defined.
Report Position
Kubernetes is a large, intricate system, with many security controls and design decisions
having arisen from organic decisions that made sense in situ to diverse and distributed
teams. This report attempts to catalog many of the discussions captured within the Rapid
Risk Assessment processes. However, the raw Rapid Risk Assessment documents will be
provided upon release of this report, so that the community as a whole may see the
discussions and notes made during meetings.
The remainder of this report analyzes components, trust zones, data flows, threat actors,
controls, and findings of the Kubernetes threat model. This was a point-in-time
assessment, and reflects the state of Kubernetes, specifically version 1.13.4, at the time of
the assessment, rather than any current or future state.
Introduction 2
Key Findings 2
Report Position 4
Methodology 7
Components 8
Planes 10
Dataflow 16
Overview 54
Recommendations 57
Methodology
This document is the result of several person-weeks worth of effort from members of the
community, the Security Working Group, and the assessment team, across diagrams,
documents, RRAs, Manning’s K ubernetes in Action book, and Kubernetes’ own
documentation. It is a control-focused threat model, with a review of each component
vis-a-vis the controls selected by the Security Working Group. The RRA template is
provided in A ppendix A: RRA Template.
Performing a threat model and architecture review of a system as large as Kubernetes
proved challenging. First, we designed a dataflow for the selected components, and
modified Mozilla’s Rapid Risk Assessment (RRA) document to focus on the selected
controls. Next, we pre-filled sections of the documents for each component, based on our
understanding of the component and online documentation. Then, we polled the
community for feedback, and held remote meetings with members of the community to
correct any gaps in the RRA documents and to discuss the impact of each control within the
selected component. Once the RRA had been filled out by a group of community members,
a different group of community members was selected to peer review the document for
accuracy.
We would like to thank all of the members of the community who came together to donate
their time to us, in order to discuss and review areas of Kubernetes’ design, and provide
holistic information that can make Kubernetes as a whole better.
Components
The Kubernetes architecture is composed of multiple components, many of which are
stand-alone binaries written in Go. The following table describes the eight selected
Kubernetes components:
Component Description
kube-controller-manager A daemon that listens for specific updates within the API
server, taking action, and storing its own updates within the
API server itself. The purpose of the daemon is to run
“controllers” within an infinite loop, with each controller
attempting to keep the state of the cluster consistent. This
works by way of a call-back listener loop, and comparison of
current cluster state with the desired state of the cluster as
described by developers and administrators.
Container Runtime A group of components that allow for the direct execution of
containers within a cluster. This includes the necessary
operating system integrations (such as control groups on
Linux), configuration settings, and Kubernetes interfaces to a
container system, e.g. Docker, cri-o, containerd, ...
Planes
Kubernetes itself is divided (roughly) into two “planes,” or groupings of components. The
following table describes each plane, and groups the aforementioned components:
Plane Description Components
Control Plane The Control Plane (CP) controls the state of kube-apiserver,
the cluster and ensures that the desired kube-scheduler,
components are running as specified by the kube-controller-manager,
end user. Generally, these are grouped as cloud-controller-manager
Masters (technically, API server, etcd, and ,
related components) as well as the kubelet kubelet,
(which, whilst part of the CP, actually runs etcd
on every Node).
Master Data The data layer of the API server and etcd
master server(s) themselves. This
boundary contains items such as
Consul or etcd, and is tantamount
to “root” or administrative access to
the cluster when accessed in an
uncontrolled fashion.
Malicious Internal User A user, such as an administrator or developer, who uses their
privileged position maliciously against the system, or stolen
credentials used for the same.
Internal Attacker An attacker who had transited one or more trust boundaries,
such as an attacker with container access.
Control Summary
Committee on National Security Systems (CNSS) Instruction (CNSSI 4009 defines “security
control” as: The management, operational, and technical controls (i.e.,
safeguards or countermeasures) prescribed for an information system to
protect the confidentiality, integrity, and availability of the system and
its information. Controls are grouped by type or family, which collect controls along
logical groupings, such as Authentication or Cryptography. This assessment will focus on
six primary control families, per the request of the Security Working Group:
Family Name Description
Secrets Management Related to the handling of sensitive application secrets such as
passwords.
Cryptography Not Applicable etcd does not store data encrypted at rest, and
instead relies on kube-apiserver to enforce
cryptographic constraints. However,
recommendations are made within the report to
strengthen cryptographically-secure hashing
operations within file system operations, such as
the Write-Ahead Log (WAL).
Secrets Not Applicable While etcd stores secrets for the cluster, etcd only
Management processes credentials sufficient to communicate
with kube-apiserver. All other secrets are handled
by kube-apiserver itself, and merely stored within
etcd.
Authentication Not Applicable KCM and CCM do not handle authentication directly,
but rather rely on kube-apiserver to be the arbiter
of authentication.
Cryptography Not Applicable KCM and CCM do not handle cryptography directly,
but rather rely on kube-apiserver to be the arbiter
of cryptography.
Multi-tenancy Not Applicable KCM and CCM do not directly handle multi-tenant
isolation. This could be problematic going forward,
as KCM and CCM components could interact with
namespaces that were not intended to have access
to cloud or other provider boundaries.
kubelet
Control Family Strength Description
Cryptography Not Applicable kubelet does not handle cryptography directly, but
rather relies on kube-apiserver to be the arbiter of
cryptography.
Multi-tenancy Not Applicable kubelet does not handle multi-tenancy directly, but
rather relies on kube-apiserver to be the arbiter of
multi-tenant isolation.
kube-proxy
Control Family Strength Description
Secrets Not Applicable kube-proxy does not handle secrets directly, but
Management rather relies on kube-apiserver to be the arbiter for
their correct storage. The sole exception is
authentication credentials, which are passed in via
command-line arguments to the binary on initial
execution. This is a general finding for Kubernetes
as a whole, rather than specific to kube-proxy.
1. A client updates a Pod definition via kubectl, which is itself a POST request to the
kube-apiserver.
2. The scheduler watches for Pod updates via an HTTP request to retrieve new Pods.
3. The scheduler then updates the Pod list via a POST to the kube-apiserver.
4. The node's k ubelet retrieves a list of Pods assigned to it via an HTTP request.
5. The node's k ubelet then updates the running Pod list on the kube-apiserver.
7. No non-repudiation or audit of user actions by default
Severity: Medium Difficulty: Low
Type: Audit and Logging Finding ID: TOB-K8S-TM07
Description
The kube-apiserver is the heart of the cluster: all transactions must pass through its
handlers and be served again to other cluster components. In this way, kube-apiserver
ensures consistent cluster state across all components: creation, modification, and deletion
are all coördinated via this central service. However, kube-apiserver does not keep a log of
users’ actions without debug mode being enabled, meaning that reconstructing an
attacker’s path through the cluster is extremely difficult.
Justification
The difficulty is low for the following reasons:
● Attackers do not require special tools or privileges to interact with the
kube-apiserver.
● Internal Attackers or Malicious Internal Users already have sufficient privileges to
interact with kube-apiserver to some degree.
The severity is medium for following reasons:
● Attackers must have sufficient privileges to undertake a sensitive action, or have a
secondary exploit.
● In general, this is not vulnerability unto itself, but rather represents a location where
incident responders would not have sufficient information to properly respond to
an attack.
Recommendation
Short term, document that secondary logging mechanisms must be used in cases that need
strong non-repudiation and audit controls. This will ensure that at least users who require
this functionality will not be surprised that it is missing.
Long term, add logging sufficient to track a user’s action across the cluster. This could be as
simple as tracking events solely within kube-apiserver, or could coördinate across the
cluster as a whole. We recommend that at least all authenticated events, including
delegated authentication from kubelet, should be logged and retrievable from a central
location within the cluster. This will allow incident responders to audit from a central
location a user’s action within the cluster.
8. Secrets not encrypted at rest by default
Severity: Low Difficulty: High
Type: Cryptography Finding ID: TOB-K8S-TM08
Description
Kubernetes allows users to define secrets, which can be anything from authentication
credentials to application configuration options. While secrets are only rarely exposed
outside of etcd (such as to kubelets or to the Container Runtime), they are not by default
encrypted at rest. An attacker with access to etcd data files, such as via a backup, will have
full access to secrets in an unencrypted state. Furthermore, the
--encryption-provider-config accepts an identity provider (the default), which does
not actually encrypt data, but rather simply returns the secret unencrypted. Users may
misconfigure the ordering of providers, and accidentally send unencrypted data to etcd or
other storage locations.
Justification
The difficulty is high for the following reasons:
● Attackers must have access to etcd data files sufficient to read secrets unencrypted.
● etcd is segmented from the rest of the cluster, with heavily restricted file system
permissions.
The severity is low for the following reasons:
● In and of itself, this does not increase the risk of exposure more than compromising
kubelet or other cluster infrastructure that handles secrets.
Recommendation
Short term, document ideal configurations for various levels of security, and provide
standard configurations for users.
Long term, move towards some reasonable default for users besides the identity provider,
and warn users when the identity provider is used either as a standalone or within a chain
of providers. This will ensure that users cannot accidentally include the identity provider.
References
● Kubernetes cluster administration guide section on encrypting data at rest
etcd findings
etcd is the main storage engine of all cluster-related data. Everything that kube-apiserver
wishes to coördinate across hosts and components eventually makes its way into etcd.
Additionally, the documentation makes it clear: “Access to etcd is equivalent to root
permission in the cluster.”
etcd works as a ReST server, accepting JavaScript Object Notation (JSON) objects from
clients, and storing these objects at a location requested by the client. Within Kubernetes,
these objects are generally stored within the /registry route, and etcd processes all objects
processed by kube-apiserver. In order to keep up with the demand for fast reading and
writing similar to a traditional database, etcd may be deployed in a separate cluster. It uses
the RAFT consensus algorithm to ensure that data is presented and available to all nodes
within the cluster eventually.
9. Write-Ahead Log does not use signatures for integrity checking
Severity: Very Low Difficulty: High
Type: Data Validation Finding ID: TOB-K8S-TM09
Description
etcd is a high-performance key-value store used to reify all state data within a Kubernetes
cluster. As part of this design, it uses a Write-Ahead Log (WAL) file, which is meant to serve
as an atomic commit file for changes; should etcd fail before the changes are written to the
main database, they can be reconstructed from the WAL file. However, the WAL file does
not use cryptographic signatures to ensure validity of data. An attacker with access to the
WAL file may tamper with the file without leaving a trace. Furthermore, copying over a
noisy or lossy connection could result in data corruption that cannot be detected until a
later point in time.
Justification
The difficulty is high for the following reasons:
● An attacker must transit multiple trust boundaries to impact a single etcd node.
● In a multi-master configuration, in order to truly impact the cluster, the attacker
must repeat this attack across multiple nodes.
The severity is very low for the following reasons:
● Attackers can only impact one etcd at a time.
● A consensus algorithm is used specifically to prevent attacks with corrupted,
incorrect, or outdated data.
● An attacker must attack a plurality of nodes within a cluster at the same time.
Recommendation
Short term, experiment with modes of adding cryptographically secure validation to the
WAL file generation. This could be as simple as hashing each entry prior to committing to
the WAL, or using something akin to Linked Timestamping, wherein each entry is hashed
with the contents of the current entry and the hash of the previous entry. Furthermore,
keyed hashes could be used to ensure that a specific etcd node has created and validated
data. Then, when the WAL file is committed to both the data and snapshot files, the sum
toto of the entries contained within the WAL file may also be hashed. Balancing entry
hashing for the faster WAL files versus total file hashing for snapshots and beyond will be
key to maintaining relative performance whilst also ensuring valid data.
Long term, any added validation must be tested with larger datasets in normal clusters, to
ensure that etcd maintains performance, even with the added validation and security. We
recommend a gradual approach of adding validation to portions of larger clusters, so that
nodes with validation may be compared to nodes without it.
10. Mutual TLS is not the default
Severity: Very Low Difficulty: High
Type: Authentication Finding ID: TOB-K8S-TM10
Description
etcd is the holder of cluster state within Kubernetes: additions, changes, and updates are
eventually stored in its data repositories. As such, authenticating who is communicating
with etcd is an important task, and while etcd supports mutual (or client-side) TLS, it is not
the default. An attacker who had transited network boundaries could interact with etcd
without further impedance.
Justification
The difficulty is high for the following reasons:
● An attacker must transit multiple trust boundaries in order to affect sufficient
position for this attack.
● etcd is specifically segmented from the rest of the cluster in order to prevent such
accesses, further increasing the difficulty for attackers.
The severity is very low for the following reasons:
● An attacker with the ability to transit multiple trust boundaries could also likely steal
authentication credentials used to secure two-way TLS.
● In and of itself, this represents defense in depth for those situations where etcd is
not completely segmented from the rest of the network or fully from the
kube-apiserver host.
Recommendation
Short term, fully document how two-way TLS may be fully enabled within etcd. The current
documentation provides a simple example, but more automated or robust examples would
be helpful to users.
Long term, support mutual TLS by default, and do not allow communications with etcd that
are unauthenticated by client TLS certificates. Furthermore, do not use Basic or Digest
Authentication for this process, as these are outdated and insecure.
kube-scheduler findings
kube-scheduler is tasked with matching Pods to hosts that can execute their workloads.
This is a surprisingly complex task, as the match is based upon a variety of criteria,
including the Pod’s own specification of what it needs to execute its work.
In order to do this, kube-scheduler operates like most other components within the
cluster: it polls kube-apiserver for new Pods and any host changes reported by the various
kubelets within the system, and attempts to match new Pods to free or more free kubelets,
which then go about executing the Pod with help from the Container Runtime.
Furthermore, it should be noted that there may be multiple kube-schedulers configured for
a single cluster, each with a different name and set of parameters. kube-schedulers are
supposed to be coöperative, however they needn’t be; nothing in the system forces
kube-schedulers to only act upon Pod specs with a matching name.
11. Anti-affinity scheduling can be used to claim disproportionate resources
Severity: Low Difficulty: High
Type: Denial of Service Finding ID: TOB-K8S-TM11
Description
Kubernetes allows users to specify various mechanisms for Pods. This can be as simple as
assigning Pods to a specific node or a complex dance of determining various aspects of
nodes and Pods. Users may also specify which Pods cannot be scheduled together,
allowing a Malicious Internal User to specify that no other Pods be scheduled on the same
host, effectively commoditizing a node to the attacker’s workload alone.
Justification
The difficulty is high for the following reasons:
● An attacker must have sufficient privileges to schedule Pods.
● An attacker must know or guess other Pod names against which to claim
anti-affinity.
The severity is low for the following reasons:
● Attackers with this level of access could also simply schedule a large number of
Pods.
● Attacks that consume entire hosts are noisy and will eventually be investigated.
● Denial-of-Service attacks that consume nodes will not impact currently running
Pods, but rather impact the scheduling of future Pods.
Recommendation
Short term, document that features such as anti-affinity may be used in ways that cause
host unavailability across the cluster.
Long term, add tooling and processes to aid administrators in reviewing the state of
clusters. This will support administrators as well as incident responders to discover and
respond to resource-exhaustion events such as this. Furthermore, consider preventing
users from selecting which scheduler they can use. Reserve that as an administrative
function. This will allow administrators to handle scheduling in a safe way, and prevent
attackers from specifying which scheduler should be used.
12. No back-off process for scheduling
Severity: Informational Difficulty: Undetermined
Type: Denial of Service Finding ID: TOB-K8S-TM12
Description
Scheduling a Pod within Kubernetes is an intricate dance of state coördination across
multiple components; kube-scheduler may interact with kubelet, the Replica Set Controller,
and other processes within the system. However, there is no back-off process when
kube-scheduler determines that a kubelet is the appropriate host for a Pod, but the kubelet
itself rejects scheduling the Pod. This may create a tight-loop wherein kube-scheduler, the
Replica Set Controller, and kubelet cause a contention wherein kube-scheduler
continuously schedules a Pod that kubelet rejects.
Justification
This item is of Informational severity and as such represents a noteworthy comment within
the system, rather than an actual vulnerability.
Recommendation
Short term, document that this issue exists, and note that developers may be able to
accidentally or maliciously introduce a tight-loop within a cluster via this feedback failure
loop.
Long term, implement a back-off process for kube-scheduler and support graceful node
failure. This may go so far as to include a “decay list” of nodes which are continuously
failing to schedule Pods. Such a list can be used for further scheduling decisions within
kube-scheduler.
KCM and CCM findings
kube-controller-manager (KCM) and cloud-controller-manager (CCM) are two core
components of Kubernetes’ interaction with the underlying platform. Like other controllers,
such as the Replica Set Controller, KCM and CCM attempt to move the state of the cluster
towards the desired state. CCM itself is a reference implementation, meant to separate out
cloud-specific controller code from other controller code. In this way, it will allow
administrators running on one cloud provider to exclude code meant for another cloud
provider.
13. Separate out controllers based on principle of least authority
Severity: Low Difficulty: High
Type: Access Control Finding ID: TOB-K8S-TM13
Description
KCM and CCM run several different controllers in a “control loop,” or an infinite loop of
feedback. KCM and CCM are packaged as single binaries, with multiple controllers
packaged as Go-level modules within the source code used to build the binary. These
controllers impact a wide range of items across the cluster, but are generally low-privileged
and unable to impact much outside of the narrow slices of policy for which they are
defined. However, some of these controllers are highly privileged (such as the Service
Account Controller) and can access their own permissions. If an attacker or malicious
controller were able to call these functions, they could escalate privileges across the
cluster, potentially to administrative-level access.
Justification
The difficulty is high for the following reasons:
● An attacker must know or discover a vulnerability allowing them to call privileged
functions.
● They must have position sufficient to use the escalated privileges, such as a
Malicious Internal User.
The severity is low for the following reasons:
● Attackers with this level of access could likely impact other items with a lower
Difficulty threshold.
● Attackers could escalate privileges across the cluster, or subtly modify resources on
the fly.
Recommendation
Short term, plan ways that privileged controllers may be separated from unprivileged ones,
and test if this is feasible within the context of both KCM and CCM.
Long term, separate out privileged controllers into their own binary or binaries. Controller
managers should not mix levels of privilege, as attackers or even just simple coding
mistakes can lead to privilege escalation.
kubelet findings
kubelet is the central orchestrator for Pods within the Kubernetes system. It runs Pods by
watching for podspecs that have been allocated to its host (by kube-scheduler), and passes
the podspec to the Container Runtime for execution. Aside from this, kubelet also handles
reporting the health status of Pods and containers to kube-apiserver, monitoring Pods
themselves for failure, working with the Container Runtime to deschedule Pods when so
requested, and reporting host status to kube-apiserver (for use by kube-scheduler). Like
kube-proxy, kubelet runs on the individual hosts, but with a different trust boundary than
Pods themselves, as it is central to the correct operation of the cluster as a whole.
14. kubelet hosts unauthenticated ports that leak pod spec information
Severity: Medium Difficulty: Medium
Type: Information Disclosure Finding ID: TOB-K8S-TM14
Description
kubelet, like most components within the Kubernetes system, uses HTTP ports for various
tasks such as reporting or task execution. Specific to kubelet, there are three main ports:
● 10250, an authenticated HTTPS server, with authentication provided by delegated
authentication from the kube-apiserver, used for task execution and kubelet
update.
● 10255, an unauthenticated HTTP server used for status information and health
information, but also includes Pod spec information.
● 10248, an unauthenticated HTTP server used for health information.
Justification
The difficulty is medium for the following reasons:
● An attacker must have sufficient position to affect the attack, such as an Internal
Attacker or Malicious Internal User.
● Minimal tooling is needed to issue an HTTP request.
● An attacker must know, or guess, the location of kubelet host and ports.
The severity is medium for the following reasons:
● Pod specs do not by default contain secrets, other than potentially ConfigMaps and
Repository authentication credentials.
● An attacker armed with this information may gain a better understanding of the
layout of a cluster’s workload, but minimal other information about the inner
workings of the cluster.
Recommendation
Short term, document the leakage of Pod spec information, and plan ways to remove it. Per
the RRA discussions, the kubelet team is already planning on removing port 10255, which is
mainly in place for cAdvisor, which is deprecated. In more recent versions of Kubernetes
than this work focused on, port 10255 is configured off by default, but can be activated
either by installers/distributions or cluster administrators.
Long term, remove the deprecated ports, and minimize the attack surface available to an
Internal Attacker. This should also include changing port 10250 to a fully bootstrapped TLS
certificate by default. In this way, kubelet will present as strong a face as possible to
internal attackers.
15. Bootstrap certificate is long-lived and not removed by default
Severity: Low Difficulty: High
Type: Configuration Finding ID: TOB-K8S-TM15
Description
Kubernetes can bootstrap certain components, such as kubelet, from certificates by
default. These certificates provide a mechanism for components to request enough access
of kube-apiserver so as to generate a Certificate Signing Request (CSR) and produce a
signed certificate that the component may use for at least client authentication. However,
the certificate is long-lived, without a Time to Live (TTL), and is not removed by default.
Justification
The difficulty is high for the following reasons:
● An attacker must transit several trust boundaries, and have host-level access to a
Worker node.
● The attacker must then have the ability either to bring up other hosts within the
cluster or create their own kubelet under their control.
The severity is low for the following reasons:
● In and of itself, a long-lived bootstrap certificate does not provide an attacker with
sufficient direct access.
● An attacker can make CSR requests to the kube-apiserver, which may provide an
attacker with access to other credentials within the cluster.
Recommendation
Short term, document that the certificate is long-lived, and must be removed by manual
processes.
Long term, issue bootstrapping certificates with an explicit-but-reasonable TTL, such as one
week. This should provide administrators plenty of time to bootstrap a cluster, but remove
the risk of a stolen bootstrapping certificate from further impacting the cluster.
Additionally, if certificate revocation is added to the cluster, bootstrap certificates may be
revoked once the CSR has been received.
References
● TLS Bootstrapping
kube-proxy findings
kube-proxy, much like kubelet, is a transitive component within the cluster: it straddles the
edge between two trust boundaries, namely the Worker and Container zones. kube-proxy
itself works by watching for service, endpoint, and similar network configurations on
kube-apiserver, and then implementing the networking request, in conjunction with the
Container Network Interface (CNI) in one of several modes:
● As a literal network proxy, handling networking between nodes
● As a bridge between Container Network Interface (CNI), which handles the actual
networking, and the host operating system
● iptables mode
● ipvsadm mode
● two Microsoft Windows-specific modes (not covered by the RRA)
kube-proxy itself is actually a cluster of five programs, which work to create a consistent
networking experience across Pods and services. In this way, kube-proxy manages the raw
plumbing of networking, connecting the CNI’s transport layer to Linux’s routing layer (via
third-party tools such as iptables).
Userspace proxy
The original mode of operation for kube-proxy, wherein kube-proxy received and
forwarded packets for Kubernetes’ hosted services. While this mode is not often used
anymore, due to performance, the setup is the same for most other modes of kube-proxy.
Furthermore, it is a core mode that may be useful under certain circumstances.
Setup:
1. Connect to the kube-apiserver.
2. Watch the kube-apiserver for services/endpoints/&c definitions.
3. Build an in-memory caching map: for services, for every port a service maps, open a
port, write iptables rule for Virtual IP (VIP) & Virtual Port.
4. Continue with step No. 2, until the cluster is restarted or terminated.
When a consumer connects to the port:
1. The desired service is running on a VIP:VPort pair.
2. The Root NS lookup, which is routed by an iptables defintion, which eventually
points to the kube-proxy port.
3. When a connection is received, look at the src/dst port, check the map, pick a service
on that port at random (if that fails, try another until either success or a retry count
has exceeded).
4. Shuffle bytes back and forth between backend service and client until termination or
failure.
iptables
iptables is a common mode of operation for kube-proxy; it interacts directly with iptables in
order to build routing rules for VIP:VPort pairs. However, this mode does not require
kube-proxy to actually intercept or communicate with client connects. Instead, kube-proxy
uses iptables to create rewriting rules for the intended host, and has no futher interaction
with the system, until such time that iptables restore command sets must be updated.
1. Same initial setup as the userspace proxy, sans opening a port directly.
2. Build an iptables restore command set, which is simple a giant string of services.
3. Map user VIP to a random backend, rewriting packets at the kernel level, so
kube-proxy never sees the data.
4. At the end of the sync loop, write batches to avoid iptables contentions/
5. Perform no more routing table updates until service updates, from watching
kube-apiserver or a time out.
NOTE: rate limited (bounded frequency) of iptables updates:
● No later than 10 minutes by default
● No sooner than 15s by default, if there are no service map updates
ipvs
1. Similar setup to iptables & userspace proxy modes.
2. Here, we use the ipvsadm and ipset commands instead of iptables
3. This does have some potentially unintended consequences:
● IP address needs a dummy adapter
● NOTE Any service bound to 0.0.0.0 is also bound to all adapters
● This is somewhat expected because of the binding to 0.0.0.0, but can still
lead to interesting behavior
Networking Concerns
Low-level network attacks may still impact kube-proxy, such as ARP Poisoning.
Furthermore, endpoint selection is namespace a nd Pod-based, so an injection could
theoretically overwrite this mapping. Additionally, further work may be needed to use only
CAP_NET_BIND, which allows a process or containerbind to low ports, without root
permissions, for containers/pods, to alleviate concerns surrounding attacks such as ARP
Poisoning via C AP_NET_RAW.
16. Race condition in Pod IP reuse
Severity: Low Difficulty: High
Type: Timing Finding ID: TOB-K8S-TM16
Description
kube-proxy coördinates with Pods, kubelet, and other components to “string the wire,” so
to speak, of communications within a cluster. This includes Pod IPs, which generally have a
larger allocation than there are Pods within a cluster by a factor of two. However, if an
attacker were able to cause a churn in Pod IPs, they could potentially win a race condition,
and trick kube-proxy into forwarding traffic to a Pod controlled by the attacker, rather than
the Pod expected by the cluster.
Justification
The difficulty is high for the following reasons:
● An attacker must have sufficient position to cause a large volume of turnover in Pod
IPs.
● The attacker must also have sufficient privileges to launch malicious Pods or have
previously compromised a Pod with the Pod IP they wish to control.
The severity is low for the following reasons:
● Attackers with position sufficient to cause Pod IP reuse could likely use other
attacks, such as ARP Poisoning, to achieve a similar effect with less work.
● The attack itself is largely theoretical, concerning a possible method by which an
attacker could win a race condition against the Pod IP assignment algorithm.
Recommendation
Short term, document the issue, so that users may be aware of a possible race condition.
Long term, determine a method for a back-off process within kube-proxy, and ways of
ensuring that tight loops cannot allow attackers to win race conditions. It is possible that
the best arbiter of routing truth may be kube-apiserver, however, this would require larger
architectural changes to the system as a whole. An achievable goal would be to simply back
off assignments when tight-loop Pod IP churn is noticed, and allow the normal network
process to reach equilibrium prior to further assignments.
Container Runtime findings
The last-but-not-least component that the team reviewed was the Container Runtime.
Container Runtime is technically an interface, like Container Networking, meant to support
multiple Linux container runtime systems (e.g. Docker) with a single API. Container Runtime
itself does not execute a container until instructed to do so by kubelet, as shown in the
process below:
1. Container Runtimes expose an IPC endpoint such as a Unix Domain Socket
2. kubelet retrieves Pods to be executed from the kube-apiserver
3. kubelet issues a request to the Container Runtime web server
4. The web server returns a URL with a single-time-use token
5. kubelet issues a request to the URL via gRPC over Unix Domain Socket
6. The Container Runtime Interface then executes the necessary commands/requests
from the actual container system (e.g. Docker) to run the Pod
17. Search space for single-use token is too small
Severity: Low Difficulty: High
Type: Cryptography Finding ID: TOB-K8S-TM17
Description
Container Runtime coördinates with kubelet via two mechanisms: a TCP/IP web server, and
a gRPC server running via Unix Domain Sockets. The gRPC request is authenticated via a
single-use token issued to kubelet in response to a scheduling request. However, the token
is small, being only eight characters long, meaning an attacker could feasibly generate
many valid tokens in a short amount of time.
Justification
The difficulty is high for the following reasons:
● An attacker must have sufficient position to access the Unix Domain Socket.
● They must then generate many tokens (potentially up to 2^64).
● These tokens must be discovered before a one-minute timeout has elapsed.
The severity is low for the following reasons:
● An attacker with Host access could impact far more sensitive items than attempting
to brute force a scheduling token.
● The attack would merely stop a Pod from being scheduled, which would either
result in it being rescheduled or in another host scheduling the pod, minimizing
total impact.
Recommendation
Utilize a standard, cryptographically secure token such as a UUIDv4. This will ensure that
the search space is too large for practical searches, and utilize standard, well understood
token-generation practices.
A. RRA Template
Overview
● Component:
● Owner(s):
● SIG/WG(s) at meeting:
● Service Data Classification:
● Highest Risk Impact:
Service Notes
The portion should walk through the component and discuss connections, their relevant
controls, and generally lay out how the component serves its relevant function. For
example a component that accepts an HTTP connection may have relevant questions about
channel security (TLS and Cryptography), authentication, authorization,
non-repudiation/auditing, and logging. The questions aren't the only drivers as to what may
be discussed -- the questions are meant to drive what we discuss and keep things on task
for the duration of a meeting/call.
Data Dictionary
Name Classification/Sensitivity Comments
Threat Scenarios
● An External Attacker without access to the client application
● An External Attacker with valid access to the client application
● An Internal Attacker with access to cluster
● A Malicious Internal User
Networking
Cryptography
Secrets Management
Authentication
Authorization
Multi-tenancy Isolation
Summary
Recommendations