Openshift Container Platform-4.5-Scalability and Performance-En-Us
Openshift Container Platform-4.5-Scalability and Performance-En-Us
The text of and illustrations in this document are licensed by Red Hat under a Creative Commons
Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is
available at
http://creativecommons.org/licenses/by-sa/3.0/
. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must
provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert,
Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift,
Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States
and other countries.
Linux ® is the registered trademark of Linus Torvalds in the United States and other countries.
XFS ® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States
and/or other countries.
MySQL ® is a registered trademark of MySQL AB in the United States, the European Union and
other countries.
Node.js ® is an official trademark of Joyent. Red Hat is not formally related to or endorsed by the
official Joyent Node.js open source or commercial project.
The OpenStack ® Word Mark and OpenStack logo are either registered trademarks/service marks
or trademarks/service marks of the OpenStack Foundation, in the United States and other
countries and are used with the OpenStack Foundation's permission. We are not affiliated with,
endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.
Abstract
This document provides instructions for scaling your cluster and optimizing the performance of
your OpenShift Container Platform environment.
Table of Contents
Table of Contents
.CHAPTER
. . . . . . . . . . 1.. .RECOMMENDED
. . . . . . . . . . . . . . . . . .PRACTICES
. . . . . . . . . . . . FOR
. . . . . INSTALLING
. . . . . . . . . . . . . .LARGE
. . . . . . . .CLUSTERS
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4. . . . . . . . . . . . .
1.1. RECOMMENDED PRACTICES FOR INSTALLING LARGE SCALE CLUSTERS 4
.CHAPTER
. . . . . . . . . . 2.
. . RECOMMENDED
. . . . . . . . . . . . . . . . . . HOST
. . . . . . .PRACTICES
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . . . . . . . . .
2.1. RECOMMENDED NODE HOST PRACTICES 5
2.2. CREATING A KUBELETCONFIG CRD TO EDIT KUBELET PARAMETERS 5
2.3. CONTROL PLANE NODE SIZING 8
2.4. RECOMMENDED ETCD PRACTICES 9
2.5. DEFRAGMENTING ETCD DATA 10
2.6. OPENSHIFT CONTAINER PLATFORM INFRASTRUCTURE COMPONENTS 12
2.7. MOVING THE MONITORING SOLUTION 13
2.8. MOVING THE DEFAULT REGISTRY 14
2.9. MOVING THE ROUTER 15
2.10. INFRASTRUCTURE NODE SIZING 17
2.11. ADDITIONAL RESOURCES 18
.CHAPTER
. . . . . . . . . . 3.
. . RECOMMENDED
. . . . . . . . . . . . . . . . . . CLUSTER
. . . . . . . . . . .SCALING
. . . . . . . . . .PRACTICES
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
..............
3.1. RECOMMENDED PRACTICES FOR SCALING THE CLUSTER 19
3.2. MODIFYING A MACHINE SET 19
3.3. ABOUT MACHINE HEALTH CHECKS 20
3.3.1. MachineHealthChecks on Bare Metal 21
3.3.2. Limitations when deploying machine health checks 21
3.4. SAMPLE MACHINEHEALTHCHECK RESOURCE 21
3.4.1. Short-circuiting machine health check remediation 23
3.4.1.1. Setting maxUnhealthy by using an absolute value 24
3.4.1.2. Setting maxUnhealthy by using percentages 24
3.5. CREATING A MACHINEHEALTHCHECK RESOURCE 24
.CHAPTER
. . . . . . . . . . 4.
. . .USING
. . . . . . .THE
. . . . .NODE
. . . . . . TUNING
. . . . . . . . . OPERATOR
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
..............
4.1. ABOUT THE NODE TUNING OPERATOR 25
4.2. ACCESSING AN EXAMPLE NODE TUNING OPERATOR SPECIFICATION 25
4.3. DEFAULT PROFILES SET ON A CLUSTER 26
4.4. VERIFYING THAT THE TUNED PROFILES ARE APPLIED 27
4.5. CUSTOM TUNING SPECIFICATION 28
4.6. CUSTOM TUNING EXAMPLE 32
4.7. SUPPORTED TUNED DAEMON PLUG-INS 33
.CHAPTER
. . . . . . . . . . 5.
. . USING
. . . . . . . .CLUSTER
. . . . . . . . . . LOADER
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
..............
5.1. INSTALLING CLUSTER LOADER 34
5.2. RUNNING CLUSTER LOADER 34
5.3. CONFIGURING CLUSTER LOADER 34
5.3.1. Example Cluster Loader configuration file 35
5.3.2. Configuration fields 36
5.4. KNOWN ISSUES 39
. . . . . . . . . . . 6.
CHAPTER . . .USING
. . . . . . .CPU
. . . . .MANAGER
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40
..............
6.1. SETTING UP CPU MANAGER 40
.CHAPTER
. . . . . . . . . . 7.
. . USING
. . . . . . . .TOPOLOGY
. . . . . . . . . . . . .MANAGER
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45
..............
7.1. TOPOLOGY MANAGER POLICIES 45
7.2. SETTING UP TOPOLOGY MANAGER 46
7.3. POD INTERACTIONS WITH TOPOLOGY MANAGER POLICIES 47
1
OpenShift Container Platform 4.5 Scalability and performance
.CHAPTER
. . . . . . . . . . 8.
. . .SCALING
. . . . . . . . . .THE
. . . . CLUSTER
. . . . . . . . . . .MONITORING
. . . . . . . . . . . . . . .OPERATOR
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49
..............
8.1. PROMETHEUS DATABASE STORAGE REQUIREMENTS 49
8.2. CONFIGURING CLUSTER MONITORING 50
.CHAPTER
. . . . . . . . . . 9.
. . .PLANNING
. . . . . . . . . . . YOUR
. . . . . . . ENVIRONMENT
. . . . . . . . . . . . . . . . .ACCORDING
. . . . . . . . . . . . . TO
. . . .OBJECT
. . . . . . . . . MAXIMUMS
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52
..............
9.1. OPENSHIFT CONTAINER PLATFORM TESTED CLUSTER MAXIMUMS FOR MAJOR RELEASES 52
9.2. OPENSHIFT CONTAINER PLATFORM TESTED CLUSTER MAXIMUMS 53
9.3. OPENSHIFT CONTAINER PLATFORM ENVIRONMENT AND CONFIGURATION ON WHICH THE CLUSTER
MAXIMUMS ARE TESTED 54
9.4. HOW TO PLAN YOUR ENVIRONMENT ACCORDING TO TESTED CLUSTER MAXIMUMS 55
9.5. HOW TO PLAN YOUR ENVIRONMENT ACCORDING TO APPLICATION REQUIREMENTS 56
.CHAPTER
. . . . . . . . . . 10.
. . . OPTIMIZING
. . . . . . . . . . . . . .STORAGE
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59
..............
10.1. AVAILABLE PERSISTENT STORAGE OPTIONS 59
10.2. RECOMMENDED CONFIGURABLE STORAGE TECHNOLOGY 60
10.2.1. Specific application storage recommendations 60
10.2.1.1. Registry 61
10.2.1.2. Scaled registry 61
10.2.1.3. Metrics 61
10.2.1.4. Logging 62
10.2.1.5. Applications 62
10.2.2. Other specific application storage recommendations 62
10.3. DATA STORAGE MANAGEMENT 62
. . . . . . . . . . . 11.
CHAPTER . . .OPTIMIZING
. . . . . . . . . . . . .ROUTING
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64
..............
11.1. BASELINE INGRESS CONTROLLER (ROUTER) PERFORMANCE 64
11.2. INGRESS CONTROLLER (ROUTER) PERFORMANCE OPTIMIZATIONS 65
.CHAPTER
. . . . . . . . . . 12.
. . . OPTIMIZING
. . . . . . . . . . . . . NETWORKING
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .66
..............
12.1. OPTIMIZING THE MTU FOR YOUR NETWORK 66
12.2. RECOMMENDED PRACTICES FOR INSTALLING LARGE SCALE CLUSTERS 67
12.3. IMPACT OF IPSEC 67
. . . . . . . . . . . 13.
CHAPTER . . . WHAT
. . . . . . . HUGE
. . . . . . .PAGES
. . . . . . . .DO
. . . AND
. . . . . .HOW
. . . . . THEY
. . . . . . .ARE
. . . . CONSUMED
. . . . . . . . . . . . . .BY
. . .APPLICATIONS
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .68
..............
13.1. WHAT HUGE PAGES DO 68
13.2. HOW HUGE PAGES ARE CONSUMED BY APPS 68
13.3. CONFIGURING HUGE PAGES 69
13.3.1. At boot time 69
2
Table of Contents
3
OpenShift Container Platform 4.5 Scalability and performance
networking:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
machineCIDR: 10.0.0.0/16
networkType: OpenShiftSDN
serviceNetwork:
- 172.30.0.0/16
The default cluster network cidr 10.128.0.0/14 cannot be used if the cluster size is more than 500
nodes. It must be set to 10.128.0.0/12 or 10.128.0.0/10 to get to larger node counts beyond 500 nodes.
4
CHAPTER 2. RECOMMENDED HOST PRACTICES
When both options are in use, the lower of the two values limits the number of pods on a node.
Exceeding these values can result in:
IMPORTANT
In Kubernetes, a pod that is holding a single container actually uses two containers. The
second container is used to set up networking prior to the actual container starting.
Therefore, a system running 10 pods will actually have 20 containers running.
podsPerCore sets the number of pods the node can run based on the number of processor cores on
the node. For example, if podsPerCore is set to 10 on a node with 4 processor cores, the maximum
number of pods allowed on the node will be 40.
kubeletConfig:
podsPerCore: 10
Setting podsPerCore to 0 disables this limit. The default is 0. podsPerCore cannot exceed maxPods.
maxPods sets the number of pods the node can run to a fixed value, regardless of the properties of the
node.
kubeletConfig:
maxPods: 250
Procedure
5
OpenShift Container Platform 4.5 Scalability and performance
1. Run:
$ oc get machineconfig
This provides a list of the available machine configuration objects you can select. By default, the
two kubelet-related configs are 01-master-kubelet and 01-worker-kubelet.
For example:
Example output
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 3500m
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15341844Ki
pods: 250
3. To set the max pods per node on the worker nodes, create a custom resource file that contains
the kubelet configuration. For example, change-maxPods-cr.yaml:
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: set-max-pods
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: large-pods
kubeletConfig:
maxPods: 500
The rate at which the kubelet talks to the API server depends on queries per second (QPS) and
burst values. The default values, 50 for kubeAPIQPS and 100 for kubeAPIBurst, are good
enough if there are limited pods running on each node. Updating the kubelet QPS and burst
rates is recommended if there are enough CPU and memory resources on the node:
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: set-max-pods
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: large-pods
6
CHAPTER 2. RECOMMENDED HOST PRACTICES
kubeletConfig:
maxPods: <pod_count>
kubeAPIBurst: <burst_rate>
kubeAPIQPS: <QPS>
a. Run:
b. Run:
$ oc create -f change-maxPods-cr.yaml
c. Run:
$ oc get kubeletconfig
Depending on the number of worker nodes in the cluster, wait for the worker nodes to be
rebooted one by one. For a cluster with 3 worker nodes, this could take about 10 to 15
minutes.
$ oc describe node
Procedure
By default, only one machine is allowed to be unavailable when applying the kubelet-related
configuration to the available worker nodes. For a large cluster, it can take a long time for the
configuration change to be reflected. At any time, you can adjust the number of machines that are
updating to speed up the process.
1. Run:
spec:
maxUnavailable: <node_count>
IMPORTANT
When setting the value, consider the number of worker nodes that can be
unavailable without affecting the applications running on the cluster.
7
OpenShift Container Platform 4.5 Scalability and performance
12 image streams
3 build configurations
6 builds
25 500 4 16
100 1000 8 32
250 4000 16 96
On a cluster with three masters or control plane nodes, the CPU and memory usage will spike up when
one of the nodes is stopped, rebooted or fails because the remaining two nodes must handle the load in
order to be highly available. This is also expected during upgrades because the masters are cordoned,
drained, and rebooted serially to apply the operating system updates, as well as the control plane
Operators update. To avoid cascading failures on large and dense clusters, keep the overall resource
usage on the master nodes to at most half of all available capacity to handle the resource usage spikes.
Increase the CPU and memory on the master nodes accordingly.
IMPORTANT
The node sizing varies depending on the number of nodes and object counts in the
cluster. It also depends on whether the objects are actively being created on the cluster.
During object creation, the control plane is more active in terms of resource usage
compared to when the objects are in the running phase.
IMPORTANT
8
CHAPTER 2. RECOMMENDED HOST PRACTICES
IMPORTANT
IMPORTANT
The recommendations are based on the data points captured on OpenShift Container
Platform clusters with OpenShiftSDN as the network plug-in.
NOTE
In OpenShift Container Platform 4.5, half of a CPU core (500 millicore) is now reserved
by the system by default compared to OpenShift Container Platform 3.11 and previous
versions. The sizes are determined taking that into consideration.
Etcd writes data to disk, so its performance strongly depends on disk performance. Etcd persists
proposals on disk. Slow disks and disk activity from other processes might cause long fsync latencies,
causing etcd to miss heartbeats, inability to commit new proposals to the disk on time, which can cause
request timeouts and temporary leader loss. It is highly recommended to run etcd on machines backed
by SSD/NVMe disks with low latency and high throughput.
Some of the key metrics to monitor on a deployed OpenShift Container Platform cluster are p99 of
etcd disk write ahead log duration and the number of etcd leader changes. Use Prometheus to track
these metrics. etcd_disk_wal_fsync_duration_seconds_bucket reports the etcd disk fsync duration,
etcd_server_leader_changes_seen_total reports the leader changes. To rule out a slow disk and
confirm that the disk is reasonably fast, 99th percentile of the
etcd_disk_wal_fsync_duration_seconds_bucket should be less than 10ms.
Fio, a I/O benchmarking tool can be used to validate the hardware for etcd before or after creating the
OpenShift cluster. Run fio and analyze the results:
Assuming container runtimes like podman or docker are installed on the machine under test and the
path etcd writes the data exists - /var/lib/etcd, run:
Procedure
Run the following if using podman:
9
OpenShift Container Platform 4.5 Scalability and performance
The output reports whether the disk is fast enough to host etcd by comparing the 99th percentile of the
fsync metric captured from the run to see if it is less than 10ms.
Etcd replicates the requests among all the members, so its performance strongly depends on network
input/output (IO) latency. High network latencies result in etcd heartbeats taking longer than the
election timeout, which leads to leader elections that are disruptive to the cluster. A key metric to
monitor on a deployed OpenShift Container Platform cluster is the 99th percentile of etcd network peer
latency on each etcd cluster member. Use Prometheus to track the metric. histogram_quantile(0.99,
rate(etcd_network_peer_round_trip_time_seconds_bucket[2m])) reports the round trip time for
etcd to finish replicating the client requests between the members; it should be less than 50 ms.
History compaction is performed automatically every five minutes and leaves gaps in the back-end
database. This fragmented space is available for use by etcd, but is not available to the host file system.
You must defragment etcd to make this space available to the host file system.
Because etcd writes data to disk, its performance strongly depends on disk performance. Consider
defragmenting etcd every month, twice a month, or as needed for your cluster. You can also monitor the
etcd_db_total_size_in_bytes metric to determine whether defragmentation is necessary.
WARNING
Defragmenting etcd is a blocking action. The etcd member will not response until
defragmentation is complete. For this reason, wait at least one minute between
defragmentation actions on each of the pods to allow the cluster to recover.
Prerequisites
You have access to the cluster as a user with the cluster-admin role.
Procedure
1. Determine which etcd member is the leader, because the leader should be defragmented last.
Example output
10
CHAPTER 2. RECOMMENDED HOST PRACTICES
b. Choose a pod and run the following command to determine which etcd member is the
leader:
Example output
a. Connect to the running etcd container, passing in the name of a pod that is not the leader:
Example output
11
OpenShift Container Platform 4.5 Scalability and performance
If a timeout error occurs, increase the value for --command-timeout until the command
succeeds.
Example output
+---------------------------+------------------+---------+---------+-----------+------------+-----------
+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER |
RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------
+------------+--------------------+--------+
| https://10.0.191.37:2379 | 251cd44483d811c3 | 3.4.9 | 104 MB | false | false |
7| 91624 | 91624 | |
| https://10.0.159.225:2379 | 264c7c58ecbdabee | 3.4.9 | 41 MB | false | false |
7| 91624 | 91624 | | 1
| https://10.0.199.170:2379 | 9ac311f93915cc79 | 3.4.9 | 104 MB | true | false |
7| 91624 | 91624 | |
+---------------------------+------------------+---------+---------+-----------+------------+-----------
+------------+--------------------+--------+
This example shows that the database size for this etcd member is now 41 MB as opposed
to the starting size of 104 MB.
e. Repeat these steps to connect to each of the other etcd members and defragment them.
Always defragment the leader last.
Wait at least one minute between defragmentation actions to allow the etcd pod to recover.
Until the etcd pod recovers, the etcd member will not respond.
3. If any NOSPACE alarms were triggered due to the space quota being exceeded, clear them.
Example output
memberID:12345678912345678912 alarm:NOSPACE
12
CHAPTER 2. RECOMMENDED HOST PRACTICES
Kubernetes and OpenShift Container Platform control plane services that run on masters
The cluster metrics collection, or monitoring service, including components for monitoring user-
defined projects
Service brokers
Any node that runs any other container, pod, or component is a worker node that your subscription must
cover.
Procedure
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |+
alertmanagerMain:
nodeSelector:
node-role.kubernetes.io/infra: ""
prometheusK8s:
nodeSelector:
node-role.kubernetes.io/infra: ""
prometheusOperator:
nodeSelector:
node-role.kubernetes.io/infra: ""
grafana:
nodeSelector:
node-role.kubernetes.io/infra: ""
k8sPrometheusAdapter:
nodeSelector:
node-role.kubernetes.io/infra: ""
kubeStateMetrics:
nodeSelector:
13
OpenShift Container Platform 4.5 Scalability and performance
node-role.kubernetes.io/infra: ""
telemeterClient:
nodeSelector:
node-role.kubernetes.io/infra: ""
openshiftStateMetrics:
nodeSelector:
node-role.kubernetes.io/infra: ""
thanosQuerier:
nodeSelector:
node-role.kubernetes.io/infra: ""
Running this config map forces the components of the monitoring stack to redeploy to
infrastructure nodes.
$ oc create -f cluster-monitoring-configmap.yaml
4. If a component has not moved to the infra node, delete the pod with this component:
The component from the deleted pod is re-created on the infra node.
Prerequisites
Procedure
Example output
apiVersion: imageregistry.operator.openshift.io/v1
kind: Config
metadata:
creationTimestamp: 2019-02-05T13:52:05Z
finalizers:
- imageregistry.operator.openshift.io/finalizer
generation: 1
name: cluster
resourceVersion: "56174"
14
CHAPTER 2. RECOMMENDED HOST PRACTICES
selfLink: /apis/imageregistry.operator.openshift.io/v1/configs/cluster
uid: 36fd3724-294d-11e9-a524-12ffeee2931b
spec:
httpSecret: d9a012ccd117b1e6616ceccb2c3bb66a5fed1b5e481623
logging: 2
managementState: Managed
proxy: {}
replicas: 1
requests:
read: {}
write: {}
storage:
s3:
bucket: image-registry-us-east-1-c92e88cad85b48ec8b312344dff03c82-392c
region: us-east-1
status:
...
$ oc edit configs.imageregistry.operator.openshift.io/cluster
3. Add the following lines of text the spec section of the object:
nodeSelector:
node-role.kubernetes.io/infra: ""
4. Verify the registry pod has been moved to the infrastructure node.
a. Run the following command to identify the node where the registry pod is located:
Prerequisites
Procedure
15
OpenShift Container Platform 4.5 Scalability and performance
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
creationTimestamp: 2019-04-18T12:35:39Z
finalizers:
- ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
generation: 1
name: default
namespace: openshift-ingress-operator
resourceVersion: "11341"
selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-
operator/ingresscontrollers/default
uid: 79509e05-61d6-11e9-bc55-02ce4781844a
spec: {}
status:
availableReplicas: 2
conditions:
- lastTransitionTime: 2019-04-18T12:36:15Z
status: "True"
type: Available
domain: apps.<cluster>.example.com
endpointPublishingStrategy:
type: LoadBalancerService
selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
2. Edit the ingresscontroller resource and change the nodeSelector to use the infra label:
Add the nodeSelector stanza that references the infra label to the spec section, as shown:
spec:
nodePlacement:
nodeSelector:
matchLabels:
node-role.kubernetes.io/infra: ""
a. View the list of router pods and note the node name of the running pod:
Example output
16
CHAPTER 2. RECOMMENDED HOST PRACTICES
1 1 Specify the <node_name> that you obtained from the pod list.
Example output
Because the role list includes infra, the pod is running on the correct node.
25 4 32
100 8 64
250 32 192
500 32 192
IMPORTANT
17
OpenShift Container Platform 4.5 Scalability and performance
IMPORTANT
These sizing recommendations are based on scale tests, which create a large number of
objects across the cluster. These tests include reaching some of the cluster maximums. In
the case of 250 and 500 node counts on a OpenShift Container Platform 4.5 cluster,
these maximums are 10000 namespaces with 61000 pods, 10000 deployments, 181000
secrets, 400 config maps, and so on. Prometheus is a highly memory intensive
application; the resource usage depends on various factors including the number of
nodes, objects, the Prometheus metrics scraping interval, metrics or time series, and the
age of the cluster. The disk size also depends on the retention period. You must take
these factors into consideration and size them accordingly.
The sizing recommendations are applicable only for the infrastructure components which
gets installed during the cluster install - Prometheus, Router and Registry. Logging is a
day two operation and the recommendations do not take it into account.
NOTE
In OpenShift Container Platform 4.5, half of a CPU core (500 millicore) is now reserved
by the system by default compared to OpenShift Container Platform 3.11 and previous
versions. This influences the stated sizing recommendations.
18
CHAPTER 3. RECOMMENDED CLUSTER SCALING PRACTICES
IMPORTANT
The guidance in this section is only relevant for installations with cloud provider
integration.
Apply the following best practices to scale the number of worker machines in your OpenShift Container
Platform cluster. You scale the worker machines by increasing or decreasing the number of replicas that
are defined in the worker machine set.
Spread nodes across all of the available zones for higher availability.
Consider creating new machine sets in each available zone with alternative instance types of
similar size to help mitigate any periodic provider capacity constraints. For example, on AWS,
use m5.large and m5d.large.
NOTE
Cloud providers might implement a quota for API services. Therefore, gradually scale the
cluster.
The controller might not be able to create the machines if the replicas in the machine sets are set to
higher numbers all at one time. The number of requests the cloud platform, which OpenShift Container
Platform is deployed on top of, is able to handle impacts the process. The controller will start to query
more while trying to create, check, and update the machines with the status. The cloud platform on
which OpenShift Container Platform is deployed has API request limits and excessive queries might lead
to machine creation failures due to cloud platform limitations.
Enable machine health checks when scaling to large node counts. In case of failures, the health checks
monitor the condition and automatically repair unhealthy machines.
NOTE
When scaling large and dense clusters to lower node counts, it might take large amounts
of time as the process involves draining or evicting the objects running on the nodes
being terminated in parallel. Also, the client might start to throttle the requests if there
are too many objects to evict. The default client QPS and burst rates are currently set to
5 and 10 respectively and they cannot be modified in OpenShift Container Platform.
19
OpenShift Container Platform 4.5 Scalability and performance
If you need to scale a machine set without making other changes, you do not need to delete the
machines.
NOTE
By default, the OpenShift Container Platform router pods are deployed on workers.
Because the router is required to access some cluster resources, including the web
console, do not scale the worker machine set to 0 unless you first relocate the router
pods.
Prerequisites
Procedure
Or:
Or:
Wait for the machines to start. The new machines contain changes you made to the machine
set.
To monitor machine health, create a MachineHealthCheck custom resource (CR) that includes a label
for the set of machines to monitor and a condition to check, such as staying in the NotReady status for
15 minutes or displaying a permanent condition in the node-problem-detector.
The controller that observes a MachineHealthCheck CR checks for the condition that you defined. If a
20
CHAPTER 3. RECOMMENDED CLUSTER SCALING PRACTICES
The controller that observes a MachineHealthCheck CR checks for the condition that you defined. If a
machine fails the health check, the machine is automatically deleted and a new one is created to take its
place. When a machine is deleted, you see a machine deleted event.
NOTE
For machines with the master role, the machine health check reports the number of
unhealthy nodes, but the machine is not deleted. For example:
Example output
To limit the disruptive impact of machine deletions, the controller drains and deletes only
one node at a time. If there are more unhealthy machines than the maxUnhealthy
threshold allows for in the targeted pool of machines, the controller stops deleting
machines and you must manually intervene.
After you set the annotation, unhealthy machines are power-cycled by using BMC credentials.
Only machines owned by a machine set are remediated by a machine health check.
Control plane machines are not currently supported and are not remediated if they are
unhealthy.
If the node for a machine is removed from the cluster, a machine health check considers the
machine to be unhealthy and remediates it immediately.
If the corresponding node for a machine does not join the cluster after the
nodeStartupTimeout, the machine is remediated.
21
OpenShift Container Platform 4.5 Scalability and performance
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: example 1
namespace: openshift-machine-api
annotations:
machine.openshift.io/remediation-strategy: external-baremetal 2
spec:
selector:
matchLabels:
machine.openshift.io/cluster-api-machine-role: <role> 3
machine.openshift.io/cluster-api-machine-type: <role> 4
machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone> 5
unhealthyConditions:
- type: "Ready"
timeout: "300s" 6
status: "False"
- type: "Ready"
timeout: "300s" 7
status: "Unknown"
maxUnhealthy: "40%" 8
nodeStartupTimeout: "10m" 9
3 4 Specify a label for the machine pool that you want to check.
5 Specify the machine set to track in <cluster_name>-<label>-<zone> format. For example, prod-
node-us-east-1a.
6 7 Specify the timeout duration for a node condition. If a condition is met for the duration of the
timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for
a workload on an unhealthy machine.
8 Specify the amount of unhealthy machines allowed in the targeted pool. This can be set as a
percentage or an integer.
9 Specify the timeout duration that a machine health check must wait for a node to join the cluster
before a machine is determined to be unhealthy.
NOTE
The matchLabels are examples only; you must map your machine groups based on your
specific needs.
22
CHAPTER 3. RECOMMENDED CLUSTER SCALING PRACTICES
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: example 1
namespace: openshift-machine-api
spec:
selector:
matchLabels:
machine.openshift.io/cluster-api-machine-role: <role> 2
machine.openshift.io/cluster-api-machine-type: <role> 3
machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone> 4
unhealthyConditions:
- type: "Ready"
timeout: "300s" 5
status: "False"
- type: "Ready"
timeout: "300s" 6
status: "Unknown"
maxUnhealthy: "40%" 7
nodeStartupTimeout: "10m" 8
2 3 Specify a label for the machine pool that you want to check.
4 Specify the machine set to track in <cluster_name>-<label>-<zone> format. For example, prod-
node-us-east-1a.
5 6 Specify the timeout duration for a node condition. If a condition is met for the duration of the
timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for
a workload on an unhealthy machine.
7 Specify the amount of unhealthy machines allowed in the targeted pool. This can be set as a
percentage or an integer.
8 Specify the timeout duration that a machine health check must wait for a node to join the cluster
before a machine is determined to be unhealthy.
NOTE
The matchLabels are examples only; you must map your machine groups based on your
specific needs.
If the user defines a value for the maxUnhealthy field, before remediating any machines, the
MachineHealthCheck compares the value of maxUnhealthy with the number of machines within its
target pool that it has determined to be unhealthy. Remediation is not performed if the number of
unhealthy machines exceeds the maxUnhealthy limit.
23
OpenShift Container Platform 4.5 Scalability and performance
IMPORTANT
If maxUnhealthy is not set, the value defaults to 100% and the machines are remediated
regardless of the state of the cluster.
The maxUnhealthy field can be set as either an integer or percentage. There are different remediation
implementations depending on the maxUnhealthy value.
If maxUnhealthy is set to 2:
These values are independent of how many machines are being checked by the machine health check.
NOTE
Prerequisites
Procedure
1. Create a healthcheck.yml file that contains the definition of your machine health check.
$ oc apply -f healthcheck.yml
24
CHAPTER 4. USING THE NODE TUNING OPERATOR
The Operator manages the containerized Tuned daemon for OpenShift Container Platform as a
Kubernetes daemon set. It ensures the custom tuning specification is passed to all containerized Tuned
daemons running in the cluster in the format that the daemons understand. The daemons run on all
nodes in the cluster, one per node.
Node-level settings applied by the containerized Tuned daemon are rolled back on an event that
triggers a profile change or when the containerized Tuned daemon is terminated gracefully by receiving
and handling a termination signal.
The Node Tuning Operator is part of a standard OpenShift Container Platform installation in version 4.1
and later.
Procedure
1. Run:
The default CR is meant for delivering standard node-level tuning for the OpenShift Container Platform
platform and it can only be modified to set the Operator Management state. Any other custom changes
to the default CR will be overwritten by the Operator. For custom tuning, create your own Tuned CRs.
Newly created CRs will be combined with the default CR and custom tuning applied to OpenShift
Container Platform nodes based on node or pod labels and profile priorities.
WARNING
While in certain situations the support for pod labels can be a convenient way of
automatically delivering required tuning, this practice is discouraged and strongly
advised against, especially in large-scale clusters. The default Tuned CR ships
without pod label matching. If a custom profile is created with pod label matching,
then the functionality will be enabled at that time. The pod label functionality might
be deprecated in future versions of the Node Tuning Operator.
25
OpenShift Container Platform 4.5 Scalability and performance
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
name: default
namespace: openshift-cluster-node-tuning-operator
spec:
profile:
- name: "openshift"
data: |
[main]
summary=Optimize systems running OpenShift (parent profile)
include=${f:virt_check:virtual-guest:throughput-performance}
[selinux]
avc_cache_threshold=8192
[net]
nf_conntrack_hashsize=131072
[sysctl]
net.ipv4.ip_forward=1
kernel.pid_max=>4194304
net.netfilter.nf_conntrack_max=1048576
net.ipv4.conf.all.arp_announce=2
net.ipv4.neigh.default.gc_thresh1=8192
net.ipv4.neigh.default.gc_thresh2=32768
net.ipv4.neigh.default.gc_thresh3=65536
net.ipv6.neigh.default.gc_thresh1=8192
net.ipv6.neigh.default.gc_thresh2=32768
net.ipv6.neigh.default.gc_thresh3=65536
vm.max_map_count=262144
[sysfs]
/sys/module/nvme_core/parameters/io_timeout=4294967295
/sys/module/nvme_core/parameters/max_retries=10
- name: "openshift-control-plane"
data: |
[main]
summary=Optimize systems running OpenShift control plane
include=openshift
[sysctl]
# ktune sysctl settings, maximizing i/o throughput
#
# Minimal preemption granularity for CPU-bound tasks:
# (default: 1 msec# (1 + ilog(ncpus)), units: nanoseconds)
kernel.sched_min_granularity_ns=10000000
# The total time the scheduler will consider a migrated process
# "cache hot" and thus less likely to be re-migrated
# (system default is 500000, i.e. 0.5 ms)
kernel.sched_migration_cost_ns=5000000
26
CHAPTER 4. USING THE NODE TUNING OPERATOR
- name: "openshift-node"
data: |
[main]
summary=Optimize systems running OpenShift nodes
include=openshift
[sysctl]
net.ipv4.tcp_fastopen=3
fs.inotify.max_user_watches=65536
fs.inotify.max_user_instances=8192
recommend:
- profile: "openshift-control-plane"
priority: 30
match:
- label: "node-role.kubernetes.io/master"
- label: "node-role.kubernetes.io/infra"
- profile: "openshift-node"
priority: 40
Procedure
Example output
27
OpenShift Container Platform 4.5 Scalability and performance
2. Extract the profile applied from each pod and match them against the previous list:
Example output
Multiple custom tuning specifications can co-exist as multiple CRs in the Operator’s namespace. The
existence of new CRs or the deletion of old CRs is detected by the Operator. All existing custom tuning
specifications are merged and appropriate objects for the containerized Tuned daemons are updated.
28
CHAPTER 4. USING THE NODE TUNING OPERATOR
Profile data
profile:
- name: tuned_profile_1
data: |
# Tuned profile specification
[main]
summary=Description of tuned_profile_1 profile
[sysctl]
net.ipv4.ip_forward=1
# ... other sysctl's or other Tuned daemon plug-ins supported by the containerized Tuned
# ...
- name: tuned_profile_n
data: |
# Tuned profile specification
[main]
summary=Description of tuned_profile_n profile
Recommended profiles
The profile: selection logic is defined by the recommend: section of the CR. The recommend: section
is a list of items to recommend the profiles based on a selection criteria.
recommend:
<recommend-item-1>
# ...
<recommend-item-n>
- machineConfigLabels: 1
<mcLabels> 2
match: 3
<match> 4
priority: <priority> 5
profile: <tuned_profile_name> 6
1 Optional.
3 If omitted, profile match is assumed unless a profile with a higher priority matches first or
machineConfigLabels is set.
4 An optional list.
5 Profile ordering priority. Lower numbers mean higher priority (0 is the highest priority).
29
OpenShift Container Platform 4.5 Scalability and performance
- label: <label_name> 1
value: <label_value> 2
type: <label_type> 3
<match> 4
2 Optional node or pod label value. If omitted, the presence of <label_name> is enough to match.
If <match> is not omitted, all nested <match> sections must also evaluate to true. Otherwise, false is
assumed and the profile with the respective <match> section will not be applied or recommended.
Therefore, the nesting (child <match> sections) works as logical AND operator. Conversely, if any item
of the <match> list matches, the entire <match> list evaluates to true. Therefore, the list acts as logical
OR operator.
If machineConfigLabels is defined, machine config pool based matching is turned on for the given
recommend: list item. <mcLabels> specifies the labels for a machine config. The machine config is
created automatically to apply host settings, such as kernel boot parameters, for the profile
<tuned_profile_name>. This involves finding all machine config pools with machine config selector
matching <mcLabels> and setting the profile <tuned_profile_name> on all nodes that match the
machine config pools' node selectors.
The list items match and machineConfigLabels are connected by the logical OR operator. The match
item is evaluated first in a short-circuit manner. Therefore, if it evaluates to true, the
machineConfigLabels item is not considered.
IMPORTANT
When using machine config pool based matching, it is advised to group nodes with the
same hardware configuration into the same machine config pool. Not following this
practice might result in Tuned operands calculating conflicting kernel parameters for two
or more nodes sharing the same machine config pool.
- match:
- label: tuned.openshift.io/elasticsearch
match:
- label: node-role.kubernetes.io/master
- label: node-role.kubernetes.io/infra
type: pod
priority: 10
profile: openshift-control-plane-es
- match:
- label: node-role.kubernetes.io/master
30
CHAPTER 4. USING THE NODE TUNING OPERATOR
- label: node-role.kubernetes.io/infra
priority: 20
profile: openshift-control-plane
- priority: 30
profile: openshift-node
The CR above is translated for the containerized Tuned daemon into its recommend.conf file based on
the profile priorities. The profile with the highest priority (10) is openshift-control-plane-es and,
therefore, it is considered first. The containerized Tuned daemon running on a given node looks to see if
there is a pod running on the same node with the tuned.openshift.io/elasticsearch label set. If not, the
entire <match> section evaluates as false. If there is such a pod with the label, in order for the <match>
section to evaluate to true, the node label also needs to be node-role.kubernetes.io/master or node-
role.kubernetes.io/infra.
If the labels for the profile with priority 10 matched, openshift-control-plane-es profile is applied and
no other profile is considered. If the node/pod label combination did not match, the second highest
priority profile (openshift-control-plane) is considered. This profile is applied if the containerized Tuned
pod runs on a node with labels node-role.kubernetes.io/master or node-role.kubernetes.io/infra.
Finally, the profile openshift-node has the lowest priority of 30. It lacks the <match> section and,
therefore, will always match. It acts as a profile catch-all to set openshift-node profile, if no other profile
with higher priority matches on a given node.
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
name: openshift-node-custom
31
OpenShift Container Platform 4.5 Scalability and performance
namespace: openshift-cluster-node-tuning-operator
spec:
profile:
- data: |
[main]
summary=Custom OpenShift node profile with an additional kernel parameter
include=openshift-node
[bootloader]
cmdline_openshift_node_custom=+skew_tick=1
name: openshift-node-custom
recommend:
- machineConfigLabels:
machineconfiguration.openshift.io/role: "worker-custom"
priority: 20
profile: openshift-node-custom
To minimize node reboots, label the target nodes with a label the machine config pool’s node selector
will match, then create the Tuned CR above and finally create the custom machine config pool itself.
IMPORTANT
32
CHAPTER 4. USING THE NODE TUNING OPERATOR
IMPORTANT
Custom profile writers are strongly encouraged to include the default Tuned daemon
profiles shipped within the default Tuned CR. The example above uses the default
openshift-control-plane profile to accomplish this.
audio
cpu
disk
eeepc_she
modules
mounts
net
scheduler
scsi_host
selinux
sysctl
sysfs
usb
video
vm
There is some dynamic tuning functionality provided by some of these plug-ins that is not supported.
The following Tuned plug-ins are currently not supported:
bootloader
script
systemd
See Available Tuned Plug-ins and Getting Started with Tuned for more information.
33
OpenShift Container Platform 4.5 Scalability and performance
Procedure
Prerequisites
The repository will prompt you to authenticate. The registry credentials allow you to access the
image, which is not publicly available. Use your existing authentication credentials from
installation.
Procedure
1. Execute Cluster Loader using the built-in test configuration, which deploys five template builds
and waits for them to complete:
In this example, ${LOCAL_KUBECONFIG} refers to the path to the kubeconfig on your local
file system. Also, there is a directory called ${LOCAL_CONFIG_FILE_PATH}, which is mounted
into the container that contains a configuration file called test.yaml. Additionally, if the
test.yaml references any external template files or podspec files, they should also be mounted
into the container.
34
CHAPTER 5. USING CLUSTER LOADER
provider: local 1
ClusterLoader:
cleanup: true
projects:
- num: 1
basename: clusterloader-cakephp-mysql
tuning: default
ifexists: reuse
templates:
- num: 1
file: cakephp-mysql.json
- num: 1
basename: clusterloader-dancer-mysql
tuning: default
ifexists: reuse
templates:
- num: 1
file: dancer-mysql.json
- num: 1
basename: clusterloader-django-postgresql
tuning: default
ifexists: reuse
templates:
- num: 1
file: django-postgresql.json
- num: 1
basename: clusterloader-nodejs-mongodb
tuning: default
ifexists: reuse
templates:
- num: 1
file: quickstarts/nodejs-mongodb.json
- num: 1
basename: clusterloader-rails-postgresql
tuning: default
templates:
- num: 1
file: rails-postgresql.json
tuningsets: 2
- name: default
pods:
stepping: 3
stepsize: 5
pause: 0 s
rate_limit: 4
delay: 0 ms
35
OpenShift Container Platform 4.5 Scalability and performance
1 Optional setting for end-to-end tests. Set to local to avoid extra log messages.
2 The tuning sets allow rate limiting and stepping, the ability to create several batches of pods while
pausing in between sets. Cluster Loader monitors completion of the previous step before
continuing.
3 Stepping will pause for M seconds after each N objects are created.
This example assumes that references to any external template files or pod spec files are also mounted
into the container.
IMPORTANT
If you are running Cluster Loader on Microsoft Azure, then you must set the
AZURE_AUTH_LOCATION variable to a file that contains the output of
terraform.azure.auto.tfvars.json, which is present in the installer directory.
Field Description
Field Description
36
CHAPTER 5. USING CLUSTER LOADER
Field Description
Field Description
37
OpenShift Container Platform 4.5 Scalability and performance
Field Description
name A string. The name of the tuning set which will match
the name specified when defining a tuning in a
project.
Field Description
Field Description
Field Description
38
CHAPTER 5. USING CLUSTER LOADER
Field Description
If the IDENTIFIER parameter is not defined in user templates, template creation fails with error:
unknown parameter name "IDENTIFIER". If you deploy templates, add this parameter to your
template to avoid this error:
{
"name": "IDENTIFIER",
"description": "Number to append to the name of resources",
"value": "1"
}
39
OpenShift Container Platform 4.5 Scalability and performance
CPU Manager is useful for workloads that have some of these attributes:
Coordinate with other processes and benefit from sharing a single processor cache.
Procedure
2. Edit the MachineConfigPool of the nodes where CPU Manager should be enabled. In this
example, all workers have CPU Manager enabled:
metadata:
creationTimestamp: 2020-xx-xxx
generation: 3
labels:
custom-kubelet: cpumanager-enabled
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: cpumanager-enabled
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: cpumanager-enabled
kubeletConfig:
cpuManagerPolicy: static 1
cpuManagerReconcilePeriod: 5s 2
1 Specify a policy:
none. This policy explicitly enables the existing default CPU affinity scheme, providing
40
CHAPTER 6. USING CPU MANAGER
none. This policy explicitly enables the existing default CPU affinity scheme, providing
no affinity beyond what the scheduler does automatically.
static. This policy allows pods with certain resource characteristics to be granted
increased CPU affinity and exclusivity on the node.
2 Optional. Specify the CPU Manager reconcile frequency. The default is 5s.
# oc create -f cpumanager-kubeletconfig.yaml
This adds the CPU Manager feature to the kubelet config and, if needed, the Machine Config
Operator (MCO) reboots the node. To enable CPU Manager, a reboot is not needed.
Example output
"ownerReferences": [
{
"apiVersion": "machineconfiguration.openshift.io/v1",
"kind": "KubeletConfig",
"name": "cpumanager-enabled",
"uid": "7ed5616d-6b72-11e9-aae1-021e1ce18878"
}
]
# oc debug node/perf-node.example.com
sh-4.2# cat /host/etc/kubernetes/kubelet.conf | grep cpuManager
Example output
cpuManagerPolicy: static 1
cpuManagerReconcilePeriod: 5s 2
1 2 These settings were defined when you created the KubeletConfig CR.
8. Create a pod that requests a core or multiple cores. Both limits and requests must have their
CPU value set to a whole integer. That is the number of cores that will be dedicated to this pod:
# cat cpumanager-pod.yaml
Example output
apiVersion: v1
41
OpenShift Container Platform 4.5 Scalability and performance
kind: Pod
metadata:
generateName: cpumanager-
spec:
containers:
- name: cpumanager
image: gcr.io/google_containers/pause-amd64:3.0
resources:
requests:
cpu: 1
memory: "1G"
limits:
cpu: 1
memory: "1G"
nodeSelector:
cpumanager: "true"
# oc create -f cpumanager-pod.yaml
10. Verify that the pod is scheduled to the node that you labeled:
Example output
Name: cpumanager-6cqz7
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: perf-node.example.com/xxx.xx.xx.xxx
...
Limits:
cpu: 1
memory: 1G
Requests:
cpu: 1
memory: 1G
...
QoS Class: Guaranteed
Node-Selectors: cpumanager=true
11. Verify that the cgroups are set up correctly. Get the process ID (PID) of the pause process:
# ├─init.scope
│ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 17
└─kubepods.slice
├─kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice
│ ├─crio-b5437308f1a574c542bdf08563b865c0345c8f8c0b0a655612c.scope
│ └─32706 /pause
Pods of quality of service (QoS) tier Guaranteed are placed within the kubepods.slice. Pods of
other QoS tiers end up in child cgroups of kubepods:
42
CHAPTER 6. USING CPU MANAGER
# cd /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-
pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice/crio-
b5437308f1ad1a7db0574c542bdf08563b865c0345c86e9585f8c0b0a655612c.scope
# for i in `ls cpuset.cpus tasks` ; do echo -n "$i "; cat $i ; done
Example output
cpuset.cpus 1
tasks 32706
Example output
Cpus_allowed_list: 1
13. Verify that another pod (in this case, the pod in the burstable QoS tier) on the system cannot
run on the core allocated for the Guaranteed pod:
# cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-
podc494a073_6b77_11e9_98c0_06bba5c387ea.slice/crio-
c56982f57b75a2420947f0afc6cafe7534c5734efc34157525fa9abbf99e3849.scope/cpuset.cpus
0
# oc describe node perf-node.example.com
Example output
...
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 2
ephemeral-storage: 124768236Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8162900Ki
pods: 250
Allocatable:
attachable-volumes-aws-ebs: 39
cpu: 1500m
ephemeral-storage: 124768236Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7548500Ki
pods: 250
------- ---- ------------ ---------- --------------- ------------- --
-
default cpumanager-6cqz7 1 (66%) 1 (66%) 1G (12%)
1G (12%) 29m
Allocated resources:
43
OpenShift Container Platform 4.5 Scalability and performance
This VM has two CPU cores. The system-reserved setting reserves 500 millicores, meaning
that half of one core is subtracted from the total capacity of the node to arrive at the Node
Allocatable amount. You can see that Allocatable CPU is 1500 millicores. This means you can
run one of the CPU Manager pods since each will take one whole core. A whole core is
equivalent to 1000 millicores. If you try to schedule a second pod, the system will accept the
pod, but it will never be scheduled:
44
CHAPTER 7. USING TOPOLOGY MANAGER
Topology Manager uses topology information from collected hints to decide if a pod can be accepted or
rejected on a node, based on the configured Topology Manager policy and Pod resources requested.
Topology Manager is useful for workloads that use hardware accelerators to support latency-critical
execution and high throughput parallel computation.
NOTE
To use Topology Manager you must use the CPU Manager with the static policy. For
more information on CPU Manager, see Using CPU Manager.
NOTE
To align CPU resources with other requested resources in a Pod spec, the CPU Manager
must be enabled with the static CPU Manager policy.
Topology Manager supports four allocation policies, which you assign in the cpumanager-enabled
custom resource (CR):
none policy
This is the default policy and does not perform any topology alignment.
best-effort policy
For each container in a pod with the best-effort topology management policy, kubelet calls each Hint
Provider to discover their resource availability. Using this information, the Topology Manager stores
the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology
Manager stores this and admits the pod to the node.
restricted policy
For each container in a pod with the restricted topology management policy, kubelet calls each Hint
Provider to discover their resource availability. Using this information, the Topology Manager stores
the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology
Manager rejects this pod from the node, resulting in a pod in a Terminated state with a pod
admission failure.
single-numa-node policy
For each container in a pod with the single-numa-node topology management policy, kubelet calls
each Hint Provider to discover their resource availability. Using this information, the Topology
Manager determines if a single NUMA Node affinity is possible. If it is, the pod is admitted to the
node. If a single NUMA Node affinity is not possible, the Topology Manager rejects the pod from the
node. This results in a pod in a Terminated state with a pod admission failure.
45
OpenShift Container Platform 4.5 Scalability and performance
Prequisites
Configure the CPU Manager policy to be static. Refer to Using CPU Manager in the Scalability
and Performance section.
Procedure
To activate Topololgy Manager:
$ oc edit featuregate/cluster
apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
annotations:
release.openshift.io/create-only: "true"
creationTimestamp: 2020-06-05T14:41:09Z
generation: 2
managedFields:
- apiVersion: config.openshift.io/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:release.openshift.io/create-only: {}
f:spec: {}
manager: cluster-version-operator
operation: Update
time: 2020-06-05T14:41:09Z
- apiVersion: config.openshift.io/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
f:featureSet: {}
manager: oc
operation: Update
time: 2020-06-05T15:21:44Z
name: cluster
resourceVersion: "28457"
selfLink: /apis/config.openshift.io/v1/featuregates/cluster
uid: e802e840-89ee-4137-a7e5-ca15fd2806f8
spec:
featureSet: LatencySensitive 1
...
46
CHAPTER 7. USING TOPOLOGY MANAGER
2. Configure the Topology Manager policy in the cpumanager-enabled custom resource (CR).
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: cpumanager-enabled
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: cpumanager-enabled
kubeletConfig:
cpuManagerPolicy: static 1
cpuManagerReconcilePeriod: 5s
topologyManagerPolicy: single-numa-node 2
2 Specify your selected Topology Manager policy. Here, the policy is single-numa-node.
Acceptable values are: default, best-effort, restricted, single-numa-node.
Additional resources
For more information on CPU Manager, see Using CPU Manager.
The following pod runs in the BestEffort QoS class because no resource requests or limits are specified.
spec:
containers:
- name: nginx
image: nginx
The next pod runs in the Burstable QoS class because requests are less than limits.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
requests:
memory: "100Mi"
If the selected policy is anything other than none, Topology Manager would not consider either of these
Pod specifications.
The last example pod below runs in the Guaranteed QoS class because requests are equal to limits.
47
OpenShift Container Platform 4.5 Scalability and performance
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
example.com/device: "1"
requests:
memory: "200Mi"
cpu: "2"
example.com/device: "1"
Topology Manager would consider this pod. The Topology Manager consults the CPU Manager static
policy, which returns the topology of available CPUs. Topology Manager also consults Device Manager
to discover the topology of available devices for example.com/device.
Topology Manager will use this information to store the best Topology for this container. In the case of
this pod, CPU Manager and Device Manager will use this stored information at the resource allocation
stage.
48
CHAPTER 8. SCALING THE CLUSTER MONITORING OPERATOR
NOTE
The Prometheus storage requirements below are not prescriptive. Higher resource
consumption might be observed in your cluster depending on workload activity and
resource use.
Table 8.1. Prometheus Database storage requirements based on number of nodes/pods in the
cluster
50 1800 6.3 GB 94 GB 6 GB 16 MB
Approximately 20 percent of the expected size was added as overhead to ensure that the storage
requirements do not exceed the calculated value.
The above calculation is for the default OpenShift Container Platform Cluster Monitoring Operator.
NOTE
CPU utilization has minor impact. The ratio is approximately 1 core out of 40 per 50
nodes and 1800 pods.
49
OpenShift Container Platform 4.5 Scalability and performance
apiVersion: v1
kind: ConfigMap
data:
config.yaml: |
prometheusOperator:
baseImage: quay.io/coreos/prometheus-operator
prometheusConfigReloaderBaseImage: quay.io/coreos/prometheus-config-reloader
configReloaderBaseImage: quay.io/coreos/configmap-reload
nodeSelector:
node-role.kubernetes.io/infra: ""
prometheusK8s:
retention: {{PROMETHEUS_RETENTION_PERIOD}} 1
baseImage: openshift/prometheus
nodeSelector:
node-role.kubernetes.io/infra: ""
volumeClaimTemplate:
spec:
storageClassName: gp2
resources:
requests:
storage: {{PROMETHEUS_STORAGE_SIZE}} 2
alertmanagerMain:
baseImage: openshift/prometheus-alertmanager
nodeSelector:
node-role.kubernetes.io/infra: ""
volumeClaimTemplate:
spec:
storageClassName: gp2
resources:
requests:
storage: {{ALERTMANAGER_STORAGE_SIZE}} 3
nodeExporter:
baseImage: openshift/prometheus-node-exporter
kubeRbacProxy:
baseImage: quay.io/coreos/kube-rbac-proxy
kubeStateMetrics:
baseImage: quay.io/coreos/kube-state-metrics
nodeSelector:
node-role.kubernetes.io/infra: ""
grafana:
baseImage: grafana/grafana
nodeSelector:
node-role.kubernetes.io/infra: ""
auth:
baseImage: openshift/oauth-proxy
k8sPrometheusAdapter:
nodeSelector:
node-role.kubernetes.io/infra: ""
50
CHAPTER 8. SCALING THE CLUSTER MONITORING OPERATOR
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
2. Set the values like the retention period and storage sizes.
$ oc create -f cluster-monitoring-config.yml
51
OpenShift Container Platform 4.5 Scalability and performance
These guidelines are based on the largest possible cluster. For smaller clusters, the maximums are lower.
There are many factors that influence the stated thresholds, including the etcd version or storage data
format.
In most cases, exceeding these numbers results in lower overall performance. It does not necessarily
mean that the cluster will fail.
Number of pods per core There is no default value. There is no default value.
Number of builds 10,000 (Default pod RAM 512 Mi) 10,000 (Default pod RAM 512 Mi)
- Pipeline Strategy - Source-to-Image (S2I) build
strategy
52
CHAPTER 9. PLANNING YOUR ENVIRONMENT ACCORDING TO OBJECT MAXIMUMS
1. The pod count displayed here is the number of test pods. The actual number of pods depends
on the application’s memory, CPU, and storage requirements.
2. This was tested on a cluster with 100 worker nodes with 500 pods per worker node. The default
maxPods is still 250. To get to 500 maxPods, the cluster must be created with a maxPods set
to 500 using a custom kubelet config. If you need 500 user pods, you need a hostPrefix of 22
because there are 10-15 system pods already running on the node. The maximum number of
pods with attached persistent volume claims (PVC) depends on storage backend from where
PVC are allocated. In our tests, only OpenShift Container Storage v4 (OCS v4) was able to
satisfy the number of pods per node discussed in this document.
3. When there are a large number of active projects, etcd might suffer from poor performance if
the keyspace grows excessively large and exceeds the space quota. Periodic maintenance of
etcd, including defragmentaion, is highly recommended to free etcd storage.
4. There are a number of control loops in the system that must iterate over all objects in a given
namespace as a reaction to some changes in state. Having a large number of objects of a given
type in a single namespace can make those loops expensive and slow down processing given
state changes. The limit assumes that the system has enough CPU, memory, and disk to satisfy
the application requirements.
5. Each service port and each service back-end has a corresponding entry in iptables. The number
of back-ends of a given service impact the size of the endpoints objects, which impacts the size
of data that is being sent all over the system.
Maximum 4.1 tested 4.2 tested 4.3 tested 4.4 tested 4.5 tested
type maximum maximum maximum maximum maximum
53
OpenShift Container Platform 4.5 Scalability and performance
Maximum 4.1 tested 4.2 tested 4.3 tested 4.4 tested 4.5 tested
type maximum maximum maximum maximum maximum
1. The pod count displayed here is the number of test pods. The actual number of pods depends
on the application’s memory, CPU, and storage requirements.
2. When there are a large number of active projects, etcd might suffer from poor performance if
the keyspace grows excessively large and exceeds the space quota. Periodic maintenance of
etcd, including defragmentaion, is highly recommended to free etcd storage.
3. There are a number of control loops in the system that must iterate over all objects in a given
namespace as a reaction to some changes in state. Having a large number of objects of a given
type in a single namespace can make those loops expensive and slow down processing given
state changes. The limit assumes that the system has enough CPU, memory, and disk to satisfy
the application requirements.
4. Each service port and each service back end has a corresponding entry in iptables. The number
of back ends of a given service impact the size of the endpoints objects, which impacts the size
of data that is being sent all over the system.
54
CHAPTER 9. PLANNING YOUR ENVIRONMENT ACCORDING TO OBJECT MAXIMUMS
1. io1 disks with 3000 IOPS are used for master/etcd nodes as etcd is I/O intensive and latency
sensitive.
2. Infra nodes are used to host Monitoring, Ingress, and Registry components to ensure they have
enough resources to run at large scale.
4. Larger disk size is used so that there is enough space to store the large amounts of data that is
collected during the performance and scalability test run.
5. Cluster is scaled in iterations and performance and scalability tests are executed at the
specified node counts.
IMPORTANT
Some of the tested maximums are stretched only in a single dimension. They will vary
when many objects are running on the cluster.
The numbers noted in this documentation are based on Red Hat’s test methodology,
setup, configuration, and tunings. These numbers can vary based on your own individual
setup and environments.
While planning your environment, determine how many pods are expected to fit per node:
required pods per cluster / pods per node = total number of nodes needed
The current maximum number of pods per node is 250. However, the number of pods that fit on a node
55
OpenShift Container Platform 4.5 Scalability and performance
The current maximum number of pods per node is 250. However, the number of pods that fit on a node
is dependent on the application itself. Consider the application’s memory, CPU, and storage
requirements, as described in How to plan your environment according to application requirements .
Example scenario
If you want to scope your cluster for 2200 pods per cluster, you would need at least five nodes,
assuming that there are 500 maximum pods per node:
If you increase the number of nodes to 20, then the pod distribution changes to 110 pods per node:
2200 / 20 = 110
Where:
required pods per cluster / total number of nodes = expected pods per node
node.js 200 1 GB 1 1 GB
postgresql 100 1 GB 2 10 GB
Extrapolated requirements: 550 CPU cores, 450GB RAM, and 1.4TB storage.
Instance size for nodes can be modulated up or down, depending on your preference. Nodes are often
resource overcommitted. In this deployment scenario, you can choose to run additional smaller nodes or
fewer larger nodes to provide the same amount of resources. Factors such as operational agility and
cost-per-instance should be considered.
Nodes (option 2) 50 8 32
Nodes (option 3) 25 16 64
56
CHAPTER 9. PLANNING YOUR ENVIRONMENT ACCORDING TO OBJECT MAXIMUMS
Some applications lend themselves well to overcommitted environments, and some do not. Most Java
applications and applications that use huge pages are examples of applications that would not allow for
overcommitment. That memory can not be used for other applications. In the example above, the
environment would be roughly 30 percent overcommitted, a common ratio.
The application pods can access a service either by using environment variables or DNS. If using
environment variables, for each active service the variables are injected by the kubelet when a pod is run
on a node. A cluster-aware DNS server watches the Kubernetes API for new services and creates a set
of DNS records for each one. If DNS is enabled throughout your cluster, then all pods should
automatically be able to resolve services by their DNS name. Service discovery using DNS can be used in
case you must go beyond 5000 services. When using environment variables for service discovery, the
argument list exceeds the allowed length after 5000 services in a namespace, then the pods and
deployments will start failing. Disable the service links in the deployment’s service specification file to
overcome this:
---
Kind: Template
apiVersion: v1
metadata:
name: deploymentConfigTemplate
creationTimestamp:
annotations:
description: This template will create a deploymentConfig with 1 replica, 4 env vars and a
service.
tags: ''
objects:
- kind: DeploymentConfig
apiVersion: v1
metadata:
name: deploymentconfig${IDENTIFIER}
spec:
template:
metadata:
labels:
name: replicationcontroller${IDENTIFIER}
spec:
enableServiceLinks: false
containers:
- name: pause${IDENTIFIER}
image: "${IMAGE}"
ports:
- containerPort: 8080
protocol: TCP
env:
- name: ENVVAR1_${IDENTIFIER}
value: "${ENV_VALUE}"
- name: ENVVAR2_${IDENTIFIER}
value: "${ENV_VALUE}"
- name: ENVVAR3_${IDENTIFIER}
value: "${ENV_VALUE}"
- name: ENVVAR4_${IDENTIFIER}
value: "${ENV_VALUE}"
resources: {}
imagePullPolicy: IfNotPresent
capabilities: {}
securityContext:
57
OpenShift Container Platform 4.5 Scalability and performance
capabilities: {}
privileged: false
restartPolicy: Always
serviceAccount: ''
replicas: 1
selector:
name: replicationcontroller${IDENTIFIER}
triggers:
- type: ConfigChange
strategy:
type: Rolling
- kind: Service
apiVersion: v1
metadata:
name: service${IDENTIFIER}
spec:
selector:
name: replicationcontroller${IDENTIFIER}
ports:
- name: serviceport${IDENTIFIER}
protocol: TCP
port: 80
targetPort: 8080
portalIP: ''
type: ClusterIP
sessionAffinity: None
status:
loadBalancer: {}
parameters:
- name: IDENTIFIER
description: Number to append to the name of resources
value: '1'
required: true
- name: IMAGE
description: Image to use for deploymentConfig
value: gcr.io/google-containers/pause-amd64:3.0
required: false
- name: ENV_VALUE
description: Value to use for environment variables
generate: expression
from: "[A-Za-z0-9]{255}"
required: false
labels:
template: deploymentConfigTemplate
58
CHAPTER 10. OPTIMIZING STORAGE
Object AWS S3
Accessible through a REST API endpoint
1. NetApp NFS supports dynamic PV provisioning when using the Trident plug-in.
IMPORTANT
59
OpenShift Container Platform 4.5 Scalability and performance
IMPORTANT
1 ReadOnlyMany
2 ReadWriteMany
3
Prometheus is the underlying technology used for metrics.
4 This does not apply to physical disk, VM physical disk, VMDK, loopback over NFS, AWS EBS, and Azure
Disk.
5
For metrics, using file storage with theReadWriteMany (RWX) access mode is unreliable. If you use file
storage, do not configure the RWX access mode on any persistent volume claims (PVCs) that are
configured for use with metrics.
6 For logging, using any shared storage would be an anti-pattern. One volume per elasticsearch is
required.
7
Object storage is not consumed through OpenShift Container Platform’s PVs or PVCs. Apps must
integrate with the object storage REST API.
NOTE
A scaled registry is an OpenShift Container Platform registry where two or more pod
replicas are running.
60
CHAPTER 10. OPTIMIZING STORAGE
IMPORTANT
Testing shows issues with using the NFS server on Red Hat Enterprise Linux (RHEL) as
storage backend for core services. This includes the OpenShift Container Registry and
Quay, Prometheus for monitoring storage, and Elasticsearch for logging storage.
Therefore, using RHEL NFS to back PVs used by core services is not recommended.
Other NFS implementations on the marketplace might not have these issues. Contact
the individual NFS implementation vendor for more information on any testing that was
possibly completed against these OpenShift Container Platform core components.
10.2.1.1. Registry
The storage technology does not have to support RWX access mode.
File storage is not recommended for OpenShift Container Platform registry cluster deployment
with production workloads.
The storage technology must support RWX access mode and must ensure read-after-write
consistency.
Amazon Simple Storage Service (Amazon S3), Google Cloud Storage (GCS), Microsoft Azure
Blob Storage, and OpenStack Swift are supported.
File storage is not recommended for a scaled/HA OpenShift Container Platform registry cluster
deployment with production workloads.
For non-cloud platforms, such as vSphere and bare metal installations, the only configurable
technology is file storage.
10.2.1.3. Metrics
IMPORTANT
61
OpenShift Container Platform 4.5 Scalability and performance
IMPORTANT
It is not recommended to use file storage for a hosted metrics cluster deployment with
production workloads.
10.2.1.4. Logging
File storage is not recommended for a scaled/HA OpenShift Container Platform registry cluster
deployment with production workloads.
IMPORTANT
Testing shows issues with using the NFS server on RHEL as storage backend for core
services. This includes Elasticsearch for logging storage. Therefore, using RHEL NFS to
back PVs used by core services is not recommended.
Other NFS implementations on the marketplace might not have these issues. Contact
the individual NFS implementation vendor for more information on any testing that was
possibly completed against these OpenShift Container Platform core components.
10.2.1.5. Applications
Application use cases vary from application to application, as described in the following examples:
Storage technologies that support dynamic PV provisioning have low mount time latencies, and
are not tied to nodes to support a healthy cluster.
Application developers are responsible for knowing and understanding the storage
requirements for their application, and how it works with the provided storage to ensure that
issues do not occur when an application scales or interacts with the storage layer.
It is highly recommended that you use etcd with storage that handles serial writes (fsync)
quickly, such as NVMe or SSD. Ceph, NFS, and spinning disks are not recommended.
Red Hat OpenStack Platform (RHOSP) Cinder: RHOSP Cinder tends to be adept in ROX
access mode use cases.
Databases: Databases (RDBMSs, NoSQL DBs, etc.) tend to perform best with dedicated block
storage.
62
CHAPTER 10. OPTIMIZING STORAGE
Table 10.3. Main directories for storing OpenShift Container Platform data
/var/lib/etcd Used for etcd storage Less than 20 GB. Will grow slowly with the
when storing the environment. Only
database. Database can grow up storing metadata.
to 8 GB.
Additional 20-25 GB for
every additional 8 GB of
memory.
/var/lib/containers This is the mount point 50 GB for a node with 16 Growth is limited by
for the CRI-O runtime. GB memory. Note that capacity for running
Storage used for active this sizing should not be containers.
container runtimes, used to determine
including pods, and minimum cluster
storage of local images. requirements.
Not used for registry
storage. Additional 20-25 GB for
every additional 8 GB of
memory.
/var/log Log files for all 10 to 30 GB. Log files can grow
components. quickly; size can be
managed by growing
disks or by using log
rotate.
63
OpenShift Container Platform 4.5 Scalability and performance
When evaluating a single HAProxy router performance in terms of HTTP requests handled per second,
the performance varies depending on many factors. In particular:
Route type
While performance in your specific environment will vary, Red Hat lab tests on a public cloud instance of
size 4 vCPU/16GB RAM. A single HAProxy router handling 100 routes terminated by backends serving
1kB static pages is able to handle the following number of transactions per second.
64
CHAPTER 11. OPTIMIZING ROUTING
Default Ingress Controller configuration with ROUTER_THREADS=4 was used and two different
endpoint publishing strategies (LoadBalancerService/HostNetwork) were tested. TLS session
resumption was used for encrypted routes. With HTTP keep-alive, a single HAProxy router is capable of
saturating 1 Gbit NIC at page sizes as small as 8 kB.
When running on bare metal with modern processors, you can expect roughly twice the performance of
the public cloud instance above. This overhead is introduced by the virtualization layer in place on public
clouds and holds mostly true for private cloud-based virtualization as well. The following table is a guide
to how many applications to use behind the router:
In general, HAProxy can support routes for 5 to 1000 applications, depending on the technology in use.
Ingress Controller performance might be limited by the capabilities and performance of the applications
behind it, such as language or static versus dynamic content.
Ingress, or router, sharding should be used to serve more routes towards applications and help
horizontally scale the routing tier.
For more information on Ingress sharding, see Configuring Ingress Controller sharding by using route
labels and Configuring Ingress Controller sharding by using namespace labels .
You can modify the Ingress Controller deployment, but if the Ingress Operator is enabled, the
configuration is overwritten.
65
OpenShift Container Platform 4.5 Scalability and performance
OVN-Kubernetes uses Geneve (Generic Network Virtualization Encapsulation) instead of VXLAN as the
tunnel protocol.
VXLAN provides benefits over VLANs, such as an increase in networks from 4096 to over 16 million, and
layer 2 connectivity across physical networks. This allows for all pods behind a service to communicate
with each other, even if they are running on different systems.
VXLAN encapsulates all tunneled traffic in user datagram protocol (UDP) packets. However, this leads
to increased CPU utilization. Both these outer- and inner-packets are subject to normal checksumming
rules to guarantee data is not corrupted during transit. Depending on CPU performance, this additional
processing overhead can cause a reduction in throughput and increased latency when compared to
traditional, non-overlay networks.
Cloud, VM, and bare metal CPU performance can be capable of handling much more than one Gbps
network throughput. When using higher bandwidth links such as 10 or 40 Gbps, reduced performance
can occur. This is a known issue in VXLAN-based environments and is not specific to containers or
OpenShift Container Platform. Any network that relies on VXLAN tunnels will perform similarly because
of the VXLAN implementation.
Evaluate network plug-ins that implement different routing techniques, such as border gateway
protocol (BGP).
Use VXLAN-offload capable network adapters. VXLAN-offload moves the packet checksum
calculation and associated CPU overhead off of the system CPU and onto dedicated hardware
on the network adapter. This frees up CPU cycles for use by pods and applications, and allows
users to utilize the full bandwidth of their network infrastructure.
VXLAN-offload does not reduce latency. However, CPU utilization is reduced even in latency tests.
The NIC MTU is only configured at the time of OpenShift Container Platform installation. The MTU
must be less than or equal to the maximum supported value of the NIC of your network. If you are
optimizing for throughput, choose the largest possible value. If you are optimizing for lowest latency,
choose a lower value.
The SDN overlay’s MTU must be less than the NIC MTU by 50 bytes at a minimum. This accounts for
the SDN overlay header. So, on a normal ethernet network, set this to 1450. On a jumbo frame ethernet
network, set this to 8950.
For OVN and Geneve, the MTU must be less than the NIC MTU by 100 bytes at a minimum.
NOTE
66
CHAPTER 12. OPTIMIZING NETWORKING
NOTE
This 50 byte overlay header is relevant to the OpenShift SDN. Other SDN solutions
might require the value to be more or less.
networking:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
machineCIDR: 10.0.0.0/16
networkType: OpenShiftSDN
serviceNetwork:
- 172.30.0.0/16
The default cluster network cidr 10.128.0.0/14 cannot be used if the cluster size is more than 500
nodes. It must be set to 10.128.0.0/12 or 10.128.0.0/10 to get to larger node counts beyond 500 nodes.
IPSec encrypts traffic at the IP payload level, before it hits the NIC, protecting fields that would
otherwise be used for NIC offloading. This means that some NIC acceleration features might not be
usable when IPSec is enabled and will lead to decreased throughput and increased CPU usage.
Additional resources
Configuration parameters for the OpenShift SDN default CNI network provider
67
OpenShift Container Platform 4.5 Scalability and performance
A huge page is a memory page that is larger than 4Ki. On x86_64 architectures, there are two common
huge page sizes: 2Mi and 1Gi. Sizes vary on other architectures. In order to use huge pages, code must
be written so that applications are aware of them. Transparent Huge Pages (THP) attempt to automate
the management of huge pages without application knowledge, but they have limitations. In particular,
they are limited to 2Mi page sizes. THP can lead to performance degradation on nodes with high
memory utilization or fragmentation due to defragmenting efforts of THP, which can lock memory
pages. For this reason, some applications may be designed to (or recommend) usage of pre-allocated
huge pages instead of THP.
In OpenShift Container Platform, applications in a pod can allocate and consume pre-allocated huge
pages.
Huge pages can be consumed through container-level resource requirements using the resource name
hugepages-<size>, where size is the most compact binary notation using integer values supported on a
particular node. For example, if a node supports 2048KiB page sizes, it exposes a schedulable resource
hugepages-2Mi. Unlike CPU or memory, huge pages do not support over-commitment.
apiVersion: v1
kind: Pod
metadata:
generateName: hugepages-volume-
spec:
containers:
- securityContext:
privileged: true
image: rhel7:latest
command:
- sleep
- inf
name: example
volumeMounts:
- mountPath: /dev/hugepages
name: hugepage
resources:
limits:
68
CHAPTER 13. WHAT HUGE PAGES DO AND HOW THEY ARE CONSUMED BY APPLICATIONS
hugepages-2Mi: 100Mi 1
memory: "1Gi"
cpu: "1"
volumes:
- name: hugepage
emptyDir:
medium: HugePages
1 Specify the amount of memory for hugepages as the exact amount to be allocated. Do not specify
this value as the amount of memory for hugepages multiplied by the size of the page. For
example, given a huge page size of 2MB, if you want to use 100MB of huge-page-backed RAM for
your application, then you would allocate 50 huge pages. OpenShift Container Platform handles
the math for you. As in the above example, you can specify 100MB directly.
Some platforms support multiple huge page sizes. To allocate huge pages of a specific size, precede the
huge pages boot command parameters with a huge page size selection parameter hugepagesz=<size>.
The <size> value must be specified in bytes with an optional scale suffix [ kKmMgG]. The default huge
page size can be defined with the default_hugepagesz=<size> boot parameter.
Huge page requests must equal the limits. This is the default if limits are specified, but requests
are not.
Huge pages are isolated at a pod scope. Container isolation is planned in a future iteration.
EmptyDir volumes backed by huge pages must not consume more huge page memory than the
pod request.
Applications that consume huge pages via shmget() with SHM_HUGETLB must run with a
supplemental group that matches proc/sys/vm/hugetlb_shm_group.
Additional resources
Procedure
To minimize node reboots, the order of the steps below needs to be followed:
1. Label all nodes that need the same huge pages setting by a label.
69
OpenShift Container Platform 4.5 Scalability and performance
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
name: hugepages 1
namespace: openshift-cluster-node-tuning-operator
spec:
profile: 2
- data: |
[main]
summary=Boot time configuration for hugepages
include=openshift-node
[bootloader]
cmdline_openshift_node_hugepages=hugepagesz=2M hugepages=50 3
name: openshift-node-hugepages
recommend:
- machineConfigLabels: 4
machineconfiguration.openshift.io/role: "worker-hp"
priority: 30
profile: openshift-node-hugepages
3 Note the order of parameters is important as some platforms support huge pages of
various sizes.
$ oc create -f hugepages-tuned-boottime.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: worker-hp
labels:
worker-hp: ""
spec:
machineConfigSelector:
matchExpressions:
- {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-hp]}
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker-hp: ""
70
CHAPTER 13. WHAT HUGE PAGES DO AND HOW THEY ARE CONSUMED BY APPLICATIONS
$ oc create -f hugepages-mcp.yaml
Given enough non-fragmented memory, all the nodes in the worker-hp machine config pool should now
have 50 2Mi huge pages allocated.
WARNING
This functionality is currently only supported on Red Hat Enterprise Linux CoreOS
(RHCOS) 8.x worker nodes. On Red Hat Enterprise Linux (RHEL) 7.x worker nodes
the Tuned [bootloader] plug-in is currently not supported.
71