Red Hat Enterprise Linux-9-Monitoring and Managing System Status and Performance-En-Us
Red Hat Enterprise Linux-9-Monitoring and Managing System Status and Performance-En-Us
The text of and illustrations in this document are licensed by Red Hat under a Creative Commons
Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is
available at
http://creativecommons.org/licenses/by-sa/3.0/
. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must
provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert,
Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift,
Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States
and other countries.
Linux ® is the registered trademark of Linus Torvalds in the United States and other countries.
XFS ® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States
and/or other countries.
MySQL ® is a registered trademark of MySQL AB in the United States, the European Union and
other countries.
Node.js ® is an official trademark of Joyent. Red Hat is not formally related to or endorsed by the
official Joyent Node.js open source or commercial project.
The OpenStack ® Word Mark and OpenStack logo are either registered trademarks/service marks
or trademarks/service marks of the OpenStack Foundation, in the United States and other
countries and are used with the OpenStack Foundation's permission. We are not affiliated with,
endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.
Abstract
This documentation collection provides instructions on how to monitor and optimize the
throughput, latency, and power consumption of Red Hat Enterprise Linux 9 in different scenarios.
Table of Contents
Table of Contents
. . . . . . . . . .OPEN
MAKING . . . . . . SOURCE
. . . . . . . . . .MORE
. . . . . . .INCLUSIVE
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
..............
. . . . . . . . . . . . . FEEDBACK
PROVIDING . . . . . . . . . . . . ON
. . . .RED
. . . . .HAT
. . . . .DOCUMENTATION
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11. . . . . . . . . . . . .
.CHAPTER
. . . . . . . . . . 1.. .GETTING
. . . . . . . . . . STARTED
. . . . . . . . . . .WITH
. . . . . .TUNED
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
..............
1.1. THE PURPOSE OF TUNED 12
1.2. TUNED PROFILES 12
Syntax of profile configuration 12
1.3. THE DEFAULT TUNED PROFILE 13
1.4. MERGED TUNED PROFILES 13
1.5. THE LOCATION OF TUNED PROFILES 14
1.6. TUNED PROFILES DISTRIBUTED WITH RHEL 14
1.7. TUNED CPU-PARTITIONING PROFILE 16
1.8. USING THE TUNED CPU-PARTITIONING PROFILE FOR LOW-LATENCY TUNING 17
1.9. CUSTOMIZING THE CPU-PARTITIONING TUNED PROFILE 18
1.10. REAL-TIME TUNED PROFILES DISTRIBUTED WITH RHEL 19
1.11. STATIC AND DYNAMIC TUNING IN TUNED 19
1.12. TUNED NO-DAEMON MODE 20
1.13. INSTALLING AND ENABLING TUNED 20
1.14. LISTING AVAILABLE TUNED PROFILES 21
1.15. SETTING A TUNED PROFILE 22
1.16. DISABLING TUNED 23
. . . . . . . . . . . 2.
CHAPTER . . CUSTOMIZING
. . . . . . . . . . . . . . . . TUNED
. . . . . . . . PROFILES
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
..............
2.1. TUNED PROFILES 24
Syntax of profile configuration 24
2.2. THE DEFAULT TUNED PROFILE 24
2.3. MERGED TUNED PROFILES 25
2.4. THE LOCATION OF TUNED PROFILES 25
2.5. INHERITANCE BETWEEN TUNED PROFILES 26
2.6. STATIC AND DYNAMIC TUNING IN TUNED 26
2.7. TUNED PLUG-INS 27
Syntax for plug-ins in TuneD profiles 28
Short plug-in syntax 28
Conflicting plug-in definitions in a profile 29
2.8. AVAILABLE TUNED PLUG-INS 29
Monitoring plug-ins 29
Tuning plug-ins 29
2.9. VARIABLES IN TUNED PROFILES 33
2.10. BUILT-IN FUNCTIONS IN TUNED PROFILES 34
2.11. BUILT-IN FUNCTIONS AVAILABLE IN TUNED PROFILES 34
2.12. CREATING NEW TUNED PROFILES 35
2.13. MODIFYING EXISTING TUNED PROFILES 36
2.14. SETTING THE DISK SCHEDULER USING TUNED 37
.CHAPTER
. . . . . . . . . . 3.
. . REVIEWING
. . . . . . . . . . . . .A
. . SYSTEM
. . . . . . . . . USING
. . . . . . . .TUNA
. . . . . . INTERFACE
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40
..............
3.1. INSTALLING TUNA TOOL 40
3.2. VIEWING THE SYSTEM STATUS USING TUNA TOOL 40
3.3. TUNING CPUS USING TUNA TOOL 41
3.4. TUNING IRQS USING TUNA TOOL 43
. . . . . . . . . . . 4.
CHAPTER . . .MONITORING
. . . . . . . . . . . . . . PERFORMANCE
. . . . . . . . . . . . . . . . . .USING
. . . . . . .RHEL
. . . . . .SYSTEM
. . . . . . . . . ROLES
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45
..............
1
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
4.1. PREPARING A CONTROL NODE AND MANAGED NODES TO USE RHEL SYSTEM ROLES 45
4.1.1. Introduction to RHEL System Roles 45
4.1.2. RHEL System Roles terminology 45
4.1.3. Preparing a control node 46
4.1.4. Preparing a managed node 48
4.1.5. Verifying access from the control node to managed nodes 49
4.2. INTRODUCTION TO THE METRICS SYSTEM ROLE 50
4.3. USING THE METRICS SYSTEM ROLE TO MONITOR YOUR LOCAL SYSTEM WITH VISUALIZATION 51
4.4. USING THE METRICS SYSTEM ROLE TO SETUP A FLEET OF INDIVIDUAL SYSTEMS TO MONITOR
THEMSELVES 52
4.5. USING THE METRICS SYSTEM ROLE TO MONITOR A FLEET OF MACHINES CENTRALLY VIA YOUR
LOCAL MACHINE 53
4.6. SETTING UP AUTHENTICATION WHILE MONITORING A SYSTEM USING THE METRICS SYSTEM ROLE
53
4.7. USING THE METRICS SYSTEM ROLE TO CONFIGURE AND ENABLE METRICS COLLECTION FOR SQL
SERVER 54
.CHAPTER
. . . . . . . . . . 5.
. . SETTING
. . . . . . . . . . UP
. . . .PCP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56
..............
5.1. OVERVIEW OF PCP 56
5.2. INSTALLING AND ENABLING PCP 56
5.3. DEPLOYING A MINIMAL PCP SETUP 57
5.4. SYSTEM SERVICES DISTRIBUTED WITH PCP 58
5.5. TOOLS DISTRIBUTED WITH PCP 59
5.6. PCP DEPLOYMENT ARCHITECTURES 62
5.7. RECOMMENDED DEPLOYMENT ARCHITECTURE 65
5.8. SIZING FACTORS 65
5.9. CONFIGURATION OPTIONS FOR PCP SCALING 66
5.10. EXAMPLE: ANALYZING THE CENTRALIZED LOGGING DEPLOYMENT 66
5.11. EXAMPLE: ANALYZING THE FEDERATED SETUP DEPLOYMENT 67
5.12. TROUBLESHOOTING HIGH MEMORY USAGE 68
.CHAPTER
. . . . . . . . . . 6.
. . .LOGGING
. . . . . . . . . . PERFORMANCE
. . . . . . . . . . . . . . . . . .DATA
. . . . . .WITH
. . . . . .PMLOGGER
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
..............
6.1. MODIFYING THE PMLOGGER CONFIGURATION FILE WITH PMLOGCONF 71
6.2. EDITING THE PMLOGGER CONFIGURATION FILE MANUALLY 71
6.3. ENABLING THE PMLOGGER SERVICE 72
6.4. SETTING UP A CLIENT SYSTEM FOR METRICS COLLECTION 73
6.5. SETTING UP A CENTRAL SERVER TO COLLECT DATA 74
6.6. REPLAYING THE PCP LOG ARCHIVES WITH PMREP 75
. . . . . . . . . . . 7.
CHAPTER . . MONITORING
. . . . . . . . . . . . . . . PERFORMANCE
. . . . . . . . . . . . . . . . . WITH
. . . . . . PERFORMANCE
. . . . . . . . . . . . . . . . . .CO-PILOT
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77
..............
7.1. MONITORING POSTFIX WITH PMDA-POSTFIX 77
7.2. VISUALLY TRACING PCP LOG ARCHIVES WITH THE PCP CHARTS APPLICATION 78
7.3. COLLECTING DATA FROM SQL SERVER USING PCP 80
7.4. GENERATING PCP ARCHIVES FROM SADC ARCHIVES 82
.CHAPTER
. . . . . . . . . . 8.
. . .PERFORMANCE
. . . . . . . . . . . . . . . . .ANALYSIS
. . . . . . . . . . .OF
. . . XFS
. . . . . WITH
. . . . . . PCP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83
..............
8.1. INSTALLING XFS PMDA MANUALLY 83
8.2. EXAMINING XFS PERFORMANCE METRICS WITH PMINFO 83
8.3. RESETTING XFS PERFORMANCE METRICS WITH PMSTORE 85
8.4. PCP METRIC GROUPS FOR XFS 85
8.5. PER-DEVICE PCP METRIC GROUPS FOR XFS 87
. . . . . . . . . . . 9.
CHAPTER . . .SETTING
. . . . . . . . . UP
. . . .GRAPHICAL
. . . . . . . . . . . . .REPRESENTATION
. . . . . . . . . . . . . . . . . . . .OF
. . . .PCP
. . . . METRICS
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .89
..............
9.1. SETTING UP PCP WITH PCP-ZEROCONF 89
9.2. SETTING UP A GRAFANA-SERVER 89
2
Table of Contents
. . . . . . . . . . . 10.
CHAPTER . . . OPTIMIZING
. . . . . . . . . . . . . .THE
. . . . SYSTEM
. . . . . . . . . .PERFORMANCE
. . . . . . . . . . . . . . . . .USING
. . . . . . . THE
. . . . .WEB
. . . . . CONSOLE
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104
...............
10.1. PERFORMANCE TUNING OPTIONS IN THE WEB CONSOLE 104
10.2. SETTING A PERFORMANCE PROFILE IN THE WEB CONSOLE 104
10.3. MONITORING PERFORMANCE ON THE LOCAL SYSTEM USING THE WEB CONSOLE 105
10.4. MONITORING PERFORMANCE ON SEVERAL SYSTEMS USING THE WEB CONSOLE AND GRAFANA
106
. . . . . . . . . . . 11.
CHAPTER . . .SETTING
. . . . . . . . . .THE
. . . . DISK
. . . . . .SCHEDULER
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .109
...............
11.1. AVAILABLE DISK SCHEDULERS 109
11.2. DIFFERENT DISK SCHEDULERS FOR DIFFERENT USE CASES 110
11.3. THE DEFAULT DISK SCHEDULER 110
11.4. DETERMINING THE ACTIVE DISK SCHEDULER 110
11.5. SETTING THE DISK SCHEDULER USING TUNED 111
11.6. SETTING THE DISK SCHEDULER USING UDEV RULES 113
11.7. TEMPORARILY SETTING A SCHEDULER FOR A SPECIFIC DISK 114
.CHAPTER
. . . . . . . . . . 12.
. . . TUNING
. . . . . . . . . THE
. . . . .PERFORMANCE
. . . . . . . . . . . . . . . . .OF
. . . .A. .SAMBA
. . . . . . . .SERVER
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .115
..............
12.1. SETTING THE SMB PROTOCOL VERSION 115
12.2. TUNING SHARES WITH DIRECTORIES THAT CONTAIN A LARGE NUMBER OF FILES 115
12.3. SETTINGS THAT CAN HAVE A NEGATIVE PERFORMANCE IMPACT 116
.CHAPTER
. . . . . . . . . . 13.
. . . OPTIMIZING
. . . . . . . . . . . . . .VIRTUAL
. . . . . . . . . MACHINE
. . . . . . . . . . .PERFORMANCE
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .117
..............
13.1. WHAT INFLUENCES VIRTUAL MACHINE PERFORMANCE 117
The impact of virtualization on system performance 117
Reducing VM performance loss 117
13.2. OPTIMIZING VIRTUAL MACHINE PERFORMANCE USING TUNED 118
13.3. OPTIMIZING LIBVIRT DAEMONS 119
13.3.1. Types of libvirt daemons 119
13.3.2. Enabling modular libvirt daemons 120
13.4. CONFIGURING VIRTUAL MACHINE MEMORY 121
13.4.1. Adding and removing virtual machine memory using the web console 121
13.4.2. Adding and removing virtual machine memory using the command-line interface 122
13.4.3. Additional resources 124
13.5. OPTIMIZING VIRTUAL MACHINE I/O PERFORMANCE 124
13.5.1. Tuning block I/O in virtual machines 124
13.5.2. Disk I/O throttling in virtual machines 125
13.5.3. Enabling multi-queue virtio-scsi 126
13.6. OPTIMIZING VIRTUAL MACHINE CPU PERFORMANCE 127
13.6.1. Adding and removing virtual CPUs using the command-line interface 127
13.6.2. Managing virtual CPUs using the web console 128
13.6.3. Configuring NUMA in a virtual machine 129
13.6.4. Sample vCPU performance tuning scenario 131
13.6.5. Managing kernel same-page merging 137
3
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
. . . . . . . . . . . 14.
CHAPTER . . . IMPORTANCE
. . . . . . . . . . . . . . . OF
. . . .POWER
. . . . . . . . MANAGEMENT
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143
...............
14.1. POWER MANAGEMENT BASICS 143
14.2. AUDIT AND ANALYSIS OVERVIEW 144
14.3. TOOLS FOR AUDITING 145
.CHAPTER
. . . . . . . . . . 15.
. . . MANAGING
. . . . . . . . . . . . .POWER
. . . . . . . . CONSUMPTION
. . . . . . . . . . . . . . . . . WITH
. . . . . . POWERTOP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .149
...............
15.1. THE PURPOSE OF POWERTOP 149
15.2. USING POWERTOP 149
15.2.1. Starting PowerTOP 149
15.2.2. Calibrating PowerTOP 149
15.2.3. Setting the measuring interval 150
15.2.4. Additional resources 150
15.3. POWERTOP STATISTICS 150
15.3.1. The Overview tab 150
15.3.2. The Idle stats tab 151
15.3.3. The Device stats tab 151
15.3.4. The Tunables tab 151
15.3.5. The WakeUp tab 151
15.4. WHY POWERTOP DOES NOT DISPLAY FREQUENCY STATS VALUES IN SOME INSTANCES 152
15.5. GENERATING AN HTML OUTPUT 153
15.6. OPTIMIZING POWER CONSUMPTION 153
15.6.1. Optimizing power consumption using the powertop service 153
15.6.2. The powertop2tuned utility 153
15.6.3. Optimizing power consumption using the powertop2tuned utility 153
15.6.4. Comparison of powertop.service and powertop2tuned 154
. . . . . . . . . . . 16.
CHAPTER . . . GETTING
. . . . . . . . . . STARTED
. . . . . . . . . . .WITH
. . . . . .PERF
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
...............
16.1. INTRODUCTION TO PERF 155
16.2. INSTALLING PERF 155
16.3. COMMON PERF COMMANDS 155
. . . . . . . . . . . 17.
CHAPTER . . . PROFILING
. . . . . . . . . . . . CPU
. . . . . USAGE
. . . . . . . . IN
. . . REAL
. . . . . . TIME
. . . . . .WITH
. . . . . .PERF
. . . . . .TOP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
...............
17.1. THE PURPOSE OF PERF TOP 157
17.2. PROFILING CPU USAGE WITH PERF TOP 157
17.3. INTERPRETATION OF PERF TOP OUTPUT 158
17.4. WHY PERF DISPLAYS SOME FUNCTION NAMES AS RAW FUNCTION ADDRESSES 158
17.5. ENABLING DEBUG AND SOURCE REPOSITORIES 158
17.6. GETTING DEBUGINFO PACKAGES FOR AN APPLICATION OR LIBRARY USING GDB 159
. . . . . . . . . . . 18.
CHAPTER . . . COUNTING
. . . . . . . . . . . . EVENTS
. . . . . . . . . .DURING
. . . . . . . . PROCESS
. . . . . . . . . . . EXECUTION
. . . . . . . . . . . . . WITH
. . . . . . PERF
. . . . . . STAT
. . . . . . . . . . . . . . . . . . . . . . . . . . . .161
..............
18.1. THE PURPOSE OF PERF STAT 161
18.2. COUNTING EVENTS WITH PERF STAT 161
18.3. INTERPRETATION OF PERF STAT OUTPUT 162
18.4. ATTACHING PERF STAT TO A RUNNING PROCESS 163
. . . . . . . . . . . 19.
CHAPTER . . . RECORDING
. . . . . . . . . . . . . .AND
. . . . .ANALYZING
. . . . . . . . . . . . .PERFORMANCE
. . . . . . . . . . . . . . . . . PROFILES
. . . . . . . . . . . WITH
. . . . . . PERF
. . . . . . . . . . . . . . . . . . . . . . . . . . .164
...............
19.1. THE PURPOSE OF PERF RECORD 164
19.2. RECORDING A PERFORMANCE PROFILE WITHOUT ROOT ACCESS 164
19.3. RECORDING A PERFORMANCE PROFILE WITH ROOT ACCESS 164
19.4. RECORDING A PERFORMANCE PROFILE IN PER-CPU MODE 165
19.5. CAPTURING CALL GRAPH DATA WITH PERF RECORD 165
4
Table of Contents
. . . . . . . . . . . 20.
CHAPTER . . . .INVESTIGATING
. . . . . . . . . . . . . . . . . BUSY
. . . . . . .CPUS
. . . . . .WITH
. . . . . .PERF
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .171
..............
20.1. DISPLAYING WHICH CPU EVENTS WERE COUNTED ON WITH PERF STAT 171
20.2. DISPLAYING WHICH CPU SAMPLES WERE TAKEN ON WITH PERF REPORT 171
20.3. DISPLAYING SPECIFIC CPUS DURING PROFILING WITH PERF TOP 172
20.4. MONITORING SPECIFIC CPUS WITH PERF RECORD AND PERF REPORT 172
. . . . . . . . . . . 21.
CHAPTER . . . MONITORING
. . . . . . . . . . . . . . .APPLICATION
. . . . . . . . . . . . . . .PERFORMANCE
. . . . . . . . . . . . . . . . . WITH
. . . . . . PERF
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .174
...............
21.1. ATTACHING PERF RECORD TO A RUNNING PROCESS 174
21.2. CAPTURING CALL GRAPH DATA WITH PERF RECORD 174
21.3. ANALYZING PERF.DATA WITH PERF REPORT 175
. . . . . . . . . . . 22.
CHAPTER . . . .CREATING
. . . . . . . . . . . UPROBES
. . . . . . . . . . .WITH
. . . . . . PERF
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
...............
22.1. CREATING UPROBES AT THE FUNCTION LEVEL WITH PERF 177
22.2. CREATING UPROBES ON LINES WITHIN A FUNCTION WITH PERF 177
22.3. PERF SCRIPT OUTPUT OF DATA RECORDED OVER UPROBES 178
. . . . . . . . . . . 23.
CHAPTER . . . .PROFILING
. . . . . . . . . . . .MEMORY
. . . . . . . . . . ACCESSES
. . . . . . . . . . . . WITH
. . . . . . PERF
. . . . . . MEM
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .179
...............
23.1. THE PURPOSE OF PERF MEM 179
23.2. SAMPLING MEMORY ACCESS WITH PERF MEM 179
23.3. INTERPRETATION OF PERF MEM REPORT OUTPUT 181
. . . . . . . . . . . 24.
CHAPTER . . . .DETECTING
. . . . . . . . . . . . .FALSE
. . . . . . . SHARING
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .183
...............
24.1. THE PURPOSE OF PERF C2C 183
24.2. DETECTING CACHE-LINE CONTENTION WITH PERF C2C 183
24.3. VISUALIZING A PERF.DATA FILE RECORDED WITH PERF C2C RECORD 184
24.4. INTERPRETATION OF PERF C2C REPORT OUTPUT 186
24.5. DETECTING FALSE SHARING WITH PERF C2C 187
. . . . . . . . . . . 25.
CHAPTER . . . .GETTING
. . . . . . . . . .STARTED
. . . . . . . . . . WITH
. . . . . . FLAMEGRAPHS
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .190
...............
25.1. INSTALLING FLAMEGRAPHS 190
25.2. CREATING FLAMEGRAPHS OVER THE ENTIRE SYSTEM 190
25.3. CREATING FLAMEGRAPHS OVER SPECIFIC PROCESSES 191
25.4. INTERPRETING FLAMEGRAPHS 192
CHAPTER 26. MONITORING PROCESSES FOR PERFORMANCE BOTTLENECKS USING PERF CIRCULAR
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .194
BUFFERS ...............
26.1. CIRCULAR BUFFERS AND EVENT-SPECIFIC SNAPSHOTS WITH PERF 194
26.2. COLLECTING SPECIFIC DATA TO MONITOR FOR PERFORMANCE BOTTLENECKS USING PERF
CIRCULAR BUFFERS 194
CHAPTER 27. ADDING AND REMOVING TRACEPOINTS FROM A RUNNING PERF COLLECTOR WITHOUT
. . . . . . . . . . . . OR
STOPPING . . . .RESTARTING
. . . . . . . . . . . . . . PERF
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .196
...............
27.1. ADDING TRACEPOINTS TO A RUNNING PERF COLLECTOR WITHOUT STOPPING OR RESTARTING
PERF 196
27.2. REMOVING TRACEPOINTS FROM A RUNNING PERF COLLECTOR WITHOUT STOPPING OR
RESTARTING PERF 197
. . . . . . . . . . . 28.
CHAPTER . . . .PROFILING
. . . . . . . . . . . .MEMORY
. . . . . . . . . . ALLOCATION
. . . . . . . . . . . . . . .WITH
. . . . . .NUMASTAT
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .198
...............
5
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
. . . . . . . . . . . 29.
CHAPTER . . . .CONFIGURING
. . . . . . . . . . . . . . . .AN
. . . OPERATING
. . . . . . . . . . . . . .SYSTEM
. . . . . . . . .TO
. . . OPTIMIZE
. . . . . . . . . . . CPU
. . . . . UTILIZATION
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
................
29.1. TOOLS FOR MONITORING AND DIAGNOSING PROCESSOR ISSUES 200
29.2. TYPES OF SYSTEM TOPOLOGY 201
29.2.1. Displaying system topologies 201
29.3. CONFIGURING KERNEL TICK TIME 203
29.4. OVERVIEW OF AN INTERRUPT REQUEST 205
29.4.1. Balancing interrupts manually 205
29.4.2. Setting the smp_affinity mask 206
.CHAPTER
. . . . . . . . . . 30.
. . . .TUNING
. . . . . . . . .SCHEDULING
. . . . . . . . . . . . . . .POLICY
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
................
30.1. CATEGORIES OF SCHEDULING POLICIES 208
30.2. STATIC PRIORITY SCHEDULING WITH SCHED_FIFO 208
30.3. ROUND ROBIN PRIORITY SCHEDULING WITH SCHED_RR 209
30.4. NORMAL SCHEDULING WITH SCHED_OTHER 209
30.5. SETTING SCHEDULER POLICIES 209
30.6. POLICY OPTIONS FOR THE CHRT COMMAND 210
30.7. CHANGING THE PRIORITY OF SERVICES DURING THE BOOT PROCESS 211
30.8. PRIORITY MAP 212
30.9. TUNED CPU-PARTITIONING PROFILE 213
30.10. USING THE TUNED CPU-PARTITIONING PROFILE FOR LOW-LATENCY TUNING 214
30.11. CUSTOMIZING THE CPU-PARTITIONING TUNED PROFILE 215
. . . . . . . . . . . 31.
CHAPTER . . . CONFIGURING
. . . . . . . . . . . . . . . . AN
. . . .OPERATING
. . . . . . . . . . . . .SYSTEM
. . . . . . . . . TO
. . . .OPTIMIZE
. . . . . . . . . . .ACCESS
. . . . . . . . .TO
. . . NETWORK
. . . . . . . . . . . .RESOURCES
...........................
217
31.1. TOOLS FOR MONITORING AND DIAGNOSING PERFORMANCE ISSUES 217
31.2. BOTTLENECKS IN A PACKET RECEPTION 218
31.3. BUSY POLLING 219
31.3.1. Enabling busy polling 220
31.4. RECEIVE-SIDE SCALING 220
31.4.1. Viewing the interrupt request queues 221
31.5. RECEIVE PACKET STEERING 222
31.6. RECEIVE FLOW STEERING 222
31.6.1. Enabling Receive Flow Steering 223
31.7. ACCELERATED RFS 224
31.7.1. Enabling the ntuple filters 224
. . . . . . . . . . . 32.
CHAPTER . . . .FACTORS
. . . . . . . . . . AFFECTING
. . . . . . . . . . . . . I/O
. . . . AND
. . . . . FILE
. . . . . SYSTEM
. . . . . . . . . .PERFORMANCE
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .226
...............
32.1. TOOLS FOR MONITORING AND DIAGNOSING I/O AND FILE SYSTEM ISSUES 226
32.2. AVAILABLE TUNING OPTIONS FOR FORMATTING A FILE SYSTEM 228
32.3. AVAILABLE TUNING OPTIONS FOR MOUNTING A FILE SYSTEM 229
32.4. TYPES OF DISCARDING UNUSED BLOCKS 230
32.5. SOLID-STATE DISKS TUNING CONSIDERATIONS 230
32.6. GENERIC BLOCK DEVICE TUNING PARAMETERS 231
.CHAPTER
. . . . . . . . . . 33.
. . . .USING
. . . . . . . SYSTEMD
. . . . . . . . . . .TO
. . . .MANAGE
. . . . . . . . . .RESOURCES
. . . . . . . . . . . . . USED
. . . . . . .BY
. . .APPLICATIONS
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .233
...............
33.1. ALLOCATING SYSTEM RESOURCES USING SYSTEMD 233
33.2. ROLE OF SYSTEMD IN RESOURCE MANAGEMENT 234
33.3. OVERVIEW OF SYSTEMD HIERARCHY FOR CGROUPS 234
33.4. LISTING SYSTEMD UNITS 236
33.5. VIEWING SYSTEMD CONTROL GROUP HIERARCHY 237
33.6. VIEWING CGROUPS OF PROCESSES 239
6
Table of Contents
.CHAPTER
. . . . . . . . . . 34.
. . . .UNDERSTANDING
. . . . . . . . . . . . . . . . . . . CGROUPS
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .247
...............
34.1. UNDERSTANDING CONTROL GROUPS 247
34.2. WHAT ARE KERNEL RESOURCE CONTROLLERS 248
34.3. WHAT ARE NAMESPACES 249
.CHAPTER
. . . . . . . . . . 35.
. . . .IMPROVING
. . . . . . . . . . . . .SYSTEM
. . . . . . . . .PERFORMANCE
. . . . . . . . . . . . . . . . . WITH
. . . . . . ZSWAP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
...............
35.1. WHAT IS ZSWAP 251
35.2. ENABLING ZSWAP AT RUNTIME 251
35.3. ENABLING ZSWAP PERMANENTLY 252
. . . . . . . . . . . 36.
CHAPTER . . . .USING
. . . . . . . CGROUPFS
. . . . . . . . . . . . .TO
. . . MANUALLY
. . . . . . . . . . . . .MANAGE
. . . . . . . . . .CGROUPS
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .253
...............
36.1. CREATING CGROUPS AND ENABLING CONTROLLERS IN CGROUPS-V2 FILE SYSTEM 253
36.2. CONTROLLING DISTRIBUTION OF CPU TIME FOR APPLICATIONS BY ADJUSTING CPU WEIGHT 255
36.3. MOUNTING CGROUPS-V1 258
36.4. SETTING CPU LIMITS TO APPLICATIONS USING CGROUPS-V1 260
.CHAPTER
. . . . . . . . . . 37.
. . . .ANALYZING
. . . . . . . . . . . . .SYSTEM
. . . . . . . . . PERFORMANCE
. . . . . . . . . . . . . . . . . WITH
. . . . . . BPF
. . . . .COMPILER
. . . . . . . . . . . .COLLECTION
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
................
37.1. AN INTRODUCTION TO BCC 264
37.2. INSTALLING THE BCC-TOOLS PACKAGE 264
37.3. USING SELECTED BCC-TOOLS FOR PERFORMANCE ANALYSES 264
Using execsnoop to examine the system processes 265
Using opensnoop to track what files a command opens 265
Using biotop to examine the I/O operations on the disk 266
Using xfsslower to expose unexpectedly slow file system operations 267
.CHAPTER
. . . . . . . . . . 38.
. . . .CONFIGURING
. . . . . . . . . . . . . . . .AN
. . . OPERATING
. . . . . . . . . . . . . .SYSTEM
. . . . . . . . .TO
. . . OPTIMIZE
. . . . . . . . . . . MEMORY
. . . . . . . . . . ACCESS
. . . . . . . . . . . . . . . . . . . . . . . . . .269
...............
38.1. TOOLS FOR MONITORING AND DIAGNOSING SYSTEM MEMORY ISSUES 269
38.2. OVERVIEW OF A SYSTEM’S MEMORY 269
38.3. VIRTUAL MEMORY PARAMETERS 270
38.4. FILE SYSTEM PARAMETERS 273
38.5. KERNEL PARAMETERS 273
38.6. SETTING MEMORY-RELATED KERNEL PARAMETERS 274
. . . . . . . . . . . 39.
CHAPTER . . . .CONFIGURING
. . . . . . . . . . . . . . . .HUGE
. . . . . . PAGES
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .275
...............
39.1. AVAILABLE HUGE PAGE FEATURES 275
39.2. PARAMETERS FOR RESERVING HUGETLB PAGES AT BOOT TIME 276
39.3. CONFIGURING HUGETLB AT BOOT TIME 276
39.4. PARAMETERS FOR RESERVING HUGETLB PAGES AT RUN TIME 278
39.5. CONFIGURING HUGETLB AT RUN TIME 278
39.6. ENABLING TRANSPARENT HUGEPAGES 279
39.7. DISABLING TRANSPARENT HUGEPAGES 280
39.8. IMPACT OF PAGE SIZE ON TRANSLATION LOOKASIDE BUFFER SIZE 280
. . . . . . . . . . . 40.
CHAPTER . . . .GETTING
. . . . . . . . . . STARTED
. . . . . . . . . . .WITH
. . . . . .SYSTEMTAP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .282
...............
40.1. THE PURPOSE OF SYSTEMTAP 282
40.2. INSTALLING SYSTEMTAP 282
7
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
.CHAPTER
. . . . . . . . . . 41.
. . . CROSS-INSTRUMENTATION
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OF
. . . .SYSTEMTAP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .285
...............
41.1. SYSTEMTAP CROSS-INSTRUMENTATION 285
41.2. INITIALIZING CROSS-INSTRUMENTATION OF SYSTEMTAP 286
.CHAPTER
. . . . . . . . . . 42.
. . . .MONITORING
. . . . . . . . . . . . . . .NETWORK
. . . . . . . . . . .ACTIVITY
. . . . . . . . . . WITH
. . . . . . SYSTEMTAP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .288
...............
42.1. PROFILING NETWORK ACTIVITY WITH SYSTEMTAP 288
42.2. TRACING FUNCTIONS CALLED IN NETWORK SOCKET CODE WITH SYSTEMTAP 289
42.3. MONITORING NETWORK PACKET DROPS WITH SYSTEMTAP 290
.CHAPTER
. . . . . . . . . . 43.
. . . .PROFILING
. . . . . . . . . . . . KERNEL
. . . . . . . . . ACTIVITY
. . . . . . . . . . .WITH
. . . . . .SYSTEMTAP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
...............
43.1. COUNTING FUNCTION CALLS WITH SYSTEMTAP 291
43.2. TRACING FUNCTION CALLS WITH SYSTEMTAP 292
43.3. DETERMINING TIME SPENT IN KERNEL AND USER SPACE WITH SYSTEMTAP 293
43.4. MONITORING POLLING APPLICATIONS WITH SYSTEMTAP 294
43.5. TRACKING MOST FREQUENTLY USED SYSTEM CALLS WITH SYSTEMTAP 295
43.6. TRACKING SYSTEM CALL VOLUME PER PROCESS WITH SYSTEMTAP 295
. . . . . . . . . . . 44.
CHAPTER . . . .MONITORING
. . . . . . . . . . . . . . .DISK
. . . . . AND
. . . . . I/O
. . . . ACTIVITY
. . . . . . . . . . .WITH
. . . . . .SYSTEMTAP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .297
...............
44.1. SUMMARIZING DISK READ/WRITE TRAFFIC WITH SYSTEMTAP 297
44.2. TRACKING I/O TIME FOR EACH FILE READ OR WRITE WITH SYSTEMTAP 298
44.3. TRACKING CUMULATIVE I/O WITH SYSTEMTAP 298
44.4. MONITORING I/O ACTIVITY ON A SPECIFIC DEVICE WITH SYSTEMTAP 299
44.5. MONITORING READS AND WRITES TO A FILE WITH SYSTEMTAP 300
8
Table of Contents
9
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
10
PROVIDING FEEDBACK ON RED HAT DOCUMENTATION
1. View the documentation in the Multi-page HTML format and ensure that you see the
Feedback button in the upper right corner after the page fully loads.
2. Use your cursor to highlight the part of the text that you want to comment on.
3. Click the Add Feedback button that appears near the highlighted text.
4. Enter your suggestion for improvement in the Description field. Include links to the relevant
parts of the documentation.
11
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
TuneD is distributed with a number of predefined profiles for use cases such as:
High throughput
Low latency
Saving power
It is possible to modify the rules defined for each profile and customize how to tune a particular device.
When you switch to another profile or deactivate TuneD, all changes made to the system settings by the
previous profile revert back to their original state.
You can also configure TuneD to react to changes in device usage and adjusts settings to improve
performance of active devices and reduce power consumption of inactive devices.
The profiles provided with TuneD are divided into the following categories:
Power-saving profiles
Performance-boosting profiles
The performance-boosting profiles include profiles that focus on the following aspects:
Additional resources
12
CHAPTER 1. GETTING STARTED WITH TUNED
Additional resources
If there are conflicts, the settings from the last specified profile takes precedence.
The following example optimizes the system to run in a virtual machine for the best performance and
concurrently tunes it for low power consumption, while the low power consumption is the priority:
WARNING
Additional resources
13
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
/usr/lib/tuned/
Distribution-specific profiles are stored in the directory. Each profile has its own directory. The profile
consists of the main configuration file called tuned.conf, and optionally other files, for example
helper scripts.
/etc/tuned/
If you need to customize a profile, copy the profile directory into the directory, which is used for
custom profiles. If there are two profiles of the same name, the custom profile located in /etc/tuned/
is used.
Additional resources
NOTE
balanced
The default power-saving profile. It is intended to be a compromise between performance and power
consumption. It uses auto-scaling and auto-tuning whenever possible. The only drawback is the
increased latency. In the current TuneD release, it enables the CPU, disk, audio, and video plugins,
and activates the conservative CPU governor. The radeon_powersave option uses the dpm-
balanced value if it is supported, otherwise it is set to auto.
It changes the energy_performance_preference attribute to the normal energy setting. It also
changes the scaling_governor policy attribute to either the conservative or powersave CPU
governor.
powersave
A profile for maximum power saving performance. It can throttle the performance in order to
minimize the actual power consumption. In the current TuneD release it enables USB autosuspend,
WiFi power saving, and Aggressive Link Power Management (ALPM) power savings for SATA host
adapters. It also schedules multi-core power savings for systems with a low wakeup rate and
activates the ondemand governor. It enables AC97 audio power saving or, depending on your
system, HDA-Intel power savings with a 10 seconds timeout. If your system contains a supported
Radeon graphics card with enabled KMS, the profile configures it to automatic power saving. On
ASUS Eee PCs, a dynamic Super Hybrid Engine is enabled.
It changes the energy_performance_preference attribute to the powersave or power energy
setting. It also changes the scaling_governor policy attribute to either the ondemand or
powersave CPU governor.
NOTE
14
CHAPTER 1. GETTING STARTED WITH TUNED
NOTE
In certain cases, the balanced profile is more efficient compared to the powersave
profile.
Consider there is a defined amount of work that needs to be done, for example a video
file that needs to be transcoded. Your machine might consume less energy if the
transcoding is done on the full power, because the task is finished quickly, the
machine starts to idle, and it can automatically step-down to very efficient power save
modes. On the other hand, if you transcode the file with a throttled machine, the
machine consumes less power during the transcoding, but the process takes longer
and the overall consumed energy can be higher.
throughput-performance
A server profile optimized for high throughput. It disables power savings mechanisms and enables
sysctl settings that improve the throughput performance of the disk and network IO. CPU governor
is set to performance.
It changes the energy_performance_preference and scaling_governor attribute to the
performance profile.
accelerator-performance
The accelerator-performance profile contains the same tuning as the throughput-performance
profile. Additionally, it locks the CPU to low C states so that the latency is less than 100us. This
improves the performance of certain accelerators, such as GPUs.
latency-performance
A server profile optimized for low latency. It disables power savings mechanisms and enables sysctl
settings that improve latency. CPU governor is set to performance and the CPU is locked to the low
C states (by PM QoS).
It changes the energy_performance_preference and scaling_governor attribute to the
performance profile.
network-latency
A profile for low latency network tuning. It is based on the latency-performance profile. It
additionally disables transparent huge pages and NUMA balancing, and tunes several other network-
related sysctl parameters.
It inherits the latency-performance profile which changes the energy_performance_preference
and scaling_governor attribute to the performance profile.
hpc-compute
A profile optimized for high-performance computing. It is based on the latency-performance
profile.
network-throughput
A profile for throughput network tuning. It is based on the throughput-performance profile. It
additionally increases kernel network buffers.
It inherits either the latency-performance or throughput-performance profile, and changes the
energy_performance_preference and scaling_governor attribute to the performance profile.
virtual-guest
A profile designed for Red Hat Enterprise Linux 9 virtual machines and VMWare guests based on the
15
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
A profile designed for Red Hat Enterprise Linux 9 virtual machines and VMWare guests based on the
throughput-performance profile that, among other tasks, decreases virtual memory swappiness and
increases disk readahead values. It does not disable disk barriers.
It inherits the throughput-performance profile and changes the energy_performance_preference
and scaling_governor attribute to the performance profile.
virtual-host
A profile designed for virtual hosts based on the throughput-performance profile that, among other
tasks, decreases virtual memory swappiness, increases disk readahead values, and enables a more
aggressive value of dirty pages writeback.
It inherits the throughput-performance profile and changes the energy_performance_preference
and scaling_governor attribute to the performance profile.
oracle
A profile optimized for Oracle databases loads based on throughput-performance profile. It
additionally disables transparent huge pages and modifies other performance-related kernel
parameters. This profile is provided by the tuned-profiles-oracle package.
desktop
A profile optimized for desktops, based on the balanced profile. It additionally enables scheduler
autogroups for better response of interactive applications.
optimize-serial-console
A profile that tunes down I/O activity to the serial console by reducing the printk value. This should
make the serial console more responsive. This profile is intended to be used as an overlay on other
profiles. For example:
mssql
A profile provided for Microsoft SQL Server. It is based on the throughput-performance profile.
intel-sst
A profile optimized for systems with user-defined Intel Speed Select Technology configurations. This
profile is intended to be used as an overlay on other profiles. For example:
Prior to Red Hat Enterprise Linux 9, the low-latency Red Hat documentation described the numerous
low-level steps needed to achieve low-latency tuning. In Red Hat Enterprise Linux 9, you can perform
low-latency tuning more efficiently by using the cpu-partitioning TuneD profile. This profile is easily
customizable according to the requirements for individual low-latency applications.
The following figure is an example to demonstrate how to use the cpu-partitioning profile. This
example uses the CPU and node layout.
The list of isolated CPUs is comma-separated or you can specify a range using a dash, such as 3-5.
This option is mandatory. Any CPU missing from this list is automatically considered a housekeeping
CPU.
Specifying the no_balance_cores option is optional, however any CPUs in this list must be a subset
of the CPUs listed in the isolated_cores list.
Application threads using these CPUs need to be pinned individually to each CPU.
Housekeeping CPUs
Any CPU not isolated in the cpu-partitioning-variables.conf file is automatically considered a
housekeeping CPU. On the housekeeping CPUs, all services, daemons, user processes, movable
kernel threads, interrupt handlers, and kernel timers are permitted to execute.
Additional resources
This procedure describes how to tune a system for low-latency using the TuneD’s cpu-partitioning
17
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
This procedure describes how to tune a system for low-latency using the TuneD’s cpu-partitioning
profile. It uses the example of a low-latency application that can use cpu-partitioning and the CPU
layout as mentioned in the cpu-partitioning figure.
One dedicated reader thread that reads data from the network will be pinned to CPU 2.
A large number of threads that process this network data will be pinned to CPUs 4-23.
A dedicated writer thread that writes the processed data to the network will be pinned to CPU
3.
Prerequisites
You have installed the cpu-partitioning TuneD profile by using the dnf install tuned-profiles-
cpu-partitioning command as root.
Procedure
3. Reboot
After rebooting, the system is tuned for low-latency, according to the isolation in the cpu-
partitioning figure. The application can use taskset to pin the reader and writer threads to CPUs
2 and 3, and the remaining application threads on CPUs 4-23.
Additional resources
For example, the cpu-partitioning profile sets the CPUs to use cstate=1. In order to use the cpu-
partitioning profile but to additionally change the CPU cstate from cstate1 to cstate0, the following
procedure describes a new TuneD profile named my_profile, which inherits the cpu-partitioning profile
and then sets C state-0.
Procedure
# mkdir /etc/tuned/my_profile
18
CHAPTER 1. GETTING STARTED WITH TUNED
2. Create a tuned.conf file in this directory, and add the following content:
# vi /etc/tuned/my_profile/tuned.conf
[main]
summary=Customized tuning on top of cpu-partitioning
include=cpu-partitioning
[cpu]
force_latency=cstate.id:0|1
NOTE
In the shared example, a reboot is not required. However, if the changes in the my_profile
profile require a reboot to take effect, then reboot your machine.
Additional resources
realtime
Use on bare-metal real-time systems.
Provided by the tuned-profiles-realtime package, which is available from the RT or NFV repositories.
realtime-virtual-host
Use in a virtualization host configured for real-time.
Provided by the tuned-profiles-nfv-host package, which is available from the NFV repository.
realtime-virtual-guest
Use in a virtualization guest configured for real-time.
Provided by the tuned-profiles-nfv-guest package, which is available from the NFV repository.
Static tuning
Mainly consists of the application of predefined sysctl and sysfs settings and one-shot activation of
several configuration tools such as ethtool.
19
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Dynamic tuning
Watches how various system components are used throughout the uptime of your system. TuneD
adjusts system settings dynamically based on that monitoring information.
For example, the hard drive is used heavily during startup and login, but is barely used later when the
user might mainly work with applications such as web browsers or email clients. Similarly, the CPU
and network devices are used differently at different times. TuneD monitors the activity of these
components and reacts to the changes in their use.
By default, dynamic tuning is disabled. To enable it, edit the /etc/tuned/tuned-main.conf file and
change the dynamic_tuning option to 1. TuneD then periodically analyzes system statistics and
uses them to update your system tuning settings. To configure the time interval in seconds between
these updates, use the update_interval option.
Currently implemented dynamic tuning algorithms try to balance the performance and powersave,
and are therefore disabled in the performance profiles. Dynamic tuning for individual plug-ins can be
enabled or disabled in the TuneD profiles.
On a typical office workstation, the Ethernet network interface is inactive most of the time. Only a
few emails go in and out or some web pages might be loaded.
For those kinds of loads, the network interface does not have to run at full speed all the time, as it
does by default. TuneD has a monitoring and tuning plug-in for network devices that can detect this
low activity and then automatically lower the speed of that interface, typically resulting in a lower
power usage.
If the activity on the interface increases for a longer period of time, for example because a DVD
image is being downloaded or an email with a large attachment is opened, TuneD detects this and
sets the interface speed to maximum to offer the best performance while the activity level is high.
This principle is used for other plug-ins for CPU and disks as well.
By default, no-daemon mode is disabled because a lot of TuneD functionality is missing in this mode,
including:
D-Bus support
Hot-plug support
To enable no-daemon mode, include the following line in the /etc/tuned/tuned-main.conf file:
daemon = 0
20
CHAPTER 1. GETTING STARTED WITH TUNED
This procedure installs and enables the TuneD application, installs TuneD profiles, and presets a default
TuneD profile for your system.
Procedure
Install it.
$ tuned-adm active
NOTE
The active profile TuneD automatically presets differs based on your machine
type and system settings.
$ tuned-adm verify
Procedure
$ tuned-adm list
Available profiles:
- accelerator-performance - Throughput performance based tuning with disabled higher
latency STOP states
21
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
$ tuned-adm active
Additional resources
Prerequisites
The TuneD service is running. See Installing and Enabling TuneD for details.
Procedure
1. Optionally, you can let TuneD recommend the most suitable profile for your system:
# tuned-adm recommend
throughput-performance
2. Activate a profile:
The following example optimizes the system to run in a virtual machine with the best
22
CHAPTER 1. GETTING STARTED WITH TUNED
The following example optimizes the system to run in a virtual machine with the best
performance and concurrently tunes it for low power consumption, while the low power
consumption is the priority:
# tuned-adm active
# reboot
Verification steps
$ tuned-adm verify
Additional resources
Procedure
# tuned-adm off
The tunings are applied again after the TuneD service restarts.
Additional resources
23
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Prerequisites
Install and enable TuneD as described in Installing and Enabling TuneD for details.
The profiles provided with TuneD are divided into the following categories:
Power-saving profiles
Performance-boosting profiles
The performance-boosting profiles include profiles that focus on the following aspects:
Additional resources
24
CHAPTER 2. CUSTOMIZING TUNED PROFILES
Additional resources
If there are conflicts, the settings from the last specified profile takes precedence.
The following example optimizes the system to run in a virtual machine for the best performance and
concurrently tunes it for low power consumption, while the low power consumption is the priority:
WARNING
Additional resources
/usr/lib/tuned/
Distribution-specific profiles are stored in the directory. Each profile has its own directory. The profile
consists of the main configuration file called tuned.conf, and optionally other files, for example
helper scripts.
25
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
/etc/tuned/
If you need to customize a profile, copy the profile directory into the directory, which is used for
custom profiles. If there are two profiles of the same name, the custom profile located in /etc/tuned/
is used.
Additional resources
[main]
include=parent
All settings from the parent profile are loaded in this child profile. In the following sections, the child
profile can override certain settings inherited from the parent profile or add new settings not present in
the parent profile.
You can create your own child profile in the /etc/tuned/ directory based on a pre-installed profile in
/usr/lib/tuned/ with only some parameters adjusted.
If the parent profile is updated, such as after a TuneD upgrade, the changes are reflected in the child
profile.
The following is an example of a custom profile that extends the balanced profile and sets
Aggressive Link Power Management (ALPM) for all devices to the maximum powersaving.
[main]
include=balanced
[scsi_host]
alpm=min_power
Additional resources
Static tuning
Mainly consists of the application of predefined sysctl and sysfs settings and one-shot activation of
several configuration tools such as ethtool.
26
CHAPTER 2. CUSTOMIZING TUNED PROFILES
Dynamic tuning
Watches how various system components are used throughout the uptime of your system. TuneD
adjusts system settings dynamically based on that monitoring information.
For example, the hard drive is used heavily during startup and login, but is barely used later when the
user might mainly work with applications such as web browsers or email clients. Similarly, the CPU
and network devices are used differently at different times. TuneD monitors the activity of these
components and reacts to the changes in their use.
By default, dynamic tuning is disabled. To enable it, edit the /etc/tuned/tuned-main.conf file and
change the dynamic_tuning option to 1. TuneD then periodically analyzes system statistics and
uses them to update your system tuning settings. To configure the time interval in seconds between
these updates, use the update_interval option.
Currently implemented dynamic tuning algorithms try to balance the performance and powersave,
and are therefore disabled in the performance profiles. Dynamic tuning for individual plug-ins can be
enabled or disabled in the TuneD profiles.
On a typical office workstation, the Ethernet network interface is inactive most of the time. Only a
few emails go in and out or some web pages might be loaded.
For those kinds of loads, the network interface does not have to run at full speed all the time, as it
does by default. TuneD has a monitoring and tuning plug-in for network devices that can detect this
low activity and then automatically lower the speed of that interface, typically resulting in a lower
power usage.
If the activity on the interface increases for a longer period of time, for example because a DVD
image is being downloaded or an email with a large attachment is opened, TuneD detects this and
sets the interface speed to maximum to offer the best performance while the activity level is high.
This principle is used for other plug-ins for CPU and disks as well.
Monitoring plug-ins
Monitoring plug-ins are used to get information from a running system. The output of the monitoring
plug-ins can be used by tuning plug-ins for dynamic tuning.
Monitoring plug-ins are automatically instantiated whenever their metrics are needed by any of the
enabled tuning plug-ins. If two tuning plug-ins require the same data, only one instance of the
monitoring plug-in is created and the data is shared.
Tuning plug-ins
Each tuning plug-in tunes an individual subsystem and takes several parameters that are populated
from the TuneD profiles. Each subsystem can have multiple devices, such as multiple CPUs or
network cards, that are handled by individual instances of the tuning plug-ins. Specific settings for
individual devices are also supported.
27
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
[NAME]
type=TYPE
devices=DEVICES
NAME
is the name of the plug-in instance as it is used in the logs. It can be an arbitrary string.
TYPE
is the type of the tuning plug-in.
DEVICES
is the list of devices that this plug-in instance handles.
The devices line can contain a list, a wildcard ( *), and negation (!). If there is no devices line, all
devices present or later attached on the system of the TYPE are handled by the plug-in instance.
This is same as using the devices=* option.
The following example matches all block devices starting with sd, such as sda or sdb, and does
not disable barriers on them:
[data_disk]
type=disk
devices=sd*
disable_barriers=false
The following example matches all block devices except sda1 and sda2:
[data_disk]
type=disk
devices=!sda1, !sda2
disable_barriers=false
If the plug-in supports more options, they can be also specified in the plug-in section. If the option is not
specified and it was not previously specified in the included plug-in, the default value is used.
[TYPE]
devices=DEVICES
In this case, it is possible to omit the type line. The instance is then referred to with a name, same as the
type. The previous example could be then rewritten into:
28
CHAPTER 2. CUSTOMIZING TUNED PROFILES
[disk]
devices=sdb*
disable_barriers=false
You can also disable the plug-in by specifying the enabled=false option. This has the same effect as if
the instance was never defined. Disabling the plug-in is useful if you are redefining the previous
definition from the include option and do not want the plug-in to be active in your custom profile.
NOTE
TuneD includes the ability to run any shell command as part of enabling or disabling a tuning profile.
This enables you to extend TuneD profiles with functionality that has not been integrated into TuneD
yet.
You can specify arbitrary shell commands using the script plug-in.
Additional resources
Monitoring plug-ins
Currently, the following monitoring plug-ins are implemented:
disk
Gets disk load (number of IO operations) per device and measurement interval.
net
Gets network load (number of transferred packets) per network card and measurement interval.
load
Gets CPU load per CPU and measurement interval.
Tuning plug-ins
Currently, the following tuning plug-ins are implemented. Only some of these plug-ins implement
dynamic tuning. Options supported by plug-ins are also listed:
cpu
Sets the CPU governor to the value specified by the governor option and dynamically changes the
Power Management Quality of Service (PM QoS) CPU Direct Memory Access (DMA) latency
according to the CPU load.
If the CPU load is lower than the value specified by the load_threshold option, the latency is set to
the value specified by the latency_high option, otherwise it is set to the value specified by
latency_low.
29
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
You can also force the latency to a specific value and prevent it from dynamically changing further.
To do so, set the force_latency option to the required latency value.
eeepc_she
Dynamically sets the front-side bus (FSB) speed according to the CPU load.
This feature can be found on some netbooks and is also known as the ASUS Super Hybrid Engine
(SHE).
If the CPU load is lower or equal to the value specified by the load_threshold_powersave option,
the plug-in sets the FSB speed to the value specified by the she_powersave option. If the CPU load
is higher or equal to the value specified by the load_threshold_normal option, it sets the FSB speed
to the value specified by the she_normal option.
Static tuning is not supported and the plug-in is transparently disabled if TuneD does not detect the
hardware support for this feature.
net
Configures the Wake-on-LAN functionality to the values specified by the wake_on_lan option. It
uses the same syntax as the ethtool utility. It also dynamically changes the interface speed according
to the interface utilization.
sysctl
Sets various sysctl settings specified by the plug-in options.
The syntax is name=value, where name is the same as the name provided by the sysctl utility.
Use the sysctl plug-in if you need to change system settings that are not covered by other plug-ins
available in TuneD. If the settings are covered by some specific plug-ins, prefer these plug-ins.
usb
Sets autosuspend timeout of USB devices to the value specified by the autosuspend parameter.
The value 0 means that autosuspend is disabled.
vm
Enables or disables transparent huge pages depending on the value of the transparent_hugepages
option.
Valid values of the transparent_hugepages option are:
"always"
"never"
"madvise"
audio
Sets the autosuspend timeout for audio codecs to the value specified by the timeout option.
Currently, the snd_hda_intel and snd_ac97_codec codecs are supported. The value 0 means that
the autosuspend is disabled. You can also enforce the controller reset by setting the Boolean option
reset_controller to true.
disk
Sets the disk elevator to the value specified by the elevator option.
It also sets:
30
CHAPTER 2. CUSTOMIZING TUNED PROFILES
The current disk readahead to a value multiplied by the constant specified by the
readahead_multiply option
In addition, this plug-in dynamically changes the advanced power management and spindown
timeout setting for the drive according to the current drive utilization. The dynamic tuning can be
controlled by the Boolean option dynamic and is enabled by default.
scsi_host
Tunes options for SCSI hosts.
It sets Aggressive Link Power Management (ALPM) to the value specified by the alpm option.
mounts
Enables or disables barriers for mounts according to the Boolean value of the disable_barriers
option.
script
Executes an external script or binary when the profile is loaded or unloaded. You can choose an
arbitrary executable.
IMPORTANT
The script plug-in is provided mainly for compatibility with earlier releases. Prefer
other TuneD plug-ins if they cover the required functionality.
You need to correctly implement the stop action in your executable and revert all settings that you
changed during the start action. Otherwise, the roll-back step after changing your TuneD profile will
not work.
Bash scripts can import the /usr/lib/tuned/functions Bash library and use the functions defined
there. Use these functions only for functionality that is not natively provided by TuneD. If a function
name starts with an underscore, such as _wifi_set_power_level, consider the function private and do
not use it in your scripts, because it might change in the future.
Specify the path to the executable using the script parameter in the plug-in configuration.
To run a Bash script named script.sh that is located in the profile directory, use:
[script]
script=${i:PROFILE_DIR}/script.sh
31
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
sysfs
Sets various sysfs settings specified by the plug-in options.
The syntax is name=value, where name is the sysfs path to use.
Use this plugin in case you need to change some settings that are not covered by other plug-ins.
Prefer specific plug-ins if they cover the required settings.
video
Sets various powersave levels on video cards. Currently, only the Radeon cards are supported.
The powersave level can be specified by using the radeon_powersave option. Supported values are:
default
auto
low
mid
high
dynpm
dpm-battery
dpm-balanced
dpm-perfomance
For details, see www.x.org. Note that this plug-in is experimental and the option might change in
future releases.
bootloader
Adds options to the kernel command line. This plug-in supports only the GRUB 2 boot loader.
Customized non-standard location of the GRUB 2 configuration file can be specified by the
grub2_cfg_file option.
The kernel options are added to the current GRUB configuration and its templates. The system
needs to be rebooted for the kernel options to take effect.
Switching to another profile or manually stopping the TuneD service removes the additional options.
If you shut down or reboot the system, the kernel options persist in the grub.cfg file.
For example, to add the quiet kernel option to a TuneD profile, include the following lines in the
tuned.conf file:
32
CHAPTER 2. CUSTOMIZING TUNED PROFILES
[bootloader]
cmdline=quiet
The following is an example of a custom profile that adds the isolcpus=2 option to the kernel
command line:
[bootloader]
cmdline=isolcpus=2
Using TuneD variables reduces the amount of necessary typing in TuneD profiles.
There are no predefined variables in TuneD profiles. You can define your own variables by creating the
[variables] section in a profile and using the following syntax:
[variables]
variable_name=value
${variable_name}
In the following example, the ${isolated_cores} variable expands to 1,2; hence the kernel boots with
the isolcpus=1,2 option:
[variables]
isolated_cores=1,2
[bootloader]
cmdline=isolcpus=${isolated_cores}
The variables can be specified in a separate file. For example, you can add the following lines to
tuned.conf:
[variables]
include=/etc/tuned/my-variables.conf
[bootloader]
cmdline=isolcpus=${isolated_cores}
If you add the isolated_cores=1,2 option to the /etc/tuned/my-variables.conf file, the kernel boots
with the isolcpus=1,2 option.
Additional resources
33
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Additional resources
You can:
Create custom functions in Python and add them to TuneD in the form of plug-ins
${f:function_name:argument_1:argument_2}
To expand the directory path where the profile and the tuned.conf file are located, use the
PROFILE_DIR function, which requires special syntax:
${i:PROFILE_DIR}
Example 2.9. Isolating CPU cores using variables and built-in functions
In the following example, the ${non_isolated_cores} variable expands to 0,3-5, and the
cpulist_invert built-in function is called with the 0,3-5 argument:
[variables]
non_isolated_cores=0,3-5
[bootloader]
cmdline=isolcpus=${f:cpulist_invert:${non_isolated_cores}}
The cpulist_invert function inverts the list of CPUs. For a 6-CPU machine, the inversion is 1,2, and
the kernel boots with the isolcpus=1,2 command-line option.
Additional resources
PROFILE_DIR
Returns the directory path where the profile and the tuned.conf file are located.
exec
Executes a process and returns its output.
assertion
Compares two arguments. If they do not match, the function logs text from the first argument and
34
CHAPTER 2. CUSTOMIZING TUNED PROFILES
Compares two arguments. If they do not match, the function logs text from the first argument and
aborts profile loading.
assertion_non_equal
Compares two arguments. If they match, the function logs text from the first argument and aborts
profile loading.
kb2s
Converts kilobytes to disk sectors.
s2kb
Converts disk sectors to kilobytes.
strip
Creates a string from all passed arguments and deletes both leading and trailing white space.
virt_check
Checks whether TuneD is running inside a virtual machine (VM) or on bare metal:
On bare metal, the function returns the second argument, even in case of an error.
cpulist_invert
Inverts a list of CPUs to make its complement. For example, on a system with 4 CPUs, numbered
from 0 to 3, the inversion of the list 0,2,3 is 1.
cpulist2hex
Converts a CPU list to a hexadecimal CPU mask.
cpulist2hex_invert
Converts a CPU list to a hexadecimal CPU mask and inverts it.
hex2cpulist
Converts a hexadecimal CPU mask to a CPU list.
cpulist_online
Checks whether the CPUs from the list are online. Returns the list containing only online CPUs.
cpulist_present
Checks whether the CPUs from the list are present. Returns the list containing only present CPUs.
cpulist_unpack
Unpacks a CPU list in the form of 1-3,4 to 1,2,3,4.
cpulist_pack
Packs a CPU list in the form of 1,2,3,5 to 1-3,5.
Prerequisites
The TuneD service is running. See Installing and Enabling TuneD for details.
Procedure
1. In the /etc/tuned/ directory, create a new directory named the same as the profile that you want
35
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
1. In the /etc/tuned/ directory, create a new directory named the same as the profile that you want
to create:
# mkdir /etc/tuned/my-profile
2. In the new directory, create a file named tuned.conf. Add a [main] section and plug-in
definitions in it, according to your requirements.
For example, see the configuration of the balanced profile:
[main]
summary=General non-specialized TuneD profile
[cpu]
governor=conservative
energy_perf_bias=normal
[audio]
timeout=10
[video]
radeon_powersave=dpm-balanced, auto
[scsi_host]
alpm=medium_power
4. Verify that the TuneD profile is active and the system settings are applied:
$ tuned-adm active
$ tuned-adm verify
Additional resources
Prerequisites
The TuneD service is running. See Installing and Enabling TuneD for details.
36
CHAPTER 2. CUSTOMIZING TUNED PROFILES
Procedure
1. In the /etc/tuned/ directory, create a new directory named the same as the profile that you want
to create:
# mkdir /etc/tuned/modified-profile
2. In the new directory, create a file named tuned.conf, and set the [main] section as follows:
[main]
include=parent-profile
Replace parent-profile with the name of the profile you are modifying.
To use the settings from the throughput-performance profile and change the value of
vm.swappiness to 5, instead of the default 10, use:
[main]
include=throughput-performance
[sysctl]
vm.swappiness=5
5. Verify that the TuneD profile is active and the system settings are applied:
$ tuned-adm active
$ tuned-adm verify
Additional resources
37
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
device with the name of the block device, for example sdf
selected-scheduler with the disk scheduler that you want to set for the device, for example bfq
Prerequisites
The TuneD service is installed and enabled. For details, see Installing and enabling TuneD .
Procedure
1. Optional: Select an existing TuneD profile on which your profile will be based. For a list of
available profiles, see TuneD profiles distributed with RHEL .
To see which profile is currently active, use:
$ tuned-adm active
# mkdir /etc/tuned/my-profile
ID_WWN=0x5002538d00000000_
ID_SERIAL=Generic-_SD_MMC_20120501030900000-0:0
ID_SERIAL_SHORT=20120501030900000
NOTE
The command in the this example will return all values identified as a World Wide
Name (WWN) or serial number associated with the specified block device.
Although it is preferred to use a WWN, the WWN is not always available for a
given device and any values returned by the example command are acceptable to
use as the device system unique ID.
4. Create the /etc/tuned/my-profile/tuned.conf configuration file. In the file, set the following
options:
[main]
include=existing-profile
b. Set the selected disk scheduler for the device that matches the WWN identifier:
[disk]
devices_udev_regex=IDNAME=device system unique id
elevator=selected-scheduler
Here:
Replace IDNAME with the name of the identifier being used (for example, ID_WWN).
38
CHAPTER 2. CUSTOMIZING TUNED PROFILES
Replace device system unique id with the value of the chosen identifier (for example,
0x5002538d00000000).
To match multiple devices in the devices_udev_regex option, enclose the identifiers in
parentheses and separate them with vertical bars:
devices_udev_regex=(ID_WWN=0x5002538d00000000)|
(ID_WWN=0x1234567800000000)
Verification steps
$ tuned-adm active
$ tuned-adm verify
# cat /sys/block/device/queue/scheduler
In the file name, replace device with the block device name, for example sdc.
Additional resources
39
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Procedure
Verification steps
# tuna -h
Additional resources
Prerequisites
The tuna tool is installed. For more information, see Installing tuna tool.
Procedure
# tuna --show_threads
thread
40
CHAPTER 3. REVIEWING A SYSTEM USING TUNA INTERFACE
To tune CPUs using the tuna CLI, see Tuning CPUs using tuna tool .
To tune the IRQs using the tuna tool, see Tuning IRQs using tuna tool .
# tuna --save=filename
This command saves only currently running kernel threads. Processes that are not running are
not saved.
Additional resources
Isolate CPUs
All tasks running on the specified CPU move to the next available CPU. Isolating a CPU makes it
unavailable by removing it from the affinity mask of all threads.
Include CPUs
Allows tasks to run on the specified CPU
Restore CPUs
Restores the specified CPU to its previous configuration.
This procedure describes how to tune CPUs using the tuna CLI.
Prerequisites
The tuna tool is installed. For more information, see Installing tuna tool.
Procedure
The cpu_list argument is a list of comma-separated CPU numbers. For example, --cpus=0,2.
41
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
The cpu_list argument is a list of comma-separated CPU numbers. For example, --cpus=0,2.
CPU lists can also be specified in a range, for example --cpus=”1-3”, which would select CPUs 1,
2, and 3.
To add a specific CPU to the current cpu_list, for example, use --cpus=+0.
To isolate a CPU:
To include a CPU:
To use a system with four or more processors, display how to make all the ssh threads run on
CPU 0 and 1, and all the http threads on CPU 2 and 3:
3. Moves the selected threads to the selected CPUs. Tuna sets the affinity mask of threads
starting with ssh to the appropriate CPUs. The CPUs can be expressed numerically as 0
and 1, in hex mask as 0x3, or in binary as 11.
6. Moves the selected threads to the specified CPUs. Tuna sets the affinity mask of threads
starting with http to the specified CPUs. The CPUs can be expressed numerically as 2 and
3, in hex mask as 0xC, or in binary as 1100.
Verification steps
Display the current configuration and verify that the changes were performed as expected:
thread ctxt_switches
pid SCHED_ rtpri affinity voluntary nonvoluntary cmd
3861 OTHER 0 0,1 33997 58 gnome-screensav
thread ctxt_switches
pid SCHED_ rtpri affinity voluntary nonvoluntary cmd
3861 OTHER 0 0 33997 58 gnome-screensav
thread ctxt_switches
42
CHAPTER 3. REVIEWING A SYSTEM USING TUNA INTERFACE
2. Displays the selected threads to enable the user to verify their affinity mask and RT priority.
3. Selects CPU 0.
10. Moves the gnome-sc threads to the specified CPUs, CPUs 0 and 1.
Additional resources
/proc/cpuinfo file
This procedure describes how to tune the IRQs using the tuna tool.
Prerequisites
The tuna tool is installed. For more information, see Installing tuna tool.
Procedure
# tuna --show_irqs
# users affinity
0 timer 0
1 i8042 0
7 parport0 0
43
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Replace 128 with the irq_list argument and 3 with the cpu_list argument.
The cpu_list argument is a list of comma-separated CPU numbers, for example, --cpus=0,2. For
more information, see Tuning CPUs using tuna tool .
Verification steps
Compare the state of the selected IRQs before and after moving any interrupt to a specified
CPU:
Additional resources
/procs/interrupts file
44
CHAPTER 4. MONITORING PERFORMANCE USING RHEL SYSTEM ROLES
On Red Hat Enterprise Linux 9, the interface currently consists of the following roles:
Metrics (metrics)
Network Bound Disk Encryption client and Network Bound Disk Encryption server (nbde_client
and nbde_server)
Networking (network)
Postfix (postfix)
All these roles are provided by the rhel-system-roles package available in the AppStream repository.
Additional resources
Ansible playbook
Playbooks are Ansible’s configuration, deployment, and orchestration language. They can describe a
45
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Playbooks are Ansible’s configuration, deployment, and orchestration language. They can describe a
policy you want your remote systems to enforce, or a set of steps in a general IT process.
Control node
Any machine with Ansible installed. You can run commands and playbooks, invoking /usr/bin/ansible
or /usr/bin/ansible-playbook, from any control node. You can use any computer that has Python
installed on it as a control node - laptops, shared desktops, and servers can all run Ansible. However,
you cannot use a Windows machine as a control node. You can have multiple control nodes.
Inventory
A list of managed nodes. An inventory file is also sometimes called a “hostfile”. Your inventory can
specify information like IP address for each managed node. An inventory can also organize managed
nodes, creating and nesting groups for easier scaling. To learn more about inventory, see the
Working with Inventory section.
Managed nodes
The network devices, servers, or both that you manage with Ansible. Managed nodes are also
sometimes called “hosts”. Ansible is not installed on managed nodes.
Prerequisites
You attached a Red Hat Enterprise Linux Server subscription to the system.
If available in your Customer Portal account, you attached an Ansible Automation Platform
subscription to the system.
Procedure
2. Create a user that you later use to manage and execute playbooks:
[root@control-node]# su - ansible
[ansible@control-node]$ ssh-keygen
46
CHAPTER 4. MONITORING PERFORMANCE USING RHEL SYSTEM ROLES
5. Optional: Configure an SSH agent to prevent Ansible from prompting you for the SSH key
password each time you establish a connection.
[defaults]
inventory = /home/ansible/inventory
remote_user = ansible
[privilege_escalation]
become = True
become_method = sudo
become_user = root
become_ask_pass = True
Ansible uses the account set in the remote_user parameter when it establishes SSH
connections to managed nodes.
Ansible uses the sudo utility to execute tasks on managed nodes as the root user.
For security reasons, configure sudo on managed nodes to require entering the password
of the remote user to become root. By specifying the become_ask_pass=True setting in
~/.ansible.cfg, Ansible prompts for this password when you execute a playbook.
Settings in the ~/.ansible.cfg file have a higher priority and override settings from the global
/etc/ansible/ansible.cfg file.
7. Create the ~/inventory file. For example, the following is an inventory file in the INI format with
three hosts and one host group named US:
managed-node-01.example.com
[US]
managed-node-02.example.com ansible_host=192.0.2.100
managed-node-03.example.com
Note that the control node must be able to resolve the hostnames. If the DNS server cannot
resolve certain hostnames, add the ansible_host parameter next to the host entry to specify its
IP address.
Verification
47
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Additional resources
Scope of support for the Ansible Core package included in the RHEL 9 and RHEL 8.6 and later
AppStream repositories
How to register and subscribe a system to the Red Hat Customer Portal using subscription-
manager
However, direct SSH access as the root user can be a security risk. Therefore, when you prepare a
managed node, you create a local user on this node and configure a sudo policy. Ansible on the control
node can then use this account to log in to the managed node and execute playbooks as different users,
such as root.
Prerequisites
Procedure
1. Create a user:
The control node later uses this user to establish an SSH connection to this host.
You must enter this password when Ansible uses sudo to perform tasks as the root user.
3. Install the ansible user’s SSH public key on the managed node:
a. Log into the control node as the ansible user, and copy the SSH public key to the managed
node:
48
CHAPTER 4. MONITORING PERFORMANCE USING RHEL SYSTEM ROLES
b. Remotely execute a command on the control node to verify the SSH connection:
a. Use the visudo command to create and edit the /etc/sudoers.d/ansible file:
The benefit of using visudo over a normal editor is that this utility provides basic sanity
checks and checks for parse errors before installing the file.
These settings grant permissions to the ansible user to run all commands as any user and
group on this host without entering the password of the ansible user.
Additional resources
Prerequisites
You prepared at least one managed node as described in Preparing a managed node .
49
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
If you want to run playbooks on host groups, the managed node is listed in the inventory file on
the control node.
Procedure
1. Use the Ansible ping module to verify that you can execute commands on an all managed hosts:
The hard-coded all host group dynamically contains all hosts listed in the inventory file.
2. Use the Ansible command module to run the whoami utility on a managed host:
If the command returns root, you configured sudo on the managed nodes correctly, and
privilege escalation works.
50
CHAPTER 4. MONITORING PERFORMANCE USING RHEL SYSTEM ROLES
NOTE
Prerequisites
You have the rhel-system-roles package installed on the machine you want to monitor.
Procedure
1. Configure localhost in the /etc/ansible/hosts Ansible inventory by adding the following content
to the inventory:
localhost ansible_connection=local
---
51
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
- hosts: localhost
vars:
metrics_graph_service: yes
roles:
- rhel-system-roles.metrics
# ansible-playbook name_of_your_playbook.yml
NOTE
4. To view visualization of the metrics being collected on your machine, access the grafana web
interface as described in Accessing the Grafana web UI .
Prerequisites
You have the rhel-system-roles package installed on the machine you want to use to run the
playbook.
Procedure
1. Add the name or IP of the machines you wish to monitor via the playbook to the
/etc/ansible/hosts Ansible inventory file under an identifying group name enclosed in brackets:
[remotes]
webserver.example.com
database.example.com
---
- hosts: remotes
vars:
metrics_retention_days: 0
roles:
- rhel-system-roles.metrics
52
CHAPTER 4. MONITORING PERFORMANCE USING RHEL SYSTEM ROLES
# ansible-playbook name_of_your_playbook.yml -k
Prerequisites
You have the rhel-system-roles package installed on the machine you want to use to run the
playbook.
Procedure
---
- hosts: localhost
vars:
metrics_graph_service: yes
metrics_query_service: yes
metrics_retention_days: 10
metrics_monitored_hosts: ["database.example.com", "webserver.example.com"]
roles:
- rhel-system-roles.metrics
# ansible-playbook name_of_your_playbook.yml
NOTE
3. To view graphical representation of the metrics being collected centrally by your machine and to
query the data, access the grafana web interface as described in Accessing the Grafana web UI .
53
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
authentication using the scram-sha-256 authentication mechanism. This procedure describes how to
setup authentication using the metrics RHEL System Role.
Prerequisites
You have the rhel-system-roles package installed on the machine you want to use to run the
playbook.
Procedure
1. Include the following variables in the Ansible playbook you want to setup authentication for:
---
vars:
metrics_username: your_username
metrics_password: your_password
# ansible-playbook name_of_your_playbook.yml
Verification steps
Prerequisites
You have the rhel-system-roles package installed on the machine you want to monitor.
You have installed Microsoft SQL Server for Red Hat Enterprise Linux and established a
'trusted' connection to an SQL server. See Install SQL Server and create a database on Red Hat .
You have installed the Microsoft ODBC driver for SQL Server for Red Hat Enterprise Linux. See
Red Hat Enterprise Server and Oracle Linux .
Procedure
54
CHAPTER 4. MONITORING PERFORMANCE USING RHEL SYSTEM ROLES
1. Configure localhost in the /etc/ansible/hosts Ansible inventory by adding the following content
to the inventory:
localhost ansible_connection=local
---
- hosts: localhost
roles:
- role: rhel-system-roles.metrics
vars:
metrics_from_mssql: yes
# ansible-playbook name_of_your_playbook.yml
Verification steps
Use the pcp command to verify that SQL Server PMDA agent (mssql) is loaded and running:
# pcp
platform: Linux rhel82-2.local 4.18.0-167.el8.x86_64 #1 SMP Sun Dec 15 01:24:23 UTC
2019 x86_64
hardware: 2 cpus, 1 disk, 1 node, 2770MB RAM
timezone: PDT+7
services: pmcd pmproxy
pmcd: Version 5.0.2-1, 12 agents, 4 clients
pmda: root pmcd proc pmproxy xfs linux nfsclient mmv kvm mssql
jbd2 dm
pmlogger: primary logger: /var/log/pcp/pmlogger/rhel82-2.local/20200326.16.31
pmie: primary engine: /var/log/pcp/pmie/rhel82-2.local/pmie.log
Additional resources
For more information about using Performance Co-Pilot for Microsoft SQL Server, see this Red
Hat Developers Blog post.
55
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
This section describes how to install and enable PCP on your system.
You can analyze data patterns by comparing live results with archived data.
Features of PCP:
Light-weight distributed architecture, which is useful during the centralized analysis of complex
systems.
The Performance Metric Collector Daemon (pmcd) collects performance data from the
installed Performance Metric Domain Agents (pmda). PMDAs can be individually loaded or
unloaded on the system and are controlled by the PMCD on the same host.
Various client tools, such as pminfo or pmstat, can retrieve, display, archive, and process this
data on the same host or over the network.
The pcp package provides the command-line tools and underlying functionality.
The pcp-gui package provides the graphical application. Install the pcp-gui package by
executing the dnf install pcp-gui command. For more information, see Visually tracing PCP log
archives with the PCP Charts application.
Additional resources
/usr/share/doc/pcp-doc/ directory
Index of Performance Co-Pilot (PCP) articles, solutions, tutorials, and white papers fromon
Red Hat Customer Portal
Side-by-side comparison of PCP tools with legacy tools Red Hat Knowledgebase article
56
CHAPTER 5. SETTING UP PCP
To begin using PCP, install all the required packages and enable the PCP monitoring services.
This procedure describes how to install PCP using the pcp package. If you want to automate the PCP
installation, install it using the pcp-zeroconf package. For more information on installing PCP by using
pcp-zeroconf, see Setting up PCP with pcp-zeroconf.
Procedure
Verification steps
# pcp
platform: Linux workstation 4.18.0-80.el8.x86_64 #1 SMP Wed Mar 13 12:02:46 UTC 2019
x86_64
hardware: 12 cpus, 2 disks, 1 node, 36023MB RAM
timezone: CEST-2
services: pmcd
pmcd: Version 4.3.0-1, 8 agents
pmda: root pmcd proc xfs linux mmv kvm jbd2
Additional resources
You can analyze the resulting tar.gz file and the archive of the pmlogger output using various PCP
tools and compare them with other sources of performance information.
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
57
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Procedure
# pmlogconf -r /var/lib/pcp/config/pmlogger/config.default
5. Save the output and save it to a tar.gz file named based on the host name and the current date
and time:
# cd /var/log/pcp/pmlogger/
Extract this file and analyze the data using PCP tools.
Additional resources
Name Description
58
CHAPTER 5. SETTING UP PCP
Name Description
59
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
pcp-uptime Displays how long the system has been running, how
many users are currently logged on, and the system
load averages for the past 1, 5, and 15 minutes.
60
CHAPTER 5. SETTING UP PCP
61
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
PCP supports multiple deployment architectures, based on the scale of the PCP deployment.
Localhost
Each service runs locally on the monitored machine. When you start a service without any
configuration changes, this is the default deployment. Scaling beyond the individual node is not
possible in this case.
By default, the deployment setup for Redis is standalone, localhost. However, Redis can optionally
perform in a highly-available and highly scalable clustered fashion, where data is shared across
multiple hosts. Another viable option is to deploy a Redis cluster in the cloud, or to utilize a managed
Redis cluster from a cloud vendor.
Decentralized
The only difference between localhost and decentralized setup is the centralized Redis service. In
this model, the host executes pmlogger service on each monitored host and retrieves metrics from a
local pmcd instance. A local pmproxy service then exports the performance metrics to a central
Redis instance.
NOTE
By default, the deployment setup for Redis is standalone, localhost. However, Redis can
optionally perform in a highly-available and highly scalable clustered fashion, where data
is shared across multiple hosts. Another viable option is to deploy a Redis cluster in the
cloud, or to utilize a managed Redis cluster from a cloud vendor.
64
CHAPTER 5. SETTING UP PCP
Additional resources
pmcd servers N N N
After every PCP upgrade, the pmlogrewrite tool is executed and rewrites old archives if there were
65
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
After every PCP upgrade, the pmlogrewrite tool is executed and rewrites old archives if there were
changes in the metric metadata from the previous version and the new version of PCP. This process
duration scales linear with the number of archives stored.
Additional resources
stream.expire specifies the duration when stale metrics should be removed, that is metrics
which were not updated in a specified amount of time in seconds.
stream.maxlen specifies the maximum number of metric values for one metric per host. This
setting should be the retention time divided by the logging interval, for example 20160 for 14
days of retention and 60s logging interval (60*60*24*14/60)
Additional resources
The following results were gathered on a centralized logging setup, also known as pmlogger farm
66
CHAPTER 5. SETTING UP PCP
The following results were gathered on a centralized logging setup, also known as pmlogger farm
deployment, with a default pcp-zeroconf 5.3.0 installation, where each remote host is an identical
container instance running pmcd on a server with 64 CPU cores, 376 GB RAM, and one disk attached.
The logging interval is 10s, proc metrics of remote nodes are not included, and the memory values refer
to the Resident Set Size (RSS) value.
Number of Hosts 10 50
Table 5.5. Used resources depending on monitored hosts for 60s logging interval
NOTE
The pmproxy queues Redis requests and employs Redis pipelining to speed up Redis
queries. This can result in high memory usage. For troubleshooting this issue, see
Troubleshooting high memory usage.
This setup of the pmlogger farms is identical to the configuration mentioned in the Example: Analyzing
67
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
This setup of the pmlogger farms is identical to the configuration mentioned in the Example: Analyzing
the centralized logging deployment for 60s logging interval, except that the Redis servers were
operating in cluster mode.
Table 5.6. Used resources depending on federated hosts for 60s logging interval
PCP Archives pmlogger Network per Day pmproxy Redis Memory per
Storage per Day Memory (In/Out) Memory Day
Here, all values are per host. The network bandwidth is higher due to the inter-node communication of
the Redis cluster.
The pmproxy process is busy processing new PCP archives and does not have spare CPU
cycles to process Redis requests and responses.
The Redis node or cluster is overloaded and cannot process incoming requests on time.
The pmproxy service daemon uses Redis streams and supports the configuration parameters, which are
PCP tuning parameters and affects Redis memory usage and key retention. The
/etc/pcp/pmproxy/pmproxy.conf file lists the available configuration options for pmproxy and the
associated APIs.
Prerequisites
Procedure
To troubleshoot high memory usage, execute the following command and observe the inflight
column:
$ pmrep :pmproxy
backlog inflight reqs/s resp/s wait req err resp err changed throttled
byte count count/s count/s s/s count/s count/s count/s count/s
14:59:08 0 0 N/A N/A N/A N/A N/A N/A N/A
14:59:09 0 0 2268.9 2268.9 28 0 0 2.0 4.0
14:59:10 0 0 0.0 0.0 0 0 0 0.0 0.0
14:59:11 0 0 0.0 0.0 0 0 0 0.0 0.0
This column shows how many Redis requests are in-flight, which means they are queued or sent,
68
CHAPTER 5. SETTING UP PCP
This column shows how many Redis requests are in-flight, which means they are queued or sent,
and no reply was received so far.
The pmproxy process is busy processing new PCP archives and does not have spare CPU
cycles to process Redis requests and responses.
The Redis node or cluster is overloaded and cannot process incoming requests on time.
To troubleshoot the high memory usage issue, reduce the number of pmlogger processes for
this farm, and add another pmlogger farm. Use the federated - multiple pmlogger farms setup.
If the Redis node is using 100% CPU for an extended amount of time, move it to a host with
better performance or use a clustered Redis setup instead.
To view how many Redis requests are inflight, see the pmproxy.redis.requests.inflight.total
metric and pmproxy.redis.requests.inflight.bytes metric to view how many bytes are
occupied by all current inflight Redis requests.
In general, the redis request queue would be zero but can build up based on the usage of large
pmlogger farms, which limits scalability and can cause high latency for pmproxy clients.
Use the pminfo command to view information about performance metrics. For example, to view
the redis.* metrics, use the following command:
69
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Additional resources
70
CHAPTER 6. LOGGING PERFORMANCE DATA WITH PMLOGGER
Specify which metrics are recorded on the system and how often
Use the pmlogconf utility to check the default configuration. If the pmlogger configuration file does
not exist, pmlogconf creates it with a default metric values.
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Procedure
# pmlogconf -r /var/lib/pcp/config/pmlogger/config.default
2. Follow pmlogconf prompts to enable or disable groups of related performance metrics and to
control the logging interval for each enabled group.
Additional resources
71
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Procedure
[access]
disallow * : all;
allow localhost : enquire;
Additional resources
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Procedure
72
CHAPTER 6. LOGGING PERFORMANCE DATA WITH PMLOGGER
Verification steps
# pcp
platform: Linux workstation 4.18.0-80.el8.x86_64 #1 SMP Wed Mar 13 12:02:46 UTC 2019
x86_64
hardware: 12 cpus, 2 disks, 1 node, 36023MB RAM
timezone: CEST-2
services: pmcd
pmcd: Version 4.3.0-1, 8 agents, 1 client
pmda: root pmcd proc xfs linux mmv kvm jbd2
pmlogger: primary logger: /var/log/pcp/pmlogger/workstation/20190827.15.54
Additional resources
/var/lib/pcp/config/pmlogger/config.default file
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Procedure
Replace 192.168.4.62 with the IP address, the client should listen on.
73
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
# firewall-cmd --reload
success
# setsebool -P pcp_bind_all_unreserved_ports on
Verification steps
Additional resources
/var/lib/pcp/config/pmlogger/config.default file
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Client is configured for metrics collection. For more information, see Setting up a client system
for metrics collection.
Procedure
74
CHAPTER 6. LOGGING PERFORMANCE DATA WITH PMLOGGER
Replace 192.168.4.13, 192.168.4.14, 192.168.4.62 and 192.168.4.69 with the client IP addresses.
Verification steps
Ensure that you can access the latest archive file from each directory:
The archive files from the /var/log/pcp/pmlogger/ directory can be used for further analysis and
graphing.
Additional resources
/var/lib/pcp/config/pmlogger/config.default file
75
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Parse the selected PCP log archive and export the values into an ASCII table
Extract the entire archive log or only select metric values from the log by specifying individual
metrics on the command line
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
The pmlogger service is enabled. For more information, see Enabling the pmlogger service.
Procedure
$ pmrep --start @3:00am --archive 20211128 --interval 5seconds --samples 10 --output csv
disk.dev.write
Time,"disk.dev.write-sda","disk.dev.write-sdb"
2021-11-28 03:00:00,,
2021-11-28 03:00:05,4.000,5.200
2021-11-28 03:00:10,1.600,7.600
2021-11-28 03:00:15,0.800,7.100
2021-11-28 03:00:20,16.600,8.400
2021-11-28 03:00:25,21.400,7.200
2021-11-28 03:00:30,21.200,6.800
2021-11-28 03:00:35,21.000,27.600
2021-11-28 03:00:40,12.400,33.800
2021-11-28 03:00:45,9.800,20.600
The mentioned example displays the data on the disk.dev.write metric collected in an archive
at a 5 second interval in comma-separated-value format.
NOTE
Additional resources
76
CHAPTER 7. MONITORING PERFORMANCE WITH PERFORMANCE CO-PILOT
As a system administrator, you can monitor the system’s performance using the PCP application in
Red Hat Enterprise Linux 9.
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
The pmlogger service is enabled. For more information, see Enabling the pmlogger service.
Procedure
3. Enable the SELinux boolean, so that pmda-postfix can access the required log files:
# setsebool -P pcp_read_generic_logs=on
# cd /var/lib/pcp/pmdas/postfix/
77
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
# ./Install
Verification steps
# pminfo postfix
postfix.received
postfix.sent
postfix.queues.incoming
postfix.queues.maildrop
postfix.queues.hold
postfix.queues.deferred
postfix.queues.active
Additional resources
/var/lib/pcp/config/pmlogger/config.default file
7.2. VISUALLY TRACING PCP LOG ARCHIVES WITH THE PCP CHARTS
APPLICATION
After recording metric data, you can replay the PCP log archives as graphs. The metrics are sourced
from one or more live hosts with alternative options to use metric data from PCP log archives as a
source of historical data. To customize the PCP Charts application interface to display the data from
the performance metrics, you can use line plot, bar graphs, or utilization graphs.
Replay the data in the PCP Charts application application and use graphs to visualize the
retrospective data alongside live data of the system.
78
CHAPTER 7. MONITORING PERFORMANCE WITH PERFORMANCE CO-PILOT
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Logged performance data with the pmlogger. For more information, see Logging performance
data with pmlogger.
Procedure
# pmchart
The pmtime server settings are located at the bottom. The start and pause button allows you
to control:
2. Click File and then New Chart to select metric from both the local machine and remote
machines by specifying their host name or address. Advanced configuration options include the
ability to manually set the axis values for the chart, and to manually choose the color of the
plots.
Click File and then Export to save an image of the current view.
79
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Click Record and then Start to start a recording. Click Record and then Stop to stop the
recording. After stopping the recording, the recorded metrics are archived to be viewed
later.
4. Optional: In the PCP Charts application, the main configuration file, known as the view, allows
the metadata associated with one or more charts to be saved. This metadata describes all chart
aspects, including the metrics used and the chart columns. Save the custom view configuration
by clicking File and then Save View, and load the view configuration later.
The following example of the PCP Charts application view configuration file describes a
stacking chart graph showing the total number of bytes read and written to the given XFS file
system loop1:
#kmchart
version 1
Additional resources
This procedure describes how to collect data for Microsoft SQL Server via pcp on your system.
Prerequisites
You have installed Microsoft SQL Server for Red Hat Enterprise Linux and established a
'trusted' connection to an SQL server.
You have installed the Microsoft ODBC driver for SQL Server for Red Hat Enterprise Linux.
Procedure
1. Install PCP:
80
CHAPTER 7. MONITORING PERFORMANCE WITH PERFORMANCE CO-PILOT
b. Edit the /etc/pcp/mssql/mssql.conf file to configure the SQL server account’s username
and password for the mssql agent. Ensure that the account you configure has access rights
to performance data.
username: user_name
password: user_password
Replace user_name with the SQL Server account and user_password with the SQL Server
user password for this account.
# cd /var/lib/pcp/pmdas/mssql
# ./Install
Updating the Performance Metrics Name Space (PMNS) ...
Terminate PMDA if already installed ...
Updating the PMCD control file, and notifying PMCD ...
Check mssql metrics have appeared ... 168 metrics and 598 values
[...]
Verification steps
Using the pcp command, verify if the SQL Server PMDA ( mssql) is loaded and running:
$ pcp
Performance Co-Pilot configuration on rhel.local:
platform: Linux rhel.local 4.18.0-167.el8.x86_64 #1 SMP Sun Dec 15 01:24:23 UTC 2019
x86_64
hardware: 2 cpus, 1 disk, 1 node, 2770MB RAM
timezone: PDT+7
services: pmcd pmproxy
pmcd: Version 5.0.2-1, 12 agents, 4 clients
pmda: root pmcd proc pmproxy xfs linux nfsclient mmv kvm mssql
jbd2 dm
pmlogger: primary logger: /var/log/pcp/pmlogger/rhel.local/20200326.16.31
pmie: primary engine: /var/log/pcp/pmie/rhel.local/pmie.log
View the complete list of metrics that PCP can collect from the SQL Server:
# pminfo mssql
After viewing the list of metrics, you can report the rate of transactions. For example, to report
on the overall transaction count per second, over a five second time window:
# pmval -t 1 -T 5 mssql.databases.transactions
View the graphical chart of these metrics on your system by using the pmchart command. For
more information, see Visually tracing PCP log archives with the PCP Charts application .
Additional resources
81
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Performance Co-Pilot for Microsoft SQL Server with RHEL 8.2 Red Hat Developers Blog post
Prerequisites
# /usr/lib64/sa/sadc 1 5 -
In this example, sadc is sampling system data 1 time in a 5 second interval. The outfile is
specified as - which results in sadc writing the data to the standard system activity daily data
file. This file is named saDD and is located in the /var/log/sa directory by default.
Procedure
# sadf -l -O pcparchive=/tmp/recording -2
In this example, using the -2 option results in sadf generating a PCP archive from a sadc archive
recorded 2 days ago.
Verification steps
You can use PCP commands to inspect and analyze the PCP archive generated from a sadc archive as
you would a native PCP archive. For example:
To show a list of metrics in the PCP archive generated from an sadc archive archive, run:
To show the timespace of the archive and hostname of the PCP archive, run:
82
CHAPTER 8. PERFORMANCE ANALYSIS OF XFS WITH PCP
This section describes how to analyze XFS file system’s performance using PCP.
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Procedure
# cd /var/lib/pcp/pmdas/xfs/
Verification steps
Verify that the pmcd process is running on the host and the XFS PMDA is listed as enabled in
the configuration:
# pcp
platform: Linux workstation 4.18.0-80.el8.x86_64 #1 SMP Wed Mar 13 12:02:46 UTC 2019
x86_64
hardware: 12 cpus, 2 disks, 1 node, 36023MB RAM
timezone: CEST-2
services: pmcd
pmcd: Version 4.3.0-1, 8 agents
pmda: root pmcd proc xfs linux mmv kvm jbd2
Additional resources
The pminfo command provides per-device XFS metrics for each mounted XFS file system.
83
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
This procedure displays a list of all available metrics provided by the XFS PMDA.
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Procedure
Display the list of all available metrics provided by the XFS PMDA:
# pminfo xfs
Display information for the individual metrics. The following examples examine specific XFS
read and write metrics using the pminfo tool:
xfs.read_bytes
Help:
This is the number of bytes read via read(2) system calls to files in
XFS file systems. It can be used in conjunction with the read_calls
count to calculate the average size of the read operations to file in
XFS file systems.
xfs.read_bytes
value 4891346238
Additional resources
84
CHAPTER 8. PERFORMANCE ANALYSIS OF XFS WITH PCP
This procedure describes how to reset XFS metrics using the pmstore tool.
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Procedure
$ pminfo -f xfs.write
xfs.write
value 325262
# pmstore xfs.control.reset 1
Verification steps
xfs.write
value 0
Additional resources
85
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
86
CHAPTER 8. PERFORMANCE ANALYSIS OF XFS WITH PCP
87
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
88
CHAPTER 9. SETTING UP GRAPHICAL REPRESENTATION OF PCP METRICS
This section describes how to set up and access the graphical representation of PCP metrics.
Procedure
Verification steps
Ensure that the pmlogger service is active, and starts archiving the metrics:
Additional resources
Prerequisites
PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
Procedure
89
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
3. Open the server’s firewall for network traffic to the Grafana service.
# firewall-cmd --reload
success
Verification steps
performancecopilot-pcp-app @ 3.1.0
Additional resources
add PCP Redis, PCP bpftrace, and PCP Vector data sources
create dashboard
Prerequisites
1. PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
Procedure
1. On the client system, open a browser and access the grafana-server on port 3000, using
http://192.0.2.0:3000 link.
Replace 192.0.2.0 with your machine IP.
90
CHAPTER 9. SETTING UP GRAPHICAL REPRESENTATION OF PCP METRICS
2. For the first login, enter admin in both the Email or username and Password field.
Grafana prompts to set a New password to create a secured account. If you want to set it later,
click Skip.
3. From the menu, hover over the Configuration icon and then click Plugins.
4. In the Plugins tab, type performance co-pilot in the Search by name or type text box and then
click Performance Co-Pilot (PCP) plugin.
NOTE
The top corner of the screen has a similar icon, but it controls the
general Dashboard settings.
7. In the Grafana Home page, click Add your first data sourceto add PCP Redis, PCP bpftrace,
and PCP Vector data sources. For more information on adding data source, see:
To add pcp redis data source, view default dashboard, create a panel, and an alert rule, see
Creating panels and alert in PCP Redis data source .
To add pcp bpftrace data source and view the default dashboard, see Viewing the PCP
bpftrace System Analysis dashboard.
To add pcp vector data source, view the default dashboard, and to view the vector checklist,
see Viewing the PCP Vector Checklist.
8. Optional: From the menu, hover over the admin profile icon to change the
Preferences including Edit Profile, Change Password, or to Sign out.
91
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Additional resources
Prerequisites
1. PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
Procedure
3. Mail transfer agent, for example, sendmail or postfix is installed and configured.
# vi /etc/grafana/grafana.ini
allow_loading_unsigned_plugins = pcp-redis-datasource
Verification steps
# pmseries disk.dev.read
2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
92
CHAPTER 9. SETTING UP GRAPHICAL REPRESENTATION OF PCP METRICS
This command does not return any data if the redis package is not installed.
Additional resources
Prerequisites
1. The PCP Redis is configured. For more information, see Configuring PCP Redis.
2. The grafana-server is accessible. For more information, see Accessing the Grafana web UI .
Procedure
2. In the Grafana Home page, click Add your first data source.
3. In the Add data source pane, type redis in the Filter by name or type text box and then click
PCP Redis.
a. Add http://localhost:44322 in the URL field and then click Save & Test.
b. Click Dashboards tab → Import → PCP Redis: Host Overview to see a dashboard with an
overview of any useful metrics.
93
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
a. From the menu, hover over the Create icon → Dashboard → Add new panel
icon to add a panel.
b. In the Query tab, select the PCP Redis from the query list instead of the selected default
option and in the text field of A, enter metric, for example, kernel.all.load to visualize the
kernel load graph.
c. Optional: Add Panel title and Description, and update other options from the Settings.
d. Click Save to apply changes and save the dashboard. Add Dashboard name.
a. In the PCP Redis query panel, click Alert and then click Create
Alert.
b. Edit the Name, Evaluate query, and For fields from the Rule, and specify the Conditions
for your alert.
c. Click Save to apply changes and save the dashboard. Click Apply to apply changes and go
back to the dashboard.
d. Optional: In the same panel, scroll down and click Delete icon to delete the created rule.
e. Optional: From the menu, click Alerting icon to view the created alert rules with
different alert statuses, to edit the alert rule, or to pause the existing rule from the Alert
Rules tab.
To add a notification channel for the created alert rule to receive an alert notification from
Grafana, see Adding notification channels for alerts .
You can receive these alerts after selecting any one type from the supported list of notifiers, which
includes DingDing, Discord, Email, Google Hangouts Chat, HipChat, Kafka REST Proxy, LINE,
Microsoft Teams, OpsGenie, PagerDuty, Prometheus Alertmanager, Pushover, Sensu, Slack,
Telegram, Threema Gateway, VictorOps, and webhook.
Prerequisites
1. The grafana-server is accessible. For more information, see Accessing the Grafana web UI .
2. An alert rule is created. For more information, see Creating panels and alert in PCP Redis data
source.
3. Configure SMTP and add a valid sender’s email address in the grafana/grafana.ini file:
# vi /etc/grafana/grafana.ini
[smtp]
enabled = true
from_address = [email protected]
Procedure
95
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
1. From the menu, hover over the Alerting icon → click Notification channels → Add
channel.
b. Select the communication Type, for example, Email and enter the email address. You can
add multiple email addresses using the ; separator.
3. Click Save.
a. From the menu, hover over the Alerting icon and then click Alert rules.
b. From the Alert Rules tab, click the created alert rule.
c. On the Notifications tab, select your notification channel name from the Send to option,
and then add an alert message.
d. Click Apply.
Additional resources
Procedure
2. Specify the supported authentication mechanism and the user database path in the pmcd.conf
file:
# vi /etc/sasl2/pmcd.conf
mech_list: scram-sha-256
sasldb_path: /etc/pcp/passwd.db
# useradd -r metrics
96
CHAPTER 9. SETTING UP GRAPHICAL REPRESENTATION OF PCP METRICS
Password:
Again (for verification):
To add the created user, you are required to enter the metrics account password.
Verification steps
Additional resources
How can I setup authentication between PCP components, like PMDAs and pmcd in RHEL 8.2?
The bpftrace agent uses bpftrace scripts to gather the metrics. The bpftrace scripts use the enhanced
Berkeley Packet Filter (eBPF).
Prerequisites
1. PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
3. The scram-sha-256 authentication mechanism is configured. For more information, see Setting
up authentication between PCP components.
97
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Procedure
2. Edit the bpftrace.conf file and add the user that you have created in the {setting-up-
authentication-between-pcp-components}:
# vi /var/lib/pcp/pmdas/bpftrace/bpftrace.conf
[dynamic_scripts]
enabled = true
auth_enabled = true
allowed_users = root,metrics
# cd /var/lib/pcp/pmdas/bpftrace/
# ./Install
Updating the Performance Metrics Name Space (PMNS) ...
Terminate PMDA if already installed ...
Updating the PMCD control file, and notifying PMCD ...
Check bpftrace metrics have appeared ... 7 metrics and 6 values
The pmda-bpftrace is now installed, and can only be used after authenticating your user. For
more information, see Viewing the PCP bpftrace System Analysis dashboard.
Additional resources
In the PCP bpftrace data source, you can view the dashboard with an overview of useful metrics.
Prerequisites
1. The PCP bpftrace is installed. For more information, see Installing PCP bpftrace.
2. The grafana-server is accessible. For more information, see Accessing the Grafana web UI .
Procedure
2. In the Grafana Home page, click Add your first data source.
3. In the Add data source pane, type bpftrace in the Filter by name or type text box and then
98
CHAPTER 9. SETTING UP GRAPHICAL REPRESENTATION OF PCP METRICS
3. In the Add data source pane, type bpftrace in the Filter by name or type text box and then
click PCP bpftrace.
b. Toggle the Basic Auth option and add the created user credentials in the User and
Password field.
d. Click Dashboards tab → Import → PCP bpftrace: System Analysis to see a dashboard
with an overview of any useful metrics.
99
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Prerequisites
1. PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
Procedure
# cd /var/lib/pcp/pmdas/bcc
# ./Install
[Wed Apr 1 00:27:48] pmdabcc(22341) Info: Initializing, currently in 'notready' state.
[Wed Apr 1 00:27:48] pmdabcc(22341) Info: Enabled modules:
[Wed Apr 1 00:27:48] pmdabcc(22341) Info: ['biolatency', 'sysfork',
[...]
Updating the Performance Metrics Name Space (PMNS) ...
Terminate PMDA if already installed ...
Updating the PMCD control file, and notifying PMCD ...
Check bcc metrics have appeared ... 1 warnings, 1 metrics and 0 values
Additional resources
After adding the PCP Vector data source, you can view the dashboard with an overview of useful metrics
and view the related troubleshooting or reference links in the checklist.
Prerequisites
1. The PCP Vector is installed. For more information, see Installing PCP Vector.
2. The grafana-server is accessible. For more information, see Accessing the Grafana web UI .
Procedure
2. In the Grafana Home page, click Add your first data source.
3. In the Add data source pane, type vector in the Filter by name or type text box and then click
PCP Vector.
a. Add http://localhost:44322 in the URL field and then click Save & Test.
100
CHAPTER 9. SETTING UP GRAPHICAL REPRESENTATION OF PCP METRICS
b. Click Dashboards tab → Import → PCP Vector: Host Overview to see a dashboard with an
overview of any useful metrics.
5. From the menu, hover over the Performance Co-Pilot plugin and then click PCP
Vector Checklist.
In the PCP checklist, click help or warning icon to view the related
troubleshooting or reference links.
Procedure
Verify that the pmlogger service is up and running by executing the following command:
101
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Verify if files were created or modified to the disk by executing the following command:
$ ls /var/log/pcp/pmlogger/$(hostname)/ -rlt
total 4024
-rw-r--r--. 1 pcp pcp 45996 Oct 13 2019 20191013.20.07.meta.xz
-rw-r--r--. 1 pcp pcp 412 Oct 13 2019 20191013.20.07.index
-rw-r--r--. 1 pcp pcp 32188 Oct 13 2019 20191013.20.07.0.xz
-rw-r--r--. 1 pcp pcp 44756 Oct 13 2019 20191013.20.30-00.meta.xz
[..]
Verify that the pmproxy service is running by executing the following command:
Verify that pmproxy is running, time series support is enabled, and a connection to Redis is
established by viewing the /var/log/pcp/pmproxy/pmproxy.log file and ensure that it contains
the following text:
Here, 1716 is the PID of pmproxy, which will be different for every invocation of pmproxy.
Verify if the Redis database contains any keys by executing the following command:
$ redis-cli dbsize
(integer) 34837
Verify if any PCP metrics are in the Redis database and pmproxy is able to access them by
executing the following commands:
$ pmseries disk.dev.read
2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
$ pmseries "disk.dev.read[count:10]"
2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
[Mon Jul 26 12:21:10.085468000 2021] 117971
70e83e88d4e1857a3a31605c6d1333755f2dd17c
[Mon Jul 26 12:21:00.087401000 2021] 117758
70e83e88d4e1857a3a31605c6d1333755f2dd17c
[Mon Jul 26 12:20:50.085738000 2021] 116688
70e83e88d4e1857a3a31605c6d1333755f2dd17c
[...]
pcp:metric.name:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
pcp:values:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
pcp:desc:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
pcp:labelvalue:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
pcp:instances:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
pcp:labelflags:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
102
CHAPTER 9. SETTING UP GRAPHICAL REPRESENTATION OF PCP METRICS
Verify if there are any errors in the Grafana logs by executing the following command:
$ journalctl -e -u grafana-server
-- Logs begin at Mon 2021-07-26 11:55:10 IST, end at Mon 2021-07-26 12:30:15 IST. --
Jul 26 11:55:17 localhost.localdomain systemd[1]: Starting Grafana instance...
Jul 26 11:55:17 localhost.localdomain grafana-server[1171]: t=2021-07-26T11:55:17+0530
lvl=info msg="Starting Grafana" logger=server version=7.3.6 c>
Jul 26 11:55:17 localhost.localdomain grafana-server[1171]: t=2021-07-26T11:55:17+0530
lvl=info msg="Config loaded from" logger=settings file=/usr/s>
Jul 26 11:55:17 localhost.localdomain grafana-server[1171]: t=2021-07-26T11:55:17+0530
lvl=info msg="Config loaded from" logger=settings file=/etc/g>
[...]
103
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Throughput performance
Latency performance
Network performance
Virtual machines
The TuneD service optimizes system options to match the selected profile.
In the web console, you can set which performance profile your system uses.
Additional resources
Prerequisites
Make sure the web console is installed and accessible. For details, see Installing the web
console.
Procedure
1. Log into the RHEL web console. For details, see Logging in to the web console .
2. Click Overview.
104
CHAPTER 10. OPTIMIZING THE SYSTEM PERFORMANCE USING THE WEB CONSOLE
4. In the Change Performance Profile dialog box, change the profile if necessary.
Verification steps
Here, you can view the events, errors, and graphical representation for resource utilization and
saturation.
Prerequisites
Make sure the web console is installed and accessible. For details, see Installing the web
console.
Install the cockpit-pcp package, which enables collecting the performance metrics:
i. Log in to the web console with administrative privileges. For details, see Logging in to
the web console.
105
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Procedure
1. Log into the RHEL 9 web console. In the Overview page click View details and history to view
the Performance Metrics.
This procedure shows you how to enable performance metrics export with PCP from your RHEL 9 web
console interface.
Prerequisites
The web console must be installed and accessible. For details, see Installing the web console .
106
CHAPTER 10. OPTIMIZING THE SYSTEM PERFORMANCE USING THE WEB CONSOLE
a. Log in to the web console with administrative privileges. For details, see Logging in to
the web console.
Alternatively, you can install the package from web console interface later in the procedure.
Procedure
1. In the Overview page, click View details and history in the Usage table.
If you do not have the redis service installed, you will be prompted to install it.
107
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
4. To open the pmproxy service, select a zone from a drop down list and click the Add pmproxy
button.
5. Click Save.
Verification
1. Click Networking.
2. In the Firewall table, click n active zones or the Edit rules and zones button.
IMPORTANT
108
CHAPTER 11. SETTING THE DISK SCHEDULER
Set the scheduler using TuneD, as described in Setting the disk scheduler using TuneD
Set the scheduler using udev, as described in Setting the disk scheduler using udev rules
NOTE
In Red Hat Enterprise Linux 9, block devices support only multi-queue scheduling. This
enables the block layer performance to scale well with fast solid-state drives (SSDs) and
multi-core systems.
none
Implements a first-in first-out (FIFO) scheduling algorithm. It merges requests at the generic block
layer through a simple last-hit cache.
mq-deadline
Attempts to provide a guaranteed latency for requests from the point at which requests reach the
scheduler.
The mq-deadline scheduler sorts queued I/O requests into a read or write batch and then schedules
them for execution in increasing logical block addressing (LBA) order. By default, read batches take
precedence over write batches, because applications are more likely to block on read I/O operations.
After mq-deadline processes a batch, it checks how long write operations have been starved of
processor time and schedules the next read or write batch as appropriate.
This scheduler is suitable for most use cases, but particularly those in which the write operations are
mostly asynchronous.
bfq
Targets desktop systems and interactive tasks.
The bfq scheduler ensures that a single application is never using all of the bandwidth. In effect, the
storage device is always as responsive as if it was idle. In its default configuration, bfq focuses on
delivering the lowest latency rather than achieving the maximum throughput.
bfq is based on cfq code. It does not grant the disk to each process for a fixed time slice but assigns a
budget measured in number of sectors to the process.
This scheduler is suitable while copying large files and the system does not become unresponsive in
this case.
109
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
kyber
The scheduler tunes itself to achieve a latency goal by calculating the latencies of every I/O request
submitted to the block I/O layer. You can configure the target latencies for read, in the case of
cache-misses, and synchronous write requests.
This scheduler is suitable for fast devices, for example NVMe, SSD, or other low latency devices.
High-performance SSD or a CPU-bound system with Use none, especially when running enterprise
fast storage applications. Alternatively, use kyber.
NOTE
For non-volatile Memory Express (NVMe) block devices specifically, the default
scheduler is none and Red Hat recommends not changing this.
The kernel selects a default disk scheduler based on the type of device. The automatically selected
scheduler is typically the optimal setting. If you require a different scheduler, Red Hat recommends to
use udev rules or the TuneD application to configure it. Match the selected devices and switch the
scheduler only for those devices.
Procedure
# cat /sys/block/device/queue/scheduler
110
CHAPTER 11. SETTING THE DISK SCHEDULER
In the file name, replace device with the block device name, for example sdc.
device with the name of the block device, for example sdf
selected-scheduler with the disk scheduler that you want to set for the device, for example bfq
Prerequisites
The TuneD service is installed and enabled. For details, see Installing and enabling TuneD .
Procedure
1. Optional: Select an existing TuneD profile on which your profile will be based. For a list of
available profiles, see TuneD profiles distributed with RHEL .
To see which profile is currently active, use:
$ tuned-adm active
# mkdir /etc/tuned/my-profile
ID_WWN=0x5002538d00000000_
ID_SERIAL=Generic-_SD_MMC_20120501030900000-0:0
ID_SERIAL_SHORT=20120501030900000
NOTE
The command in the this example will return all values identified as a World Wide
Name (WWN) or serial number associated with the specified block device.
Although it is preferred to use a WWN, the WWN is not always available for a
given device and any values returned by the example command are acceptable to
use as the device system unique ID.
4. Create the /etc/tuned/my-profile/tuned.conf configuration file. In the file, set the following
options:
111
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
[main]
include=existing-profile
b. Set the selected disk scheduler for the device that matches the WWN identifier:
[disk]
devices_udev_regex=IDNAME=device system unique id
elevator=selected-scheduler
Here:
Replace IDNAME with the name of the identifier being used (for example, ID_WWN).
Replace device system unique id with the value of the chosen identifier (for example,
0x5002538d00000000).
To match multiple devices in the devices_udev_regex option, enclose the identifiers in
parentheses and separate them with vertical bars:
devices_udev_regex=(ID_WWN=0x5002538d00000000)|
(ID_WWN=0x1234567800000000)
Verification steps
$ tuned-adm active
$ tuned-adm verify
# cat /sys/block/device/queue/scheduler
In the file name, replace device with the block device name, for example sdc.
Additional resources
112
CHAPTER 11. SETTING THE DISK SCHEDULER
device with the name of the block device, for example sdf
selected-scheduler with the disk scheduler that you want to set for the device, for example bfq
Procedure
NOTE
The command in the this example will return all values identified as a World Wide
Name (WWN) or serial number associated with the specified block device.
Although it is preferred to use a WWN, the WWN is not always available for a
given device and any values returned by the example command are acceptable to
use as the device system unique ID.
2. Configure the udev rule. Create the /etc/udev/rules.d/99-scheduler.rules file with the
following content:
Here:
Replace IDNAME with the name of the identifier being used (for example, ID_WWN).
Replace device system unique id with the value of the chosen identifier (for example,
0x5002538d00000000).
Verification steps
113
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
# cat /sys/block/device/queue/scheduler
Procedure
In the file name, replace device with the block device name, for example sdc.
Verification steps
# cat /sys/block/device/queue/scheduler
114
CHAPTER 12. TUNING THE PERFORMANCE OF A SAMBA SERVER
Parts of this section were adopted from the Performance Tuning documentation published in the Samba
Wiki. License: CC BY 4.0. Authors and contributors: See the history tab on the Wiki page.
Prerequisites
NOTE
To always have the latest stable SMB protocol version enabled, do not set the server
max protocol parameter. If you set the parameter manually, you will need to modify the
setting with each new version of the SMB protocol, to have the latest protocol version
enabled.
The following procedure explains how to use the default value in the server max protocol parameter.
Procedure
1. Remove the server max protocol parameter from the [global] section in the
/etc/samba/smb.conf file.
Prerequisites
Procedure
115
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
NOTE
Using the settings in this procedure, files with names other than in lowercase will
no longer be displayed.
For details about the parameters, see their descriptions in the smb.conf(5) man page.
# testparm
After you applied these settings, the names of all newly created files on this share use lowercase.
Because of these settings, Samba no longer needs to scan the directory for uppercase and lowercase,
which improves the performance.
To use the optimized settings from the Kernel, remove the socket options parameter from the [global]
section in the /etc/samba/smb.conf.
116
CHAPTER 13. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
Virtual CPUs (vCPUs) are implemented as threads on the host, handled by the Linux scheduler.
VMs do not automatically inherit optimization features, such as NUMA or huge pages, from the
host kernel.
Disk and network I/O settings of the host might have a significant performance impact on the
VM.
Depending on the host devices and their models, there might be significant overhead due to
emulation of particular hardware.
The severity of the virtualization impact on the VM performance is influenced by a variety factors, which
include:
The TuneD service can automatically optimize the resource distribution and performance of
your VMs.
Block I/O tuning can improve the performances of the VM’s block devices, such as disks.
IMPORTANT
117
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
IMPORTANT
Tuning VM performance can have adverse effects on other virtualization functions. For
example, it can make migrating the modified VM more difficult.
For RHEL 9 virtual machines, use the virtual-guest profile. It is based on the generally
applicable throughput-performance profile, but also decreases the swappiness of virtual
memory.
For RHEL 9 virtualization hosts, use the virtual-host profile. This enables more aggressive
writeback of dirty memory pages, which benefits the host performance.
Prerequisites
Procedure
To enable a specific TuneD profile:
# tuned-adm list
Available profiles:
- balanced - General non-specialized TuneD profile
- desktop - Optimize for the desktop use-case
[...]
- virtual-guest - Optimize for running inside a virtual guest
- virtual-host - Optimize for running KVM guests
Current active profile: balanced
118
CHAPTER 13. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
Additional resources
Monolithic libvirt
The traditional libvirt daemon, libvirtd, controls a wide variety of virtualization drivers, using a single
configuration file - /etc/libvirt/libvirtd.conf.
As such, libvirtd allows for centralized hypervisor configuration, but may use system resources
inefficiently. Therefore, libvirtd will become unsupported in a future major release of RHEL.
However, if you updated to RHEL 9 from RHEL 8, your host still uses libvirtd by default.
Modular libvirt
Newly introduced in RHEL 9, modular libvirt provides a specific daemon for each virtualization driver.
These include the following:
Each of the daemons has a separate configuration file - for example /etc/libvirt/virtqemud.conf. As
such, modular libvirt daemons provide better options for fine-tuning libvirt resource management.
Next steps
If your RHEL 9 uses libvirtd, Red Hat recommends switching to modular daemons. For
instructions, see Enabling modular libvirt daemons.
119
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
If you performed a fresh install of a RHEL 9 host, your hypervisor uses modular libvirt daemons by
default. However, if you upgraded your host from RHEL 8 to RHEL 9, your hypervisor uses the
monolithic libvirtd daemon, which is the default in RHEL 8.
If that is the case, Red Hat recommends enabling the modular libvirt daemons instead, because they
provide better options for fine-tuning libvirt resource management. In addition, libvirtd will become
unsupported in a future major release of RHEL.
Prerequisites
Your hypervisor is using the monolithic libvirtd service. To learn whether this is the case:
Procedure
# for drv in qemu interface network nodedev nwfilter secret storage; do systemctl
unmask virt${drv}d.service; systemctl unmask virt${drv}d{,-ro,-admin}.socket;
systemctl enable virt${drv}d.service; systemctl enable virt${drv}d{,-ro,-admin}.socket;
done
# for drv in qemu network nodedev nwfilter secret storage; do systemctl start
virt${drv}d{,-ro,-admin}.socket; done
5. Optional: If you require connecting to your host from remote hosts, enable and start the
virtualization proxy daemon.
120
CHAPTER 13. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
Verification
# virsh uri
qemu:///system
If this command displays active, you have successfully enabled modular libvirt daemons.
To perform these actions, you can use the web console or the command-line interface.
13.4.1. Adding and removing virtual machine memory using the web console
To improve the performance of a virtual machine (VM) or to free up the host resources it is using, you
can use the web console to adjust amount of memory allocated to the VM.
Prerequisites
The guest OS is running the memory balloon drivers. To verify this is the case:
If this commands displays any output and the model is not set to none, the memballoon
device is present.
In Windows guests, the drivers are installed as a part of the virtio-win driver package.
For instructions, see Installing paravirtualized KVM drivers for Windows virtual
machines.
In Linux guests, the drivers are generally included by default and activate when the
memballoon device is present.
121
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Procedure
1. Optional: Obtain the information about the maximum memory and currently used memory for a
VM. This will serve as a baseline for your changes, and also for verification.
2. In the Virtual Machines interface, click the VM whose information you want to see.
A new page opens with an Overview section with basic information about the selected VM and a
Console section to access the VM’s graphical interface.
Maximum allocation - Sets the maximum amount of host memory that the VM can use for
its processes. You can specify the maximum memory when creating the VM or increase it
later. You can specify memory as multiples of MiB or GiB.
Adjusting maximum memory allocation is only possible on a shut-off VM.
Current allocation - Sets the actual amount of memory allocated to the VM. This value can
be less than the Maximum allocation but cannot exceed it. You can adjust the value to
regulate the memory available to the VM for its processes. You can specify memory as
multiples of MiB or GiB.
If you do not specify this value, the default allocation is the Maximum allocation value.
5. Click Save.
The memory allocation of the VM is adjusted.
Additional resources
Adding and removing virtual machine memory using the command-line interface
13.4.2. Adding and removing virtual machine memory using the command-line
interface
To improve the performance of a virtual machine (VM) or to free up the host resources it is using, you
can use the CLI to adjust amount of memory allocated to the VM.
122
CHAPTER 13. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
Prerequisites
The guest OS is running the memory balloon drivers. To verify this is the case:
If this commands displays any output and the model is not set to none, the memballoon
device is present.
In Windows guests, the drivers are installed as a part of the virtio-win driver package.
For instructions, see Installing paravirtualized KVM drivers for Windows virtual
machines.
In Linux guests, the drivers are generally included by default and activate when the
memballoon device is present.
Procedure
1. Optional: Obtain the information about the maximum memory and currently used memory for a
VM. This will serve as a baseline for your changes, and also for verification.
2. Adjust the maximum memory allocated to a VM. Increasing this value improves the performance
potential of the VM, and reducing the value lowers the performance footprint the VM has on
your host. Note that this change can only be performed on a shut-off VM, so adjusting a running
VM requires a reboot to take effect.
For example, to change the maximum memory that the testguest VM can use to 4096 MiB:
To increase the maximum memory of a running VM, you can attach a memory device to the VM.
This is also referred to as memory hot plug. For details, see Attaching devices to virtual
machines.
WARNING
123
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
3. Optional: You can also adjust the memory currently used by the VM, up to the maximum
allocation. This regulates the memory load that the VM has on the host until the next reboot,
without changing the maximum VM allocation.
Verification
2. Optional: If you adjusted the current VM memory, you can obtain the memory balloon statistics
of the VM to evaluate how effectively it regulates its memory use.
Additional resources
Adding and removing virtual machine memory using the web console
124
CHAPTER 13. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
Increasing the I/O weight of a device increases its priority for I/O bandwidth, and therefore provides it
with more host resources. Similarly, reducing a device’s weight makes it consume less host resources.
NOTE
Each device’s weight value must be within the 100 to 1000 range. Alternatively, the value
can be 0, which removes that device from per-device listings.
Procedure
To display and set a VM’s block I/O parameters:
<domain>
[...]
<blkiotune>
<weight>800</weight>
<device>
<path>/dev/sda</path>
<weight>1000</weight>
</device>
<device>
<path>/dev/sdb</path>
<weight>500</weight>
</device>
</blkiotune>
[...]
</domain>
For example, the following changes the weight of the /dev/sda device in the liftrul VM to 500.
To enable disk I/O throttling, set a limit on disk I/O requests sent from each block device attached to
VMs to the host machine.
Procedure
1. Use the virsh domblklist command to list the names of all the disk devices on a specified VM.
125
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Target Source
------------------------------------------------
vda /var/lib/libvirt/images/rollin-coal.qcow2
sda -
sdb /home/horridly-demanding-processes.iso
2. Find the host block device where the virtual disk that you want to throttle is mounted.
For example, if you want to throttle the sdb virtual disk from the previous step, the following
output shows that the disk is mounted on the /dev/nvme0n1p3 partition.
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
zram0 252:0 0 4G 0 disk [SWAP]
nvme0n1 259:0 0 238.5G 0 disk
├─nvme0n1p1 259:1 0 600M 0 part /boot/efi
├─nvme0n1p2 259:2 0 1G 0 part /boot
└─nvme0n1p3 259:3 0 236.9G 0 part
└─luks-a1123911-6f37-463c-b4eb-fxzy1ac12fea 253:0 0 236.9G 0 crypt /home
3. Set I/O limits for the block device using the virsh blkiotune command.
The following example throttles the sdb disk on the rollin-coal VM to 1000 read and write I/O
operations per second and to 50 MB per second read and write throughput.
Additional information
Disk I/O throttling can be useful in various situations, for example when VMs belonging to
different customers are running on the same host, or when quality of service guarantees are
given for different VMs. Disk I/O throttling can also be used to simulate slower disks.
I/O throttling can be applied independently to each block device attached to a VM and
supports limits on throughput and I/O operations.
Red Hat does not support using the virsh blkdeviotune command to configure I/O throttling in
VMs. For more information on unsupported features when using RHEL 9 as a VM host, see
Unsupported features in RHEL 9 virtualization .
Procedure
To enable multi-queue virtio-scsi support for a specific VM, add the following to the VM’s XML
configuration, where N is the total number of vCPU queues:
126
CHAPTER 13. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
1. Adjust how many host CPUs are assigned to the VM. You can do this using the CLI or the web
console.
2. Ensure that the vCPU model is aligned with the CPU model of the host. For example, to set the
testguest1 VM to use the CPU model of the host:
4. If your host machine uses Non-Uniform Memory Access (NUMA), you can also configure NUMA
for its VMs. This maps the host’s CPU and memory processes onto the CPU and memory
processes of the VM as closely as possible. In effect, NUMA tuning provides the vCPU with a
more streamlined access to the system memory allocated to the VM, which can improve the
vCPU processing effectiveness.
For details, see Configuring NUMA in a virtual machine and Sample vCPU performance tuning
scenario.
13.6.1. Adding and removing virtual CPUs using the command-line interface
To increase or optimize the CPU performance of a virtual machine (VM), you can add or remove virtual
CPUs (vCPUs) assigned to the VM.
When performed on a running VM, this is also referred to as vCPU hot plugging and hot unplugging.
However, note that vCPU hot unplug is not supported in RHEL 9, and Red Hat highly discourages its use.
Prerequisites
Optional: View the current state of the vCPUs in the targeted VM. For example, to display the
number of vCPUs on the testguest VM:
This output indicates that testguest is currently using 1 vCPU, and 1 more vCPu can be hot
plugged to it to increase the VM’s performance. However, after reboot, the number of vCPUs
testguest uses will change to 2, and it will be possible to hot plug 2 more vCPUs.
Procedure
1. Adjust the maximum number of vCPUs that can be attached to a VM, which takes effect on the
127
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
1. Adjust the maximum number of vCPUs that can be attached to a VM, which takes effect on the
VM’s next boot.
For example, to increase the maximum vCPU count for the testguest VM to 8:
Note that the maximum may be limited by the CPU topology, host hardware, the hypervisor,
and other factors.
2. Adjust the current number of vCPUs attached to a VM, up to the maximum configured in the
previous step. For example:
This increases the VM’s performance and host load footprint of testguest until the VM’s
next boot.
This decreases the VM’s performance and host load footprint of testguest after the VM’s
next boot. However, if needed, additional vCPUs can be hot plugged to the VM to
temporarily increase its performance.
Verification
Confirm that the current state of vCPU for the VM reflects your changes.
Additional resources
Prerequisites
Procedure
1. In the Virtual Machines interface, click the VM whose information you want to see.
A new page opens with an Overview section with basic information about the selected VM and a
128
CHAPTER 13. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
A new page opens with an Overview section with basic information about the selected VM and a
Console section to access the VM’s graphical interface.
NOTE
vCPU Maximum - The maximum number of virtual CPUs that can be configured for the
VM. If this value is higher than the vCPU Count, additional vCPUs can be attached to the
VM.
Cores per socket - The number of cores for each socket to expose to the VM.
Threads per core - The number of threads for each core to expose to the VM.
Note that the Sockets, Cores per socket, and Threads per core options adjust the CPU
topology of the VM. This may be beneficial for vCPU performance and may impact the
functionality of certain software in the guest OS. If a different setting is not required by your
deployment, keep the default values.
2. Click Apply.
The virtual CPUs for the VM are configured.
NOTE
Changes to virtual CPU settings only take effect after the VM is restarted.
Additional resources
The following methods can be used to configure Non-Uniform Memory Access (NUMA) settings of a
129
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
The following methods can be used to configure Non-Uniform Memory Access (NUMA) settings of a
virtual machine (VM) on a RHEL 9 host.
Prerequisites
The host is a NUMA-compatible machine. To detect whether this is the case, use the virsh
nodeinfo command and see the NUMA cell(s) line:
# virsh nodeinfo
CPU model: x86_64
CPU(s): 48
CPU frequency: 1200 MHz
CPU socket(s): 1
Core(s) per socket: 12
Thread(s) per core: 2
NUMA cell(s): 2
Memory size: 67012964 KiB
Procedure
For ease of use, you can set up a VM’s NUMA configuration using automated utilities and services.
However, manual NUMA setup is more likely to yield a significant performance improvement.
Automatic methods
Set the VM’s NUMA policy to Preferred. For example, to do so for the testguest5 VM:
Use the numad command to automatically align the VM CPU with memory resources.
# numad
Manual methods
1. Pin specific vCPU threads to a specific host CPU or range of CPUs. This is also possible on non-
NUMA hosts and VMs, and is recommended as a safe method of vCPU performance
improvement.
For example, the following commands pin vCPU threads 0 to 5 of the testguest6 VM to host
CPUs 1, 3, 5, 7, 9, and 11, respectively:
130
CHAPTER 13. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
2. After pinning vCPU threads, you can also pin QEMU process threads associated with a specified
VM to a specific host CPU or range of CPUs. For example, the following commands pin the
QEMU process thread of testguest6 to CPUs 13 and 15, and verify this was successful:
3. Finally, you can also specify which host NUMA nodes will be assigned specifically to a certain
VM. This can improve the host memory usage by the VM’s vCPU. For example, the following
commands set testguest6 to use host NUMA nodes 3 to 5, and verify this was successful:
NOTE
For best performance results, it is recommended to use all of the manual tuning methods
listed above
Known issues
Additional resources
View the current NUMA configuration of your system using the numastat utility
Starting scenario
2 NUMA nodes
131
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
The output of virsh nodeinfo of such a machine would look similar to:
# virsh nodeinfo
CPU model: x86_64
CPU(s): 12
CPU frequency: 3661 MHz
CPU socket(s): 2
Core(s) per socket: 3
Thread(s) per core: 2
NUMA cell(s): 2
Memory size: 31248692 KiB
You intend to modify an existing VM to have 8 vCPUs, which means that it will not fit in a single
NUMA node.
Therefore, you should distribute 4 vCPUs on each NUMA node and make the vCPU topology
resemble the host topology as closely as possible. This means that vCPUs that run as sibling
threads of a given physical CPU should be pinned to host threads on the same core. For details,
see the Solution below:
Solution
# virsh capabilities
The output should include a section that looks similar to the following:
<topology>
<cells num="2">
<cell id="0">
<memory unit="KiB">15624346</memory>
<pages unit="KiB" size="4">3906086</pages>
<pages unit="KiB" size="2048">0</pages>
<pages unit="KiB" size="1048576">0</pages>
<distances>
<sibling id="0" value="10" />
<sibling id="1" value="21" />
</distances>
<cpus num="6">
<cpu id="0" socket_id="0" core_id="0" siblings="0,3" />
<cpu id="1" socket_id="0" core_id="1" siblings="1,4" />
<cpu id="2" socket_id="0" core_id="2" siblings="2,5" />
<cpu id="3" socket_id="0" core_id="0" siblings="0,3" />
<cpu id="4" socket_id="0" core_id="1" siblings="1,4" />
<cpu id="5" socket_id="0" core_id="2" siblings="2,5" />
</cpus>
</cell>
<cell id="1">
<memory unit="KiB">15624346</memory>
<pages unit="KiB" size="4">3906086</pages>
<pages unit="KiB" size="2048">0</pages>
132
CHAPTER 13. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
2. Optional: Test the performance of the VM using the applicable tools and utilities.
default_hugepagesz=1G hugepagesz=1G
[Unit]
Description=HugeTLB Gigantic Pages Reservation
DefaultDependencies=no
Before=dev-hugepages.mount
ConditionPathExists=/sys/devices/system/node
ConditionKernelCommandLine=hugepagesz=1G
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/etc/systemd/hugetlb-reserve-pages.sh
[Install]
WantedBy=sysinit.target
#!/bin/sh
nodes_path=/sys/devices/system/node/
if [ ! -d $nodes_path ]; then
echo "ERROR: $nodes_path does not exist"
exit 1
fi
reserve_pages()
{
133
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
reserve_pages 4 node1
reserve_pages 4 node2
This reserves four 1GiB huge pages from node1 and four 1GiB huge pages from node2.
# chmod +x /etc/systemd/hugetlb-reserve-pages.sh
4. Use the virsh edit command to edit the XML configuration of the VM you wish to optimize, in
this example super-VM:
a. Set the VM to use 8 static vCPUs. Use the <vcpu/> element to do this.
b. Pin each of the vCPU threads to the corresponding host CPU threads that it mirrors in the
topology. To do so, use the <vcpupin/> elements in the <cputune> section.
Note that, as shown by the virsh capabilities utility above, host CPU threads are not
ordered sequentially in their respective cores. In addition, the vCPU threads should be
pinned to the highest available set of host cores on the same NUMA node. For a table
illustration, see the Sample topology section below.
The XML configuration for steps a. and b. can look similar to:
<cputune>
<vcpupin vcpu='0' cpuset='1'/>
<vcpupin vcpu='1' cpuset='4'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='5'/>
<vcpupin vcpu='4' cpuset='7'/>
<vcpupin vcpu='5' cpuset='10'/>
<vcpupin vcpu='6' cpuset='8'/>
<vcpupin vcpu='7' cpuset='11'/>
<emulatorpin cpuset='6,9'/>
</cputune>
<memoryBacking>
<hugepages>
<page size='1' unit='GiB'/>
</hugepages>
</memoryBacking>
d. Configure the VM’s NUMA nodes to use memory from the corresponding NUMA nodes on
134
CHAPTER 13. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
d. Configure the VM’s NUMA nodes to use memory from the corresponding NUMA nodes on
the host. To do so, use the <memnode/> elements in the <numatune/> section:
<numatune>
<memory mode="preferred" nodeset="1"/>
<memnode cellid="0" mode="strict" nodeset="0"/>
<memnode cellid="1" mode="strict" nodeset="1"/>
</numatune>
e. Ensure the CPU mode is set to host-passthrough, and that the CPU uses cache in
passthrough mode:
<cpu mode="host-passthrough">
<topology sockets="2" cores="2" threads="2"/>
<cache mode="passthrough"/>
Verification
1. Confirm that the resulting XML configuration of the VM includes a section similar to the
following:
[...]
<memoryBacking>
<hugepages>
<page size='1' unit='GiB'/>
</hugepages>
</memoryBacking>
<vcpu placement='static'>8</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='1'/>
<vcpupin vcpu='1' cpuset='4'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='5'/>
<vcpupin vcpu='4' cpuset='7'/>
<vcpupin vcpu='5' cpuset='10'/>
<vcpupin vcpu='6' cpuset='8'/>
<vcpupin vcpu='7' cpuset='11'/>
<emulatorpin cpuset='6,9'/>
</cputune>
<numatune>
<memory mode="preferred" nodeset="1"/>
<memnode cellid="0" mode="strict" nodeset="0"/>
<memnode cellid="1" mode="strict" nodeset="1"/>
</numatune>
<cpu mode="host-passthrough">
<topology sockets="2" cores="2" threads="2"/>
<cache mode="passthrough"/>
<numa>
<cell id="0" cpus="0-3" memory="2" unit="GiB">
<distances>
<sibling id="0" value="10"/>
<sibling id="1" value="21"/>
</distances>
</cell>
<cell id="1" cpus="4-7" memory="2" unit="GiB">
135
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
<distances>
<sibling id="0" value="21"/>
<sibling id="1" value="10"/>
</distances>
</cell>
</numa>
</cpu>
</domain>
2. Optional: Test the performance of the VM using the applicable tools and utilities to evaluate
the impact of the VM’s optimization.
Sample topology
The following tables illustrate the connections between the vCPUs and the host CPUs they
should be pinned to:
CPU threads 0 3 1 4 2 5 6 9 7 10 8 11
Cores 0 1 2 3 4 5
Sockets 0 1
NUMA nodes 0 1
vCPU threads 0 1 2 3 4 5 6 7
Cores 0 1 2 3
Sockets 0 1
NUMA nodes 0 1
vCPU threads 0 1 2 3 4 5 6 7
Host CPU 0 3 1 4 2 5 6 9 7 10 8 11
threads
Cores 0 1 2 3 4 5
Sockets 0 1
NUMA nodes 0 1
136
CHAPTER 13. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
In this scenario, there are 2 NUMA nodes and 8 vCPUs. Therefore, 4 vCPU threads should be
pinned to each node.
In addition, Red Hat recommends leaving at least a single CPU thread available on each node
for host system operations.
Because in this example, each NUMA node houses 3 cores, each with 2 host CPU threads, the
set for node 0 translates as follows:
Depending on your requirements, you can either enable or disable KSM for a single session or
persistently.
NOTE
Prerequisites
Procedure
Disable KSM:
To deactivate KSM for a single session, use the systemctl utility to stop ksm and
ksmtuned services.
To deactivate KSM persistently, use the systemctl utility to disable ksm and ksmtuned
services.
NOTE
137
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
NOTE
Memory pages shared between VMs before deactivating KSM will remain shared. To stop
sharing, delete all the PageKSM pages in the system using the following command:
After anonymous pages replace the KSM pages, the khugepaged kernel service will
rebuild transparent hugepages on the VM’s physical memory.
Enable KSM:
WARNING
Enabling KSM increases CPU utilization and affects overall CPU performance.
To enable KSM for a single session, use the systemctl utility to start the ksm and
ksmtuned services.
To enable KSM persistently, use the systemctl utility to enable the ksm and ksmtuned
services.
Procedure
Use any of the following methods and observe if it has a beneficial effect on your VM network
performance:
138
CHAPTER 13. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
If the output of this command is blank, enable the vhost_net kernel module:
# modprobe vhost_net
<interface type='network'>
<source network='default'/>
<model type='virtio'/>
<driver name='vhost' queues='N'/>
</interface>
SR-IOV
If your host NIC supports SR-IOV, use SR-IOV device assignment for your vNICs. For more
information, see Managing SR-IOV devices.
Additional resources
On your RHEL 9 host, as root, use the top utility or the system monitor application, and look for
139
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
On your RHEL 9 host, as root, use the top utility or the system monitor application, and look for
qemu and virt in the output. This shows how much host system resources your VMs are
consuming.
If the monitoring tool displays that any of the qemu or virt processes consume a large
portion of the host CPU or memory capacity, use the perf utility to investigate. For details,
see below.
On the guest operating system, use performance utilities and applications available on the
system to evaluate which processes consume the most system resources.
perf kvm
You can use the perf utility to collect and analyze virtualization-specific statistics about the
performance of your RHEL 9 host. To do so:
2. Use one of the perf kvm stat commands to display perf statistics for your virtualization host:
For real-time monitoring of your hypervisor, use the perf kvm stat live command.
To log the perf data of your hypervisor over a period of time, activate the logging using the
perf kvm stat record command. After the command is canceled or interrupted, the data is
saved in the perf.data.guest file, which can be analyzed using the perf kvm stat report
command.
3. Analyze the perf output for types of VM-EXIT events and their distribution. For example, the
PAUSE_INSTRUCTION events should be infrequent, but in the following output, the high
occurrence of this event suggests that the host CPUs are not handling the running vCPUs well.
In such a scenario, consider shutting down some of your active VMs, removing vCPUs from
these VMs, or tuning the performance of the vCPUs.
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
140
CHAPTER 13. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
( +- 0.70% )
HLT 20440 1.77% 69.83% 0.62us 79319.41us 14134.56us ( +- 0.79%
)
VMCALL 12426 1.07% 0.03% 1.02us 5416.25us 8.77us ( +- 7.36%
)
EXCEPTION_NMI 27 0.00% 0.00% 0.69us 1.34us 0.98us ( +-
3.50% )
EPT_MISCONFIG 5 0.00% 0.00% 5.15us 10.85us 7.88us ( +-
11.67% )
Other event types that can signal problems in the output of perf kvm stat include:
For more information on using perf to monitor virtualization performance, see the perf-kvm man page.
numastat
To see the current NUMA configuration of your system, you can use the numastat utility, which is
provided by installing the numactl package.
The following shows a host with 4 running VMs, each obtaining memory from multiple NUMA nodes. This
is not optimal for vCPU performance, and warrants adjusting:
# numastat -c qemu-kvm
In contrast, the following shows memory being provided to each VM by a single node, which is
significantly more efficient.
# numastat -c qemu-kvm
141
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
142
CHAPTER 14. IMPORTANCE OF POWER MANAGEMENT
reduced secondary costs, including cooling, space, cables, generators, and uninterruptible
power supplies (UPS)
meeting government regulations or legal requirements regarding Green IT, for example, Energy
Star
This section describes the information regarding power management of your Red Hat Enterprise Linux
systems.
143
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
SpeedStep
PowerNow!
Cool’n’Quiet
ACPI (C-state)
Smart
If your hardware has support for these features and they are enabled in the BIOS, Red Hat
Enterprise Linux uses them by default.
Sleep (C-states)
However, performing these tasks once for a large number of nearly identical systems where you can
reuse the same settings for all systems can be very useful. For example, consider the deployment of
thousands of desktop systems, or an HPC cluster where the machines are nearly identical. Another
reason to do auditing and analysis is to provide a basis for comparison against which you can identify
regressions or changes in system behavior in the future. The results of this analysis can be very helpful in
cases where hardware, BIOS, or software updates happen regularly and you want to avoid any surprises
with regard to power consumption. Generally, a thorough audit and analysis gives you a much better idea
of what is really happening on a particular system.
Auditing and analyzing a system with regard to power consumption is relatively hard, even with the most
modern systems available. Most systems do not provide the necessary means to measure power use via
software. Exceptions exist though:
iLO management console of Hewlett Packard server systems has a power management module
144
CHAPTER 14. IMPORTANCE OF POWER MANAGEMENT
iLO management console of Hewlett Packard server systems has a power management module
that you can access through the web.
On some Dell systems, the IT Assistant offers power monitoring capabilities as well.
Other vendors are likely to offer similar capabilities for their server platforms, but as can be seen there is
no single solution available that is supported by all vendors. Direct measurements of power consumption
are often only necessary to maximize savings as far as possible.
Many of these tools are used for performance tuning as well, which include:
PowerTOP
It identifies specific components of kernel and user-space applications that frequently wake up the
CPU. Use the powertop command as root to start the PowerTop tool and powertop --calibrate to
calibrate the power estimation engine. For more information on PowerTop, see Managing power
consumption with PowerTOP.
Diskdevstat and netdevstat
They are SystemTap tools that collect detailed information about the disk activity and network
activity of all applications running on a system. Using the collected statistics by these tools, you can
identify applications that waste power with many small I/O operations rather than fewer, larger
operations. Using the dnf install tuned-utils-systemtap kernel-debuginfo command as root, install
the diskdevstat and netdevstat tool.
To view the detailed information about the disk and network activity, use:
# diskdevstat
# netdevstat
With these commands, you can specify three parameters: update_interval, total_duration, and
145
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
With these commands, you can specify three parameters: update_interval, total_duration, and
display_histogram.
TuneD
It is a profile-based system tuning tool that uses the udev device manager to monitor connected
devices, and enables both static and dynamic tuning of system settings. You can use the tuned-adm
recommend command to determine which profile Red Hat recommends as the most suitable for a
particular product. For more information on TuneD, see Getting started with TuneD and Customizing
TuneD profiles. Using the powertop2tuned utility, you can create custom TuneD profiles from
PowerTOP suggestions. For information on the powertop2tuned utility, see Optimizing power
consumption.
Virtual memory statistics (vmstat)
It is provided by the procps-ng package. Using this tool, you can view the detailed information about
processes, memory, paging, block I/O, traps, and CPU activity.
To view this information, use:
$ vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 5805576 380856 4852848 0 0 119 73 814 640 2 2 96 0 0
Using the vmstat -a command, you can display active and inactive memory. For more information on
other vmstat options, see the vmstat man page.
iostat
It is provided by the sysstat package. This tool is similar to vmstat, but only for monitoring I/O on
block devices. It also provides more verbose output and statistics.
To monitor the system I/O, use:
$ iostat
avg-cpu: %user %nice %system %iowait %steal %idle
2.05 0.46 1.55 0.26 0.00 95.67
blktrace
It provides detailed information about how time is spent in the I/O subsystem.
To view this information in human readable format, use:
Here, The first column, 253,0 is the device major and minor tuple. The second column, 1, gives
information about the CPU, followed by columns for timestamps and PID of the process issuing the
IO process.
146
CHAPTER 14. IMPORTANCE OF POWER MANAGEMENT
The sixth column, Q, shows the event type, the 7th column, W for write operation, the 8th column,
76423384, is the block number, and the + 8 is the number of requested blocks.
By default, the blktrace command runs forever until the process is explicitly killed. Use the -w option
to specify the run-time duration.
turbostat
It is provided by the kernel-tools package. It reports on processor topology, frequency, idle power-
state statistics, temperature, and power usage on x86-64 processors.
To view this summary, use:
# turbostat
By default, turbostat prints a summary of counter results for the entire screen, followed by counter
results every 5 seconds. Specify a different period between counter results with the -i option, for
example, execute turbostat -i 10 to print results every 10 seconds instead.
Turbostat is also useful for identifying servers that are inefficient in terms of power usage or idle
time. It also helps to identify the rate of system management interrupts (SMIs) occurring on the
system. It can also be used to verify the effects of power management tuning.
cpupower
IT is a collection of tools to examine and tune power saving related features of processors. Use the
cpupower command with the frequency-info, frequency-set, idle-info, idle-set, set, info, and
monitor options to display and set processor related values.
For example, to view available cpufreq governors, use:
For more information about cpupower, see Viewing CPU related information.
Additional resources
147
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
148
CHAPTER 15. MANAGING POWER CONSUMPTION WITH POWERTOP
The PowerTOP tool can provide an estimate of the total power usage of the system and also individual
power usage for each process, device, kernel worker, timer, and interrupt handler. The tool can also
identify specific components of kernel and user-space applications that frequently wake up the CPU.
Prerequisites
To be able to use PowerTOP, make sure that the powertop package has been installed on your
system:
Procedure
# powertop
IMPORTANT
Laptops should run on battery power when running the powertop command.
Procedure
1. On a laptop, you can calibrate the power estimation engine by running the following command:
# powertop --calibrate
2. Let the calibration finish without interacting with the machine during the process.
Calibration takes time because the process performs various tests, cycles through brightness
levels and switches devices on and off.
3. When the calibration process is completed, PowerTOP starts as normal. Let it run for
149
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
3. When the calibration process is completed, PowerTOP starts as normal. Let it run for
approximately an hour to collect data.
When enough data is collected, power estimation figures will be displayed in the first column of
the output table.
NOTE
If you want to change this measuring frequency, use the following procedure:
Procedure
Overview
Idle stats
Frequency stats
Device stats
Tunables
WakeUp
You can use the Tab and Shift+Tab keys to cycle through these tabs.
The adjacent columns within the Overview tab provide the following pieces of information:
Usage
150
CHAPTER 15. MANAGING POWER CONSUMPTION WITH POWERTOP
If properly calibrated, a power consumption estimation for every listed item in the first column is shown
as well.
Apart from this, the Overview tab includes the line with summary statistics such as:
Summary of total wakeups per second, GPU operations per second, and virtual file system
operations per second
Use the up and down keys to move through suggestions, and the enter key to toggle the suggestion on
or off.
Use the up and down keys to move through the available settings, and the enter key to enable or
disable a setting.
Additional resources
In total, there are three possible modes of the Intel P-State driver:
Passive mode
Switching to the ACPI CPUfreq driver results in complete information being displayed by PowerTOP.
However, it is recommended to keep your system on the default settings.
# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
intel_pstate is returned if the Intel P-State driver is loaded and in active mode.
intel_cpufreq is returned if the Intel P-State driver is loaded and in passive mode.
While using the Intel P-State driver, add the following argument to the kernel boot command line to
force the driver to run in passive mode:
intel_pstate=passive
To disable the Intel P-State driver and use, instead, the ACPI CPUfreq driver, add the following
152
CHAPTER 15. MANAGING POWER CONSUMPTION WITH POWERTOP
To disable the Intel P-State driver and use, instead, the ACPI CPUfreq driver, add the following
argument to the kernel boot command line:
intel_pstate=disable
Procedure
# powertop --html=htmlfile.html
Replace the htmlfile.html parameter with the required name for the output file.
Procedure
By default, powertop2tuned creates profiles in the /etc/tuned/ directory, and bases the custom profile
on the currently selected TuneD profile. For safety reasons, all PowerTOP tunings are initially disabled
in the new profile.
Use the --enable or -e option to generate a new profile that enables most of the tunings
suggested by PowerTOP.
Certain potentially problematic tunings, such as the USB autosuspend, are disabled by default
and need to be uncommented manually.
153
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Prerequisites
Procedure
# powertop2tuned new_profile_name
Additional information
$ powertop2tuned --help
The powertop2tuned utility represents integration of PowerTOP into TuneD, which enables to
benefit of advantages of both tools.
154
CHAPTER 16. GETTING STARTED WITH PERF
Procedure
perf stat
This command provides overall statistics for common performance events, including instructions
executed and clock cycles consumed. Options allow for selection of events other than the default
measurement events.
perf record
This command records performance data into a file, perf.data, which can be later analyzed using the
perf report command.
perf report
This command reads and displays the performance data from the perf.data file created by perf
record.
perf list
This command lists the events available on a particular machine. These events will vary based on
performance monitoring hardware and software configuration of the system.
perf top
This command performs a similar function to the top utility. It generates and displays a performance
counter profile in realtime.
perf trace
This command performs a similar function to the strace tool. It monitors the system calls used by a
specified thread or process and all signals received by that application.
perf help
155
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Additional resources
156
CHAPTER 17. PROFILING CPU USAGE IN REAL TIME WITH PERF TOP
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
# perf top
Samples: 8K of event 'cycles', 2000 Hz, Event count (approx.): 4579432780 lost: 0/0 drop:
0/0
Overhead Shared Object Symbol
2.20% [kernel] [k] do_syscall_64
2.17% [kernel] [k] module_get_kallsym
1.49% [kernel] [k] copy_user_enhanced_fast_string
1.37% libpthread-2.29.so [.] pthread_mutex_lock 1.31% [unknown] [.] 0000000000000000
1.07% [kernel] [k] psi_task_change 1.04% [kernel] [k] switch_mm_irqs_off 0.94% [kernel] [k]
fget
0.74% [kernel] [k] entry_SYSCALL_64
0.69% [kernel] [k] syscall_return_via_sysret
0.69% libxul.so [.] 0x000000000113f9b0
0.67% [kernel] [k] kallsyms_expand_symbol.constprop.0
0.65% firefox [.] moz_xmalloc
0.65% libpthread-2.29.so [.] __pthread_mutex_unlock_usercnt
0.60% firefox [.] free
0.60% libxul.so [.] 0x000000000241d1cd
0.60% [kernel] [k] do_sys_poll
157
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
In this example, the kernel function do_syscall_64 is using the most CPU time.
Additional resources
The debuginfo package of the executable must be installed or, if the executable is a locally developed
application, the application must be compiled with debugging information turned on (the -g option in
GCC) to display the function names or symbols in such a situation.
NOTE
It is not necessary to re-run the perf record command after installing the debuginfo
associated with an executable. Simply re-run the perf report command.
Additional Resources
Procedure
Enable the source and debug information package channels: The $(uname -i) part is
158
CHAPTER 17. PROFILING CPU USAGE IN REAL TIME WITH PERF TOP
Enable the source and debug information package channels: The $(uname -i) part is
automatically replaced with a matching value for architecture of your system:
Prerequisites
The application or library you want to debug must be installed on the system.
GDB and the debuginfo-install tool must be installed on the system. For details, see Setting up
to debug applications.
Channels providing debuginfo and debugsource packages must be configured and enabled on
the system.
Procedure
1. Start GDB attached to the application or library you want to debug. GDB automatically
recognizes missing debugging information and suggests a command to run.
$ gdb -q /bin/ls
Reading symbols from /bin/ls...Reading symbols from .gnu_debugdata for /usr/bin/ls...(no
debugging symbols found)...done.
(no debugging symbols found)...done.
Missing separate debuginfos, use: dnf debuginfo-install coreutils-8.30-6.el8.x86_64
(gdb)
(gdb) q
3. Run the command suggested by GDB to install the required debuginfo packages:
The dnf package management tool provides a summary of the changes, asks for confirmation
and once you confirm, downloads and installs all the necessary files.
159
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
4. In case GDB is not able to suggest the debuginfo package, follow the procedure described in
Getting debuginfo packages for an application or library manually .
Additional resources
Red Hat Developer Toolset User Guide, section Installing Debugging Information
How can I download or install debuginfo packages for RHEL systems? — Red Hat
Knowledgebase solution
160
CHAPTER 18. COUNTING EVENTS DURING PROCESS EXECUTION WITH PERF STAT
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
Running the perf stat command without root access will only count events occurring in the
user space:
$ perf stat ls
161
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
As you can see in the previous example, when perf stat runs without root access the event
names are followed by :u, indicating that these events were counted only in the user-space.
To count both user-space and kernel-space events, you must have root access when
running perf stat:
# perf stat ls
# perf stat -a ls
Additional resources
3. When related metrics are available, a ratio or percentage is displayed after the hash sign (#) in
the right-most column.
For example, when running in default mode, perf stat counts both cycles and instructions and,
therefore, calculates and displays instructions per cycle in the right-most column. You can see
similar behavior with regard to branch-misses as a percent of all branches since both events are
162
CHAPTER 18. COUNTING EVENTS DURING PROCESS EXECUTION WITH PERF STAT
counted by default.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
The previous example counts events in the processes with the IDs of ID1 and ID2 for a time
period of seconds seconds as dictated by using the sleep command.
Additional resources
163
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Prerequisites
You have the perf user space tool installed as described in Installing perf.
If you do not specify a command for perf record to record during, it will record until you manually stop
the process by pressing Ctrl+C. You can attach perf record to specific processes by passing the -p
option followed by one or more process IDs. You can run perf record without root access, however,
doing so will only sample performance data in the user space. In the default mode, perf record uses
CPU cycles as the sampling event and operates in per-thread mode with inherit mode enabled.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
Replace command with the command you want to sample data during. If you do not specify a
command, then perf record will sample data until you manually stop it by pressing Ctrl+C.
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
164
CHAPTER 19. RECORDING AND ANALYZING PERFORMANCE PROFILES WITH PERF
Procedure
Replace command with the command you want to sample data during. If you do not specify a
command, then perf record will sample data until you manually stop it by pressing Ctrl+C.
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
Replace command with the command you want to sample data during. If you do not specify a
command, then perf record will sample data until you manually stop it by pressing Ctrl+C.
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
165
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Replace command with the command you want to sample data during. If you do not specify
a command, then perf record will sample data until you manually stop it by pressing Ctrl+C.
fp
Uses the frame pointer method. Depending on compiler optimization, such as with
binaries built with the GCC option --fomit-frame-pointer, this may not be able to unwind
the stack.
dwarf
Uses DWARF Call Frame Information to unwind the stack.
lbr
Uses the last branch record hardware on Intel processors.
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
If the perf.data file was created with root access, you need to run perf report with root access
too.
Procedure
# perf report
166
CHAPTER 19. RECORDING AND ANALYZING PERFORMANCE PROFILES WITH PERF
Additional resources
In default mode, the functions are sorted in descending order with those with the highest overhead
displayed first.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
The kernel debuginfo package is installed. For more information, see Getting debuginfo
packages for an application or library using GDB.
Procedure
This example would generate a perf.data over the entire system for a period of seconds
167
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
This example would generate a perf.data over the entire system for a period of seconds
seconds as dictated by the use of the sleep command. It would also capture call graph data
using the frame pointer method.
# perf archive
Verification steps
Verify that the archive file has been generated in your current active directory:
# ls perf.data*
The output will display every file in your current directory that begins with perf.data. The archive
file will be named either:
perf.data.tar.gz
or
perf.data.tar.bz2
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
A perf.data file and associated archive file generated on a different device are present on the
current device being used.
Procedure
1. Copy both the perf.data file and the archive file into your current active directory.
# mkdir -p ~/.debug
# tar xf perf.data.tar.bz2 -C ~/.debug
NOTE
168
CHAPTER 19. RECORDING AND ANALYZING PERFORMANCE PROFILES WITH PERF
NOTE
# perf report
The debuginfo package of the executable must be installed or, if the executable is a locally developed
application, the application must be compiled with debugging information turned on (the -g option in
GCC) to display the function names or symbols in such a situation.
NOTE
It is not necessary to re-run the perf record command after installing the debuginfo
associated with an executable. Simply re-run the perf report command.
Additional Resources
Procedure
Enable the source and debug information package channels: The $(uname -i) part is
automatically replaced with a matching value for architecture of your system:
Prerequisites
The application or library you want to debug must be installed on the system.
GDB and the debuginfo-install tool must be installed on the system. For details, see Setting up
to debug applications.
Channels providing debuginfo and debugsource packages must be configured and enabled on
the system.
Procedure
1. Start GDB attached to the application or library you want to debug. GDB automatically
recognizes missing debugging information and suggests a command to run.
$ gdb -q /bin/ls
Reading symbols from /bin/ls...Reading symbols from .gnu_debugdata for /usr/bin/ls...(no
debugging symbols found)...done.
(no debugging symbols found)...done.
Missing separate debuginfos, use: dnf debuginfo-install coreutils-8.30-6.el8.x86_64
(gdb)
(gdb) q
3. Run the command suggested by GDB to install the required debuginfo packages:
The dnf package management tool provides a summary of the changes, asks for confirmation
and once you confirm, downloads and installs all the necessary files.
4. In case GDB is not able to suggest the debuginfo package, follow the procedure described in
Getting debuginfo packages for an application or library manually .
Additional resources
Red Hat Developer Toolset User Guide, section Installing Debugging Information
How can I download or install debuginfo packages for RHEL systems? — Red Hat
Knowledgebase solution
170
CHAPTER 20. INVESTIGATING BUSY CPUS WITH PERF
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
The previous example displays counts of a default set of common hardware and software
events recorded over a time period of seconds seconds, as dictated by using the sleep
command, over each individual CPU in ascending order, starting with CPU0. As such, it may be
useful to specify an event such as cycles:
Prerequisites
You have the perf user space tool installed as described in Installing perf.
There is a perf.data file created with perf record in the current directory. If the perf.data file
was created with root access, you need to run perf report with root access too.
Procedure
Display the contents of the perf.data file for further analysis while sorting by CPU:
You can sort by CPU and command to display more detailed information about where CPU
time is being spent:
171
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
This example will list commands from all monitored CPUs by total overhead in descending
order of overhead usage and identify the CPU the command was executed on.
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
This example will list CPUs and their respective overhead in descending order of overhead
usage in real time.
You can sort by CPU and command for more detailed information of where CPU time is
being spent:
This example will list commands by total overhead in descending order of overhead usage
and identify the CPU the command was executed on in real time.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
1. Sample and record the performance data in the specific CPU’s, generating a perf.data file:
172
CHAPTER 20. INVESTIGATING BUSY CPUS WITH PERF
The previous example samples and records data in CPUs 0 and 1 for a period of seconds
seconds as dictated by the use of the sleep command.
The previous example samples and records data in all CPUs from CPU 0 to 2 for a period of
seconds seconds as dictated by the use of the sleep command.
# perf report
This example will display the contents of perf.data. If you are monitoring several CPUs and want
to know which CPU data was sampled on, see Displaying which CPU samples were taken on with
perf report.
173
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
The previous example samples and records performance data of the processes with the process
ID’s ID1 and ID2 for a time period of seconds seconds as dictated by using the sleep
command. You can also configure perf to record events in specific threads:
NOTE
When using the -t flag and stipulating thread ID’s, perf disables inheritance by
default. You can enable inheritance by adding the --inherit option.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
Replace command with the command you want to sample data during. If you do not specify
a command, then perf record will sample data until you manually stop it by pressing Ctrl+C.
fp
174
CHAPTER 21. MONITORING APPLICATION PERFORMANCE WITH PERF
fp
Uses the frame pointer method. Depending on compiler optimization, such as with
binaries built with the GCC option --fomit-frame-pointer, this may not be able to unwind
the stack.
dwarf
Uses DWARF Call Frame Information to unwind the stack.
lbr
Uses the last branch record hardware on Intel processors.
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
If the perf.data file was created with root access, you need to run perf report with root access
too.
Procedure
# perf report
175
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Additional resources
176
CHAPTER 22. CREATING UPROBES WITH PERF
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
1. Create the uprobe in the process or application you are interested in monitoring at a location of
interest within the process or application:
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
NOTE
To do this, the debuginfo package of the executable must be installed or, if the
executable is a locally developed application, the application must be compiled
with debugging information, the -g option in GCC.
177
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Procedure
<main@/home/user/my_executable:0>
0 int main(int argc, const char **argv)
1 {
int err;
const char *cmd;
char sbuf[STRERR_BUFSIZE];
/* libsubcmd init */
7 exec_cmd_init("perf", PREFIX, PERF_EXEC_PATH,
EXEC_PATH_ENVIRONMENT);
8 pager_init(PERF_PAGER_ENVIRONMENT);
In the perf script example output: * A uprobe is added to the function isprime() in a program called
my_prog * a is a function argument added to the uprobe. Alternatively, a could be an arbitrary variable
visible in the code scope of where you add your uprobe:
# perf script
my_prog 1367 [007] 10802159.906593: probe_my_prog:isprime: (400551) a=2
my_prog 1367 [007] 10802159.906623: probe_my_prog:isprime: (400551) a=3
my_prog 1367 [007] 10802159.906625: probe_my_prog:isprime: (400551) a=4
my_prog 1367 [007] 10802159.906627: probe_my_prog:isprime: (400551) a=5
my_prog 1367 [007] 10802159.906629: probe_my_prog:isprime: (400551) a=6
my_prog 1367 [007] 10802159.906631: probe_my_prog:isprime: (400551) a=7
my_prog 1367 [007] 10802159.906633: probe_my_prog:isprime: (400551) a=13
my_prog 1367 [007] 10802159.906635: probe_my_prog:isprime: (400551) a=17
my_prog 1367 [007] 10802159.906637: probe_my_prog:isprime: (400551) a=19
178
CHAPTER 23. PROFILING MEMORY ACCESSES WITH PERF MEM
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
This example samples memory accesses across all CPUs for a period of seconds seconds as
dictated by the sleep command. You can replace the sleep command for any command during
which you want to sample memory access data. By default, perf mem samples both memory
loads and stores. You can select only one memory operation by using the -t option and
specifying either "load" or "store" between perf mem and record. For loads, information over
the memory hierarchy level, TLB memory accesses, bus snoops, and memory locks is captured.
Available samples
35k cpu/mem-loads,ldlat=30/P
54k cpu/mem-stores/P
The cpu/mem-loads,ldlat=30/P line denotes data collected over memory loads and the
cpu/mem-stores/P line denotes data collected over memory stores. Highlight the category of
interest and press Enter to view the data:
179
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Alternatively, you can sort your results to investigate different aspects of interest when
180
CHAPTER 23. PROFILING MEMORY ACCESSES WITH PERF MEM
Alternatively, you can sort your results to investigate different aspects of interest when
displaying the data. For example, to sort data over memory loads by type of memory accesses
occurring during the sampling period in descending order of overhead they account for:
Additional resources
IMPORTANT
Oftentimes, due to dynamic allocation of memory or stack memory being accessed, the
'Data Symbol' column will display a raw address.
181
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
In default mode, the functions are sorted in descending order with those with the highest overhead
displayed first.
182
CHAPTER 24. DETECTING FALSE SHARING
This initial modification requires that the other processors using the cache line invalidate their copy and
request an updated one despite the processors not needing, or even necessarily having access to, an
updated version of the modified data item.
You can use the perf c2c command to detect false sharing.
Cache-line contention occurs when a processor core on a Symmetric Multi Processing (SMP) system
modifies data items on the same cache line that is in use by other processors. All other processors using
this cache-line must then invalidate their copy and request an updated one. This can lead to degraded
performance.
The perf c2c command supports the same options as perf record as well as some options exclusive to
the c2c subcommand. The recorded data is stored in a perf.data file in the current directory for later
analysis.
Prerequisites
The perf user space tool is installed. For more information, see installing perf.
Procedure
This example samples and records cache-line contention data across all CPU’s for a period of
seconds as dictated by the sleep command. You can replace the sleep command with any
command you want to collect cache-line contention data over.
183
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Additional resources
Prerequisites
The perf user space tool is installed. For more information, see Installing perf.
A perf.data file recorded using the perf c2c command is available in the current directory. For
more information, see Detecting cache-line contention with perf c2c.
Procedure
This command visualizes the perf.data file into several graphs within the terminal:
=================================================
Trace Event Information
=================================================
Total records : 329219
Locked Load/Store Operations : 14654
Load Operations : 69679
Loads - uncacheable : 0
Loads - IO : 0
Loads - Miss : 3972
Loads - no mapping : 0
Load Fill Buffer Hit : 11958
Load L1D hit : 17235
Load L2D hit : 21
Load LLC hit : 14219
Load Local HITM : 3402
Load Remote HITM : 12757
Load Remote HIT : 5295
Load Local DRAM : 976
Load Remote DRAM : 3246
Load MESI State Exclusive : 4222
Load MESI State Shared : 0
Load LLC Misses : 22274
LLC Misses to Local DRAM : 4.4%
LLC Misses to Remote DRAM : 14.6%
LLC Misses to Remote cache (HIT) : 23.8%
LLC Misses to Remote cache (HITM) : 57.3%
Store Operations : 259539
Store - uncacheable : 0
Store - no mapping : 11
184
CHAPTER 24. DETECTING FALSE SHARING
=================================================
Global Shared Cache Line Event Information
=================================================
Total Shared Cache Lines : 55
Load HITs on shared lines : 55454
Fill Buffer Hits on shared lines : 10635
L1D hits on shared lines : 16415
L2D hits on shared lines : 0
LLC hits on shared lines : 8501
Locked Access on shared lines : 14351
Store HITs on shared lines : 109953
Store L1D hits on shared lines : 109449
Total Merged records : 126112
=================================================
c2c details
=================================================
Events : cpu/mem-loads,ldlat=30/P
: cpu/mem-stores/P
Cachelines sort on : Remote HITMs
Cacheline data groupping : offset,pid,iaddr
=================================================
Shared Data Cache Line Table
=================================================
#
# Total Rmt ----- LLC Load Hitm ----- ---- Store Reference ---- --- Load
Dram ---- LLC Total ----- Core Load Hit ----- -- LLC Load Hit --
# Index Cacheline records Hitm Total Lcl Rmt Total L1Hit L1Miss
Lcl Rmt Ld Miss Loads FB L1 L2 Llc Rmt
# ..... .................. ....... ....... ....... ....... ....... ....... ....... ....... ........ ........ ....... ....... .......
....... ....... ........ ........
#
0 0x602180 149904 77.09% 12103 2269 9834 109504 109036 468
727 2657 13747 40400 5355 16154 0 2875 529
1 0x602100 12128 22.20% 3951 1119 2832 0 0 0 65
200 3749 12128 5096 108 0 2056 652
2 0xffff883ffb6a7e80 260 0.09% 15 3 12 161 161 0 1
1 15 99 25 50 0 6 1
3 0xffffffff81aec000 157 0.07% 9 0 9 1 0 1 0 7
20 156 50 59 0 27 4
4 0xffffffff81e3f540 179 0.06% 9 1 8 117 97 20 0 10
25 62 11 1 0 24 7
=================================================
Shared Cache Line Distribution Pareto
=================================================
#
# ----- HITM ----- -- Store Refs -- Data address ---------- cycles --
-------- cpu Shared
# Num Rmt Lcl L1 Hit L1 Miss Offset Pid Code address rmt hitm lcl
185
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
-------------------------------------------------------------
1 2832 1119 0 0 0x602100
-------------------------------------------------------------
29.13% 36.19% 0.00% 0.00% 0x20 14604 0x400bb3 1964
1230 1788 2 [.] read_write_func no_false_sharing.exe
false_sharing_example.c:155 1{122} 2{144}
43.68% 34.41% 0.00% 0.00% 0x28 14604 0x400bcd 2274
1566 1793 2 [.] read_write_func no_false_sharing.exe
false_sharing_example.c:159 2{53} 3{170}
27.19% 29.40% 0.00% 0.00% 0x30 14604 0x400be7 2045
1247 2011 2 [.] read_write_func no_false_sharing.exe
false_sharing_example.c:163 0{96} 3{171}
The visualization displayed by running the perf c2c report --stdio command sorts the data into several
tables:
186
CHAPTER 24. DETECTING FALSE SHARING
The virtual address of each cache line is contained in the Data address Offset column and
followed subsequently by the offset into the cache line where different accesses occurred.
The Code Address column contains the instruction pointer code address.
The columns under the cycles label show average load latencies.
The cpu cnt column displays how many different CPUs samples came from (essentially, how
many different CPUs were waiting for the data indexed at that given location).
The Shared Object column displays the name of the ELF image where the samples come
from (the name [kernel.kallsyms] is used when the samples come from the kernel).
The Source:Line column displays the source file and line number.
The Node{cpu list} column displays which specific CPUs samples came from for each node.
Prerequisites
The perf user space tool is installed. For more information, see installing perf.
A perf.data file recorded using the perf c2c command is available in the current directory. For
more information, see Detecting cache-line contention with perf c2c.
Procedure
2. In the "Trace Event Information" table, locate the row containing the values for LLC Misses to
Remote Cache (HITM):
The percentage in the value column of the LLC Misses to Remote Cache (HITM) row
represents the percentage of LLC misses that were occurring across NUMA nodes in modified
cache-lines and is a key indicator false sharing has occurred.
187
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
=================================================
Trace Event Information
=================================================
Total records : 329219
Locked Load/Store Operations : 14654
Load Operations : 69679
Loads - uncacheable : 0
Loads - IO : 0
Loads - Miss : 3972
Loads - no mapping : 0
Load Fill Buffer Hit : 11958
Load L1D hit : 17235
Load L2D hit : 21
Load LLC hit : 14219
Load Local HITM : 3402
Load Remote HITM : 12757
Load Remote HIT : 5295
Load Local DRAM : 976
Load Remote DRAM : 3246
Load MESI State Exclusive : 4222
Load MESI State Shared : 0
Load LLC Misses : 22274
LLC Misses to Local DRAM : 4.4%
LLC Misses to Remote DRAM : 14.6%
LLC Misses to Remote cache (HIT) : 23.8%
LLC Misses to Remote cache (HITM) : 57.3%
Store Operations : 259539
Store - uncacheable : 0
Store - no mapping : 11
Store L1D Hit : 256696
Store L1D Miss : 2832
No Page Map Rejects : 2376
Unable to parse data source : 1
3. Inspect the Rmt column of the LLC Load Hitm field of the Shared Data Cache Line Table:
=================================================
Shared Data Cache Line Table
=================================================
#
# Total Rmt ----- LLC Load Hitm ----- ---- Store Reference ---- ---
Load Dram ---- LLC Total ----- Core Load Hit ----- -- LLC Load Hit --
# Index Cacheline records Hitm Total Lcl Rmt Total L1Hit L1Miss
Lcl Rmt Ld Miss Loads FB L1 L2 Llc Rmt
# ..... .................. ....... ....... ....... ....... ....... ....... ....... ....... ........ ........ ....... ....... .......
....... ....... ........ ........
#
0 0x602180 149904 77.09% 12103 2269 9834 109504 109036
468 727 2657 13747 40400 5355 16154 0 2875 529
1 0x602100 12128 22.20% 3951 1119 2832 0 0 0 65
200 3749 12128 5096 108 0 2056 652
2 0xffff883ffb6a7e80 260 0.09% 15 3 12 161 161 0 1
1 15 99 25 50 0 6 1
3 0xffffffff81aec000 157 0.07% 9 0 9 1 0 1 0 7
188
CHAPTER 24. DETECTING FALSE SHARING
20 156 50 59 0 27 4
4 0xffffffff81e3f540 179 0.06% 9 1 8 117 97 20 0
10 25 62 11 1 0 24 7
This table is sorted in descending order by the amount of remote Hitm detected per cache line.
A high number in the Rmt column of the LLC Load Hitm section indicates false sharing and
requires further inspection of the cache line on which it occurred to debug the false sharing
activity.
189
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Sampling stack traces is a common technique for profiling CPU performance with the perf tool.
Unfortunately, the results of profiling stack traces with perf can be extremely verbose and labor-
intensive to analyze. flamegraphs are visualizations created from data recorded with perf to make
identifying hot code-paths faster and easier.
Procedure
Prerequisites
Procedure
This command samples and records performance data over the entire system for 60 seconds,
as stipulated by use of the sleep command, and then constructs the visualization which will be
stored in the current active directory as flamegraph.html. The command samples call-graph
data by default and takes the same arguments as the perf tool, in this particular case:
-a
Stipulates to record data over the entire system.
-F
To set the sampling frequency per second.
Verification steps
190
CHAPTER 25. GETTING STARTED WITH FLAMEGRAPHS
# xdg-open flamegraph.html
Prerequisites
Procedure
This command samples and records performance data of the processes with the process ID’s
ID1 and ID2 for 60 seconds, as stipulated by use of the sleep command, and then constructs
the visualization which will be stored in the current active directory as flamegraph.html. The
command samples call-graph data by default and takes the same arguments as the perf tool, in
this particular case:
-a
Stipulates to record data over the entire system.
-F
To set the sampling frequency per second.
-p
To stipulate specific process ID’s to sample and record data over.
Verification steps
# xdg-open flamegraph.html
191
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
The children of a stack in a given row are displayed based on the number of samples taken of each
respective function in descending order along the x-axis; the x-axis does not represent the passing of
time. The wider an individual box is, the more frequent it was on-CPU or part of an on-CPU ancestry at
the time the data was being sampled.
Procedure
To reveal the names of functions which may have not been displayed previously and further
investigate the data click on a box within the flamegraph to zoom into the stack at that given
location:
IMPORTANT
192
CHAPTER 25. GETTING STARTED WITH FLAMEGRAPHS
IMPORTANT
Additional resources
193
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
The --overwrite option makes perf record store all data in an overwritable circular buffer. When the
buffer gets full, perf record automatically overwrites the oldest records which, therefore, never get
written to a perf.data file.
Using the --overwrite and --switch-output-event options together configures a circular buffer that
records and dumps data continuously until it detects the --switch-output-event trigger event. The
trigger event signals to perf record that something of interest to the user has occurred and to write the
data in the circular buffer to a perf.data file. This collects specific data you are interested in while
simultaneously reducing the overhead of the running perf process by not writing data you do not want
to a perf.data file.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
You have placed a uprobe in the process or application you are interested in monitoring at a
location of interest within the process or application:
Procedure
Create the circular buffer with the uprobe as the trigger event:
194
CHAPTER 26. MONITORING PROCESSES FOR PERFORMANCE BOTTLENECKS USING PERF CIRCULAR BUFFERS
This example initiates the executable and collects cpu cycles, specified after the -e option, until
perf detects the uprobe, the trigger event specified after the --switch-output-event option. At
that point, perf takes a snapshot of all the data in the circular buffer and stores it in a unique
perf.data file identified by timestamp. This example produced a total of 2 snapshots, the last
perf.data file was forced by pressing Ctrl+c.
195
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
2. Run perf record with the control file setup and events you are interested in enabling:
In this example, declaring 'sched:*' after the -e option starts perf record with scheduler events.
Starting the read side of the control pipe triggers the following message in the first terminal:
Events disabled
This command triggers perf to scan the current event list in the control file for the declared
event. If the event is present, the tracepoint is enabled and the following message appears in
the first terminal:
Once the tracepoint is enabled, the second terminal displays the output from perf detecting the
tracepoint:
196
DING AND REMOVING TRACEPOINTS FROM A RUNNING PERF COLLECTOR WITHOUT STOPPING OR RESTARTING PERF
Prerequisites
You have the perf user space tool installed as described in Installing perf.
You have added tracepoints to a running perf collector via the control pipe interface. For more
information, see Adding tracepoints to a running perf collector without stopping or restarting
perf.
Procedure
NOTE
This example assumes you have previously loaded scheduler events into the
control file and enabled the tracepoint sched:sched_process_fork.
This command triggers perf to scan the current event list in the control file for the declared
event. If the event is present, the tracepoint is disabled and the following message appears in
the terminal used to configure the control pipe:
197
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
The numastat tool displays data for each NUMA node separately. You can use this information to
investigate memory performance of your system or the effectiveness of different memory policies on
your system.
numa_hit
The number of pages that were successfully allocated to this node.
numa_miss
The number of pages that were allocated on this node because of low memory on the intended node.
Each numa_miss event has a corresponding numa_foreign event on another node.
numa_foreign
The number of pages initially intended for this node that were allocated to another node instead.
Each numa_foreign event has a corresponding numa_miss event on another node.
interleave_hit
The number of interleave policy pages successfully allocated to this node.
local_node
The number of pages successfully allocated on this node by a process on this node.
other_node
The number of pages allocated on this node by a process on another node.
NOTE
High numa_hit values and low numa_miss values (relative to each other) indicate
optimal performance.
Prerequisites
Procedure
$ numastat
node0 node1
198
CHAPTER 28. PROFILING MEMORY ALLOCATION WITH NUMASTAT
Additional resources
199
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
turbostat tool prints counter results at specified intervals to help administrators identify
unexpected behavior in servers, such as excessive power usage, failure to enter deep sleep
states, or system management interrupts (SMIs) being created unnecessarily.
numactl utility provides a number of options to manage processor and memory affinity. The
numactl package includes the libnuma library which offers a simple programming interface to
the NUMA policy supported by the kernel, and can be used for more fine-grained tuning than
the numactl application.
numastat tool displays per-NUMA node memory statistics for the operating system and its
processes, and shows administrators whether the process memory is spread throughout a
system or is centralized on specific nodes. This tool is provided by the numactl package.
numad is an automatic NUMA affinity management daemon. It monitors NUMA topology and
resource usage within a system in order to dynamically improve NUMA resource allocation and
management.
/proc/interrupts file displays the interrupt request (IRQ) number, the number of similar
interrupt requests handled by each processor in the system, the type of interrupt sent, and a
comma-separated list of devices that respond to the listed interrupt request.
pqos utility is available in the intel-cmt-cat package. It monitors CPU cache and memory
bandwidth on recent Intel processors. It monitors:
The size in kilobytes that the program executing in a given CPU occupies in the LLC.
taskset tool is provided by the util-linux package. It allows administrators to retrieve and set
the processor affinity of a running process, or launch a process with a specified processor
affinity.
Additional resources
200
CHAPTER 29. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE CPU UTILIZATION
Additional resources
The following are the two primary types of topology used in modern computing:
Multi-threaded applications that are sensitive to performance may benefit from being configured to
execute on a specific NUMA node rather than a specific processor. Whether this is suitable depends
on your system and the requirements of your application. If multiple application threads access the
same cached data, then configuring those threads to execute on the same processor may be
suitable. However, if multiple threads that access and cache different data execute on the same
processor, each thread may evict cached data accessed by a previous thread. This means that each
thread 'misses' the cache and wastes execution time fetching data from memory and replacing it in
the cache. Use the perf tool to check for an excessive number of cache misses.
Procedure
$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 4 8 12 16 20 24 28 32 36
node 0 size: 65415 MB
node 0 free: 43971 MB
[...]
201
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
To gather the information about the CPU architecture, such as the number of CPUs, threads,
cores, sockets, and NUMA nodes:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 1
Core(s) per socket: 10
Socket(s): 4
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 47
Model name: Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz
Stepping: 2
CPU MHz: 2394.204
BogoMIPS: 4787.85
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0,4,8,12,16,20,24,28,32,36
NUMA node1 CPU(s): 2,6,10,14,18,22,26,30,34,38
NUMA node2 CPU(s): 1,5,9,13,17,21,25,29,33,37
NUMA node3 CPU(s): 3,7,11,15,19,23,27,31,35,39
Additional resources
By default, Red Hat Enterprise Linux 9 uses a tickless kernel, which does not interrupt idle CPUs in order
203
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
By default, Red Hat Enterprise Linux 9 uses a tickless kernel, which does not interrupt idle CPUs in order
to reduce power usage and allow new processors to take advantage of deep sleep states.
Red Hat Enterprise Linux 9 also offers a dynamic tickless option, which is useful for latency-sensitive
workloads, such as high performance computing or realtime computing. By default, the dynamic tickless
option is disabled. Red Hat recommends using the cpu-partitioning TuneD profile to enable the
dynamic tickless option for cores specified as isolated_cores.
This procedure describes how to manually persistently enable dynamic tickless behavior.
Procedure
1. To enable dynamic tickless behavior in certain cores, specify those cores on the kernel
command line with the nohz_full parameter. On a 16 core system, enable the nohz_full=1-15
kernel option:
This enables dynamic tickless behavior on cores 1 through 15, moving all timekeeping to the
only unspecified core (core 0).
2. When the system boots, manually move the rcu threads to the non-latency-sensitive core, in
this case core 0:
3. Optional: Use the isolcpus parameter on the kernel command line to isolate certain cores from
user-space tasks.
4. Optional: Set the CPU affinity for the kernel’s write-back bdi-flush threads to the
housekeeping core:
Verification steps
This command measures ticks on CPU 1 while telling CPU 1 to sleep for 3 seconds.
The default kernel timer configuration shows around 3100 ticks on a regular CPU:
204
CHAPTER 29. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE CPU UTILIZATION
3,107 irq_vectors:local_timer_entry
With the dynamic tickless kernel configured, you should see around 4 ticks instead:
4 irq_vectors:local_timer_entry
Additional resources
How to verify the list of "isolated" and "nohz_full" CPU information from sysfs? Red Hat
Knowledgebase article
Because interrupt halts normal operation, high interrupt rates can severely degrade system
performance. It is possible to reduce the amount of time taken by interrupts by configuring interrupt
affinity or by sending a number of lower priority interrupts in a batch (coalescing a number of interrupts).
Interrupt requests have an associated affinity property, smp_affinity, which defines the processors that
handle the interrupt request. To improve application performance, assign interrupt affinity and process
affinity to the same processor, or processors on the same core. This allows the specified interrupt and
application threads to share cache lines.
On systems that support interrupt steering, modifying the smp_affinity property of an interrupt request
sets up the hardware so that the decision to service an interrupt with a particular processor is made at
the hardware level with no intervention from the kernel.
Procedure
1. Check which devices correspond to the interrupt requests that you want to configure.
2. Find the hardware specification for your platform. Check if the chipset on your system supports
distributing interrupts.
a. If it does, you can configure interrupt delivery as described in the following steps.
205
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
a. If it does, you can configure interrupt delivery as described in the following steps.
Additionally, check which algorithm your chipset uses to balance interrupts. Some BIOSes
have options to configure interrupt delivery.
b. If it does not, your chipset always routes all interrupts to a single, static CPU. You cannot
configure which CPU is used.
3. Check which Advanced Programmable Interrupt Controller (APIC) mode is in use on your
system:
Here,
If your system uses a mode other than flat, you can see a line similar to Setting APIC
routing to physical flat.
If you can see no such message, your system uses flat mode.
If your system uses x2apic mode, you can disable it by adding the nox2apic option to the
kernel command line in the bootloader configuration.
Only non-physical flat mode (flat) supports distributing interrupts to multiple CPUs. This
mode is available only for systems that have up to 8 CPUs.
4. Calculate the smp_affinity mask. For more information on how to calculate the smp_affinity
mask, see Setting the smp_affinity mask.
Additional resources
The default value of the mask is f, which means that an interrupt request can be handled on any
processor in the system. Setting this value to 1 means that only processor 0 can handle the interrupt.
Procedure
1. In binary, use the value 1 for CPUs that handle the interrupts. For example, to set CPU 0 and
CPU 7 to handle interrupts, use 0000000010000001 as the binary code:
CPU 1 1 1 1 11 1 9 8 7 6 5 4 3 2 1 0
5 4 3 2 0
Binary 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
206
CHAPTER 29. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE CPU UTILIZATION
'0x81'
On systems with more than 32 processors, you must delimit the smp_affinity values for
discrete 32 bit groups. For example, if you want only the first 32 processors of a 64 processor
system to service an interrupt request, use 0xffffffff,00000000.
3. The interrupt affinity value for a particular interrupt request is stored in the associated
/proc/irq/irq_number/smp_affinity file. Set the smp_affinity mask in this file:
Additional resources
207
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
For example, say an application on a NUMA system is running on Node A when a processor on Node B
becomes available. To keep the processor on Node B busy, the scheduler moves one of the
application’s threads to Node B. However, the application thread still requires access to memory on
Node A. But, this memory will take longer to access because the thread is now running on Node B and
Node A memory is no longer local to the thread. Thus, it may take longer for the thread to finish running
on Node B than it would have taken to wait for a processor on Node A to become available, and then to
execute the thread on the original node with local memory access.
Normal policies
Normal threads are used for tasks of normal priority.
Realtime policies
Realtime policies are used for time-sensitive tasks that must complete without interruptions.
Realtime threads are not subject to time slicing. This means the thread runs until they block, exit,
voluntarily yield, or are preempted by a higher priority thread.
The lowest priority realtime thread is scheduled before any thread with a normal policy. For more
information, see Static priority scheduling with SCHED_FIFO and Round robin priority scheduling
with SCHED_RR.
Additional resources
When SCHED_FIFO is in use, the scheduler scans the list of all the SCHED_FIFO threads in order of
priority and schedules the highest priority thread that is ready to run. The priority level of a
SCHED_FIFO thread can be any integer from 1 to 99, where 99 is treated as the highest priority.
Red Hat recommends starting with a lower number and increasing priority only when you identify latency
issues.
208
CHAPTER 30. TUNING SCHEDULING POLICY
WARNING
Because realtime threads are not subject to time slicing, Red Hat does not
recommend setting a priority as 99. This keeps your process at the same priority
level as migration and watchdog threads; if your thread goes into a computational
loop and these threads are blocked, they will not be able to run. Systems with a
single processor will eventually hang in this situation.
Administrators can limit SCHED_FIFO bandwidth to prevent realtime application programmers from
initiating realtime tasks that monopolize the processor.
/proc/sys/kernel/sched_rt_period_us
This parameter defines the time period, in microseconds, that is considered to be one hundred
percent of the processor bandwidth. The default value is 1000000 µs, or 1 second.
/proc/sys/kernel/sched_rt_runtime_us
This parameter defines the time period, in microseconds, that is devoted to running real-time
threads. The default value is 950000 µs, or 0.95 seconds.
Like SCHED_FIFO, SCHED_RR is a realtime policy that defines a fixed priority for each thread. The
scheduler scans the list of all SCHED_RR threads in order of priority and schedules the highest priority
thread that is ready to run. However, unlike SCHED_FIFO, threads that have the same priority are
scheduled in a round-robin style within a certain time slice.
You can set the value of this time slice in milliseconds with the sched_rr_timeslice_ms kernel
parameter in the /proc/sys/kernel/sched_rr_timeslice_ms file. The lowest value is 1 millisecond.
When this policy is in use, the scheduler creates a dynamic priority list based partly on the niceness value
of each process thread. Administrators can change the niceness value of a process, but cannot change
the scheduler’s dynamic priority list directly.
209
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Procedure
# ps
Use the --pid or -p option with the ps command to view the details of the particular PID.
# chrt -p 468
pid 468's current scheduling policy: SCHED_FIFO
pid 468's current scheduling priority: 85
# chrt -p 476
pid 476's current scheduling policy: SCHED_OTHER
pid 476's current scheduling priority: 0
a. For example, to set the process with PID 1000 to SCHED_FIFO, with a priority of 50:
# chrt -f -p 50 1000
b. For example, to set the process with PID 1000 to SCHED_OTHER, with a priority of 0:
# chrt -o -p 0 1000
c. For example, to set the process with PID 1000 to SCHED_RR, with a priority of 10:
# chrt -r -p 10 1000
d. To start a new application with a particular policy and priority, specify the name of the
application:
# chrt -f 36 /bin/my-app
Additional resources
The following table describes the appropriate policy options, which can be used to set the scheduling
policy of a process.
210
CHAPTER 30. TUNING SCHEDULING POLICY
The boot process priority change is done by using the following directives in the service section:
CPUSchedulingPolicy=
Sets the CPU scheduling policy for executed processes. It is used to set other, fifo, and rr policies.
CPUSchedulingPriority=
Sets the CPU scheduling priority for executed processes. The available priority range depends on
the selected CPU scheduling policy. For real-time scheduling policies, an integer between 1 (lowest
priority) and 99 (highest priority) can be used.
The following procedure describes how to change the priority of a service, during the boot process,
using the mcelog service.
Prerequisites
Procedure
# tuna --show_threads
thread ctxt_switches
pid SCHED_ rtpri affinity voluntary nonvoluntary cmd
1 OTHER 0 0xff 3181 292 systemd
2 OTHER 0 0xff 254 0 kthreadd
3 OTHER 0 0xff 2 0 rcu_gp
4 OTHER 0 0xff 2 0 rcu_par_gp
211
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
6 OTHER 0 0 9 0 kworker/0:0H-kblockd
7 OTHER 0 0xff 1301 1 kworker/u16:0-events_unbound
8 OTHER 0 0xff 2 0 mm_percpu_wq
9 OTHER 0 0 266 0 ksoftirqd/0
[...]
2. Create a supplementary mcelog service configuration directory file and insert the policy name
and priority in this file:
>
[SERVICE]
CPUSchedulingPolicy=_fifo_
CPUSchedulingPriority=_20_
EOF
# systemctl daemon-reload
Verification steps
# tuna -t mcelog -P
thread ctxt_switches
pid SCHED_ rtpri affinity voluntary nonvoluntary cmd
826 FIFO 20 0,1,2,3 13 0 mcelog
Additional resources
The following table describes the priority range, which can be used while setting the scheduling policy of
a process.
212
CHAPTER 30. TUNING SCHEDULING POLICY
Prior to Red Hat Enterprise Linux 9, the low-latency Red Hat documentation described the numerous
low-level steps needed to achieve low-latency tuning. In Red Hat Enterprise Linux 9, you can perform
low-latency tuning more efficiently by using the cpu-partitioning TuneD profile. This profile is easily
customizable according to the requirements for individual low-latency applications.
The following figure is an example to demonstrate how to use the cpu-partitioning profile. This
example uses the CPU and node layout.
213
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
In the cpu-partitioning figure, the blocks numbered from 4 to 23, are the default isolated CPUs. The
kernel scheduler’s process load balancing is enabled on these CPUs. It is designed for low-latency
processes with multiple threads that need the kernel scheduler load balancing.
You can configure the cpu-partitioning profile in the /etc/tuned/cpu-partitioning-variables.conf file
using the isolated_cores=cpu-list option, which lists CPUs to isolate that will use the kernel
scheduler load balancing.
The list of isolated CPUs is comma-separated or you can specify a range using a dash, such as 3-5.
This option is mandatory. Any CPU missing from this list is automatically considered a housekeeping
CPU.
Specifying the no_balance_cores option is optional, however any CPUs in this list must be a subset
of the CPUs listed in the isolated_cores list.
Application threads using these CPUs need to be pinned individually to each CPU.
Housekeeping CPUs
Any CPU not isolated in the cpu-partitioning-variables.conf file is automatically considered a
housekeeping CPU. On the housekeeping CPUs, all services, daemons, user processes, movable
kernel threads, interrupt handlers, and kernel timers are permitted to execute.
Additional resources
One dedicated reader thread that reads data from the network will be pinned to CPU 2.
A large number of threads that process this network data will be pinned to CPUs 4-23.
A dedicated writer thread that writes the processed data to the network will be pinned to CPU
3.
Prerequisites
You have installed the cpu-partitioning TuneD profile by using the dnf install tuned-profiles-
cpu-partitioning command as root.
214
CHAPTER 30. TUNING SCHEDULING POLICY
Procedure
3. Reboot
After rebooting, the system is tuned for low-latency, according to the isolation in the cpu-
partitioning figure. The application can use taskset to pin the reader and writer threads to CPUs
2 and 3, and the remaining application threads on CPUs 4-23.
Additional resources
For example, the cpu-partitioning profile sets the CPUs to use cstate=1. In order to use the cpu-
partitioning profile but to additionally change the CPU cstate from cstate1 to cstate0, the following
procedure describes a new TuneD profile named my_profile, which inherits the cpu-partitioning profile
and then sets C state-0.
Procedure
# mkdir /etc/tuned/my_profile
2. Create a tuned.conf file in this directory, and add the following content:
# vi /etc/tuned/my_profile/tuned.conf
[main]
summary=Customized tuning on top of cpu-partitioning
include=cpu-partitioning
[cpu]
force_latency=cstate.id:0|1
NOTE
215
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
NOTE
In the shared example, a reboot is not required. However, if the changes in the my_profile
profile require a reboot to take effect, then reboot your machine.
Additional resources
216
CHAPTER 31. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE ACCESS TO NETWORK RESOURCES
The TuneD service provides a number of different profiles to improve performance in a number of
specific use cases:
latency-performance
network-latency
network-throughput
ss utility prints statistical information about sockets, enables administrators to assess device
performance over time. By default, ss displays open non-listening sockets that have established
connections. Using command-line options, administrators can filter out statistics about specific
sockets. Red Hat recommends ss over the deprecated netstat in Red Hat Enterprise Linux
ip utility lets administrators manage and monitor routes, devices, routing policies, and tunnels.
The ip monitor command can continuously monitor the state of devices, addresses, and routes.
Use the -j option to display the output in JSON format, which can be further provided to other
utilities to automate information processing.
dropwatch is an interactive tool, provided by the dropwatch package. It monitors and records
packets that are dropped by the kernel.
ethtool utility enables administrators to view and edit network interface card settings. Use this
tool to observe the statistics, such as the number of packets dropped by that device, of certain
devices. Using the ethtool -S device name command, view the status of a specified device’s
counters of the device you want to monitor.
/proc/net/snmp file displays data that the snmp agent uses for IP, ICMP, TCP and UDP
monitoring and management. Examining this file on a regular basis helps administrators to
identify unusual values and thereby identify potential performance problems. For example, an
increase in UDP input errors (InErrors) in the /proc/net/snmp file can indicate a bottleneck in a
socket receive queue.
nstat tool monitors kernel SNMP and network interface statistics. This tool reads data from the
/proc/net/snmp file and prints the information in a human readable format.
By default, the SystemTap scripts, provided by the systemtap-client package are installed in
the /usr/share/systemtap/examples/network directory:
nettop.stp: Every 5 seconds, the script displays a list of processes (process identifier and
command) with the number of packets sent and received and the amount of data sent and
received by the process during that interval.
socket-trace.stp: Instruments each of the functions in the Linux kernel’s net/socket.c file,
217
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
socket-trace.stp: Instruments each of the functions in the Linux kernel’s net/socket.c file,
and displays trace data.
dropwatch.stp: Every 5 seconds, the script displays the number of socket buffers freed at
locations in the kernel. Use the --all-modules option to see symbolic names.
latencytap.stp: This script records the effect that different types of latency have on one or
more processes. It prints a list of latency types every 30 seconds, sorted in descending order
by the total time the process or processes spent waiting. This can be useful for identifying
the cause of both storage and network latency.
Red Hat recommends using the --all-modules option with this script to better enable the
mapping of latency events. By default, this script is installed in the
/usr/share/systemtap/examples/profiling directory.
BPF Compiler Collection (BCC) is a library, which facilitates the creation of the extended
Berkeley Packet Filter (eBPF) programs. The main utility of the eBPF programs is analyzing OS
performance and network performance without experiencing overhead or security issues.
Additional resources
/usr/share/systemtap/examples/network directory
/usr/share/doc/bcc/README.md file
How to write a NetworkManager dispatcher script to apply ethtool commands? Red Hat
Knowlegebase solution
If the hardware buffer drops a large number of packets, the following are the few potential solutions:
Filter the incoming traffic, reduce the number of joined multicast groups, or reduce the amount of
218
CHAPTER 31. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE ACCESS TO NETWORK RESOURCES
Filter the incoming traffic, reduce the number of joined multicast groups, or reduce the amount of
broadcast traffic to decrease the rate at which the queue fills.
Resize the hardware buffer queue
Resize the hardware buffer queue: Reduce the number of packets being dropped by increasing the
size of the queue so that it does not overflow as easily. You can modify the rx/tx parameters of the
network device with the ethtool command:
ethtool --set-ring device-name value
Decrease the rate at which the queue fills by filtering or dropping packets before they reach
the queue, or by lowering the weight of the device. Filter incoming traffic or lower the
network interface card’s device weight to slow incoming traffic.
The device weight refers to the number of packets a device can receive at one time in a
single scheduled processor access. You can increase the rate at which a queue is drained by
increasing its device weight that is controlled by the dev_weight kernel setting. To
temporarily alter this parameter, change the contents of the /proc/sys/net/core/dev_weight
file, or to permanently alter, use the sysctl command, which is provided by the procps-ng
package.
Increase the length of the application’s socket queue: This is typically the easiest way to
improve the drain rate of a socket queue, but it is unlikely to be a long-term solution. If a
socket queue receives a limited amount of traffic in bursts, increasing the depth of the
socket queue to match the size of the bursts of traffic may prevent packets from being
dropped. To increase the depth of a queue, increase the size of the socket receive buffer by
making either of the following changes:
Use the setsockopt to configure a larger SO_RCVBUF value: This parameter controls
the maximum size in bytes of a socket’s receive buffer. Use the getsockopt system call
to determine the current value of the buffer.
Altering the drain rate of a queue is usually the simplest way to mitigate poor network performance.
However, increasing the number of packets that a device can receive at one time uses additional
processor time, during which no other processes can be scheduled, so this can cause other
performance problems.
Additional resources
/proc/net/snmp file
Busy polling helps to reduce latency in the network receive path by allowing socket layer code to poll the
receive queue of a network device, and disables network interrupts. This removes delays caused by the
interrupt and the resultant context switch. However, it also increases CPU utilization. Busy polling also
219
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
prevents the CPU from sleeping, which can incur additional power consumption. Busy polling behavior is
supported by all the device drivers.
Additional resources
Procedure
a. To enable busy polling on specific sockets, set the sysctl.net.core.busy_poll kernel value
to a value other than 0:
This parameter controls the number of microseconds to wait for packets on the socket poll
and select syscalls. Red Hat recommends a value of 50.
c. To enable busy polling globally, set the sysctl.net.core.busy_read to a value other than 0:
Additional resources
220
CHAPTER 31. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE ACCESS TO NETWORK RESOURCES
The number of queues or the CPUs that should process network activity for RSS are configured in the
appropriate network device driver:
The irqbalance daemon can be used in conjunction with RSS to reduce the likelihood of cross-node
memory transfers and cache line bouncing. This lowers the latency of processing network packets.
When enabled, RSS distributes network processing equally between available CPUs based on the
amount of processing each CPU has queued. However, use the --show-rxfh-indir and --set-rxfh-indir
parameters of the ethtool utility, to modify how RHEL distributes network activity, and weigh certain
types of network activity as more important than others.
Procedure
To determine whether your network interface card supports RSS, check whether multiple
interrupt request queues are associated with the interface in /proc/interrupts:
The output shows that the NIC driver created 6 receive queues for the p1p1 interface (p1p1-0
through p1p1-5). It also shows how many interrupts were processed by each queue, and which
CPU serviced the interrupt. In this case, there are 6 queues because by default, this particular
NIC driver creates one queue per CPU, and this system has 6 CPUs. This is a fairly common
pattern among NIC drivers.
To list the interrupt request queue for a PCI device with the address 0000:01:00.0:
# ls -1 /sys/devices/*/*/0000:01:00.0/msi_irqs
101
102
103
104
105
221
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
106
107
108
109
RPS does not increase the hardware interrupt rate of the network device. However, it does
introduce inter-processor interrupts.
RPS is configured per network device and receive queue, in the /sys/class/net/device/queues/rx-
queue/rps_cpus file, where device is the name of the network device, such as enp1s0 and rx-queue is
the name of the appropriate receive queue, such as rx-0.
The default value of the rps_cpus file is 0. This disables RPS, and the CPU handles the network
interrupt and also processes the packet. To enable RPS, configure the appropriate rps_cpus file with
the CPUs that should process packets from the specified network device and receive queue.
The rps_cpus files use comma-delimited CPU bitmaps. Therefore, to allow a CPU to handle interrupts
for the receive queue on an interface, set the value of their positions in the bitmap to 1. For example, to
handle interrupts with CPUs 0, 1, 2, and 3, set the value of the rps_cpus to f, which is the hexadecimal
value for 15. In binary representation, 15 is 00001111 (1+2+4+8).
For network devices with single transmit queues, best performance can be achieved by configuring RPS
to use CPUs in the same memory domain. On non-NUMA systems, this means that all available CPUs
can be used. If the network interrupt rate is extremely high, excluding the CPU that handles network
interrupts may also improve performance.
For network devices with multiple queues, there is typically no benefit to configure both RPS and RSS,
as RSS is configured to map a CPU to each receive queue by default. However, RPS can still be
beneficial if there are fewer hardware queues than CPUs, and RPS is configured to use CPUs in the
same memory domain.
Data received from a single sender is not sent to more than one CPU. If the amount of data received
from a single sender is greater than a single CPU can handle, configure a larger frame size to reduce the
number of interrupts and therefore the amount of processing work for the CPU. Alternatively, consider
NIC offload options or faster CPUs.
222
CHAPTER 31. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE ACCESS TO NETWORK RESOURCES
Consider using numactl or taskset in conjunction with RFS to pin applications to specific cores, sockets,
or NUMA nodes. This can help prevent packets from being processed out of order.
Procedure
1. Set the value of the net.core.rps_sock_flow_entries kernel value to the maximum expected
number of concurrently active connections:
NOTE
Red Hat recommends a value of 32768 for moderate server loads. All values
entered are rounded up to the nearest power of 2 in practice.
# sysctl -p /etc/sysctl.d/95-enable-rps.conf
Replace device with the name of the network device you wish to configure (for example,
enp1s0), and rx-queue with the receive queue you wish to configure (for example, rx-0).
Replace N with the number of configured receive queues. For example, if the rps_flow_entries
is set to 32768 and there are 16 configured receive queues, the rps_flow_cnt = 32786/16=
2048 (that is, rps_flow_cnt = rps_flow_enties/N ).
For single-queue devices, the value of rps_flow_cnt is the same as the value of
rps_sock_flow_entries.
Verification steps
223
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
# cat /proc/sys/net/core/rps_sock_flow_entries
32768
# cat /sys/class/net/device/queues/rx-queue/rps_flow_cnt
2048
Additional resources
Unlike traditional RFS, however, packets are sent directly to a CPU that is local to the thread consuming
the data:
NIC must support the accelerated RFS. Accelerated RFS is supported by cards that export the
ndo_rx_flow_steer() net_device function. Check the NIC’s data sheet to ensure if this feature
is supported.
ntuple filtering must be enabled. For information on how to enable these filters, see Enabling
the ntuple filters.
Once these conditions are met, CPU to queue mapping is deduced automatically based on traditional
RFS configuration. That is, CPU to queue mapping is deduced based on the IRQ affinities configured by
the driver for each receive queue. For more information on enabling the traditional RFS, see Enabling
Receive Flow Steering.
Procedure
ntuple-filters: off
NOTE
224
CHAPTER 31. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE ACCESS TO NETWORK RESOURCES
NOTE
If the output is ntuple-filters: off [fixed], then the ntuple filtering is disabled and you
cannot configure it:
Verification steps
Additional resources
225
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
I/O and file system performance can be affected by any of the following factors:
Sequential or random
Buffered or Direct IO
Block size
Pre-fetching data
File fragmentation
Resource contention
vmstat tool reports on processes, memory, paging, block I/O, interrupts, and CPU activity
across the entire system. It can help administrators determine whether the I/O subsystem is
responsible for any performance issues. If analysis with vmstat shows that the I/O subsystem is
responsible for reduced performance, administrators can use the iostat tool to determine the
responsible I/O device.
iostat reports on I/O device load in your system. It is provided by the sysstat package.
blktrace provides detailed information about how time is spent in the I/O subsystem. The
companion utility blkparse reads the raw output from blktrace and produces a human readable
summary of input and output operations recorded by blktrace.
btt analyzes blktrace output and displays the amount of time that data spends in each area of
the I/O stack, making it easier to spot bottlenecks in the I/O subsystem. This utility is provided
as part of the blktrace package. Some of the important events tracked by the blktrace
mechanism and analyzed by btt are:
226
CHAPTER 32. FACTORS AFFECTING I/O AND FILE SYSTEM PERFORMANCE
iowatcher can use the blktrace output to graph I/O over time. It focuses on the Logical Block
Address (LBA) of disk I/O, throughput in megabytes per second, the number of seeks per
second, and I/O operations per second. This can help to identify when you are hitting the
operations-per-second limit of a device.
BPF Compiler Collection (BCC) is a library, which facilitates the creation of the extended
Berkeley Packet Filter (eBPF) programs. The eBPF programs are triggered on events, such as
disk I/O, TCP connections, and process creations. The BCC tools are installed in the
/usr/share/bcc/tools/ directory. The following bcc-tools helps to analyze performance:
biolatency summarizes the latency in block device I/O (disk I/O) in histogram. This allows
the distribution to be studied, including two modes for device cache hits and for cache
misses, and latency outliers.
biosnoop is a basic block I/O tracing tool for displaying each I/O event along with the
issuing process ID, and the I/O latency. Using this tool, you can investigate disk I/O
performance issues.
ext4slower, nfsslower, and xfsslower are tools that show file system operations slower
than a certain threshold, which defaults to 10ms.
For more information, see the Analyzing system performance with BPF Compiler Collection .
bpftace is a tracing language for eBPF used for analyzing performance issues. It also provides
trace utilities like BCC for system observation, which is useful for investigating I/O performance
issues.
The following SystemTap scripts may be useful in diagnosing storage or file system
performance problems:
disktop.stp: Checks the status of reading or writing disk every 5 seconds and outputs the
top ten entries during that period.
iotime.stp: Prints the amount of time spent on read and write operations, and the number
of bytes read and written.
traceio.stp: Prints the top ten executable based on cumulative I/O traffic observed, every
second.
traceio2.stp: Prints the executable name and process identifier as reads and writes to the
specified device occur.
Inodewatch.stp: Prints the executable name and process identifier each time a read or
write occurs to the specified inode on the specified major or minor device.
227
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
inodewatch2.stp: Prints the executable name, process identifier, and attributes each time
the attributes are changed on the specified inode on the specified major or minor device.
Additional resources
vmstat(8), iostat(1), blktrace(8), blkparse(1), btt(1), bpftrace, and iowatcher(1) man pages
The following are the options available before formatting a storage device:
Size
Create an appropriately-sized file system for your workload. Smaller file systems require less time
and memory for file system checks. However, if a file system is too small, its performance suffers
from high fragmentation.
Block size
The block is the unit of work for the file system. The block size determines how much data can be
stored in a single block, and therefore the smallest amount of data that is written or read at one time.
The default block size is appropriate for most use cases. However, your file system performs better
and stores data more efficiently if the block size or the size of multiple blocks is the same as or
slightly larger than the amount of data that is typically read or written at one time. A small file still
uses an entire block. Files can be spread across multiple blocks, but this can create additional runtime
overhead.
Additionally, some file systems are limited to a certain number of blocks, which in turn limits the
maximum size of the file system. Block size is specified as part of the file system options when
formatting a device with the mkfs command. The parameter that specifies the block size varies with
the file system.
Geometry
File system geometry is concerned with the distribution of data across a file system. If your system
uses striped storage, like RAID, you can improve performance by aligning data and metadata with the
underlying storage geometry when you format the device.
Many devices export recommended geometry, which is then set automatically when the devices are
formatted with a particular file system. If your device does not export these recommendations, or you
want to change the recommended settings, you must specify geometry manually when you format
the device with the mkfs command.
The parameters that specify file system geometry vary with the file system.
External journals
Journaling file systems document the changes that will be made during a write operation in a journal
file prior to the operation being executed. This reduces the likelihood that a storage device will
become corrupted in the event of a system crash or power failure, and speeds up the recovery
process.
NOTE
228
CHAPTER 32. FACTORS AFFECTING I/O AND FILE SYSTEM PERFORMANCE
NOTE
Red Hat does not recommend using the external journals option.
Metadata-intensive workloads involve very frequent updates to the journal. A larger journal uses more
memory, but reduces the frequency of write operations. Additionally, you can improve the seek time of a
device with a metadata-intensive workload by placing its journal on dedicated storage that is as fast as,
or faster than, the primary storage.
WARNING
Ensure that external journals are reliable. Losing an external journal device causes
file system corruption. External journals must be created at format time, with
journal devices being specified at mount time.
Additional resources
Access Time
Every time a file is read, its metadata is updated with the time at which access occurred (atime). This
involves additional write I/O. The relatime is the default atime setting for most file systems.
However, if updating this metadata is time consuming, and if accurate access time data is not
required, you can mount the file system with the noatime mount option. This disables updates to
metadata when a file is read. It also enables nodiratime behavior, which disables updates to metadata
when a directory is read.
NOTE
Disabling atime updates by using the noatime mount option can break applications that
rely on them, for example, backup programs.
Read-ahead
Read-ahead behavior speeds up file access by pre-fetching data that is likely to be needed soon
and loading it into the page cache, where it can be retrieved more quickly than if it were on disk. The
higher the read-ahead value, the further ahead the system pre-fetches data.
Red Hat Enterprise Linux attempts to set an appropriate read-ahead value based on what it detects
about your file system. However, accurate detection is not always possible. For example, if a storage
array presents itself to the system as a single LUN, the system detects the single LUN, and does not
set the appropriate read-ahead value for an array.
229
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Workloads that involve heavy streaming of sequential I/O often benefit from high read-ahead values.
The storage-related tuned profiles provided with Red Hat Enterprise Linux raise the read-ahead
value, as does using LVM striping, but these adjustments are not always sufficient for all workloads.
Additional resources
Batch discard
This type of discard is part of the fstrim command. It discards all unused blocks in a file system that
match criteria specified by the administrator. Red Hat Enterprise Linux 9 supports batch discard on
XFS and ext4 formatted devices that support physical discard operations.
Online discard
This type of discard operation is configured at mount time with the discard option, and runs in real
time without user intervention. However, it only discards blocks that are transitioning from used to
free. Red Hat Enterprise Linux 9 supports online discard on XFS and ext4 formatted devices.
Red Hat recommends batch discard, except where online discard is required to maintain
performance, or where batch discard is not feasible for the system’s workload.
Pre-allocation marks disk space as being allocated to a file without writing any data into that space. This
can be useful in limiting data fragmentation and poor read performance. Red Hat Enterprise Linux 9
supports pre-allocating space on XFS, ext4, and GFS2 file systems. Applications can also benefit from
pre-allocating space by using the fallocate(2) glibc call.
Additional resources
Performance generally degrades as the used blocks on an SSD approach the capacity of the disk. The
degree of degradation varies by vendor, but all devices experience degradation in this circumstance.
Enabling discard behavior can help to alleviate this degradation. For more information, see Types of
discarding unused blocks.
The default I/O scheduler and virtual memory options are suitable for use with SSDs. Consider the
following factors when configuring settings that can affect SSD performance:
I/O Scheduler
230
CHAPTER 32. FACTORS AFFECTING I/O AND FILE SYSTEM PERFORMANCE
Any I/O scheduler is expected to perform well with most SSDs. However, as with any other storage
type, Red Hat recommends benchmarking to determine the optimal configuration for a given
workload. When using SSDs, Red Hat advises changing the I/O scheduler only for benchmarking
particular workloads. For instructions on how to switch between I/O schedulers, see the
/usr/share/doc/kernel-version/Documentation/block/switching-sched.txt file.
For single queue HBA, the default I/O scheduler is deadline. For multiple queue HBA, the default
I/O scheduler is none. For information on how to set the I/O scheduler, see Setting the disk
scheduler.
Virtual Memory
Like the I/O scheduler, virtual memory (VM) subsystem requires no special tuning. Given the fast
nature of I/O on SSD, try turning down the vm_dirty_background_ratio and vm_dirty_ratio
settings, as increased write-out activity does not usually have a negative impact on the latency of
other operations on the disk. However, this tuning can generate more overall I/O, and is therefore
not generally recommended without workload-specific testing.
Swap
An SSD can also be used as a swap device, and is likely to produce good page-out and page-in
performance.
The following listed tuning parameters are separate from I/O scheduler tuning, and are applicable to all
I/O schedulers:
add_random
Some I/O events contribute to the entropy pool for the /dev/random. This parameter can be set to 0
if the overhead of these contributions become measurable.
iostats
By default, iostats is enabled and the default value is 1. Setting iostats value to 0 disables the
gathering of I/O statistics for the device, which removes a small amount of overhead with the I/O
path. Setting iostats to 0 might slightly improve performance for very high performance devices,
such as certain NVMe solid-state storage devices. It is recommended to leave iostats enabled unless
otherwise specified for the given storage model by the vendor.
If you disable iostats, the I/O statistics for the device are no longer present within the
/proc/diskstats file. The content of /sys/diskstats file is the source of I/O information for
monitoring I/O tools, such as sar or iostats. Therefore, if you disable the iostats parameter for a
device, the device is no longer present in the output of I/O monitoring tools.
max_sectors_kb
Specifies the maximum size of an I/O request in kilobytes. The default value is 512 KB. The minimum
value for this parameter is determined by the logical block size of the storage device. The maximum
value for this parameter is determined by the value of the max_hw_sectors_kb.
Red Hat recommends max_sectors_kb to always be a multiple of the optimal I/O size and the
internal erase block size. Use a value of logical_block_size for either parameter if they are zero or
not specified by the storage device.
nomerges
Most workloads benefit from request merging. However, disabling merges can be useful for
debugging purposes. By default, the nomerges parameter is set to 0, which enables merging. To
disable simple one-hit merging, set nomerges to 1. To disable all types of merging, set nomerges
231
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
to 2.
nr_requests
It is the maximum allowed number of the queued I/O. If the current I/O scheduler is none, this
number can only be reduced; otherwise the number can be increased or reduced.
optimal_io_size
Some storage devices report an optimal I/O size through this parameter. If this value is reported,
Red Hat recommends that applications issue I/O aligned to and in multiples of the optimal I/O size
wherever possible.
read_ahead_kb
Defines the maximum number of kilobytes that the operating system may read ahead during a
sequential read operation. As a result, the necessary information is already present within the kernel
page cache for the next sequential read, which improves read I/O performance.
Device mappers often benefit from a high read_ahead_kb value. 128 KB for each device to be
mapped is a good starting point, but increasing the read_ahead_kb value up to request queue’s
max_sectors_kb of the disk might improve performance in application environments where
sequential reading of large files takes place.
rotational
Some solid-state disks do not correctly advertise their solid-state status, and are mounted as
traditional rotational disks. Manually set the rotational value to 0 to disable unnecessary seek-
reducing logic in the scheduler.
rq_affinity
The default value of the rq_affinity is 1. It completes the I/O operations on one CPU core, which is in
the same CPU group of the issued CPU core. To perform completions only on the processor that
issued the I/O request, set the rq_affinity to 2. To disable the mentioned two abilities, set it to 0.
scheduler
To set the scheduler or scheduler preference order for a particular storage device, edit the
/sys/block/devname/queue/scheduler file, where devname is the name of the device you want to
configure.
232
CHAPTER 33. USING SYSTEMD TO MANAGE RESOURCES USED BY APPLICATIONS
To achieve this, systemd takes various configuration options from the unit files or directly via the
systemctl command. Then systemd applies those options to specific process groups by utilizing the
Linux kernel system calls and features like cgroups and namespaces.
NOTE
You can review the full set of configuration options for systemd in the following manual
pages:
systemd.resource-control(5)
systemd.exec(5)
Weights
You can distribute the resource by adding up the weights of all sub-groups and giving each sub-
group the fraction matching its ratio against the sum.
For example, if you have 10 cgroups, each with weight of value 100, the sum is 1000. Each cgroup
receives one tenth of the resource.
Weight is usually used to distribute stateless resources. For example the CPUWeight= option is an
implementation of this resource distribution model.
Limits
A cgroup can consume up to the configured amount of the resource. The sum of sub-group limits
can exceed the limit of the parent cgroup. Therefore it is possible to overcommit resources in this
model.
For example the MemoryMax= option is an implementation of this resource distribution model.
Protections
You can set up a protected amount of a resource for a cgroup. If the resource usage is below the
protection boundary, the kernel will try not to penalize this cgroup in favor of other cgroups that
compete for the same resource. An overcommit is also possible.
For example the MemoryLow= option is an implementation of this resource distribution model.
Allocations
Exclusive allocations of an absolute amount of a finite resource. An overcommit is not possible. An
example of this resource type in Linux is the real-time budget.
unit file option
A setting for resource control configuration.
233
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
For example, you can configure CPU resource with options like CPUAccounting=, or CPUQuota=.
Similarly, you can configure memory or I/O resources with options like AllowedMemoryNodes= and
IOAccounting=.
Procedure
To change the required value of the unit file option of your service, you can adjust the value in the unit
file, or use systemctl command:
2. Set the required value of the CPU time allocation policy option:
Verification steps
Check the newly assigned values for the service of your choice.
Additional resources
IMPORTANT
In general, Red Hat recommends you use systemd for controlling the usage of system
resources. You should manually configure the cgroups virtual file system only in special
cases. For example, when you need to use cgroup-v1 controllers that have no
equivalents in cgroup-v2 hierarchy.
<name>.service
Scope - A group of externally created processes. Scopes encapsulate processes that are
started and stopped by the arbitrary processes through the fork() function and then registered
by systemd at runtime. For example, user sessions, containers, and virtual machines are treated
as scopes. Scopes are named as follows:
<name>.scope
Slice - A group of hierarchically organized units. Slices organize a hierarchy in which scopes and
services are placed. The actual processes are contained in scopes or in services. Every name of
a slice unit corresponds to the path to a location in the hierarchy. The dash ("-") character acts
as a separator of the path components to a slice from the -.slice root slice. In the following
example:
<parent-name>.slice
The service, the scope, and the slice units directly map to objects in the control group hierarchy. When
these units are activated, they map directly to control group paths built from the unit names.
Control group /:
-.slice
├─user.slice
│ ├─user-42.slice
│ │ ├─session-c1.scope
│ │ │ ├─ 967 gdm-session-worker [pam/gdm-launch-environment]
│ │ │ ├─1035 /usr/libexec/gdm-x-session gnome-session --autostart
/usr/share/gdm/greeter/autostart
│ │ │ ├─1054 /usr/libexec/Xorg vt1 -displayfd 3 -auth /run/user/42/gdm/Xauthority -background none
-noreset -keeptty -verbose 3
│ │ │ ├─1212 /usr/libexec/gnome-session-binary --autostart /usr/share/gdm/greeter/autostart
│ │ │ ├─1369 /usr/bin/gnome-shell
│ │ │ ├─1732 ibus-daemon --xim --panel disable
│ │ │ ├─1752 /usr/libexec/ibus-dconf
│ │ │ ├─1762 /usr/libexec/ibus-x11 --kill-daemon
│ │ │ ├─1912 /usr/libexec/gsd-xsettings
│ │ │ ├─1917 /usr/libexec/gsd-a11y-settings
│ │ │ ├─1920 /usr/libexec/gsd-clipboard
…
├─init.scope
│ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 18
└─system.slice
├─rngd.service
│ └─800 /sbin/rngd -f
├─systemd-udevd.service
│ └─659 /usr/lib/systemd/systemd-udevd
235
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
├─chronyd.service
│ └─823 /usr/sbin/chronyd
├─auditd.service
│ ├─761 /sbin/auditd
│ └─763 /usr/sbin/sedispatch
├─accounts-daemon.service
│ └─876 /usr/libexec/accounts-daemon
├─example.service
│ ├─ 929 /bin/bash /home/jdoe/example.sh
│ └─4902 sleep 1
…
The example above shows that services and scopes contain processes and are placed in slices that do
not contain processes of their own.
Additional resources
Understanding cgroups
Procedure
To list all active units on the system, execute the # systemctl command and the terminal will
return an output similar to the following example:
# systemctl
UNIT LOAD ACTIVE SUB DESCRIPTION
…
init.scope loaded active running System and Service Manager
session-2.scope loaded active running Session 2 of user jdoe
abrt-ccpp.service loaded active exited Install ABRT coredump hook
abrt-oops.service loaded active running ABRT kernel log watcher
abrt-vmcore.service loaded active exited Harvest vmcores for ABRT
abrt-xorg.service loaded active running ABRT Xorg log watcher
…
-.slice loaded active active Root Slice
machine.slice loaded active active Virtual Machine and Container
Slice system-getty.slice loaded active active
system-getty.slice
system-lvm2\x2dpvscan.slice loaded active active system-
lvm2\x2dpvscan.slice
system-sshd\x2dkeygen.slice loaded active active system-
sshd\x2dkeygen.slice
system-systemd\x2dhibernate\x2dresume.slice loaded active active system-
systemd\x2dhibernate\x2dresume>
system-user\x2druntime\x2ddir.slice loaded active active system-
236
CHAPTER 33. USING SYSTEMD TO MANAGE RESOURCES USED BY APPLICATIONS
user\x2druntime\x2ddir.slice
system.slice loaded active active System Slice
user-1000.slice loaded active active User Slice of UID 1000
user-42.slice loaded active active User Slice of UID 42
user.slice loaded active active User and Session Slice
…
UNIT - a name of a unit that also reflects the unit position in a control group hierarchy. The
units relevant for resource control are a slice, a scope, and a service.
LOAD - indicates whether the unit configuration file was properly loaded. If the unit file
failed to load, the field contains the state error instead of loaded. Other unit load states are:
stub, merged, and masked.
SUB - the low-level unit activation state. The range of possible values depends on the unit
type.
# systemctl --all
The --type option requires a comma-separated list of unit types such as a service and a slice, or
unit load states such as loaded and masked.
Additional resources
Procedure
# systemd-cgls
Control group /:
-.slice
├─user.slice
│ ├─user-42.slice
│ │ ├─session-c1.scope
│ │ │ ├─ 965 gdm-session-worker [pam/gdm-launch-environment]
│ │ │ ├─1040 /usr/libexec/gdm-x-session gnome-session --autostart
237
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
/usr/share/gdm/greeter/autostart
…
├─init.scope
│ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 18
└─system.slice
…
├─example.service
│ ├─6882 /bin/bash /home/jdoe/example.sh
│ └─6902 sleep 1
├─systemd-journald.service
└─629 /usr/lib/systemd/systemd-journald
…
The example output returns the entire cgroups hierarchy, where the highest level is formed by
slices.
# systemd-cgls memory
Controller memory; Control group /:
├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 18
├─user.slice
│ ├─user-42.slice
│ │ ├─session-c1.scope
│ │ │ ├─ 965 gdm-session-worker [pam/gdm-launch-environment]
…
└─system.slice
|
…
├─chronyd.service
│ └─844 /usr/sbin/chronyd
├─example.service
│ ├─8914 /bin/bash /home/jdoe/example.sh
│ └─8916 sleep 1
…
The example output of the above command lists the services that interact with the selected
controller.
To display detailed information about a certain unit and its part of the cgroups hierarchy,
execute # systemctl status <system_unit>:
238
CHAPTER 33. USING SYSTEMD TO MANAGE RESOURCES USED BY APPLICATIONS
Additional resources
Procedure
1. To view which cgroup a process belongs to, run the # cat proc/<PID>/cgroup command:
# cat /proc/2467/cgroup
0::/system.slice/example.service
The example output relates to a process of interest. In this case, it is a process identified by PID
2467, which belongs to the example.service unit. You can determine whether the process was
placed in a correct control group as defined by the systemd unit file specifications.
2. To display what controllers the cgroup utilizes and the respective configuration files, check the
cgroup directory:
# cat /sys/fs/cgroup/system.slice/example.service/cgroup.controllers
memory pids
# ls /sys/fs/cgroup/system.slice/example.service/
cgroup.controllers
cgroup.events
…
cpu.pressure
cpu.stat
io.pressure
memory.current
memory.events
…
pids.current
pids.events
pids.max
NOTE
The version 1 hierarchy of cgroups uses a per-controller model. Therefore the output
from the /proc/PID/cgroup file shows, which cgroups under each controller the PID
belongs to. You can find the respective cgroups under the controller directories at
/sys/fs/cgroup/<controller_name>/.
239
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Additional resources
Procedure
# systemd-cgtop
Control Group Tasks %CPU Memory Input/s Output/s
/ 607 29.8 1.5G - -
/system.slice 125 - 428.7M - -
/system.slice/ModemManager.service 3 - 8.6M - -
/system.slice/NetworkManager.service 3 - 12.8M - -
/system.slice/accounts-daemon.service 3 - 1.8M - -
/system.slice/boot.mount - - 48.0K - -
/system.slice/chronyd.service 1 - 2.0M - -
/system.slice/cockpit.socket - - 1.3M - -
/system.slice/colord.service 3 - 3.5M - -
/system.slice/crond.service 1 - 1.8M - -
/system.slice/cups.service 1 - 3.1M - -
/system.slice/dev-hugepages.mount - - 244.0K - -
/system.slice/dev-mapper-rhel\x2dswap.swap - - 912.0K - -
/system.slice/dev-mqueue.mount - - 48.0K - -
/system.slice/example.service 2 - 2.0M - -
/system.slice/firewalld.service 2 - 28.8M - -
...
The example output displays currently running cgroups ordered by their resource usage (CPU,
memory, disk I/O load). The list refreshes every 1 second by default. Therefore, it offers a
dynamic insight into the actual resource usage of each control group.
Additional resources
Prerequisites
240
CHAPTER 33. USING SYSTEMD TO MANAGE RESOURCES USED BY APPLICATIONS
Procedure
…
[Service]
MemoryMax=1500K
…
The configuration above places a maximum memory limit, which the processes in a control
group cannot exceed. The example.service service is part of such a control group which has
imposed limitations. You can use suffixes K, M, G, or T to identify Kilobyte, Megabyte, Gigabyte,
or Terabyte as a unit of measurement.
# systemctl daemon-reload
NOTE
You can review the full set of configuration options for systemd in the following manual
pages:
systemd.resource-control(5)
systemd.exec(5)
Verification
# cat /sys/fs/cgroup/system.slice/example.service/memory.max
1536000
The example output shows that the memory consumption was limited at around 1,500 KB.
Additional resources
Understanding cgroups
241
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
CPU affinity settings help you restrict the access of a particular process to some CPUs. Effectively, the
CPU scheduler never schedules the process to run on the CPU that is not in the affinity mask of the
process.
The default CPU affinity mask applies to all services managed by systemd.
To configure CPU affinity mask for a particular systemd service, systemd provides CPUAffinity= both
as a unit file option and a manager configuration option in the /etc/systemd/system.conf file.
The CPUAffinity= unit file option sets a list of CPUs or CPU ranges that are merged and used as the
affinity mask.
After configuring CPU affinity mask for a particular systemd service, you must restart the service to
apply the changes.
Procedure
To set CPU affinity mask for a particular systemd service using the CPUAffinity unit file option:
1. Check the values of the CPUAffinity unit file option in the service of your choice:
2. As a root, set the required value of the CPUAffinity unit file option for the CPU ranges used as
the affinity mask:
NOTE
You can review the full set of configuration options for systemd in the following manual
pages:
systemd.resource-control(5)
systemd.exec(5)
To set default CPU affinity mask for all systemd services using the manager configuration option:
1. Set the CPU numbers for the CPUAffinity= option in the /etc/systemd/system.conf file.
242
CHAPTER 33. USING SYSTEMD TO MANAGE RESOURCES USED BY APPLICATIONS
# systemctl daemon-reload
NOTE
You can review the full set of configuration options for systemd in the following manual
pages:
systemd.resource-control(5)
systemd.exec(5)
Memory close to the CPU has lower latency (local memory) than memory that is local for a different
CPU (foreign memory) or is shared between a set of CPUs.
In terms of the Linux kernel, NUMA policy governs where (for example, on which NUMA nodes) the
kernel allocates physical memory pages for the process.
systemd provides unit file options NUMAPolicy and NUMAMask to control memory allocation policies
for services.
Procedure
To set the NUMA memory policy through the NUMAPolicy unit file option:
1. Check the values of the NUMAPolicy unit file option in the service of your choice:
2. As a root, set the required policy type of the NUMAPolicy unit file option:
# systemd daemon-reload
243
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
IMPORTANT
When you configure a strict NUMA policy, for example bind, make sure that you also
appropriately set the CPUAffinity= unit file option.
Additional resources
NUMAPolicy
Controls the NUMA memory policy of the executed processes. The following policy types are
possible:
default
preferred
bind
interleave
local
NUMAMask
Controls the NUMA node list which is associated with the selected NUMA policy.
Note that the NUMAMask option is not required to be specified for the following policies:
default
local
For the preferred policy, the list specifies only a single NUMA node.
Additional resources
Procedure
To create a transient control group, use the systemd-run command in the following format:
244
CHAPTER 33. USING SYSTEMD TO MANAGE RESOURCES USED BY APPLICATIONS
This command creates and starts a transient service or a scope unit and runs a custom command
in such a unit.
The --unit=<name> option gives a name to the unit. If --unit is not specified, the name is
generated automatically.
The --slice=<name>.slice option makes your service or scope unit a member of a specified
slice. Replace <name>.slice with the name of an existing slice (as shown in the output of
systemctl -t slice), or create a new slice by passing a unique name. By default, services and
scopes are created as members of the system.slice.
Replace <command> with the command you wish to execute in the service or the scope
unit.
The following message is displayed to confirm that you created and started the service or
the scope successfully:
Optionally, keep the unit running after its processes finished to collect run-time information:
The command creates and starts a transient service unit and runs a custom command in such a
unit. The --remain-after-exit option ensures that the service keeps running after its processes
have finished.
Additional resources
Transient cgroups are automatically released once all the processes that a service or a scope unit
contains, finish.
Procedure
The command above uses the --kill-who option to select process(es) from the control group
245
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
you wish to terminate. To kill multiple processes at the same time, pass a comma-separated list
of PIDs. The --signal option determines the type of POSIX signal to be sent to the specified
processes. The default signal is SIGTERM.
Additional resources
246
CHAPTER 34. UNDERSTANDING CGROUPS
The resource controllers (a kernel component) then modify the behavior of processes in cgroups by
limiting, prioritizing or allocating system resources, (such as CPU time, memory, network bandwidth, or
various combinations) of those processes.
The added value of cgroups is process aggregation which enables division of hardware resources
among applications and users. Thereby an increase in overall efficiency, stability and security of users'
environment can be achieved.
The control file behavior and naming is consistent among different controllers.
IMPORTANT
Additional resources
247
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
cgroups-v1
cgroups-v2
A resource controller, also called a control group subsystem, is a kernel subsystem that represents a
single resource, such as CPU time, memory, network bandwidth or disk I/O. The Linux kernel provides a
range of resource controllers that are mounted automatically by the systemd system and service
manager. Find a list of currently mounted resource controllers in the /proc/cgroups file.
blkio - can set limits on input/output access to and from block devices.
cpu - can adjust the parameters of the Completely Fair Scheduler (CFS) scheduler for control
group’s tasks. It is mounted together with the cpuacct controller on the same mount.
cpuacct - creates automatic reports on CPU resources used by tasks in a control group. It is
mounted together with the cpu controller on the same mount.
cpuset - can be used to restrict control group tasks to run only on a specified subset of CPUs
and to direct the tasks to use memory only on specified memory nodes.
memory - can be used to set limits on memory use by tasks in a control group and generates
automatic reports on memory resources used by those tasks.
net_cls - tags network packets with a class identifier ( classid) that enables the Linux traffic
controller (the tc command) to identify packets that originate from a particular control group
task. A subsystem of net_cls, the net_filter (iptables), can also use this tag to perform actions
on such packets. The net_filter tags network sockets with a firewall identifier ( fwid) that allows
the Linux firewall (through iptables command) to identify packets originating from a particular
control group task.
pids - can set limits for a number of processes and their children in a control group.
perf_event - can group tasks for monitoring by the perf performance monitoring and reporting
utility.
rdma - can set limits on Remote Direct Memory Access/InfiniBand specific resources in a
control group.
hugetlb - can be used to limit the usage of large size virtual memory pages by tasks in a control
group.
248
CHAPTER 34. UNDERSTANDING CGROUPS
cpuset - Supports only the core functionality ( cpus{,.effective}, mems{,.effective}) with a new
partition feature.
perf_event - Support is inherent, no explicit control file. You can specify a v2 cgroup as a
parameter to the perf command that will profile all the tasks within that cgroup.
IMPORTANT
Additional resources
Documentation in /usr/share/doc/kernel-doc-<kernel_version>/Documentation/cgroups-v1/
directory (after installing the kernel-doc package).
A namespace wraps a global system resource (for example a mount point, a network device, or a
hostname) in an abstraction that makes it appear to processes within the namespace that they have
their own isolated instance of the global resource. One of the most common technologies that utilize
namespaces are containers.
Changes to a particular global resource are visible only to processes in that namespace and do not
affect the rest of the system or other namespaces.
To inspect which namespaces a process is a member of, you can check the symbolic links in the
/proc/<PID>/ns/ directory.
The following table shows supported namespaces and resources which they isolate:
Namespace Isolates
249
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Namespace Isolates
Additional resources
[1] Linux Control Group v2 - An Introduction, Devconf.cz 2019 presentation by Waiman Long
250
CHAPTER 35. IMPROVING SYSTEM PERFORMANCE WITH ZSWAP
zswap is a kernel feature that provides a compressed RAM cache for swap pages. The mechanism
works as follows: zswap takes pages that are in the process of being swapped out and attempts to
compress them into a dynamically allocated RAM-based memory pool. When the pool becomes full or
the RAM becomes exhausted, zswap evicts pages from compressed cache on an LRU basis (least
recently used) to the backing swap device. After the page has been decompressed into the swap cache,
zswap frees the compressed version in the pool.
Additional resources
What is Zswap?
Prerequisites
Procedure
Enable zswap:
Verification step
# grep -r . /sys/kernel/debug/zswap
duplicate_entry:0
pool_limit_hit:13422200
pool_total_size:6184960 (pool size in total in pages)
reject_alloc_fail:5
251
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
reject_compress_poor:0
reject_kmemcache_fail:0
reject_reclaim_fail:13422200
stored_pages:4251 (pool size after compression)
written_back_pages:0
Additional resources
Prerequisites
Procedure
Verification steps
# cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-70.5.1.el9_0.x86_64
root=/dev/mapper/rhel-root ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M
resume=/dev/mapper/rhel-swap rd.lvm.lv=rhel/root
rd.lvm.lv=rhel/swap rhgb quiet
zswap.enabled=1
Additional resources
252
CHAPTER 36. USING CGROUPFS TO MANUALLY MANAGE CGROUPS
IMPORTANT
In general, Red Hat recommends you use systemd for controlling the usage of system
resources. You should manually configure the cgroups virtual file system only in special
cases. For example, when you need to use cgroup-v1 controllers that have no
equivalents in cgroup-v2 hierarchy.
Prerequisites
Procedure
# mkdir /sys/fs/cgroup/Example/
The /sys/fs/cgroup/Example/ directory defines a child group. When you create the
/sys/fs/cgroup/Example/ directory, some cgroups-v2 interface files are automatically created
in the directory. The /sys/fs/cgroup/Example/ directory contains also controller-specific files
for the memory and pids controllers.
# ll /sys/fs/cgroup/Example/
-r—r—r--. 1 root root 0 Jun 1 10:33 cgroup.controllers
-r—r—r--. 1 root root 0 Jun 1 10:33 cgroup.events
-rw-r—r--. 1 root root 0 Jun 1 10:33 cgroup.freeze
-rw-r—r--. 1 root root 0 Jun 1 10:33 cgroup.procs
…
-rw-r—r--. 1 root root 0 Jun 1 10:33 cgroup.subtree_control
-r—r—r--. 1 root root 0 Jun 1 10:33 memory.events.local
-rw-r—r--. 1 root root 0 Jun 1 10:33 memory.high
-rw-r—r--. 1 root root 0 Jun 1 10:33 memory.low
…
253
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
The example output shows general cgroup control interface files such as cgroup.procs or
cgroup.controllers. These files are common to all control groups, regardless of enabled
controllers.
The files such as memory.high and pids.max relate to the memory and pids controllers, which
are in the root control group (/sys/fs/cgroup/), and are enabled by default by systemd.
By default, the newly created child group inherits all settings from the parent cgroup. In this
case, there are no limits from the root cgroup.
3. Verify that the desired controllers are available in the /sys/fs/cgroup/cgroup.controllers file:
# cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma
4. Enable the desired controllers. In this example it is cpu and cpuset controllers:
These commands enable the cpu and cpuset controllers for the immediate child groups of the
/sys/fs/cgroup/ root control group. Including the newly created Example control group. A child
group is where you can specify processes and apply control checks to each of the processes
based on your criteria.
Users can read the contents of the cgroup.subtree_control file at any level to get an idea of
what controllers are going to be available for enablement in the immediate child group.
NOTE
5. Enable the desired controllers for child cgroups of the Example control group:
This command ensures that the immediate child control group will only have controllers relevant
to regulate the CPU time distribution - not to memory or pids controllers.
# mkdir /sys/fs/cgroup/Example/tasks/
The /sys/fs/cgroup/Example/tasks/ directory defines a child group with files that relate purely
to cpu and cpuset controllers. You can now assign processes to this control group and utilize
cpu and cpuset controller options for your processes.
254
CHAPTER 36. USING CGROUPFS TO MANUALLY MANAGE CGROUPS
# ll /sys/fs/cgroup/Example/tasks
-r—r—r--. 1 root root 0 Jun 1 11:45 cgroup.controllers
-r—r—r--. 1 root root 0 Jun 1 11:45 cgroup.events
-rw-r—r--. 1 root root 0 Jun 1 11:45 cgroup.freeze
-rw-r—r--. 1 root root 0 Jun 1 11:45 cgroup.max.depth
-rw-r—r--. 1 root root 0 Jun 1 11:45 cgroup.max.descendants
-rw-r—r--. 1 root root 0 Jun 1 11:45 cgroup.procs
-r—r—r--. 1 root root 0 Jun 1 11:45 cgroup.stat
-rw-r—r--. 1 root root 0 Jun 1 11:45 cgroup.subtree_control
-rw-r—r--. 1 root root 0 Jun 1 11:45 cgroup.threads
-rw-r—r--. 1 root root 0 Jun 1 11:45 cgroup.type
-rw-r—r--. 1 root root 0 Jun 1 11:45 cpu.max
-rw-r—r--. 1 root root 0 Jun 1 11:45 cpu.pressure
-rw-r—r--. 1 root root 0 Jun 1 11:45 cpuset.cpus
-r—r—r--. 1 root root 0 Jun 1 11:45 cpuset.cpus.effective
-rw-r—r--. 1 root root 0 Jun 1 11:45 cpuset.cpus.partition
-rw-r—r--. 1 root root 0 Jun 1 11:45 cpuset.mems
-r—r—r--. 1 root root 0 Jun 1 11:45 cpuset.mems.effective
-r—r—r--. 1 root root 0 Jun 1 11:45 cpu.stat
-rw-r—r--. 1 root root 0 Jun 1 11:45 cpu.weight
-rw-r—r--. 1 root root 0 Jun 1 11:45 cpu.weight.nice
-rw-r—r--. 1 root root 0 Jun 1 11:45 io.pressure
-rw-r—r--. 1 root root 0 Jun 1 11:45 memory.pressure
IMPORTANT
The cpu controller is only activated if the relevant child control group has at least 2
processes which compete for time on a single CPU.
Verification steps
Optional: confirm that you have created a new cgroup with only the desired controllers active:
# cat /sys/fs/cgroup/Example/tasks/cgroup.controllers
cpuset cpu
Additional resources
Mounting cgroups-v1
Prerequisites
255
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
You have applications for which you want to control distribution of CPU time.
You created a two level hierarchy of child control groups inside the /sys/fs/cgroup/ root control
group as in the following example:
…
├── Example
│ ├── g1
│ ├── g2
│ └── g3
…
You enabled the cpu controller in the parent control group and in child control groups similarly
as described in Creating cgroups and enabling controllers in cgroups-v2 file system .
Procedure
1. Configure desired CPU weights to achieve resource restrictions within the control groups:
2. Add the applications' PIDs to the g1, g2, and g3 child groups:
The example commands ensure that desired applications become members of the Example/g*/
child cgroups and will get their CPU time distributed as per the configuration of those cgroups.
The weights of the children cgroups (g1, g2, g3) that have running processes are summed up at
the level of the parent cgroup (Example). The CPU resource is then distributed proportionally
based on the respective weights.
As a result, when all processes run at the same time, the kernel allocates to each of them the
proportionate CPU time based on their respective cgroup’s cpu.weight file:
g3 50 ~16% (50/300)
If one process stopped running, leaving cgroup g2 with no running processes, the calculation
256
CHAPTER 36. USING CGROUPFS TO MANUALLY MANAGE CGROUPS
If one process stopped running, leaving cgroup g2 with no running processes, the calculation
would omit the cgroup g2 and only account weights of cgroups g1 and g3:
g3 50 ~25% (50/200)
IMPORTANT
If a child cgroup had multiple running processes, the CPU time allocated to the
respective cgroup would be distributed equally to the member processes of that
cgroup.
Verification
The command output shows the processes of the specified applications that run in the
Example/g*/ child cgroups.
# top
top - 05:17:18 up 1 day, 18:25, 1 user, load average: 3.03, 3.03, 3.00
Tasks: 95 total, 4 running, 91 sleeping, 0 stopped, 0 zombie
%Cpu(s): 18.1 us, 81.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.3 hi, 0.0 si, 0.0 st
MiB Mem : 3737.0 total, 3233.7 free, 132.8 used, 370.5 buff/cache
MiB Swap: 4060.0 total, 4060.0 free, 0.0 used. 3373.1 avail Mem
NOTE
We forced all the example processes to run on a single CPU for clearer
illustration. The CPU weight applies the same principles also when used on
multiple CPUs.
Notice that the CPU resource for the PID 33373, PID 33374, and PID 33377 was allocated
257
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Notice that the CPU resource for the PID 33373, PID 33374, and PID 33377 was allocated
based on the weights, 150, 100, 50, you assigned to the respective child cgroups. The weights
correspond to around 50%, 33%, and 16% allocation of CPU time for each application.
Additional resources
NOTE
Both cgroup-v1 and cgroup-v2 are fully enabled in the kernel. There is no default control
group version from the kernel point of view, and is decided by systemd to mount at
startup.
Prerequisites
Procedure
1. Configure the system to mount cgroups-v1 by default during system boot by the systemd
system and service manager:
This adds the necessary kernel command-line parameters to the current boot entry.
Verification
258
CHAPTER 36. USING CGROUPFS TO MANUALLY MANAGE CGROUPS
# ll /sys/fs/cgroup/
dr-xr-xr-x. 10 root root 0 Mar 16 09:34 blkio
lrwxrwxrwx. 1 root root 11 Mar 16 09:34 cpu → cpu,cpuacct
lrwxrwxrwx. 1 root root 11 Mar 16 09:34 cpuacct → cpu,cpuacct
dr-xr-xr-x. 10 root root 0 Mar 16 09:34 cpu,cpuacct
dr-xr-xr-x. 2 root root 0 Mar 16 09:34 cpuset
dr-xr-xr-x. 10 root root 0 Mar 16 09:34 devices
dr-xr-xr-x. 2 root root 0 Mar 16 09:34 freezer
dr-xr-xr-x. 2 root root 0 Mar 16 09:34 hugetlb
dr-xr-xr-x. 10 root root 0 Mar 16 09:34 memory
dr-xr-xr-x. 2 root root 0 Mar 16 09:34 misc
lrwxrwxrwx. 1 root root 16 Mar 16 09:34 net_cls → net_cls,net_prio
dr-xr-xr-x. 2 root root 0 Mar 16 09:34 net_cls,net_prio
lrwxrwxrwx. 1 root root 16 Mar 16 09:34 net_prio → net_cls,net_prio
dr-xr-xr-x. 2 root root 0 Mar 16 09:34 perf_event
dr-xr-xr-x. 10 root root 0 Mar 16 09:34 pids
dr-xr-xr-x. 2 root root 0 Mar 16 09:34 rdma
dr-xr-xr-x. 11 root root 0 Mar 16 09:34 systemd
The /sys/fs/cgroup/ directory, also called the root control group, by default, contains controller-
specific directories such as cpuset. In addition, there are some directories related to systemd.
Additional resources
259
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Prerequisites
You configured the system to mount cgroups-v1 by default during system boot by the
systemd system and service manager:
This adds the necessary kernel command-line parameters to the current boot entry.
Procedure
1. Identify the process ID (PID) of the application you want to restrict in CPU consumption:
# top
top - 11:34:09 up 11 min, 1 user, load average: 0.51, 0.27, 0.22
Tasks: 267 total, 3 running, 264 sleeping, 0 stopped, 0 zombie
%Cpu(s): 49.0 us, 3.3 sy, 0.0 ni, 47.5 id, 0.0 wa, 0.2 hi, 0.0 si, 0.0 st
MiB Mem : 1826.8 total, 303.4 free, 1046.8 used, 476.5 buff/cache
MiB Swap: 1536.0 total, 1396.0 free, 140.0 used. 616.4 avail Mem
The example output of the top program reveals that PID 6955 (illustrative application
sha1sum) consumes a lot of CPU resources.
# mkdir /sys/fs/cgroup/cpu/Example/
The directory above represents a control group, where you can place specific processes and
260
CHAPTER 36. USING CGROUPFS TO MANUALLY MANAGE CGROUPS
The directory above represents a control group, where you can place specific processes and
apply certain CPU limits to the processes. At the same time, some cgroups-v1 interface files
and cpu controller-specific files will be created in the directory.
# ll /sys/fs/cgroup/cpu/Example/
-rw-r—r--. 1 root root 0 Mar 11 11:42 cgroup.clone_children
-rw-r—r--. 1 root root 0 Mar 11 11:42 cgroup.procs
-r—r—r--. 1 root root 0 Mar 11 11:42 cpuacct.stat
-rw-r—r--. 1 root root 0 Mar 11 11:42 cpuacct.usage
-r—r—r--. 1 root root 0 Mar 11 11:42 cpuacct.usage_all
-r—r—r--. 1 root root 0 Mar 11 11:42 cpuacct.usage_percpu
-r—r—r--. 1 root root 0 Mar 11 11:42 cpuacct.usage_percpu_sys
-r—r—r--. 1 root root 0 Mar 11 11:42 cpuacct.usage_percpu_user
-r—r—r--. 1 root root 0 Mar 11 11:42 cpuacct.usage_sys
-r—r—r--. 1 root root 0 Mar 11 11:42 cpuacct.usage_user
-rw-r—r--. 1 root root 0 Mar 11 11:42 cpu.cfs_period_us
-rw-r—r--. 1 root root 0 Mar 11 11:42 cpu.cfs_quota_us
-rw-r—r--. 1 root root 0 Mar 11 11:42 cpu.rt_period_us
-rw-r—r--. 1 root root 0 Mar 11 11:42 cpu.rt_runtime_us
-rw-r—r--. 1 root root 0 Mar 11 11:42 cpu.shares
-r—r—r--. 1 root root 0 Mar 11 11:42 cpu.stat
-rw-r—r--. 1 root root 0 Mar 11 11:42 notify_on_release
-rw-r—r--. 1 root root 0 Mar 11 11:42 tasks
The example output shows files, such as cpuacct.usage, cpu.cfs._period_us, that represent
specific configurations and/or limits, which can be set for processes in the Example control
group. Notice that the respective file names are prefixed with the name of the control group
controller to which they belong.
By default, the newly created control group inherits access to the system’s entire CPU
resources without a limit.
The cpu.cfs_period_us file represents a period of time in microseconds (µs, represented here
as "us") for how frequently a control group’s access to CPU resources should be reallocated.
The upper limit is 1 second and the lower limit is 1000 microseconds.
The cpu.cfs_quota_us file represents the total amount of time in microseconds for which all
processes collectively in a control group can run during one period (as defined by
cpu.cfs_period_us). As soon as processes in a control group, during a single period, use up all
the time specified by the quota, they are throttled for the remainder of the period and not
allowed to run until the next period. The lower limit is 1000 microseconds.
The example commands above set the CPU time limits so that all processes collectively in the
Example control group will be able to run only for 0.2 seconds (defined by cpu.cfs_quota_us)
out of every 1 second (defined by cpu.cfs_period_us).
# cat /sys/fs/cgroup/cpu/Example/cpu.cfs_period_us
261
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
/sys/fs/cgroup/cpu/Example/cpu.cfs_quota_us
1000000
200000
or
The previous command ensures that a desired application becomes a member of the Example
control group and hence does not exceed the CPU limits configured for the Example control
group. The PID should represent an existing process in the system. The PID 6955 here was
assigned to process sha1sum /dev/zero &, used to illustrate the use-case of the cpu controller.
# cat /proc/6955/cgroup
12:cpuset:/
11:hugetlb:/
10:net_cls,net_prio:/
9:memory:/user.slice/user-1000.slice/[email protected]
8:devices:/user.slice
7:blkio:/
6:freezer:/
5:rdma:/
4:pids:/user.slice/user-1000.slice/[email protected]
3:perf_event:/
2:cpu,cpuacct:/Example
1:name=systemd:/user.slice/user-1000.slice/[email protected]/gnome-terminal-
server.service
The example output above shows that the process of the desired application runs in the
Example control group, which applies CPU limits to the application’s process.
# top
top - 12:28:42 up 1:06, 1 user, load average: 1.02, 1.02, 1.00
Tasks: 266 total, 6 running, 260 sleeping, 0 stopped, 0 zombie
%Cpu(s): 11.0 us, 1.2 sy, 0.0 ni, 87.5 id, 0.0 wa, 0.2 hi, 0.0 si, 0.2 st
MiB Mem : 1826.8 total, 287.1 free, 1054.4 used, 485.3 buff/cache
MiB Swap: 1536.0 total, 1396.7 free, 139.2 used. 608.3 avail Mem
Notice that the CPU consumption of the PID 6955 has decreased from 99% to 20%.
262
CHAPTER 36. USING CGROUPFS TO MANUALLY MANAGE CGROUPS
IMPORTANT
Additional resources
263
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
BCC removes the need for users to know deep technical details of eBPF, and provides many out-of-
the-box starting points, such as the bcc-tools package with pre-created eBPF programs.
NOTE
The eBPF programs are triggered on events, such as disk I/O, TCP connections, and
process creations. It is unlikely that the programs should cause the kernel to crash, loop or
become unresponsive because they run in a safe virtual machine in the kernel.
Procedure
1. Install bcc-tools:
# ll /usr/share/bcc/tools/
...
-rwxr-xr-x. 1 root root 4198 Dec 14 17:53 dcsnoop
-rwxr-xr-x. 1 root root 3931 Dec 14 17:53 dcstat
-rwxr-xr-x. 1 root root 20040 Dec 14 17:53 deadlock_detector
-rw-r--r--. 1 root root 7105 Dec 14 17:53 deadlock_detector.c
drwxr-xr-x. 3 root root 8192 Mar 11 10:28 doc
-rwxr-xr-x. 1 root root 7588 Dec 14 17:53 execsnoop
-rwxr-xr-x. 1 root root 6373 Dec 14 17:53 ext4dist
-rwxr-xr-x. 1 root root 10401 Dec 14 17:53 ext4slower
...
The doc directory in the listing above contains documentation for each tool.
264
CHAPTER 37. ANALYZING SYSTEM PERFORMANCE WITH BPF COMPILER COLLECTION
This section describes how to use certain pre-created programs from the BPF Compiler Collection
(BCC) library to efficiently and securely analyze the system performance on the per-event basis. The
set of pre-created programs in the BCC library can serve as examples for creation of additional
programs.
Prerequisites
Root permissions
# /usr/share/bcc/tools/execsnoop
$ ls /usr/share/bcc/tools/doc/
3. The terminal running execsnoop shows the output similar to the following:
The execsnoop program prints a line of output for each new process, which consumes system
resources. It even detects processes of programs that run very shortly, such as ls, and most
monitoring tools would not register them.
RET - The return value of the exec() system call (0), which loads program code into new
processes.
To see more details, examples, and options for execsnoop, refer to the
/usr/share/bcc/tools/doc/execsnoop_example.txt file.
# /usr/share/bcc/tools/opensnoop -n uname
265
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
The above prints output for files, which are opened only by the process of the uname command.
$ uname
The command above opens certain files, which are captured in the next step.
3. The terminal running opensnoop shows the output similar to the following:
The opensnoop program watches the open() system call across the whole system, and prints a
line of output for each file that uname tried to open along the way.
FD - The file descriptor - a value that open() returns to refer to the open file. ( 3)
To see more details, examples, and options for opensnoop, refer to the
/usr/share/bcc/tools/doc/opensnoop_example.txt file.
# /usr/share/bcc/tools/biotop 30
The command enables you to monitor the top processes, which perform I/O operations on the
disk. The argument ensures that the command will produce a 30 second summary.
NOTE
266
CHAPTER 37. ANALYZING SYSTEM PERFORMANCE WITH BPF COMPILER COLLECTION
# dd if=/dev/vda of=/dev/zero
The command above reads the content from the local hard disk device and writes the output to
the /dev/zero file. This step generates certain I/O traffic to illustrate biotop.
3. The terminal running biotop shows the output similar to the following:
To see more details, examples, and options for biotop, refer to the
/usr/share/bcc/tools/doc/biotop_example.txt file.
# /usr/share/bcc/tools/xfsslower 1
The command above measures the time the XFS file system spends in performing read, write,
open or sync (fsync) operations. The 1 argument ensures that the program shows only the
operations that are slower than 1 ms.
NOTE
267
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
NOTE
$ vim text
The command above creates a text file in the vim editor to initiate certain interaction with the
XFS file system.
3. The terminal running xfsslower shows something similar upon saving the file from the previous
step:
Each line above represents an operation in the file system, which took more time than a certain
threshold. xfsslower is good at exposing possible file system problems, which can take form of
unexpectedly slow operations.
Read
Write
Sync
To see more details, examples, and options for xfsslower, refer to the
/usr/share/bcc/tools/doc/xfsslower_example.txt file.
268
CHAPTER 38. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE MEMORY ACCESS
vmstat tool, provided by the procps-ng package, displays reports of a system’s processes,
memory, paging, block I/O, traps, disks, and CPU activity. It provides an instantaneous report of
the average of these events since the machine was last turned on, or since the previous report.
valgrind framework provides instrumentation to user-space binaries. Install this tool, using the
dnf install valgrind command. It includes a number of tools, that you can use to profile and
analyze program performance, such as:
memcheck option is the default valgrind tool. It detects and reports on a number of
memory errors that can be difficult to detect and diagnose, such as:
Pointer overlap
Memory leaks
NOTE
Memcheck can only report these errors, it cannot prevent them from
occurring. However, memcheck logs an error message immediately
before the error occurs.
cachegrind option simulates application interaction with a system’s cache hierarchy and
branch predictor. It gathers statistics for the duration of application’s execution and outputs
a summary to the console.
massif option measures the heap space used by a specified application. It measures both
useful space and any additional space allocated for bookkeeping and alignment purposes.
Additional resources
/usr/share/doc/valgrind-version/valgrind_manual.pdf file
269
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
The Linux Kernel is designed to maximize the utilization of a system’s memory resources (RAM). Due to
these design characteristics, and depending on the memory requirements of the workload, part of the
system’s memory is in use within the kernel on behalf of the workload, while a small part of the memory is
free. This free memory is reserved for special system allocations, and for other low or high priority
system services.
The rest of the system’s memory is dedicated to the workload itself, and divided into the following two
categories:
File memory
Pages added in this category represent parts of files in permanent storage. These pages, from the
page cache, can be mapped or unmapped in an application’s address spaces. You can use
applications to map files into their address space using the mmap system calls, or to operate on files
via the buffered I/O read or write system calls.
Buffered I/O system calls, as well as applications that map pages directly, can re-utilize unmapped
pages. As a result, these pages are stored in the cache by the kernel, especially when the system is
not running any memory intensive tasks, to avoid re-issuing costly I/O operations over the same set
of pages.
Anonymous memory
Pages in this category are in use by a dynamically allocated process, or are not related to files in
permanent storage. This set of pages back up the in-memory control structures of each task, such as
the application stack and heap areas.
vm.dirty_ratio
Is a percentage value. When this percentage of the total system memory is modified, the system
begins writing the modifications to the disk with the pdflush operation. The default value is 20
percent.
vm.dirty_background_ratio
A percentage value. When this percentage of total system memory is modified, the system begins
writing the modifications to the disk in the background. The default value is 10 percent.
vm.overcommit_memory
Defines the conditions that determine whether a large memory request is accepted or denied.The
270
CHAPTER 38. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE MEMORY ACCESS
Defines the conditions that determine whether a large memory request is accepted or denied.The
default value is 0.
By default, the kernel performs checks if a virtual memory allocation request fits into the present
amount of memory (total + swap) and rejects only large requests. Otherwise virtual memory
allocations are granted, and this means they allow memory overcommitment.
When this parameter is set to 1, the kernel performs no memory overcommit handling. This
increases the possibility of memory overload, but improves performance for memory-
intensive tasks.
When this parameter is set to 2, the kernel denies requests for memory equal to or larger
than the sum of the total available swap space and the percentage of physical RAM specified
in the overcommit_ratio. This reduces the risk of overcommitting memory, but is
recommended only for systems with swap areas larger than their physical memory.
vm.overcommit_ratio
Specifies the percentage of physical RAM considered when overcommit_memory is set to 2. The
default value is 50.
vm.max_map_count
Defines the maximum number of memory map areas that a process can use. The default value is
65530. Increase this value if your application needs more memory map areas.
vm.min_free_kbytes
Sets the size of the reserved free pages pool. It is also responsible for setting the min_page,
low_page, and high_page thresholds that govern the behavior of the Linux kernel’s page reclaim
algorithms. It also specifies the minimum number of kilobytes to keep free across the system. This
calculates a specific value for each low memory zone, each of which is assigned a number of reserved
free pages in proportion to their size.
Setting the vm.min_free_kbytes parameter’s value:
Increasing the parameter value effectively reduces the application working set usable
memory. Therefore, you might want to use it for only kernel-driven workloads, where driver
buffers need to be allocated in atomic contexts.
Decreasing the parameter value might render the kernel unable to service system requests,
if memory becomes heavily contended in the system.
WARNING
The vm.min_free_kbytes parameter also sets a page reclaim watermark, called min_pages.
271
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
The vm.min_free_kbytes parameter also sets a page reclaim watermark, called min_pages.
This watermark is used as a factor when determining the two other memory watermarks,
low_pages, and high_pages, that govern page reclaim algorithms.
/proc/PID/oom_adj
In the event that a system runs out of memory, and the panic_on_oom parameter is set to 0, the
oom_killer function kills processes, starting with the process that has the highest oom_score, until
the system recovers.
The oom_adj parameter determines the oom_score of a process. This parameter is set per process
identifier. A value of -17 disables the oom_killer for that process. Other valid values range from -16
to 15.
NOTE
vm.swappiness
The swappiness value, ranging from 0 to 200, controls the degree to which the system favors
reclaiming memory from the anonymous memory pool, or the page cache memory pool.
Setting the swappiness parameter’s value:
Higher values favor file-mapped driven workloads while swapping out the less actively
accessed processes’ anonymous mapped memory of RAM. This is useful for file-servers or
streaming applications that depend on data, from files in the storage, to reside on memory to
reduce I/O latency for the service requests.
Low values favor anonymous-mapped driven workloads while reclaiming the page cache (file
mapped memory). This setting is useful for applications that do not depend heavily on the
file system information, and heavily utilize dynamically allocated and private memory, such as
mathematical and number crunching applications, and few hardware virtualization
supervisors like QEMU.
The default value of the vm.swappiness parameter is 60.
WARNING
Additional resources
272
CHAPTER 38. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE MEMORY ACCESS
aio-max-nr
Defines the maximum allowed number of events in all active asynchronous input/output contexts.
The default value is 65536, and modifying this value does not pre-allocate or resize any kernel data
structures.
file-max
Determines the maximum number of file handles for the entire system. The default value on Red Hat
Enterprise Linux 9 is either 8192 or one tenth of the free memory pages available at the time the
kernel starts, whichever is higher.
Raising this value can resolve errors caused by a lack of available file handles.
Additional resources
The following are the available kernel parameters used to set up limits for the msg* and shm* System V
IPC (sysvipc) system calls:
msgmax
Defines the maximum allowed size in bytes of any single message in a message queue. This value
must not exceed the size of the queue (msgmnb). Use the sysctl msgmax command to determine
the current msgmax value on your system.
msgmnb
Defines the maximum size in bytes of a single message queue. Use the sysctl msgmnb command to
determine the current msgmnb value on your system.
msgmni
Defines the maximum number of message queue identifiers, and therefore the maximum number of
queues. Use the sysctl msgmni command to determine the current msgmni value on your system.
shmall
Defines the total amount of shared memory pages that can be used on the system at one time. For
example, a page is 4096 bytes on the AMD64 and Intel 64 architecture. Use the sysctl shmall
command to determine the current shmall value on your system.
shmmax
Defines the maximum size in bytes of a single shared memory segment allowed by the kernel. Shared
memory segments up to 1Gb are now supported in the kernel. Use the sysctl shmmax command to
determine the current shmmax value on your system.
shmmni
Defines the system-wide maximum number of shared memory segments. The default value is 4096
on all systems.
Additional resources
273
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Additional resources
This procedure describes how to set a memory-related kernel parameter temporarily and persistently.
Procedure
To temporarily set the memory-related kernel parameters, edit the respective files in the /proc
file system or the sysctl tool.
For example, to temporarily set the vm.overcommit_memory parameter to 1:
To persistently set the memory-related kernel parameter, edit the /etc/sysctl.conf file and
reload the settings.
For example, to persistently set the vm.overcommit_memory parameter to 1:
vm.overcommit_memory=1
# sysctl -p
Additional resources
274
CHAPTER 39. CONFIGURING HUGE PAGES
However, specific applications can benefit from using larger page sizes in certain cases. For example, an
application that works with a large and relatively fixed data set of hundreds of megabytes or even
dozens of gigabytes can have performance issues when using 4 KB pages. Such data sets can require a
huge amount of 4 KB pages, which can lead to overhead in the operating system and the CPU.
This section provides information about huge pages available in RHEL 9 and how you can configure
them.
The following are the huge page methods, which are supported in RHEL 9:
HugeTLB pages
HugeTLB pages are also called static huge pages. There are two ways of reserving HugeTLB pages:
At boot time: It increases the possibility of success because the memory has not yet been
significantly fragmented. However, on NUMA machines, the number of pages is automatically
split among the NUMA nodes. For more information on parameters that influence HugeTLB
page behavior at boot time, see Parameters for reserving HugeTLB pages at boot time and
how to use these parameters to configure HugeTLB pages at boot time, see Configuring
HugeTLB at boot time.
At run time: It allows you to reserve the huge pages per NUMA node. If the run-time
reservation is done as early as possible in the boot process, the probability of memory
fragmentation is lower. For more information on parameters that influence HugeTLB page
behavior at run time, see Parameters for reserving HugeTLB pages at run time and how to
use these parameters to configure HugeTLB pages at run time, see Configuring HugeTLB
at run time.
system-wide: Here, the kernel tries to assign huge pages to a process whenever it is possible
to allocate the huge pages and the process is using a large contiguous virtual memory area.
per-process: Here, the kernel only assigns huge pages to the memory areas of individual
processes which you can specify using the madvise() system call.
NOTE
For more information on parameters that influence HugeTLB page behavior at boot time,
275
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
For more information on parameters that influence HugeTLB page behavior at boot time,
see Enabling transparent hugepages and Disabling transparent hugepages.
For more infomration on how to use these parameters to configure HugeTLB pages at boot time, see
Configuring HugeTLB at boot time.
Procedure
276
CHAPTER 39. CONFIGURING HUGE PAGES
[Unit]
Description=HugeTLB Gigantic Pages Reservation
DefaultDependencies=no
Before=dev-hugepages.mount
ConditionPathExists=/sys/devices/system/node
ConditionKernelCommandLine=hugepagesz=1G
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/lib/systemd/hugetlb-reserve-pages.sh
[Install]
WantedBy=sysinit.target
3. Create a new file called hugetlb-reserve-pages.sh in the /usr/lib/systemd/ directory and add
the following content:
While adding the following content, replace number_of_pages with the number of 1GB pages
you want to reserve, and node with the name of the node on which to reserve these pages.
#!/bin/sh
nodes_path=/sys/devices/system/node/
if [ ! -d $nodes_path ]; then
echo "ERROR: $nodes_path does not exist"
exit 1
fi
reserve_pages()
{
echo $1 > $nodes_path/$2/hugepages/hugepages-1048576kB/nr_hugepages
}
For example, to reserve two 1 GB pages on node0 and one 1GB page on node1, replace the
number_of_pages with 2 for node0 and 1 for node1:
reserve_pages 2 node0
reserve_pages 1 node1
# chmod +x /usr/lib/systemd/hugetlb-reserve-pages.sh
NOTE
277
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
NOTE
You can try reserving more 1GB pages at runtime by writing to nr_hugepages at
any time. However, such reservations can fail due to memory fragmentation. The
most reliable way to reserve 1 GB pages is by using this hugetlb-reserve-
pages.sh script, which runs early during boot.
Reserving static huge pages can effectively reduce the amount of memory
available to the system, and prevents it from properly utilizing its full memory
capacity. Although a properly sized pool of reserved huge pages can be
beneficial to applications that utilize it, an oversized or unused pool of reserved
huge pages will eventually be detrimental to overall system performance. When
setting a reserved huge page pool, ensure that the system can properly utilize its
full memory capacity.
Additional resources
/usr/share/doc/kernel-doc-kernel_version/Documentation/vm/hugetlbpage.txt file
For more information on how to use these parameters to configure HugeTLB pages at run time, see
Configuring HugeTLB at run time .
278
CHAPTER 39. CONFIGURING HUGE PAGES
node2 with the node on which you wish to reserve the pages.
Procedure
Verification steps
Additional resources
Procedure
# cat /sys/kernel/mm/transparent_hugepage/enabled
2. Enable THP:
279
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
3. To prevent applications from allocating more memory resources than necessary, disable the
system-wide transparent huge pages and only enable them for the applications that explicitly
request it through the madvise:
NOTE
Sometimes, providing low latency to short-lived allocations has higher priority than
immediately achieving the best performance with long-lived allocations. In such cases,
you can disable direct compaction while leaving THP enabled.
Direct compaction is a synchronous memory compaction during the huge page allocation.
Disabling direct compaction provides no guarantee of saving memory, but can decrease
the risk of higher latencies during frequent page faults. Note that if the workload benefits
significantly from THP, the performance decreases. Disable direct compaction:
Additional resources
Procedure
# cat /sys/kernel/mm/transparent_hugepage/enabled
2. Disable THP:
If a requested address mapping is not in the TLB, called a TLB miss, the system still needs to read the
280
CHAPTER 39. CONFIGURING HUGE PAGES
page table to determine the physical to virtual address mapping. Because of the relationship between
application memory requirements and the size of pages used to cache address mappings, applications
with large memory requirements are more likely to suffer performance degradation from TLB misses
than applications with minimal memory requirements. It is therefore important to avoid TLB misses
wherever possible.
Both HugeTLB and Transparent Huge Page features allow applications to use pages larger than 4 KB.
This allows addresses stored in the TLB to reference more memory, which reduces TLB misses and
improves application performance.
281
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
As an application developer, you can use SystemTap to monitor in fine detail how your application
behaves within the Linux system.
SystemTap aims to supplement the existing suite of Linux monitoring tools by providing users with the
infrastructure to track kernel activity and combining this capability with two attributes:
Flexibility
the SystemTap framework enables you to develop simple scripts for investigating and monitoring a
wide variety of kernel functions, system calls, and other events that occur in kernel space. With this,
SystemTap is not so much a tool as it is a system that allows you to develop your own kernel-specific
forensic and monitoring tools.
Ease-of-Use
SystemTap enables you to monitor kernel activity without having to recompile the kernel or reboot
the system.
Prerequisites
You have enabled debug repositories as described in Enabling debug and source repositories.
Procedure
a. Using stap-prep:
# stap-prep
b. If stap-prep does not work, install the required kernel packages manually:
282
CHAPTER 40. GETTING STARTED WITH SYSTEMTAP
$(uname -i) is automatically replaced with the hardware platform of your system and
$(uname -r) is automatically replaced with the version of your running kernel.
Verification steps
If the kernel to be probed with SystemTap is currently in use, test if your installation was
successful:
The last three lines of output (beginning with Pass 5) indicate that:
1 SystemTap successfully created the instrumentation to probe the kernel and ran the
instrumentation.
2 SystemTap detected the specified event (in this case, A VFS read).
3 SystemTap executed a valid handler (printed text and then closed it with no errors).
To allow users to run SystemTap without root access, add users to both of these user groups:
stapdev
Members of this group can use stap to run SystemTap scripts, or staprun to run SystemTap
instrumentation modules.
Running stap involves compiling SystemTap scripts into kernel modules and loading them into the
kernel. This requires elevated privileges to the system, which are granted to stapdev members.
Unfortunately, such privileges also grant effective root access to stapdev members. As such, only
grant stapdev group membership to users who can be trusted with root access.
stapusr
Members of this group can only use staprun to run SystemTap instrumentation modules. In addition,
283
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Members of this group can only use staprun to run SystemTap instrumentation modules. In addition,
they can only run those modules from the /lib/modules/kernel_version/systemtap/ directory. This
directory must be owned only by the root user, and must only be writable by the root user.
Sample scripts that are distributed with the installation of SystemTap can be found in the
/usr/share/systemtap/examples directory.
Prerequisites
1. SystemTap and the associated required kernel packages are installed as described in Installing
Systemtap.
2. To run SystemTap scripts as a normal user, add the user to the SystemTap groups:
Procedure
This command instructs stap to run the script passed by echo to standard input. To add
stap options, insert them before the - character. For example, to make the results from this
command more verbose, the command is:
From a file:
# stap file_name
284
CHAPTER 41. CROSS-INSTRUMENTATION OF SYSTEMTAP
Normally, SystemTap scripts can run only on systems where SystemTap is deployed. To run SystemTap
on ten systems, SystemTap needs to be deployed on all those systems. In some cases, this might be
neither feasible nor desired. For example, corporate policy might prohibit you from installing packages
that provide compilers or debug information on specific machines, which will prevent the deployment of
SystemTap.
The kernel information packages for various machines can be installed on a single host machine.
IMPORTANT
Kernel packaging bugs may prevent the installation. In such cases, the kernel-
debuginfo and kernel-devel packages for the host system and target system
must match. If a bug occurs, report the bug at https://bugzilla.redhat.com/.
Each target machine needs only one package to be installed to use the generated SystemTap
instrumentation module: systemtap-runtime.
IMPORTANT
The host system must be the same architecture and running the same
distribution of Linux as the target system in order for the built instrumentation
module to work.
TERMINOLOGY
285
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
TERMINOLOGY
instrumentation module
The kernel module built from a SystemTap script; the SystemTap module is built on
the host system, and will be loaded on the target kernel of the target system.
host system
The system on which the instrumentation modules (from SystemTap scripts) are
compiled, to be loaded on target systems.
target system
The system in which the instrumentation module is being built (from SystemTap
scripts).
target kernel
The kernel of the target system. This is the kernel that loads and runs the
instrumentation module.
Prerequisites
Both the host system and target system are the same architecture.
Both the host system and target system are running the same major version of Red Hat
Enterprise Linux (such as Red Hat Enterprise Linux 9).
IMPORTANT
Procedure
$ uname -r
2. On the host system, install the target kernel and related packages for each target system by the
method described in Installing Systemtap.
286
CHAPTER 41. CROSS-INSTRUMENTATION OF SYSTEMTAP
3. Build an instrumentation module on the host system, copy this module to and run this module on
on the target system either:
This command remotely implements the specified script on the target system. You must
ensure an SSH connection can be made to the target system from the host system for this
to be successful.
b. Manually:
Here, kernel_version refers to the version of the target kernel determined in step 1,
script refers to the script to be converted into an instrumentation module, and
module_name is the desired name of the instrumentation module. The -p4 option tells
SystemTap to not load and run the compiled module.
ii. Once the instrumentation module is compiled, copy it to the target system and load it
using the following command:
# staprun module_name.ko
287
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
PID
The ID of the listed process.
UID
User ID. A user ID of 0 refers to the root user.
DEV
Which ethernet device the process used to send or receive data (for example, eth0, eth1).
XMIT_PK
The number of packets transmitted by the process.
RECV_PK
The number of packets received by the process.
XMIT_KB
The amount of data sent by the process, in kilobytes.
RECV_KB
The amount of data received by the service, in kilobytes.
Prerequisites
Procedure
[...]
PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
0 0 eth0 0 5 0 0 swapper
11178 0 eth0 2 0 0 0 synergyc
PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
2886 4 eth0 79 0 5 0 cups-polld
11362 0 eth0 0 61 0 5 firefox
288
CHAPTER 42. MONITORING NETWORK ACTIVITY WITH SYSTEMTAP
0 0 eth0 3 32 0 3 swapper
2886 4 lo 4 4 0 0 cups-polld
11178 0 eth0 3 0 0 0 synergyc
PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
0 0 eth0 0 6 0 0 swapper
2886 4 lo 2 2 0 0 cups-polld
11178 0 eth0 3 0 0 0 synergyc
3611 0 eth0 0 1 0 0 Xorg
PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
0 0 eth0 3 42 0 2 swapper
11178 0 eth0 43 1 3 0 synergyc
11362 0 eth0 0 7 0 0 firefox
3897 0 eth0 0 1 0 0 multiload-apple
Prerequisites
Procedure
A 3-second excerpt of the output of the socket-trace.stp script looks similar to the following:
[...]
0 Xorg(3611): -> sock_poll
3 Xorg(3611): <- sock_poll
0 Xorg(3611): -> sock_poll
3 Xorg(3611): <- sock_poll
0 gnome-terminal(11106): -> sock_poll
5 gnome-terminal(11106): <- sock_poll
0 scim-bridge(3883): -> sock_poll
3 scim-bridge(3883): <- sock_poll
0 scim-bridge(3883): -> sys_socketcall
4 scim-bridge(3883): -> sys_recv
8 scim-bridge(3883): -> sys_recvfrom
12 scim-bridge(3883):-> sock_from_file
16 scim-bridge(3883):<- sock_from_file
20 scim-bridge(3883):-> sock_recvmsg
24 scim-bridge(3883):<- sock_recvmsg
28 scim-bridge(3883): <- sys_recvfrom
31 scim-bridge(3883): <- sys_recv
35 scim-bridge(3883): <- sys_socketcall
[...]
289
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
The dropwatch.stp SystemTap script uses kernel.trace("kfree_skb") to trace packet discards; the
script summarizes what locations discard packets in every 5-second interval.
Prerequisites
Procedure
Running the dropwatch.stp script for 15 seconds results in output similar to the following:
NOTE
[...]
ffffffff8024c5cd T unlock_new_inode
ffffffff8024c5da t unix_stream_sendmsg
ffffffff8024c920 t unix_stream_recvmsg
ffffffff8024cea1 t udp_v4_lookup_longway
[...]
ffffffff8044addc t arp_process
ffffffff8044b360 t arp_rcv
ffffffff8044b487 t parp_redo
ffffffff8044b48c t arp_solicit
[...]
290
CHAPTER 43. PROFILING KERNEL ACTIVITY WITH SYSTEMTAP
Prerequisites
Procedure
This script takes the targeted kernel function as an argument. You can use the argument
wildcards to target multiple kernel functions up to a certain extent.
The output of the script, in alphabetical order, contains the names of the functions called and
how many times it was called during the sample time.
where:
-w : Suppresses warnings.
-c command : Tells SystemTap to count function calls during the execution of a command, in
this example being /bin/true.
The output should look similar to the following:
[...]
__vma_link 97
__vma_link_file 66
__vma_link_list 97
__vma_link_rb 97
__xchg 103
add_page_to_active_list 102
add_page_to_inactive_list 19
add_to_page_cache 19
add_to_page_cache_lru 7
all_vm_events 6
alloc_pages_node 4630
alloc_slabmgmt 67
291
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
anon_vma_alloc 62
anon_vma_free 62
anon_vma_lock 66
anon_vma_prepare 98
anon_vma_unlink 97
anon_vma_unlock 66
arch_get_unmapped_area_topdown 94
arch_get_unmapped_exec_area 3
arch_unmap_area_topdown 97
atomic_add 2
atomic_add_negative 97
atomic_dec_and_test 5153
atomic_inc 470
atomic_inc_and_test 1
[...]
Prerequisites
Procedure
2. An optional trigger function, which enables or disables tracing on a per-thread basis. Tracing in
each thread will continue as long as the trigger function has not exited yet.
where:
-w : Suppresses warnings.
-c command : Tells SystemTap to count function calls during the execution of a command, in
this example being /bin/true.
[...]
292
CHAPTER 43. PROFILING KERNEL ACTIVITY WITH SYSTEMTAP
Prerequisites
Procedure
This script will display the top 20 processes taking up CPU time during a 5-second period, along
with the total number of CPU ticks made during the sample. The output of this script also notes
the percentage of CPU time each process used, as well as whether that time was spent in kernel
space or user space.
293
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Prerequisites
Procedure
This script will track how many times each application uses the following system calls over time:
poll
select
epoll
itimer
futex
nanosleep
signal
In this example output you can see which process used which system call and how many times.
Prerequisites
Procedure
--------------------------------------------------------------
SYSCALL COUNT
gettimeofday 1857
read 1821
ioctl 1568
poll 1033
close 638
open 503
select 455
write 391
writev 335
futex 303
recvmsg 251
socket 137
clock_gettime 124
rt_sigprocmask 121
sendto 120
setitimer 106
stat 90
time 81
sigreturn 72
fstat 66
--------------------------------------------------------------
295
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
Prerequisites
Procedure
296
CHAPTER 44. MONITORING DISK AND I/O ACTIVITY WITH SYSTEMTAP
Prerequisites
Procedure
The script displays the top ten processes responsible for the heaviest reads or writes to a disk.
UID
User ID. A user ID of 0 refers to the root user.
PID
The ID of the listed process.
PPID
The process ID of the listed process’s parent process.
CMD
The name of the listed process.
DEVICE
Which storage device the listed process is reading from or writing to.
T
The type of action performed by the listed process, where W refers to write, and R refers to
read.
BYTES
The amount of data read to or written from disk.
[...]
Mon Sep 29 03:38:28 2008 , Average: 19Kb/sec, Read: 7Kb, Write: 89Kb
UID PID PPID CMD DEVICE T BYTES
0 26319 26294 firefox sda5 W 90229
0 2758 2757 pam_timestamp_c sda5 R 8064
0 2885 1 cupsd sda5 W 1678
Mon Sep 29 03:38:38 2008 , Average: 1Kb/sec, Read: 7Kb, Write: 1Kb
297
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
44.2. TRACKING I/O TIME FOR EACH FILE READ OR WRITE WITH
SYSTEMTAP
You can use the iotime.stp SystemTap script to monitor the amount of time it takes for each process to
read from or write to any file. This helps you to determine what files are slow to load on a system.
Prerequisites
Procedure
The script tracks each time a system call opens, closes, reads from, and writes to a file. For each
file any system call accesses, It counts the number of microseconds it takes for any reads or
writes to finish and tracks the amount of data , in bytes, read from or written to the file.
A timestamp, in microseconds
[...]
825946 3364 (NetworkManager) access /sys/class/net/eth0/carrier read: 8190 write: 0
825955 3364 (NetworkManager) iotime /sys/class/net/eth0/carrier time: 9
[...]
117061 2460 (pcscd) access /dev/bus/usb/003/001 read: 43 write: 0
117065 2460 (pcscd) iotime /dev/bus/usb/003/001 time: 7
[...]
3973737 2886 (sendmail) access /proc/loadavg read: 4096 write: 0
3973744 2886 (sendmail) iotime /proc/loadavg time: 11
[...]
298
CHAPTER 44. MONITORING DISK AND I/O ACTIVITY WITH SYSTEMTAP
You can use the traceio.stp SystemTap script to track the cumulative amount of I/O to the system.
Prerequisites
Procedure
The script prints the top ten executables generating I/O traffic over time. It also tracks the
cumulative amount of I/O reads and writes done by those executables. This information is
tracked and printed out in 1-second intervals, and in descending order.
[...]
Xorg r: 583401 KiB w: 0 KiB
floaters r: 96 KiB w: 7130 KiB
multiload-apple r: 538 KiB w: 537 KiB
sshd r: 71 KiB w: 72 KiB
pam_timestamp_c r: 138 KiB w: 0 KiB
staprun r: 51 KiB w: 51 KiB
snmpd r: 46 KiB w: 0 KiB
pcscd r: 28 KiB w: 0 KiB
irqbalance r: 27 KiB w: 4 KiB
cupsd r: 4 KiB w: 18 KiB
Xorg r: 588140 KiB w: 0 KiB
floaters r: 97 KiB w: 7143 KiB
multiload-apple r: 543 KiB w: 542 KiB
sshd r: 72 KiB w: 72 KiB
pam_timestamp_c r: 138 KiB w: 0 KiB
staprun r: 51 KiB w: 51 KiB
snmpd r: 46 KiB w: 0 KiB
pcscd r: 28 KiB w: 0 KiB
irqbalance r: 27 KiB w: 4 KiB
cupsd r: 4 KiB w: 18 KiB
Prerequisites
Procedure
299
Red Hat Enterprise Linux 9.0 Monitoring and managing system status and performance
This script takes the whole device number as an argument. To find this number you can use:
[...]
synergyc(3722) vfs_read 0x800005
synergyc(3722) vfs_read 0x800005
cupsd(2889) vfs_write 0x800005
cupsd(2889) vfs_write 0x800005
cupsd(2889) vfs_write 0x800005
[...]
Prerequisites
Procedure
300
CHAPTER 44. MONITORING DISK AND I/O ACTIVITY WITH SYSTEMTAP
805 1078319
where:
805 is the base-16 (hexadecimal) device number. The last two digits are the minor device
number, and the remaining digits are the major number.
In the first two arguments you must use 0x prefixes for base-16 numbers.
301