AIX Performance Tuning VUG May2418
AIX Performance Tuning VUG May2418
AIX Performance Tuning
Virtual User Group – May 24, 2018
Jaqui Lynch
[email protected]
1 Copyright Jaqui Lynch 2018
Agenda
• Architecting
• CPU
• Memory tuning
• Starter Set of Tunables
• I/O (Time Permitting)
• Volume Groups and File Systems
• AIO and CIO
• Flash Cache
2 Copyright Jaqui Lynch 2018
1
5/24/2018
Architecting
3 Copyright Jaqui Lynch 2018
Have a Plan – why are you doing this?
1. Describe the problem.
2. Measure where you’re at (baseline).
1. Use your own scripts, nmon, perfpmr – be consistent
3. Recreate the problem while getting diagnostic data (perfpmr, your own scripts, etc.).
4. Analyze the data.
5. Document potential changes and their expected impact, then group and prioritize them.
a) Remember that one small change that only you know about can cause significant problems so document ALL
changes
6. Make the changes.
a) Group changes that go together if it makes sense to do so but don’t go crazy
7. Measure the results and analyze if they had the expected impact; if not, then why not?
8. Is the problem still the same? If not, return to step 1.
9. If it’s the same, return to step 3.
This may look like common sense but in an emergency that is the first thing to go out the window
Remember tuning is an iterative process
You may not get it right the first time but over time and with a plan you can make things better
4 Copyright Jaqui Lynch 2018
2
5/24/2018
Things to Consider
• Cores less than socket size
• Try to keep profile of LPAR within a socket but definitely within a node
• Too many cores and threads
• Puts more stress on hypervisor managing them
• Bring up order
• Largest LPARs first
• Not all memory dims full
• Memory less than socket size
• Try to keep profile of LPAR within a socket
• Understand difference in latency between dedicated and shared processor
cores
• Context matters – there are no hard and fast rules as workload and other
factors matter
5 Copyright Jaqui Lynch 2018
Help the Hypervisor!
Help the hypervisor cleanly place partitions when they are first defined and activated.
• Define dedicated partitions first.
• Define large partitions first.
• Within shared pool, define large partitions first.
• At system (not partition) IPL, PowerVM will allocate resources cleanly.
• Do not set maximum memory setting too high as you will waste memory
• Fill your memory dimms to get maximum bandwidth
• Don’t mix memory dimms of different speeds
• Don’t assign more cores to an LPAR than there are physical cores in the server or the pool
6 Copyright Jaqui Lynch 2018
3
5/24/2018
DIMMs
DIMMs
Empty
DIMMs
Cores Cores Cores Cores As an example
On a 48 core E880 node with 512GB
DIMMs
Empty
DIMMs
4 sockets ‐ each has 128GB and 12 cores
DIMMs
Cores Cores Cores Cores If you keep LPARs at <=12 cores and <=128GB then
You are less likely to cross a socket boundary
At 49 cores you are crossing a node boundary
Empty
DIMMs
DIMMs
Empty
DIMMs
Empty
DIMMs
Diagram courtesy of IBM
7 Copyright Jaqui Lynch 2018
8 Copyright Jaqui Lynch 2018
4
5/24/2018
Low hanging fruit
• Check change control – was there anything changed?
• Do I have any hardware errors in errpt?
• Does lsps –a or lsps –s show you have a lot of page space used?
• Is my system approaching 100%
• If shared pool am I constantly over entitlement or am I constantly folding/unfolding VPs
• Is the ratio of SYS% more than USR%?
• Does my batch window extend into my online?
• Is there unexplained I/O wait?
• Are my CPU’s and threads being used fairly evenly?
• Is the I/O fairly well spread between disks? / Adapters?
• Any full filesystems – especially /var or / or /usr
• Error messages
• /etc/syslog.conf will tell you where they are
• Look at errpt – a lot of problems are made clear there
• Check at Fix Central in case it is a known bug
• http://www‐933.ibm.com/support/fixcentral/
• Do the same at the firmware history site in case it is fixed at the next firmware update
• Know how to use PerfPMR – before you need to…
• http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/reporting_perf_prob.htm
9 Copyright Jaqui Lynch 2018
What makes it go slow?
• Obvious:‐
• Not enough CPU
• Not enough memory
• Not enough disk I/O
• Not enough network I/O
• Not so obvious:‐
• AIX tuning
• Oracle/DB2 parameters log place, SGA, Buffers
• Read vs write characteristics,
• Adapter placement, overloading bus speeds
• Throttling effects – e.g., single‐thread dependency
• Application errors
• Background processes (backups, batch processing) running during peak online times?
• Concurrent access to the same files
• Changes in shared resources
• Hardware errors
10 Copyright Jaqui Lynch 2018
5
5/24/2018
How to measure
• What is important – response time or throughput?
• Response time is the elapsed time between when a request is submitted and when the response
from that request is returned.
• Amount of time for a database query
• Amount of time it takes to echo characters to the terminal
• Amount of time it takes to access a Web page
• How much time does my user wait?
• Throughput is a measure of the amount of work that can be accomplished over some unit of time.
• Database transactions per minute
• File transfer speed in KBs per second
• File Read or Write KBs per second
• Web server hits per minute
11 Copyright Jaqui Lynch 2018
Performance Support Flowchart
Courtesy of XKCD
12 Copyright Jaqui Lynch 2018
6
5/24/2018
Sharing Resources?
Yes
Actions
No
CPU Bound?
Yes
Yes Actions
No
Is there a
performance
Performance
Memory Bound?
No problem?
Actions
Yes
Support
No
Flowchart – the
Actions
Yes
I/O Bound?
real one
Normal Operations No
Monitor system performance
Network Bound?
and check against requirements Yes
Actions
No
No
Does performance Additional tests
meet stated
goals? Actions
Yes
13 Copyright Jaqui Lynch 2018
CPU
14 Copyright Jaqui Lynch 2018
7
5/24/2018
Dispatching in shared pool
• VP gets dispatched to a core
• First time this becomes the home node
• All SMT threads for the VP go with the VP
• VP runs to the end of its entitlement
• If it has more work to do and noone else
wants the core it gets more
• If it has more work to do but other VPs want
the core then it gets context switched and put
on the home node runQ
• If it can’t get serviced in a timely manner it
goes to the global runQ and ends up running
somewhere else but its data may still be in
the memory on the home node core
Copyright Jaqui Lynch 2018
1
5
Understand SMT
• SMT
• Threads dispatch via a Virtual
Processor (VP)
• Overall more work gets done SMT
Thread
0
(throughput) Primary
• Individual threads run a little slower 1
• SMT1: Largest unit of execution work Secondary
• SMT2: Smaller unit of work, but
provides greater amount of execution 2 3
Tertiary
work per cycle
• SMT4: Smallest unit of work, but Diagram courtesy of IBM
provides the maximum amount of
execution work per cycle
• On POWER7, a single thread cannot
exceed 65% utilization
• On POWER6 or POWER5, a single
thread can consume 100%
• Understand thread dispatch order
Copyright Jaqui Lynch 2018
16
8
5/24/2018
POWER5/6 vs POWER7/8 ‐ SMT Utilization
POWER6 SMT2 POWER7 SMT2 POWER7 SMT4 POWER8 SMT4
Htc0 busy 100% Htc0 busy ~70% Htc0 busy ~63% ~77%
Htc0 busy
Htc1 idle busy Htc1 idle busy Htc1 idle busy
idle
~88% Htc1 idle
Htc2
Htc2 idle
idle
Htc3
Up to Htc3 idle
100%
Htc0 busy
100% Htc0 busy 100% ~60%
Htc1 busy busy Htc1 busy busy
busy
“busy” = user% + system%
POWER7 SMT=2 ‐ 70% & SMT=4 63% tries to show potential spare capacity
• Escaped most peoples attention
• VM goes 100% busy at entitlement & 100% from there on up to 10 x more CPU
SMT4 100% busy 1st CPU now reported as 63% busy Server cores smt1 smt2 smt4 smt8
• 2nd, 3rd and 4th LCPUs each report 12% idle time which is approximate Rperf
POWER8 Notes S822 8 60.9 88.4 114.8 122.9
S922 8 68.4 116.3 160.5 202.3
Uplift from SMT2 to SMT4 is about 30%
Uplift from SMT4 to SMT8 is about 7% Ratios
POWER9 is a bigger uplift S822 1.45 1.30 1.07
Check published rPerf Numbers S922 1.70 1.38 1.26
17
POWER5/6 vs POWER7 /8 Virtual Processor Unfolding
Virtual Processor is activated at different utilization threshold for P5/P6
and P7
P5/P6 loads the 1st and 2nd SMT threads to about 80% utilization and
then unfolds a VP
P7 loads first thread on the VP to about 50% then unfolds a VP
Once all VPs unfolded then 2nd SMT threads are used
Once 2nd threads are loaded then tertiaries are used
This is called raw throughput mode
Why?
Raw Throughput provides the highest per-thread throughput and best
response times at the expense of activating more physical cores
18
9
5/24/2018
Scaled Throughput
• P7 and higher with AIX v6.1 TL08 and AIX v7.1 TL02
• Dispatches more SMT threads to a VP core before unfolding additional VPs
• Tries to make it behave a bit more like P6
• Raw provides the highest per-thread throughput and best response times at the
expense of activating more physical core
• schedo –p –o vpm_throughput_mode=
0 Legacy Raw mode (default)
1 “Enhanced Raw” mode with a higher threshold than legacy
2 Scaled mode, use primary and secondary SMT threads
4 Scaled mode, use all four SMT threads
8 Scaled mode, use eight SMT threads (POWER8, AIX v7.1 required)
Dynamic Tunable
Copyright Jaqui Lynch 2018
19
More on Dispatching
How dispatching works
Example ‐ 1 core with 6 VMs assigned to it
VPs for the VMs on the core get dispatched (consecutively) and their threads run
As each VM runs the cache is cleared for the new VM
When entitlement reached or run out of work CPU is yielded to the next VM
Once all VMs are done then system determines if there is time left
Assume our 6 VMs take 6MS so 4MS is left
Remaining time is assigned to still running VMs according to weights
VMs run again and so on
Problem ‐ if entitlement too low then dispatch window for the VM can be too low
If VM runs multiple times in a 10ms window then it does not run full speed as cache has to be warmed
up
If entitlement higher then dispatch window is longer and cache stays warm longer ‐ fewer cache misses
10
5/24/2018
Entitlement and VPs
• Utilization calculation for CPU is different between POWER5, 6 and POWER7
• VPs are also unfolded sooner (at lower utilization levels than on P6 and P5)
• May also see high VCSW in lparstat
• This means that in POWER7 & higher you need to pay more attention to VPs
• You may see more cores activated a lower utilization levels
• But you will see higher idle
• If only primary SMT threads in use then you have excess VPs
• Try to avoid this issue by:
• Reducing VP counts
• Use realistic entitlement to VP ratios
• 10x or 20x is not a good idea
• Try setting entitlement to .6 or .7 of VPs
• Ensure workloads never run consistently above 100% entitlement
• Too little entitlement means too many VPs will be contending for the cores
• NOTE – VIO server entitlement is critical – SEAs scale by entitlement not VPs
• All VPs have to be dispatched before one can be redispatched
• Performance may (in most cases, will) degrade when the number of Virtual Processors in an
LPAR exceeds the number of physical processors
• The same applies with VPs in a shared pool LPAR – these should exceed the cores in the pool
21 Copyright Jaqui Lynch 2018
Avoiding Problems
• Stay current
• Known memory issues with 6.1 tl9 sp1 and 7.1 tl3 sp1
• Java 7.1 SR1 is the preferred Java for POWER7 and POWER8
• Java 6 SR7 is minimal on POWER7 but you should go to Java 7
• WAS 8.5.2.2
• Refer to Section 8.3 of the Performance Optimization and Tuning
Techniques Redbook SG24‐8171
• HMC v8 required for POWER8 – does not support servers prior
to POWER6
• Remember not all workloads run well in the shared processor
pool – some are better dedicated
• Apps with polling behavior, CPU intensive apps (SAS, HPC), latency
sensitive apps (think trading systems)
22 Copyright Jaqui Lynch 2018
11
5/24/2018
lparstat 30 2 SPP
lparstat 30 2 output
System configuration: type=Shared mode=Uncapped smt=4 lcpu=72 mem=319488MB psize=17 ent=12.00
lcpu=72 and smt=4 means I have 72/4=18 VPs but pool is only 17 cores ‐ BAD
psize = processors in shared pool
lbusy = %occupation of the LCPUs at the system and user level
app = Available physical processors in the pool
vcsw = Virtual context switches (virtual processor preemptions)
phint = phantom interrupts received by the LPAR
interrupts targeted to another partition that shares the same physical processor
i.e. LPAR does an I/O so cedes the core, when I/O completes the interrupt is sent to the
core but different LPAR running so it says “not for me”
NOTE – Must set “Allow performance information collection” on the LPARs to see good values for app, etc
Required for shared pool monitoring
Copyright Jaqui Lynch 2018
23
lparstat 30 2 Dedicated
lparstat 30 2 output
System configuration: type=Dedicated mode=Capped smt=4
lcpu=80 mem=524288MB System configuration: type=Dedicated mode=Capped smt=8
lcpu=32 mem=80640MB
%user %sys %wait %idle
%user %sys %wait %idle
‐‐‐‐‐ ‐‐‐‐‐ ‐‐‐‐‐‐ ‐‐‐‐‐‐
‐‐‐‐‐ ‐‐‐‐‐ ‐‐‐‐‐‐ ‐‐‐‐‐‐
16.8 28.7 6.4 48.1 55.8 31.1 0.0 13.1
17.0 29.3 5.8 48.0 56.1 31.2 0.0 12.7
lcpu=4 and smt=80 means I have 80/4=20 cores lcpu=32 and smt8 so 4 cores
lbusy = %occupation of the LCPUs at the system and user level
lparstat ‐h 30 2 output
System configuration: type=Dedicated mode=Capped smt=4 lcpu=80 mem=524288MB
%hypv
Indicates the percentage of physical processor consumption spent making hypervisor calls.
Context matters – high %hypv means very little if CPU utilization is very low
hcalls
Indicates the average number of hypervisor calls that were started.
Copyright Jaqui Lynch 2018
24
12
5/24/2018
lparstat ‐E 30 2 Dedicated
lparstat ‐E 30 2 output
Lets you check frequency server is running at
System configuration: type=Dedicated mode=Capped smt=8 lcpu=32 mem=80640MB Power=Disabled
Physical Processor Utilisation:
‐‐‐‐‐‐‐‐Actual‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐Normalised‐‐‐‐‐‐
user sys wait idle freq user sys wait idle
‐‐‐‐ ‐‐‐‐ ‐‐‐‐ ‐‐‐‐ ‐‐‐‐‐‐‐‐‐ ‐‐‐‐ ‐‐‐‐ ‐‐‐‐ ‐‐‐‐
2.226 1.245 0.000 0.528 4.1GHz[101%] 2.258 1.263 0.000 0.479
2.223 1.244 0.000 0.533 4.1GHz[101%] 2.254 1.262 0.000 0.484
This POWER8 is set up to run the same frequency all the time
Copyright Jaqui Lynch 2018
25
Using sar –mu ‐P ALL (Power7 & SMT4)
AIX (ent=10 and 16 VPs) so per VP physc entitled is about .63
System configuration: lcpu=64 ent=10.00 mode=Uncapped
14:24:31 cpu %usr %sys %wio %idle physc %entc
Average 0 77 22 0 1 0.52 5.2
1 37 14 1 48 0.18 1.8
2 0 1 0 99 0.10 1.0
3 0 1 0 99 0.10 1.0 .9 physc
4 84 14 0 1 0.49 4.9
5 42 7 1 50 0.17 1.7
6 0 1 0 99 0.10 1.0
7 0 1 0 99 0.10 1.0 .86 physc
8 88 11 0 1 0.51 5.1
9 40 11 1 48 0.18 1.8
............. Lines for 10‐62 were here
63 0 1 0 99 0.11 1.1
‐ 55 11 0 33 12.71 127.1 Above entitlement on average – increase entitlement?
So we see we are using 12.71 cores which is 127.1% of our entitlement
This is the sum of all the physc lines – cpu0‐3 = proc0 = VP0
May see a U line if in SPP and is unused LPAR capacity (compared against entitlement)
26 Copyright Jaqui Lynch 2018
13
5/24/2018
mpstat ‐s
SMT8 example
mpstat –s 1 1
System configuration: lcpu=64 ent=10.0 mode=Uncapped
Proc0 Proc4 Proc8
89.06% 84.01% 81.42%
cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7 cpu8 cpu9 cpu10 cpu11
41.51% 31.69% 7.93% 7.93% 42.41% 24.97% 8.31% 8.32% 39.47% 25.97% 7.99% 7.99%
………………………………
Proc60
99.11%
cpu60 cpu61 cpu62 cpu63 shows breakdown across the VPs (proc*) and smt threads (cpu*)
62.63% 13.22% 11.63% 11.63%
Proc* are the virtual CPUs
CPU* are the logical CPUs (SMT threads)
Copyright Jaqui Lynch 2018
27
28 Copyright Jaqui Lynch 2018
14
5/24/2018
29 Copyright Jaqui Lynch 2018
Entitled Capacity 10
min Memory MB 131072
max Memory MB 327680
online Memory 303104
Pool CPU 16
Weight 150
pool id 2
Copyright Jaqui Lynch 2018
30
15
5/24/2018
LPAR always above entitlement – increase entitlement
31
32
16
5/24/2018
CPU by thread from cpu_summ tab in nmon
SMT4
SMT8 – all threads busy
Note mostly primary thread used and
some secondary – we should possibly
reduce cores/VPs
All threads busy – maybe add cores
33
vmstat –IW Shared
vmstat –IW 60 2
System configuration: lcpu=12 mem=24832MB ent=2.00
kthr memory page faults cpu
‐‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
r b p w avm fre fi fo pi po fr sr in sy cs us sy id wa pc ec
3 1 0 2 2708633 2554878 0 46 0 0 0 0 3920 143515 10131 26 44 30 0 2.24 112.2
6 1 0 4 2831669 2414985 348 28 0 0 0 0 2983 188837 8316 38 39 22 0 2.42 120.9
Note pc=2.42 is 120.0% of entitlement
When looking at system time to user time ratios – remember on a VIO server that high system time is most likely normal as the VIO handles all
the I/O and network and really has little normal user type work
‐I shows I/O oriented view and adds in the p column
p column is number of threads waiting for I/O messages to raw devices.
‐W adds the w column (only valid with –I as well)
w column is the number of threads waiting for filesystem direct I/O (DIO) and concurrent I/O (CIO)
r column is average number of runnable threads (ready but waiting to run + those running)
This is the global run queue – use mpstat ‐w and look at the rq field to get the run queue for each logical CPU
b column is average number of threads placed in the VMM wait queue (awaiting resources or I/O)
Copyright Jaqui Lynch 2018
34
17
5/24/2018
vmstat –IW Dedicated
Busysvr on 5/1/2018 Power8
Has 4 memory pools and computational was 67.1%
minfree=1024, maxfree=2048
Copyright Jaqui Lynch 2018
35
Shared Processor Pool Monitoring
Turn on “Allow performance information collection” on the LPAR properties
This is a dynamic change
topas –C
Most important value is app – available pool processors
This represents the current number of free physical cores in the
pool
nmon option p for pool monitoring
To the right of PoolCPUs there is an unused column which is the
number of free pool cores
lparstat
Shows the app column and poolsize
36 Copyright Jaqui Lynch 2018
18
5/24/2018
topas ‐C
Shows pool size of 16 with all 16 available
Monitor VCSW as potential sign of insufficient entitlement
37 Copyright Jaqui Lynch 2018
38 Copyright Jaqui Lynch 2018
19
5/24/2018
MEMORY
39 Copyright Jaqui Lynch 2018
Memory Types
• Persistent
• Backed by filesystems
• Working storage
• Dynamic
• Includes executables and their work areas
• Backed by page space
• Shows as avm in a vmstat –I (multiply by 4096 to get bytes instead of
pages) or as %comp in nmon analyser or as a percentage of memory used
for computational pages in vmstat –v
• ALSO NOTE – if %comp is near or >97% then you will be paging and need
more memory
• Prefer to steal from persistent as it is cheap
• minperm, maxperm, maxclient, lru_file_repage and
page_steal_method all impact these decisions
40 Copyright Jaqui Lynch 2018
20
5/24/2018
Memory with lru_file_repage=0
• minperm=3
• Always try to steal from filesystems if filesystems are using more than 3% of memory
• maxperm=90
• Soft cap on the amount of memory that filesystems or network can use
• Superset so includes things covered in maxclient as well
• maxclient=90
• Hard cap on amount of memory that JFS2 or NFS can use – SUBSET of maxperm
• lru_file_repage goes away in v7 later TLs
• It is still there but you can no longer change it
All AIX systems post AIX v5.3 (tl04 I think) should have these 3 set
On v6.1 and v7 they are set by default
Check /etc/tunables/nextboot to make sure they are not overridden
from defaults on v6.1 and v7
41 Copyright Jaqui Lynch 2018
/etc/tunables/nextboot
VALUES YOU MIGHT SEE
If you see other parameters check they are not restricted and that they are still valid
vmo:
maxfree = “2048"
minfree = “1024"
no:
tcp_recvspace = "262144"
udp_recvspace = "655360"
udp_sendspace = "65536"
tcp_sendspace = "262144"
rfc1323 = "1"
ioo:
j2_minPageReadAhead = "2"
j2_maxPageReadAhead = "1024"
j2_dynamicBufferPreallocation = "256"
42
Copyright Jaqui Lynch 2018
21
5/24/2018
page_steal_method
• Default in 5.3 is 0, in 6 and 7 it is 1
• What does 1 mean?
• lru_file_repage=0 tells LRUD to try and steal from filesystems
• Memory split across mempools
• LRUD manages a mempool and scans to free pages
• 0 – scan all pages
• 1 – scan only filesystem pages
43 Copyright Jaqui Lynch 2018
page_steal_method Example
• 500GB memory
• 50% used by file systems (250GB)
• 50% used by working storage (250GB)
• mempools = 5
• So we have at least 5 LRUDs each controlling about 100GB
memory
• Set to 0
• Scans all 100GB of memory in each pool
• Set to 1
• Scans only the 50GB in each pool used by filesystems
• Reduces cpu used by scanning
• When combined with CIO this can make a significant difference
44 Copyright Jaqui Lynch 2018
22
5/24/2018
Correcting Paging
From vmstat ‐v
11173706 paging space I/Os blocked with no psbuf
lsps output on above system that was paging before changes were made to tunables
lsps ‐a
Page Space Physical Volume Volume Group Size %Used Active Auto Type
paging01 hdisk3 pagingvg 16384MB 25 yes yes lv
paging00 hdisk2 pagingvg 16384MB 25 yes yes lv
hd6 hdisk0 rootvg 16384MB 25 yes yes lv
lsps ‐s
Total Paging Space Percent Used Can also use vmstat –I and vmstat ‐s
49152MB 1%
Should be balanced – NOTE VIO Server comes with 2 different sized page datasets on one hdisk. As of 2.2.6.20
there are 2 x 1024MB files.
Best Practice
More than one page volume
All the same size including hd6
Page spaces must be on different disks to each other
Do not put on hot disks
Mirror all page spaces that are on internal or non‐raided disk
If you can’t make hd6 as big as the others then swap it off after boot
All real paging is bad
45 Copyright Jaqui Lynch 2018
Memory Breakdown
UNIT is MB – svmon -G
size inuse free pin virtual available mmode
memory 512.00 263.72 248.28 56.1 69.2 425.97 Ded
pg space 40.0 0.22
46 Copyright Jaqui Lynch 2018
23
5/24/2018
Looking for Problems
• lssrad –av
• mpstat –d
• topas –M
• svmon
• Try –G –O
unit=auto,timestamp=on,pgsz=on,affinity=de
tail options
• Look at Domain affinity section of the report
• tprof/gprof
• trace
• Etc etc
47 Copyright Jaqui Lynch 2018
Memory Problems
• Look at computational memory use
• Shows as avm in a vmstat –I (multiply by 4096 to get bytes instead of pages)
• System configuration: lcpu=48 mem=32768MB ent=0.50
• r b p w avm fre fi fo pi po fr sr in sy cs us sy id wa pc ec
• 0 0 0 0 807668 7546118 0 0 0 0 0 0 1 159 161 0 0 99 0 0.01 1.3
AVM above is about 3.08GB which is about 9% of the 32GB in the LPAR
• or as %comp in nmon analyser
• or as a percentage of memory used for computational pages in vmstat –v
• NOTE – if %comp is near or >97% then you will be paging and need more memory
• Try svmon –P –Osortseg=pgsp –Ounit=MB | more
• This shows processes using the most pagespace in MB
• You can also try the following:
• svmon –P –Ofiltercat=exclusive –Ofiltertype=working –Ounit=MB| more
48 Copyright Jaqui Lynch 2018
24
5/24/2018
Copyright Jaqui Lynch 2018
49
Copyright Jaqui Lynch 2018
50
25
5/24/2018
Affinity
• LOCAL SRAD, within the same chip, shows as s3
• NEAR SRAD, within the same node – intra‐node, shows as s4
• FAR SRAD, on another node – inter‐node, shows as s5
51 Copyright Jaqui Lynch 2018
mpstat –d Example from POWER8 S814
b814aix1: mpstat ‐d
System configuration: lcpu=48 ent=0.5 mode=Uncapped
local near far
cpu cs ics bound rq push S3pull S3grd S0rd S1rd S2rd S3rd S4rd S5rd ilcs vlcs S3hrd S4hrd S5hrd
0 82340 11449 1 2 0 0 0 98.9 0.0 0.0 1.1 0.0 0.0 23694 120742 100.0 0.0 0.0
1 81 81 0 0 0 0 0 0.0 100.0 0.0 0.0 0.0 0.0 9488 9541 100.0 0.0 0.0
2 81 81 0 0 0 0 0 0.0 100.0 0.0 0.0 0.0 0.0 9501 9533 100.0 0.0 0.0
3 82 82 0 0 0 0 0 1.2 98.8 0.0 0.0 0.0 0.0 9515 9876 100.0 0.0 0.0
4 81 81 0 0 0 0 0 0.0 100.0 0.0 0.0 0.0 0.0 9515 9525 100.0 0.0 0.0
5 81 81 0 0 0 0 0 0.0 100.0 0.0 0.0 0.0 0.0 9522 9527 100.0 0.0 0.0
6 81 81 0 0 0 0 0 0.0 100.0 0.0 0.0 0.0 0.0 9522 9518 100.0 0.0 0.0
7 82 81 0 0 0 0 0 0.0 100.0 0.0 0.0 0.0 0.0 9526 9511 100.0 0.0 0.0
The above is for a single socket system (S814) so I would expect to see everything local (s3hrd)
On a multi socket or multinode pay attention to the numbers under near and far
52 Copyright Jaqui Lynch 2018
26
5/24/2018
E850 4 socket mpstat ‐d
53 Copyright Jaqui Lynch 2018
Starter set of tunables Memory
For AIX v5.3
No need to set memory_affinity=0 after 5.3 tl05
MEMORY
vmo ‐p ‐o minperm%=3
vmo ‐p ‐o maxperm%=90
vmo ‐p ‐o maxclient%=90
vmo ‐p ‐o minfree=960 We will calculate these
vmo ‐p ‐o maxfree=1088 We will calculate these
vmo ‐p ‐o lru_file_repage=0
vmo ‐p ‐o lru_poll_interval=10
vmo ‐p ‐o page_steal_method=1
For AIX v6 or v7 including VIOS
Memory defaults are already correctly except minfree and maxfree
If you upgrade from a previous version of AIX using migration then you need to check the
settings after
54 Copyright Jaqui Lynch 2018
27
5/24/2018
vmstat –v Output
uptime
02:03PM up 39 days, 3:06, 2 users, load average: 17.02, 15.35, 14.27
9 memory pools
3.0 minperm percentage
90.0 maxperm percentage
14.9 numperm percentage
14.9 numclient percentage
90.0 maxclient percentage
numclient=numperm so most likely the I/O being done is JFS2 or NFS or VxFS
Based on the blocked I/Os it is clearly a system using JFS2
It is also having paging problems
pbufs also need reviewing
Check uptime to see how long the system has been up as stats are from boot
Context matters – if this system has been up only a few days or weeks this is a serious problem
If it has been up for a long time then you have no idea when these occurred
So run vmstat –v several times maybe 8 hours apart and compare them to get growth rates
56 Copyright Jaqui Lynch 2018
28
5/24/2018
lvmo –a Output
2725270 pending disk I/Os blocked with no pbuf
Sometimes the above line from vmstat –v only includes rootvg so use lvmo –a to double‐check
vgname = rootvg
pv_pbuf_count = 512
total_vg_pbufs = 1024
max_vg_pbuf_count = 16384
pervg_blocked_io_count = 0 this is rootvg
pv_min_pbuf = 512
Max_vg_pbuf_count = 0
global_blocked_io_count = 2725270 this is the others
Use lvmo –v xxxxvg ‐a
For other VGs we see the following in pervg_blocked_io_count
blocked total_vg_bufs
nimvg 29 512
sasvg 2719199 1024
backupvg 6042 4608
57 Copyright Jaqui Lynch 2018
Memory Pools and fre column
• fre column in vmstat is a count of all the free pages across all the memory pools
• When you look at fre you need to divide by memory pools
• Then compare it to maxfree and minfree
• This will help you determine if you are happy, page stealing or thrashing
• You can see high values in fre but still be paging
• You have to divide the fre column by mempools
• In below if maxfree=2000 and we have 10 memory pools then we only have 990
pages free in each pool on average. With minfree=960 we are page stealing and
close to thrashing.
Assuming 10 memory pools (you get this from vmstat –v)
9902/10 = 990.2 so we have 990 pages free per memory pool
If maxfree is 2000 and minfree is 960 then we are page stealing and very close to thrashing
58 Copyright Jaqui Lynch 2018
29
5/24/2018
Calculating minfree and maxfree
vmstat –v | grep memory
3 memory pools
Calculation is:
minfree = (max (960,(120 * lcpus) / memory pools))
maxfree = minfree + (Max(maxpgahead,j2_maxPageReadahead) * lcpus) / memory pools
I would probably bump this to 1536 rather than using 1472 (nice power of 2)
The difference between minfree and maxfree should be no more than 1K per IBM
If you over allocate these values it is possible that you will see high values in the “fre” column of a vmstat and yet you will be paging.
59 Copyright Jaqui Lynch 2018
/etc/tunables/nextboot Example
vmo:
minfree = “1024”
maxfree = "2048"
no:
tcp_recvspace = "262144"
udp_recvspace = "655360"
udp_sendspace = "65536"
tcp_sendspace = "262144"
rfc1323 = "1"
ioo:
j2_maxPageReadAhead = “256"
j2_dynamicBufferPreallocation = "256"
60 Copyright Jaqui Lynch 2018
30
5/24/2018
I/O
Time Permitting
61 Copyright Jaqui Lynch 2018
Rough Anatomy of an I/O
• LVM requests a PBUF
• Pinned memory buffer to hold I/O request in LVM layer
• Then placed into an FSBUF
• 3 types
• These are also pinned
• Filesystem JFS
• Client NFS and VxFS
• External Pager JFS2
• If paging then need PSBUFs (also pinned)
• Used for I/O requests to and from page space
• Then queue I/O to an hdisk (queue_depth)
• Then queue it to an adapter (num_cmd_elems)
• Adapter queues it to the disk subsystem
• Additionally, every 60 seconds the sync daemon (syncd) runs to flush dirty I/O out to filesystems or
page space
62 Copyright Jaqui Lynch 2018
31
5/24/2018
IO Wait and why it is not necessarily useful
SMT2 example for simplicity
System has 7 threads with work, the 8th has nothing so is not
shown
System has 3 threads blocked (red threads)
SMT is turned on
There are 4 threads ready to run so they get dispatched and
each is using 80% user and 20% system
Metrics would show:
%user = .8 * 4 / 4 = 80%
%sys = .2 * 4 / 4 = 20%
Idle will be 0% as no core is waiting to run threads
IO Wait will be 0% as no core is idle waiting for IO to complete
as something else got dispatched to that core
SO we have IO wait
BUT we don’t see it
Also if all threads were blocked but nothing else to run then
we would see IO wait that is very high
63 Copyright Jaqui Lynch 2018
What is iowait? Lessons to learn
• iowait is a form of idle time
• It is simply the percentage of time the CPU is idle AND there is at least one I/O still
in progress (started from that CPU)
• The iowait value seen in the output of commands like vmstat, iostat, and topas is
the iowait percentages across all CPUs averaged together
• This can be very misleading!
• High I/O wait does not mean that there is definitely an I/O bottleneck
• Zero I/O wait does not mean that there is not an I/O bottleneck
• A CPU in I/O wait state can still execute threads if there are any runnable threads
64 Copyright Jaqui Lynch 2018
32
5/24/2018
Basics
•Data layout will have more impact than most tunables
•Plan in advance
•Large hdisks are evil
•I/O performance is about bandwidth and reduced queuing, not size
•10 x 50gb or 5 x 100gb hdisk are better than 1 x 500gb
•Also larger LUN sizes may mean larger PP sizes which is not great for lots of little filesystems
•Need to separate different kinds of data i.e. logs versus data
•The issue is queue_depth
•In process and wait queues for hdisks
•In process queue contains up to queue_depth I/Os
•hdisk driver submits I/Os to the adapter driver
•Adapter driver also has in process and wait queues
•SDD and some other multi‐path drivers will not submit more than queue_depth IOs to an
hdisk which can affect performance
•Adapter driver submits I/Os to disk subsystem
•Default client qdepth for vSCSI is 3
•chdev –l hdisk? –a queue_depth=20 (or some good value)
•Default client qdepth for NPIV is set by the Multipath driver in the client
65 Copyright Jaqui Lynch 2018
More on queue depth
•Disk and adapter drivers each have a queue to handle I/O
•Queues are split into in‐service (aka in‐flight) and wait queues
•IO requests in in‐service queue are sent to storage and slot is freed when the IO is complete
•IO requests in the wait queue stay there till an in‐service slot is free
•queue depth is the size of the in‐service queue for the hdisk
•Default for vSCSI hdisk is 3
•Default for NPIV or direct attach depends on the HAK (host attach kit) or MPIO drivers used
•num_cmd_elems is the size of the in‐service queue for the HBA
•Maximum in‐flight IOs submitted to the SAN is the smallest of:
•Sum of hdisk queue depths
•Sum of the HBA num_cmd_elems
•Maximum in‐flight IOs submitted by the application
•For HBAs
•num_cmd_elems defaults to 200 typically
•Max range is 2048 to 4096 depending on storage vendor
•As of AIX v7.1 tl2 (or 6.1 tl8) num_cmd_elems is limited to 256 for VFCs
•See http://www‐01.ibm.com/support/docview.wss?uid=isg1IV63282
66 Copyright Jaqui Lynch 2018
33
5/24/2018
iostat ‐Dl
%tm bps tps bread bwrtn rps avg min max wps avg min max avg min max avg avg serv
act serv serv serv serv serv serv time time time wqsz sqsz qfull
hdisk0 13.7 255.3K 33.5 682.7 254.6K 0.1 3 1.6 4 33.4 6.6 0.7 119.2 2.4 0 81.3 0 0 2.1
hdisk5 14.1 254.6K 33.4 0 254.6K 0 0 0 0 33.4 6.7 0.8 122.9 2.4 0 82.1 0 0 2.1
hdisk16 2.7 1.7M 3.9 1.7M 0 3.9 12.6 1.2 71.3 0 0 0 0 0 0 0 0 0 0
hdisk17 0.1 1.8K 0.3 1.8K 0 0.3 4.2 2.4 6.1 0 0 0 0 0 0 0 0 0 0
hdisk15 4.4 2.2M 4.9 2.2M 273.1 4.8 19.5 2.9 97.5 0.1 7.8 1.1 14.4 0 0 0 0 0 0
hdisk18 0.1 2.2K 0.5 2.2K 0 0.5 1.5 0.2 5.1 0 0 0 0 0 0 0 0 0 0
hdisk19 0.1 2.6K 0.6 2.6K 0 0.6 2.7 0.2 15.5 0 0 0 0 0 0 0 0 0 0
hdisk20 3.4 872.4K 2.4 872.4K 0 2.4 27.7 0.2 163.2 0 0 0 0 0 0 0 0 0 0
System configuration: lcpu=32 drives=67 paths=216 vdisks=0
hdisk22 5 2.4M 29.8 2.4M 0 29.8 3.7 0.2 50.1 0 0 0 0 0 0 0.1 0 0 0
hdisk25 10.3 2.3M 12.2 2.3M 0 12.2 16.4 0.2 248.5 0 0 0 0 0 0 0 0 0 0
hdisk24 9.2 2.2M 5 2.2M 0 5 34.6 0.2 221.9 0 0 0 0 0 0 0 0 0 0
hdisk26 7.9 2.2M 4.5 2.2M 0 4.5 32 3.1 201 0 0 0 0 0 0 0 0 0 0
hdisk27 6.2 2.2M 4.4 2.2M 0 4.4 25.4 0.6 219.5 0 0 0 0 0 0 0.1 0 0 0
hdisk28 3 2.2M 4.5 2.2M 0 4.5 10.3 3 101.6 0 0 0 0 0 0 0 0 0 0
hdisk29 6.8 2.2M 4.5 2.2M 0 4.5 26.6 3.1 219.3 0 0 0 0 0 0 0 0 0 0
hdisk9 0.1 136.5 0 0 136.5 0 0 0 0 0 21.2 21.2 21.2 0 0 0 0 0 0
67 Copyright Jaqui Lynch 2018
Adapter Queue Problems
• Look at BBBF Tab in NMON Analyzer or run fcstat command
• fcstat –D provides better information including high water marks that can be used in
calculations
• Adapter device drivers use DMA for IO
• From fcstat on each fcs
• NOTE these are since boot
FC SCSI Adapter Driver Information
No DMA Resource Count: 0
No Adapter Elements Count: 2567
No Command Resource Count: 34114051
Number of times since boot that IO was temporarily blocked waiting for resources such as
num_cmd_elems too low
• No DMA resource – adjust max_xfer_size
• No adapter elements – adjust num_cmd_elems
• No command resource – adjust num_cmd_elems
• If using NPIV make changes to VIO and client, not just VIO
• Reboot VIO prior to changing client settings
68 Copyright Jaqui Lynch 2018
34
5/24/2018
Adapter Tuning
fcs0
bus_intr_lvl 115 Bus interrupt level False
bus_io_addr 0xdfc00 Bus I/O address False
bus_mem_addr 0xe8040000 Bus memory address False
init_link al INIT Link flags True
intr_priority 3 Interrupt priority False
lg_term_dma 0x800000 Long term DMA True
max_xfer_size 0x100000 Maximum Transfer Size True (16MB DMA)
num_cmd_elems 200 Maximum number of COMMANDS to queue to the adapter True
pref_alpa 0x1 Preferred AL_PA True
sw_fc_class 2 FC Class for Fabric True
Changes I often make (test first)
max_xfer_size 0x200000 Maximum Transfer Size True 128MB DMA area for data I/O
num_cmd_elems 1024 Maximum number of COMMANDS to queue to the adapter True
Often I raise this to 2048 – check with your disk vendor
lg_term_dma is the DMA area for control I/O
Check these are ok with your disk vendor!!!
chdev ‐l fcs0 ‐a max_xfer_size=0x200000 ‐a num_cmd_elems=1024 ‐P
chdev ‐l fcs1 ‐a max_xfer_size=0x200000 ‐a num_cmd_elems=1024 ‐P
At AIX 6.1 TL2 VFCs will always use a 128MB DMA memory area even with default max_xfer_size – I change it anyway for consistency
As of AIX v7.1 tl2 (or 6.1 tl8) num_cmd_elems there is an effective limit of 256 for VFCs
See http://www‐01.ibm.com/support/docview.wss?uid=isg1IV63282
This limitation got lifted for NPIV and the maximum is now 2048 provided you are at 6.1 tl9 (IV76258), 7.1 tl3 (IV76968) or 7.1 tl4 (IV76270).
Remember make changes to both VIO servers and client LPARs if using NPIV
VIO server setting must be at least as large as the client setting
69 Copyright Jaqui Lynch 2018
HBA max_xfer_size
The default is
0x100000 /* Default io_dma of 16MB */
After that, 0x200000,0x400000,0x80000 gets you 128MB
After that 0x1000000 checks for bus type, and you may get 256MB, or 128MB
There are also some adapters that support very large max_xfer sizes which can possibly allocate 512MB
VFC adapters inherit this from the physical adapter (generally)
Unless you are driving really large IO's, then max_xfer_size is rarely changed
70
35
5/24/2018
fcstat ‐D fcs8
FIBRE CHANNEL STATISTICS REPORT: fcs8
........
FC SCSI Adapter Driver Queue Statistics
High water mark of active commands: 512
High water mark of pending commands: 104
FC SCSI Adapter Driver Information
No DMA Resource Count: 0
No Adapter Elements Count: 13300
No Command Resource Count: 0
Adapter Effective max transfer value: 0x200000
Some lines removed to save space
Per Dan Braden:
Set num_cmd_elems to at least high active + high pending or 512+104=626
71 Copyright Jaqui Lynch 2018
Tunables
72 Copyright Jaqui Lynch 2018
36
5/24/2018
Starter set of I/O tunables
The parameters below should be reviewed and changed as needed
PBUFS
Use the new way – lvmo command
JFS2
ioo ‐p ‐o j2_maxPageReadAhead=128
(default above may need to be changed for sequential) – dynamic
Difference between minfree and maxfree should be > that this value
j2_dynamicBufferPreallocation=16
Max is 256. 16 means 16 x 16k slabs or 256k
Default that may need tuning but is dynamic
Replaces tuning j2_nBufferPerPagerDevice until at max.
73 Copyright Jaqui Lynch 2018
Other Interesting Tunables
• These are set as options in /etc/filesystems for the filesystem
• noatime
• Why write a record every time you read or touch a file?
• mount command option
• Use for redo and archive logs
• Release behind (or throw data out of file system cache)
• rbr – release behind on read
• rbw – release behind on write
• rbrw – both
• log=null
• Read the various AIX Difference Guides:
• http://www.redbooks.ibm.com/cgi‐bin/searchsite.cgi?query=aix+AND+differences+AND+guide
74
Copyright Jaqui Lynch 2018
37
5/24/2018
filemon
Uses trace so don’t forget to STOP the trace
Can provide the following information
CPU Utilization during the trace
Most active Files
Most active Segments
Most active Logical Volumes
Most active Physical Volumes
Most active Files Process‐Wise
Most active Files Thread‐Wise
Sample script to run it:
filemon ‐v ‐o abc.filemon.txt ‐O all ‐T 210000000
sleep 60
trcstop
OR
filemon ‐v ‐o abc.filemon2.txt ‐O pv,lv ‐T 210000000
sleep 60
trcstop
75 Copyright Jaqui Lynch 2018
filemon –v –o pv,lv
Most Active Logical Volumes
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
util #rblk #wblk KB/s volume description
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
0.66 4647264 834573 45668.9 /dev/gandalfp_ga71_lv /ga71
0.36 960 834565 6960.7 /dev/gandalfp_ga73_lv /ga73
0.13 2430816 13448 20363.1 /dev/misc_gm10_lv /gm10
0.11 53808 14800 571.6 /dev/gandalfp_ga15_lv /ga15
0.08 94416 7616 850.0 /dev/gandalfp_ga10_lv /ga10
0.07 787632 6296 6614.2 /dev/misc_gm15_lv /gm15
0.05 8256 24259 270.9 /dev/misc_gm73_lv /gm73
0.05 15936 67568 695.7 /dev/gandalfp_ga20_lv /ga20
0.05 8256 25521 281.4 /dev/misc_gm72_lv /gm72
0.04 58176 22088 668.7 /dev/misc_gm71_lv /gm71
76
Copyright Jaqui Lynch 2018
38
5/24/2018
filemon –v –o pv,lv
Most Active Physical Volumes
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
util #rblk #wblk KB/s volume description
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
0.38 4538432 46126 8193.7 /dev/hdisk20 MPIO FC 2145
0.27 12224 671683 5697.6 /dev/hdisk21 MPIO FC 2145
0.19 15696 1099234 9288.4 /dev/hdisk22 MPIO FC 2145
0.08 608 374402 3124.2 /dev/hdisk97 MPIO FC 2145
0.08 304 369260 3078.8 /dev/hdisk99 MPIO FC 2145
0.06 537136 22927 4665.9 /dev/hdisk12 MPIO FC 2145
0.06 6912 631857 5321.6 /dev/hdisk102 MPIO FC 2145
77
Copyright Jaqui Lynch 2018
Asynchronous I/O and
Concurrent I/O
78 Copyright Jaqui Lynch 2018
39
5/24/2018
79 Copyright Jaqui Lynch 2018
PROCAIO tab in nmon
Maximum seen was 192 but average was much less
80
Copyright Jaqui Lynch 2018
40
5/24/2018
DIO and CIO
• DIO
• Direct I/O
• Around since AIX v5.1, also in Linux
• Used with JFS
• CIO is built on it
• Effectively bypasses filesystem caching to bring data directly into
application buffers
• Does not like compressed JFS or BF (lfe) filesystems
• Performance will suffer due to requirement for 128kb I/O (after 4MB)
• Reduces CPU and eliminates overhead copying data twice
• Reads are asynchronous
• No filesystem readahead
• No lrud or syncd overhead
• No double buffering of data
• Inode locks still used
• Benefits heavily random access workloads
81
Copyright Jaqui Lynch 2018
DIO and CIO
• CIO
• Concurrent I/O – AIX only, not in Linux
• Only available in JFS2
• Allows performance close to raw devices
• Designed for apps (such as RDBs) that enforce write serialization at the
app
• Allows non-use of inode locks
• Implies DIO as well
• Benefits heavy update workloads
• Speeds up writes significantly
• Saves memory and CPU for double copies
• No filesystem readahead
• No lrud or syncd overhead
• No double buffering of data
• Not all apps benefit from CIO and DIO – some are better with filesystem
caching and some are safer that way
• When to use it
• Database DBF files, redo logs and control files and flashback log files.
• Not for Oracle binaries or archive log files
• Can get stats using vmstat –IW flags
82 Copyright Jaqui Lynch 2018
41
5/24/2018
Demoted I/O
• Check w column in vmstat ‐IW
• CIO write fails because IO is not aligned to FS blocksize
• i.e app writing 512 byte blocks but FS has 4096
• Ends up getting redone
• Demoted I/O consumes more kernel CPU
• And more physical I/O
• To find demoted I/O (if JFS2)
trace –aj 59B,59C ; sleep 2 ; trcstop ; trcrpt –o directio.trcrpt
grep –i demoted directio.trcrpt
Look in the report for:
83 Copyright Jaqui Lynch 2018
Flash Cache
• Read only cache using SSDs. Reads will be processed from SSDs. Writes go direct to original storage
device.
• Current limitations
• Only 1 cache pool and 1 cache partition
• It takes time for the cache to warm up
• Blog
• Nigel Griffiths Blog
• http://tinyurl.com/k7g5dr7
• Manoj Kumar Article
• http://tinyurl.com/mee4n3f
• Prereqs
• Server attached SSDs, flash that is attached using SAS or from the SAN
• AIX 7.1 tl4 sp2 or 7.2 tl0 sp0 minimum
• Minimum 4GB memory extra for every LPAR that has cache enabled
• Cache devices are owned either by an LPAR or a VIO server
• https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.osdevice/caching_limitations.htm
• Do not use this if your SAN disk is front‐ended by flash already
• Ensure following filesets are installed
• lslpp ‐l | grep Cache (on a 7.2 system)
• bos.pfcdd.rte 7.2.1.0 COMMITTED Power Flash Cache
• cache.mgt.rte 7.2.1.0 COMMITTED AIX SSD Cache Device
• bos.pfcdd.rte 7.2.1.0 COMMITTED Power Flash Cache
• cache.mgt.rte 7.2.1.0 COMMITTED AIX SSD Cache Device
84 Copyright Jaqui Lynch 2018
42
5/24/2018
Flash Cache Diagram
Taken from Nigel Griffith’s Blog at:
http://tinyurl.com/k7g5dr7
Copyright Jaqui Lynch 2018
85
Flash Cache Monitoring
cache_mgt monitor start
cache_mgt monitor stop
cache_mgt get ‐h –s
Above gets stats since caching started. But no average – it lists stats for every source disk so if you have 88 of
them it is a very long report
pfcras –a dump_stats
Undocumented but provides same statistics as above averaged for last 60 and 3600 seconds
Provides an overall average then the stats for each disk
Meaning of Statistics fields can be found at: http://tinyurl.com/kesmvft
Known Problems
cache_mgt list gets core dump if >70 disks in source – IV93772
http://www‐01.ibm.com/support/docview.wss?crawler=1&uid=isg1IV93772
PFC_CAC_NOMEM error – IV91971
http://www‐01.ibm.com/support/docview.wss?uid=isg1IV91971
D47E07BC 0413165117 P U ETCACHE NOT ENOUGH MEMORY TO ALLOCATE
Saw this when I had 88 source disks and went from 4 to 8 target SSDs
System had 512GB memory and only 256GB was in use
Waiting on an ifix when upgraded to AIX 7.2 tl01 sp2 – IJ00115s2a is new ifix
PFC_CAC_DASTOOSLOW
Claims DAS DEVICE IS SLOWER THAN SAN DEVICE
This is on hold till we install the apar for the PFC_CAC_NOMEM error
86 Copyright Jaqui Lynch 2018
43
5/24/2018
Flash Cache pfcras output
87 Copyright Jaqui Lynch 2018
OTHER
88
Copyright Jaqui Lynch 2018
44
5/24/2018
Parameter Settings ‐ Summary
DEFAULTS NEW
PARAMETER AIXv5.3 AIXv6 AIXv7 SET ALL TO
NETWORK (no)
rfc1323 0 0 0 1
tcp_sendspace 16384 16384 16384 262144 (1Gb)
tcp_recvspace 16384 16384 16384 262144 (1Gb)
udp_sendspace 9216 9216 9216 65536
udp_recvspace 42080 42080 42080 655360
MEMORY (vmo)
minperm% 20 3 3 3
maxperm% 80 90 90 90 JFS, NFS, VxFS, JFS2
maxclient% 80 90 90 90 JFS2, NFS
lru_file_repage 1 0 0 0
lru_poll_interval ? 10 10 10
Minfree 960 960 960 calculation
Maxfree 1088 1088 1088 calculation
page_steal_method 0 0 /1 (TL) 1 1
JFS2 (ioo)
j2_maxPageReadAhead 128 128 128 as needed
j2_dynamicBufferPreallocation 16 16 16 as needed
For AIX v6 and v7 memory settings should be left at the defaults
minfree and maxfree are exceptions
Do not modify restricted tunables without opening a PMR with IBM first
89 Copyright Jaqui Lynch 2018
Tips to keep out of trouble
• Monitor errpt
• Check the performance apars have all been installed
• Yes this means you need to stay current
• Keep firmware up to date
• In particular, look at the firmware history for your server to see if there are performance problems fixed
• Information on the firmware updates can be found at:
• http://www‐933.ibm.com/support/fixcentral/
• Firmware history including release dates can be found at:
• Power8 Midrange
• http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/SV‐Firmware‐Hist.html
• Power8 High end
• http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/SC‐Firmware‐Hist.html
• Power9 Midrange
• http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/VL‐Firmware‐Hist.html
• Ensure software stack is current
• Ensure compilers are current and that compiled code turns on optimization
• To get true MPIO run the correct multipath software
• Ensure system is properly architected (VPs, memory, entitlement, etc)
• Take a baseline before and after any changes
• DOCUMENTATION
• Note – you can’t mix 512 and 4k disks in a VG
90 Copyright Jaqui Lynch 2018
45
5/24/2018
nmon Monitoring
• nmon ‐ft –AOPV^dMLW ‐s 15 ‐c 120
• Grabs a 30 minute nmon snapshot
• A is async IO
• M is mempages
• t is top processes
• L is large pages
• O is SEA on the VIO
• P is paging space
• V is disk volume group
• d is disk service times
• ^ is fibre adapter stats
• W is workload manager statistics if you have WLM enabled
If you want a 24 hour nmon use:
nmon ‐ft –AOPV^dMLW ‐s 150 ‐c 576
May need to enable accounting on the SEA first – this is done on the VIO
chdev –dev ent* ‐attr accounting=enabled
Can use entstat/seastat or topas/nmon to monitor – this is done on the vios
topas –E
nmon ‐O
VIOS performance advisor also reports on the SEAs
91 Copyright Jaqui Lynch 2018
Thank you for your time
If you have questions please email me at:
[email protected]
Thanks to Joe Armstrong for his awesome work organizing these
webinars
92 Copyright Jaqui Lynch 2018
46
5/24/2018
Useful Links
• Jaqui Lynch Articles
• http://www.circle4.com/jaqui/eserver.html
• http://ibmsystemsmag.com/authors/jaqui‐lynch/
• Jay Kruemke Twitter – chromeaix
• https://twitter.com/chromeaix
• Nigel Griffiths Twitter – mr_nmon
• https://twitter.com/mr_nmon
• https://www.ibm.com/developerworks/community/blogs/aixpert
• https://www.youtube.com/user/nigelargriffiths/
• Gareth Coates Twitter – power_gaz
• https://twitter.com/power_gaz
• Jaqui’s Youtube Channel
• https://www.youtube.com/channel/UCYH6OdgB6rV1rPxYt6FWHpw/
• Movie replays
• http://www.circle4.com/movies
• IBM US Virtual User Group
• http://www.tinyurl.com/ibmaixvug
• Power Systems UK User Group
93 • http://tinyurl.com/PowerSystemsTechnicalWebinars
Copyright Jaqui Lynch 2018
References
• Processor Utilization in AIX by Saravanan Devendran
• https://www.ibm.com/developerworks/mydeveloperworks/wikis/home?lang=en#/wiki/Power
%20Systems/page/Understanding%20CPU%20utilization%20on%20AIX
• SG24‐7940 ‐ PowerVM Virtualization ‐ Introduction and Configuration
• http://www.redbooks.ibm.com/redbooks/pdfs/sg247940.pdf
• SG24‐7590 – PowerVM Virtualization – Managing and Monitoring
• http://www.redbooks.ibm.com/redbooks/pdfs/sg247590.pdf
• SG24‐8171 – Power Systems Performance Optimization
• http://www.redbooks.ibm.com/redbooks/pdfs/sg248171.pdf
• Redbook Tip on Maximizing the Value of P7 and P7+ through Tuning and Optimization
• http://www.redbooks.ibm.com/technotes/tips0956.pdf
• Dan Braden Queue Depth Articles
• http://www‐03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105745
• http://www‐03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD106122
94 Copyright Jaqui Lynch 2018
47
5/24/2018
Backup
Slides
95 Copyright Jaqui Lynch 2018
Logical Processors
Logical Processors represent SMT threads
L L L L L L L L Logical (SMT threads)
V V V V V= V= V= V=
0.6 0.4 0.4 Virtual
2 Cores 2 Cores 0.6
Dedicated Dedicated PU=1.2 PU=0.8
VPs under the VPs under the
covers covers
Weight=128 Weight=192
Hypervisor
Core Core Core Physical
Core Core Core Core
Core Core Core
9 Copyright Jaqui Lynch 2018
6
48
5/24/2018
From: AIX/VIOS Disk and Adapter IO Queue Tuning v1.2 – Dan Braden, July 2014
97 Copyright Jaqui Lynch 2018
Terms to understand
• Process
• A process is an activity within the system that is started with a command, a shell script, or another
process.
• Run Queue
• Each CPU has a dedicated run queue. A run queue is a list of runnable threads, sorted by thread priority
value. There are 256 thread priorities (zero to 255). There is also an additional global run queue where
new threads are placed.
• Time Slice
• The CPUs on the system are shared among all of the threads by giving each thread a certain slice of time
to run. The default time slice of one clock tick is 10 ms
98 Copyright Jaqui Lynch 2018
49
5/24/2018
99 Copyright Jaqui Lynch 2018
50