Fibre-channel I/O Performance Tuning on AIX using fcstat
A How-to and Usage Guide
Abstract: This document provides tips and working examples of how to properly adjust and tune the
AIX Fibre channel adapter tunable attributes using fcstat and iostat output as a guide.
Version 2.1.0
June 26, 2013
Authors
Steve F. Lang [email protected]
Anil Kalavakolanu [email protected]
Janez Vovk [email protected]
Post Editing
Minh Pham [email protected]
Contents
I. Introduction
II. AIX I/O Stack flow
III. Events affecting adapter tunable settings and how
IV. fcstat collects and reports statistics in 4 areas
V. Adapter performance tuning
VI. Additional Data
I. Introduction
The AIX fcstat command is uniquely an AIX tool designed to interface and report statistics
directly from the fibre channel adapter firmware and the AIX fibre channel driver. The tool
provides a unique view into critical events that could impact the operational characteristics of the
physical layer. The tool is capable of providing data points in 2 general areas: General fibre
channel physical layer health, similar to switch port and link error statistics commonly found in a
fibre channel switch, and AIX fibre channel driver I/O and I/O affecting statistics. As a general
rule, when assessing fibre channel I/O performance the physical layer must be clean and void of
all I/O affecting error conditions. The reason for containing error conditions in a fibre channel
fabric is related to the severity of one single error. Protocols such as TCP/IP are designed to
tolerate packet loss and out-of-order packets with minimal disruption to the over-all data transfer.
1 ©2013 IBM
Fibre Channel, a protocol built for speed and relatively short distances, is a low overhead low
latency protocol. Unlike TCP/IP, the protocol is in-tolerant of missing, damaged or out-of-order
frames and is incapable of re-transmitting a single missing frame to complete a sequence or
rather a complete read or write operation. Some of these error conditions may produce no fabric
related errors that could be acted on and cancelled at the fibre channel layer. This moves error
recovery into the SCSI layer and results in waiting for commands to timeout. In some cases an
error frame is not detected by either the target or the initiator leaving the read or write waiting for
completion at the upper SCSI layer until a 30 or 60 second timeout. These events inevitably
result in triggering a command timeout timer which eventually cancels the I/O. The impact to I/O
processing in these cases is significant and must be avoided. These type of events are often the
result of a physical layer problems such as a damaged fibre channel cable, faulty or degraded
laser in SFP’s in either a storage controller, switch or host or perhaps a failing a ASIC in a
switch. In other less frequent cases it may be due to oversubscription of an ISL or a slow
draining device causing frames to be discarded by either target, initiators or switches. Regardless
of the cause, identifying, differentiating and resolving fibre channel transport related problems is
often a time consuming and complex process but absolutely necessary before any I/O
performance tuning is attempted.
Due to the streamlined nature of the protocol mixed with the reality that SAN hardware does
become marginal and eventually fail, fabric related error conditions are fully-recoverable events.
Storage Area Network events can add significant latency to service times. Applications, the host
I/O stack, storage arrays, and inner-connected switches must be able to tolerate these undesirable
events. Moreover, the host must be able to tolerate significant latencies in I/O processing and be
able to hold off I/O until it can be resumed without resorting to the final and unrecoverable I/O
failure or rather the LVM I/O failure in the case of AIX.
II. AIX I/O Stack flow
Consistent with the AIX design archiecture, I/O can flow through different entry points of a
driver which is generally based on the type of I/O initiated and or the interface used. A
programmer for example may choose to implement a read/write operation through a raw
(/dev/rhdisk) device interface where block sizes are not confined to 4k. For simplicity reasons,
the following discussion will center around a basic read/write I/O operation initiated from
filesystem I/O activity.
I/O from filesystem activity, such as a file copy from one directory to another, eventually results
in the creation of an address that describes a buf which itself describes a scsi operation such as
WRITE, LBA, and block count. The buf is placed into an LVM work queue and then sent to the
disk drivers incoming queue. When the disk driver completes command validation and basic
resource checks for initiating a SCSI I/O request, including checking for any scsi error states,
the driver checks to determine if queue depth has been exceeded. Provided queue depth has not
been exceeded, which if it is, causes the qfull counter to increment and causes the I/O to wait for
the return of any active commands, the disk drivers attempts to coalesce any contigous blocks to
2 ©2013 IBM
form a larger I/O request up the the max_xfer size specified by that hdisk. If these operations are
successful, the I/O moves from the SCSI layer to the protocol driver. Once the I/O is in the
protocol driver, the driver works closly with the adapter driver to forward received I/O requests
to the fibre channel adapter hardware for final transport over the SAN fabric provided the target
port is accessable and SAN fabric conditions are acceptable. Before incrementing any active I/O
counters, the I/O is tracked with a tag which is passed to the adapter hw for retrieval when the
I/O requests returns. When receiving responses to I/O requests the process is very much the
reverse of the above.
The protocol driver is represented by an (fscsi) device instance and is primarly responsiable for
inneracting with resources on the SAN at the fibre channel layer. It creates and initiates ELS
commands such as a plogi to a storage ports or initiaties re-login request in order to make the
transport of SCSI commands over fibre channel possiable. It is also responsible for forming
commands used to communicate with the fabric nameserver used ultamitely for storage and lun
discovery. The protocol driver works closely with the adapter driver, represented by a (fcs)
instance, and keeps track of fabric state changes.
III. Events affecting adapter tunable settings and how to
identify them
Before performing performance based analysis associated with adapter tunable settings, the
physical link layer should be free of error conditions. It is equally important to ensure the SCSI
layer is not inundating the Target Ports or LUN with excessive I/O requests. Increasing
num_cmd_elems may result in driving more I/O to a storage device resulting in even worse I/O
service times. A quick review of the aix errpt, and iostat can help uncover some of these
problems noted below.
1) AIX errpt
FSCSI FCP or SCSI errors may be the result of transport errors. The exact nature of these
errors should be determined.
2) Average FC I/O service times in excess of ~15 ms excluding any queuing in the adapter /
protocol driver and disk
What an acceptable I/O service time is, is somewhat subjective in that customers have different
requirements. For example, some shops demand less than 2 ms service times where others may
tolerate 11 ms. Moreover, some shops may be using Synchronous PPRC or EMC Synchronous
SRDF over a constrained T3 where I/O service times may be even higher. The disk technology
affects expected I/O service time, as does the availability of write and/or read cache.
When reviewing general I/O response times throughout the I/O load spectrum, it’s been noted
that for a new array unaffected by any other I/O load from any host I/O except running 1
sequential I/O produced by a dd read would result in ~.2 to ~.5 ms service times for 1 lun to 1
3 ©2013 IBM
target port. Increasing the I/O load using 16 concurrent dd’s would result in 8 ms response times
and 32 dd’s would result in even higher service times.
15 ms was chosen in that it seems to be the average response time that drives the need for
assistance in determining why application response times have diminished over time.
In most cases where error conditions and queuing in the disk driver have been ruled out, higher
than expected I/O service times are the result of a resource constraint at the storage array. This
could be due to improper internal LUN allocation / usage of LUNs on the host side, backend
physical disk utilization, port contention or exceeding the overall capabilities of the array.
NOTE: The method used to calculate average I/O service times from the AIX disk driver
includes time the I/O was queued in the adapter protocol / adapter driver. From the application’s
point of view, the average I/O service time is the I/O service time, plus the time the I/O was
queued in the AIX disk driver.
Average disk I/O service times can be measured using “iostat –RDTl”
In the iostat example below the number of hdisks driving I/O is very low and would not be able
to generate a load sufficient to cause queuing in the adapter / protocol driver. hdisk28 is reporting
an average time spent in the queue of 8.1 ms. The average I/O service time is 6.5 ms. From the
higher layer in the I/O stack, it sees average I/O service times of 14.6 ms.
NOTE: If read / write max service times are 30 or 60 s / seconds or close to what the read write
time out value is set to this likely indicates a command timeout or some type of error recovery
the disk driver experienced.
In the example below adjusting from the default adapter settings would have no effect on I/O
performance. Although Avg service time is good around ~ 6 ms the I/O work load is being
queued in the disk driver noted by the qfull column.
IV. fcstat collects and reports statistics in 4 areas
The tool fcstat is provided in the AIX base operating system and provides a wealth of
information about statistics reported by the adapter firmware and the driver. The information can
be broken into 3 different areas.
NOTE: Older version of fcstat may not include a –D flag, used for displaying detailed output.
Newer versions starting at 5.3 TL9 and 6.1 TL 2 include this option. In the event of an adapter
run time reset or reboot all adapter stastics are cleared. An optional –Z flag can be used to clear
the stats but not all stats are cleared. The tool fcstat displays adapter statics in 3 general areas:
4 ©2013 IBM
1) General adapter configuration information
NOTE: Fibre channel adapter can inadvertantly be set to a lower speed by forcing the switch port
to a lower speed. It could also be that an older switch not supporting the a higher speed of the
adapter results in a lower link speed. The highest capable adapter link speed is reported under
link speed as well as the current link speed.
5 ©2013 IBM
2) Adapter Physical Link Health Statistics
Most attributes are collected by the adapter firmware and reported by the driver. These statistics
are similar to what a fiber channel switch might provide and are very useful in reporting
attributes associated with the health of the fiber channel physical layer.
The highlighted attributes Error frames, Dumped frames and Invalid CRC count may be the
result of a physical transport layer problem which may result in damaged fiber channel frames as
they arrive at the adapter. The highlighted values would general not increment on frames being
transmitted but rather frames received.
3) Fiber Channel Device Driver Statistics
Most attributes in this section are gathered by the AIX protocol / adapter device driver, as
opposed to the adapter firmware, and are derived from counters in both these driver. These
counters generally display statistics on queueing and holding off I/O in these 2 layers.
The output below provides useful information about the overall I/O work load. I/O workload in
this case is generally represented by the counter “number of active commands”. Active
commands repreesented by this counter are commands that have left the adapter driver and have
been handed off to the adapter hardware for transport to the end deivce. These commands have
not received a completion status and are considered active. The “high water mark of active and
pending commands” represents the peak or highest number of active commands. If I/O service
times are nomimal and if the high water mark of active commands is at or near the
6 ©2013 IBM
num_cmd_elems value then increasing the num_cmd_elems may improve I/O performance. In
certain error recovery scnerios the “high water mark of active commands” could increase up to
the num_cmd_elems limit. When tuning it is advisable to clear these counters and observe them
over a control time, perhaps 24 hours, where no errors are observed from the perspective of both
the switch and the adapter.
NOTE: values preceded by a 0x are in hex. All values below are reported in hex, not decimal.
Currently, due to how tunable attributes are applied to the driver where the driver needs to be
unloaded and re-configured to apply those attributes, values in the odm may not represent the
actual values in the kernel when chdev is used with the –P option. The expectation when using –
P is the system will be rebooted or the driver will be de-configured to apply those tunables but in
some cases this may not have been performed due to various reasons. In those cases the ODM
attribute may vary from what the kernel is reporting.
NOTE: virtual npiv clients using vfc adapters the I/O DMA pool size is a non adjustable attribute
4) Adapter Driver Performance Statistics
These counters report shortages in:
1) I/O DMA
2) Command Element Resources.
When these counters increment the I/O will be held-off waiting for returning I/O to release
needed resources. Generally, the amount of time I/O is held off due to “waiting for resources” is
very short when compaired to queueing that takes place in the disk driver due to exceeding the
queue depth.
The I/O limiting affect imposed by num_cmd_elems is similar to the restrictions imposed at the
upper scsi layer by the hdisk attribute queue_depth. However, the difference is that
num_cmd_elems governs I/O on an adapter wide basis as opposed to an individual hdisk which
may have mpio paths extending over several adapters.
7 ©2013 IBM
The primary intent of this setting is to limit the total amount of concurrent I/O initiated by an
adapter. The ability to control the I/O rate is important in a shared storage / open systems
environment in that the I/O processing capabilties of both a storage port and luns are finite. For
example, each storage port is rated to handel X number of I/O ‘s per seconds and the ability to
process I/O in a timely manner is dependent on a CPU which may be servicing I/O’s originating
from several storage ports. It is generally reccomended that once a storage port cpu reaches 80 %
that I/O load will need to be destributed to additional storage ports in a effort to reduce the CPU
utilization on that port. The end result is that once a storage port CPU reaches 80 % utilization,
I/O service times increase.
I/O service times for luns are also negatively affected by increased I/O work load but are
generally dependent on storage resources such as raid level, cache size, number of phyiscal
disks, etc.
Utilizing information from both the DMA / No Command resource counters noted below, used
for determining if the I/O work load has exceeded the current adapter I/O DMA settings and the
Driver Statistics, useful for determining the overall I/O work load, we can properly determine the
state of the adapter tunables. Morever, we can determine if changes are made that those changes
will result in an increase in I/O performance. For additional informaiton regarding these setting
and how they are used please review the example cases below.
NOTE: NPIV clients num_cmd_elem attribute should not exceed the VIOS adapter’s
num_cmd_elems
A feature associated with AIX is its ability to limit the global I/0 initiated through a single hdisk.
MPIO limits I/O based on a attribute called queue_depth which governs all configured paths.
Each storage manufacturer supplies ODM stanzas descrbing the range and maxumum value for
this attribute. as well as many other attrbutes describing how the lun will behave with AIX, with
the intent to limit I/O to a single LUN. Most storage vendors supply default values in the range
of 8 to 32 and a maximum value of 16 to 32. A hidden benefit to setting a limit is it prevents
flodding the adapter and the protocol driver with I/O requests during error recovery which could
result in lengthing the time it takes manage these events.
8 ©2013 IBM
Traffic statistics and adapter performance tunable
Above is an example of max_transfer setting in the disk driver.
9 ©2013 IBM
V. Adapter performance tuning
One of the key goals to proper performance tuning is to increase performance. After ensuring
there are no fibre channel physical layer problems, average I/O response times are not exceeding
a nominal value of 15 ms and there is no perceivable queuing in the AIX disk driver, noted by
the qfull column in iostat –Dl, we can move to tuning the adapter settings.
NOTE: If queuing in the disk driver is occurring, noted by a non-zero value in the qfull column
in iostat – Dl output, this should be reviewed to determine the cause. There are a few possible
scenarios when evaluating queuing.
1) The queue_depth set on the hdisk is too low for the apparent workload
ACTION: increase queue_depth
2) I/O service times are too high for the work load
ACTION: add additional storage resources, and adjusting queue_depth may have
unexpected results
3) Queuing / command tag queue has been disable by AIX forcing the queue depth to 1
AIX requires the NACA bit set to enable command tag queuing. This bit is checked via
an inquiry during initial device open. This bit is set by the storage array and is used to
advertise NACA is supported. This bit is generally set when AIX is selected as the host
type on the array. EMC sets this bit on the director and is called the SC3 bit. Some
midrange arrays auto-configure this setting and no change is necessary.
Warning: - Some PCI slots associated with different hardware may have a limit on the allocable
PCI bus address space used for DMA mapping. Setting max_xfer_size to a value other than the
default may fail. When adjusting the max_xfer_size using chdev with the –P option and rootvg
is associated with these adapters a subsequent reboot may fail if the max_xfer requested
allocation fails. If you have paths through 2 different adapters the suggestion is to apply the
setting to each adapter by placing the paths and adapter in a defined state, apply the setting and
bring the adapter back into an available state.
Scenario 1
Customer X has been managing a rapidly growing data warehouse infrastructure. The database
administrator has been adding table space at a rate of 500 GB per month to a high transaction
DB2 database. Nightly processing of transaction records has been slowing at a rate of 10 % per
month and DB2 performance analysis indicates the database is I/O bound during the evening
data processing and during backups. It had been noted that the overall workload per LUN has not
significantly increased but the number of luns / hdisks has increased significantly.
10 ©2013 IBM
The adapter is configured with the default settings noted below for a 4 Gb FC adapter and the
system is balancing the I/O work load over 4 dual port FC adapters. Each FC adapter is
positioned in a high bandwidth PCI slot using the IBM Adapter Placement Reference as a guide.
The Adapter Placement Reference is publically available and is used to determine what slot is the
best slot for I/O performance.
Symptom
During nightly batch operations the following has been observed:
During the past week “iostat –Dl 1” and “fcstat –D fcsX” data was collected each night. Some
single I/O’s have higher service times but those are under 20 ms. Iostat has been running
continuously. Average read and write I/O service times under heavy load are not exceeding 8 ms
when running iostat –Dl 1, which runs iostat with an interval sampling time of 1 second. Each
night after fcstat data is collected “fcstat –z fcsX” is run to clear the driver error statistics.
Analysis
After 24 hours the following values incremented:
1) No DMA Resource Count
2) No Command Resource Count
The above indicates a shortage of both I/O DMA and No Command Resources. In some cases
you may also see incrementing “No Adapter Elements” in addition to No Command Resource
11 ©2013 IBM
Counts. The wait time in the driver is minimal when commands are held-off due to no resource
noted above. On the next returning in-flight command, resources are freed and allocated to the
waiting command in a few usecs.
You can list the predefined attributes and acceptable values and ranges for max_xfer_size of the
adapter using lsattr.
Note: The IO DMA memory size can be increased by selecting a non-default max_xfer_size.
Additional I/O DMA memory is needed to initiate larger I/O’s from the adapter.
Solution
The resolution for the above was to increase:
max_xfer_size = 0x200000
num_cmd_elems = 300
When the adapter driver is unable to initiate an I/O request due to no free cmd_elems, the
num_cmd_elems counter is incremented and the I/O request waits checking for free cmd_elem
resources. Resources are made available by a returning I/O request. The same is true for No
DMA resources.
In the above scenario: if the adapter tunable changes continued to show no improvement, where
the “No Command Resource Count” and / or the “No DMA Resource Count” continued to
12 ©2013 IBM
increment, and the max_xfer_size and num_cmd_elems was set to their maximum value, the I/O
workload capability of the adapter has been exceeded.
The goal in this situation is to reduce the overall I/O load generated through this adapter by
moving that load to additional resources. There are other cases where this cannot be done and a
workaround would be needed until a permanent solution can be implemented. This method does
not address the I/O performance implications associated with over utilization of resources but
does prevent holding on to valuable I/O’s resources which could result in slightly longer
recovery during certain fabric events.
Additional resources:
1) Add additional 4 or 8 GB FC adapters and balance the I/O work load over all the adapters
by adding additional mpio paths. This will reduce overall I/O load on the adapter.
Workaround:
1) Reduce the I/O work load by reducing the num_cmd_elems.
The maximum value for the num_cmd_elems for a 4 Gb FC adapter is 2048 and can be adjusted
within a range from 20 to 2048 in increments of 1 as noted below. The selectable range is adapter
model dependent and can quickly be determined with the following command.
Scenario 2
Customer 2 has set up a high end P7 AIX server which is equally distributed between 2 database
instances. The expected I/O workload during the day is transactional random 4 KB and 8 KB
I/O’s but during the evening partial database backups are performed and on the weekends a full
database backup is written for offsite duplication to tape which is primarily large sequential I/O.
The I/O work load is spread out over 4 x 8 Gb FC adapters assigning 1 port from a dual port
adapter where each hdisk has 4 paths. Each MPIO path is using a unique storage port that are
presented from a midrange class storage array with 128 physical 10 K SATA drives in a raid 10
configuration. The total number of LUNs presented to this host is 840 - 120 GB LUNs.
The following I/O configuration has been modified from the default:
hdisk queue_depth = 256 default = 16
hdisk max_xfer = 0x1000000 default = 0x100000
adapter num_cmd_elems = 4096 default = 500
adapter max_xfer_size = 0x1000000 default = 0x100000
13 ©2013 IBM
NOTE: Large sequential workloads, typically generated from large file copies or backups, can
possibly be made more efficient by increasing the max_transfer attribute on the hdisk and
max_xfer_size on the adapter. To prevent sending I/O requests larger than the adapter is
configured to initiate, config and driver code has been modified to check these attributes prior to
making the adapter available for use. This check is present to ensure the hdisk value
max_xfer_size , or rather the maximum size an I/O can be coalesced, does not exceed the
adapters max_xfer_size. Increasing both the hdisk and adapter attributes outlined above will
result in increasing the maximum allowable size of an adapter initiated I/O request. Most
applications as well as many databases initiate random non-sequential 4KB or 8KB I/O requests
resulting in little coalescing in the AIX disk driver. Running backup applications on the other
hand generally do result in producing large sequential I/O requests which produces significant
coalescing of smaller generally 4KB adjacent I/O requests into much larger I/O ‘s .
A noteworthy point to consider regarding the above, assuming the backend array physical device
are able to handle the I/O load in either case, is that a storage ports ability to process I/O in a
timely manner is directly tied to the storage ports CPU utilization where increasing the number
of I/O’s per second the port is processing raises the CPU utilization of that port. Moreover,
increasing the attributes noted above for a larger I/O transfer size along with a large sequential
I/O work load produced by backups should result in sending fewer but larger I/O’s which should
reduce the number of I/O’s per second the storage port is processing which in turn reduce the
ports CPU utilization. Decreased port CPU utilization has the effect of reducing I/O service times
which increases overall I/O throughput.
Symptom
The complaint is that database backup times are taking longer and are getting longer to complete
especially on the weekends. As the database grows throughout the year nightly backups are now
extending into productions hours but daily database queries are nominal. Database performance
analysis indicates the database is I/O bound and has reported that some I/O service times are
exceeding 900 ms and in some cases service times reached 2 seconds.
Iostat –Dl, fcstat –D and errpt data was collected for 1 week and fcstat counters were cleared
daily. The follow was noted:
errpt data indicated no error conditions were logged
fcstat physical layer data indicated no increasing values in CRC , dumped frames etc
iostat –Dl data indicated a sharp increase in Avg read and write service times ~ 700 ms
iostat –Dl indicated no qfull conditions. Expected since the queue depth is set very high
14 ©2013 IBM
Analysis
The very high avg read and write service times during the evening backups along with the high
number of active commands and peak active commands in the absence of any physical layer
problems with no sign of queuing in the adapter or disk driver suggests the storage array is
unable to service these I/O requests in a timely manner or rather the I/O load generated by the
host application is greater than the LUN / storage controller are capable of handling within a
~15ms window.
Solution
Add additional storage resources:
a. distribute the I/O work load to additional LUNs and or storage controllers
b. use LUNs created from different RAID groups
Use a faster hardware on the array. Replace SATA drives with SCSI or SAS.
In some environments where the goal is to reduce costs, slow backup times can be tolerated and
are expected. In this situation the appropriate tuning in this case should be to throttle the I/O
workload using the queue depth on the hdisk and / or the num_cmd_elems on the adapter. In
some cases it may be necessary to segregate the I/O traffic for backup only to an isolated set of
fiber channel adapters in order to gain a finer granularity of control in limiting backup related
I/O.
VI. Additional Data
The script below will create read I/O using dd. It is suited for testing I/O service times in a pre-
production environment and is capable of creating load up to the number CMD_OUT number.
15 ©2013 IBM
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business
Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at
"Copyright and trademark information"http://www.ibm.com/legal/copytrade.shtml at
www.ibm.com/legal/copytrade.shtml.
End of Document
16 ©2013 IBM