Issue and Resolutions
Issue and Resolutions
Issue and Resolutions
Issue:
Cause:
When a Director fails, each code emulation within HYPERMAX OS performs a cache cleanup routine to unlock cache slots
in Global Memory cache, which were in middle of tasks by the failing Director and its emulations. This unlock routine on
the cache (known as Dead Director Cache Cleanup) allows the emulations to either re-purpose the cache slot or re-drive
the task to completion by another emulation CPU on live, running Director. An issue has been found where cache slots
with certain conditions and tasks that were locked by an Enterprise Delivery Services (EDS) emulation CPU on a Failed
Director do not unlock the cache slots as designed and intended.
Change:
Resolution:
Dell EMC engineering is currently investigating this problem. A permanent fix is still in progress. Contact the Dell EMC
VMAX Support
This post is just an overview of storage performance metrics and isn’t meant to dive in to
every possible scenario from every angle. Dell EMC has some excellent guides for
performance best practices that you can read here:
http://www.emc.com/collateral/hardware/white-papers/h5773-clariion-best-practices-
performance-availability-wp.pdf (Older version for clariion arrays)
https://community.emc.com/message/796647 (Newer version for VNX, see 2nd post in the
topic)
Ive used a variety of software tools in my tenure as a storage administrator. EMC’s
Performance Manager, Windows PerfMon, NetApp OnCommand Insight, Solar Winds SRM,
ViPR SRM, and of course the ubiquitous Navisphere Analyzer. All of them basically use the
same metrics, so the following information will be useful regardless of which method you
use.
The first thing I do when reviewing a potential storage array performance problem is a
quick look at the Storage Processors. This will give you a good indication of the overall
health of the array before you dive into the specific LUN (or LUNs) used by the application.
SP Cache Dirty Pages (%). These are pages in write cache that have received new data from
hosts but have not yet been flushed to disk. You should have a high percentage of dirty pages as
it increases the chance of a read coming from cache or additional writes to the same block of
data being absorbed by the cache. If an IO is served from cache the performance is better than if
the data had to be retrieved from disk. That’s why the default watermarks are usually around
60/80% or 70/90%. You don’t want dirty pages to reach 100%, they should fluctuate between
the high and low watermarks (which means the Cache is healthy). Periodic spikes or drops
outside the watermarks are ok, but consistently hitting 100% indicates that the write cache is
overstressed.
SP Utilization (%). Check and see if either SP is running higher than about 75%. If either is
running that high application response time will be increased. Also, both will need to be under
50% for non-disruptive upgrades. We had to do a large scale migration of data from one SAN to
another at one point in order to get a NDU accomplished. You’ll also want to check for proper
balance. If one is much higher than the other, you should consider migrating LUNs from one SP
owner to another. I check SP balance on all of our arrays on a daily basis.
SP Response time (ms). Make sure again that both SPs are even and that response time
is acceptable. I like to see response times under 10ms. If you see that one SP has high utilization
and response time but the other SP doesn’t, look for LUNs owned by the busier SP that are using
more array resources. Looking at total IO on a per LUN basis can help confirm If both SPs have
relatively similar throughput but one SP has much higher bandwidth. That could mean that
there is some large block IO occurring.
SP Port Queue Full Count. This represents the number of times that a front end port issued a
QFULL response back to the hosts. If you are seeing QFULL’s it could mean that the Queue Depth
on the HBA is too large for the LUNs being accessed. A Clariion/VNX front end port has a queue
depth of 1600 which is the maximum number of simultaneous IO’s that port can process. Each
LUN on the array has a maximum queue depth that is calculated using a formula based on the
number of data disks in the RAID Group. For example, a port with 512 queues and a typical LUN
queue depth of 32 can support up to: 512 / 32 = 16 LUNs on 1 Initiator (HBA) or 16 Initiators
(HBAs) with 1 LUN each or any combination not to exceed this number. Configurations that
exceed this number are in danger of returning QFULL conditions. A QFULL condition signals that
the target/storage port is unable to process more IO requests and thus the initiator will need to
throttle IO to the storage port. As a result of this, application response times will increase and IO
activity will decrease.
The next thing I do is look at the specific LUNs that the application owner is asking about.
The list below includes the basic performance metrics that I most often look at when
investigating a performance problem.
Utilization (%) represents the fraction of an observation period during which a LUN has any
outstanding requests. When the LUN becomes the bottleneck, the utilization will be at or close to
100%. However, since I/Os can get serviced by multiple disks an increase in workload might still
result in a higher throughput. Utilization by itself is not a very good indicator of the overall
performance of the LUN, it needs to be factored in with several other things. For example, If you
are writing to a LUN (100% Writes) and the location of the data is in a small physical space on
the LUN, it may be possible to get to 100% with write cache re-hits. This means that all writes
are being serviced by the write cache and since you are writing data to the same locations over
and over you do not flush any of the data to the disks. This can cause your LUN Utilization to be
100% but there will actually be no IO to the disks. Utilization is very affected by caching, both
read and write. The LUN can be very busy but may not have a problem. Use Utilization to assist
in identifing busy LUNs then look at queuing and response times to see if there really is an issue.
Queue Length is the average number of requests within a polling interval that are outstanding
to this LUN. A queue length of zero indicates an idle LUN. If three requests arrive at an idle LUN
at the same time, only one of them can be served immediately; the other two must wait in the
queue. That scenario would result in a queue length of 3. My general guideline for “bad
performance” on a LUN is a queue length greater than 2 for a single disk drive.
Average Busy Queue Length is the average number of outstanding requests when the LUN was
busy. This does not include any idle time. This value should not exceed 2 times the number of
spindles on a LUN. For example, if a LUN has 25 spindles, a value of 50 is acceptable. Since this
queue length is counted only when the LUN is not idle, the value indicates the frequency
variation (burst frequency) of incoming requests. The higher the value, the bigger the burst and
the longer the average response time at this component. In contrast to this metric, the average
queue length does also include idle periods when no requests are pending. If you have 50% of
the time just one outstanding request, and the other 50% the LUN is idle, the average busy
queue length will be 1. The average queue length however, will be ½.
Response Time (ms) is the average time, in milliseconds, that a request to this LUN is
outstanding, including its waiting time. The higher the queue length for a LUN, the more
requests are waiting in its queue, thus increasing the average response time of a single request.
For a given workload, queue length and response time are directly proportional. Keep in mind
that cache re-hits bring down the average response time (and service times), whether they are
reads or writes. LUN Response time is a good starting point for troubleshooting. It gives a good
indicator of what the host system is experiencing. Usually if your LUN response time (Response
time = queue length * service time) is good then the host performance is good. High response
times don’t always mean that the CLARiiON is busy, it can also indicate that you’re having issues
with your host or Fabric. We use the Brocade Health report on a regular basis to identify hosts
that have an excessive amount of traffic, as well as running the EMC HEAT report on hosts that
have reported issues (which can identify incorrect HBA Drivers, Bad HBA, etc).These are my
general guidelines for response time:
Less than 10 ms: very good
Between 10 – 20 ms: okay
Between 20 – 50 ms: slow, needs attention
Greater than 50 ms: I/O bottleneck
Service Time (ms) represents the Time, in milliseconds, a request spent being serviced by a
component. It does not include time waiting in a queue. Service time is mainly a characteristic of
the system component. However, larger I/Os take longer and therefore usually result in lower
throughput (IO/s) but better bandwidth (Mbytes/s). In general, Service time is simply the time it
takes to actually send the I/O request to the storage and get an answer back. In general, I like to
see service times below 20ms.
Total Throughput (IO/sec) is the average number of host requests that is passed through the
LUN per second. This includes both read and write requests. Smaller requests usually result in a
higher total throughput than larger requests. Examining total throughput (along with
%Utilization) is a good way to identify the busiest LUNs on the array. In general, here are the
IOPs limits by drive type:
RPM Drive Type IOPs
Write Throughput (I/O/sec) The average number of host write requests that is passed
through the LUN per second. Smaller requests usually result in a higher write throughput than
larger requests. When troubleshooting specific LUNs, check the write IO size and see if the size
is what you would expect for the application you are investigating. Extremely large IO sizes
coupled with high IOPS may cause write cache contention.
Read Throughput (I/O/sec) The average number of host read requests that is passed through
the LUN per second. Smaller requests usually result in a higher read throughput than larger
requests.
Total Bandwidth (MB/s) The average amount of host data in Mbytes that is passed through the
LUN per second. This includes both read and write requests. Larger requests usually result in a
higher total bandwidth than smaller requests.
Read Bandwidth (MB/s) The average amount of host read data in Mbytes that is passed
through the LUN per second. Larger requests usually result in a higher bandwidth than smaller
requests.
Write Bandwidth (MB/s) The average amount of host write data in Mbytes that is passed
through the LUN per second. Larger requests usually result in a higher bandwidth than smaller
requests. Keep in mind that writes consume many more array resources than reads.
Read Size (KB) The average read request size in Kbytes seen by the LUN. This number indicates
whether the overall read workload is oriented more toward throughput (I/Os per second) or
bandwidth (Mbytes/second). For a finer distinction of I/O sizes, use an IO Size Distribution chart
for this LUN.
Write Size (KB) The average write request size in Kbytes seen by the LUN. This number
indicates whether the overall write workload is oriented more toward throughput (I/Os per
second) or bandwidth (Mbytes/second). For a finer distinction of I/O sizes, use an IO Size
Distribution chart for the LUNs.
Below is an explanation of additional performance metrics that I don’t use as frequently,
but I’m including them for completeness.
Forced Flushes/s Number of times per second the cache had to flush pages to disk to free up
space for incoming write requests. Forced flushes are a measure of how often write requests will
have to wait for disk I/O rather than be satisfied by an empty slot in the write cache. In most
well performing systems this should be zero most of the time.
Full Stripe Writes/s Average number of write requests per second that spanned a whole stripe
(all disks in a LUN). This metric is applicable only to LUNs that are part of a RAID5 or RAID3
group.
Used Prefetches (%) The percentage of prefetched data in the read cache that was read during
the last polling interval.
Disk Crossing (%) Percentage of host requests that require I/O to at least two disks compared
to the total number of host requests. A single disk crossing can involve more than two disk
drives.
Disk Crossings/s Number of times per second that a request requires access to at least two disk
drives. A single disk crossing can involve more than two disks.
Read Cache Hits/s Average number of read requests per second that were satisfied by either
read or write cache without requiring any disk access. A read cache hit occurs when recently
accessed data is re-referenced while it is still in the cache.
Read Cache Misses/s Average number of read requests per second that did require one or
more disk accesses.
Reads From Write Cache/s Average number of read requests per second that were satisfied by
write cache only. Reads from write cache occur when recently written data is read again while it
is still in the write cache. This is a subset of read cache hits which includes requests satisfied by
either the write or the read cache.
Reads From Read Cache/s Average number of read requests per second that were satisfied by
the read cache only. Reads from read cache occur when data that has been recently read or
prefetched is re-read while it is still in the read cache. This is a subset of read cache hits which
includes requests satisfied by either the write or the read cache.
Read Cache Hit Ratio The fraction of read requests served from both read and write caches vs.
the total number of read requests. A higher ratio indicates better read performance.
Write Cache Hits/s Average number of write requests per second that were satisfied by the
write cache without requiring any disk access. Write requests that are not write cache hits are
referred to as write cache misses.
Write Cache Misses/s Average number of write requests per second that did require one or
multiple disk accesses. Write requests that cause forced flushes or that bypass the write cache
due to their size are examples of write cache misses.
Write Cache Rehits/s Average number of write requests per second that were satisfied by the
write cache since they had been referenced before and not yet flushed to the disks. Write cache
rehits occur when recently accessed data is referenced again while it is still in the write cache.
This is a subset of Write Cache Hits.
Write Cache Hit Ratio The ratio of write requests that the write cache satisfied without
requiring any disk access vs. the total number of write requests to this LUN. A higher ratio
indicates better write performance.
Write Cache Rehit Ratio The ratio of write requests that the write cache satisfied since they
have been referenced before and not yet flushed to the disks vs. the total number of write
requests to this LUN. This is a measure of how often the write cache succeeded in eliminating a
write operation to disk. While improving the rehit ratio is useful it is more beneficial to reduce
the number of forced flushes.
===============================================================