IBM DS8000 Performance Monitoring and Tuning
IBM DS8000 Performance Monitoring and Tuning
Axel Westphal
Bert Dufrasne
Wilhelm Gardt
Jana Jamsek
Peter Kimmel
Flavio Morais
Paulus Usong
Alexander Warmuth
Kenta Yuge
Redbooks
International Technical Support Organization
April 2016
SG24-8318-00
Note: Before using this information and the product it supports, read the information in “Notices” on
page xiii.
This edition applies to Version 8.0 of the IBM DS8884 and DS8886 models (product numbers 2831–2834) and
Version 7.5 of the IBM DS8870 (product numbers 2421–2424).
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx
Contents v
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.3.2 Migrating the DS8870 storage system to the DS8886 storage system. . . . . . . . 199
6.4 Disk Magic Easy Tier modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.4.1 Predefined skew levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.4.2 Current workload existing skew level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
6.4.3 Heat map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
6.5 Storage Tier Advisor Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.5.1 Storage Tier Advisor Tool output samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.5.2 Storage Tier Advisor Tool for Disk Magic skew . . . . . . . . . . . . . . . . . . . . . . . . . 218
Contents vii
Chapter 11. Performance considerations for VMware . . . . . . . . . . . . . . . . . . . . . . . . . 363
11.1 Disk I/O architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
11.2 vStorage APIs for Array Integration support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
11.3 Host type for the DS8000 storage system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
11.4 Multipathing considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
11.5 Performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
11.5.1 Virtual Center performance statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
11.5.2 Performance monitoring with esxtop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
11.5.3 Guest-based performance monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
11.5.4 VMware specific tuning for maximum performance . . . . . . . . . . . . . . . . . . . . . 375
11.5.5 Workload spreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
11.5.6 Virtual machines sharing the LUN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
11.5.7 ESXi file system considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
11.5.8 Aligning partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
Chapter 15. IBM System Storage SAN Volume Controller attachment . . . . . . . . . . . 491
15.1 IBM System Storage SAN Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
15.1.1 SAN Volume Controller concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
15.1.2 SAN Volume Controller multipathing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
15.1.3 SAN Volume Controller Advanced Copy Services . . . . . . . . . . . . . . . . . . . . . . 495
15.2 SAN Volume Controller performance considerations . . . . . . . . . . . . . . . . . . . . . . . . 496
15.3 DS8000 performance considerations with SAN Volume Controller . . . . . . . . . . . . . 498
15.3.1 DS8000 array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
15.3.2 DS8000 rank format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
15.3.3 DS8000 extent pool implications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
15.3.4 DS8000 volume considerations with SAN Volume Controller. . . . . . . . . . . . . . 500
15.3.5 Volume assignment to SAN Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . 501
Contents ix
15.4 Performance monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
15.5 Sharing the DS8000 storage system between various server types and the SAN Volume
Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
15.5.1 Sharing the DS8000 storage system between Open Systems servers and the SAN
Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
15.5.2 Sharing the DS8000 storage system between z Systems servers and the SAN
Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
15.6 Configuration guidelines for optimizing performance . . . . . . . . . . . . . . . . . . . . . . . . 503
15.7 Where to place flash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
15.8 Where to place Easy Tier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
Contents xi
xii IBM System Storage DS8000 Performance Monitoring and Tuning
Notices
This information was developed for products and services offered in the US. This material might be available
from IBM in other languages. However, you may be required to own a copy of the product or product version in
that language in order to access it.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.
The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
Statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and
represent goals and objectives only.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to actual people or business enterprises is entirely
coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are
provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use
of the sample programs.
The following terms are trademarks or registered trademarks of International Business Machines Corporation,
and might also be trademarks or registered trademarks in other countries.
AIX® IBM FlashSystem® PR/SM™
CICS® IBM Spectrum™ ProtecTIER®
Cognos® IBM Spectrum Control™ Real-time Compression™
DB2® IBM Spectrum Protect™ Redbooks®
DB2 Universal Database™ IBM Spectrum Scale™ Redpapers™
developerWorks® IBM Spectrum Virtualize™ Redbooks (logo) ®
DS4000® IBM z Systems™ Resource Measurement Facility™
DS8000® IBM z13™ RMF™
Easy Tier® IBM zHyperWrite™ Storwize®
ECKD™ IMS™ System i®
Enterprise Storage Server® OMEGAMON® System Storage®
FICON® Parallel Sysplex® Tivoli®
FlashCopy® POWER® Tivoli Enterprise Console®
FlashSystem™ Power Systems™ z Systems™
GDPS® POWER7® z/Architecture®
GPFS™ POWER7+™ z/OS®
HACMP™ POWER8® z/VM®
HyperSwap® PowerHA® z/VSE®
IBM® PowerPC® z13™
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the
United States, other countries, or both.
Java, and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its
affiliates.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Other company, product, or service names may be trademarks or service marks of others.
Download
Android
iOS
Now
This IBM® Redbooks® publication provides guidance about how to configure, monitor, and
manage your IBM DS8880 storage systems to achieve optimum performance, and it also
covers the IBM DS8870 storage system. It describes the DS8880 performance features and
characteristics, including hardware-related performance features, synergy items for certain
operating systems, and other functions, such as IBM Easy Tier® and the DS8000® I/O
Priority Manager.
The book also describes specific performance considerations that apply to particular host
environments, including database applications.
This book also outlines the various tools that are available for monitoring and measuring I/O
performance for different server environments, and it describes how to monitor the
performance of the entire DS8000 storage system.
This book is intended for individuals who want to maximize the performance of their DS8880
and DS8870 storage systems and investigate the planning and monitoring tools that are
available. The IBM DS8880 storage system features, as described in this book, are available
for the DS8880 model family with R8.0 release bundles (Licensed Machine Code (LMC) level
7.8.0).
For more information about optimizing performance with the previous DS8000 models such
as the DS8800 or DS8700 models, see DS8800 Performance Monitoring and Tuning,
SG24-8013.
Performance: Any sample performance measurement data that is provided in this book is
for comparative purposes only. The data was collected in controlled laboratory
environments at a specific point by using the configurations, hardware, and firmware levels
that were available then. The performance in real-world environments can vary. Actual
throughput or performance that any user experiences also varies depending on
considerations, such as the I/O access methods in the user’s job, the I/O configuration, the
storage configuration, and the workload processed. The data is intended only to help
illustrate how different hardware technologies behave in relation to each other. Contact
your IBM representative or IBM Business Partner if you have questions about the expected
performance capability of IBM products in your environment.
Authors
This book was produced by a team of specialists from around the world working for the
International Technical Support Organization, at the EMEA Storage Competence Center
(ESCC) in Mainz, Germany.
Axel Westphal works as a certified IT Specialist at the IBM EMEA Storage Competence
Center (ESCC) in Mainz, Germany. He joined IBM in 1996, working for IBM Global Services
as a Systems Engineer. His areas of expertise include setup and demonstration of
IBM System Storage® products and solutions in various environments. He wrote several
storage white papers and co-authored several IBM Redbooks publications.
Jana Jamsek is an IT specialist that works in Storage Advanced Technical Skills for Europe
as a specialist for IBM Storage Systems and IBM i systems. Jana has eight years of
experience in the IBM System i® and AS/400 areas, and 15 years of experience in Storage.
She has a master’s degree in computer science and a degree in mathematics from the
University of Ljubljana, Slovenia. Jana works on complex customer cases that involve IBM i
and Storage systems, in different European, Middle Eastern, and African countries. She
presents at IBM Storage and Power universities and runs workshops for IBM employers and
customers. Jana is the author or co-author of several IBM publications, including IBM
Redbooks publications, IBM Redpapers™ publications, and white papers.
Peter Kimmel is an IT Specialist and ATS team lead of the Enterprise Disk Solutions team at
the ESCC in Mainz, Germany. He joined IBM Storage in 1999 and worked since then with all
the various IBM Enterprise Storage Server® and DS8000 generations, with a focus on
architecture and performance. He was involved in the Early Shipment Programs (ESPs) of
these early installations, and co-authored several IBM Redbooks publications. Peter holds a
diploma (MSc) degree in physics from the University of Kaiserslautern.
Flavio Morais is a GTS Storage Specialist in Brazil. He has six years of experience in the
SAN/storage field. He holds a degree in computer engineering from Instituto de
EnsinoSuperior de Brasilia. His areas of expertise include DS8000 planning, copy services,
IBM Tivoli® Storage Productivity Center for Replication, and performance troubleshooting. He
has extensive experience solving performance problems with Open Systems.
Paulus Usong joined IBM in Indonesia as a Systems Engineer. His next position brought him
to New York as a Systems Programmer at a bank. From New York, he came back to IBM in
San Jose at the Santa Teresa Lab, which is now known as the Silicon Valley Lab. During his
IBM employment in San Jose, Paulus moved to several different departments, all in San Jose.
His latest position at IBM is a Consulting I/T Specialist with the IBM ATS group as a disk
performance expert covering customers in the Americas. After his retirement from IBM, he
joined IntelliMagic as a Mainframe Consultant.
Alexander Warmuth is a Senior IT Specialist for IBM at the ESCC in Mainz, Germany.
Working in technical sales support, he designs and promotes new and complex storage
solutions, drives the introduction of new products, and provides advice to clients, Business
Partners, and sales. His main areas of expertise are high-end storage solutions and business
resiliency. He joined IBM in 1993 and has been working in technical sales support since 2001.
Alexander holds a diploma in electrical engineering from the University of Erlangen,
Germany.
Nick Clayton, Peter Flämig, Dieter Flaut, Marc Gerbrecht, Marion Hejny, Karl Hohenauer, Lee
La Frese, Frank Krüger, Uwe Heinrich Müller, Henry Sautter, Louise Schillig, Dietmar
Schniering, Uwe Schweikhard, Christopher Seiwert, Paul Spagnolo, Mark Wells, Harry
Yudenfriend
ESCC Rhine-Main Lab Operations
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
We want our books to be as helpful as possible. Send us your comments about this book or
other IBM Redbooks publications in one of the following ways:
Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
Send your comments in an email to:
[email protected]
Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
Preface xix
Stay connected to IBM Redbooks
Find us on Facebook:
http://www.facebook.com/IBMRedbooks
Follow us on Twitter:
http://twitter.com/redbooks
Look for us on LinkedIn:
http://www.linkedin.com/groups?home=&gid=2130806
Explore new Redbooks publications, residencies, and workshops with the IBM Redbooks
weekly newsletter:
https://www.redbooks.ibm.com/Redbooks.nsf/subscribe?OpenForm
Stay current on recent Redbooks publications with RSS Feeds:
http://www.redbooks.ibm.com/rss.html
Data continually moves from one component to another within a storage server. The objective
of server design is hardware that has sufficient throughput to keep data flowing smoothly
without waiting because a component is busy. When data stops flowing because a
component is busy, a bottleneck forms. Obviously, it is preferable to minimize the frequency
and severity of bottlenecks.
The ideal storage server is one in which all components are used and bottlenecks are few.
This scenario is the case if the following conditions are met:
The machine is designed well, with all hardware components in balance. To provide this
balance over a range of workloads, a storage server must allow a range of hardware
component options.
The machine is sized well for the client workload. Where options exist, the correct
quantities of each option are chosen.
The machine is set up well. Where options exist in hardware installation and logical
configuration, these options are chosen correctly.
Automatic rebalancing and tiering options can help achieve optimum performance even in an
environment of ever-changing workload patterns, but they cannot replace a correct sizing of
the machine.
Throughput numbers are achieved in controlled tests that push as much data as possible
through the storage server as a whole, or perhaps a single component. At the point of
maximum throughput, the system is so overloaded that response times are greatly extended.
Trying to achieve such throughput numbers in a normal business environment brings protests
from the users of the system because response times are poor.
To assure yourself that the DS8880 family offers the fastest technology, consider the
performance numbers for the individual disks, adapters, and other components of the
DS8886 and DS8884 models, and for the total device. The DS8880 family uses the most
current technology available. But, use a more rigorous approach when planning the DS8000
hardware configuration to meet the requirements of a specific environment.
For more information about this tool, see 6.1, “Disk Magic” on page 160.
Spreading the workload maximizes the usage and performance of the storage server as a
whole. Isolating a workload is a way to maximize the workload performance, making the
workload run as fast as possible. Automatic I/O prioritization can help avoid a situation in
which less-important workloads dominate the mission-critical workloads in shared
environments, and allow more shared environments.
If you expect growing loads, for example, when replacing one system with a new one that also
has a much bigger capacity, add some contingency for this amount of foreseeable I/O growth.
For more information, see 4.2, “Configuration principles for optimal performance” on page 87.
With the capabilities of the DS8000 series, a huge number of multiple workloads can be
easily consolidated into a single storage system.
Compared to the previous DS8870 models, the DS8880 family comes with largely increased
processing power because of IBM POWER8® technology, more scalability for processors and
cache, a new I/O bay interconnect, and improved I/O enclosure bays.
The storage systems can scale up to 1536 disk drives (or 2.5-inch solid-state drives (SSDs)
plus another 240 high-performance HPFE cards, the latter on a dedicated 1.8-inch flash
architecture.
Table 1-1 provides an overview of the DS8880 and DS8870 models, including processor,
memory, HA, and disk specifications for each model. The number of processor complexes is
two, for each of the models.
Number of transistors per 4.2 billion 4.2 billion 4.2 billion 2.1 billion 2.1 billion
processor (Second gen.) (Second gen.)
1.2 billion 1.2 billion
(First gen.) (First gen.)
Processor speed 3.02 - 3.89 GHz 3.89 GHz 4.23 GHz 4.23 GHz
3.72 GHz (8/16 core) (Second gen.) (Second gen.)
3.53 GHz 3.55 GHz 3.55 GHz
(24 core) (First gen.) (First gen.)
Disk drive interface - SAS 12 Gbps SAS 12 Gbps SAS 6 Gbps SAS 6 Gbps
technology
The following sections provide a short description of the main hardware components.
The IBM POWER® processor architecture offers superior performance and availability
features, compared to other conventional processor architectures on the market. POWER7+
allowed up to four intelligent simultaneous threads per core, which is twice of what larger x86
processors have. The POWER8 architecture even allows up to eight simultaneous threads
per core.
Disk drives
The DS8880 family offers a selection of industry-standard Serial Attached SCSI (SAS 3.0)
disk drives. Most drive types (15,000 and 10,000 RPM) are 6.35 cm (2.5-inch) small form
factor (SFF) sizes, with drive capacities of 300 GB - 1200 GB. SSDs are also available in
2.5-inch, with capacities of 200 - 1600 GB. The Nearline drives (7,200 RPM) are 8.9 cm
(3.5-inch) size drives with a SAS interface. With the maximum number and type of drives, the
storage system can scale to over 3 PB of total raw capacity.
With all these new components, the DS8880 family is positioned at the top of the
high-performance category. The following hardware components contribute to the high
performance of the DS8000 family:
Array across loops (AAL) when building Redundant Array of Independent Disks (RAID)
POWER symmetrical multiprocessor system (SMP) processor architecture
Multi-threaded design, simultaneous multithreading (SMT)
Switched PCIe architecture
Powerful processing components on HA and device adapter (DA) level, again with own
IBM PowerPC® CPUs on board of each such adapter, to which major CPU-intensive
functions can be offloaded.
AMP provides a provable, optimal sequential read performance and maximizes the sequential
read throughputs of all RAID arrays where it is used, and therefore of the system.
By carefully selecting the data that we destage and, at the same time, reordering the
destaged data, we minimize the disk head movements that are involved in the destage
processes. Therefore, we achieve a large performance boost on random-write workloads.
SDD is provided with the DS8000 series at no additional charge. Fibre Channel (SCSI-FCP)
attachment configurations are supported in the IBM AIX®, Hewlett-Packard UNIX (HP-UX),
Linux, Microsoft Windows, and Oracle Solaris environments.
In addition, the DS8000 series supports the built-in multipath options of many distributed
operating systems.
Rank stands here for a formatted RAID array. A drive tier is a group of drives with similar
performance characteristics, for example, flash. Easy Tier determines the appropriate tier of
storage based on data access requirements. It then automatically and nondisruptively moves
data, at the subvolume or sublogical unit number (LUN) level, to the appropriate tier on the
DS8000 storage system.
Easy Tier automatic mode provides automatic cross-tier storage performance and storage
economics management for up to three tiers. It also provides automatic intra-tier performance
management (auto-rebalance) in multitier (hybrid) or single-tier (homogeneous) extent pools.
Easy Tier also provides a performance monitoring capability, even when the Easy Tier
auto-mode (auto-migration) is not turned on. Easy Tier uses the monitoring process to
determine what data to move and when to move it when using automatic mode. The usage of
thin provisioning (extent space-efficient (ESE) volumes) is also possible with Easy Tier.
You can enable monitoring independently from the auto-migration for information about the
behavior and benefits that can be expected if automatic mode is enabled. Data from the
monitoring process is included in a summary report that you can download to your Microsoft
Windows system. Use the DS8000 Storage Tier Advisor Tool (STAT) application to view the
data when you point your browser to that file.
To download the STAT tool, check the respective DS8000 download section of IBM Fix
Central, found at:
http://www.ibm.com/support/fixcentral/
Prerequisites
To enable Easy Tier automatic mode, you must meet the following requirements:
Easy Tier automatic monitoring (monitor mode) is set to either All or Auto Mode.
For Easy Tier to manage pools, the Easy Tier Auto Mode setting must be set to either
Tiered or All (in the DS GUI, click Settings → System → Easy Tier).
Use the automatic mode of Easy Tier to relocate your extents to their most appropriate
storage tier in a hybrid pool, based on usage. Because workloads typically concentrate I/O
operations (data access) on only a subset of the extents within a volume or LUN, automatic
mode identifies the subset of your frequently accessed extents. It then relocates them to the
higher-performance storage tier.
Using automatic mode, you can use high-performance storage tiers with a smaller cost. You
invest a small portion of storage capacity in the high-performance storage tier. You can use
automatic mode for relocation and tuning without intervention. Automatic mode can help
generate cost-savings while optimizing your storage performance.
Easy Tier automatically relocates that data to an appropriate storage device in an extent pool
that is managed by Easy Tier. It uses an algorithm to assign heat values to each extent in a
storage device. These heat values determine which tier is best for the data, and migration
takes place automatically. Data movement is dynamic and transparent to the host server and
to applications that use the data.
By default, automatic mode is enabled (through the DSCLI and DS Storage Manager) for
heterogeneous pools. You can temporarily disable automatic mode, or as special option reset
or pause the Easy Tier “learning” logic.
In any tier, placing highly active (hot) data on the same physical rank can cause the hot rank
or the associated DA to become a performance bottleneck. Likewise, over time, skew can
appear within a single tier that cannot be addressed by migrating data to a faster tier alone. It
requires some degree of workload rebalancing within the same tier. Auto-rebalance
addresses these issues within a tier in both hybrid (multitier) and homogeneous (single-tier)
pools. It also helps the system to respond in a more timely and appropriate manner to
overload situations, skew, and any under-utilization. These conditions can occur for the
following reasons:
Addition or removal of hardware
Migration of extents between tiers
Merger of extent pools
Changes in the underlying volume configuration
Variations in the workload
If you set the Easy Tier automatic mode control to manage All Pools, Easy Tier also manages
homogeneous extent pools with only a single tier and performs intra-tier performance
balancing. If Easy Tier is turned off, no volumes are managed. If Easy Tier is turned on, it
manages all supported volumes, either standard (thick) or ESE (thin) volumes. For DS8870
and earlier models, the Track space-efficient (TSE) volumes that were offered by these
models are not supported by Easy Tier.
Warm demotion
To avoid overloading higher-performance tiers in hybrid extent pools, and thus potentially
degrading overall pool performance, Easy Tier automatic mode monitors performance of the
ranks. It can trigger the move of selected extents from the higher-performance tier to the
lower-performance tier based on either predefined bandwidth or IOPS overload thresholds.
The Nearline tier drives perform almost as well as SSDs and Enterprise hard disk drives
(HDDs) for sequential (high-bandwidth) operations.
This automatic operation is rank-based, and the target rank is randomly selected from the
lower tier. Warm demote is the highest priority to relieve quickly overloaded ranks.
So, Easy Tier continuously ensures that the higher-performance tier does not suffer from
saturation or overload conditions that might affect the overall performance in the extent pool.
Auto-rebalancing movement takes place within the same tier. Warm demotion takes place
across more than one tier. Auto-rebalance can be initiated when the rank configuration
changes the workload that is not balanced across ranks of the same tier. Warm demotion is
initiated when an overloaded rank is detected.
Cold demotion occurs when Easy Tier detects any of the following scenarios:
Segments in a storage pool become inactive over time, while other data remains active.
This scenario is the most typical use for cold demotion, where inactive data is demoted to
the Nearline tier. This action frees up segments on the Enterprise tier before the segments
on the Nearline tier become hot, which helps the system to be more responsive to new,
hot data.
In addition to cold demote, which uses the capacity in the lowest tier, segments with
moderate bandwidth but low random IOPS requirements are selected for demotion to the
lower tier in an active storage pool. This demotion better uses the bandwidth in the
Nearline tier (expanded cold demote).
If all the segments in a storage pool become inactive simultaneously because of either a
planned or an unplanned outage, cold demotion is disabled. Disabling cold demotion assists
the user in scheduling extended outages or when experiencing outages without changing the
data placement.
Figure 1-1 illustrates all of the migration types that are supported by the Easy Tier
enhancements in a three-tier configuration. The auto-rebalance might also include additional
swap operations.
A uto
R e b a l an c e
Hi g h e s t
SSD SSD SSD
Pe rfo rm a n c e …
RANK 1 RANK 2 RANK n
Ti e r
W ar m
Pro m o te De m o te S wa p
Hi g h e r
ENT H DD ENT H DD ENT H DD
P er fo rm a n c e ...
R ANK 1 RA NK 2 RA NK n
Ti e r
Wa rm C o ld Ex p an d e d
Pro m o te Swa p
D e mo te Co l d De m o te
De m o te
L o we r
NL HD D R ANK NL HD D R ANK NL HD D R ANK
P er fo rm a n c e ...
1 2 m
Ti e r
A uto R e ba l a n c e
In Easy Tier manual mode, you can dynamically relocate a logical volume between extent
pools or within an extent pool to change the extent allocation method of the volume or to
redistribute the volume across new ranks. This capability is referred to as dynamic volume
relocation. You can also merge two existing pools into one pool without affecting the data on
the logical volumes that are associated with the extent pools. In an older installation with
many pools, you can introduce the automatic mode of Easy Tier with automatic inter-tier and
intra-tier storage performance and storage economics management in multi-rank extent pools
with one or more tiers. Easy Tier manual mode also provides a rank depopulation option to
remove a rank from an extent pool with all the allocated extents on this rank automatically
moved to the other ranks in the pool.
The enhanced functions of Easy Tier manual mode provide additional capabilities. You can
use manual mode to relocate entire volumes from one pool to another pool. Upgrading to a
new disk drive technology, rearranging the storage space, or changing storage distribution for
a workload are typical operations that you can perform with volume relocations.
You can more easily manage configurations that deploy separate extent pools with different
storage tiers or performance characteristics. The storage administrator can easily and
dynamically move volumes to the appropriate extent pool. Therefore, the storage
administrator can meet storage performance or storage economics requirements for these
volumes transparently to the host and the application. Use manual mode to achieve these
operations and increase the options to manage your storage.
Volume migration
You can select which logical volumes to migrate, based on performance considerations or
storage management concerns:
Migrate volumes from one extent pool to another. You might want to migrate volumes to a
different extent pool with more suitable performance characteristics, such as different disk
drives or RAID ranks. Also, as different RAID configurations or drive technologies become
available, you might want to move a logical volume to a different extent pool with different
characteristics. You might also want to redistribute the available disk capacity between
extent pools.
Change the extent allocation method that is assigned to a volume, like restriping it again
across the ranks. (This is meaningful only for those few installations and pools that are not
managed by an Easy Tier auto-rebalancing.)
The impact that is associated with volume migration is comparable to an IBM FlashCopy®
operation that runs in a background copy.
For more information about the STAT features, see 6.5, “Storage Tier Advisor Tool” on
page 213.
The DS8000 storage system prioritizes access to system resources to achieve the wanted
QoS for the volume based on defined performance goals of high, standard, or low. I/O Priority
Manager constantly monitors and balances system resources to help applications meet their
performance targets automatically, without operator intervention.
It is increasingly common to use one storage system, and often fairly large pools, to serve
many categories of workloads with different characteristics and requirements. The
widespread use of virtualization and the advent of cloud computing that facilitates
consolidating applications into a shared storage infrastructure are common practices.
However, business-critical applications can suffer performance degradation because of
resource contention with less important applications. Workloads are forced to compete for
resources, such as disk storage capacity, bandwidth, DAs, and ranks.
The I/O Priority Manager maintains statistics for the set of logical volumes in each
performance group that can be queried. If management is performed for the performance
policy, the I/O Priority Manager controls the I/O operations of all managed performance
groups to achieve the goals of the associated performance policies. The performance group
of a volume defaults to PG0. Table 1-2 lists the performance groups that are predefined, with
their associated performance policies.
Table 1-2 DS8000 I/O Priority Manager - performance group to performance policy mapping
Performance Performance Priority QoS target Ceiling Performance policy description
group policy (max. delay)
factor [%]
Each performance group comes with a predefined priority and QoS target. Because
mainframe volumes and Open Systems volumes are on separate extent pools with different
rank sets, they do not interfere with each other, except in rare cases of overloaded DAs.
Open Systems can use several performance groups (such as PG1 - PG5) that share priority
and QoS characteristics. You can put applications into different groups for monitoring
purposes, without assigning different QoS priorities.
If the I/O Priority Manager detects resource overload conditions, such as resource contention
that leads to insufficient response times for higher-priority volumes, it throttles I/O for volumes
in lower-priority performance groups. This method allows the higher-performance group
applications to run faster and meet their QoS targets.
Important: Lower-priority I/O operations are delayed by I/O Priority Manager only if
contention exists on a resource that causes a deviation from normal I/O operation
performance. The I/O operations that are delayed are limited to operations that involve the
RAID arrays or DAs that experience contention.
Performance groups are assigned to a volume at the time of the volume creation. You can
assign performance groups to existing volumes by using the DS8000 command-line interface
(DSCLI) chfbvol and chckdvol commands. At any time, you can reassign volumes online to
other performance groups.
Modes of operation
I/O Priority Manager can operate in the following modes:
Disabled: I/O Priority Manager does not monitor any resources and does not alter any I/O
response times.
Monitor: I/O Priority Manager monitors resources and updates statistics that are available
in performance data. This performance data can be accessed from the DSCLI. No I/O
response times are altered.
Manage: I/O Priority Manager monitors resources and updates statistics that are available
in performance data. I/O response times are altered on volumes that are in lower-QoS
performance groups if resource contention occurs.
In both monitor and manage modes, I/O Priority Manager can send Simple Network
Management Protocol (SNMP) traps to alert the user when certain resources detect a
saturation event.
Example 1-1 Monitoring default performance group PG0 for one entire month in one-day intervals
dscli> lsperfgrprpt -start 32d -stop 1d -interval 1d pg0
time grp resrc avIO avMB avresp pri avQ tgtQ %hlpT %dlyT %impt
==============================================================================================================
2015-10-01/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 8617 489.376 0.554 0 43 0 0 0 0
2015-10-02/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 11204 564.409 2.627 0 37 0 0 0 0
2015-10-03/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 21737 871.813 5.562 0 27 0 0 0 0
2015-10-04/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 21469 865.803 5.633 0 32 0 0 0 0
2015-10-05/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 23189 1027.818 5.413 0 54 0 0 0 0
2015-10-06/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 20915 915.315 5.799 0 52 0 0 0 0
2015-10-07/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 18481 788.450 6.690 0 41 0 0 0 0
2015-10-08/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 19185 799.205 6.310 0 43 0 0 0 0
2015-10-09/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 19943 817.699 6.069 0 41 0 0 0 0
2015-10-10/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 20752 793.538 5.822 0 49 0 0 0 0
2015-10-11/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 23634 654.019 4.934 0 97 0 0 0 0
2015-10-12/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 23136 545.550 4.961 0 145 0 0 0 0
2015-10-13/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 19981 505.037 5.772 0 92 0 0 0 0
2015-10-14/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 5962 176.957 5.302 0 93 0 0 0 0
2015-10-15/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 2286 131.120 0.169 0 135 0 0 0 0
2015-10-16/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 2287 131.130 0.169 0 135 0 0 0 0
2015-10-17/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 2287 131.137 0.169 0 135 0 0 0 0
2015-10-18/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 10219 585.908 0.265 0 207 0 0 0 0
2015-10-19/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 22347 1281.260 0.162 0 490 0 0 0 0
2015-10-20/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 13327 764.097 0.146 0 507 0 0 0 0
2015-10-21/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 9353 536.250 0.151 0 458 0 0 0 0
2015-10-22/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 9944 570.158 0.127 0 495 0 0 0 0
2015-10-23/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 11753 673.871 0.147 0 421 0 0 0 0
2015-10-24/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 11525 660.757 0.140 0 363 0 0 0 0
2015-10-25/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 5022 288.004 0.213 0 136 0 0 0 0
2015-10-26/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 5550 318.230 0.092 0 155 0 0 0 0
2015-10-27/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 8732 461.987 0.313 0 148 0 0 0 0
2015-10-28/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 13404 613.771 1.434 0 64 0 0 0 0
2015-10-29/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 25797 689.529 0.926 0 51 0 0 0 0
2015-10-30/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 25560 725.174 1.039 0 49 0 0 0 0
2015-10-31/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 25786 725.305 1.013 0 49 0 0 0 0
A client can create regular performance reports and store them easily, for example, on a
weekly or monthly basis. Example 1-1 is a report created on the first day of a month for the
previous month. This report shows a long-term overview of the amount of IOPS, MBps, and
their average response times.
If all volumes are in their default performance group, which is PG0, the report that is shown in
Example 1-1 is possible. However, when you want to start the I/O Priority Manager QoS
management and throttling of the lower-priority volumes, you must put the volumes into
non-default performance groups. For this example, we use PG1 - PG15 for Open Systems
and PG19 - PG31 for CKD volumes.
Monitoring is then possible on the performance-group level, as shown in Example 1-2, on the
RAID-rank level, as in Example 1-3 on page 21, or the DA pair level, as in Example 1-4 on
page 21.
Example 1-2 Showing reports for a certain performance group PG28 for a certain time frame
dscli> lsperfgrprpt -start 3h -stop 2h pg28
time grp resrc avIO avMB avresp pri avQ tgtQ %hlpT %dlyT %impt
============================================================================================================
2015-11-01/14:10:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 204 11.719 1.375 9 17 5 0 0 0
The lsperfrescrpt command displays performance statistics for individual ranks, as shown
in Example 1-3:
The first three columns show the average number of IOPS, throughput, and average
response time, in milliseconds, for all I/Os on that rank.
The %Hutl column shows the percentage of time that the rank utilization was high enough
(over 33%) to warrant workload control.
The %hlpT column shows the average percentage of time that I/Os were helped on this
rank for all performance groups. This column shows the percentage of time where lower
priority I/Os were delayed to help higher priority I/Os.
The %dlyT column specifies the average percentage of time that I/Os were delayed for all
performance groups on this rank.
The %impt column specifies, on average, the length of the delay.
The DS8000 storage systems prioritize access to system resources to achieve the QoS of the
volume based on the defined performance goals (high, medium, or low) of any volume. I/O
Priority Manager constantly monitors and balances system resources to help applications
meet their performance targets automatically without operator intervention based on input
from the zWLM software.
You can run the I/O Priority Manager for z/OS the same way as for Open Systems. You can
control the performance results of the CKD volumes and assign them online into
higher-priority or lower-priority performance groups, if needed. The zWLM integration offers
an additional level of automation. The zWLM can control and fully manage the performance
behavior of all volumes.
With z/OS and zWLM software support, the user assigns application priorities through the
Workload Manager. z/OS, then assigns an “importance” value to each I/O, based on the
zWLM inputs. Then, based on the prior history of I/O response times for I/Os with the same
importance value, and based on the zWLM expectations for this response time, z/OS assigns
an “achievement” value to each I/O.
The importance and achievement values for each I/O are compared. The I/O becomes
associated with a performance policy, independently of the performance group or policy of the
volume. When saturation or resource contention occurs, I/O is then managed according to the
preassigned zWLM performance policy.
Together, these features can help consolidate multiple workloads on a single DS8000 storage
system. They optimize overall performance through automated tiering in a simple and
cost-effective manner while sharing resources. The DS8000 storage system can help
address storage consolidation requirements, which in turn helps manage increasing amounts
of data with less effort and lower infrastructure costs.
Understanding the hardware components, the functions that are performed by each
component, and the technology can help you select the correct components to order and the
quantities of each component. However, do not focus too much on any one hardware
component. Instead, ensure that you balance components to work together effectively. The
ultimate criteria for storage server performance are response times and the total throughput.
Storage unit
A storage unit consists of a single DS8000 storage system, including all its expansion frames.
A storage unit can consist of several frames: one base frame and up to four expansion frames
for a DS8886 storage system, up to three expansion frames for a DS8870 storage system, or
up to two expansion frames for a DS8884 storage system. The storage unit ID is the DS8000
base frame serial number, ending in 0 (for example, 75-06570).
Processor complex
A DS8880 processor complex is one POWER8 symmetric multiprocessor (SMP) system unit.
A DS8886 storage system uses a pair of IBM Power System S824 servers as processor
complexes, and a DS8884 storage system uses a pair of IBM Power System S822 servers as
processor complexes. In comparison, a DS8870 storage system uses an IBM Power 740
server pair (initially running on POWER7, later on POWER7+).
On all DS8000 models, the two processor complexes (servers) are housed in the base frame.
They form a redundant pair so that if either processor complex fails, the surviving one
continues to run the workload.
In a DS8000 storage system, a server is effectively the software that uses a processor logical
partition (a processor LPAR) and that has access to the memory and processor resources
that are available on a processor complex. The DS8880 models 980 and 981, and the
DS8870 model 961, are single-SFI models, so this one storage image is using 100% of the
resources.
For the internal naming for this example, we work with server 0 and server 1.
Figure 2-1 on page 27 shows an overview of the architecture of a DS8000 storage system.
You see the RAID adapters in the back end, which are the DAs that are used for the disk
drives and 2.5-inch solid-state drives (SSDs). The High-Performance Flash Enclosure
(HPFE) device carries out the RAID calculation internally on own dedicated PowerPC chips,
so it does not need to connect through conventional DAs.
On all DS8000 models, each processor complex has its own system memory. Within each
processor complex, the system memory is divided into these parts:
Memory that is used for the DS8000 control program
Cache
Persistent cache
Cache processing improves the performance of the I/O operations that are done by the host
systems that attach to the DS8000 storage system. Cache size, the efficient internal
structure, and algorithms of the DS8000 storage system are factors that improve I/O
performance. The significance of this benefit is determined by the type of workload that is run.
Read operations
These operations occur when a host sends a read request to the DS8000 storage system:
A cache hit occurs if the requested data is in the cache. In this case, the I/O operation
does not disconnect from the channel/bus until the read is complete. A read hit provides
the highest performance.
A cache miss occurs if the data is not in the cache. The I/O operation logically disconnects
from the host. Other I/O operations occur over the same interface. A stage operation from
the disk back end occurs.
The data remains in the cache and persistent memory until it is destaged, at which point it is
flushed from cache. Destage operations of sequential write operations to RAID 5 arrays are
done in parallel mode, writing a stripe to all disks in the RAID set as a single operation. An
entire stripe of data is written across all the disks in the RAID array. The parity is generated
one time for all the data simultaneously and written to the parity disk. This approach reduces
the parity generation penalty that is associated with write operations to RAID 5 arrays. For
RAID 6, data is striped on a block level across a set of drives, similar to RAID 5
configurations. A second set of parity is calculated and written across all the drives. This
technique does not apply for the RAID 10 arrays because there is no parity generation that is
required. Therefore, no penalty is involved other than a double write when writing to RAID 10
arrays.
It is possible that the DS8000 storage system cannot copy write data to the persistent cache
because it is full, which can occur if all data in the persistent cache waits for destage to disk.
In this case, instead of a fast write hit, the DS8000 storage system sends a command to the
host to retry the write operation. Having full persistent cache is not a good situation because it
delays all write operations. On the DS8000 storage system, the amount of persistent cache is
sized according to the total amount of system memory. The amount of persistent cache is
designed so that the probability of full persistent cache occurring in normal processing is low.
IBM Storage Development in partnership with IBM Research developed these algorithms.
The DS8000 cache is organized in 4 KB pages called cache pages or slots. This unit of
allocation ensures that small I/Os do not waste cache memory.
The decision to copy an amount of data into the DS8000 cache can be triggered from two
policies: demand paging and prefetching.
Demand paging means that eight disk blocks (a 4 K cache page) are brought in only on a
cache miss. Demand paging is always active for all volumes and ensures that I/O patterns
with locality find at least some recently used data in the cache.
Prefetching means that data is copied into the cache even before it is requested. To
prefetch, a prediction of likely future data access is required. Because effective,
sophisticated prediction schemes need extensive history of the page accesses, the
algorithm uses prefetching only for sequential workloads. Sequential access patterns are
commonly found in video-on-demand, database scans, copy, backup, and recovery. The
goal of sequential prefetching is to detect sequential access and effectively preload the
cache with data to minimize cache misses.
For prefetching, the cache management uses tracks. A track is a set of 128 disk blocks
(16 cache pages). To detect a sequential access pattern, counters are maintained with every
track to record if a track is accessed together with its predecessor. Sequential prefetching
becomes active only when these counters suggest a sequential access pattern. In this
manner, the DS8000 storage system monitors application read patterns and dynamically
determines whether it is optimal to stage into cache:
Only the page requested
The page requested, plus remaining data on the disk track
An entire disk track or multiple disk tracks not yet requested
The decision of when and what to prefetch is made on a per-application basis (rather than a
system-wide basis) to be responsive to the data reference patterns of various applications
that can run concurrently.
With the z Systems integration of newer DS8000 codes, a host application, such as DB2, can
send cache hints to the storage system and manage the DS8000 prefetching, reducing the
number of I/O requests.
RANDOM SEQ
MRU MRU
Desired size
SEQ bottom
LRU
RANDOM bottom
LRU
In Figure 2-2, a page that is brought into the cache by simple demand paging is added to the
Most Recently Used (MRU) head of the RANDOM list. With no further references to that
page, it moves down to the Least Recently Used (LRU) bottom of the list. A page that is
brought into the cache by a sequential access or by sequential prefetching is added to the
MRU head of the sequential (SEQ) list. It moves down that list as more sequential reads are
done. Additional rules control the management of pages between the lists so that the same
pages are not kept in memory twice.
To follow workload changes, the algorithm trades cache space between the RANDOM and
SEQ lists dynamically. Trading cache space allows the algorithm to prevent one-time
sequential requests from filling the entire cache with blocks of data with a low probability of
being read again. The algorithm maintains a wanted size parameter for the SEQ list. The size
is continually adapted in response to the workload. Specifically, if the bottom portion of the
SEQ list is more valuable than the bottom portion of the RANDOM list, the wanted size of the
SEQ list is increased. Otherwise, the wanted size is decreased. The constant adaptation
strives to make optimal use of limited cache space and delivers greater throughput and faster
response times for a specific cache size.
Figure 2-3 on page 31 shows the improvement in response time because of SARC,
measured on an older DS8000 storage system (and without flash) as this algorithm is
permanently enabled.
The SEQ list is managed by the AMP technology, which was developed by IBM. AMP
introduces an autonomic, workload-responsive, and self-optimizing prefetching technology
that adapts both the amount of prefetch and the timing of prefetch on a per-application basis
to maximize the performance of the system. The AMP algorithm solves two problems that
plague most other prefetching algorithms:
Prefetch wastage occurs when prefetched data is evicted from the cache before it can be
used.
Cache pollution occurs when less useful data is prefetched instead of more useful data.
By choosing the prefetching parameters, AMP provides optimal sequential read performance
and maximizes the aggregate sequential read throughput of the system. The amount that is
prefetched for each stream is dynamically adapted according to the needs of the application
and the space that is available in the SEQ list. The timing of the prefetches is also
continuously adapted for each stream to avoid misses and, concurrently, to avoid any cache
pollution.
SARC and AMP play complementary roles. SARC carefully divides the cache between the
RANDOM and the SEQ lists to maximize the overall hit ratio. AMP manages the contents of
the SEQ list to maximize the throughput that is obtained for the sequential workloads. SARC
affects cases that involve both random and sequential workloads. AMP helps any workload
that has a sequential read component, including pure sequential read workloads.
The CLOCK algorithm uses temporal ordering. It keeps a circular list of pages in memory,
with the “hand” pointing to the oldest page in the list. When a page must be inserted into the
cache, then an R (recency) bit is inspected at the “hand” location. If R is zero, the new page is
put in place of the page to which the “hand” points and R is set to 1. Otherwise, the R bit is
cleared and set to zero. Then, the clock hand moves one step clockwise forward and the
process is repeated until a page is replaced.
The CSCAN algorithm uses spatial ordering. The CSCAN algorithm is the circular variation of
the SCAN algorithm. The SCAN algorithm tries to minimize the disk head movement when
servicing read and write requests. It maintains a sorted list of pending requests along with the
position on the drive of the request. Requests are processed in the current direction of the
disk head until it reaches the edge of the disk. At that point, the direction changes. In the
CSCAN algorithm, the requests are always served in the same direction. When the head
arrives at the outer edge of the disk, it returns to the beginning of the disk and services the
new requests in this direction only. This algorithm results in more equal performance for all
head positions.
The idea of IWC is to maintain a sorted list of write groups, as in the CSCAN algorithm. The
smallest and the highest write groups are joined, forming a circular queue. The addition is to
maintain a recency bit for each write group, as in the CLOCK algorithm. A write group is
always inserted in its correct sorted position, and the recency bit is set to 0 at the beginning.
When a write hit occurs, the recency bit is set to 1. The destage operation proceeds. The
destage pointer is maintained that scans the circular list looking for destage victims. Now, this
algorithm allows only destaging of write groups whose recency bit is 0. The write groups with
a recency bit of 1 are skipped and the recent bit is then turned off and reset to 0. This method
gives an “extra life” to those write groups that were hit since the last time the destage pointer
visited them.
In the DS8000 implementation, an IWC list is maintained for each rank. The dynamically
adapted size of each IWC list is based on the workload intensity on each rank. The rate of
destage is proportional to the portion of nonvolatile storage (NVS) occupied by an IWC list.
The NVS is shared across all ranks of a processor complex. Furthermore, destages are
smoothed out so that write bursts are not converted into destage bursts.
IWC has better or comparable peak throughput to the best of CSCAN and CLOCK across a
wide gamut of write-cache sizes and workload configurations. In addition, even at lower
throughputs, IWC has lower average response times than CSCAN and CLOCK. The
random-write parts of workload profiles benefit from the IWC algorithm. The costs for the
destages are minimized, and the number of possible write-miss IOPS greatly improves
compared to a system not using IWC.
The IWC algorithm can be applied to storage systems, servers, and their operating systems.
The DS8000 implementation is the first for a storage system. Because of IBM patents on this
algorithm and the other advanced cache algorithms, it is unlikely that a competitive system
uses them.
For SAN Volume Controller attachments, consider the SAN Volume Controller node cache in
this calculation, which might lead to a slightly smaller DS8000 cache. However, most
installations come with a minimum of 128 GB of DS8000 cache. Using flash in the DS8000
storage system does not typically change the prior values. HPFE cards or SSDs are
beneficial with cache-unfriendly workload profiles because they reduce the cost of cache
misses.
Most storage servers support a mix of workloads. These general rules can work well, but
many times they do not. Use them like any general rule, but only if you have no other
information on which to base your selection.
When coming from an existing disk storage server environment and you intend to consolidate
this environment into DS8000 storage systems, follow these recommendations:
Choose a cache size for the DS8000 series that has a similar ratio between cache size
and disk storage to that of the configuration that you use.
When you consolidate multiple disk storage servers, configure the sum of all cache from
the source disk storage servers for the target DS8000 processor memory or cache size.
For example, consider replacing four DS8800 storage systems, each with 65 TB and 128 GB
cache, with a single DS8886 storage system. The ratio between cache size and disk storage
for each DS8800 storage system is 0.2% (128 GB/65 TB). The new DS8886 storage system
is configured with 300 TB to consolidate the four 65 TB DS8800 storage systems, plus
provide some capacity for growth. This DS8886 storage system should be fine with 512 GB of
cache to keep mainly the original cache-to-disk storage ratio. If the requirements are
somewhere in between or in doubt, round up to the next available memory size. When using a
SAN Volume Controller in front, round down in some cases for the DS8880 cache.
The cache size is not an isolated factor when estimating the overall DS8000 performance.
Consider it with the DS8000 model, the capacity and speed of the disk drives, and the
number and type of HAs. Larger cache sizes mean that more reads are satisfied from the
cache, which reduces the load on DAs and the disk drive back end that is associated with
reading data from disk. To see the effects of different amounts of cache on the performance of
the DS8000 storage system, run a Disk Magic model, which is described in 6.1, “Disk Magic”
on page 160.
I/O enclosures
The DS8880 base frame and first expansion frame (if installed) both contain I/O enclosures.
I/O enclosures are installed in pairs. There can be one or two I/O enclosure pairs installed in
the base frame, depending on the model. Further I/O enclosures are installed in the first
expansion frame. Each I/O enclosure has six slots for adapters: DAs and HAs are installed in
the I/O enclosures. The I/O enclosures provide connectivity between the processor
complexes and the HAs or DAs. The DS8880 can have up to two DAs and four HAs installed
in an I/O enclosure.
Depending on the number of installed disks and the number of required host connections,
some of the I/O enclosures might not contain any adapters.
With the DAs, you work with DA pairs because DAs are always installed in quantities of two
(one DA is attached to each processor complex). The members of a DA pair are split across
two I/O enclosures for redundancy. The number of installed disk devices determines the
number of required DAs. In any I/O enclosure, the number of individual DAs installed can be
zero, one, or two.
PCIe infrastructure
The DS8880 processor complexes use a x8 PCI Express (PCIe) Gen-3 infrastructure to
access the I/O enclosures. This infrastructure greatly improves performance over previous
DS8000 models. There is no longer any GX++ or GX+ bus in the DS8880 storage system, as
there was in earlier models.
DAs are installed in pairs because each processor complex requires its own DA to connect to
each disk enclosure for redundancy. DAs in a pair are installed in separate I/O enclosures to
eliminate the I/O enclosure as a single point of failure.
Each DA performs the RAID logic and frees the processors from this task. The throughput
and performance of a DA is determined by the port speed and hardware that are used, and
also by the firmware efficiency.
Gigapack Enclosures
Device
SFP Fla s h Fla s h
SFP Process or Process or
SRAM SRAM
Adapter
6 Gb ps SAS 6 Gb ps SAS
AC/DC AC/DC
Pow er Su pp ly Pow er Su pp ly
SFP SAS SAS ..24.. SAS SAS SAS SAS ..24.. SAS SAS
SFP AC/DC AC/DC
Device SFP
SFP
Pow er Su pp ly Pow er Su pp ly
Adapter 6 Gb ps SAS
SRAM
6 Gb ps SAS
SRAM
Process or Fla s h
Process or Fla s h
SFP
ASIC SFP SFP
ASIC SFP
8G bp s FC 8G bp s FC 8G bp s FC 8G bp s FC
SFP SFP SFP SFP
The ASICs provide the FC-to-SAS bridging function from the external SFP connectors to
each of the ports on the SAS disk drives. The processor is the controlling element in the
system.
These switches use the FC-AL protocol and attach to the SAS drives (bridging to SAS
protocol) through a point-to-point connection. The arbitration message of a drive is captured
in the switch, processed, and propagated back to the drive, without routing it through all the
other drives in the loop.
Performance is enhanced because both DAs connect to the switched FC subsystem back
end, as shown in Figure 2-6 on page 37. Each DA port can concurrently send and receive
data.
Figure 2-6 High availability and increased bandwidth connect both DAs to two logical loops
The DS8880 storage system supports two types of high-density storage enclosure: the
2.5-inch SFF enclosure and the 3.5-inch large form factor (LFF) enclosure. The high-density
and lower-cost LFF storage enclosure accepts 3.5-inch drives, offering 12 drive slots. The
SFF enclosure offers twenty-four 2.5-inch drive slots. The front of the LFF enclosure differs
from the front of the SFF enclosure, with its 12 drives slotting horizontally rather than
vertically.
All drives within a disk enclosure pair must be the same capacity and rotation speed. A disk
enclosure pair that contains fewer than 48 DDMs must also contain dummy carriers called
fillers. These fillers are used to maintain airflow.
When ordering SSDs in increments of 16 (Driveset), you can create a balanced configuration
across the two processor complexes, especially with the high-performance capabilities of
SSDs. In general, an uneven number of ranks of similar type drives (especially SSDs) can
cause an imbalance in resources, such as cache or processor usage. Use a balanced
configuration with an even number of ranks and extent pools, for example, one even and one
odd extent pool, and each with one flash rank. This balanced configuration enables Easy Tier
automatic cross-tier performance optimization on both processor complexes. It distributes the
overall workload evenly across all resources.
For Nearline HDDs (especially with only three array-sites per 3.5-inch disk enclosure pair, in
contrast to six array-sites in a 2.5-inch disk enclosure pair), load balancing is not so critical as
it is with flash to go with an uneven number of ranks. Nearline HDDs show lower performance
characteristics.
By putting half of each array on one loop and half of each array on another loop, there are
more data paths into each array. This design provides a performance benefit, particularly in
situations where a large amount of I/O goes to one array, such as sequential processing and
array rebuilds.
2.4.4 DDMs
At the time of writing this book, the DS8880 provides a choice of the following DDM types:
300 and 600 GB, 15 K RPM SAS disk, 2.5-inch SFF
600 and 1200 GB, 10K RPM SAS disk, 2.5-inch SFF
4 TB, 7,200 RPM Nearline-SAS disk, 3.5-inch LFF
200/400/800/1600 GB e-MLC SAS SSDs (enterprise-grade Multi-Level Cell Solid-State
Drives), 2.5-inch SFF
400 GB HPFE cards, 1.8-inch
All drives are encryption capable (Full Disk Encryption, FDE). Whether encryption is enabled
has no influence on their performance. Additional drive types are constantly in evaluation and
added to this list when available.
These disks provide a range of options to meet the capacity and performance requirements of
various workloads, and to introduce automated multitiering.
Another difference between these drive types is the RAID rebuild time after a drive failure.
This rebuild time grows with larger capacity drives. Therefore, RAID 6 is used for the
large-capacity Nearline drives to prevent a second disk failure during rebuild from causing a
loss of data, as described in 3.1.2, “RAID 6 overview” on page 49. RAID 6 has much less
IOPS per array for OLTP compared to RAID 5. So, it is RPM speed and the reduced spindle
count (because of their large capacity) that make Nearline drives comparably slower, and the
RAID type. So, they are typically used as the slow tier in hybrid pools.
A flash rank can potentially do tens of thousands of IOPS if the underlying RAID architecture
supports these many IOPS, and at sub-ms response times even for cache misses. Flash is a
mature technology and can be used in critical production environments. Their high
performance for small-block/random workloads makes them financially viable for part of a
hybrid HDD/flash pool mix. Over-provisioning techniques are used for failing cells, and data in
worn-out cells is copied proactively. There are several algorithms for wear-leveling across the
cells. The algorithms include allocating a rarely used block for the next block to use or moving
data internally to less-used cells to enhance lifetime. Error detection and correction
mechanisms are used, and bad blocks are flagged.
1U Empty 1U Empty
DA 3 Disk Enclosure DA 6 Disk Enclosure
DA 3 Disk Enclosure DA 6 Disk Enclosure
DA 2 Disk Enclosure DA 6 Disk Enclosure 11U Empty
DA 2 Disk Enclosure
DA 6 Disk Enclosure
DA 2 Disk Enclosure
DA 6 Disk Enclosure
DA 2 Disk Enclosure
DA 6 Disk Enclosure
DA 2 Disk Enclosure DA 7 Disk Enclosure
DA 3 Disk Enclosure
DA 2 Disk Enclosure DA 7 Disk Enclosure
DA 3 Disk Enclosure
DA 2 Disk Enclosure
DA 7 Disk Enclosure
DA 3 Disk Enclosure
DA 2 Disk Enclosure
DA 7 Disk Enclosure
DA 3 Disk Enclosure
HPFE 19
HPFE 18 DA 7 Disk Enclosure
DA 3 Disk Enclosure
HMC1 HMC2
Figure 2-7 DS8884 model - DA-to-disk mapping with all three possible frames shown
The DS8886 981 model can hold up to 1536 SFF drives with all expansion frames. Figure 2-8
on page 41 shows a DS8886 model that is fully equipped with 1536 SFF SAS drives and 240
HPFE cards.
DC-UPS 1 DC-UPS 2 DC-UPS 1 DC-UPS 2 DC-UPS 1 DC-UPS 2 DC-UPS 1 DC-UPS 2 DC-UPS 1 DC-UPS 2
HMC1 HMC2
Base Frame First Expansion Frame Second Expansion Frame Third Expansion Frame Fourth Expansion Frame
Figure 2-8 DS8886 model - DA-to-disk mapping with all five possible frames shown
Up to eight DA pairs (16 individual DAs) are possible in a maximum DS8886 configuration.
With more than 384 disks, each DA pair handles more than 48 DDMs (six ranks). For usual
workload profiles, this number is not a problem. It is rare to overload DAs. The performance of
many storage systems is determined by the number of disk drives if there are no HA
bottlenecks.
The card itself is PCIe Gen 2, but working in a PCIe Gen3 slot. The card is driven by a new
high-function, high-performance application-specific integrated circuit (ASIC). To ensure
maximum data integrity, it supports metadata creation and checking. Each FC port supports a
maximum of 509 host login IDs and 1280 paths so that you can create large storage area
networks (SANs).
The front end with the 16 Gbps ports scales up to 128 ports for a DS8886, which results in a
theoretical aggregated host I/O bandwidth of 128 × 16 Gbps. The HBAs use PowerPC chips
(quad-core Freescale for 16 Gbps HAs, dual-core Freescale for 8 Gbps HAs).
The 16 Gbps adapter ports can negotiate to 16, 8, or 4 Gbps. The 8 Gbps adapter ports can
each negotiate to 8, 4, or 2 Gbps speeds. For slower attachments, use a switch in between.
Only the 8 Gbps adapter ports can be used in FC-AL protocol, that is, without a switch.
The 8-port (8 Gbps) HA offers essentially the same total maximum throughput when taking
loads of all its ports together as the 4-port 8 Gbps HA. Therefore, the 8-port HAs are meant
for more attachment options, but not for more performance.
Automatic Port Queues is a mechanism that the DS8000 storage system uses to self-adjust
the queue based on the workload. The Automatic Port Queues mechanism allows higher port
queue oversubscription while maintaining a fair share for the servers and the accessed LUNs.
The port that the queue fills goes into SCSI Queue Fill mode, where it accepts no additional
commands to slow down the I/Os. By avoiding error recovery and the 30-second blocking
SCSI Queue Full recovery interval, the overall performance is better with Automatic Port
Queues.
Host connections frequently go through various external connections between the server and
the DS8000 storage system. Therefore, you need enough host connections for each server
so that if half of the connections fail, processing can continue at the level before the failure.
This availability-oriented approach requires that each connection carry only half the data
traffic that it otherwise might carry. These multiple lightly loaded connections also help to
minimize the instances when spikes in activity might cause bottlenecks at the HA or port. A
multiple-path environment requires at least two connections. Four connections are typical,
and eight connections are not unusual. Typically, these connections are spread across as
many I/O enclosures in the DS8880 storage system that you equipped with HAs.
Usually, SAN directors or switches are used. Use two separate switches to avoid a single
point of failure.
In a z Systems environment, you must select a SAN switch or director that also supports
FICON. An availability-oriented approach applies to the z Systems environments similar to
the Open Systems approach. Plan enough host connections for each server so that if half of
the connections fail, processing can continue at the level before the failure.
For more information, see 4.10.1, “I/O port planning considerations” on page 131.
Your IBM representative or IBM Business Partner has access to these and additional white
papers and can provide them to you.
For a list of drive combinations and RAID configurations, see 8.5.2, “Disk capacity,” in IBM
DS8880 Architecture and Implementation (Release 8), SG24-8323.
Performance of the RAID 5 array returns to normal when the data reconstruction onto the
spare device completes. The time that is taken for sparing can vary, depending on the size of
the failed DDM and the workload on the array, the switched network, and the DA. The use of
arrays across loops (AAL) both speeds up rebuild time and decreases the impact of a rebuild.
Smart Rebuild, an IBM technique further improving the classical RAID 5, further reduces the
risk of a second drive failure for RAID 5 ranks during rebuild by detecting a failing drive early
and copying the drive data to the spare drive in advance. If a RAID 5 array is predicted to fail,
a “rebuild” is initiated by copying off that failing drive to the spare drive before it fails,
decreasing the overall rebuild time. If the drive fails during the copy operation, the rebuild
continues from the parity information like a regular rebuild.
RAID 6 is preferably used in combination with large capacity disk drives, for example, such as
4 TB Nearline drives because of longer rebuild times and the increased risk of an additional
medium error in addition to the failed drive during the rebuild. In many environments, RAID 6
is also considered for Enterprise drives of capacities above 1 TB in cases where reliability is
favored over performance and the trade-off in performance versus reliability can be accepted.
However, with the Smart Rebuild capability, the risk of a second drive failure for RAID 5 ranks
is also further reduced.
Comparing RAID 6 to RAID 5 performance provides about the same results on reads. For
random writes, the throughput of a RAID 6 array is only two thirds of a RAID 5 array because
of the additional parity handling. Workload planning is important before implementing RAID 6,
specifically for write-intensive applications, including Copy Services targets, where they are
not recommended to be used. Yet, when properly sized for the I/O demand, RAID 6 is a
considerable reliability enhancement.
During the rebuild of the data on the new drive, the DA can still handle I/O requests of the
connected hosts to the affected array. A performance degradation can occur during the
reconstruction because DAs and back-end resources are involved in the rebuild. Additionally,
any read requests for data on the failed drive require data to be read from the other drives in
the array, and then the DA reconstructs the data. Any later failure during the reconstruction
within the same array (second drive failure, second coincident medium errors, or a drive
failure and a medium error) can be recovered without loss of data.
Performance of the RAID 6 array returns to normal when the data reconstruction, on the
spare device, completes. The rebuild time varies, depending on the size of the failed DDM
and the workload on the array and the DA. The completion time is comparable to a RAID 5
rebuild, but slower than rebuilding a RAID 10 array in a single drive failure.
RAID 10 is not as commonly used as RAID 5 mainly because more raw disk capacity is
required for every gigabyte of effective capacity. Typically, RAID 10 is used for workloads with
a high random-write ratio.
Write operations are not affected. Performance of the RAID 10 array returns to normal when
the data reconstruction, onto the spare device, completes. The time that is taken for sparing
can vary, depending on the size of the failed DDM and the workload on the array and the DA.
In relation to RAID 5, RAID 10 sparing completion time is a little faster. Rebuilding a RAID 5
6+P configuration requires six reads plus one parity operation for each write. A RAID 10 3+3
configuration requires one read and one write (a direct copy).
Each enclosure places two Fibre Channel (FC) switches onto each loop. SAS DDMs are
purchased in groups of 16 (drive set). Half of the new DDMs go into one disk enclosure, and
half of the new DDMs go into the other disk enclosure of the pair. The same setup applies to
SSD and NL-SAS drives where for the latter you also have a half drive set purchase option
with only eight DDMs.
An array site consists of eight DDMs. Four DDMs are taken from one enclosure in the disk
enclosure pair, and four are taken from the other enclosure in the pair. Therefore, when a
RAID array is created on the array site, half of the array is on each disk enclosure.
One disk enclosure of the pair is on one FC switched loop, and the other disk enclosure of the
pair is on a second switched loop. The array is split across two loops, so the use the term
array across loops (AAL) is used.
AAL is used to increase performance. When the DA writes a stripe of data to a RAID 5 array,
it sends half of the write to each switched loop. By splitting the workload in this manner, each
loop is worked evenly. This setup aggregates the bandwidth of the two loops and improves
performance. If RAID 10 is used, two RAID 0 arrays are created. Each loop hosts one RAID 0
array. When servicing read I/O, half of the reads can be sent to each loop, again improving
performance by balancing workload across loops.
A minimum of one spare is created for each array site that is assigned to an array until the
following conditions are met:
Minimum of four spares per DA pair
Minimum of four spares for the largest capacity array site on the DA pair
Minimum of two spares of capacity and speed (RPM) greater than or equal to the fastest
array site of any capacity on the DA pair.
For HPFE cards, because they are not using DA pairs, the first two arrays on each HPFE
come with spares (6+P+S), and the remaining two arrays come without extra spares (6+P)
because they share the spares with the first array pair.
Floating spares
The DS8000 storage system implements a smart floating technique for spare DDMs. A
floating spare is defined this way. When a DDM fails and the data it contains is rebuilt on to a
spare, then when the disk is replaced the replacement disk becomes the spare. The data is
not migrated to another DDM, such as the DDM in the original position that the failed DDM
occupied.
The DS8000 Licensed Internal Code takes this idea one step further. It might choose to allow
the hot spare to remain where it is moved, but it can instead choose to migrate the spare to a
more optimum position. This migration can better balance the spares across the DA pairs, the
loops, and the disk enclosures. It might be preferable that a DDM that is in use as an array
member is converted to a spare. In this case, the data on that DDM is migrated in the
background on to an existing spare. This process does not fail the disk that is being migrated,
although it reduces the number of available spares in the DS8000 storage system until the
migration process is complete.
The DS8000 storage system uses this smart floating technique so that the larger or faster
(higher RPM) DDMs are allocated as spares. Allocating the larger or faster DDMs as spares
ensures that a spare can provide at least the same capacity and performance as the replaced
drive. If you rebuild the contents of a 600 GB DDM onto a 1200 GB DDM, half of the 1200 GB
DDM is wasted because that space is not needed. When the failed 600 GB DDM is replaced
with a new 600 GB DDM, the DS8000 Licensed Internal Code most likely migrates the data
back onto the recently replaced 600 GB DDM. When this process completes, the 600 GB
DDM rejoins the array and the 1200 GB DDM becomes the spare again.
Another example is if you fail a 300 GB 15 K RPM DDM onto a 600 GB 10 K RPM DDM. The
data is now moved to a slower DDM and wastes significant space. The array has a mix of
RPMs, which is not desirable. When the failed disk is replaced, the replacement is the same
type as the failed 15 K RPM disk. Again, a smart migration of the data is performed after
suitable spares are available.
Hot-pluggable DDMs
Replacement of a failed drive does not affect the operation of the DS8000 storage system
because the drives are fully hot-pluggable. Each disk plugs into a switch, so there is no loop
break associated with the removal or replacement of a disk. In addition, there is no potentially
disruptive loop initialization process.
Important: In general, when a drive fails either in RAID 10, 5, or 6, it might cause a
minimal degradation of performance on DA and switched network resources during the
sparing process. The DS8880 architecture and features minimize the effect of this
behavior.
By using the DSCLI, it is possible to check for any failed drives by running the DSCLI lsddm
-state not_normal command. See Example 3-1.
The definition of virtualization is the abstraction process from the physical disk drives to a
logical volume that is presented to hosts and servers in a way that they see it as though it
were a physical disk.
When talking about virtualization, it is the process of preparing physical disk drives (DDMs) to
become an entity that can be used by an operating system, which means the creation of
logical unit numbers (LUNs).
The DDMs are mounted in disk enclosures and connected in a switched FC topology by using
a Fibre Channel Arbitrated Loop (FC-AL) protocol. The DS8880 small form factor disks are
mounted in 24 DDM enclosures (mounted vertically), and the Nearline drives come in
12-DDM slot LFF enclosures (mounted horizontally).
The disk drives can be accessed by a pair of DAs. Each DA has four paths to the disk drives.
One device interface from each DA connects to a set of FC-AL devices so that either DA can
access any disk drive through two independent switched fabrics (the DAs and switches are
redundant).
Each DA has four ports, and because DAs operate in pairs, there are eight ports or paths to
the disk drives. All eight paths can operate concurrently and can access all disk drives on the
attached fabric. However, in normal operation disk drives are typically accessed by one DA.
Which DA owns the disk is defined during the logical configuration process to avoid any
contention between the two DAs for access to the disks.
Array
Site
Switch
Loop 1 Loop 2
Figure 3-1 Array site
As you can see in Figure 3-1, array sites span loops. Four DDMs are taken from loop 1 and
another four DDMs from loop 2. Array sites are the building blocks that are used to define
arrays.
3.2.2 Arrays
An array, also called managed array, is created from one array site. Forming an array means
defining it as a specific RAID type. The supported RAID types are RAID 5, RAID 6, and
RAID 10 (see 3.1, “RAID levels and spares” on page 48). For each array site, you can select a
RAID type. The process of selecting the RAID type for an array is also called defining an
array.
Important: In the DS8000 implementation, one managed array is defined by using one
array site.
Figure 3-2 on page 55 shows the creation of a RAID 5 array with one spare, which is also
called a 6+P+S array (capacity of six DDMs for data, capacity of one DDM for parity, and a
spare drive). According to the RAID 5 rules, parity is distributed across all seven drives in this
example.
Array
Site
D1 D7 D13 ...
D2 D8 D14 ...
D3 D9 D15 ...
an array D5 D1 1 P ...
D6 P D17 ...
P D1 2 D18 ...
D ata
D ata
D ata
D ata
RAID
D ata
D ata Array
Parity Spare
Spare
So, an array is formed by using one array site, and although the array can be accessed by
each adapter of the DA pair, it is managed by one DA. Later in the configuration process, you
define the adapter and the server that manage this array.
3.2.3 Ranks
In the DS8000 virtualization hierarchy, there is another logical construct, which is called a
rank. When you define a new rank, its name is chosen by the DS Storage Manager, for
example, R1, R2, and R3. You must add an array to a rank.
Important: In the DS8000 implementation, a rank is built by using only one array.
The available space on each rank is divided into extents. The extents are the building blocks
of the logical volumes. An extent is striped across all disks of an array, as shown in Figure 3-3
on page 56, and indicated by the small squares in Figure 3-4 on page 58.
z Systems users or administrators typically do not deal with gigabytes or gibibytes, and
instead they think of storage in the original 3390 volume sizes. A 3390 Model 3 is three times
the size of a Model 1. A Model 1 has 1113 cylinders (about 0.94 GB). The extent size of a
CKD rank is one 3390 Model 1 or 1113 cylinders.
Figure 3-3 shows an example of an array that is formatted for FB data with 1 GiB extents (the
squares in the rank indicate that the extent is composed of several blocks from DDMs).
D1 D7 D13 ...
Data
Data D2 D8 D14 ...
Data
Data
RAID D3 D9 D15 ...
Creation of
a Rank
....
....
....
FB Rank
1 GiB 1 GiB 1 GiB 1 GiB
....
....
of 1 GiB
....
extents
....
....
It is still possible to define a CKD volume with a capacity that is an integral multiple of one
cylinder or a fixed block LUN with a capacity that is an integral multiple of 128 logical blocks
(64 KB). However, if the defined capacity is not an integral multiple of the capacity of one
extent, the unused capacity in the last extent is wasted. For example, you can define a one
cylinder CKD volume, but 1113 cylinders (one extent) are allocated and 1112 cylinders are
wasted.
Encryption group
A DS8880 storage system comes with encryption-capable disk drives. If you plan to use
encryption before creating a rank, you must define an encryption group. For more
information, see IBM DS8870 Disk Encryption, REDP-4500. Currently, the DS8000 series
supports only one encryption group. All ranks must be in this encryption group. The
encryption group is an attribute of a rank. So, your choice is to encrypt everything or nothing.
You can turn on (create an encryption group) encryption later, but then all ranks must be
deleted and re-created, which means that your data is also deleted.
With Easy Tier, it is possible to mix ranks with different characteristics and features in
managed extent pools to achieve the preferred performance results. You can mix all three
storage classes or storage tiers within the same extent pool that have solid-state drive (SSD),
Enterprise, and Nearline-class disks.
Important: In general, do not mix ranks with different RAID levels or disk types (size and
RPMs) in the same extent pool if you are not implementing Easy Tier automatic
management of these pools. Easy Tier has algorithms for automatically managing
performance and data relocation across storage tiers and even rebalancing data within a
storage tier across ranks in multitier or single-tier extent pools, providing automatic storage
performance and storage economics management with the preferred price, performance,
and energy savings costs.
There is no predefined affinity of ranks or arrays to a storage server. The affinity of the rank
(and its associated array) to a certain server is determined at the point that the rank is
assigned to an extent pool.
One or more ranks with the same extent type (FB or CKD) can be assigned to an extent pool.
One rank can be assigned to only one extent pool. There can be as many extent pools as
there are ranks.
With storage-pool striping (the default extent allocation method (EAM) rotate extents), you
can create logical volumes striped across multiple ranks. This approach typically enhances
performance. To benefit from storage pool striping (see “Rotate extents (storage pool striping)
extent allocation method” on page 63), multiple ranks in an extent pool are required.
Storage-pool striping enhances performance, but it also increases the failure boundary. When
one rank is lost, for example, in the unlikely event that a whole RAID array failed because of
multiple failures at the same time, the data of this single rank is lost. Also, all volumes in the
pool that are allocated with the rotate extents option are exposed to data loss. To avoid
exposure to data loss for this event, consider mirroring your data to a remote DS8000 storage
system.
When an extent pool is defined, it must be assigned with the following attributes:
Server affinity (or rank group)
Storage type (either FB or CKD)
Encryption group
Just like ranks, extent pools also belong to an encryption group. When defining an extent
pool, you must specify an encryption group. Encryption group 0 means no encryption.
Encryption group 1 means encryption.
The minimum reasonable number of extent pools on a DS8000 storage system is two. One
extent pool is assigned to storage server 0 (rank group 0), and the other extent pool is
assigned to storage server 1 (rank group 1) so that both DS8000 storage systems are active.
In an environment where FB storage and CKD storage share the DS8000 storage system,
four extent pools are required to assign each pair of FB pools and CKD pools to both storage
systems, balancing capacity and workload across both DS8000 processor complexes.
Extent pools are expanded by adding more ranks to the pool. Ranks are organized in to two
rank groups: Rank group 0 is controlled by storage server 0 (processor complex 0), and rank
group 1 is controlled by storage server 1 (processor complex 1).
Important: For the preferred performance, balance capacity between the two servers and
create at least two extent pools, with one extent pool per DS8000 storage system.
0
r Extent Pool FBprod 1
r
e
vr e
1GiB 1GiB 1 GiB 1GiB vr
e FB FB FB FB e
S Extent Pool FBtest S
1GiB 1GiB 1 GiB 1GiB
FB FB FB FB 1GiB 1GiB 1 GiB 1GiB
FB FB FB FB
Dynamic extent pool merge allows one extent pool to be merged into another extent pool
while the logical volumes in both extent pools remain accessible to the host servers.
Important: You can apply dynamic extent pool merge only among extent pools that are
associated with the same DS8000 storage system affinity (storage server 0 or storage
server 1) or rank group. All even-numbered extent pools (P0, P2, P4, and so on) belong to
rank group 0 and are serviced by storage server 0. All odd-numbered extent pools (P1, P3,
P5, and so on) belong to rank group 1 and are serviced by storage server 1 (unless one
DS8000 storage system failed or is quiesced with a failover to the alternative storage
system).
Additionally, the dynamic extent pool merge is not supported in these situations:
If source and target pools have different storage types (FB and CKD).
If you select an extent pool that contains volumes that are being migrated.
For more information about dynamic extent pool merge, see IBM DS8000 Easy Tier,
REDP-4667.
With the Easy Tier feature, you can easily add capacity and even single ranks to existing
extent pools without concern about performance.
Auto-rebalance
With Easy Tier automatic mode enabled for single-tier or multitier extent pools, you can
benefit from Easy Tier automated intratier performance management (auto-rebalance), which
relocates extents based on rank utilization, and reduces skew and avoids rank hot spots.
Easy Tier relocates subvolume data on extent level based on actual workload pattern and
rank utilization (workload rebalance) rather than balance the capacity of a volume across all
ranks in the pool (capacity rebalance, as achieved with manual volume rebalance).
When adding capacity to managed pools, Easy Tier automatic mode performance
management, auto-rebalance, takes advantage of the new ranks and automatically populates
the new ranks that are added to the pool when rebalancing the workload within a storage tier
and relocating subvolume data. Auto-rebalance can be enabled for hybrid and homogeneous
extent pools.
Tip: For brand new DS8000 storage systems, the Easy Tier automatic mode switch is set
to Tiered, which means that Easy Tier is working only in hybrid pools. Have Easy Tier
automatic mode working in all pools, including single-tier pools. To do so, set the Easy Tier
automode switch to All.
For more information about auto-rebalance, see IBM DS8000 Easy Tier, REDP-4667.
LUNs can be allocated in binary GiB (230 bytes), decimal GB (109 bytes), or 512 or 520-byte
blocks. However, the physical capacity that is allocated for a LUN always is a multiple of 1 GiB
(binary), so it is a good idea to have LUN sizes that are a multiple of a gibibyte. If you define a
LUN with a LUN size that is not a multiple of 1 GiB, for example, 25.5 GiB, the LUN size is
25.5 GiB, but a capacity of 26 GiB is physically allocated, wasting 0.5 GiB of physical storage
capacity.
CKD volumes
A z Systems CKD volume is composed of one or more extents from one CKD extent pool.
CKD extents are the size of 3390 Model 1, which has 1113 cylinders. However, when you
define a z Systems CKD volume, you do not specify the number of 3390 Model 1 extents, but
the number of cylinders that you want for the volume.
The maximum size for a CKD volume was originally 65,520 cylinders. The DS8000 codes
introduced the Extended Address Volume (EAV), which give you CKD volumes sizes up to
around 1 TB.
If the number of specified cylinders is not an exact multiple of 1113 cylinders, part of the
space in the last allocated extent is wasted. For example, if you define 1114 or 3340
cylinders, 1112 cylinders are wasted. For maximum storage efficiency, consider allocating
volumes that are exact multiples of 1113 cylinders. In fact, consider multiples of 3339
cylinders for future compatibility.
A CKD volume cannot span multiple extent pools, but a volume can have extents from
different ranks in the same extent pool. You can stripe a volume across all ranks in an extent
pool by using the rotate extents EAM to distribute the capacity of the volume. This EAM
distributes the workload evenly across all ranks in the pool, minimizes skew, and reduces the
risk of single extent pools that become a hot spot. For more information, see “Rotate extents
(storage pool striping) extent allocation method” on page 63.
For the DS8880 R8.0 code version, extent space-efficient (ESE) volumes are supported for
FB (Open System + IBM i) format. The ESE concept is described in detail in DS8000 Thin
Provisioning, REDP-4554.
A space-efficient volume does not occupy all of its physical capacity at the time it is created.
The space becomes allocated when data is written to the volume. The sum of all the virtual
capacity of all space-efficient volumes can be larger than the available physical capacity,
which is known as over-provisioning or thin provisioning.
The idea behind space-efficient volumes is to allocate physical storage at the time that it is
needed to satisfy temporary peak storage needs.
Important: No Copy Services support exists for logical volumes larger than 2 TiB (2 × 240
bytes). Thin-provisioned volumes (ESE) as released with the DS8000 LMC R8.0 are not
yet supported by CKD volumes. Thin-provisioned volumes are supported by most but not
all DS8000 Copy Services or advanced functions. These restrictions might change with
future DS8000 LMC releases, so check the related documentation for your DS8000 LMC
release.
Space allocation
Space for a space-efficient volume is allocated when a write occurs. More precisely, it is
allocated when a destage from the cache occurs and a new track or extent must be allocated.
Virtual space is created as part of the extent pool definition. This virtual space is mapped to
ESE volumes in the extent pool as needed. The virtual capacity equals the total logical
capacity of all ESE volumes. No physical storage capacity (other than for the metadata) is
allocated until write activity occurs.
The initfbvol command can release the space for space-efficient volumes.
When you create a volume in a managed extent pool, that is, an extent pool that is managed
by Easy Tier automatic mode, the EAM of the volume always becomes managed.
Tip: Rotate extents and rotate volume EAMs determine the distribution of a volume
capacity and the volume workload distribution across the ranks in an extent pool. Rotate
extents (the default EAM) evenly distributes the capacity of a volume at a granular 1 GiB
extent level across the ranks in a homogeneous extent pool. It is the preferred method to
reduce skew and minimize hot spots, improving overall performance.
Although all volumes in the extent pool that use rotate extents are evenly distributed across all
ranks in the pool, the initial start of each volume is additionally rotated throughout the ranks to
improve a balanced rank utilization and workload distribution. If the first volume is created
starting on rank R(n), the allocation for the later volume starts on the later rank R(n+1) in the
pool. The DS8000 storage system maintains a sequence of ranks. While the extents of each
volume are rotated across all available ranks in the pool, the DS8000 storage system
additionally tracks the rank in which the last volume allocation started. The allocation of the
first extent for the next volume then starts on the next rank in that sequence.
In hybrid or multitier extent pools (whether managed or non-managed by Easy Tier), initial
volume creation always starts on the ranks of the Enterprise tier first. The Enterprise tier is
also called the home tier. The extents of a new volume are distributed in a rotate extents or
storage pool striping fashion across all available ranks in this home tier in the extent pool if
sufficient capacity is available. Only when all capacity on the home tier in an extent pool is
consumed does volume creation continue on the ranks of the Nearline tier. When all capacity
on the Enterprise tier and Nearline tier is exhausted, then volume creation continues
allocating extents on the SSD tier. The initial extent allocation in non-managed hybrid pools
differs from the extent allocation in single-tier extent pools with rotate extents (the extents of a
volume are not evenly distributed across all ranks in the pool because of the different
treatment of the different storage tiers). However, the attribute for the EAM of the volume is
shown as rotate extents if the pool is not under Easy Tier automatic mode control. After the
pool is managed by Easy Tier automatic mode, the EAM becomes managed.
In managed homogeneous extent pools with only a single storage tier, the initial extent
allocation for a new volume is the same as with rotate extents or storage-pool striping. For a
volume, the appropriate DSCLI command, showfbvol or showckdvol, which is used with the
-rank option, allows the user to list the number of allocated extents of a volume on each
associated rank in the extent pool.
However, if you use, for example, Physical Partition striping in AIX already, double striping
probably does not improve performance any further.
Tip: Double-striping a volume, for example, by using rotate extents in storage and striping
a volume on the AIX Logical Volume Manager (LVM) level can lead to unexpected
performance results. Consider striping on the storage system level or on the operating
system level only. For more information, see 3.3.2, “Extent pool considerations” on
page 72.
If you decide to use storage-pool striping, it is preferable to use this allocation method for all
volumes in the extent pool to keep the ranks equally allocated and used.
Tip: When configuring a new DS8000 storage system, do not generally mix volumes that
use the rotate extents EAM (storage pool striping) and volumes that use the rotate volumes
EAM in the same extent pool.
Striping a volume across multiple ranks also increases the failure boundary. If you have extent
pools with many ranks and all volumes striped across these ranks, you lose the data of all
volumes in the pool if one rank fails after suddenly losing multiple drives: For example, two
disk drives in the same RAID 5 rank fail at the same time.
If multiple EAM types are used in same (non-managed) extent pool, use Easy Tier manual
mode to change the EAM from rotate volumes to rotate extents and vice versa. Use volume
migration (dynamic volume relocation (DVR)) in non-managed, homogeneous extent pools.
However, before switching any EAM of a volume, consider that you might need to change
other volumes on the same extent pool before distributing your volume across ranks. For
example, you create various volumes with the rotate volumes EAM and only a few with rotate
extents. Now, you want to switch only one volume to rotate extents. The ranks might not have
enough free extents available for Easy Tier to balance all the extents evenly over all ranks. In
this case, you might have to apply multiple steps and switch every volume to the new EAM
type before changing only one volume. Depending on your case, you might also consider
moving volumes to another extent pool before reorganizing all volumes and extents in the
extent pool.
Certain cases, where you merge other extent pools, you also must plan to reorganize the
extents on the new extent pool by using, for example, manual volume rebalance, so that you
can properly redistribute the extents of the volumes across the ranks in the pool.
This construction method of using fixed extents to form a logical volume in the DS8000 series
allows flexibility in the management of the logical volumes. You can delete LUNs/CKD
volumes, resize LUNs/volumes, and reuse the extents of those LUNs to create other
LUNs/volumes, maybe of different sizes. One logical volume can be removed without
affecting the other logical volumes defined on the same extent pool. Extents are cleaned after
you delete a LUN or CKD volume.
After the logical volume is created and available for host access, it is placed in the normal
configuration state. If a volume deletion request is received, the logical volume is placed in the
deconfiguring configuration state until all capacity that is associated with the logical volume is
deallocated and the logical volume object is deleted.
The reconfiguring configuration state is associated with a volume expansion request. The
transposing configuration state is associated with an extent pool merge. The migrating,
migration paused, migration error, and migration canceled configuration states are
associated with a volume relocation request.
Important: Before you can expand a volume, you must remove any Copy Services
relationships that involve that volume.
If the same extent pool is specified and rotate extents is used as the EAM, the volume
migration is carried out as manual volume rebalance, as described in “Manual volume
rebalance” on page 59. Manual volume rebalance is designed to redistribute the extents of
volumes within a non-managed, single-tier (homogeneous) pool so that workload skew and
hot spots are less likely to occur on the ranks. During extent relocation, only one extent at a
time is allocated rather than preallocating the full volume and only a minimum amount of free
capacity is required in the extent pool.
Important: A volume migration with DVR back into the same extent pool (for example,
manual volume rebalance for restriping purposes) is not supported in managed or hybrid
extent pools. Hybrid pools are always supposed to be prepared for Easy Tier automatic
management. In pools under control of Easy Tier automatic mode, the volume placement
is managed automatically by Easy Tier. It relocates extents across ranks and storage tiers
to optimize storage performance and storage efficiency. However, it is always possible to
migrate volumes across extent pools, no matter if those pools are managed,
non-managed, or hybrid pools.
For more information about this topic, see IBM DS8000 Easy Tier, REDP-4667.
On the DS8000 series, there is no fixed binding between a rank and an LSS. The capacity of
one or more ranks can be aggregated into an extent pool. The logical volumes that are
configured in that extent pool are not necessarily bound to a specific rank. Different logical
volumes on the same LSS can even be configured in separate extent pools. The available
capacity of the storage facility can be flexibly allocated across LSSs and logical volumes. You
can define up to 255 LSSs on a DS8000 storage system.
For each LUN or CKD volume, you must select an LSS when creating the volume. The LSS is
part of the volume ID ‘abcd’ and must to be specified upon volume creation.
You can have up to 256 volumes in one LSS. However, there is one restriction. Volumes are
created from extents of an extent pool. However, an extent pool is associated with one
DS8000 storage system (also called a central processor complex (CPC)): server 0 or
server 1. The LSS number also reflects this affinity to one of these DS8000 storage systems.
All even-numbered LSSs (X'00', X'02', X'04', up to X'FE') are serviced by storage server 0
(rank group 0). All odd-numbered LSSs (X'01', X'03', X'05', up to X'FD') are serviced by
storage server 1 (rank group 1). LSS X’FF’ is reserved.
z Systems users are familiar with a logical control unit (LCU). z Systems operating systems
configure LCUs to create device addresses. There is a one-to-one relationship between an
LCU and a CKD LSS (LSS X'ab' maps to LCU X'ab'). Logical volumes have a logical volume
number X'abcd' in hexadecimal notation where X'ab' identifies the LSS and X'cd' is one of the
256 logical volumes on the LSS. This logical volume number is assigned to a logical volume
when a logical volume is created and determines with which LSS the logical volume is
associated. The 256 possible logical volumes that are associated with an LSS are mapped to
the 256 possible device addresses on an LCU (logical volume X'abcd' maps to device
address X'cd' on LCU X'ab'). When creating CKD logical volumes and assigning their logical
volume numbers, consider whether Parallel Access Volumes (PAVs) are required on the LCU
and reserve addresses on the LCU for alias addresses.
For Open Systems, LSSs do not play an important role other than associating a volume with a
specific rank group and server affinity (storage server 0 or storage server 1) or grouping hosts
and applications together under selected LSSs for the DS8000 Copy Services relationships
and management.
Tip: Certain management actions in Metro Mirror, Global Mirror, or Global Copy operate at
the LSS level. For example, the freezing of pairs to preserve data consistency across all
pairs is at the LSS level. The option to put all or a set of volumes of a certain application in
one LSS can make the management of remote copy operations easier under certain
circumstances.
Important: LSSs for FB volumes are created automatically when the first FB logical
volume on the LSS is created, and deleted automatically when the last FB logical volume
on the LSS is deleted. CKD LSSs require user parameters to be specified and must be
created before the first CKD logical volume can be created on the LSS. They must be
deleted manually after the last CKD logical volume on the LSS is deleted.
All devices in an address group must be either CKD or FB. LSSs are grouped into address
groups of 16 LSSs. LSSs are numbered X'ab', where a is the address group. So, all LSSs
within one address group must be of the same type, CKD or FB. The first LSS defined in an
address group sets the type of that address group. For example, LSS X'10' to LSS X'1F' are
all in the same address group and therefore can all be used only for the same storage type,
either FB or CKD.
Figure 3-5 Volume IDs, logical subsystems, and address groups on the DS8000 storage systems
The volume ID X'gabb' in hexadecimal notation is composed of the address group X'g', the
LSS ID X'ga', and the volume number X'bb' within the LSS. For example, LUN X'2101'
denotes the second (X'01') LUN in LSS X'21' of address group 2.
Host attachment
Host bus adapters (HBAs) are identified to the DS8000 storage system in a host attachment
or host connection construct that specifies the HBA worldwide port names (WWPNs). A set of
host ports (host connections) can be associated through a port group attribute in the DSCLI
that allows a set of HBAs to be managed collectively. This group is called a host attachment
within the GUI.
Each host attachment can be associated with a volume group to define which LUNs that HBA
is allowed to access. Multiple host attachments can share the volume group. The host
attachment can also specify a port mask that controls which DS8000 I/O ports the HBA is
allowed to log in to. Whichever ports the HBA logs in to, it sees the same volume group that is
defined on the host attachment that is associated with this HBA.
When used with Open Systems hosts, a host attachment object that identifies the HBA is
linked to a specific volume group. You must define the volume group by indicating which FB
logical volumes are to be placed in the volume group. Logical volumes can be added to or
removed from any volume group dynamically.
One host connection can be assigned to one volume group only. However, the same volume
group can be assigned to multiple host connections. An FB logical volume can be assigned to
one or more volume groups. Assigning a logical volume to different volume groups allows a
LUN to be shared by hosts, each configured with its own dedicated volume group and set of
volumes (in case a set of volumes that is not identical is shared between the hosts).
The maximum number of volume groups is 8,320 for the DS8000 storage system.
Next, this section described the creation of logical volumes within the extent pools (optionally
striping the volumes), assigning them a logical volume number that determined to which LSS
they are associated and indicated which server manages them. Space-efficient volumes can
be created immediately or within a repository of the extent pool. Then, the LUNs can be
assigned to one or more volume groups. Finally, the HBAs were configured into a host
attachment that is associated with a volume group.
The above concept is seen when working with the DSCLI. When working with the DS Storage
Manager GUI, some complexity is reduced externally: Instead of array sites, arrays and ranks,
the GUI just speaks of “Arrays” or “Managed Arrays”, and these were put into “Pools”. Also,
the concept of a volume group is not directly visible externally when working with the GUI.
This virtualization concept provides for greater flexibility. Logical volumes can dynamically be
created, deleted, migrated, and resized. They can be grouped logically to simplify storage
management. Large LUNs and CKD volumes reduce the required total number of volumes,
which also contributes to a reduction of management efforts.
Data B B
B
Data F F F
B B B iB iB iB
Data F F F G G G
iB B
i B
i 0 1 1 1
Data r
G G G e
v
Data 1 1 1 r
e
S B B B
Data F F F
B
i B
i B
i
Parity G G G
Spare 1 1 1
X '2x ' FB
4 096
a ddre sse s
X'3 x' CK D
4 096
a ddre sse s
It can also help you fine-tune system performance from an extent pool perspective, for
example, sharing the resources of an extent pool evenly between application workloads or
isolating application workloads to dedicated extent pools. Data placement can help you when
planning for dedicated extent pools with different performance characteristics and storage
tiers without using Easy Tier automatic management. Plan your configuration carefully to
meet your performance goals by minimizing potential performance limitations that might be
introduced by single resources that become a bottleneck because of workload skew. For
example, use rotate extents as the default EAM to help reduce the risk of single ranks that
become a hot spot and limit the overall system performance because of workload skew.
If workload isolation is required in your environment, you can isolate workloads and I/O on the
rank and DA levels on the DS8000 storage systems, if required.
In a RAID 5 6+P or 7+P array, the amount of capacity equal to one disk is used for parity
information. However, the parity information is not bound to a single disk. Instead, it is striped
across all the disks in the array, so all disks of the array are involved to service I/O requests
equally.
In a RAID 6 5+P+Q or 6+P+Q array, the amount of capacity equal to two disks is used for
parity information. As with RAID 5, the parity information is not bound to single disks, but
instead is striped across all the disks in the array. So, all disks of the array service I/O
requests equally. However, a RAID 6 array might have one less drive available than a RAID 5
array configuration. Nearline drives, for example, by default allow RAID 6 configurations only.
In a RAID 10 3+3 array, the available usable space is the capacity of only three disks. Two
disks of the array site are used as spare disks. When a LUN is created from the extents of this
array, the data is always mirrored across two disks in the array. Each write to the array must
be performed twice to two disks. There is no additional parity information in a RAID 10 array
configuration.
Important: The spares in the mirrored RAID 10 configuration act independently; they are
not mirrored spares.
In a RAID 10 4+4 array, the available usable space is the capacity of four disks. No disks of
the array site are used as spare disks. When a LUN is created from the extents of this array,
the data is always mirrored across two disks in the array. Each write to the array must be
performed twice to two disks.
Important: The stripe width for the RAID arrays differs in size with the number of active
disks that hold the data. Because of the different stripe widths that make up the extent from
each type of RAID array, it is not a preferred practice to intermix RAID array types within
the same extent pool, especially in homogeneous extent pool configurations that do not
use Easy Tier automatic management. With Easy Tier enabled, the benefit of automatic
storage performance and storage efficiency management combined with Easy Tier
micro-tiering capabilities typically outperforms the disadvantage of different RAID arrays
within the same pool.
Even in single-tier homogeneous extent pools, you can benefit from Easy Tier automatic
mode (by running the DSCLI chsi ETautomode=all command). It manages the subvolume
data placement within the managed pool based on rank utilization and thus reduces workload
skew and hot spots (auto-rebalance).
In multitier hybrid extent pools, you can fully benefit from Easy Tier automatic mode (by
running the DSCLI chsi ETautomode=all|tiered command). It provides full automatic
storage performance and storage economics management by optimizing subvolume data
placement in a managed extent pool across different storage tiers and even across ranks
within each storage tier (auto-rebalance). Easy Tier automatic mode and hybrid extent pool
configurations offer the most efficient way to use different storage tiers. It optimizes storage
performance and storage economics across three drive tiers to manage more applications
effectively and efficiently with a single DS8000 storage system at an optimum price versus the
performance and footprint ratio.
The data placement and extent distribution of a volume across the ranks in an extent pool can
be displayed by running the DSCLI showfbvol -rank or showckdvol -rank command, as
shown in Example 3-2 on page 78.
Before configuring extent pools and volumes, be aware of the basic configuration principles
about workload sharing, isolation, and spreading, as described in 4.2, “Configuration
principles for optimal performance” on page 87.
The first example, which is shown in Figure 3-7 on page 73, illustrates an extent pool with
only one rank, which is also referred to a single-rank extent pool. This approach is common if
you plan to use the SAN Volume Controller, for example, or if you plan a configuration that
uses the maximum isolation that you can achieve on the rank/extpool level. In this type of a
single-rank extent pool configuration, all volumes that are created are bound to a single rank.
This type of configuration requires careful logical configuration and performance planning
because single ranks are likely to become a hot spot and might limit overall system
performance. It also requires the highest administration and management effort because
workload skew typically varies over time. You might constantly monitor your system
performance and need to react to hot spots. It also considerably limits the benefits that a
DS8000 storage system can offer regarding its virtualization and Easy Tier automatic
management capabilities.
Also, in this example, one host can easily degrade the whole rank, depending on its I/O
workload, and affect multiple hosts that share volumes on the same rank if you have more
than one LUN allocated in this extent pool.
Extpool P2
R6
Host B – LUN 6
Host B – LUN 6
Host C – LUN 7
Host C – LUN 7
Single Tier
The second example, which is shown in Figure 3-8, illustrates an extent pool with multiple
ranks of the same storage class or storage tier, which is referred to as a homogeneous or
single-tier extent pool. In general, an extent pool with multiple ranks also is called a
multi-rank extent pool.
Extpool P1
R1 R2 R3 R4
Host A – LUN 1
Host A – LUN 2
Host B – LUN 3
Host B – LUN 4
Host C – LUN 5
Host C – LUN 5
Single Tier
The use of the EAM rotate volumes still can isolate volumes to separate ranks, even in
multi-rank extent pools, wherever such configurations are required. This EAM minimizes the
configuration effort because a set of volumes that is distributed across different ranks can be
created with a single command. Plan your configuration and performance needs to implement
host-level-based methods to balance the workload evenly across all volumes and ranks.
However, it is not a preferred practice to use both EAMs in the same extent pool without an
efficient host-level-based striping method for the non-striped volumes. This approach easily
forfeits the benefits of storage pool striping and likely leads to imbalanced workloads across
ranks and potential single-rank performance bottlenecks.
Figure 3-8 on page 73 is an example of storage-pool striping for LUNs 1 - 4. It shows more
than one host and more than one LUN distributed across the ranks. In contrast to the
preferred practice, it also shows an example of LUN 5 being created with the rotate volumes
EAM in the same pool. The storage system tries to allocate the continuous space available for
this volume on a single rank (R1) until there is insufficient capacity that is left on this rank and
then it spills over to the next available rank (R2). All workload on this LUN is limited to these
two ranks. This approach considerably increases the workload skew across all ranks in the
pool and the likelihood that these two ranks might become a bottleneck for all volumes in the
pool, which reduces overall pool performance.
Multiple hosts with multiple LUNs, as shown in Figure 3-8 on page 73, share the resources
(resource sharing) in the extent pool, that is, ranks, DAs, and physical spindles. If one host or
LUN has a high workload, I/O contention can result and easily affect the other application
workloads in the pool, especially if all applications have their workload peaks at the same
time. Alternatively, applications can benefit from a much larger amount of disk spindles and
thus larger performance capabilities in a shared environment in contrast to workload isolation
and only dedicated resources. With resource sharing, expect that not all applications peak at
the same time, so that each application typically benefits from the larger amount of disk
resources that it can use. The resource sharing and storage-pool striping in non-managed
extent pools method is a good approach for most cases if no other requirements, such as
workload isolation or a specific quality of service (QoS) requirements, dictate another
approach.
Enabling Easy Tier automatic mode for homogeneous, single-tier extent pools always is an
additional option, and is preferred, to let the DS8000 storage system manage system
performance in the pools based on rank utilization (auto-rebalance). The EAM of all volumes
in the pool becomes managed in this case. With Easy Tier and its advanced micro-tiering
capabilities that take different RAID levels and drive characteristics into account for
determining the rank utilization in managed pools, even a mix of different drive characteristics
and RAID levels of the same storage tier might be an option for certain environments.
With Easy Tier and I/O Priority Manager, the DS8880 family offers advanced features when
taking advantage of resource sharing to minimize administration efforts and reduce workload
skew and hot spots while benefitting from automatic storage performance, storage
economics, and workload priority management. The use of these features in the DS8000
environments is highly encouraged. These features generally help provide excellent overall
system performance while ensuring (QoS) levels by prioritizing workloads in shared
environments at a minimum of administration effort and at an optimum price-performance
ratio.
When you create a volume in a managed extent pool, that is, an extent pool that is managed
by Easy Tier automatic mode, the EAM of the volume always becomes managed. This
situation is true no matter which EAM is specified at volume creation. The volume is under
control of Easy Tier. Easy Tier moves extents to the most appropriate storage tier and rank in
the pool based on performance aspects. Any specified EAM, such as rotate extents or rotate
volumes, is ignored.
In managed or hybrid extent pools, an initial EAM that is similar to rotate extents for new
volumes is used. The same situation applies if an existing volume is manually moved to a
managed or hybrid extent pool by using volume migration or DVR. In hybrid or multitier extent
pools (whether managed or non-managed by Easy Tier), initial volume creation always starts
on the ranks of the Enterprise tier first. The Enterprise tier is also called the home tier. The
extents of a new volume are distributed in a rotate extents or storage pool striping fashion
across all available ranks in this home tier in the extent pool if sufficient capacity is available.
Only when all capacity on the home tier in an extent pool is consumed does volume creation
continue on the ranks of the flash or Nearline tier. The initial extent allocation in non-managed
hybrid pools differs from the extent allocation in single-tier extent pools with rotate extents (the
extents of a volume are not evenly distributed across all ranks in the pool because of the
different treatment of the different storage tiers). However, the attribute for the EAM of the
volume is shown as rotate extents if the pool is not under Easy Tier automatic mode control.
After Easy Tier automatic mode is enabled for a hybrid pool, the EAM of all volumes in that
pool becomes managed.
Mixing different storage tiers combined with Easy Tier automatic performance and economics
management on a subvolume level can considerably increase the performance versus price
ratio, increase energy savings, and reduce the overall footprint. The use of Easy Tier
automated subvolume data relocation and the addition of an flash tier are good for mixed
environments with applications that demand both IOPS and bandwidth at the same time. For
example, database systems might have different I/O demands according to their architecture.
Costs might be too high to allocate a whole database on flash. Mixing different drive
technologies, for example, flash with Enterprise or Nearline disks, and efficiently allocating
the data capacity on the subvolume level across the tiers with Easy Tier can highly optimize
price, performance, the footprint, and the energy usage. Only the hot part of the data
allocates flash or SSD capacity instead of provisioning flash capacity for full volumes.
Therefore, you can achieve considerable system performance at a reduced cost and footprint
with only a few SSDs.
The ratio of flash capacity to hard disk drive (HDD) disk capacity in a hybrid pool depends on
the workload characteristics and skew. Ideally, there must be enough flash or SSD capacity to
hold the active (hot) extents in the pool, but not more, to not waste the more expensive flash
capacity. For new DS8000 orders, 3 - 5% of flash capacity might be a reasonable percentage
to plan with hybrid pools in typical environments. This configuration can already result in the
movement of 50% and more of the small and random I/O workload from Enterprise drives to
flash. This configuration provides a reasonable initial estimate if measurement data is not
available to support configuration planning.
Figure 3-9 shows a configuration of a managed two-tier extent pool with an SSD and
Enterprise storage tier. All LUNs are managed by Easy Tier. Easy Tier automatically and
dynamically relocates subvolume data to the appropriate storage tier and rank based on their
workload patterns. Figure 3-9 shows multiple LUNs from different hosts allocated in the
two-tier pool with hot data already migrated to SSDs. Initial volume creation in this pool
always allocates extents on the Enterprise tier first, if capacity is available, before Easy Tier
automatically starts promoting extents to the SSD tier.
Extpool P5
R14 R15
Host A – LUN 1
Host A – LUN 2
Host B – LUN 3
Host B – LUN 4
Host C – LUN 5
Host C – LUN 6
SSD ENT
Two Tiers
Figure 3-9 Multitier extent pool with SSD and Enterprise ranks
Figure 3-10 on page 77 shows a configuration of a managed two-tier extent pool with an
Enterprise and Nearline storage tier. All LUNs are managed by Easy Tier. Easy Tier
automatically and dynamically relocates subvolume data to the appropriate storage tier and
rank based on their workload patterns. With more than one rank in the Enterprise storage tier,
Easy Tier also balances the workload and resource usage across the ranks within this
storage tier (auto-rebalance). Figure 3-10 on page 77 shows multiple LUNs from different
hosts allocated in the two-tier pool with cold data already demoted to the Nearline tier. Initial
volume creation in this pool always allocates extents on the Enterprise tier first, if capacity is
available, before Easy Tier automatically starts demoting extents to the Nearline tier.
Host A – LUN 1
Host A – LUN 2
Host B – LUN 3
Host B – LUN 4
Host C – LUN 5
Host C – LUN 6
ENT NL
Two Tiers
Figure 3-10 Multitier extent pool with Enterprise and Nearline ranks
Figure 3-11 shows a configuration of a managed three-tier extent pool with an SSD,
Enterprise, and Nearline storage tiers. All LUNs are managed by Easy Tier. Easy Tier
automatically and dynamically relocates subvolume data to the appropriate storage tier and
rank based on their workload patterns. With more than one rank in the Enterprise storage tier,
Easy Tier also balances the workload and resource usage across the ranks within this
storage tier (auto-rebalance). Figure 3-11 shows multiple LUNs from different hosts allocated
in the three-tier pool with hot data promoted to the SSD/flash tier and cold data demoted to
the Nearline tier. Initial volume creation in this pool always allocates extents on the Enterprise
tier first, if capacity is available, before Easy Tier automatically starts promoting extents to the
SSD/flash tier or demoting extents to the Nearline tier.
Use the hybrid extent pool configurations under automated Easy Tier management. It
provides ease of use with minimum administration and performance management efforts
while optimizing the system performance, price, footprint, and energy costs.
Extpool P3
R7 R8 R9
Host A – LUN 1
Host A – LUN 2
Host B – LUN 3
Host B – LUN 4
Host C – LUN 5
Host C – LUN 6
SSD ENT NL
Three Tiers
Figure 3-11 Multitier extent pool with SSD/flash, Enterprise, and Nearline ranks
The initial allocation of extents for a volume in a managed single-tier pool is similar to the
rotate extents EAM or storage pool striping. So, the extents are evenly distributed across all
ranks in the pool right after the volume creation. The initial allocation of volumes in hybrid
extent pools differs slightly. The extent allocation always begins in a rotate extents-like fashion
on the ranks of the Enterprise tier first, and then continues on SSD/flash and Nearline ranks.
After the initial extent allocation of a volume in the pool, the extents and their placement on
the different storage tiers and ranks are managed by Easy Tier. Easy Tier collects workload
statistics for each extent in the pool and creates migration plans to relocate the extents to the
appropriate storage tiers and ranks. The extents are promoted to higher tiers or demoted to
lower tiers based on their actual workload patterns. The data placement of a volume in a
managed pool is no longer static or determined by its initial extent allocation. The data
placement of the volume across the ranks in a managed extent pool is subject to change over
time to constantly optimize storage performance and storage economics in the pool. This
process is ongoing and always adapting to changing workload conditions. After Easy Tier
data collection and automatic mode are enabled, it might take a few hours before the first
migration plan is created and applied. For more information about Easy Tier migration plan
creation and timings, see IBM DS8000 Easy Tier, REDP-4667.
The DSCLI showfbvol -rank or showckdvol -rank commands, and the showfbvol -tier or
showckdvol -tier commands, can help show the current extent distribution of a volume
across the ranks and tiers, as shown in Example 3-2. In this example, volume 0110 is
managed by Easy Tier and distributed across ranks R13 (HPFE flash or SSD tier), R10 (ENT
tier), and R11 (ENT). You can use the lsarray -l -rank Rxy command to show the storage
class and DA pair of a specific rank Rxy.
Example 3-2 showfbvol -tier and showfbvol -rank commands to show the volume-to-rank relationship in
a multitier pool
dscli> showfbvol -tier 0110
Name aix7_esp
ID 0110
accstate Online
datastate Normal
configstate Normal
deviceMTM 2107-900
datatype FB 512T
addrgrp 0
extpool P3
exts 100
captype DS
cap (2^30B) 100.0
cap (10^9B) -
cap (blocks) 209715200
volgrp V3
ranks 3
The volume heat distribution (volume heat map), which is provided by the STAT, helps you
identify hot, warm, and cold extents for each volume and its distribution across the storage
tiers in the pool. For more information about the STAT, see 6.5, “Storage Tier Advisor Tool” on
page 213.
Figure 3-12 gives an example of a three-tier hybrid pool managed by Easy Tier. It shows the
change of the volume data placement across the ranks over time when Easy Tier relocates
extents based on their workload pattern and adapts changing workload conditions. At time
T1, you see that all volumes are initially allocated on the Enterprise tier, evenly balanced
across ranks R8 and R9. After becoming cold, some extents from LUN 1 are demoted to
Nearline drives at T3. LUN 2 and LUN 3, with almost no activity, are demoted to Nearline
drives at T2. After increased activity, some data of LUN 2 is promoted to Enterprise drives
again at T3. The hot extent of LUN 4 is promoted to SSD/flash, and the cold extent is
demoted to Nearline drives at T2. After changing workload conditions on LUN 4 and the
increased activity on the demoted extents, these extents are promoted again to Enterprise
drives at T3. Cold extents from LUN 5 are demoted to Nearline drives with a constant low
access pattern over time. LUN 6 shows some extents promoted to SSD/flash at T2 and
further extent allocation changes across the ranks of the Enterprise tier at T3. The changes
are because of potential extent relocations from the SSD/flash tier (warm demote or swap
operations) or across the ranks within the Enterprise tiers (auto-rebalance), balancing the
workload based on workload patterns and rank utilization.
d
reate ier ier
T1 ns c T2 syT T3 syT
Lu Ea Ea timeline
Important: Before reading this chapter, familiarize yourself with the material that is
covered in Chapter 3, “Logical configuration concepts and terminology” on page 47.
This chapter introduces a step-by-step approach to configuring the IBM Storage System
DS8000 workload and performance considerations:
Reviewing the tiered storage concepts and Easy Tier
Understanding the configuration principles for optimal performance:
– Workload isolation
– Workload resource-sharing
– Workload spreading
Analyzing workload characteristics to determine isolation or resource-sharing
Planning allocation of the DS8000 disk and host connection capacity to identified
workloads
Planning spreading volumes and host connections for the identified workloads
Planning array sites
Planning RAID arrays and ranks with RAID-level performance considerations
Planning extent pools with single-tier and multitier extent pool considerations
Planning address groups, logical subsystems (LSSs), volume IDs, and count key data
(CKD) Parallel Access Volumes (PAVs)
Planning I/O port IDs, host attachments, and volume groups
Implementing and Documenting the DS8000 logical configuration
Typically, an optimal design keeps the active operational data in Tier 0 and Tier 1 and uses
Tiers 2 and 3 for less active data, as shown in Figure 4-2 on page 85.
The benefits that are associated with a tiered storage approach mostly relate to cost. By
introducing flash storage as tier 0, you might more efficiently address the highest
performance needs while reducing the Enterprise class storage, system footprint, and energy
costs. A tiered storage approach can provide the performance that you need and save
significant costs that are associated with storage because lower-tier storage is less
expensive. Environmental savings, such as energy, footprint, and cooling reductions, are
possible.
With dramatically high I/O rates, low response times, and IOPS-energy-efficient
characteristics, flash addresses the highest performance needs and also potentially can
achieve significant savings in operational costs. However, the acquisition cost per GB is
higher than HDDs. To satisfy most workload characteristics, flash must be used efficiently
with HDDs in a well-balanced tiered storage architecture. It is critical to choose the correct
mix of storage tiers and the correct data placement to achieve optimal storage performance
and economics across all tiers at a low cost.
With the DS8000 storage system, you can easily implement tiered storage environments that
use flash, Enterprise, and Nearline class storage tiers. Still, different storage tiers can be
isolated to separate extent pools and volume placement can be managed manually across
extent pools where required. Or, better and highly encouraged, volume placement can be
managed automatically on a subvolume level in hybrid extent pools by Easy Tier automatic
mode with minimum management effort for the storage administrator. Easy Tier is a no-cost
feature on DS8000 storage systems. For more information about Easy Tier, see 1.3.4, “Easy
Tier” on page 11.
Consider Easy Tier automatic mode and hybrid extent pools for managing tiered storage on
the DS8000 storage system. The overall management and performance monitoring effort
increases considerably when manually managing storage capacity and storage performance
needs across multiple storage classes and does not achieve the efficiency provided with Easy
Tier automatic mode data relocation on the subvolume level (extent level). With Easy Tier,
client configurations show less potential to waste flash capacity than with volume-based
tiering methods.
In environments with homogeneous system configurations or isolated storage tiers that are
bound to different homogeneous extent pools, you can benefit from Easy Tier automatic
mode. Easy Tier provides automatic intra-tier performance management by rebalancing the
workload across ranks (auto-rebalance) in homogeneous single-tier pools based on rank
utilization. Easy Tier automatically minimizes skew and rank hot spots and helps to reduce
the overall management effort for the storage administrator.
Depending on the particular storage requirements in your environment, with the DS8000
architecture, you can address a vast range of storage needs combined with ease of
management. On a single DS8000 storage system, you can perform these tasks:
Isolate workloads to selected extent pools (or down to selected ranks and DAs).
Share resources of other extent pools with different workloads.
Use Easy Tier to manage automatically multitier extent pools with different storage tiers
(or homogeneous extent pools).
Adapt your logical configuration easily and dynamically at any time to changing
performance or capacity needs by migrating volumes across extent pools, merging extent
pools, or removing ranks from one extent pool (rank depopulation) and moving them to
another pool.
Easy Tier helps you consolidate more workloads onto a single DS8000 storage system by
automating storage performance and storage economics management across up to three
drive tiers. In addition, I/O Priority Manager, as described in 1.3.5, “I/O Priority Manager” on
page 17, can help you align workloads to quality of service (QoS) levels to prioritize separate
system workloads that compete for the same shared and possibly constrained storage
resources to meet their performance goals.
For many initial installations, an approach with two extent pools (with or without different
storage tiers) and enabled Easy Tier automatic management might be the simplest way to
start if you have FB or CKD storage only; otherwise, four extent pools are required. You can
plan for more extent pools based on your specific environment and storage needs, for
example, workload isolation for some pools, different resource sharing pools for different
departments or clients, or specific Copy Services considerations.
You can take advantage of features such as Easy Tier and I/O Priority Manager. Both features
pursue different goals and can be combined.
Easy Tier provides a significant benefit for mixed workloads, so consider it for
resource-sharing workloads and isolated workloads dedicated to a specific set of resources.
Furthermore, Easy Tier automatically supports the goal of workload spreading by distributing
the workload in an optimum way across all the dedicated resources in an extent pool. It
provides automated storage performance and storage economics optimization through
dynamic data relocation on extent level across multiple storage tiers and ranks based on their
access patterns. With auto-rebalance, it rebalances the workload across the ranks within a
storage tier based on utilization to reduce skew and avoid hot spots. Auto-rebalance applies
to managed multitier pools and single-tier pools and helps to rebalance the workloads evenly
across ranks to provide an overall balanced rank utilization within a storage tier or managed
single-tier extent pool. Figure 4-3 shows the effect of auto-rebalance in a single-tier extent
pool that starts with a highly imbalanced workload across the ranks at T1. Auto-rebalance
rebalances the workload and optimizes the rank utilization over time.
The DS8000 I/O Priority Manager provides a significant benefit for resource-sharing
workloads. It aligns QoS levels to separate workloads that compete for the same shared and
possibly constrained storage resources. I/O Priority Manager can prioritize access to these
system resources to achieve the needed QoS for the volume based on predefined
performance goals (high, medium, or low). I/O Priority Manager constantly monitors and
balances system resources to help applications meet their performance targets automatically,
without operator intervention. I/O Priority Manager acts only if resource contention is
detected.
Isolation provides ensured availability of the hardware resources that are dedicated to the
isolated workload. It removes contention with other applications for those resources.
However, isolation limits the isolated workload to a subset of the total DS8000 hardware so
that its maximum potential performance might be reduced. Unless an application has an
entire DS8000 storage system that is dedicated to its use, there is potential for contention
with other applications for any hardware (such as cache and processor resources) that is not
dedicated. Typically, isolation is implemented to improve the performance of certain
workloads by separating different workload types.
One traditional practice to isolation is to identify lower-priority workloads with heavy I/O
demands and to separate them from all of the more important workloads. You might be able
to isolate multiple lower priority workloads with heavy I/O demands to a single set of hardware
resources and still meet their lower service-level requirements, particularly if their peak I/O
demands are at different times. In addition, I/O Priority Manager can help to prioritize different
workloads, if required.
You can partition the DS8000 disk capacity for isolation at several levels:
Rank level: Certain ranks are dedicated to a workload, that is, volumes for one workload
are allocated on these ranks. The ranks can have a different disk type (capacity or speed),
a different RAID array type (RAID 5, RAID 6, or RAID 10, arrays with spares or arrays
without spares), or a different storage type (CKD or FB) than the disk types, RAID array
types, or storage types that are used by other workloads. Workloads that require different
types of the above can dictate rank, extent pool, and address group isolation. You might
consider workloads with heavy random activity for rank isolation, for example.
Extent pool level: Extent pools are logical constructs that represent a group of ranks that
are serviced by storage server 0 or storage server 1. You can isolate different workloads to
different extent pools, but you always must be aware of the rank and DA pair associations.
Although physical isolation on rank and DA level involves building appropriate extent pools
with a selected set of ranks or ranks from a specific DA pair, different extent pools with a
subset of ranks from different DA pairs still can share DAs. Isolated workloads to different
extent pools might share a DA adapter as a physical resource, which can be a potential
limiting physical resource under certain extreme conditions. However, given the
capabilities of one DA, some mutual interference there is rare, and isolation (if needed) on
extent pool level is effective if the workloads are disk-bound.
The DS8000 host connection subsetting for isolation can also be done at several levels:
I/O port level: Certain DS8000 I/O ports are dedicated to a workload, which is a common
case. Workloads that require Fibre Channel connection (FICON) and Fibre Channel
Protocol (FCP) must be isolated at the I/O port level anyway because each I/O port on a
FCP/FICON-capable HA card can be configured to support only one of these protocols.
Although Open Systems host servers and remote mirroring links use the same protocol
(FCP), they are typically isolated to different I/O ports. You must also consider workloads
with heavy large-block sequential activity for HA isolation because they tend to consume
all of the I/O port resources that are available to them.
HA level: Certain HAs are dedicated to a workload. FICON and FCP workloads do not
necessarily require HA isolation because separate I/O ports on the same
FCP/FICON-capable HA card can be configured to support each protocol (FICON or
FCP). However, it is a preferred practice to separate FCP and FICON to different HBAs.
Furthermore, host connection requirements might dictate a unique type of HA card
(longwave (LW) or shortwave (SW)) for a workload. Workloads with heavy large-block
sequential activity must be considered for HA isolation because they tend to consume all
of the I/O port resources that are available to them.
I/O enclosure level: Certain I/O enclosures are dedicated to a workload. This approach is
not necessary.
Multiple resource-sharing workloads can have logical volumes on the same ranks and can
access the same DS8000 HAs or I/O ports. Resource-sharing allows a workload to access
more DS8000 hardware than can be dedicated to the workload, providing greater potential
performance, but this hardware sharing can result in resource contention between
applications that impacts overall performance at times. It is important to allow
resource-sharing only for workloads that do not consume all of the DS8000 hardware
resources that are available to them. As an alternative, use I/O Priority Manager to align QoS
levels to the volumes that are sharing resources and prioritize different workloads, if required.
Pinning volumes to one certain tier can also be considered temporarily, and then you can
release these volumes again.
Easy Tier extent pools typically are shared by multiple workloads because Easy Tier with its
automatic data relocation and performance optimization across multiple storage tiers
provides the most benefit for mixed workloads.
To better understand the resource-sharing principle for workloads on disk arrays, see 3.3.2,
“Extent pool considerations” on page 72.
You must allocate the DS8000 hardware resources to either an isolated workload or multiple
resource-sharing workloads in a balanced manner, that is, you must allocate either an
isolated workload or resource-sharing workloads to the DS8000 ranks that are assigned to
DAs and both processor complexes in a balanced manner. You must allocate either type of
workload to I/O ports that are spread across HAs and I/O enclosures in a balanced manner.
You must distribute volumes and host connections for either an isolated workload or a
resource-sharing workload in a balanced manner across all DS8000 hardware resources that
are allocated to that workload.
You should create volumes as evenly distributed as possible across all ranks and DAs
allocated to those workloads.
One exception to the recommendation of spreading volumes might be when specific files or
data sets are never accessed simultaneously, such as multiple log files for the same
application where only one log file is in use at a time. In that case, you can place the volumes
required by these data sets or files on the same resources.
Additionally, you might identify any workload that is so critical that its performance can never
be allowed to be negatively impacted by other workloads.
Then, identify the remaining workloads that are considered appropriate for resource-sharing.
Next, define a balanced set of hardware resources that can be dedicated to any isolated
workloads, if required. Then, allocate the remaining DS8000 hardware for sharing among the
resource-sharing workloads. Carefully consider the appropriate resources and storage tiers
for Easy Tier and multitier extent pools in a balanced manner. Also, plan ahead for
appropriate I/O Priority Manager alignments of QoS levels to resource-sharing workloads
where needed.
The next step is planning extent pools and assigning volumes and host connections to all
workloads in a way that is balanced and spread. By default, the standard allocation method
when creating volumes is stripes with one-extent granularity across all arrays in a pool, so on
the rank level, this distribution is done automatically.
Without the explicit need for workload isolation or any other requirements for multiple extent
pools, starting with two extent pools (with or without different storage tiers) and a balanced
distribution of the ranks and DAs might be the simplest configuration to start with using
resource-sharing throughout the whole DS8000 storage system and Easy Tier automatic
management if you have either FB or CKD storage. Otherwise, four extent pools are required
for a reasonable minimum configuration, two for FB storage and two for CKD storage, and
each pair is distributed across both DS8000 storage servers. In addition, you can plan to align
your workloads to expected QoS levels with I/O Priority Manager.
The final step is the implementation of host-level striping (when appropriate) and multipathing
software, if needed. If you planned for Easy Tier, do not consider host-level striping because it
dilutes the workload skew and is counterproductive to the Easy Tier optimization.
For example, the ratio of flash capacity to HDD capacity in a hybrid pool depends on the
workload characteristics and skew. Ideally, there must be enough flash capacity to hold the
active (hot) extents in the pool, but not more to not waste expensive flash capacity. For new
DS8000 orders, 3–5% of flash might be a reasonable percentage to plan for with hybrid pools
in typical environments. This configuration can result in the movement of already over 50% of
the small and random I/O workload from Enterprise drives to flash. It provides a reasonable
initial estimate if measurement data is not available to support configuration planning.
The Storage Tier Advisor Tool (STAT) also can provide guidance for capacity planning of the
available storage tiers based on the existing workloads on a DS8000 storage system with
Easy Tier monitoring enabled. For more information, see 6.5, “Storage Tier Advisor Tool” on
page 213.
You must also consider organizational and business considerations in determining which
workloads to isolate. Workload priority (the importance of a workload to the business) is a key
consideration. Application administrators typically request dedicated resources for high
priority workloads. For example, certain database online transaction processing (OLTP)
workloads might require dedicated resources to ensure service levels.
The most important consideration is preventing lower-priority workloads with heavy I/O
requirements from impacting higher priority workloads. Lower-priority workloads with heavy
random activity must be evaluated for rank isolation. Lower-priority workloads with heavy,
large blocksize, and sequential activity must be evaluated for DA and I/O port isolation.
Workloads that require different disk drive types (capacity and speed), different RAID types
(RAID 5, RAID 6, or RAID 10), or different storage types (CKD or FB) dictate isolation to
different DS8000 arrays, ranks, and extent pools, unless this situation can be solved by
pinning volumes to one certain tier. For more information about the performance implications
of various RAID types, see “RAID-level performance considerations” on page 98.
Workloads that use different I/O protocols (FCP or FICON) dictate isolation to different I/O
ports. However, workloads that use the same disk drive types, RAID type, storage type, and
I/O protocol can be evaluated for separation or isolation requirements.
Isolation of only a few workloads that are known to have high I/O demands can allow all the
remaining workloads (including the high-priority workloads) to share hardware resources and
achieve acceptable levels of performance. More than one workload with high I/O demands
might be able to share the isolated DS8000 resources, depending on the service level
requirements and the times of peak activity.
The following examples are I/O workloads, files, or data sets that might have heavy and
continuous I/O access patterns:
Sequential workloads (especially those workloads with large-blocksize transfers)
Log files or data sets
Sort or work data sets or files
Business Intelligence and Data Mining
Disk copies (including Point-in-Time Copy background copies, remote mirroring target
volumes, and tape simulation on disk)
Video and imaging applications
Engineering and scientific applications
Certain batch workloads
You must consider workloads for all applications for which DS8000 storage is allocated,
including current workloads to be migrated from other installed storage systems and new
workloads that are planned for the DS8000 storage system. Also, consider projected growth
for both current and new workloads.
For existing applications, consider historical experience first. For example, is there an
application where certain data sets or files are known to have heavy, continuous I/O access
patterns? Is there a combination of multiple workloads that might result in unacceptable
performance if their peak I/O times occur simultaneously? Consider workload importance
(workloads of critical importance and workloads of lesser importance).
For existing applications, you can also use performance monitoring tools that are available for
the existing storage systems and server platforms to understand current application workload
characteristics:
Read/write ratio
Random/sequential ratio
Average transfer size (blocksize)
Peak workload (IOPS for random access and MB per second for sequential access)
Peak workload periods (time of day and time of month)
Copy Services requirements (Point-in-Time Copy and Remote Mirroring)
Host connection utilization and throughput (FCP host connections and FICON)
Remote mirroring link utilization and throughput
Estimate the requirements for new application workloads and for current application workload
growth. You can obtain information about general workload characteristics in Chapter 5,
“Understanding your workload” on page 141.
You can use the Disk Magic modeling tool to model the current or projected workload and
estimate the required DS8000 hardware resources. Disk Magic is described in 6.1, “Disk
Magic” on page 160.
The STAT can also provide workload information and capacity planning recommendations
that are associated with a specific workload to reconsider the need for isolation and evaluate
the potential benefit when using a multitier configuration and Easy Tier.
Choose the DS8000 resources to dedicate in a balanced manner. If ranks are planned for
workloads in multiples of two, half of the ranks can later be assigned to extent pools managed
by processor complex 0, and the other ranks can be assigned to extent pools managed by
processor complex 1. You may also note the DAs to be used. If I/O ports are allocated in
multiples of four, they can later be spread evenly across all I/O enclosures in a DS8000 frame
if four or more HA cards are installed. If I/O ports are allocated in multiples of two, they can
later be spread evenly across left and right I/O enclosures.
Easy Tier later provides automatic intra-tier management in single-tier and multitier pools
(auto-rebalance) and cross-tier management in multitier pools for the resource-sharing
workloads. In addition, different QoS levels can be aligned to different workloads to meet
performance goals.
Host connection: In this chapter, we use host connection in a general sense to represent
a connection between a host server (either z/OS or Open Systems) and the DS8000
storage system.
After the spreading plan is complete, use the DS8000 hardware resources that are identified
in the plan as input to order the DS8000 hardware.
However, there are host server performance considerations related to the number and size of
volumes. For example, for z Systems servers, the number of Parallel Access Volumes (PAVs)
that are needed can vary with volume size. For more information about PAVs, see 14.2,
“DS8000 and z Systems planning and configuration” on page 476. For IBM i servers, use a
volume size that is on the order of half the size of the disk drives that are used. There also
can be Open Systems host server or multipathing software considerations that are related to
the number or the size of volumes, so you must consider these factors in addition to workload
requirements.
To spread volumes across allocated hardware for each isolated workload, and then for each
workload in a group of resource-sharing workloads, complete the following steps:
1. Review the required number and the size of the logical volumes that are identified during
the workload analysis.
2. Review the number of ranks that are allocated to the workload (or group of
resource-sharing workloads) and the associated DA pairs.
3. Evaluate the use of multi-rank or multitier extent pools. Evaluate the use of Easy Tier in
automatic mode to automatically manage data placement and performance.
4. Assign the volumes, preferably with the default allocation method rotate extents (DSCLI
term: rotateexts, GUI: rotate capacity).
There are significant performance implications from the assignment of host connections to
I/O ports, HAs, and I/O enclosures. The goal of the entire logical configuration planning
process is to ensure that host connections for each workload access I/O ports and HAs that
allow all workloads to meet the performance objectives.
To spread host connections across allocated hardware for each isolated workload, and then
for each workload in a group of resource-sharing workloads, complete the following steps:
1. Review the required number and type (SW, LW, FCP, or FICON) of host connections that
are identified in the workload analysis. You must use a minimum of two host connections
to different DS8000 HA cards to ensure availability. Some Open Systems hosts might
impose limits on the number of paths and volumes. In such cases, you might consider not
exceeding four paths per volume, which in general is a good approach for performance
and availability. The DS8880 front-end host ports are 16 Gbps capable and if the expected
workload is not explicitly saturating the adapter and port bandwidth with high sequential
loads, you might share ports with many hosts.
2. Review the HAs that are allocated to the workload (or group of resource-sharing
workloads) and the associated I/O enclosures.
3. Review requirements that need I/O port isolation, for example, remote replication Copy
Services, IBM ProtecTIER®, and SAN Volume Controller. If possible, try to split them as
you split hosts among hardware resources. Do not mix them with other Open Systems
because they can have different workload characteristics.
When using the DS Storage Manager GUI to create managed arrays and pools, the GUI
automatically chooses a good distribution of the arrays across all DAs, and initial formatting
with the GUI gives optimal results for many cases. Only for specific requirements (for
example, isolation by DA pairs) is the command-line interface (DSCLI) advantageous
because it gives more options for a certain specific configuration.
After the DS8000 hardware is installed, you can use the output of the DS8000 DSCLI
lsarraysite command to display and document array site information, including disk drive
type and DA pair. Check the disk drive type and DA pair for each array site to ensure that
arrays, ranks, and ultimately volumes that are created from the array site are created on the
DS8000 hardware resources required for the isolated or resource-sharing workloads.
The result of this step is the addition of specific array site IDs to the plan of workload
assignment to ranks.
Storage servers: Array sites, arrays, and ranks do not have a fixed or predetermined
relationship to any DS8000 processor complex (storage server) before they are finally
assigned to an extent pool and a rank group (rank group 0/1 is managed by processor
complex 0/1).
RAID 5 is one of the most commonly used levels of RAID protection because it optimizes
cost-effective performance while emphasizing usable capacity through data striping. It
provides fault tolerance if one disk drive fails by using XOR parity for redundancy. Hot spots
within an array are avoided by distributing data and parity information across all of the drives
in the array. The capacity of one drive in the RAID array is lost because it holds the parity
information. RAID 5 provides a good balance of performance and usable storage capacity.
RAID 6 provides a higher level of fault tolerance than RAID 5 in disk failures, but also provides
less usable capacity than RAID 5 because the capacity of two drives in the array is set aside
to hold the parity information. As with RAID 5, hot spots within an array are avoided by
distributing data and parity information across all of the drives in the array. Still, RAID 6 offers
more usable capacity than RAID 10 by providing an efficient method of data protection in
double disk errors, such as two drive failures, two coincident medium errors, or a drive failure
and a medium error during a rebuild. Because the likelihood of media errors increases with
the capacity of the physical disk drives, consider the use of RAID 6 with large capacity disk
drives and higher data availability requirements. For example, consider RAID 6 where
rebuilding the array in a drive failure takes a long time. RAID 6 can also be used with smaller
SAS (Serial-attached Small Computer System Interface) drives, when the primary concern is
a higher level of data protection than is provided by RAID 5.
RAID 10 optimizes high performance while maintaining fault tolerance for disk drive failures.
The data is striped across several disks, and the first set of disk drives is mirrored to an
identical set. RAID 10 can tolerate at least one, and in most cases, multiple disk failures if the
primary copy and the secondary copy of a mirrored disk pair do not fail at the same time.
Regarding read I/O operations, either random or sequential, there is generally no difference
between RAID 5, RAID 6, and RAID 10. When a DS8000 storage system receives a read
request from a host system, it first checks whether the requested data is already in cache. If
the data is in cache (that is, a read cache hit), there is no need to read the data from disk, and
the RAID level on the arrays does not matter. For reads that must be satisfied from disk (that
is, the array or the back end), the performance of RAID 5, RAID 6, and RAID 10 is roughly
equal because the requests are spread evenly across all disks in the array. In RAID 5 and
RAID 6 arrays, data is striped across all disks, so I/Os are spread across all disks. In
RAID 10, data is striped and mirrored across two sets of disks, so half of the reads are
processed by one set of disks, and half of the reads are processed by the other set, reducing
the utilization of individual disks.
Regarding random-write I/O operations, the different RAID levels vary considerably in their
performance characteristics. With RAID 10, each write operation at the disk back end initiates
two disk operations to the rank. With RAID 5, an individual random small-block write
operation to the disk back end typically causes a RAID 5 write penalty, which initiates four I/O
operations to the rank by reading the old data and the old parity block before finally writing the
new data and the new parity block. For RAID 6 with two parity blocks, the write penalty
increases to six required I/O operations at the back end for a single random small-block write
operation. This assumption is a worst-case scenario that is helpful for understanding the
back-end impact of random workloads with a certain read/write ratio for the various RAID
levels. It permits a rough estimate of the expected back-end I/O workload and helps you to
plan for the correct number of arrays. On a heavily loaded system, it might take fewer I/O
operations than expected on average for RAID 5 and RAID 6 arrays. The optimization of the
queue of write I/Os waiting in cache for the next destage operation can lead to a high number
of partial or full stripe writes to the arrays with fewer required back-end disk operations for the
parity calculation.
On modern disk systems, such as the DS8000 storage system, write operations are cached
by the storage subsystem and thus handled asynchronously with short write response times
for the attached host systems. So, any RAID 5 or RAID 6 write penalties are shielded from the
attached host systems in disk response time. Typically, a write request that is sent to the
DS8000 subsystem is written into storage server cache and persistent cache, and the I/O
operation is then acknowledged immediately to the host system as complete. If there is free
space in these cache areas, the response time that is seen by the application is only the time
to get data into the cache, and it does not matter whether RAID 5, RAID 6, or RAID 10 is
used.
There is also the concept of rewrites. If you update a cache segment that is still in write cache
and not yet destaged, update segment in the cache and eliminate the RAID penalty for the
previous write step. However, if the host systems send data to the cache areas faster than the
storage server can destage the data to the arrays (that is, move it from cache to the physical
disks), the cache can occasionally fill up with no space for the next write request. Therefore,
the storage server signals the host system to retry the I/O write operation. In the time that it
takes the host system to retry the I/O write operation, the storage server likely can destage
part of the data, which provides free space in the cache and allows the I/O operation to
complete on the retry attempt.
Although RAID 10 clearly outperforms RAID 5 and RAID 6 in small-block random write
operations, RAID 5 and RAID 6 show excellent performance in sequential write I/O
operations. With sequential write requests, all of the blocks that are required for the RAID 5
parity calculation can be accumulated in cache, and thus the destage operation with parity
calculation can be dynamic as a full stripe write without the need for additional disk operations
to the array. So, with only one additional parity block for a full stripe write (for example, seven
data blocks plus one parity block for a 7+P RAID 5 array), RAID 5 requires fewer disk
operations at the back end than a RAID 10, which always requires twice the write operations
because of data mirroring. RAID 6 also benefits from sequential write patterns with most of
the data blocks, which are required for the double parity calculation, staying in cache and thus
reducing the number of additional disk operations to the back end considerably. For
sequential writes, a RAID 5 destage completes faster and reduces the busy time of the disk
subsystem.
Comparing RAID 5 to RAID 6, the performance of small-block random read and the
performance of a sequential read are roughly equal. Because of the higher write penalty, the
RAID 6 small-block random write performance is explicitly less than with RAID 5. Also, the
maximum sequential write throughput is slightly less with RAID 6 than with RAID 5 because
of the additional second parity calculation. However, RAID 6 rebuild times are close to RAID 5
rebuild times (for the same size disk drive modules (DDMs)) because rebuild times are
primarily limited by the achievable write throughput to the spare disk during data
reconstruction. So, RAID 6 mainly is a significant reliability enhancement with a trade-off in
random-write performance. It is most effective for large capacity disks that hold
mission-critical data and that are correctly sized for the expected write I/O demand. Workload
planning is especially important before implementing RAID 6 for write-intensive applications,
including Copy Services targets.
RAID 10 is not as commonly used as RAID 5 for two key reasons. First, RAID 10 requires
more raw disk capacity for every TB of effective capacity. Second, when you consider a
standard workload with a typically high number of read operations and only a few write
operations, RAID 5 generally offers the best trade-off between overall performance and
usable capacity. In many cases, RAID 5 write performance is adequate because disk systems
tend to operate at I/O rates below their maximum throughputs, and differences between
RAID 5 and RAID 10 are primarily observed at maximum throughput levels. Consider
RAID 10 for critical workloads with a high percentage of steady random-write requests, which
can easily become rank limited. RAID 10 provides almost twice the throughput as RAID 5
(because of the “write penalty”). The trade-off for better performance with RAID 10 is about
40% less usable disk capacity. Larger drives can be used with RAID 10 to get the
random-write performance benefit while maintaining about the same usable capacity as a
RAID 5 array with the same number of disks.
Table 4-1 shows a short overview of the advantages and disadvantages for the RAID level
reliability, space efficiency, and random write performance.
Table 4-1 RAID-level comparison of reliability, space efficiency, and write penalty
RAID level Reliability Space efficiencya Performance
(number of erasures) write penalty
(number of disk
operations)
In general, workloads that effectively use storage system cache for reads and writes see little
difference between RAID 5 and RAID 10 configurations. For workloads that perform better
with RAID 5, the difference in RAID 5 performance over RAID 10 is typically small. However,
for workloads that perform better with RAID 10, the difference in RAID 10 performance over
RAID 5 performance or RAID 6 performance can be significant.
Because RAID 5, RAID 6, and RAID 10 perform equally well for both random and sequential
read operations, RAID 5 or RAID 6 might be a good choice for space efficiency and
performance for standard workloads with many read requests. RAID 6 offers a higher level of
data protection than RAID 5, especially for large capacity drives, but the random-write
performance of RAID 6 is less because of the second parity calculation. Therefore, size for
performance, especially for RAID 6.
RAID 5 tends to have a slight performance advantage for sequential writes. RAID 10
performs better for random writes. RAID 10 is considered to be the RAID type of choice for
business-critical workloads with many random write requests (typically more than 35% writes)
and low response time requirements.
For array rebuilds, RAID 5, RAID 6, and RAID 10 require approximately the same elapsed
time, although RAID 5 and RAID 6 require more disk operations and therefore are more likely
to affect other disk activity on the same disk array.
You can select RAID types for each array site. So, you can select the RAID type based on the
specific performance requirements of the data for that site. The preferred way to compare the
performance of a specific workload that uses RAID 5, RAID 6, or RAID 10 is to run a Disk
Magic model. For additional information about the capabilities of this tool, see 6.1, “Disk
Magic” on page 160.
Despite the different RAID levels and the actual workload pattern (read:write ratio, sequential
access, or random access), the limits of the maximum I/O rate per rank also depend on the
type of disk drives that are used. As a mechanical device, each disk drive can process a
limited number of random IOPS, depending on the drive characteristics. So, the number of
disk drives that are used for a specific amount of storage capacity determines the achievable
random IOPS performance. The 15 K drives offer approximately 30% more random IOPS
performance than 10 K drives. Generally, for random IOPS planning calculations, you may
use up to 160 IOPS per 15 K FC drive and 120 IOPS per 10 K FC drive. However, at such
levels of IOPS and disk utilization, you might see already elevated response times. So, for
excellent response time expectations, consider lower IOPS limits. Low spinning,
large-capacity Nearline disk drives offer a considerably lower maximum random access I/O
rate per drive (approximately half of a 15 K drive). Therefore, they are only intended for
environments with fixed content, data archival, reference data, or for near-line applications
that require large amounts of data at low cost, or in case of normal production, as a slower
Tier 2 in hybrid pools and with a fraction of the total capacity.
Today, disk drives are mostly used as some Tier 1 and lower tiers in a hybrid pool where most
of the IOPS are handled by Tier 0 SSDs. Yet, even if the flash tier might handle, for example,
70% and more of the load, the HDDs still handle a considerable workload amount because of
their large bulk capacity. So, it can be a difference whether you go with, for example,
600 GB/15 K drives for Tier 1 versus going with 1.2 TB/10 K drives for Tier 1. The drive
selection of such lower drive tiers must be done as a sizing exercise as well.
Finally, using the lsarray -l, and lsrank -l commands can give you an idea of which DA
pair is used by each array and rank respectively, as shown in Example 4-1. You can do further
planning from here.
Example 4-1 lsarray -l and lsrank -l commands showing Array ID sequence and DA pair
dscli> lsarray -l
Array State Data RAIDtype arsite Rank DA Pair DDMcap (10^9B) diskclass encrypt
======================================================================================
A0 Assigned Normal 5 (7+P) S12 R0 2 600.0 ENT supported
A1 Assigned Normal 5 (6+P) S16 R1 18 400.0 Flash supported
A2 Assigned Normal 5 (7+P) S11 R2 2 600.0 ENT supported
A3 Assigned Normal 5 (6+P) S15 R3 18 400.0 Flash supported
A4 Assigned Normal 5 (7+P) S6 R4 0 600.0 ENT supported
A5 Assigned Normal 5 (6+P+S) S10 R5 2 600.0 ENT supported
A6 Assigned Normal 5 (6+P+S) S8 R6 2 600.0 ENT supported
A7 Assigned Normal 5 (6+P+S) S4 R7 0 600.0 ENT supported
A8 Assigned Normal 5 (6+P+S) S14 R8 18 400.0 Flash supported
A9 Assigned Normal 5 (7+P) S5 R9 0 600.0 ENT supported
A10 Assigned Normal 5 (6+P+S) S9 R10 2 600.0 ENT supported
dscli> lsrank -l
ID Group State datastate Array RAIDtype extpoolID extpoolnam stgtype exts usedexts encryptgrp marray
======================================================================================================
R0 0 Normal Normal A0 5 P0 ITSO_CKD ckd 4113 0 - MA12
R1 0 Normal Normal A1 5 P0 ITSO_CKD ckd 2392 0 - MA16
R2 1 Normal Normal A2 5 P1 ITSO_CKD ckd 4113 0 - MA11
R3 1 Normal Normal A3 5 P1 ITSO_CKD ckd 2392 0 - MA15
R4 0 Normal Normal A4 5 P2 ITSO_FB fb 3672 0 - MA6
R5 0 Normal Normal A5 5 P2 ITSO_FB fb 3142 0 - MA10
R6 0 Normal Normal A6 5 P2 ITSO_FB fb 3142 0 - MA8
R7 0 Normal Normal A7 5 P2 ITSO_FB fb 3142 0 - MA4
R8 0 Normal Normal A8 5 P2 ITSO_FB fb 2132 0 - MA14
R9 1 Normal Normal A9 5 P3 ITSO_FB fb 3672 0 - MA5
R10 1 Normal Normal A10 5 P3 ITSO_FB fb 3142 0 - MA9
R11 1 Normal Normal A11 5 P3 ITSO_FB fb 3142 0 - MA7
R12 1 Normal Normal A12 5 P3 ITSO_FB fb 3142 0 - MA3
R13 1 Normal Normal A13 5 P3 ITSO_FB fb 2132 0 - MA13
R14 0 Normal Normal A14 5 P4 Outcenter fb 3142 0 - MA2
R15 1 Normal Normal A15 5 P5 Outcenter fb 3142 0 - MA1
Extent pools are automatically numbered with system-generated IDs starting with P0, P1, and
P2 in the sequence in which they are created. Extent pools that are created for rank group 0
are managed by processor complex 0 and have even-numbered IDs (P0, P2, and P4, for
example). Extent pools that are created for rank group 1 are managed by processor complex
1 and have odd-numbered IDs (P1, P3, and P5, for example). Only in a failure condition or
during a concurrent code load is the ownership of a certain rank group temporarily moved to
the alternative processor complex.
To achieve a uniform storage system I/O performance and avoid single resources that
become bottlenecks (called hot spots), it is preferable to distribute volumes and workloads
evenly across all of the ranks (disk spindles) and DA pairs that are dedicated to a workload by
creating appropriate extent pool configurations.
The assignment of the ranks to extent pools together with an appropriate concept for the
logical configuration and volume layout is the most essential step to optimize overall storage
system performance. A rank can be assigned to any extent pool or rank group. Each rank
provides a particular number of storage extents of a certain storage type (either FB or CKD)
to an extent pool. Finally, an extent pool aggregates the extents from the assigned ranks and
provides the logical storage capacity for the creation of logical volumes for the attached host
systems.
In addition, the various extent pool configurations (homogeneous or hybrid pools, managed or
not managed by Easy Tier) can further be combined with the DS8000 I/O Priority Manager to
prioritize workloads that are sharing resources to meet QoS goals in cases when resource
contention might occur.
The following sections present concepts for the configuration of single-tier and multitier extent
pools to spread the workloads evenly across the available hardware resources. Also, the
benefits of Easy Tier with different extent pool configurations are outlined. Unless otherwise
noted, assume that enabling Easy Tier automatic mode refers to enabling the automatic
management capabilities of Easy Tier and Easy Tier monitoring.
ET automode=off
1-tier (homogeneous) pools
Figure 4-4 Ease of storage management versus automatic storage optimization by using Easy Tier
However, data placement across tiers must be managed manually and occurs on the volume
level only, which might, for example, waste costly flash capacity. Typically, only a part of the
capacity of a specific volume is hot and suited for flash or SSD placement. The workload can
become imbalanced across the ranks within an extent pool and limit the overall performance
even with storage pool striping because of natural workload skew. Workload spreading within
a pool is based on spreading only the volume capacity evenly across the ranks, not taking any
data access patterns or performance statistics into account. After adding capacity to an
existing extent pool, you must restripe the volume data within an extent pool manually, for
example, by using manual volume rebalance, to maintain a balanced workload distribution
across all ranks in a specific pool.
As shown in the second configuration (orange color), you can easily optimize the first
configuration and reduce management efforts considerably by enabling Easy Tier automatic
mode. Thus, you automate intra-tier performance management (auto-rebalance) in these
homogeneous single-tier extent pools. Easy Tier controls the workload spreading within each
extent pool and automatically relocates data across the ranks based on rank utilization to
minimize skew and avoid rank hot spots. Performance management is shifted from rank to
extent pool level with correct data placement across tiers and extent pools at the volume level.
Furthermore, when adding capacity to an existing extent pool, Easy Tier automatic mode
automatically takes advantage of the capacity and performance capabilities of the new ranks
without the need for manual interaction.
You can further reduce management effort by merging extent pools and building different
combinations of two-tier hybrid extent pools, as shown in the third configuration (blue color).
You can introduce an automatically managed tiered storage architecture but still isolate, for
example, the high-performance production workload from the development/test environment.
You can introduce an ENT/SSD pool for the high-performance and high-priority production
workload, efficiently boosting ENT performance with flash cards or SSDs and automate
storage tiering from enterprise-class drives to SSDs by using Easy Tier automated cross-tier
data relocation and storage performance optimization at the subvolume level. You can create
an ENT/NL pool for the development/test environment or other enterprise-class applications
to maintain enterprise-class performance while shrinking the footprint and reducing costs by
combining enterprise-class drives with large-capacity Nearline drives that use Easy Tier
automated cross-tier data relocation and storage economics management. In addition, you
benefit from a balanced workload distribution across all ranks within each drive tier because
of the Easy Tier intra-tier optimization and automatically take advantage of the capacity and
performance capabilities of new ranks when capacity is added to an existing pool without the
need for manual interaction.
All of these configurations can be combined with I/O Priority Manager to prioritize workloads
when sharing resources and provide QoS levels in case resource contention occurs.
You can also take full advantage of the Easy Tier manual mode features, such as dynamic
volume relocation (volume migration), dynamic extent pool merge, and rank depopulation to
modify dynamically your logical configuration. When merging extent pools with different
storage tiers, you can gradually introduce more automatic storage management with Easy
Tier at any time. With rank depopulation, you can reduce multitier pools and automated
cross-tier management according to your needs.
These examples are generic. A single DS8000 storage system with its tremendous scalability
can manage many applications effectively and efficiently, so typically multiple extent pool
configurations exist on a large system for various needs for isolated and resource sharing
workloads, Copy Services considerations, or other specific requirements. Easy Tier and
potentially I/O Priority Manager simplify management with single-tier and multitier extent
pools and help to spread workloads easily across shared hardware resources under optimum
conditions and best performance, automatically adapting to changing workload conditions.
You can choose from various extent pool configurations for your resource isolation and
resource sharing workloads, which are combined with Easy Tier and I/O Priority Manager.
Single-tier extent pools consist of one or more ranks that can be referred to as single-rank or
multi-rank extent pools.
Furthermore, you benefit less from the advanced DS8000 virtualization features, such as
dynamic volume expansion (DVE), storage pool striping, Easy Tier automatic performance
management, and workload spreading, which use the capabilities of multiple ranks within a
single extent pool.
Single-rank extent pools are selected for environments where isolation or management of
volumes on the rank level is needed, such as in some z/OS environments. Single-rank extent
pools are selected for configurations by using storage appliances, such as the SAN Volume
Controller, where the selected RAID arrays are provided to the appliance as simple back-end
storage capacity and where the advanced virtualization features on the DS8000 storage
system are not required or not wanted to avoid multiple layers of data striping. However, the
use of homogeneous multi-rank extent pools and storage pool striping to minimize the
storage administrative effort by shifting the performance management from the rank to the
extent pool level and letting the DS8000 storage system maintain a balanced data distribution
across the ranks within a specific pool is popular. It provides excellent performance in relation
to the reduced management effort.
Also, you do not need to strictly use only single-rank extent pools or only multi-rank extent
pools on a storage system. You can base your decision on individual considerations for each
workload group that is assigned to a set of ranks and thus extent pools. The decision to use
single-rank and multi-rank extent pools depends on the logical configuration concept that is
chosen for the distribution of the identified workloads or workload groups for isolation and
resource-sharing.
In general, single-rank extent pools might not be good in the current complex and mixed
environments unless you know that this level of isolation and micro-performance
management is required for your specific environment. If not managed correctly, workload
skew and rank hot spots that limit overall system performance are likely to occur.
With a homogeneous multi-rank extent pool, you take advantage of the advanced DS8000
virtualization features to spread the workload evenly across the ranks in an extent pool to
achieve a well-balanced data distribution with considerably less management effort.
Performance management is shifted from the rank level to the extent pool level. An extent
pool represents a set of merged ranks (a larger set of disk spindles) with a uniform workload
distribution. So, the level of complexity for standard performance and configuration
management is reduced from managing many individual ranks (micro-performance
management) to a few multi-rank extent pools (macro-performance management).
The size of the volumes must fit the available capacity on each rank. The number of volumes
that are created for this workload in a specific extent pool must match the number of ranks (or
be at least a multiple of this number). Otherwise, the result is an imbalanced volume and
workload distribution across the ranks and rank bottlenecks might emerge. However, efficient
host-based striping must be ensured in this case to spread the workload evenly across all
ranks, eventually from two or more extent pools. For more information about the EAMs and
how the volume data is spread across the ranks in an extent pool, see 3.2.7, “Extent
allocation methods” on page 62.
Even multi-rank extent pools that are not managed by Easy Tier provide some level of control
over the volume placement across the ranks in cases where it is necessary to enforce
manually a special volume allocation scheme: You can use the DSCLI command chrank
-reserve to reserve all of the extents from a rank in an extent pool from being used for the
next creation of volumes. Alternatively, you can use the DSCLI command chrank -release to
release a rank and make the extents available again.
Multi-rank extent pools that use storage pool striping are the general configuration approach
today on modern DS8000 storage systems to spread the data evenly across the ranks in a
homogeneous multi-rank extent pool and thus reduce skew and the likelihood of single-rank
hot spots. Without Easy Tier automatic mode management, such non-managed,
homogeneous multitier extent pools consist only of ranks of the same drive type and RAID
level. Although not required (and probably not realizable for smaller or heterogeneous
configurations), you can take the effective rank capacity into account, grouping ranks with and
without spares into different extent pools when using storage pool striping to ensure a strict
balanced workload distribution across all ranks up to the last extent. Otherwise, take
additional considerations for the volumes that are created from the last extents in a mixed
homogeneous extent pool that contains ranks with and without spares because these
volumes are probably allocated only on part of the ranks with the larger capacity and without
spares.
In combination with Easy Tier, a more efficient and automated way of spreading the
workloads evenly across all ranks in homogeneous multi-rank extent pool is available. The
automated intra-tier performance management (auto-rebalance) of Easy Tier efficiently
spreads the workload evenly across all ranks. It automatically relocates the data across the
ranks of the same storage class in an extent pool based on rank utilization to achieve and
maintain a balanced distribution of the workload, minimizing skew and avoiding rank hot
spots. You can enable auto-rebalance for homogeneous extent pool by setting the Easy Tier
management scope to all extent pools (ETautomode=all).
In addition, Easy Tier automatic mode can also handle storage device variations within a tier
that uses a micro-tiering capability. An example of storage device variations within a tier is an
intermix of ranks with different disk types (RPM or RAID level) within the same storage class
of an extent pool, or when mixing classical SSDs with the more powerful HPFE cards.
Note: Easy Tier does not differentiate 10 K and 15 K Enterprise disk drives as separate
storage tiers in a managed extent pool, or between SSDs and HPFE cards. Both drive
types are considered as the same storage tier and no automated cross-tier promotion or
demotion algorithms are applied between these two storage classes. Easy Tier automated
data relocation across tiers to optimize performance and storage economics based on the
hotness of the particular extent takes place only between different storage tiers. If these
drives are mixed in the same managed extent pool, the Easy Tier auto-rebalance algorithm
balances the workload only across all ranks of this Enterprise-class tier based on overall
rank utilization, taking the performance capabilities of each rank (micro-tiering capability)
into account.
Figure 4-5 provides two configuration examples for using dedicated homogeneous extent
pools with storage classes in combination with and without Easy Tier automatic mode
management.
SSD SSD - Manual cross-tier workload management on volume level using Easy Tier manual mode
P0 P1
features for volume migrations (dynamic volume relocation) across pools and tiers.
ENT ENT - Manual intra-tier workload management and workload spreading within extent pools using
ENT P2 P3 DS8000 extent allocation methods such as storage pool striping based on a balanced
ENT
volume capacity distribution.
ENT ENT - Strictly homogeneous pools with ranks of the same drive characteristics and RAID level.
ENT P4 P5 - Isolation of workloads across different extent pools and storage tiers on volume level,
ENT
limiting the most efficient use of the available storage capacity and tiers.
NL P6 NL P7 - Highest administration and performance management effort with constant resource
ET automode=none utilization monitoring, workload balancing and manual placement of volumes across
1-tier extent pools ranks, extent pools and storage tiers.
- Efficiently taking advantage of new capacity when added to existing pools typically
requires manual restriping of volumes using manual volume rebalance.
Figure 4-5 Single-tier extent pool configuration examples and Easy Tier benefits
With multi-rank extent pools, you can fully use the features of the DS8000 virtualization
architecture and Easy Tier that provide ease of use when you manage more applications
effectively and efficiently with a single DS8000 storage system. Consider multi-rank extent
pools and the use of Easy Tier automatic management especially for mixed workloads that
will be spread across multiple ranks. Multi-rank extent pools help simplify management and
volume creation. They also allow the creation of single volumes that can span multiple ranks
and thus exceed the capacity and performance limits of a single rank.
For more information about data placement in extent pool configurations, see 3.3.2, “Extent
pool considerations” on page 72.
Important: Multi-rank extent pools offer numerous advantages with respect to ease of use,
space efficiency, and the DS8000 virtualization features. Multi-rank extent pools, in
combination with Easy Tier automatic mode, provide both ease of use and excellent
performance for standard environments with workload groups that share a set of
homogeneous resources.
A multitier extent pool can consist of one of the following storage class combinations with up
to three storage tiers:
HPFE cards/SSD + Enterprise disk
HPFE cards/SSD + Nearline disk
Enterprise disk + Nearline disk
HPFE cards/SSD + Enterprise disk + Nearline disk
Multitier extent pools are especially suited for mixed, resource-sharing workloads. Tiered
storage, as described in 4.1, “Reviewing the tiered storage concepts and Easy Tier” on
page 84, is an approach of using types of storage throughout the storage infrastructure. It is a
mix of higher-performing/higher-cost storage with lower-performing/lower-cost storage and
placing data based on its specific I/O access characteristics. Although flash can help boost
efficiently enterprise-class performance, you can additionally shrink the footprint and reduce
costs by adding large-capacity Nearline drives while maintaining enterprise class
performance. Correctly balancing all the tiers eventually leads to the lowest cost and best
performance solution.
Always create hybrid extent pools for Easy Tier automatic mode management. The extent
allocation for volumes in hybrid extent pools differs from the extent allocation in homogeneous
pools. Any specified EAM, such as rotate extents or rotate volumes, is ignored when a new
volume is created in, or migrated into, a hybrid pool. The EAM is changed to managed when
the Easy Tier automatic mode is enabled for the pool, and the volume is under the control of
Easy Tier. Easy Tier then automatically moves extents to the most appropriate storage tier
and rank in the pool based on performance aspects.
Easy Tier automatically spreads workload across the resources (ranks and DAs) in a
managed hybrid pool. Easy Tier automatic mode adapts to changing workload conditions and
automatically promotes hot extents from the lower tier to the next upper tier. It demotes colder
extents from the higher tier to the next lower tier (swap extents from flash with hotter extents
from Enterprise tier, or demote cold extents from Enterprise tier to Nearline tier). Easy Tier
automatic mode optimizes the Nearline tier by demoting some of the sequential workload to
the Nearline tier to better balance sequential workloads. The auto-rebalance feature
rebalances extents across the ranks of the same tier based on rank utilization to minimize
skew and avoid hot spots. Auto-rebalance takes different device characteristics into account
when different devices or RAID levels are mixed within the same storage tier (micro-tiering).
Regarding the requirements of your workloads, you can create one or multiple pairs of extent
pools with different two-tier or three-tier combinations that depend on your needs and
available hardware resources. You can, for example, create separate two-tier SSD/ENT and
ENT/NL extent pools to isolate your production environment from your development
environment. You can boost the performance of your production application with flash cards
or SSDs and optimize storage economics for your development applications with NL drives.
You can create three-tier extent pools for mixed, large resource-sharing workload groups and
benefit from fully automated storage performance and economics management at a minimum
management effort. You can boost the performance of your high-demand workloads with
flash and reduce the footprint and costs with NL drives for the lower-demand data.
You can use the DSCLI showfbvol/showckdvol -rank or -tier commands to display the
current extent distribution of a volume across the ranks and tiers, as shown in Example 3-2 on
page 78. Additionally, the volume heat distribution (volume heat map), provided by the STAT,
can help identify the amount of hot, warm, and cold extents for each volume and its
distribution across the storage tiers in the pool. For more information about STAT, see 6.5,
“Storage Tier Advisor Tool” on page 213.
The ratio of SSD, ENT, and NL disk capacity in a hybrid pool depends on the workload
characteristics and skew and must be planned when ordering the drive hardware for the
identified workloads.
With the Easy Tier manual mode features, such as dynamic extent pool merge, dynamic
volume relocation, and rank depopulation, you can modify existing configurations easily,
depending on your needs. You can grow from a manually managed single-tier configuration
into a partially or fully automatically managed tiered storage configuration. You add tiers or
merge appropriate extent pools and enable Easy Tier at any time. For more information about
Easy Tier, see IBM DS8000 Easy Tier, REDP-4667.
Important: Multitier extent pools and Easy Tier help you implement a tiered storage
architecture on a single DS8000 storage system with all its benefits at a minimum
management effort and ease of use. Easy Tier and its automatic data placement within
and across tiers spread the workload efficiently across the available resources in an extent
pool. Easy Tier constantly optimizes storage performance and storage economics and
adapts to changing workload conditions. Easy Tier can reduce overall performance
management efforts and help consolidate more workloads efficiently and effectively on a
single DS8000 storage system. It optimizes performance and reduces energy costs and
the footprint.
Figure 4-6 Multitier extent pool configuration examples and Easy Tier benefits
In this case, only used capacity is allocated in the pool and Easy Tier does not move unused
extents around or move hot extents on a large scale up from the Nearline tier to the
Enterprise tier and to the flash tier. However, thin-provisioned volumes are not fully supported
by all DS8000 Copy Services or advanced functions and platforms yet, so it might not be a
valid approach for all environments at this time. For more information about the initial volume
allocation in hybrid extent pools, see “Extent allocation in hybrid and managed extent pools”
on page 63.
When bringing a new DS8880 storage system into production to replace an older one, with
the older storage system often not using Easy Tier, consider the timeline of the
implementation stages by which you migrate all servers from the older to the new storage
system.
One good option is to consider a staged approach when migrating servers to a new multitier
DS8880 storage system:
Assign the resources for the high-performing and response time sensitive workloads first,
then add the less performing workloads. The other way might lead to situations where all
initial resources, such as the Enterprise tier in hybrid extent pools, are allocated already by
the secondary workloads. This situation does not leave enough space on the Enterprise
tier for the primary workloads, which then must be initially on the Nearline tier.
Split your servers into several subgroups, where you migrate each subgroup one by one,
and not all at once. Then, allow Easy Tier several days to learn and optimize. Some
extents are moved to flash and some extents are moved to Nearline. You regain space on
the Enterprise HDDs. After a server subgroup learns and reaches a steady state, the next
server subgroup can be migrated. You gradually allocate the capacity in the hybrid extent
pool by optimizing the extent distribution of each application one by one while regaining
space in the Enterprise tier (home tier) for the next applications.
Another option that can help in some cases of new deployments is to reset the Easy Tier
learning heatmap for a certain subset of volumes, or for some pools. This action cuts off all
the previous days of Easy Tier learning, and the next upcoming internal auto-migration plan is
based on brand new workload patterns only.
With the default rotate extents (rotateexts in DSCLI, Rotate capacity in the GUI) algorithm,
the extents (1 GiB for FB volumes and 1113 cylinders or approximately 0.94 GiB for CKD
volumes) of each single volume are spread across all ranks within an extent pool and thus
across more disks. This approach reduces the occurrences of I/O hot spots at the rank level
within the storage system. Storage pool striping helps to balance the overall workload evenly
across the back-end resources. It reduces the risk of single ranks that become performance
bottlenecks while providing ease of use with less administrative effort.
When using the optional rotate volumes (rotatevols in DSCLI) EAM, each volume is placed
on a single separate rank with a successive distribution across all ranks in a round-robin
fashion.
The rotate extents and rotate volumes EAMs determine the initial data distribution of a
volume and thus the spreading of workloads in non-managed, single-tier extent pools. With
Easy Tier automatic mode enabled for single-tier (homogeneous) or multitier (hybrid) extent
pools, this selection becomes unimportant. The data placement and thus the workload
spreading is managed by Easy Tier. The use of Easy Tier automatic mode for single-tier
extent pools is highly encouraged for an optimal spreading of the workloads across the
resources. In single-tier extent pools, you can benefit from the Easy Tier automatic mode
feature auto-rebalance. Auto-rebalance constantly and automatically balances the workload
across ranks of the same storage tier based on rank utilization, minimizing skew and avoiding
the occurrence of single-rank hot spots.
Certain, if not most, application environments might benefit from the use of storage pool
striping (rotate extents):
Operating systems that do not directly support host-level striping.
VMware datastores.
Microsoft Exchange.
Windows clustering environments.
Older Solaris environments.
Environments that need to suballocate storage from a large pool.
Applications with multiple volumes and volume access patterns that differ from day to day.
Resource sharing workload groups that are dedicated to many ranks with host operating
systems that do not all use or support host-level striping techniques or application-level
striping techniques.
Consider the following points for selected applications or environments to use storage-pool
striping in homogeneous configurations:
DB2: Excellent opportunity to simplify storage management by using storage-pool striping.
You might prefer to use DB2 traditional recommendations for DB2 striping for
performance-sensitive environments.
DB2 and similar data warehouse applications, where the database manages storage and
parallel access to data. Consider independent volumes on individual ranks with a careful
volume layout strategy that does not use storage-pool striping. Containers or database
partitions are configured according to suggestions from the database vendor.
Oracle: Excellent opportunity to simplify storage management for Oracle. You might prefer
to use Oracle traditional suggestions that involve ASM and Oracle striping capabilities for
performance-sensitive environments.
Small, highly active logs or files: Small highly active files or storage areas smaller than
1 GiB with a high access density might require spreading across multiple ranks for
performance reasons. However, storage-pool striping offers a striping granularity on extent
levels only around 1 GiB, which is too large in this case. Continue to use host-level striping
techniques or application-level striping techniques that support smaller stripe sizes. For
example, assume a 0.8 GiB log file exists with extreme write content, and you want to
spread this log file across several RAID arrays. Assume that you intend to spread its
activity across four ranks. At least four 1 GiB extents must be allocated, one extent on
each rank (which is the smallest possible allocation). Creating four separate volumes,
each with a 1 GiB extent from each rank, and then using Logical Volume Manager (LVM)
striping with a relatively small stripe size (for example, 16 MiB) effectively distributes the
workload across all four ranks. Creating a single LUN of four extents, which is also
distributed across the four ranks by using DS8000 storage-pool striping, cannot effectively
spread the file workload evenly across all four ranks because of the large stripe size of one
extent, which is larger than the actual size of the file.
IBM Spectrum™ Protect/Tivoli Storage Manager storage pools: Tivoli Storage Manager
storage pools work well in striped pools. But, Tivoli Storage Manager suggests that the
Tivoli Storage Manager databases be allocated in a separate pool or pools.
AIX volume groups (VGs): LVM and physical partition (PP) striping continue to be powerful
tools for managing performance. In combination with storage-pool striping, now
considerably fewer stripes are required for common environments. Instead of striping
across a large set of volumes from many ranks (for example, 32 volumes from 32 ranks),
striping is required only across a few volumes from a small set of different multi-rank
extent pools from both DS8000 rank groups that use storage-pool striping. For example,
use four volumes from four extent pools, each with eight ranks. For specific workloads that
use the advanced AIX LVM striping capabilities with a smaller granularity on the KiB or
MiB level, instead of storage-pool striping with 1 GiB extents (FB), might be preferable to
achieve the highest performance.
In general, storage-pool striping helps improve overall performance and reduces the effort of
performance management by evenly distributing data and workloads across a larger set of
ranks, which reduces skew and hot spots. Certain application workloads can also benefit from
the higher number of disk spindles behind one volume. But, there are cases where host-level
striping or application-level striping might achieve a higher performance, at the cost of higher
overall administrative effort. Storage-pool striping might deliver good performance in these
cases with less management effort, but manual striping with careful configuration planning
can achieve the ultimate preferred levels of performance. So, for overall performance and
ease of use, storage-pool striping might offer an excellent compromise for many
environments, especially for larger workload groups where host-level striping techniques or
application-level striping techniques are not widely used or available.
You must distribute the I/O workloads evenly across the available front-end resources:
I/O ports
HA cards
I/O enclosures
You must distribute the I/O workloads evenly across both DS8000 processor complexes
(called storage server 0/CEC#0 and storage server 1/CEC#1) as well.
Configuring the extent pools determines the balance of the workloads across the available
back-end resources, ranks, DA pairs, and both processor complexes.
Each extent pool is associated with an extent pool ID (P0, P1, and P2, for example). Each
rank has a relationship to a specific DA pair and can be assigned to only one extent pool. You
can have as many (non-empty) extent pools as you have ranks. Extent pools can be
expanded by adding more ranks to the pool. However, when assigning a rank to a specific
extent pool, the affinity of this rank to a specific DS8000 processor complex is determined. By
hardware, a predefined affinity of ranks to a processor complex does not exist. All ranks that
are assigned to even-numbered extent pools (P0, P2, and P4, for example) form rank group 0
and are serviced by DS8000 processor complex 0. All ranks that are assigned to
odd-numbered extent pools (P1, P3, and P5, for example) form rank group 1 and are
serviced by DS8000 processor complex 1.
To spread the overall workload across both DS8000 processor complexes, a minimum of two
extent pools is required: one assigned to processor complex 0 (for example, P0) and one
assigned to processor complex 1 (for example, P1).
For a balanced distribution of the overall workload across both processor complexes and both
DA cards of each DA pair, apply the following rules. For each type of rank and its RAID level,
storage type (FB or CKD), and disk drive characteristics (disk type, RPM speed, and
capacity), apply these rules:
Assign half of the ranks to even-numbered extent pools (rank group 0) and assign half of
them to odd-numbered extent pools (rank group 1).
Spread ranks with and without spares evenly across both rank groups.
Distribute ranks from each DA pair evenly across both rank groups.
It is important to understand that you might seriously limit the available back-end bandwidth
and thus the system overall throughput if, for example, all ranks of a DA pair are assigned to
only one rank group and thus a single processor complex. In this case, only one DA card of
the DA pair is used to service all the ranks of this DA pair and thus only half of the available
DA pair bandwidth is available.
Use the GUI for creating a new configuration; even if there are fewer controls, as far as the
balancing is concerned, the GUI takes care of rank and DA distribution when creating pools.
Figure 4-7 Example of a homogeneously configured DS8000 storage system (single-tier) with two
extent pools
Another example for a homogeneously configured DS8000 storage system with four extent
pools and one workload group, which shares all resources across four extent pools, or two
isolated workload groups that each share half of the resources, is shown in Figure 4-8.
Figure 4-8 Example of a homogeneously configured DS8000 storage system with four extent pools
Another configuration with four extent pools is shown on the right in Figure 4-8 on page 119.
Evenly distribute the 6+P+S and 7+P ranks from all DA pairs across all four extent pools to
obtain the same overall capacity in each extent pool. However, the last capacity in these pools
is only allocated on the 7+P ranks. Use Easy Tier automatic mode management
(auto-rebalance) for these pools. Using four extent pools and storage-pool striping instead of
two can also reduce the failure boundary from one extent pool with 12 ranks (that is, if one
rank fails, all data in the pool is lost) to two distinct extent pools with only six ranks per
processor complex (for example, when physically separating table spaces from logs).
Also, consider separating workloads by using different extent pools with the principles of
workload isolation, as shown in Figure 4-9. The isolated workload can either use storage-pool
striping with the EAM or rotate volumes that are combined with host-level or application-level
striping. The workload isolation in this example is on the DA pair level (DA2). In addition, there
is one pair of extent pools for resource-sharing workload groups. Instead of manually
spreading the workloads across the ranks in each pool, consider using Easy Tier automatic
mode management (auto-rebalance) for all pools.
Figure 4-9 Example of DS8000 extent pool configuration with workload isolation on DA pair level
Another consideration for the number of extent pools to create is the usage of Copy Services,
such as FlashCopy. If you use FlashCopy, you also might consider a minimum of four extent
pools with two extent pools per rank group or processor complex. If you do so, you can
perform your FlashCopy copies from the P0 volumes (sources) to the P2 volumes (targets),
and vice versa from P2 source volumes to target volumes in pool P0. Likewise, you can
create FlashCopy pairs between the extent pools P1 and P3, and vice versa. This approach
follows the guidelines for FlashCopy performance (staying in the same processor complex
source–target, but having the target volumes on other ranks/different spindles, and preferably
on different DAs), and is also a preferred way when considering the failure boundaries.
ENT ENT
DA7 ENT ENT DA7
ENT ENT
ENT ENT
DA5 ENT ENT DA5
ENT ENT
Figure 4-10 Example DS8000 configuration with two hybrid extent pools that use Easy Tier
Using dedicated extent pools with an appropriate number of ranks and DA pairs for selected
workloads is a suitable approach for isolating workloads.
The minimum number of required extent pools depends on the following considerations:
The number of isolated and resource-sharing workload groups
The number of different storage types, either FB for Open Systems or IBM i, or CKD for
z Systems
Definition of failure boundaries (for example, separating logs and table spaces to different
extent pools)
in some cases, Copy Services considerations.
Although you are not restricted from assigning all ranks to only one extent pool, the minimum
number of extent pools, even with only one workload on a homogeneously configured
DS8000 storage system, must be two (for example, P0 and P1). You need one extent pool for
each rank group (or storage server) so that the overall workload is balanced across both
processor complexes.
To optimize performance, the ranks for each workload group (either isolated or
resource-sharing workload groups) must be split across at least two extent pools with an
equal number of ranks from each rank group. So, at the workload level, each workload is
balanced across both processor complexes. Typically, you assign an equal number of ranks
from each DA pair to extent pools assigned to processor complex 0 (rank group 0: P0, P2,
and P4, for example) and to extent pools assigned to processor complex 1 (rank group 1: P1,
P3, and P5, for example). In environments with FB and CKD storage (Open Systems and z
Systems), you additionally need separate extent pools for CKD and FB volumes. It is often
useful to have a minimum of four extent pools to balance the capacity and I/O workload
between the two DS8000 processor complexes. Additional extent pools might be needed to
meet individual needs, such as ease of use, implementing tiered storage concepts, or
separating ranks for different DDM types, RAID types, clients, applications, performance, or
Copy Services requirements.
However, the maximum number of extent pools is given by the number of available ranks (that
is, creating one extent pool for each rank).
In most cases, accepting the configurations that the GUI offers when doing initial setup and
formatting already gives excellent results. For specific situations, creating dedicated extent
pools on the DS8000 storage system with dedicated back-end resources for separate
workloads allows individual performance management for business and performance-critical
applications. Compared to share and spread everything storage systems without the
possibility to implement workload isolation concepts, creating dedicated extent pools on the
DS8000 storage system with dedicated back-end resources for separate workloads is an
outstanding feature of the DS8000 storage system as an enterprise-class storage system.
With this feature, you can consolidate and manage various application demands with different
performance profiles, which are typical in enterprise environments, on a single storage
system.
Plan an initial assignment of ranks to your planned workload groups, either isolated or
resource-sharing, and extent pools for your capacity requirements. After this initial
assignment of ranks to extent pools and appropriate workload groups, you can create
additional spreadsheets to hold more details about the logical configuration and finally the
volume layout of the array site IDs, array IDs, rank IDs, DA pair association, extent pools IDs,
and volume IDs, and their assignments to volume groups and host connections.
4.9 Planning address groups, LSSs, volume IDs, and CKD PAVs
After creating the extent pools and evenly distributing the back-end resources (DA pairs and
ranks) across both DS8000 processor complexes, you can create host volumes from these
pools. When creating the host volumes, it is important to follow a volume layout scheme that
evenly spreads the volumes of each application workload across all ranks and extent pools
that are dedicated to this workload to achieve a balanced I/O workload distribution across
ranks, DA pairs, and the DS8000 processor complexes.
So, the next step is to plan the volume layout and thus the mapping of address groups and
LSSs to volumes created from the various extent pools for the identified workloads and
workload groups. For performance management and analysis reasons, it can be useful to
relate easily volumes, which are related to a specific I/O workload, to ranks, which finally
provide the physical disk spindles for servicing the workload I/O requests and determining the
I/O processing capabilities. Therefore, an overall logical configuration concept that easily
relates volumes to workloads, extent pools, and ranks is wanted.
Each volume is associated with a hexadecimal four-digit volume ID that must be specified
when creating the volume. An example for volume ID 1101 is shown in Table 4-2.
Table 4-2 Understand the volume ID relationship to address groups and LSSs/LCUs
Volume ID Digits Description
The first digit of the hexadecimal volume ID specifies the address group, 0 - F, of that volume.
Each address group can be used only by a single storage type, either FB or CKD. The first
and second digit together specify the logical subsystem ID (LSS ID) for Open Systems
volumes (FB) or the logical control unit ID (LCU ID) for z Systems volumes (CKD). There are
16 LSS/LCU IDs per address group. The third and fourth digits specify the volume number
within the LSS/LCU, 00 - FF. There are 256 volumes per LSS/LCU. The volume with volume
ID 1101 is the volume with volume number 01 of LSS 11, and it belongs to address group 1
(first digit).
In the past, for performance analysis reasons, it was useful to identify easily the association of
specific volumes to ranks or extent pools when investigating resource contention. But, since
the introduction of storage-pool striping, the use of multi-rank extent pools is the preferred
configuration approach for most environments. Multitier extent pools are managed by Easy
Tier automatic mode anyway, constantly providing automatic storage intra-tier and cross-tier
performance and storage economics optimization. For single-tier pools, turn on Easy Tier
management. In managed pools, Easy Tier automatically relocates the data to the
appropriate ranks and storage tiers based on the access pattern, so the extent allocation
across the ranks for a specific volume is likely to change over time. With storage-pool striping
or extent pools that are managed by Easy Tier, you no longer have a fixed relationship
between the performance of a specific volume and a single rank. Therefore, planning for a
hardware-based LSS/LCU scheme and relating LSS/LCU IDs to hardware resources, such as
ranks, is no longer reasonable. Performance management focus is shifted from ranks to
extent pools. However, a numbering scheme that relates only to the extent pool might still be
viable, but it is less common and less practical.
The common approach that is still valid today with Easy Tier and storage pool striping is to
relate an LSS/LCU to a specific application workload with a meaningful numbering scheme
for the volume IDs for the distribution across the extent pools. Each LSS can have 256
volumes, with volume numbers 00 - FF. So, relating the LSS/LCU to a certain application
workload and additionally reserving a specific range of volume numbers for different extent
pools is a reasonable choice, especially in Open Systems environments. Because volume IDs
are transparent to the attached host systems, this approach helps the administrator of the
host system to determine easily the relationship of volumes to extent pools by the volume ID.
Therefore, this approach helps you easily identify physically independent volumes from
different extent pools when setting up host-level striping across pools. This approach helps
you when separating, for example, database table spaces from database logs on to volumes
from physically different drives in different pools.
This approach provides a logical configuration concept that provides ease of use for storage
management operations and reduces management efforts when using the DS8000 related
Copy Services because basic Copy Services management steps (such as establishing
Peer-to-Peer Remote Copy (PPRC) paths and consistency groups) are related to LSSs. If
Copy Services are not planned, plan the volume layout because overall management is
easier if you must introduce Copy Services in the future (for example, when migrating to a
new DS8000 storage system that uses Copy Services).
However, the strategy for the assignment of LSS/LCU IDs to resources and workloads can
still vary depending on the particular requirements in an environment.
The following section introduces suggestions for LSS/LCU and volume ID numbering
schemes to help to relate volume IDs to application workloads and extent pools.
Typically, when using LSS/LCU IDs that relate to application workloads, the simplest
approach is to reserve a suitable number of LSS/LCU IDs according to the total number of
volumes requested by the application. Then, populate the LSS/LCUs in sequence, creating
the volumes from offset 00. Ideally, all volumes that belong to a certain application workload
or a group of related host systems are within the same LSS. However, because the volumes
must be spread evenly across both DS8000 processor complexes, at least two logical
subsystems are typically required per application workload. One even LSS is for the volumes
that are managed by processor complex 0, and one odd LSS is for volumes managed by
processor complex 1 (for example, LSS 10 and LSS 11). Moreover, consider the future
capacity demand of the application when planning the number of LSSs to be reserved for an
application. So, for those applications that are likely to increase the number of volumes
beyond the range of one LSS pair (256 volumes per LSS), reserve a suitable number of LSS
pair IDs for them from the beginning.
1000 1010 1200 2800 2a00 1100 1110 1300 2900 2b00
1001 1011 1201 2801 2a01 1101 1111 1301 2901 2b01
1002 1012 1202 2802 2a02 1102 1112 1302 2902 2b02
1003 1013 1203 2803 2a03 1103 1113 1303 2903 2b03
1004 1014 1204 2804 2a04 1104 1114 1304 2904 2b04
1005 1015 1205 2805 2a05 Ranks Ranks 1105 1115 1305 2905 2b05
1006 1016 1206 2806 2a06 . . 1106 1116 1306 2906 2b06
1007 1017 1207 2807 2a07 . . 1107 1117 1307 2907 2b07
1008 1018 1208 2808 2a08 . . 1108 1118 1308 2908 2b08
1009 1019 1209 2809 2a09 Ranks Ranks 1109 1119 1309 2909 2b09
100a 101a 120a 280a 2a0a 110a 111a 130a 290a 2b0a
100b 101b 120b 280b 2a0b 110b 111b 130b 290b 2b0b
100c 101c 120c 280c 2a0c 110c 111c 130c 290c 2b0c
100d 101d 120d 280d 2a0d 110d 111d 130d 290d 2b0d
100e 101e 120e 280e 2a0e 110e 111e 130e 290e 2b0e
100f 101f 120f 280f 2a0f 110f 111f 130f 290f 2b0f
P0 P1
Host A1 Host A2 Host B Host C Host A1 Host A2 Host B Host C
Application A Application B Application C Application A Application B Application C
Figure 4-11 Application-related volume layout example for two shared extent pools
In Figure 4-12, the workloads are spread across four extent pools. Again, assign two
LSS/LCU IDs (one even, one odd) to each workload to spread the I/O activity evenly across
both processor complexes (both rank groups). Additionally, reserve a certain volume ID range
for each extent pool based on the third digit of the volume ID. With this approach, you can
quickly create volumes with successive volume IDs for a specific workload per extent pool
with a single DSCLI mkfbvo or mkckdvol command.
Hosts A1 and A2 belong to the same application A and are assigned to LSS 10 and LSS 11.
For this workload, use volume IDs 1000 - 100f in extent pool P0 and 1010 - 101f in extent pool
P2 on processor complex 0. Use volume IDs 1100 - 110f in extent pool P1 and 1110 - 111f in
extent pool P3 on processor complex 1. In this case, the administrator of the host system can
easily relate volumes to different extent pools and thus different physical resources on the
same processor complex by looking at the third digit of the volume ID. This numbering
scheme can be helpful when separating, for example, DB table spaces from DB logs on to
volumes from physically different pools.
1 1 2 Ranks Ranks 1 1 2
0 0 a . . 1 1 b
0 0 0 . . 0 0 0
0 1 0 Ranks Ranks 0 1 0
P0 P1
1 1 2 Ranks Ranks 1 1 2
0 0 a . . 1 1 b
1 1 1 . . 1 1 1
0 1 0 Ranks Ranks 0 1 0
P2 P3
Host A1 Host A2 Host B Host A1 Host A2 Host B
Application A Application B Application A Application B
Figure 4-12 Application and extent pool-related volume layout example for four shared extent pools
The example that is depicted in Figure 4-13 on page 127 provides a numbering scheme that
can be used in a FlashCopy scenario. Two different pairs of LSS are used for source and
target volumes. The address group identifies the role in the FlashCopy relationship: address
group 1 is assigned to source volumes, and address group 2 is used for target volumes. This
numbering scheme allows a symmetrical distribution of the FlashCopy relationships across
source and target LSSs. For example, source volume 1007 in P0 uses the volume 2007 in P2
as the FlashCopy target. In this example, use the third digit of the volume ID within an LSS as
a marker to indicate that source volumes 1007 and 1017 are from different extent pools. The
same approach applies to the target volumes, for example, volumes 2007 and 2017 are from
different pools.
Figure 4-14 Create volumes in Custom mode and specify an LSS range
For a DS8880 storage system, four-port and eight-port HA card options are available.
16 Gbps four-port HAs satisfy the highest throughput requirements. For the 8 Gbps HAs,
because the maximum available bandwidth is the same for four-port and eight-port HA cards,
the eight-port HA card provides additional connectivity but no additional performance.
Furthermore, the HA card maximum available bandwidth is less than the nominal aggregate
bandwidth and depends on the workload profile. These specifications must be considered
when planning the HA card port allocation and especially for workloads with high sequential
throughputs. Be sure to contact your IBM representative or IBM Business Partner for an
appropriate sizing, depending on your actual workload requirements. With typical
transaction-driven workloads that show high numbers of random, small-blocksize I/O
operations, all ports in a HA card can be used likewise. For the preferred performance of
workloads with different I/O characteristics, consider the isolation of large-block sequential
and small-block random workloads at the I/O port level or the HA card level.
The preferred practice is to use dedicated I/O ports for Copy Services paths and host
connections. For more information about performance aspects that are related to Copy
Services, see the performance-related chapters in IBM DS8870 Copy Services for IBM z
Systems, SG24-6787 (for z Systems) and IBM DS8870 Copy Services for Open Systems,
SG24-6788 (for Open Systems).
To assign FB volumes to the attached Open Systems hosts by using LUN masking, when
using the DSCLI, these volumes must be grouped in the DS8000 volume groups. A volume
group can be assigned to multiple host connections, and each host connection is specified by
the worldwide port name (WWPN) of the host FC port. A set of host connections from the
same host system is called a host attachment. The same volume group can be assigned to
multiple host connections; however, a host connection can be associated only with one
volume group. To share volumes between multiple host systems, the most convenient way is
to create a separate volume group for each host system and assign the shared volumes to
each of the individual volume groups as required. A single volume can be assigned to multiple
volume groups. Only if a group of host systems shares a set of volumes, and there is no need
to assign additional non-shared volumes independently to particular hosts of this group, can
you consider using a single shared volume group for all host systems to simplify
management. Typically, there are no significant DS8000 performance implications because of
the number of DS8000 volume groups or the assignment of host attachments and volumes to
the DS8000 volume groups.
Do not omit additional host attachment and host system considerations, such as SAN zoning,
multipathing software, and host-level striping. For more information, see Chapter 8, “Host
attachment” on page 267, Chapter 9, “Performance considerations for UNIX servers” on
page 285, “Chapter 12, “Performance considerations for Linux” on page 385, and Chapter 14,
“Performance considerations for IBM z Systems servers” on page 459.
After the DS8000 storage system is installed, you can use the DSCLI lsioport command to
display and document I/O port information, including the I/O ports, HA type, I/O enclosure
location, and WWPN. Use this information to add specific I/O port IDs, the required protocol
(FICON or FCP), and the DS8000 I/O port WWPNs to the plan of host and remote mirroring
connections that are identified in 4.4, “Planning allocation of disk and host connection
capacity” on page 94.
The DS8000 I/O ports use predetermined, fixed DS8000 logical port IDs in the form I0xyz,
where:
x: I/O enclosure
y: Slot number within the I/O enclosure
z: Port within the adapter
Slot numbers: The slot numbers for logical I/O port IDs are one less than the physical
location numbers for HA cards, as shown on the physical labels and in IBM Spectrum
Control/Tivoli Storage Productivity Center for Disk, for example, I0101 is R1-XI2-C1-T2.
A simplified example of spreading the DS8000 I/O ports evenly to two redundant SAN fabrics
is shown in Figure 4-15 on page 131. The SAN implementations can vary, depending on
individual requirements, workload considerations for isolation and resource-sharing, and
available hardware resources.
0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 16 1 Bay 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 17 1
3 3 3 3 0 0 0 0 3 3 3 3 0 0 0 0 3 3 3 3 0 0 0 0 3 3 3 3 0 0 0 00 0 Card 0 0 0 0 3 3 3 3 0 0 0 0 3 3 3 3 0 0 0 0 3 3 3 3 0 0 0 0 3 3 3 03 0
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 03 1 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 03 1
Port
01 02 03 04 05 06 07 08 01 02 03 04 05 06 07 08
Figure 4-15 Example of spreading DS8000 I/O ports evenly across two redundant SAN fabrics
Look at Example 4-2, which shows a DS8886 storage system with a selection of different
HAs:
You see four-port LW (longwave / single-mode) HAs, for example, I000x or I013x, which
are already configured for a FICON topology here.
You see four-port 16 Gbps SW (shortwave / multi-mode) HAs, for example I003x.
All SW (shortwave) HAs are configured with the FCP protocol.
You see some eight-port HAs (8 Gbps), for example, I030x.
You see that corresponding HA types are balanced between I/O enclosures when the
machine is coming from the manufacturing site. For example:
– For the 16 Gbps HAs, of which there are four, each of the four is in another I/O
enclosure (0, 1, 2, and 3).
– For the 8 Gbps eight-port, you have one in I/O enclosure 2 and one in I/O enclosure 3.
– For the 8 Gbps LW, you have one in I/O enclosure 0 and one in I/O enclosure 1.
When planning the paths for the host systems, ensure that each host system uses a
multipathing device driver and a minimum of two host connections to two different HA cards in
different I/O enclosures on the DS8000. Preferably, they are evenly distributed between left
side (even-numbered) I/O enclosures and the right side (odd-numbered) I/O enclosures for
highest availability. Multipathing additionally optimizes workload spreading across the
available I/O ports, HA cards, and I/O enclosures.
You must tune the SAN zoning scheme to balance both the oversubscription and the
estimated total throughput for each I/O port to avoid congestion and performance bottlenecks.
After the logical configuration is planned, you can use either the DS Storage Manager GUI or
the DSCLI to implement it on the DS8000 by completing the following steps:
1. Change the password for the default user (admin) for DS Storage Manager and DSCLI.
2. Create additional user IDs for DS Storage Manager and DSCLI.
3. Apply the DS8000 authorization keys.
4. Create arrays.
5. Create ranks.
6. Create extent pools.
7. Assign ranks to extent pools.
8. Create CKD Logical Control Units (LCUs).
9. Create CKD volumes.
10.Create CKD PAVs.
11.Create FB LUNs.
12.Create Open Systems host definitions.
13.Create Open Systems DS8000 volume groups.
14.Assign Open Systems hosts and volumes to the DS8000 volume groups.
15.Configure I/O ports.
16.Implement SAN zoning, multipathing software, and host-level striping, as needed.
You can use this information with a planning spreadsheet to document the logical
configuration.
Figure 4-16 DS Storage Manager GUI - exporting System Summary (output truncated)
Example 4-3 Example of a minimum DSCLI script get_config.dscli to gather the logical configuration
> dscli -cfg profile/DEVICE.profile -script get_config.dscli > DEVICE_SN_config.out
CMMCI9029E showrank: rank R48 does not exist.
lsarraysite -l
lsarray -l
lsrank -l
lsextpool -l
lsaddressgrp
lslss # Use only if FB volumes have been configured
#lslcu # Use only if CKD volumes and LCUs have been configured
# otherwise the command returns an error and the script terminates.
lsioport -l
lshostconnect
lsvolgrp
To help automate the processes to gather such data, VBScript and Excel macro programs
were written and combined to provide a quick and easy-to-use interface to DS8000 storage
servers through DSCLI, which is passed to an Excel macro to generate a summary workbook
with detailed configuration information.
The tool is started from a desktop icon on a Microsoft Windows system. A VBScript program
is included to create the icon with a link to the first Excel macro that displays a DS8000
Hardware Management Console (HMC) selection window to start the query process.
The DS8QTOOL uses non-intrusive list and show commands to query and report on system
configurations. The design point of the programs is to automate a repeatable process of
creating configuration documentation for a specific DS8000 storage system.
You can obtain this tool, along with other tools, such as DS8CAPGEN, from this IBM website:
http://congsa.ibm.com/~dlutz/
Information in this chapter is not dedicated to IBM System Storage DS8000. You can apply
this information generally to other disk storage systems.
In general, you describe the workload in these terms. The following sections cover the details
and describe the different workload types.
A 100% random access workload is rare, which you must remember when you size the disk
system.
Important: Because of large block access and high response times, physically separate
sequential workloads from random small-block workloads. Do not mix random and
sequential workloads on the same physical disk. If you do, large amounts of cache are
required on the disk system. Typically, high response times with small-block random
access mean the presence of the sequential write activity (foreground or background) on
the same disks.
Plan when to run batch jobs: Plan all batch workload activity for the end of the day or at
a slower time of day. Normal activity can be negatively affected with the batch activity.
For more information about z/OS DFSORT, see the following websites:
http://www.ibm.com/support/docview.wss?rs=114&uid=isg3T7000077
http://www.ibm.com/support/knowledgecenter/api/redirect/zos/v1r11/index.jsp?topic
=/com.ibm.zos.r11.iceg200/abstract.htm
It is not as easy to divide known workload types into cache friendly and cache unfriendly. An
application can change its behavior during the day several times. When users work with data,
it is cache friendly. When the batch processing or reporting starts, it is not cache friendly. High
random-access numbers mean a not cache-friendly workload type. However, if the amount of
data that is accessed randomly is not large, 10% for example, it can be placed totally into the
disk system cache and becomes cache friendly.
Sequential workloads are always cache-friendly because of prefetch algorithms that exist in
the disk system. Sequential workload is easy to prefetch. You know that the next 10 or 100
blocks are definitely accessed, and you can read them in advance. For the random
workloads, it is different. There are no purely random workloads in the actual applications and
it is possible to predict some moments. The DS8000 storage systems use the following
powerful read-caching algorithms to deal with cache unfriendly workloads:
Sequential Prefetching in Adaptive Replacement Cache (SARC)
Adaptive Multi-stream Prefetching (AMP)
The write workload is always cache friendly because every write request comes to the cache
first and the application gets the reply when the request is placed into cache. Write requests
are served at least two times longer by the back end than read requests. You always need to
wait for the write acknowledgment, which is why cache is used for every write request.
However, improvement is possible. The DS8800 and DS8700 storage systems use the
Intelligent Write Caching (IWC) algorithm, which makes work with write requests more
effective.
To learn more about the DS8000 caching algorithms, see the following website:
http://www.redbooks.ibm.com/abstracts/sg248323.html
Table 5-1 on page 145 provides a summary of the characteristics of the various types of
workloads.
The database environment is often difficult to typify because I/O characteristics differ greatly.
A database query has a high read content and is of a sequential nature. It also can be
random, depending on the query type and data structure. Transaction environments are more
random in behavior and are sometimes cache-unfriendly. At other times, they have good hit
ratios. You can implement several enhancements in databases, such as sequential prefetch
and the exploitation of I/O priority queuing, that affect the I/O characteristics. Users must
understand the unique characteristics of their database capabilities before generalizing the
performance.
The workload pattern for the logging is sequential writes mostly. Blocksize is about 64 KB.
Reads are rare and might not be considered. The write capability and location of the online
transaction logs are most important. The entire performance of the database depends on the
writes to the online transaction logs. If you expect high write rates to the database, plan for a
RAID 10 on to which to place the online transaction logs. Also, as a preferred practice,
physically separate log files from the disks on which the data and index files are. For more
information, see Chapter 17, “Databases for open performance” on page 513.
A database can benefit from using a large amount of server memory for the large buffer pool.
For example, the database large buffer pool, when managed correctly, can avoid a large
percentage of the accesses to disk. Depending on the application and the size of the buffer
pool, this large buffer pool can convert poor cache hit ratios into synchronous reads in DB2.
You can spread data across several RAID arrays to increase the throughput even if all
accesses are read misses. DB2 administrators often require that table spaces and their
indexes are placed on separate volumes. This configuration improves both availability and
performance.
3 Digital video 100/0, 0/100, or 50/50 128 KB, 256 - Sequential, good
editing 1024 KB caching.
An example of a data warehouse is a design around a financial institution and its functions,
such as loans, savings, bank cards, and trusts for a financial institution. In this application,
there are three kinds of operations: initial loading of the data, access to the data, and
updating of the data. However, because of the fundamental characteristics of a warehouse,
these operations can occur simultaneously. At times, this application can perform 100% reads
when accessing the warehouse, 70% reads and 30% writes when accessing data while
record updating occurs simultaneously, or even 50% reads and 50% writes when the user
load is heavy. The data within the warehouse is a series of snapshots and after the snapshot
of data is made, the data in the warehouse does not change. Therefore, there is typically a
higher read ratio when using the data warehouse.
Object-Relational DBMSs (ORDBMSs) are being developed, and they offer traditional
relational DBMS features and support complex data types. Objects can be stored and
manipulated, and complex queries at the database level can be run. Object data is data about
real objects, including information about their location, geometry, and topology. Location
describes their position, geometry relates to their shape, and topology includes their
relationship to other objects. These applications essentially have an identical profile to that of
the data warehouse application.
Depending on the host and operating system that are used to perform this application,
transfers are typically medium to large and access is always sequential. Image processing
consists of moving huge image files for editing. In these applications, the user regularly
moves huge high-resolution images between the storage device and the host system. These
applications service many desktop publishing and workstation applications. Editing sessions
can include loading large files of up to 16 MB into host memory, where users edit, render,
modify, and store data onto the storage system. High interface transfer rates are needed for
these applications, or the users waste huge amounts of time by waiting to see results. If the
interface can move data to and from the storage device at over 32 MBps, an entire 16 MB
image can be stored and retrieved in less than 1 second. The need for throughput is all
important to these applications and along with the additional load of many users, I/O
operations per second (IOPS) are also a major requirement.
For general rules for application types, see Table 5-1 on page 145.
Transaction distribution
Table 5-4 breaks down the number of times that key application transactions are run by
the average user and how much I/O is generated per transaction. Detailed application and
database knowledge is required to identify the number of I/Os and the type of I/Os per
transaction. The following information is a sample.
Table 5-5 Logical I/O profile from user population and transaction profiles
Transaction Iterations I/Os I/O type Average Peak users
per user user I/Os
Transfer money 0.5 4 reads/4 RR, random 1000, 1000 3000 R/W
to checking writes write I/Os
(RW)
Transfer money to checking 1000, 1000 100 RR, 1000 RW 300 RR, 3000 RW
Configure new bill payee 1000, 1000 100 RR, 1000 RW 300 RR, 3000 RW
As you can see in Table 5-6, to meet the peak workloads, you must design an I/O
subsystem to support 6000 random reads/sec and 6000 random writes/sec:
Physical I/Os The number of physical I/Os per second from the host perspective
RR Random Read I/Os
RW Random Write I/Os
To determine the appropriate configuration to support your unique workload, see Appendix A,
“Performance management process” on page 551.
It is also possible to get the performance data for the DA pair or the rank. See
Example 5-2.
Example 5-2 Output of the performance data for 20 hours for rank 17 (output truncated)
dscli> lsperfrescrpt -start 20h r17
time resrc avIO avMB avresp %Hutl %hlpT %dlyT %impt
===================================================================
2015-11-11/09:15:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:20:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:25:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:30:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:35:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:40:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:45:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:50:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:55:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/10:00:00 R17 0 0.000 0.000 0 0 0 0
By default, statistics are shown for 1 hour. You can use settings that are specified in days,
hours, and minutes.
IBM Spectrum Control™
IBM Spectrum Control is the tool to monitor the workload on your DS8000 storage for a
long period and collect historical data. This tool can also create reports and provide alerts.
For more information, see 7.2.1, “IBM Spectrum Control overview” on page 223.
These commands are standard tools that are available with most UNIX and UNIX like (Linux)
systems. Use iostat to obtain the data that you need to evaluate your host I/O levels. Specific
monitoring tools are also available for AIX, Linux, Hewlett-Packard UNIX (HP-UX), and Oracle
Solaris.
For more information, see Chapter 9, “Performance considerations for UNIX servers” on
page 285 and Chapter 12, “Performance considerations for Linux” on page 385.
For more information, see Chapter 10, “Performance considerations for Microsoft Windows
servers” on page 335.
IBM i environment
IBM i provides a vast selection of performance tools that can be used in performance-related
cases with external storage. Several of the tools, such as Collection services, are integrated
in the IBM i system. Other tools are a part of an IBM i licensed product. The management of
many IBM i performance tools is integrated into IBM i web graphical user interface IBM
System Director Navigator for i, or into the product iDoctor.
The IBM i tools, such as Performance Explorer and iDoctor, are used to analyze the hot data
in IBM i and to size solid-state drives (SSDs) for this environment. Other tools, such as Job
Watcher, are used mostly in solving performance problems, together with the tools for
monitoring the DS8000 storage system.
For more information about the IBM i tools and their usage, see 13.6.1, “IBM i performance
tools” on page 443.
The RMF Spreadsheet Reporter is an easy way to create Microsoft Excel Charts based on
RMF postprocessor reports. It is used to convert your RMF data to spreadsheet format and
generate representative charts for all performance charts for all performance relevant areas.
For more information, see Chapter 14, “Performance considerations for IBM z Systems
servers” on page 459.
IBM specialists and IBM Business Partner specialists use the IBM Disk Magic tool for
modeling the workload on the systems. Disk Magic can be used to help to plan the DS8000
hardware configuration. With Disk Magic, you model the DS8000 performance when
migrating from another disk system or when making changes to an existing DS8000
configuration and the I/O workload. Disk Magic is for use with both z Systems and Open
Systems server workloads.
When running the DS8000 modeling, you start from one of these scenarios:
An existing, non DS8000 model, which you want to migrate to a DS8000 storage system
An existing DS8000 workload
Modeling a planned new workload, even if you do not have the workload running on any
disk system
You can model the following major DS8000 components by using Disk Magic:
DS8000 model: DS8300, DS8300 Turbo, DS8700, DS8886, and DS8884 models
Cache size for the DS8000 storage system
Number, capacity, and speed of disk drive modules (DDMs)
Number of arrays and RAID type
Type and number of DS8000 host adapters (HAs)
Type and number of channels
Remote Copy option
When working with Disk Magic, always ensure that you input accurate and representative
workload information because Disk Magic results depend on the input data that you provide.
Also, carefully estimate the future demand growth that you input to Disk Magic for modeling
projections. The hardware configuration decisions are based on these estimates.
For more information about using Disk Magic, see 6.1, “Disk Magic” on page 160.
Workload testing
There are various reasons for conducting I/O load tests. They all start with a hypothesis and
have defined performance requirements. The objective of the test is to determine whether the
hypothesis is true or false. For example, a hypothesis might be that you think that a DS8884
storage system with 18 disk arrays and 128 GB of cache can support 10,000 IOPS with a
70/30/50 workload and the following response time requirements:
Read response times: 95th percentile < 10 ms
Write response times: 95th percentile < 5 ms
With these configuration settings, you can simulate and test most types of workloads. Specify
the workload characteristics to reflect the workload in your environment.
To test the sequential read speed of a rank, run the following command:
time dd if=/dev/rvpath0 of=/dev/null bs=128k count=781
The rvpath0 is the character or raw device file for the LUN that is presented to the operating
system by SDD. This command reads 100 MB off rvpath0 and reports how long it takes in
seconds. Take 100 MB and divide by the number of seconds that is reported to determine the
MBps read speed.
Linux: For Linux systems, use the appropriate /dev/sdX device or /dev/mpath/mpathn
device if you use Device-Mapper multipath.
Run the following command and start the nmon monitor or iostat -k 1 command in Linux:
dd if=/dev/rvpath0 of=/dev/null bs=128k
Your nmon monitor (the e option) reports that this previous command imposed a sustained 100
MBps bandwidth with a blocksize=128 K on vpath0. Notice the xfers/sec column; xfers/sec is
IOPS. Now, if your dd command did not error out because it reached the end of the disk,
press Ctrl+c to stop the process. Now, nmon reports an idle status. Next, run the following dd
command with a 4 KB blocksize and put it in the background:
For this command, nmon reports a lower MBps but a higher IOPS, which is the nature of I/O as
a function of blocksize. Run your dd sequential read command with a bs=1024 and you see a
high MBps but a reduced IOPS.
Try different blocksizes, different raw vpath devices, and combinations of reads and writes.
Run the commands against the block device (/dev/vpath0) and notice that blocksize does not
affect performance.
Because the dd command generates a sequential workload, you still must generate the
random workload. You can use a no-charge open source tool, such as Vdbench.
Vdbench is a disk and tape I/O workload generator for verifying data integrity and measuring
the performance of direct-attached and network-connected storage on Windows, AIX, Linux,
Solaris, OS X, and HP-UX. It uses workload profiles as the inputs for the workload modeling
and has its own reporting system. All output is presented in HTML files as reports and can be
analyzed later. For more information, see the following website:
http://www.oracle.com/technetwork/server-storage/vdbench-downloads-1901681.html
The examples in this chapter demonstrate the steps that are required to model a storage
system for certain workload requirements. The examples show how to model a DS8880
storage system and provide guidance about the steps that are involved in this process.
Disk Magic: Disk Magic is available for use by IBM representatives, IBM Business
Partners, and users. Clients must contact their IBM representative to run Disk Magic tool
studies when planning their DS8000 hardware configurations.
Disk Magic for Windows is a product of IntelliMagic B.V. Although this book refers to this
product as Disk Magic, the product that is available from IntelliMagic for clients was
renamed to IntelliMagic Direction and contains more features.
This chapter provides basic examples only for the use of Disk Magic in this book. The
version of the product that is used in these examples is current as of the date of the writing
of this book. For more information, see the product documentation and guides. For more
information about the current client version of this product, go to the IntelliMagic website:
http://www.intellimagic.com
Performing an extensive and elaborate lab benchmark by using the correct hardware and
software provides a more accurate result because it is real-life testing. Unfortunately, this
approach requires much planning, time, and preparation, plus a significant amount of
resources, such as technical expertise and hardware/software in an equipped lab.
Doing a Disk Magic study requires much less effort and resources. Retrieving the
performance data of the workload and getting the configuration data from the servers and the
disk subsystems (DSSs) is all that is required from the client. With this data, the IBM
representative or IBM Business Partner can use Disk Magic to do the modeling.
Disk Magic is calibrated to match the results of lab runs that are documented in sales
materials and white papers. You can view it as an encoding of the data that is obtained in
benchmarks and reported in white papers.
When the Disk Magic model is run, it is important to size each component of the storage
server for its peak usage period, usually a 15- or 30-minute interval. Using a longer period
tends to average out the peaks and non-peaks, which does not give a true reading of the
maximum demand.
Different components can peak at different times. For example, a processor-intensive online
application might drive processor utilization to a peak while users are actively using the
system. However, disk utilization might be at a peak when the files are backed up during
off-hours. So, you might need to model multiple intervals to get a complete picture of your
processing environment.
Some of this information can be obtained from the reports created by RMF Magic.
In a z/OS environment, running a Disk Magic model requires the System Management
Facilities (SMF) record types 70 - 78. There are two different ways to send this SMF data:
If you have RMF Magic available
If you do not have access to RMF Magic
To pack the SMF data set into a ZRF file, complete the following steps:
1. Install RMFPACK on your z/OS system.
2. Prepare the collection of SMF data.
3. Run the $1SORT job to sort the SMF records.
4. Run the $2PACK job to compress the SMF records and to create the ZRF file.
When uploading the data to the IBM FTP website, use the following information:
The FTP site is testcase.boulder.ibm.com.
The user ID is anonymous.
The password is your email user ID, for example, [email protected].
The directory to put the date into is eserver/toibm/zseries.
Notify IBM or the Business Partner about the file name that you use to create your FTP
file.
Data collection
The preferred data collection method for a Disk Magic study is using Spectrum Control. For
each control unit to be modeled, collect performance data, create a report for each control
unit, and export each report as a comma-separated values (CSV) file. You can obtain the
detailed instructions for this data collection from your IBM representative.
When working with Disk Magic, always ensure that you feed in accurate and representative
workload information because Disk Magic results depend on the input data that is provided.
Also, carefully estimate future demand growth, which is fed into Disk Magic for modeling
projections on which the hardware configuration decisions are made.
After the valid base model is created, you proceed with your modeling. You change the
hardware configuration options of the base model to determine the preferred DS8880
configuration for a certain workload, or you can modify the workload values that you initially
entered, so that, for example, you can see what happens when your workload grows or its
characteristics change.
When doing Disk Magic modeling, you must model two or three different measurement
periods:
Peak I/O rate
Peak read + write throughput in MBps
Peak write throughput in MBps if you are also modeling a DSS running one of the remote
copy options
When you create a model for a DS8000 storage system, select New SAN Project. The
window that is shown in Figure 6-2 on page 165 opens.
The preferred option is to use an automated input process to load the RMF data into Disk
Magic by using a DMC file. To read the DMC file as automated input into Disk Magic,
select zSeries or WLE Automated input (*.DMC) in the New SAN Project window, as
shown in Figure 6-2 on page 165. By using automated input, you can make the Disk
Magic process the model at the DSS, logical control unit (LCU), or device level.
Considerations for using one or the other option are provided in the Disk Magic help text
under “How to Perform Device Level Modeling”. The simplest way to model a DSS is using
the DSS option, where the model contains the workload statistics by DSS. These DMC
files must be created first by using RMF Magic.
For example, JL292059 means that the DMC file was created for the RMF period of July 29 at
20:59.
2. In this particular example, select the JL292059.dmc file, which opens the window that is
shown in Figure 6-5. This particular DMC file was chosen because it represents the peak
I/O rate period.
9. Click Base to create the base model for this DS8870 storage system. It is possible that a
base model cannot be created from the input workload statistics, for example, if there is
excessive CONN time that Disk Magic cannot calibrate against the input workload
statistics. In this case, you must identify another DMC from a different period, and try to
create the base model from that DMC file. For example, if this model is for the peak I/O
rate, you should try the period with the second highest I/O rate.
After creating this base model for IBM-EFGH1, you must also create the base model for
IBM-ABCD1 by following this same procedure. The DS8870 IBM-ABCD1 storage system
has the same physical configuration as IBM-EFGH1, but with different workload
characteristics.
You can select the cache size. In this case, select 1024 GB because each of the two
DS8870 storage systems has 512 GB cache.
Disk Magic computes the number of HAs on the DS8880 storage system based on the
specification on the Interfaces page, but you can, to a certain extent, override these
numbers. The Fibre ports are used for Peer-to-Peer Remote Copy (PPRC) links. Enter 12
into the FICON HAs field and 8 into the Fibre HAs field.
3. Click the Interfaces tab to open the From Servers dialog box (Figure 6-17 on page 179).
Because the DS88870 FICON ports are running at 8 Gbps, you must update this option on
all the LPARs and also on the From Disk Subsystem to 16 Gbps. If the Host CPC uses
different FICON channels than the FICON channels that are specified, it also must be
updated.
Select and determine the Remote Copy Interfaces. Select the Remote Copy type and the
connections that are used for the Remote Copy links.
4. To select the DDM capacity and RPM used, click the zSeries Disk tab, as shown in
Figure 6-18. Do not make any changes here, and let Disk Magic select the DDM/SSD
configuration based on the source DSSs.
6. Perform the merge procedure. From the Merge Target window (Figure 6-20), click Start
Merge.
Figure 6-21 z Systems - DS8886 disk subsystem created as the merge result
11.Figure 6-25 on page 185 shows that you upgraded the host server FICON channels to
FEx16S. This upgrade also requires that you upgrade, by using the From Disk
Subsystem tab, to make the Server Side also use the FEx16S channels.
12.For this configuration, update the SSD drives to flash drives (Figure 6-26). Click Solve to
create the new Disk Magic configuration model.
Figure 6-26 z Systems - replace the SSD drives with flash drives
As you can see from the examples in this section, you can modify many of the resources of
the target DSS, for example, replacing some of the 600 GB/15 K RPM DDM with flash drives,
and let Disk Magic model the new configuration. This way, if the DS8886 model shows
resource bottlenecks, you can make the modifications to try to eliminate those bottlenecks.
2. With Figure 6-29 open, press and hold the Ctrl key and select IBM-EFGH1, IBM-ABCD1,
and MergeResult2. Right-click any of them and a small window opens. Select Graph from
this window. In the window that opens (Figure 6-30 on page 189), click Clear to clear any
prior graph option settings.
3. Click Plot to produce the response time components graph of the three DSSs that you
selected in a Microsoft Excel spreadsheet. Figure 6-31 is the graph that is created based
on the numbers from the Excel spreadsheet.
The response time that is shown here is 0.56 msec, which is slightly different compared to the
0.55 msec shown in Figure 6-27 on page 186. This difference is caused by rounding factors
by Disk Magic.
2. Click Range Type, and choose I/O Rate, which completes the From field with the I/O Rate
of the current workload, which is 186,093.7 IOPS. You can change the to field to any
value; in this example, we change it to 321,000 IOPS and the by field to 10,000 IOPS.
These changes create a plot from 186,093.7 IOPS and increments the next point by
10,000 IOPS until the I/O rate reaches the maximum rate that is equal to or less than
321,000 IOPS.
3. Click New Sheet to create the next plot on a new sheet in the Excel file and then click
Plot. An error message displays and informs you that the DA is saturated. The projected
response time report is shown in Figure 6-33 on page 191.
4. Click Utilization Overview in the Graph Data choices, then click New Sheet and Plot to
produce the chart that is shown in Figure 6-34.
This utilization growth projection chart shows how much these DS8886 resources increase as
the I/O rate increases. Here, you can see that the first bottleneck that is reached is the DA.
At the bottom of the chart, you can see that Disk Magic projected that the DS8886 storage
system can sustain a workload growth of up to 43%, as shown in Figure 6-33. Additional
ranks and additional DAs should be planned, which might include an additional DSS.
The next step is to repeat the same modeling steps starting with 6.2.1, “Processing the DMC
file” on page 167, for two other peak periods, which are:
Peak read + write MBps to check how the overall DS8886 storage system performs under
this stress load
Peak write MBps to check whether the SMP, Bus, and Fibre links can handle this peak
PPRC activity.
In this example, we use the comma-separated values (CSV) file created by the Open and
iSeries Automated Input.
Typically, when doing a Disk Magic study, model the following periods:
Peak I/O period
Peak Read + Write throughput in MBps
Peak Write throughput in MBps, if you are doing a Remote Copy configuration
6.3.1 Processing the CSV output file to create the base model for the DS8870
storage system
To process the CSV files, complete the following steps:
1. From the Welcome to Disk Magic window, which is shown in Figure 6-35 on page 193,
select New SAN Project and click OK. The result is shown in Figure 6-36 on page 193.
Select Open and iSeries Automated Input and then click OK.
3. Now, you see Figure 6-39 on page 195, and from here select the row with the date of Nov.
3 at 03:10:00 AM. This is the period where the DS8870 storage system reaches the peak
I/O Rate and also the peak Write MBps. Click Add Model and then click Finish. You now
see the Disk Magic configuration window (Figure 6-40 on page 195). In this window,
delete all the other DSSs and show only the DS8870 storage system (16 core) with 512
GB of cache (IBM-75PQR21) that is migrated to the DS8886 storage system. Click Easy
Tier Settings.
Figure 6-43 Open Systems - server to disk subsystem and PPRC interfaces
7. In Figure 6-45, click Base to create the base of this DS8870 IBM-PQR21 storage system.
Figure 6-48 Open Systems - server to disk subsystem and PPRC interfaces
2. For the Graph Data drop-down menu, select Utilization Overview, and for the Range
Type drop-down menu, select I/O Rate.
3. Click Plot, which opens the dialog box that is shown in Figure 6-51. This dialog box states
that at a certain point when projecting the I/O Rate growth, Disk Magic will stop the
modeling because the DA is saturated.
Figure 6-51 Open Systems - growth projection limited by device adapter saturation
The utilization growth projection in Figure 6-52 on page 203 shows the projected growth of
each resource of the DS8886 storage system. At greater than 125% I/O rate growth, the DA
would have reached a 100% utilization. But to be realistic, you should not run this
configuration at greater than the 39% growth rate because at that point the DA utilization
starts to reach the amber/warning stage.
The next step is to try to update the configuration to relieve this bottleneck.
Figure 6-53 Open Systems - upgrade the SSD drives with flash drives
Figure 6-54 Open Systems - Disk Magic solve of DS8886 with flash drives
Now, run the utilization graph with I/O rate growth again. This time, the DA is not the
bottleneck, but you get the Host Adapter Utilization > 100% message (Figure 6-55).
Figure 6-55 Open Systems - growth projection limited by host adapter saturation
The utilization growth projection in Figure 6-56 shows the projected growth of each resource
after the flash upgrade on the DS8886 storage system. At greater than 183% I/O rate growth,
the Fibre HA reaches a 100% utilization. Do not run this configuration at greater than the 67%
growth rate because at that point the Fibre HA utilization starts to reach the amber/warning
stage.
Figure 6-56 Open Systems - growth projection bottleneck at the Fibre host adapter
The service time improves from 2.53 msec on the DS8870 storage system to 1.98 msec on
the DS8886 storage system. The service time is the same for both DS8886 options.
The difference between option 2 and option 3 is that option 3 can handle a much higher I/O
rate growth projection because the flash drives do not use the DAs.
Figure 6-58 Open Systems - graph option for service time growth projection
Figure 6-59 Open Systems - message when doing service time growth projection
Figure 6-60 Open Systems - service time chart with workload growth
There are three different approaches on how to model Easy Tier on a DS8880 DSS. Here are
the three options:
Use one of the predefined skew levels.
Use an existing skew level based on the current workload on the current DSS.
Use heatmap data from a DSS that supports this function.
Disk Magic uses this setting to predict the number of I/Os that are serviced by the higher
performance tier. In Figure 6-61, the five curves represent the predefined skew levels in
respect to the capacity and I/Os.
A skew level value of 1 means that the workload does not have any skew at all, meaning that
the I/Os are distributed evenly across all ranks.
So, the top curve represents the very high skew level. The lowest curve represents the very
low skew level. In this chart, the intermediate skew curve (the middle one) indicates that for a
fast tier capacity of 20% Easy Tier moves 79% of the Workload (I/Os) to the fast tier. Disk
Magic assumes that if there is an extent pool where 20% of the capacity is on SSD or flash
ranks, Easy Tier manages to fill this 20% of the capacity with data that handles 79% of all the
I/Os.
These skew curves are developed by IntelliMagic. The class of curves and the five predefined
levels were chosen after researching workload data from medium and large sites.
The skew level settings affect the Disk Magic predictions. A heavy skew level selection results
in a more aggressive sizing of the higher performance tier. A low skew level selection provides
a conservative prediction. It is important to understand which skew level best matches the
actual workload before you start the modeling.
Tip: For Open Systems and z Systems workload, the default skew level is High (14). For a
System i workload, the default skew level is Very Low (2).
For z/OS, RMF Magic also estimates the skew for Easy Tier modeling even if the DSS is not
running Easy Tier. It estimates the skew based on the volume skew.
This skew level is integrated into the model of the new DSS that is migrated from the current
one.
Figure 6-64 Disk Magic data for an existing Easy Tier disk subsystem
In Figure 6-65 on page 213, select Enable Easy Tier and then click Read Heatmap. These
actions open the file containing the heat map. The heat map name is XXXX_skew_curve.csv,
where XXXX is the DSS name. From here, select the appropriate heat map. Based on the
heat map that Disk Magic reads, the predicted skew level is now 12.11.
The advisor tool provides a graphical representation of performance data that is collected by
the Easy Tier monitor over a 24-hour operational cycle. You can view the information that is
displayed by the advisor tool to analyze workload statistics and evaluate which logical
volumes might be candidates for Easy Tier management. If the Easy Tier feature is not
installed and enabled, you can use the performance statistics that are gathered by the
monitoring process to help you determine whether to use Easy Tier to enable potential
performance improvements in your storage environment and to determine optimal flash, SSD,
or HDD configurations and benefits.
To extract the summary performance data that is generated by Easy Tier, you can use either
the DSCLI or DS Storage Manager. When you extract summary data, two files are provided,
one for each processor complex in the Storage Facility Image (SFI) server. The download
operation initiates a long running task to collect performance data from both selected SFIs.
This information can be provided to IBM if performance analysis or problem determination is
required.
The version of the STAT that is available with DS8000 Licensed Internal Code R8.0 is more
granular and it provides a broader range of recommendations and benefit estimations. The
recommendations are available for all the supported multitier configurations and they are on a
per-extent pool basis. The tool estimates the performance improvement for each pool by
using the existing SSD ranks, and it provides guidelines on the type and number of SSD
ranks to configure. The STAT also provides recommendations for sizing a Nearline tier so you
can evaluate the advantage of demoting extents to either existing or additional Nearline
ranks, and the cold data capacity that results from this cold demotion.
Another improvement in the STAT is a better way to calculate the performance estimation.
Previously, the STAT took the heat values of each bucket as linear, which resulted in an
inaccurate estimation when the numbers of extents in each bucket were disproportional. The
STAT now uses the average heat value that is provided at the sub-LUN level to provide a
more accurate estimate of the performance improvement.
The STAT describes the Easy Tier statistical data that is collected by the DS8000 storage
system in detail and it produces reports in HTML format that can be viewed by using a
standard browser. These reports provide information at the levels of a DS8000 storage
system, the extent pools, and the volumes. Sizing recommendations and estimated benefits
are also included in the STAT reports.
Figure 6-66 on page 215 shows the System Summary report from STAT. There are two
storage pools monitored (P2 & P3) with a total of 404 volumes and 23,487 GiB capacity.
Three percent of the data is hot. It also shows that P2 and P3 contain two tiers, SSD and
Enterprise disk.
The dark purple portion of the Data Management Status bar displays the data that is
managed by Easy Tier that are assigned to a certain tier. The green portion of the bar
represents data that is managed by Easy Tier. On storage pool P2, you see that there are
1320 GiB of data assigned/pinned to one tier.
Figure 6-67 Systemwide Recommendation report that shows possible improvements for all pools
The next series of views is by storage pool. When you click a certain storage pool, you can
see additional and detailed recommendations for improvements at the level of each extent
pool, as shown in Figure 6-68 on page 217.
Figure 6-68 shows a storage pool view for a pool that consists of two Enterprise (15 K/10 K)
HDD ranks, with both hot and cold extents. This pool can benefit from adding one SSD rank
and one Nearline rank. You can select the types of drives that you want to add for a certain
tier through the drop-down menus on the left. These menus contain all the drive and RAID
types for a certain type of tier. For example, when adding more Enterprise drives is
suggested, the STAT can calculate the benefit of adding drives in RAID 10 instead of RAID 5,
or the STAT can calculate the benefit of using additional 900 GB/10 K HDDs instead of the
300 GB/15 K drives.
If adding multiple ranks of a certain tier is beneficial for a certain pool, the STAT modeling
offers improvement predictions for the expected performance gains when adding two, three,
or more ranks up to the recommended number.
Figure 6-69 Storage pool statistics and recommendations - Volume Heat Distribution view
In this view, three heat classes are visible externally. However, internally, DS8000 Easy Tier
monitoring uses a more granular extent temperature in heat buckets. This detailed Easy Tier
data can be retrieved by IBM Support for extended studies of a client workload situation.
To get the enhanced data, process the binary heat map data with STAT by using the
information that is shown in Example 6-1.
Example 6-1 Process the binary heat map data by using STAT
STAT.exe SF75FAW80ESS01_heat_20151112222136.data
SF75FAW80ESS11_heat_20151112222416.data
Processing the DS8000 heat map with STAT produces three CSV files. One of them,
SF75FAW80_skew_curve.csv, is used by Disk Magic to do the Easy Tier modeling.
Figure 6-70 on page 219 shows how to import the heatmap data that is created by STAT.
From the DSS main window, click Easy Tier Settings. In the next window, which is shown in
Figure 6-71 on page 219, click Read Heatmap. A dialog box opens, where you can select the
CSV file that contains the skew curve that you want to use.
Figure 6-72 The heatmap that is selected shows the computed skew
Important: IBM Spectrum Control has different editions that have specific features. This
chapter assumes the use of IBM Spectrum Control Standard Edition, which includes
performance and monitoring functions that are relevant to this topic.
Table 7-1 IBM Spectrum Control supported activities for performance processes
Process Activities Feature
Tactical Performance analysis and tuning. Tool facilitates thorough data collection
and reporting.
Additional performance management processes that complement IBM Spectrum Control are
shown in Table 7-2 on page 223.
For a full list of the features that are provided in each of the IBM Spectrum components, go to
the following IBM website:
http://www.ibm.com/systems/storage/spectrum/
For more information about the configuration and deployment of storage by using IBM
Spectrum Control, see these publications:
IBM Spectrum Family: IBM Spectrum Control Standard Edition, SG24-8321
IBM Tivoli Storage Productivity Center V5.2 Release Guide, SG24-8204
IBM Tivoli Storage Productivity Center V5.1 Technical Guide, SG24-8053
IBM Tivoli Storage Productivity Center Beyond the Basics, SG24-8236
You can customize the dashboard by using the arrows. In the left navigation section, you can
see the aggregated status of internal or related resources.
Since IBM Spectrum Control V5.2.8, the stand-alone GUI is no longer available. All Alerting
functions and comprehensive performance management capabilities are available in the
WebUI. From here, it is possible to drill down to a more detailed level of information regarding
performance metrics. This WebUI is used to display some of these metrics later in the
chapter.
Metrics: A metric is a numerical value that is derived from the information that is provided
by a device. It is the raw data and a calculated value. For example, the raw data is the
transferred bytes, but the metric uses this value and the interval to show the bytes/second.
For the DS8000 storage system, the native application programming interface (NAPI) is used
to collect performance data, in contrast to the SMI-S Standard that is used for third-party
devices.
The DS8000 storage system interacts with the NAPI in the following ways:
Access method used: Enterprise Storage Server Network Interface (ESSNI)
Failover:
– For the communication with a DS8000 storage system, IBM Spectrum Control uses the
ESSNI client. This library is basically the same library that is included in any DS8000
command-line interface (DSCLI). Because this component has built-in capabilities to
fail over from one Hardware Management Console (HMC) to another HMC, a good
approach is to specify the secondary HMC IP address if your DS8000 storage system
has one.
– The failover might still cause errors in a IBM Spectrum Control job, but the next
command that is sent to the device uses the redundant connection.
Subsystem
On the subsystem level, metrics are aggregated from multiple records to a single value per
metric to give the performance of a storage subsystem from a high-level view, based on the
metrics of other components. This aggregation is done by adding values, or calculating
average values, depending on the metric.
Cache
The cache in Figure 7-2 on page 225 plays a crucial role in the performance of any storage
subsystem.
Metrics, such as disk-to-cache operations, show the number of data transfer operations from
disks to cache. The number of data transfer operations from disks to cache is called staging
for a specific volume. Disk-to-cache operations are directly linked to read activity from hosts.
When data is not found in the DS8000 cache, the data is first staged from back-end disks into
the cache of the DS8000 storage system and then transferred to the host.
Read hits occur when all the data that is requested for a read data access is in cache. The
DS8000 storage system improves the performance of read caching by using Sequential
Prefetching in Adaptive Replacement Cache (SARC) staging algorithms. For more
information about the SARC algorithm, see 1.3.1, “Advanced caching techniques” on page 8.
The SARC algorithm seeks to store those data tracks that have the greatest probability of
being accessed by a read operation in cache.
The cache-to-disk operation shows the number of data transfer operations from cache to
disks, which is called as destaging for a specific volume. Cache-to-disk operations are directly
linked to write activity from hosts to this volume. Data that is written is first stored in the
persistent memory (also known as nonvolatile storage (NVS)) at the DS8000 storage system
and then destaged to the back-end disk. The DS8000 destaging is enhanced automatically by
striping the volume across all the disk drive modules (DDMs) in one or several ranks
(depending on your configuration). This striping, or volume management that is done by Easy
Tier, provides automatic load balancing across DDMs in ranks and an elimination of the hot
spots.
The Write-cache Delay I/O Rate or Write-cache Delay Percentage because of persistent
memory allocation gives you information about the cache usage for write activities. The
DS8000 storage system stores data in the persistent memory before sending an
acknowledgment to the host. If the persistent memory is full of data (no space available), the
host receives a retry for its write request. In parallel, the subsystem must destage the data
that is stored in its persistent memory to the back-end disk before accepting new write
operations from any host.
If a volume experiences write operation delayed due to persistent memory constraint delays,
consider moving the volume to a less busy rank or spread this volume on multiple ranks
(increase the number of DDMs used). If this solution does not fix the persistent memory
constraint problem, consider adding cache capacity to your DS8000 storage system.
As shown in Figure 7-3 on page 227, you can use IBM Spectrum Control to monitor the cache
metrics easily.
Controller/Nodes
IBM Spectrum Control refers to the DS8000 processor complexes as Nodes (formerly called
controllers). A DS8000 storage system has two processor complexes, and each processor
complex independently provides major functions for the disk storage system. Examples
include directing host adapters (HAs) for data transfer to and from host processors, managing
cache resources, and directing lower device interfaces for data transfer to and from physical
disks. To analyze performance data, you must know that most volumes can be
“assigned/used” by only one controller at a time.
When you right-click one of the nodes in the WebUI, you can use IBM Spectrum Control to
drill down to the volume performance chart for the volumes that are assigned to the selected
node, as shown in Figure 7-5 on page 229.
You can enlarge the performance chart with the “Open in a new window” icon. The URL of the
new window can be bookmarked or attached to an email.
Ports
The port information reflects the performance metrics for the front-end DS8000 ports that
connect the DS8000 storage system to the SAN switches or hosts. Additionally, port error rate
metrics, such as Error Frame Rate, are also available. The DS8000 HA card has four or eight
ports. The WebUI does not reflect this aggregation, but if necessary, custom reports can be
created with IBM Cognos Report Studio or with native SQL statements to show port
performance data that is grouped by the HA to which they belong. Monitoring and analyzing
the ports that belong to the same card are beneficial because the aggregated throughput is
less than the sum of the stated bandwidth of the individual ports.
Note: Cognos BI is an optional part of IBM Spectrum Control and can be installed at any
time. More information about installation and usage of Cognos BI is provided in IBM Tivoli
Storage Productivity Center V5.1 Technical Guide, SG24-8053 and IBM Tivoli Storage
Productivity Center V5.2 Release Guide, SG24-8204.
For more information about the DS8000 port cards, see 2.5.1, “Fibre Channel and FICON
host adapters” on page 41.
.
Port metrics: IBM Spectrum Control reports on many port metrics, so the ports on the
DS8000 storage system are the front-end part of the storage device.
Array
The array name that is shown in the WebUI, as shown in Figure 7-8 on page 231, directly
refers to the array on the DS8000 storage system as listed in the DS GUI or DSCLI.
When you click the Performance tab, the top five performing arrays are displayed with their
corresponding graphs, as shown in Figure 7-8.
Figure 7-8 IBM Spectrum Control WebUI V5.2.8 DS8000 Array Performance chart
A DS8000 array is defined on an array site with a specific RAID type. A rank is a logical
construct to which an array is assigned. A rank provides a number of extents that are used to
create one or several volumes. A volume can use the DS8000 extents from one or several
ranks. For more information, see 3.2.1, “Array sites” on page 54, 3.2.2, “Arrays” on page 54,
and 3.2.3, “Ranks” on page 55.
In most common logical configurations, users typically sequence them in order, for
example, Array Site = S1 = Array A0 = Rank R0. If they are not in order, you must
understand on which array the analysis is performed.
Important: In IBM Spectrum Control, the relationship between array site, array, and rank
can be configured to be displayed in the WebUI as shown in Figure 7-7. The array statistics
are used to measure the rank’s usage because they have a 1:1 relationship.
Example 7-1 on page 232 shows the relationships among a DS8000 rank, an array, and an
array site with a typical divergent numbering scheme by using DSCLI commands. Use the
showrank command to show which volumes have extents on the specified rank.
In the Array performance chart, you can include both front-end and back-end metrics. The
back-end metrics can be selected on the Disk Metrics Tab. They provide metrics from the
perspective of the controller to the back-end array sites. The front-end metrics relate to the
activity between the server and the controller.
There is a relationship between array operations, cache hit ratio, and percentage of read
requests:
When the cache hit ratio is low, the DS8000 storage system has frequent transfers from
DDMs to cache (staging).
When the percentage of read requests is high and the cache hit ratio is also high, most of
the I/O requests can be satisfied without accessing the DDMs because of the cache
management prefetching algorithm.
When there is heavy write activity, it leads to frequent transfers from cache to DDMs
(destaging).
Comparing the performance of different arrays shows whether the global workload is equally
spread on the DDMs of your DS8000 storage system. Spreading data across multiple arrays
increases the number of DDMs that is used and optimizes the overall performance.
Important: Back-end write metrics do not include the RAID impact. In reality, the RAID 5
write penalty adds additional unreported I/O operations.
Analysis of volume data facilitates the understanding of the I/O workload distribution among
volumes, and workload characteristics (random or sequential and cache hit ratios). A DS8000
volume can belong to one or several ranks, as shown in Figure 7-9 (for more information, see
3.2.7, “Extent allocation methods” on page 62). Especially in managed multi-rank extent pools
with Easy Tier automatic data relocation enabled, the distribution of a certain volume across
the ranks in the extent pool can change over time. The STAT with its volume heat distribution
report provides additional information about the heat of the data and the data distribution
across the tiers within a pool for each volume. For more information about the STAT and its
report, see 6.5, “Storage Tier Advisor Tool” on page 213.
With IBM Spectrum Control, you can see the Easy Tier Distribution, Easy Tier Status, and the
capacity values for pools and volumes, as shown in Figure 7-10 on page 235 and Figure 7-11
on page 235.
The analysis of volume metrics shows the activity of the volumes on your DS8000 storage
system and can help you perform these tasks:
Determine where the most accessed data is and what performance you get from the
volume.
Understand the type of workload that your application generates (sequential or random
and the read or write operation ratio).
Determine the cache benefits for the read operation (cache management prefetching
algorithm SARC).
Determine cache bottlenecks for write operations.
Compare the I/O response observed on the DS8000 storage system with the I/O response
time observed on the host.
The relationship of certain RAID arrays and ranks to the DS8000 pools can be derived from
the IBM Spectrum Control RAID array list window, which is shown in Figure 7-7 on page 231.
From there, you can easily see the volumes that belong to a certain pool by right-clicking a
pool, selecting View Properties, and clicking the Volumes tab.
Figure 7-12 Relationship of the raid array, pool, and volumes for a raid array
In addition, to associate quickly the DS8000 arrays to array sites and ranks, you might use the
output of the DSCLI commands lsrank -l and lsarray -l, as shown in Example 7-1 on
page 232.
Random read Attempt to find data in cache. If not present in cache, read from back end.
Sequential write Write data to the NVS of the processor complex owning volume and send a
copy of the data to cache in the other processor complex. Upon back-end
destaging, perform prefetching of read data and parity into cache to reduce
the number of disk operations on the back end.
Random write Write data to NVS of the processor complex owning volume and send a copy
of the data to cache in the other processor complex. Destage modified data
from NVS to disk as determined by Licensed Internal Code.
The read hit ratio depends on the characteristics of data on your DS8000 storage system and
applications that use the data. If you have a database and it has a high locality of reference, it
shows a high cache hit ratio because most of the data that is referenced can remain in the
cache. If your database has a low locality of reference, but it has the appropriate sets of
indexes, it might also have a high cache hit ratio because the entire index can remain in the
cache.
For a logical volume that has sequential files, you must understand the application types that
access those sequential files. Normally, these sequential files are used for either read only or
write only at the time of their use. The DS8000 cache management prefetching algorithm
(SARC) determines whether the data access pattern is sequential. If the access is sequential,
contiguous data is prefetched into cache in anticipation of the next read request.
IBM Spectrum Control reports the reads and writes through various metrics. For a description
of these metrics in greater detail, see 7.3, “IBM Spectrum Control data collection
considerations” on page 238.
Although the time configuration of the device is written to the database, reports are always
based on the time of the IBM Spectrum Control server. It receives the time zone information
from the devices (or the NAPIs) and uses this information to adjust the time in the reports to
the local time. Certain devices might convert the time into Coordinate Universal Time time
stamps and not provide any time zone information.
This complexity is necessary to compare the information from two subsystems in different
time zones from a single administration point. This administration point is the GUI, not the
IBM Spectrum Control server. If you open the GUI in different time zones, a performance
diagram might show a distinct peak at different times, depending on its local time zone.
When using IBM Spectrum Control to compare data from a server (for example, iostat data)
with the data of the storage subsystem, it is important to know the time stamp of the storage
subsystem. The time zone of the device is shown in the DS8000 Properties window.
To ensure that the time stamps on the DS8000 storage system are synchronized with the
other infrastructure components, the DS8000 storage system provides features for
configuring a Network Time Protocol (NTP) server. To modify the time and configure the HMC
to use an NTP server, see the following publications:
IBM DS8870 Architecture and Implementation (Release 7.5), SG24-8085
IBM DS8880 Architecture and Implementation (Release 8), SG24-8323
As IBM Spectrum Control can synchronize multiple performance charts that are opened in
the WebUI to display the metrics at the same time, use an NTP server for all components in
the SAN environment.
7.3.2 Duration
IBM Spectrum Control collects data continuously. From a performance management
perspective, collecting data continuously means that performance data exists to facilitate
reactive, proactive, and even predictive processes, as described in Chapter 7, “Practical
performance management” on page 221.
7.3.3 Intervals
In IBM Spectrum Control, the data collection interval is referred to as the sample interval. The
sample interval for the DS8000 performance data collection tasks is 1 - 60 minutes. A shorter
sample interval results in a more granular view of performance data at the expense of
requiring additional database space. The appropriate sample interval depends on the
objective of the data collection. Table 7-4 on page 239 displays example data collection
objectives and reasonable values for a sample interval.
Attention: Although the 1-minute interval collection is the default for some devices, the
5-minute interval is considered a good average for reviewing data. However, some issues
can occur within this time interval resulting in peaks being averaged out, which might not
be as apparent as they are with 1-minute interval collection. In such situations, the
1-minute interval is most appropriate for offering a more granular view that results in a
more effective analysis. However, a 1-minute interval results in copious amount of data
being collected, resulting in a fast growing database. Therefore, it should be used only for
troubleshooting purposes.
For a list of available performance metrics for the DS8000 storage system, see the IBM
Spectrum Control IBM Knowledge Center:
http://ibm.co/1J59pAq
Note: IBM Spectrum Control also has other metrics that can be adjusted and configured
for your specific environment to suit customer demands.
Node Volume Cache Holding Time < 200 Indicates high cache track turnover
Metrics and possibly cache constraint.
Node Volume Write Cache Delay > 1% Indicates writes delayed because of
Metrics Percentage insufficient memory resources.
Array Disk Metrics Disk Utilization Percentage > 70% Indicates disk saturation. For
IBM Spectrum Control, the default
value on this threshold is 50%.
Array Disk Metrics Overall Response Time > 35 Indicates busy disks.
Array Disk Metrics Write Response Time > 35 Indicates busy disks.
Array Disk Metrics Read Response Time > 35 Indicates busy disks.
Port Port Metrics Total Port I/O Rate Depends Indicates transaction intensive load.
The configuration depends on the
HBA, switch, and other components.
Port Port Metrics Total Port Data Rate Depends If the port data rate is close to the
bandwidth, this rate indicates
saturation. The configuration
depends on the HBA, switch, and
other components.
Port Port Metrics Port Send Response Time >2 Indicates contention on I/O path from
the DS8000 storage system to the
host.
Port Port Metrics Port Receive Response >2 Indicates a potential issue on the I/O
Time path or the DS8000 storage system
back end.
Port Port Metrics Total Port Response Time >2 Indicates a potential issue on the I/O
path or the DS8000 storage system
back end.
Because the intervals are usually 1- 15 minutes, IBM Spectrum Control is not an online or
real-time monitor.
You can use IBM Spectrum Control to define performance-related alerts that can trigger an
event when the defined thresholds are reached. Even though it works in a similar manner to a
monitor without user intervention, the actions are still performed at the intervals specified
during the definition of the performance monitor job.
1
2
3
2 3 4 5
2. Select the component for which you want to set the threshold. In this example, click
Controllers 7/7.
3. Click the Performance tab.
4. Click Add Metric.
5. Select the check box for the metric for which you want to set the Threshold.
6. Click OK.
7. Enable the alert (1).
8. Specify the Threshold (2) and the severity (3).
9. Click the envelope to specify the notification (4).
10.Click the struck through exclamation mark to specify the suppression settings (5).
11.Click Save.
Reference: For more information about setting Thresholds and Alert suppressions in IBM
Spectrum Control, see IBM Spectrum Family: IBM Spectrum Control Standard Edition,
SG24-8321.
The alerts for a DS8000 storage system can be seen, filtered, removed, acknowledged, or
exported in the storage system Alert window, as shown in Figure 7-14.
False positive alerts: Configuring thresholds too conservatively can lead to an excessive
number of false positive alerts.
All of the reports use the metrics that are available for the DS8000 storage system. The
remainder of this section describes each of the report types from a general aspect.
To export the summary table underneath the chart, click Action → More → Export, and
select the correct format.
IBM Spectrum Control provides over 70 predefined reports that show capacity and
performance information that is collected by IBM Spectrum Control, as shown in Figure 7-16.
Charts are automatically generated for most of the predefined reports. Depending on the type
of resource, the charts show statistics for space usage, workload activity, bandwidth
percentage, and other statistics, as shown in Figure 7-17. You can schedule reports and
specify the report output as HTML, PDF, and other formats. You can also configure reports to
save the report output to your local file system, and to send reports as mail attachments.
If the wanted report is not available as a predefined report, you can use either Query Studio
or Report Studio to create your own custom reports.
Cognos Report Studio is a professional report authoring tool and offers many advanced
functions:
Creating multiple report pages.
Creating multiple queries that can be joined, unioned, and so on.
Native SQL Queries can be rendered.
Generation of rollup reports.
Complex calculations.
Additional chart types with baseline and trending functions.
Active report that can create interactive reports that can be used on your mobile devices.
Reference: For more information about the usage of Cognos and its functions, see IBM
Tivoli Storage Productivity Center V5.1 Technical Guide, SG24-8053 and IBM Spectrum
Family: IBM Spectrum Control Standard Edition, SG24-8321.
7.5.3 TPCTOOL
You can use the TPCTOOL command-line interface (CLI) to extract data from the IBM
Spectrum Control database. It requires no knowledge of the IBM Spectrum Control schema
or SQL query skills, but you must understand how to use the tool.
You also can make connections by using the ODBC interface, for example, with Microsoft
Excel.
Note: Always specify with ur for read only in your SQL queries. Otherwise, your tables
might get locked during the read operation, which might slow down the performance of the
TPCDB.
An example of querying the IBM Spectrum Control database by using native SQL with
Microsoft Excel is in IBM Spectrum Family: IBM Spectrum Control Standard Edition,
SG24-8321.
Reference: Documentation about the IBM Spectrum Control Exposed Views is available at
the following website:
http://www.ibm.com/support/docview.wss?uid=swg27023813
These functions are not included in IBM Spectrum Control Standard Edition.
The storage optimization function of IBM Spectrum Control uses the VDisk mirroring
capabilities of the SAN Volume Controller, so it can be used only for devices that are
configured as back-end storage for storage virtualizers.
For more information about the storage optimization function of IBM Spectrum Control, see
IBM SmartCloud Virtual Storage Center, SG24-8239.
The self-service provisioning capability of IBM Spectrum Control can also be used for
DS8000 pools that are not behind a storage virtualizer.
To take advantage of the self-service provisioning capabilities that are available in IBM
Spectrum Control Advanced Edition, some configuration is required. This configuration is
called cloud configuration, and it specifies storage tiers, service classes, and capacity pools
(see Figure 7-18).
After the cloud configuration is done, you can do storage provisioning by specifying the
storage capacity and quality of a given service class. Then, volumes are created with the
required characteristics that are defined within that selected service class.
For more information about monitoring performance through a SAN switch or director point
product, see the following websites:
http://www.brocade.com
http://www.cisco.com
Most SAN management software includes options to create SNMP alerts based on
performance criteria, and to create historical reports for trend analysis. Certain SAN vendors
offer advanced performance monitoring capabilities, such as measuring I/O traffic between
specific pairs of source and destination ports, and measuring I/O traffic for specific LUNs.
In addition to the vendor point products, IBM Spectrum Control can be used as a central data
repository and reporting tool for switch environments. It lacks real-time capabilities, but it
collects and reports on data for a 1- 60-minute interval for performance analysis over a
selected time frame when integrated with Network Advisor.
IBM Spectrum Control provides facilities to report on fabric topology, configurations and
switches, and port performance and errors. In addition, you can use Spectrum Control to
configure alerts or thresholds for Total Port Data Rate and Total Port Packet Rate.
Configuration options allow the creation of events to be triggered if thresholds are exceeded.
Although it does not provide real-time monitoring, it offers several advantages over traditional
vendor point products:
Ability to store performance data from multiple switch vendors in a common database
Advanced reporting and correlation between host data and switch data through custom
reports
Centralized management and reporting
Aggregation of port performance data for the entire switch
Perceived or actual I/O bottlenecks can result from hardware failures on the I/O path,
contention on the server, contention on the SAN Fabric, contention on the DS8000 front-end
ports, or contention on the back-end disk adapters or disk arrays. This section provides a
process for diagnosing these scenarios by using IBM Spectrum Control and external data.
This process was developed for identifying specific types of problems and is not a substitute
for common sense, knowledge of the environment, and experience. Figure 7-19 shows the
high-level process flow.
To troubleshoot performance problems, IBM Spectrum Control data must be augmented with
host performance and configuration data. Figure 7-20 shows a logical end-to-end view from a
measurement perspective.
Although IBM Spectrum Control does not provide host performance, configuration, or error
data, IBM Spectrum Control provides performance data from host connections, SAN
switches, and the DS8000 storage system, and configuration information and error logs from
SAN switches and the DS8000 storage system.
Tip: Performance analysis and troubleshooting must always start top-down, starting with
the application (for example, database design and layout), then the operating system,
server hardware, SAN, and then storage. The tuning potential is greater at the “higher”
levels. The best I/O tuning is never carried out because server caching or a better
database design eliminated the need for it.
Process flow
The order in which you conduct the analysis is important. Use the following process:
1. Define the problem. A sample questionnaire is provided in “Sample questions for an AIX
host” on page 559. The goal is to assist you in determining the problem background and
understand how the performance requirements are not being met.
2. Consider checking the application level first. Has all potential tuning on the database level
been performed? Does the layout adhere to the vendor recommendations, and is the
server adequately sized (RAM, processor, and buses) and configured?
3. Correctly classify the problem by identifying hardware or configuration issues. Hardware
failures often manifest themselves as performance issues because I/O is degraded on one
or more paths. If a hardware issue is identified, all problem determination efforts must
focus on identifying the root cause of the hardware errors:
a. Gather any errors on any of the host paths.
Physical component: If you notice significant errors in the “datapath query device”
or the “pcmpath query device” and the errors increase, there is most likely a problem
with a physical component on the I/O path.
b. Gather the host error report and look for Small Computer System Interface (SCSI) or
Fibre errors.
Hardware: Often a hardware error that relates to a component on the I/O path
shows as a TEMP error. A TEMP error does not exclude a hardware failure. You
must perform diagnostic tests on all hardware components in the I/O path, including
the host bus adapter (HBA), SAN switch ports, and the DS8000 HBA ports.
c. Gather the SAN switch configuration and errors. Every switch vendor provides different
management software. All of the SAN switch software provides error monitoring and a
way to identify whether there is a hardware failure with a port or application-specific
integrated circuit (ASIC). For more information about identifying hardware failures, see
your vendor-specific manuals or contact vendor support.
d. If errors exist on one or more of the host paths, determine whether there are any
DS8000 hardware errors. Log on to the HMC as customer/cust0mer and look to ensure
that there are no hardware alerts. Figure 7-21 provides a sample of a healthy DS8000
storage system. If there are any errors, you might need to open a problem ticket (PMH)
with DS8000 hardware support (2107 engineering).
4. After validating that no hardware failures exist, analyze server performance data and
identify any disk bottlenecks. The fundamental premise of this methodology is that I/O
performance degradation that relates to SAN component contention can be observed at
the server through analysis of the key server-based I/O metrics.
Degraded end-to-end I/O response time is the strongest indication of I/O path contention.
Typically, server physical disk response times measure the time that a physical I/O request
takes from the moment that the request was initiated by the device driver until the device
driver receives an interrupt from the controller that the I/O completed. The measurements
are displayed as either service time or response time. They are averaged over the
measurement interval. Typically, server wait or queue metrics refer to time spent waiting at
the HBA, which is usually an indication of HBA saturation. In general, you need to interpret
the service times as response times because they include potential queuing at various
storage subsystem components, for example:
– Switch
– Storage HBA
– Storage cache
– Storage back-end disk controller
– Storage back-end paths
– Disk drives
I/O-intensive disks: The number of total I/Os per second indicates the relative
activity of the device. This relative activity provides a metric to prioritize the analysis.
Those devices with high response times and high activity are more important to
understand than devices with high response time and infrequent access. If
analyzing the data in a spreadsheet, consider creating a combined metric of
Average I/Os × Average Response Time to provide a method for identifying the most
I/O-intensive disks. You can obtain additional detail about OS-specific server
analysis in the OS-specific chapters.
Multipathing: Ensure that multipathing works as designed. For example, if there are
two paths that are zoned per HBA to the DS8000 storage system, there must be four
active paths per LUN. Both SDD and SDDPCM use an active/active configuration of
multipathing, which means that traffic flows across all the traffic fairly evenly. For
native DS8000 connections, the absence of activity on one or more paths indicates
a problem with the SDD behavior.
c. Format the data and correlate the host LUNs with their associated DS8000 resources.
Formatting the data is not required for analysis, but it is easier to analyze formatted
data in a spreadsheet.
The following steps represent the logical steps that are required to format the data and
do not represent literal steps. You can codify these steps in scripts:
i. Read the configuration file.
ii. Build a hdisk hash with key = hdisk and value = LUN SN.
iii. Read I/O response time data.
iv. Create hashes for each of the following values with hdisk as the key: Date, Start
time, Physical Volume, Reads, Avg Read Time, Avg Read Size, Writes, Avg Write
Time, and Avg Write Size.
v. Print the data to a file with headers and commas to separate the fields.
vi. Iterate through the hdisk hash and use the common hdisk key to index into the other
hashes and print those hashes that have values.
d. Analyze the host performance data:
i. Determine whether I/O bottlenecks exist by summarizing the data and analyzing
key performance metrics for values in excess of the thresholds that are described in
“Rules” on page 349. Identify those vpaths/LUNs with poor response time. We
show an example in “Analyzing performance data” on page 357. Hardware errors
and multipathing configuration issues must already be excluded. The hot LUNs
must already be identified. Proceed to step 5 on page 254 to determine the root
cause of the performance issue.
ii. If no degraded disk response times exist, the issue is likely not internal to the
server.
2
3
Analyze the DS8000 performance data first: Check for Alerts (2) and errors (3) in
the left navigation. Then, look at the performance data of the internal resources (4).
Analysis of the SAN fabric and the DS8000 performance data can be completed in
either order. However, SAN bottlenecks occur less frequently than disk bottlenecks,
so it can be more efficient to analyze the DS8000 performance data first.
b. Use IBM Spectrum Control to gather the DS8000 performance data for subsystem
ports, pools, arrays, and volumes, nodes, and host connections. Compare the key
performance indicators from Table 7-5 on page 240 with the performance data. To
analyze the performance, complete the following steps:
i. For those server LUNs that show poor response time, analyze the associated
volumes during the same period. If the problem is on the DS8000 storage system, a
correlation exists between the high response times observed on the host and the
volume response times observed on the DS8000 storage system.
ii. Correlate the hot LUNs with their associated disk arrays. When using the IBM
Spectrum Control WebUI, the relationships are provided automatically in the
drill-down feature, as shown in Figure 7-23 (2).
2
Figure 7-23 Drill-down function of IBM Spectrum Control
If you use Cognos exports and want to correlate the volume data to the rank data,
you can correlate the volume data to the rank data manually or by using the script. If
multiple ranks per extent pool and storage pool striping, or Easy Tier managed
pools are used, one volume can exist on multiple ranks.
Analyze storage subsystem ports for the ports associated with the server in
question.
6. Continue the identification of the root cause by collecting and analyzing SAN fabric
configuration and performance data:
a. Gather the connectivity information and establish a visual diagram of the environment.
Visualize the environment: Sophisticated tools are not necessary for creating this
type of view; however, the configuration, zoning, and connectivity information must
be available to create a logical visual representation of the environment.
30.00
25.00
)
s Disk1
m 20.00
(
e Disk2
m
i
T 15.00 Disk3
e Disk4
s
n
o Disk5
p
s 10.00
e Disk6
R
5.00
Figure 7-24 Windows Server perfmon - Average Physical Disk Read Response Time
At approximately 18:39 hours, the average read response time jumps from approximately
15 ms to 25 ms. Further investigation of the host reveals that the increase in response time
correlates with an increase in load, as shown in Figure 7-25.
1,000.00
900.00
800.00
Disk1
c 700.00
e Disk2
s
/ 600.00
s Disk3
d
a 500.00
e Disk4
R
k 400.00
s
i Disk5
D 300.00
Disk6
200.00
100.00
-
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
0
: 0
: 0
: 0
: 0
: 0
: 0
: 0
: 0 : 0 : 0
: 0
: 0 : 0
: 0
: 0
: 0
: 0
: 0
: 0
:
5 1 7 3 9 5 1 7 3 9 5 1 7 3 9 5 1 7 3 9
5
: 0
: 0
: 1
: 1
: 2
: 3
: 3 4
: : : 4 5
: 0
: 0 : 1
: 1
: 2
: 3
: 3
: 4
: 4
:
6 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Time - 1 Minute Interval
Because the most probable reason for the elevated response times is the disk utilization on
the array, gather and analyze this metric first. Figure 7-26 shows the disk utilization on the
DS8000 storage system.
Problem definition
The online transactions for a Windows Server SQL server appear to take longer than normal
and time out in certain cases.
Disabling a path: In cases where there is a path with significant errors, you can disable
the path with the multipathing software, which allows the non-working paths to be disabled
without causing performance degradation to the working paths. With SDD PCM, disable
the path by running pcmpath set device # path # offline.
1000000
900000
800000 Dev Disk7
700000 Dev Disk6
600000 Dev Disk5
c
e
/s 500000 Dev Disk4
B
K 400000 Pro d uction Disk5
300000 Pro d uction Disk2
200000 Pro d uction Disk1
100000
0
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
:0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0
9 5 1 7 3 9 5 1 7 3 9 5 1 7 3 9 5 1 7
:3 :4 :5 :5 :0 :0 :1 :2 :2 :3 :3 :4 :5 :5 :0 :0 :1 :2 :2
7 7 7 7 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Time - 1 minute interval
The DS8000 port data reveals a peak throughput of around 300 MBps per 4-Gbps port.
Note: The HA ports speed might differ from DS8000 models, so it is a value that depends
on the hardware configuration of the system and the SAN environment.
700.00
600.00
500.00
c
e
/s 400.00
B R1-I3-C4-T0
M
l 300.00 R1-I3-C1-T0
a
t
o
T
200.00
100.00
Time - 5 minute
Before beginning the diagnostic process, you must understand your workload and your
physical configuration. You must know how your system resources are allocated, and
understand your path and channel configuration for all attached servers.
Assume that you have an environment with a DS8000 storage system attached to a z/OS
host, an AIX on IBM Power Systems™ host, and several Windows Server hosts. You noticed
that your z/OS online users experience a performance degradation 07:30 - 08:00 hours each
morning.
You might notice that there are 3390 volumes that indicate high disconnect times, or high
device busy delay time for several volumes in the RMF device activity reports. Unlike UNIX or
Windows Server, you might notice the response time and its breakdown to connect,
disconnect, pending, and IOS queuing.
Device busy delay is an indication that another system locks up a volume, and an extent
conflict occurs among z/OS hosts or applications in the same host when using Parallel
Access Volumes (PAVs). The DS8000 multiple allegiance or PAVs capability allows it to
process multiple I/Os against the same volume at the same time. However, if a read or write
request against an extent is pending while another I/O is writing to the extent, or if a write
request against an extent is pending while another I/O is reading or writing data from the
extent, the DS8000 storage system delays the I/O by queuing. This condition is referred as
extent conflict. Queuing time because of extent conflict is accumulated to device busy (DB)
delay time. An extent is a sphere of access; the unit of increment is a track. Usually, I/O
drivers or system routines decide and declare the sphere.
To get more information about cache usage, you can check the cache statistics of the FB
volumes that belong to the same server. You might be able to identify the FB volumes that
have a low read hit ratio and short cache holding time. Moving the workload of the Open
Systems logical disks, or the z Systems CKD volumes, that you are concerned about to the
other side of the cluster improves the situation by concentrating cache-friendly I/O workload
across both clusters. If you cannot or if the condition does not improve after this move,
consider balancing the I/O distribution on more ranks, or solid-state drives (SSDs). Balancing
the I/O distribution on more ranks optimizes the staging and destaging operation.
The scenarios that use IBM Spectrum Control as described in this chapter might not cover all
the possible situations that can be encountered. You might need to include more information,
such as application and host operating system-based performance statistics, the STAT
reports, or other data collections to analyze and solve a specific performance problem.
Part 3 Performance
considerations for host
systems and databases
This part provides performance considerations for various host systems or appliances that
are attached to the IBM System Storage DS8000 storage system, and for databases.
This chapter provides detailed information about performance tuning considerations for
specific operating systems in later chapters of this book.
The DS8886 model supports a maximum of 16 ports per I/O bay and can have four I/O bays
in the base frame and four I/O bays in first expansion frame, so a maximum of 128
FCP/FICON ports is supported. The DS8884 model supports a maximum of 16 FCP/FICON
HAs and can have two I/O bays in the base frame and two I/O bays in first expansion frame,
so a maximum of 64 FCP/FICON ports is supported. Both models support 8 Gbps or 16 Gbps
HAs. All ports can be intermixed and independently configured. The 8 Gbps HAs support 2, 4,
or 8 Gbps link speeds, and the 16 Gbps HAs support 4, 8, or 16 Gbps. Thus, 1 Gbps is no
longer supported on the DS8880 storage system. Enterprise Systems Connection (ESCON)
adapters are not supported on the DS8880 storage system.
The DS8000 storage system can support host and remote mirroring links by using
Peer-to-Peer Remote Copy (PPRC) on the same I/O port. However, it is preferable to use
dedicated I/O ports for remote mirroring links.
Planning and sizing the HAs for performance are not easy tasks, so use modeling tools, such
as Disk Magic (see 6.1, “Disk Magic” on page 160). The factors that might affect the
performance at the HA level are typically the aggregate throughput and the workload mix that
the adapter can handle. All connections on a HA share bandwidth in a balanced manner.
Therefore, host attachments that require maximum I/O port performance must be connected
to HAs that are not fully populated. You must allocate host connections across I/O ports, HAs,
and I/O enclosures in a balanced manner (workload spreading).
No SCSI: There is no direct Small Computer System Interface (SCSI) attachment support
for the DS8000 storage system.
The next section describes preferred practices for implementing a switched fabric.
If a HA fails and starts logging in and out of the switched fabric, or a server must be restarted
several times, you do not want it to disturb the I/O to other hosts. Figure 8-1 on page 271
shows zones that include only a single HA and multiple DS8000 ports (single initiator zone).
This approach is the preferred way to create zones to prevent interaction between server
HAs.
Tip: Each zone contains a single host system adapter with the wanted number of ports
attached to the DS8000 storage system.
By establishing zones, you reduce the possibility of interactions between system adapters in
switched configurations. You can establish the zones by using either of two zoning methods:
Port number
Worldwide port name (WWPN)
You can configure switch ports that are attached to the DS8000 storage system in more than
one zone, which enables multiple host system adapters to share access to the DS8000 HA
ports. Shared access to a DS8000 HA port might be from host platforms that support a
combination of bus adapter types and operating systems.
LUN masking
In FC attachment, logical unit number (LUN) affinity is based on the WWPN of the adapter on
the host, which is independent of the DS8000 HA port to which the host is attached. This
LUN masking function on the DS8000 storage system is provided through the definition of
DS8000 volume groups. A volume group is defined by using the DS Storage Manager or
DS8000 command-line interface (DSCLI), and host WWPNs are connected to the volume
group. The LUNs to be accessed by the hosts that are connected to the volume group are
defined to be in that volume group.
Although it is possible to limit through which DS8000 HA ports a certain WWPN connects to
volume groups, it is preferable to define the WWPNs to have access to all available DS8000
HA ports. Then, by using the preferred process of creating FC zones, as described in
“Importance of establishing zones” on page 269, you can limit the wanted HA ports through
the FC zones. In a switched fabric with multiple connections to the DS8000 storage system,
this concept of LUN affinity enables the host to see the same LUNs on different paths.
The number of times that a DS8000 logical disk is presented as a disk device to an open host
depends on the number of paths from each HA to the DS8000 storage system. The number
of paths from an open server to the DS8000 storage system is determined by these factors:
The number of HAs installed in the server
The number of connections between the SAN switches and the DS8000 storage system
The zone definitions created by the SAN switch software
Physical paths: Each physical path to a logical disk on the DS8000 storage system is
presented to the host operating system as a disk device.
I0000
SAN I0030
Switch
FC 0 I0100
A
I0130
Host DS8000
I0200
FC 1 SAN I0230
Switch
I0300
B
I0330
Zone A - FC 0 Zone B - FC 1
DS8000_I0000 DS8000_I0230
DS8000_I0130 DS8000_I0300
You can see how the number of logical devices that are presented to a host can increase
rapidly in a SAN environment if you are not careful about selecting the size of logical disks
and the number of paths from the host to the DS8000 storage system.
Typically, it is preferable to cable the switches and create zones in the SAN switch software for
dual-attached hosts so that each server HA has 2 - 4 paths from the switch to the DS8000
storage system. With hosts configured this way, you can allow the multipathing module to
balance the load across the four HAs in the DS8000 storage system.
Zoning more paths, such as eight connections from the host to the DS8000 storage system,
does not improve SAN performance and causes twice as many devices to be presented to
the operating system.
Host System
Host adapter
single point of
failure
SAN switch
single point of
failure
Host
Port
I0001
Logical
disk
DS8000
Figure 8-2 SAN single-path connection
Adding additional paths requires you to use multipathing software (Figure 8-3 on page 273).
Otherwise, the same LUN behind each path is handled as a separate disk from the operating
system side, which does not allow failover support.
Multipathing provides the DS8000 attached Open Systems hosts that run Windows, AIX,
HP-UX, Oracle Solaris, or Linux with these capabilities:
Support for several paths per LUN.
Load balancing between multiple paths when there is more than one path from a host
server to the DS8000 storage system. This approach might eliminate I/O bottlenecks that
occur when many I/O operations are directed to common devices through the same I/O
path, thus improving the I/O performance.
Host System
multipathing module
Host Host
Port Port
I0001 I0131
LUN
DS8000
Important: Do not intermix several multipathing solutions within one host system. Usually,
the multipathing software solutions cannot coexist.
Persistent Reserve: Do not share LUNs among multiple hosts without the protection of
Persistent Reserve (PR). If you share LUNs among hosts without PR, you are exposed to
data corruption situations. You must also use PR when using FlashCopy.
The SDD does not support booting from or placing a system primary paging device on an
SDD pseudo-device.
For certain servers that run AIX, booting off the DS8000 storage system is supported. In that
case, LUNs used for booting are manually excluded from the SDD configuration by using the
querysn command to create an exclude file.
For more information about installing and using SDD, see IBM System Storage Multipath
Subsystem Device Driver User’s Guide, GC52-1309. This publication and other information
are available at the following website:
http://www.ibm.com/support/docview.wss?uid=ssg1S7000303
The policy that is specified for the device determines the path that is selected to use for an I/O
operation. The following policies are available:
Load balancing (default): The path to use for an I/O operation is chosen by estimating the
load on the adapter to which each path is attached. The load is a function of the number of
I/O operations currently in process. If multiple paths have the same load, a path is chosen
at random from those paths.
Round-robin: The path to use for each I/O operation is chosen at random from those paths
that are not used for the last I/O operation. If a device has only two paths, SDD alternates
between the two paths.
Failover only: All I/O operations for the device are sent to the same (preferred) path until
the path fails because of I/O errors. Then, an alternative path is chosen for later I/O
operations.
Normally, path selection is performed on a global rotating basis; however, the same path is
used when two sequential write operations are detected.
Important: With a single-path connection, which is not preferable, the SDD cannot provide
failure protection and load balancing.
From an availability point of view, this configuration is not preferred because of the single fiber
cable from the host to the SAN switch. However, this configuration is better than a single path
from the host to the DS8000 storage system, and this configuration can be useful for
preparing for maintenance on the DS8000 storage system.
When a path failure occurs, the SDD automatically reroutes the I/O operations from the failed
path to the other remaining paths. This action eliminates the possibility of a data path
becoming a single point of failure.
Multipath I/O
Multipath I/O (MPIO) summarizes native multipathing technologies that are available in
several operating systems, such as AIX, Linux, and Windows. Although the implementation
differs for each of the operating systems, the basic concept is almost the same:
The multipathing module is delivered with the operating system.
The multipathing module supports failover and load balancing for standard SCSI devices,
such as simple SCSI disks or SCSI arrays.
To add device-specific support and functions for a specific storage device, each storage
vendor might provide a device-specific module that implements advanced functions for
managing the specific storage device.
For example, Symantec provides an alternative to the IBM provided multipathing software.
The Veritas Volume Manager (VxVM) relies on the Microsoft implementation of MPIO and
Device Specific Modules (DSMs) that rely on the Storport driver. The Storport driver is not
available for all versions of Windows. The Veritas Dynamic MultiPathing (DMP) software is
also available for UNIX versions, such as Oracle Solaris.
FCP: z/VM, z/VSE, and Linux for z Systems can also be attached to the DS8000 storage
system with FCP. Then, the same considerations as for Open Systems hosts apply.
8.3.1 FICON
FICON is a Fibre Connection used with z Systems servers. Each storage unit HA has either
four or eight ports, and each port has a unique WWPN. You can configure the port to operate
with the FICON upper layer protocol. When configured for FICON, the storage unit provides
the following configurations:
Either fabric or point-to-point topology.
A maximum of 128 host ports for a DS8886 model and a maximum of 64 host ports for a
DS8884 model.
A maximum of 1280 logical paths per DS8000 HA port.
Access to all 255 control unit images (65280 count key data (CKD) devices) over each
FICON port. FICON HAs support 2, 4, 8, or 16 Gbps link speed in DS8880 storage
systems.
Operating at 16 Gbps speeds, FICON Express16S channels achieve up to 620 MBps for a
mix of large sequential read and write I/O operations, as shown in the following charts.
Figure 8-4 shows a comparison of the overall throughput capabilities of various generations of
channel technology.
FICON
Express 16S
16Gbps
2500 2600
zHPF
2000
FICON
Express 8S
8Gbps
1500 1600
zHPF
1000
FICON FICON FICON
Express 8 Express 8S Express 16S
8Gbps 8Gbps 16Gbps
FICON
FICON 620 620 620
500 Express 4
Express 2 4Gbps
2Gbps
350 zEC12 zEC12
270 z196 zBC12 zBC12
z9/z990/z890 z10 z10 z196/z114 z13 z196/z114 z13
0
The FICON Express 16S channel on IBM z13 and the FICON Express8S channel on IBM z
Systems Enterprise EC12 and BC12 represents a significant improvement in maximum
bandwidth capability compared to FICON Express4 channels and previous FICON offerings.
The response time improvements are expected to be noticeable for large data transfers.
Figure 8-4 also shows the maximum throughput of FICON Express 16S and 8S with High
Performance FICON (zHPF) that delivers significant improvement compared to the FICON
protocol.
FICON
Express 16S
FICON 16Gbps
100000 Express 8S
8Gbps 98000
92000 zHPF
80000 zHPF
60000
40000
FICON FICON
FICON Express 8S Express 16S
Express 8 8Gbps 16Gbps
FICON 8Gbps
20000 FICON
Express 4
23000 23000
4Gbps 20000
Express 2
2Gbps zEC12 zEC12
14000
9200 z196 zBC12 zBC12
z9/z990/z890 z10 z10 z196/z114 z13 z196/z114 z13
0
Figure 8-5 Measurements of IOPS over several generations of channels
z13 is the last z Systems server to support FICON Express 8 channels. FICON Express 8
will not be supported on future high-end z Systems servers as carry forward on an
upgrade.
z13 does not support FICON Express4. zEC12 and zBC12 are the last systems that
support FICON Express4.
Withdrawal: At the time of writing, all FICON Express4, FICON Express2, and FICON
features are withdrawn from marketing.
Note: The FICON Express4 was the last feature to support 1 Gbps link data rates.
For any generation of FICON channels, you can attach directly to a DS8000 storage system
or you can attach through a FICON capable FC switch.
When you use a FC/FICON HA to attach to FICON channels, either directly or through a
switch, the port is dedicated to FICON attachment and cannot be simultaneously attached to
FCP hosts. When you attach a DS8000 storage system to FICON channels through one or
more switches, the maximum number of FICON logical paths is 1280 per DS8000 HA port.
The directors provide high availability with redundant components and no single points of
failure. A single director between servers and a DS8000 storage system is not preferable
because it can be a single point of failure. More than two directors are preferable for
redundancy.
Point-to-Point
FC Link
z Systems DS8000
Switched Point-to-Point
z Systems DS8000
Figure 8-6 FICON topologies between z Systems and a DS8000 storage system
FICON connectivity
Usually in z Systems environments, a one-to-one connection between FICON channels and
storage HAs is preferred because the FICON channels are shared among multiple logical
partitions (LPARs) and heavily used. Carefully plan the oversubscription of HA ports to avoid
any bottlenecks.
z System s z System s
FICON (FC) channels FICON (FC) channels
FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC
z Systems
FICON (FC) channels
FC FC FC FC FC FC FC FC
Sizing FICON connectivity is not an easy task. You must consider many factors. As a
preferred practice, create a detailed analysis of the specific environment. Use these
guidelines before you begin sizing the attachment environment:
For FICON Express CHPID utilization, the preferred maximum utilization level is 50%.
For the FICON Bus busy utilization, the preferred maximum utilization level is 40%.
For the FICON Express Link utilization with an estimated link throughput of 2 Gbps,
4 Gbps, 8 Gbps, or 16 Gbps, the preferred maximum utilization threshold level is 70%.
For more information about DS8000 FICON support, see IBM System Storage DS8000 Host
Systems Attachment Guide, SC26-7917, and FICON Native Implementation and Reference
Guide, SG24-6266.
You can monitor the FICON channel utilization for each CHPID in the RMF Channel Path
Activity report. For more information about the Channel Path Activity report, see Chapter 14,
“Performance considerations for IBM z Systems servers” on page 459.
The following statements are some considerations and preferred practices for paths in z/OS
systems to optimize performance and redundancy:
Do not mix paths to one LCU with different link speeds in a path group on one z/OS. It
does not matter in the following cases, even if those paths are on the same CPC:
– The paths with different speeds, from one z/OS to different multiple LCUs
– The paths with different speeds, from each z/OS to one LCU
Place each path in a path group on different I/O bays.
The FICON features provide support of FC devices to z/VM, z/VSE, and Linux on z Systems,
which means that these features can access industry-standard SCSI devices. For disk
applications, these FCP storage devices use FB 512-byte sectors rather than Extended
Count Key Data (IBM ECKD™) format. All available FICON features can be defined in FCP
mode.
Before planning for performance, validate the configuration of your environment. See the IBM
System Storage Interoperation Center (SSIC), which shows information about supported
system models, operating system versions, host bus adapters (HBA), and so on:
http://www.ibm.com/systems/support/storage/config/ssic/index.jsp
To download fix packs for your AIX version and the current firmware version for the Fibre
Channel (FC) adapters, go to the IBM Support website:
http://www.ibm.com/support/fixcentral/
For more information about how to attach and configure a host system to a DS8000 storage
system, see the IBM System Storage DS8000 Host System Attachment Guide, GC27-4210,
found at:
http://www.ibm.com/support/docview.wss?rs=1114&context=HW2B2&uid=ssg1S7001161
Note: The AIX command prtconf displays system information, such as system model,
processor type, and firmware levels, which facilitates Fix Central website navigation.
Normally in each of these layers, there are performance indicators that help you to assess
how that particular layer affects performance.
Modern DS8000 storage systems have many improvements in data management that can
change LVM usage. Easy Tier V3 functions move the method of data isolation from the rank
level to the extent pool, volume, and application levels. It is important that a disk system of
today must be planned from the application point of view, not the hardware resource
allocation point of view. Plan the logical volumes (LVs) on the extent pool that is for one type
of application or workload. This method of disk system layout eliminates the necessity of LVM
usage.
Moreover, by using LVM striping of the Easy Tier managed volumes, you eliminate most of the
technology benefits because striping shadows the real skew factor and changes the real
picture of the hot extents allocation. This method might lead to improper extent migration plan
generation, which leads to continuous massive extent migration. Performance analysis
becomes complicated and makes I/O bottleneck allocation a complicated task. In general, the
basic approach for most of the applications is to use one or two hybrid extent pools with three
tiers and Easy Tier in automated mode for group of applications of the same kind or the same
workload type. To prevent bandwidth consumption by one or several applications, the I/O
Priority Manager function must be used.
The extended RAID function of the LVM (RAID 5, 6, and 10) must not be used at all, except
the RAID 1 (Mirroring) function, which might be required in high availability (HA) and disaster
recovery (DR) solutions.
Consider these points when you read the LVM description in the following sections. Also, see
Chapter 4, “Logical configuration performance considerations” on page 83.
All these parameters have a common meaning: the length of the queue of the SCSI
commands, which a device can keep unconfirmed while maintaining the I/O requests. A
device (disk or FC adapter) sends a successful command completion notification to the
operating system or a driver before it is completed, which allows the operating system or a
driver to send another command or I/O request.One I/O request can consist of two or more
SCSI commands.
There are two methods for the implementation of queuing: untagged and tagged. For more
information about this topic, see the following website:
https://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.kernextc/scsi_c
md_tag.htm
SCSI command tagged queuing refers to queuing multiple commands to a SCSI device.
Queuing to the SCSI device can improve performance because the device itself determines
the most efficient way to order and process commands. Untagged queuing allows a target to
accept one I/O process from each initiator for each logical unit or target routine.
The qdepth parameter might affect the disk I/O performance. Values that are too small can
make the device ineffective to use. Values that are too high might lead to the QUEUE FULL
status of the device, reject the next I/O, and lead to data corruption or a system crash.
Another important reason why queue depth parameters must be set correctly is the queue
limits of the host ports of the disk system. The host port might be flooded with the SCSI
commands if there is no correct limit set in the operating system. When this situation
happens, a host port refuses to accept any I/Os, then resets, and then starts the loop
initialization primitive (LIP) procedure. This situation leads to the inactivity of the port for up to
several minutes and might initiate path failover or an I/O interruption. Moreover, in highly
loaded environments, this situation leads to the overload of the other paths and might lead to
the complete I/O interruption for the application or buffer overflow in the operating system,
which causes paging activity.
In addition, certain HBAs or multipathing drivers have their own queue depth settings to
manage FC targets and paths. These settings are needed to limit the number of commands
that the driver sends to the FC target and FCP LUN to prevent buffer overflow. For more
information, see the operating system-specific chapters of this book.
On AIX, the size of the disk (hdisk) driver queue is specified by the queue_depth attribute, and
the size of the adapter driver queue is specified by the num_cmd_elems attribute. In addition,
there are queue depth limits to the FC adapter and FC path. For more information, see 9.2.6,
“IBM Subsystem Device Driver for AIX” on page 306 and 9.2.7, “Multipath I/O with IBM
Subsystem Device Driver Path Control Module” on page 306.
For each VSCSI adapter in a VIOS, which is known as a vhost device, there is a matching
VSCSI adapter in a Virtual I/O Client (VIOC). These adapters have a fixed queue depth that
varies depending on how many VSCSI LUNs are configured for the adapter. There are 512
command elements, of which two are used by the adapter, three are reserved for each VSCSI
LUN for error recovery, and the rest are used for I/O requests. Thus, with the default
queue_depth of 3 for VSCSI LUNs, there are up to 85 LUNs to use at an adapter: (512 - 2)/(3
+ 3) = 85 (rounding down). So, if you need higher queue depths for the devices, the number of
LUNs per adapter is reduced. For example, if you want to use a queue_depth of 25, you can
have 510/28 = 18 LUNs. You can configure multiple VSCSI adapters to handle many LUNs
with high queue depths and each one requires additional memory. You can have more than
one VSCSI adapter on a VIOC that is connected to the same VIOS if you need more
bandwidth.
Important: To change the queue_depth on a hdisk device at the VIOS, you must unmap
the disk from the VIOC and remap it back after the change.
If you use NPIV, if you increase num_cmd_elems on the virtual FC (vFC) adapter, you must also
increase the setting on the real FC adapter.
For more information about the queue depth settings for VIO Server, see IBM System
Storage DS8000 Host Attachment and Interoperability, SG24-8887.
It is preferable to tune based on the application I/O requirements, especially when the disk
system is shared with other servers.
Regarding the qdepth_enable parameter, the default is yes, which essentially has the SDD
handling the I/Os beyond queue_depth for the underlying hdisks. Setting it to no results in the
hdisk device driver handling them in its wait queue. With qdepth_enable=yes, SDD handles
the wait queue; otherwise, the hdisk device driver handles the wait queue. There are
error-handling benefits that allow the SDD to handle these I/Os, for example, by using LVM
mirroring across two DS8000 storage systems. With heavy I/O loads and much queuing in
SDD (when qdepth_enable=yes), it is more efficient to allow the hdisk device drivers to handle
relatively shorter wait queues rather than SDD handling a long wait queue by setting
qdepth_enable=no. SDD queue handling is single threaded and there is a thread for handling
each hdisk queue. So, if error handling is of primary importance (for example, when LVM
mirrors across disk subsystems), leave qdepth_enable=yes. Otherwise, setting
qdepth_enable=no more efficiently handles the wait queues when they are long. Set the
qdepth_enable parameter by using the datapath command because it is a dynamic change
that way (chdev is not dynamic for this parameter).
For the adapters, look at the adaptstats column. Set num_cmd_elems=Maximum or 200,
whichever is greater. Unlike devstats with qdepth_enable=yes, Maximum for adaptstats can
exceed num_cmd_elems.
The iostat -D command shows statistics since system boot, and it assumes that the system
is configured to continuously maintain disk IO history. Run # lsattr -El sys0 to see whether
the iostat attribute is set to true, and smitty chgsys to change the attribute setting.
# sa r -d 1 2
S y st em co n fi gu r at io n : l cp u =2 d r iv e s= 1 en t =0 .30
The avwait is the average time spent in the wait queue. The avserv is the average time spent
in the service queue. avserv corresponds to avgserv in the iostat output. The avque value
represents the average number of I/Os in the wait queue.
IBM Subsystem Device Driver Path Control Module (SDDPCM) provides the pcmpath query
devstats and pcmpath query adaptstats commands to show hdisk and adapter queue
statistics. You can refer to the SDDPCM manual for syntax, options, and explanations of all
the fields. Example 9-2 shows devstats output for a single LUN and adaptstats output for a
Fibre Channel adapter.
Transfer Size: <= 512 <= 4k <= 16K <= 64K > 64K
118 20702 80403 12173 5245
Look at the Maximum field, which indicates the maximum number of I/Os submitted to the
device since system boot.
You can monitor adapter queues and IOPS. For adapter IOPS, run iostat -at <interval>
<# of intervals> and for adapter queue information, run iostat -aD, optionally with an
interval and number of intervals.
The downside of setting queue depths too high is that the disk subsystem cannot handle the
I/O requests in a timely fashion and might even reject the I/O or ignore it. This situation can
result in an I/O timeout, and an I/O error recovery code is called. This situation is bad
because the processor ends up performing more work to handle I/Os than necessary. If the
I/O eventually fails, this situation can lead to an application crash or worse.
Lower the queue depth per LUN when using multipathing. With multipathing, this default value
is magnified because it equals the default queue depth of the adapter multiplied by the
number of active paths to the storage device. For example, because QLogic uses a default
queue depth of 32, the preferable queue depth value to use is 16 when using two active paths
and 8 when using four active paths. Directions for adjusting the queue depth are specific to
each HBA driver and are available in the documentation for the HBA.
For more information about AIX, see AIX disk queue depth tuning for performance, found at:
http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105745
With AIX 7 (with the DS8000 storage system), you must the default parameters and install the
file sets for multipathing and host attachment that already provide basic performance defaults
for queue length and SCSI timeout. For more information about setting up the volume layout,
see 9.2.4, “IBM Logical Volume Manager” on page 301.
JFS2 was created for 64-bit kernels. Its file organization method is a B+ tree algorithm. It
supports all the features that are described for JFS, with the exception of “delayed write
operations.” It also supports concurrent I/O (CIO).
Read-ahead algorithms
JFS and JFS2 have read-ahead algorithms that can be configured to buffer data for
sequential reads into the file system cache before the application requests it. Ideally, this
feature reduces the percent of I/O wait (%iowait) and increases I/O throughput as seen from
the operating system. Configuring the read-ahead algorithms too aggressively results in
unnecessary I/O. The following VMM tunable parameters control read-ahead behavior:
For JFS:
– minpgahead = max(2, <application’s blocksize> / <filesystem’s blocksize>)
– maxpgahead = max(256, (<application’s blocksize> / <filesystem’s blocksize> *
<application’s read ahead block count>))
I/O pacing
I/O pacing manages the concurrency to files and segments by limiting the processor
resources for processes that exceed a specified number of pending write I/Os to a discrete
file or segment. When a process exceeds the maxpout limit (high-water mark), it is put to sleep
until the number of pending write I/Os to the file or segment is less than minpout (low-water
mark). This pacing allows another process to access the file or segment.
Disabling I/O pacing improves backup times and sequential throughput. Enabling I/O pacing
ensures that no single process dominates the access to a file or segment. AIX V6.1 and
higher enables I/O pacing by default. In AIX V5.3, you need to enable explicitly this feature.
The feature is enabled by setting the sys0 settings of the minpout and maxpout parameters to
4096 and 8193 (lsattr -El sys0). To disable I/O pacing, simply set them both to zero. You
can also limit the effect of setting global parameters by mounting file systems by using an
explicit 0 for minput and maxpout: mount -o minpout=0,maxpout=0 /u. Tuning the maxpout and
minpout parameters might prevent any thread that performs sequential writes to a file from
dominating system resources.
Enabling I/O pacing improves user response time at the expense of throughput. For more
information about I/O pacing, see the following website:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.performance/disk
_io_pacing.htm
Write behind
The sync daemon (syncd) writes dirty file pages that remain in memory and do not get reused
to disk. In some situations, this situation might result in abnormal temporary disk utilization.
The write behind parameter causes pages to be written to disk before the sync daemon
runs. Writes are triggered when a specified number of sequential 16 KB clusters (for JFS) and
128 KB clusters (for JFS2) are updated.
Sequential write behind:
– numclust for JFS
– j2_nPagesPerWriteBehindCluster and j2_nRandomCluster for JFS2
Mount options
Use the release behind, direct I/O, and CIO mount options when appropriate:
The release behind mount option can reduce syncd and lrud impact. This option modifies
the file system behavior so that it does not maintain data in JFS2 cache. You use these
options if you know that data that goes into or out of certain file systems is not requested
again by the application before the data is likely to be paged out. Therefore, the lrud
daemon has less work to do to free cache and eliminates any syncd impact for this file
system. One example of a situation where you can use these options is if you have a Tivoli
Storage Manager Server with disk storage pools in file systems and you configured the
read ahead mechanism to increase the throughput of data, especially when a migration
takes place from disk storage pools to tape storage pools:
– -rbr for release behind after a read
– -rbw for release behind after a write
– -rbrw for release behind after a read or a write
Direct I/O (DIO):
– Bypass JFS/JFS2 cache.
– No read ahead.
– An option of the mount command.
– Useful for databases that use file systems rather than raw LVs. If an application has its
own cache, it does not make sense to also keep data in file system cache.
– Direct I/O is not supported on compressed file systems.
For more information about DIO, see the following website:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.genprogc/dire
ct_io_normal.htm
CIO:
– Same as DIO but without inode locking, so the application must ensure data integrity
for multiple simultaneous I/Os to a file.
– An option of the mount command.
– Not available for JFS.
– If possible, consider the use of a no cache option at the database level when it is
available rather than at the AIX level. An example for DB2 is the no_filesystem_cache
option. DB2 can control it at the table space level. With a no file system caching policy
enabled in DB2, a specific OS call is made that uses CIO regardless of the mount. By
setting CIO as a mount option, all files in the file system are CIO-enabled, which might
not benefit certain files.
Asynchronous I/O
AIO is the AIX facility that allows an application to issue an I/O request and continue
processing without waiting for the I/O to finish. Many applications, such as databases and file
servers, take advantage of the ability to overlap processing and I/O.
For more information about changing tunable values for AIX AIO, see the following website:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.performance/asyn
c_io.htm
For more information, see the Veritas File System for AIX Administrator’s Guide, found at:
https://www.veritas.com/support/en_US/article.DOC5203
In addition, Veritas published Advice for tuning Veritas File System on AIX, found at:
https://www.veritas.com/support/en_US/article.000067287
At a high level, a larger file system block size provides higher throughput for medium to large
files by increasing I/O size. Smaller file system block sizes handle tiny files more efficiently. To
select the preferred file system block size, it is essential to understand the system workload,
especially the average and minimum file sizes.
The following sections describe some of the configuration parameters that are available in
Spectrum Scale. They include some notes about how the parameter values might affect
performance.
Pagepool
Spectrum Scale uses pinned memory (also called pagepool) and unpinned memory for
storing file data and metadata in support of I/O operations. Pinned memory regions cannot be
swapped out to disk.
The pagepool sets the size of buffer cache on each node. For new Spectrum Scale V4.1.1
installations, the default value is either one-third of the node’s physical memory or 1 GB,
whichever is smaller. For a database system, 100 MB might be enough. For an application
with many small files, you might need to increase this setting to 2 GB - 4 GB.
In Figure 9-5, the DS8000 LUNs that are under the control of the LVM are called physical
volumes (PVs). The LVM splits the disk space into smaller pieces, which are called physical
partitions (PPs). A logical volume (LV) consists of several logical partitions (LPARs). A file
system can be mounted over an LV, or it can be used as a raw device. Each LPAR can point to
up to three corresponding PPs. The ability of the LV to point a single LPAR to multiple PPs is
the way that LVM implements mirroring (RAID 1).
To set up the volume layout with the DS8000 LUNs, you can adopt one of the following
strategies:
Storage pool striping: In this case, you are spreading the workload at the storage level. At
the operating system level, you must create the LVs with the inter-policy attribute set to
minimum, which is the default option when creating an LV.
PP striping: A set of LUNs is created in different ranks inside of the DS8000 storage
system. When the LUNs are recognized in AIX, a volume group (VG) is created. The LVs
are spread evenly over the LUNs by setting the inter-policy to maximum, which is the most
common method that is used to distribute the workload. The advantage of this method
compared to storage pool striping is the granularity of data spread over the LUNs. With
storage pool striping, the data is spread in chunks of 1 GB. In a VG, you can create PP
sizes 8 - 16 MB. The advantage of this method compared to LVM striping is that you have
more flexibility to manage the LVs, such as adding more disks and redistributing evenly the
LVs across all disks by reorganizing the VG.
LVM striping: As with PP striping, a set of LUNs is created in different ranks inside of the
DS8000 storage system. After the LUNs are recognized in AIX, a VG is created with larger
PP sizes, such as 128 MB or 256 MB. The LVs are spread evenly over the LUNs by setting
the stripe size of LV 8 - 16 MB. From a performance standpoint, LVM striping and PP
striping provide the same performance. You might see an advantage in a scenario of
PowerHA with LVM Cross-Site and VGs of 1 TB or more when you perform cluster
verification, or you see that operations related to creating, modifying, or deleting LVs are
faster.
PP striping
Figure 9-6 shows an example of PP striping. The VG contains four LUNs and created 16 MB
PPs on the LUNs. The LV in this example consists of a group of 16 MB PPs from four logical
disks: hdisk4, hdisk5, hdisk6, and hdisk7.
PP Striping
/dev/inte r-dis k_lv
8 GB Lo gica l d isk (LUN ) = hdisk 4
16MB 16MB 16MB 16MB 16MB 16MB 16MB 16MB
pp 1 pp2 pp 3 pp4 ... pp4 97 pp 498 pp 499 pp500
lp1 l p5
vpath0, vpath1, vpath2, and vpath3 are hardware-striped LUNs on different DS8000 Extent Pools
8 GB/16 MB partitions ~ 500 physical partitions per LUN (pp1-pp500)
/dev/inter-disk_lv is made up of eight logical partitions
(lp1 + lp2 + lp3 + lp4 + lp5 +lp6 +lp7 + lp8) = 8 x 16 = 128 MB
The first step is to create a VG. Create a VG with a set of DS8000 LUNs where each LUN is
in a separate extent pool. If you plan to add a set of LUNs to a host, define another VG. To
create a VG, run the following command to create the data01vg and a PP size of 16 MB:
mkvg -S -s 16 -y data01vg hdisk4 hdisk5 hdisk6 hdisk7
After you create the VG, the next step is to create the LVs. To create a VG with four disks
(LUNs), create the LVs as a multiple of the number of disks in the VG times the PP size. In
this example, we create the LVs in multiples of 64 MB. You can implement the PP striping by
using the -e x option. By adding the -a e option, the Intra-Physical Volume Allocation Policy
changes the allocation policy from middle (default) to edge so that the LV PPs are allocated
beginning at the outer edge and continuing to the inner edge. This method ensures that all
PPs are sequentially allocated across the physical disks. To create an LV of 1 GB, run the
following command:
mklv -e x -a e -t jfs2 -y inter-disk_lv data01vg 64 hdisk4 hdisk5 hdisk6 hdisk7
Preferably, use inline logs for JFS2 LVs, which result in one log for every file system that is
automatically sized. Having one log per file system improves performance because it avoids
the serialization of access when multiple file systems make metadata changes. The
disadvantage of inline logs is that they cannot be monitored for I/O rates, which can provide
an indication of the rate of metadata changes for a file system.
LVM striping
Figure 9-7 shows an example of a striped LV. The LV called /dev/striped_lv uses the same
capacity as /dev/inter-disk_lv (shown in Figure 9-6 on page 303), but it is created
differently.
LVM Striping
8GB LUN = hd isk 4
256MB 256MB 256MB 256MB 256MB 256MB 256MB 2 56MB
l p1 pp1 l p5 pp2 pp 3 pp4 .... pp29 p p30 p p31 pp3 2
1 .1 1 .2 1 .3 5 .1 5 .2 5 .3
hdisk4, hdisk5, hdisk6, and hdisk7 are hardware-striped LUN S on different DS8000 Extent Pools
8 GB/256 MB partitions ~ 32 physical partitions per LUN (pp1 – pp32)
Again, the first step is to create a VG. To create a VG for LVM striping, run the following
command:
mkvg -S -s 256 -y data01vg hdisk4 hdisk5 hdisk6 hdisk7
For you to create a striped LV, you must combine the following options when you use LVM
striping:
Stripe width (-C): This option sets the maximum number of disks to spread the data. The
default value is used from the upperbound option (-u).
Copies (-c): This option is required only when you create mirrors. You can set 1 - 3 copies.
The default value is 1.
Strict allocation policy (-s): This option is required only when you create mirrors and it is
necessary to use the value s (superstrict).
Stripe size (-S): This option sets the size of a chunk of a sliced PP. Since AIX V5.3, the
valid values include 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1M, 2M, 4M, 8M, 16M,
32M, 64M, and 128M.
Upperbound (-u): This option sets the maximum number of disks for a new allocation. If
you set the allocation policy to superstrict, the upperbound value must be the result of the
stripe width times the number of copies that you want to create.
Since AIX V5.3, the striped column feature is available. With this feature, you can extend an
LV in a new set of disks after the current disks where the LV is spread is full.
Memory buffers
Adjust the memory buffers (pv_min_pbuf) of LVM to increase the performance. Set the
parameter to 2048 for AIX 7.1.
Scheduling policy
If you have a dual-site cluster solution that uses PowerHA with LVM Cross-Site, you can
reduce the link requirements among the sites by changing the scheduling policy of each LV to
parallel write/sequential read (ps). You must remember that the first copy of the mirror must
point to the local storage.
With SDD V1.7 and prior versions, the datapath command, instead of the chdev command,
was used to change the qdepth_enable setting because then it is a dynamic change. For
example, datapath set qdepth disable sets it to no. Certain releases of SDD do not include
SDD queuing, and other releases include SDD queuing. Certain releases do not show the
qdepth_enable attribute. Either check the manual for your version of SDD or try the datapath
command to see whether it supports turning off this feature.
If qdepth_enable=yes (the default), I/Os that exceed the queue_depth queue in the SDD. If
qdepth_enable=no, I/Os that exceed the queue_depth queue in the hdisk wait queue. SDD
with qdepth_enable=no and SDDPCM do not queue I/Os and instead merely pass them to the
hdisk drivers.
Note: SDD is not compatible with the AIX MPIO framework. It cannot be installed on the
IBM Power Systems VIOS and is not supported on AIX 7.1 (or later). For these reasons,
SDD should no longer be used for AIX support of the DS8000 storage system.
9.2.7 Multipath I/O with IBM Subsystem Device Driver Path Control Module
MPIO is another multipathing device driver. It was introduced in AIX V5.2. The reason for
providing your own multipathing solution is that in a SAN environment you might want to
connect to several storage subsystems from a single host. Each storage vendor has its own
multipathing solution that is not interoperable with the multipathing solution of other storage
vendors. This restriction increases the complexity of managing the compatibility of operating
system fix levels, HBA firmware levels, and multipathing software versions.
AIX provides the base MPIO device driver; however, it is still necessary to install the MPIO
device driver that is provided by the storage vendor to take advantage of all of the features of
a multipathing solution. For the DS8000 storage system and other storage systems, IBM
provides the SDDPCM multipathing driver. SDDPCM is compatible with the AIX MPIO
framework and replaced SDD.
Use MPIO with SDDPCM rather than SDD with AIX whenever possible.
If you used both SDD and SDDPCM, with SDD each LUN has a corresponding vpath and a
hdisk for each path to the LUN. With SDDPCM, you have only one hdisk per LUN. Thus, with
SDD, you can submit queue_depth x # paths to a LUN, and with SDDPCM, you can submit
queue_depth IOs only to the LUN. If you switch from SDD that uses four paths to SDDPCM,
you must set the SDDPCM hdisks to four times that of the SDD hdisks for an equivalent
effective queue depth.
Both the hdisk and adapter drivers have “in process” and “wait” queues. After the queue limit
is reached, the I/Os wait until an I/O completes and opens a slot in the service queue. The
in-process queue is also sometimes called the “service” queue. Many applications do not
generate many in-flight I/Os, especially single-threaded applications that do not use AIO.
Applications that use AIO are likely to generate more in-flight I/Os.
Changing the max_xfer_size: Changing max_xfer_size uses memory in the PCI Host
Bridge chips attached to the PCI slots. The sales manual, regarding the dual-port 4
Gbps PCI-X FC adapter, states that “If placed in a PCI-X slot rated as SDR compatible
and/or has the slot speed of 133 MHz, the AIX value of the max_xfer_size must be kept
at the default setting of 0x100000 (1 megabyte) when both ports are in use. The
architecture of the DMA buffer for these slots does not accommodate larger
max_xfer_size settings.” Issues occur when configuring the LUNs if there are too many
FC adapters and too many LUNs attached to the adapter. Errors, such as DMA_ERR
might appear in the error report. If you get these errors, you must change the
max_xfer_size back to the default value. Also, if you boot from SAN and you encounter
this error, you cannot boot, so be sure to have a back-out plan if you plan to change the
max_xfer_size and boot from SAN.
dyntrk: AIX supports dynamic tracking of FC devices. Previous releases of AIX required a
user to unconfigure FC storage device and adapter device instances before changing the
system area network (SAN), which can result in an N_Port ID (SCSI ID) change of any
remote storage ports. If dynamic tracking of FC devices is enabled, the FC adapter driver
detects when the Fibre Channel N_Port ID of a device changes. The FC adapter driver
then reroutes traffic that is destined for that device to the new address while the devices
are still online. Events that can cause an N_Port ID to change include moving a cable
between a switch and storage device from one switch port to another, connecting two
separate switches that use an inter-switch link (ISL), and possibly restarting a switch.
Dynamic tracking of FC devices is controlled by a new fscsi device attribute, dyntrk. The
default setting for this attribute is no. To enable dynamic tracking of FC devices, set this
attribute to dyntrk=yes, as shown in the following example:
chdev -l fscsi0 -a dyntrk=yes
fc_err_recov: AIX supports Fast I/O Failure for FC devices after link events in a switched
environment. If the FC adapter driver detects a link event, such as a lost link between a
storage device and a switch, the FC adapter driver waits a short time, approximately 15
seconds, so that the fabric can stabilize. At that point, if the FC adapter driver detects that
the device is not on the fabric, it begins failing all I/Os at the adapter driver. Any new I/O or
future retries of the failed I/Os are failed immediately by the adapter until the adapter driver
detects that the device rejoined the fabric. Fast Failure of I/O is controlled by a fscsi device
attribute, fc_err_recov. The default setting for this attribute is delayed_fail, which is the
I/O failure behavior seen in previous versions of AIX. To enable Fast I/O Failure, set this
attribute to fast_fail, as shown in the following example:
chdev -l fscsi0 -a fc_err_recov=fast_fail
Important: Change fc_err_recov to fast_fail and dyntrk to yes only if you use a
multipathing solution with more than one path.
Example 9-3 on page 309 shows the output of the attributes of a fcs device.
For more information about the Fast I/O Failure (fc_err_recov) and Dynamic Tracking
(dyntrk) options, see the following links:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.kernextc/dynami
ctracking.htm
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.performance/fas
t_fail_dynamic_interaction.htm
For more information about num_cmd_elems and max_xfer_size, see AIX disk queue depth
tuning for performance, found at:
http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105745
For more information, see the SDD and SDDPCM User’s Guides at the following website:
http://www.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S7000303
The VIOS allows a physical adapter with disk attached at the VIOS partition level to be shared
by one or more partitions and enables clients to consolidate and potentially minimize the
number of required physical adapters.
There are two ways of connecting disk storage to a client LPAR with VIOS:
VSCSI
N-port ID Virtualization (NPIV)
Performance suggestions
Use these performance settings when configuring VSCSI for performance:
Processor:
– Typical entitlement is 0.25.
– Virtual processor of 2.
– Always run uncapped.
– Run at higher priority (weight factor >128).
– More processor power with high network loads.
Important: Change the reserve_policy parameter to no_reserve only if you are going to
map the LUNs of the DS8000 storage system directly to the client LPAR.
For more information, see the Planning for the Virtual I/O Server and Planning for virtual SCSI
sections in the POWER8 IBM Knowledge Center at the following website:
http://www.ibm.com/support/knowledgecenter/POWER8/p8hb1/p8hb1_vios_planning.htm
http://www.ibm.com/support/knowledgecenter/8247-21L/p8hb1/p8hb1_vios_planning_vsc
si.htm
Example 9-5 on page 313 shows how vmstat can help you monitor file system activity by
running the vmstat -I command.
Example 9-6 shows you another option that you can use, vmstat -v, from which you can
understand whether the blocked I/Os are because of a shortage of buffers.
Example 9-6 The vmstat -v utility output for file system buffer activity analysis
[root@p520-tic-3]# vmstat -v | tail -7
0 pending disk I/Os blocked with no pbuf
0 paging space I/Os blocked with no psbuf
2484 filesystem I/Os blocked with no fsbuf
0 client filesystem I/Os blocked with no fsbuf
0 external pager filesystem I/Os blocked with no fsbuf
0 Virtualized Partition Memory Page Faults
0.00 Time resolving virtualized partition memory page faults
For the preferred practice values, see the application papers listed under “AIX file system
caching” on page 296.
As you can see in Example 9-7, there are two incremental counters: pervg_blocked_io_count
and global_blocked_io_count. The first counter indicates how many times an I/O block
happened because of a lack of LVM pinned memory buffer (pbufs) on that VG. The second
incremental counter counts how many times an I/O block happened because of the lack of
LVM pinned memory buffer (pbufs) in the whole OS. Other indicators for I/O bound can be
seen with the disk xfer part of the vmstat output when run against the physical disk, as
shown in Example 9-8.
The disk xfer part provides the number of transfers per second to the specified PVs that
occurred in the sample interval. This count does not imply an amount of data that was read or
written.
9.3.2 pstat
The pstat command counts how many legacy AIO servers are used in the server. There are
two AIO subsystems:
Legacy AIO
Posix AIO
You can run the psat -a | grep ‘aioserver’ | wc -l command to get the number of legacy
AIO servers that are running. You can run the pstat -a | grep posix_aioserver | wc -l
command to see the number of POSIX AIO servers.
Important: If you use raw devices, you must use ps -k instead of pstat -a to measure the
legacy AIO activity.
In AIX V6 and Version 7, both AIO subsystems are loaded by default but are activated only
when an AIO request is initiated by the application. Run pstat -a | grep aio to see the AIO
subsystems that are loaded, as shown in Example 9-10.
Example 9-10 pstat -a output to show the AIO subsystem defined in AIX 7.1
[p8-e870-01v1:root:/:] pstat -a|grep aio
50 a 3200c8 1 3200c8 0 0 1 aioPpool
1049 a 1901f6 1 1901f6 0 0 1 aioLpool
In AIX Version 6 and Version 7, you can use the ioo tunables to show whether the AIO is
used. An illustration is given in Example 9-11.
Example 9-11 ioo -a output to show the AIO subsystem activity in AIX 7.1
[p8-e870-01v1:root:/:] ioo -a|grep aio
aio_active = 0
aio_maxreqs = 131072
aio_maxservers = 30
aio_minservers = 3
aio_server_inactivity = 300
posix_aio_active = 0
posix_aio_maxreqs = 131072
posix_aio_maxservers = 30
posix_aio_minservers = 3
posix_aio_server_inactivity = 300
In Example 9-11, aio_active and posix_aio_active show whether the AIO is used. The
parameters aio_server_inactivity and posix_aio_server_inactivity show how long an
AIO server sleeps without servicing an I/O request.
To check the AIO configuration in AIX V5.3, run the commands that are shown in
Example 9-12.
Example 9-12 lsattr -El aio0 output to list the configuration of legacy AIO
[root@p520-tic-3]# lsattr -El aio0
autoconfig defined STATE to be configured at system restart True
fastpath enable State of fast path True
kprocprio 39 Server PRIORITY True
maxreqs 4096 Maximum number of REQUESTS True
maxservers 10 MAXIMUM number of servers per cpu True
minservers 1 MINIMUM number of servers True
If your AIX V5.3 is between TL05 and TL08, you can also use the aioo command to list and
increase the values of maxservers, minservers, and maxreqs.
The rule is to monitor the I/O wait by using the vmstat command. If the I/O wait is more than
25%, consider enabling AIO, which reduces the I/O wait but does not help disks that are busy.
You can monitor busy disks by using iostat.
The lsattr -E -l sys0 -a iostat command indicates whether the iostat statistic collection
is enabled. To enable the collection of iostat data, run chdev -l sys0 -a iostat=true.
The disk and adapter-level system throughput can be observed by running the iostat -aDR
command.
The a option retrieves the adapter-level details, and the D option retrieves the disk-level
details. The R option resets the min* and max* values at each interval, as shown in
Example 9-13.
Vadapter:
vscsi0 xfer: Kbps tps bkread bkwrtn partition-id
29.7 3.6 2.8 0.8 0
read: rps avgserv minserv maxserv
0.0 48.2S 1.6 25.1
write: wps avgserv minserv maxserv
30402.8 0.0 2.1 52.8
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
0.0 0.0 0.0 0.0 0.0 0.0
Paths/Disks:
hdisk0 xfer: %tm_act bps tps bread bwrtn
1.4 30.4K 3.6 23.7K 6.7K
read: rps avgserv minserv maxserv timeouts fails
2.8 5.7 1.6 25.1 0 0
write: wps avgserv minserv maxserv timeouts fails
0.8 9.0 2.1 52.8 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
11.5 0.0 34.4 0.0 0.0 0.9
The iostat command displays the I/O statistics for the hdisk1 device. as shown in
Example 9-14. One way to establish relationships between a hdisk device, the corresponding
I/O paths, and the DS8000 LVs is to use the pcmpath query device command that is installed
with SDDPCM. In Example 9-14, the logical disk on the DS8000 storage system has LUN
serial number 75ZA5710019.
The option that is shown in Example 9-15 on page 319 provides details in a record format,
which can be used to sum up the disk activity.
It is not unusual to see a device reported by iostat as 90% - 100% busy because a DS8000
volume that is spread across an array of multiple disks can sustain a much higher I/O rate
than a single physical disk. A device that is 100% busy is a problem for a single device, but it
is not a problem for a RAID 5 device.
Further AIO can be monitored by running iostat -A for legacy AIO and iostat -P for POSIX
AIO.
Because the AIO queues are assigned by file system, it is more accurate to measure the
queues per file system. If you have several instances of the same application where each
application uses a set of file systems, you can see which instances consume more resources.
To see the legacy AIO, which is shown in Example 9-16, run the iostat -AQ command.
Similarly, for POSIX-compliant AIO statistics, run iostat -PQ.
Example 9-16 iostat -AQ output to measure legacy AIO activity by file system
[root@p520-tic-3]# iostat -AQ 1 2
aio: avgc avfc maxg maif maxr avg-cpu: % user % sys % idle % iowait
0 0 0 0 16384 0.0 0.1 99.9 0.0
aio: avgc avfc maxg maif maxr avg-cpu: % user % sys % idle % iowait
0 0 0 0 16384 0.0 0.1 99.9 0.0
If your AIX system is in a SAN environment, you might have so many hdisks that iostat does
not provide much information. Instead, use the nmon tool, as described in “Interactive nmon
options for DS8000 performance monitoring” on page 322.
For more information about the enhancements of the iostat tool in AIX 7, see IBM AIX
Version 7.1 Differences Guide, SG24-7910, or see the iostat man pages at the following
website:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.cmds3/iostat.htm
9.3.4 lvmstat
The lvmstat command reports input and output statistics for LPARs, LVs, and volume groups.
This command is useful in determining the I/O rates to LVM volume groups, LVs, and LPARs.
This command is useful for dealing with unbalanced I/O situations where the data layout was
not considered initially.
To enable statistics collection for all LVs in a VG (in this case, the rootvg VG), use the -e
option together with the -v <volume group> flag, as shown in the following example:
#lvmstat -v rootvg -e
When you do not need to continue to collect statistics with lvmstat, disable it because it
affects the performance of the system. To disable the statistics collection for all LVs in a VG (in
this case, the rootvg VG), use the -d option together with the -v <volume group> flag, as
shown in the following example:
#lvmstat -v rootvg -d
This command disables the collection of statistics on all LVs in the VG.
The first report section that is generated by lvmstat provides statistics that are about the time
since the statistical collection was enabled. Each later report section covers the time since the
previous report. All statistics are reported each time that lvmstat runs. The report consists of
a header row, followed by a line of statistics for each LPAR or LV, depending on the flags
specified.
You can see that fslv00 is busy performing writes and hd2 is performing read and some write
I/O.
The lvmstat tool has powerful options, such as reporting on a specific LV or reporting busy
LVs in a VG only. For more information about usage, see the following resources:
IBM AIX Version 7.1 Differences Guide, SG24-79100:
A description of the lvmstat command, found at:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.performance/l
vm_perf_mon_lvmstat.htm
9.3.5 topas
The interactive AIX tool topas is convenient if you want to get a quick overall view of the
activity of the system. A fast snapshot of memory usage or user activity can be a helpful
starting point for further investigation. Example 9-18 shows a sample topas output.
Since AIX V6.1 the topas monitor offers enhanced monitoring capabilities and file system I/O
statistics:
To expand the file system I/O statistics, enter ff (first f turns it off, the next f expands it).
To get an exclusive and even more detailed view of the file system I/O statistics, enter F.
Expanded disk I/O statistics can be obtained by typing dd or D in the topas initial window.
For more information, see the topas manual page, found at:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.cmds5/topas.htm
9.3.6 nmon
The nmon tool and analyzer for AIX and Linux is a great storage performance analysis
resource. It was written by Nigel Griffiths, who works for IBM UK. Since AIX V5.3 TL09 and
AIX V6.1 TL02, it is integrated with the topas command. It is installed by default. For more
information, see the AIX 7.1 nmon command description, found at:
https://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.cmds4/nmon.htm
You can start either of the tools by running nmon or topas and then toggle between them by
typing the ~ character. You can use this tool, among others, when you perform client
benchmarks.
The interactive nmon tool is similar to monitor or topas, which you might have used to monitor
AIX, but it offers more features that are useful for monitoring the DS8000 performance.
Unlike topas, the nmon tool can also record data that can be used to establish a baseline of
performance for comparison later. Recorded data can be saved in a file and imported into the
nmon analyzer (a spreadsheet format) for easy analysis and graphing.
The different options that you can select when you run nmon are shown in Example 9-19 on
page 323.
Then, run nmon with the -g flag to point to the map file:
nmon -g /tmp/vg-maps
When nmon starts, press the g key to view statistics for your disk groups. An example of the
output is shown in Example 9-22.
To enable nmon to report iostats based on ranks, you can make a disk-group map file that
lists ranks with the associated hdisk members.
Recording nmon information for import into the nmon analyzer tool
A great benefit that the nmon tool provides is the ability to collect data over time to a file and
then to import the file into the nmon analyzer tool, found at:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Power+Syste
ms/page/nmon_analyser
To collect nmon data in comma-separated value (CSV) file format for easy spreadsheet import,
complete the following steps:
1. Run nmon with the -f flag. See nmon -h for the details, but as an example, to run nmon for
an hour to capture data snapshots every 30 seconds, run this command:
nmon -f -s 30 -c 120
2. This command creates the output file in the <hostname>_date_time.nmon directory.
Note: When you capture data to a file, the nmon tool disconnects from the shell to ensure
that it continues running even if you log out, which means that nmon can appear to fail, but it
is still running in the background until the end of the analysis period.
The nmon analyzer is a macro-customized Microsoft Excel spreadsheet. After transferring the
output file to the machine that runs the nmon analyzer, simply start the nmon analyzer, enabling
the macros, and click Analyze nmon data. You are prompted to select your spreadsheet and
then to save the results.
Many spreadsheets have fixed numbers of columns and rows. Collect up to a maximum of
300 snapshots to avoid experiencing these issues.
Hint: The use of the CHARTS setting instead of PICTURES for graph output simplifies the
analysis of the data, which makes it more flexible.
9.3.7 fcstat
The fcstat command displays statistics from a specific FC adapter. Example 9-23 shows the
output of the fcstat command.
9.3.8 filemon
The filemon command monitors a trace of file system and I/O system events, and reports
performance statistics for files, virtual memory segments, LVs, and PVs. The filemon
command is useful to individuals whose applications are thought to be disk-bound, and who
want to know where and why.
The filemon command provides a quick test to determine whether there is an I/O problem by
measuring the I/O service times for reads and writes at the disk and LV level.
The filemon command is in /usr/bin and is part of the bos.perf.tools file set, which can be
installed from the AIX base installation media.
filemon measurements
To provide an understanding of file system performance for an application, the filemon
command monitors file and I/O activity at four levels:
Logical file system
The filemon command monitors logical I/O operations on logical files. The monitored
operations include all read, write, open, and seek system calls, which might result in
physical I/O, depending on whether the files are already buffered in memory. I/O statistics
are kept on a per-file basis.
Virtual memory system
The filemon command monitors physical I/O operations (that is, paging) between
segments and their images on disk. I/O statistics are kept on a per segment basis.
LVs
The filemon command monitors I/O operations on LVs. I/O statistics are kept on a per-LV
basis.
PVs
The filemon command monitors I/O operations on PVs. At this level, physical resource
utilizations are obtained. I/O statistics are kept on a per-PV basis.
filemon examples
A simple way to use filemon is to run the command that is shown in Example 9-24, which
performs these actions:
Run filemon for 2 minutes and stop the trace.
Store the output in /tmp/fmon.out.
Collect only LV and PV output.
Tip: To set the size of the buffer of option -T, start with 2 MB per logical CPU.
To produce a sample output for filemon, we ran a sequential write test in the background and
started a filemon trace, as shown in Example 9-25. We used the lmktemp command to create
a 2 GB file full of nulls while filemon gathered I/O statistics.
skipping...........
------------------------------------------------------------------------
Detailed Logical Volume Stats (512 byte blocks)
------------------------------------------------------------------------
skipping...........
------------------------------------------------------------------------
Detailed Physical Volume Stats (512 byte blocks)
------------------------------------------------------------------------
skipping to end.....................
The following fields are in the filemon report of the filemon command:
util Utilization of the volume (fraction of time busy). The rows are sorted by
this field, in decreasing order. The first number, 1.00, means 100
percent.
#rblk Number of 512-byte blocks read from the volume.
#wblk Number of 512-byte blocks written to the volume.
KB/sec Total transfer throughput in kilobytes per second.
volume Name of volume.
description Contents of the volume: Either a file system name or an LV type (jfs2,
paging, jfslog, jfs2log, boot, or sysdump). Also indicates whether the
file system is fragmented or compressed.
In the filemon output in Example 9-26 on page 327, notice these characteristics:
The most active LV is /dev/305glv (/interdiskfs); it is the busiest LV with an average
data rate of 87 MBps.
The Detailed Logical Volume Status field shows an average write time of 1.816 ms for
/dev/305glv.
The Detailed Physical Volume Stats show an average write time of 1.934 ms for the
busiest disk, /dev/hdisk39, and 1.473 ms for /dev/hdisk55, which is the next busiest disk.
3. Next, run sequential reads and writes (by using the dd command, for example) to all of the
hdisk devices (raw or block) for about an hour. Then, look at your SAN infrastructure to see
how it performs.
Look at the UNIX error report. Problems show up as storage errors, disk errors, or adapter
errors. If there are problems, they are not to hard to identify in the error report because
there are many errors. The source of the problem can be hardware problems on the
storage side of the SAN, Fibre Channel cables or connections, early device drivers, or
device (HBA) Licensed Internal Code. If you see errors similar to the errors shown in
Example 9-28, stop and fix them.
4. Next, run the following command to see whether MPIO/SDDPCM correctly balances the
load across paths to the LUNs:
pcmpath query device
The output from this command looks like Example 9-29 on page 331.
Check to ensure that for every LUN the counters under the Select column are the same
and that there are no errors.
5. Next, randomly check the sequential read speed of the raw disk device. The following
command is an example of the command that is run against a device called hdisk4. For
the LUNs that you test, ensure that they each yield the same results:
time dd if=/dev/rhdisk4 of=/dev/null bs=128k count=20000
Hint: For this dd command, for the first time that it is run against rhdisk4, the I/O must
be read from disk and staged to the DS8000 cache. The second time that this dd
command is run, the I/O is already in cache. Notice the shorter read time when we get
an I/O cache hit.
If everything looks good, continue with the configuration of volume groups and LVs.
The lmktemp command, which is used next, creates a file, and you control the size of the file.
It does not appear to be supported by any AIX documentation and therefore might disappear
in future releases of AIX. Here are examples of the tests:
A simple sequential write test:
# cd /singleLUN
# time lmktemp 2GBtestfile 2000M
2GBtestfile
real 0m11.42s
user 0m0.02s
sys 0m1.01s
Divide 2000/11.42 seconds = 175 MBps read speed.
Now that the DS8000 cache is primed, run the test again. When we ran the test again, we
got 0.28 seconds. Priming the cache is a good idea for isolated application read testing. If
you have an application, such as a database, and you perform several isolated fixed
reads, ignore the first run and measure the second run to take advantage of read hits from
the DS8000 cache because these results are a more realistic measurement of how the
application performs.
Hint: The lmktemp command for AIX has a 2 GB size limitation. This command cannot
create a file greater than 2 GB. If you want a file larger than 2 GB for a sequential read test,
concatenate a couple of 2 GB files.
Disk throughput and I/O response time for any server that is connected to a DS8000 storage
system are affected by the workload and configuration of the server and the DS8000 storage
system, data layout and volume placement, connectivity characteristics, and the performance
characteristics of the DS8000 storage system. Although the health and tuning of all of the
system components affect the overall performance management and tuning of a Windows
server, this chapter limits its descriptions to the following topics:
General Windows performance tuning
I/O architecture overview
File systems
Volume management
Multipathing and the port layer
Host bus adapter (HBA) settings
Windows server I/O enhancements
I/O performance measurement
Problem determination
Load testing
For more information about these tuning suggestions, see the following resources:
Tuning IBM System x Servers for Performance, SG24-5287
Tuning Windows Server 2003 on IBM System x Servers, REDP-3943
https://msdn.microsoft.com/en-us/library/windows/hardware/dn529133
To initiate an I/O request, an application issues an I/O request by using one of the supported
I/O request calls. The I/O manager receives the application I/O request and passes the I/O
request packet (IRP) from the application to each of the lower layers that route the IRP to the
appropriate device driver, port driver, and adapter-specific driver.
Windows server file systems can be configured as file allocation table (FAT), FAT32, or NTFS.
The file structure is specified for a particular partition or logical volume. A logical volume can
contain one or more physical disks. All Windows volumes are managed by the Windows
Logical Disk Management utility.
For more information about Windows Server 2003 and Windows Server 2008 I/O stacks and
performance, see the following documents:
http://download.microsoft.com/download/5/6/6/5664b85a-ad06-45ec-979e-ec4887d715eb
/Storport.doc
http://download.microsoft.com/download/5/b/9/5b97017b-e28a-4bae-ba48-174cf47d23cd
/STO089_WH06.ppt
I/O priorities
The Windows Server 2008 I/O subsystem provides a mechanism to specify I/O processing
priorities. Windows primarily uses this mechanism to prioritize critical I/O requests over
background I/O requests. API extensions exist to provide application vendors file-level I/O
priority control. The prioritization code has some processing impact and can be disabled for
disks that are targeted for similar I/O activities, such as databases.
Important: NTFS file system compression seems to be the easiest way to increase the
amount of the available capacity. However, do not use in enterprise environments because
file system compression consumes too much disk and processor resources and increases
response times of reading and writing. For better capacity utilization, consider the DS8000
Thin Provisioning technology and IBM data deduplication technologies.
Start sector offset: The start sector offset must be 256 KB because of the stripe size on
the DS8000 storage system. Workloads with small, random I/Os (less than 16 KB) are
unlikely to experience any significant performance improvement from sector alignment on
the DS8000 logical unit numbers (LUNs).
For more information about the paging file, see the following websites:
http://support.microsoft.com/kb/889654
http://technet.microsoft.com/en-us/magazine/ff382717.aspx
For more information about VxVM, see Veritas Storage Foundation High Availability for
Windows, found at:
http://www.symantec.com/business/storage-foundation-for-windows
Important: Despite the logical volume manager (LVM) functional benefits, it is preferable
to not use any LVM striping in DS8000 Easy Tier and I/O Priority Manager environments.
DS8800 and DS8700 storage systems offer improved algorithms and methods of
managing the data and performance at a lower level and do not require any additional
volume management. Combined usage of these technologies might lead to unexpected
results and performance degradation that makes the search for bottlenecks impossible.
Note: The example applications that are listed in Table 10-1 are examples only and not
specific rules.
The prior approach of workload isolation on the rank level might work for low skew factor
workloads or some specific workloads. Also, you can use this approach if you are confident in
planning the workload and volume layout.
It also provides I/O load balancing. For each I/O request, SDD dynamically selects one of the
available paths to balance the load across all possible paths.
To receive the benefits of path balancing, ensure that the disk drive subsystem is configured
so that there are multiple paths to each LUN. By using multiple paths to each LUN, you can
benefit from the performance improvements from SDD path balancing. This approach also
prevents the loss of access to data if there is a path failure.
Section “Subsystem Device Driver” on page 273 describes the SDD in further detail.
You can obtain more information about SDDDSM in the SDD User’s Guide, found at:
http://www.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S7000303
SDDDSM: For non-clustered environments, use SDDDSM for its performance and
scalability improvements.
To configure the HBA, see the IBM System Storage DS8700 and DS8800 Introduction and
Planning Guide, GC27-2297-07. This guide contains detailed procedures and settings. You
also must read the readme file and manuals for the driver, BIOS, and HBA.
Obtain a list of supported HBAs, firmware, and device driver information at this website:
http://www.ibm.com/systems/support/storage/config/ssic/displayesssearchwithoutjs.w
ss?start_over=yes
Newer versions: When configuring the HBA, install the newest version of driver and the
BIOS. The newer version includes more effective function and problem fixes so that the
performance or reliability, availability, and service (RAS) can improve.
In Figure 10-2:
The application layer is monitored and tuned with the application-specific tools and metrics
available for monitoring and analyzing application performance on Windows servers.
Application-specific objects and counters are outside the scope of this book.
The I/O Manager and file system levels are controlled with the built-in Windows tool, which
is available in Windows Performance Console (perfmon).
The volume manager level can be also monitored with perfmon. However, it is preferable to
not use any logical volume management in Windows servers.
Fibre Channel port level multipathing is monitored with the tools provided by the
multipathing software: SDD, SDDPCM, or Veritas DMP drivers.
The Fibre Channel adapter level can be monitored with the adapter-specific original
software that is provided by each vendor. For the support software of the adapters that are
compatible with DS8870 and DS8880 storage systems, see the following websites:
– http://www.brocade.com/services-support/index.page
– http://www.emulex.com/support.html
The SAN fabric level and the D8000 level are monitored with the IBM Tivoli Storage
Productivity Center and the DS8000 built-in tools. Because Tivoli Storage Productivity
Center provides more functions to monitor the DS8000 storage systems, use it for
monitoring and analysis.
Table 10-2 describes the key I/O-related metrics that are reported by perfmon.
Table 10-2 Performance monitoring counters for PhysicalDisk and other objects
Counter Normal Critical values Description
values
%Disk Time ~ 70 - 90% Depends on the Percentage of elapsed disk serviced read
situation or write requests.
Disk Transfers/sec According to Close to the The momentary number of disk transfers
the workload limits of volume, per second during the collection interval.
rank, and extent
pool
Disk Reads/sec According to Close to the The momentary number of disk reads per
the workload limits of volume, second during the collection interval.
rank, and extent
pool
Disk Bytes/sec According to Close to the The momentary number of bytes per
the workload limits of volume, second during the collection interval.
rank, and extent
pool
Disk Read According to Close to the The momentary number of bytes read per
Bytes/sec the workload limits of volume, second during the collection interval.
rank, and extent
pool
Object Paging file, 0-1 40% and more The amount of paging file instance in use
counter%Usage in a percentage.
Rules
We provide the following rules based on our field experience. Before using these rules for
anything specific, such as a contractual service-level agreement (SLA), you must carefully
analyze and consider these technical requirements: disk speeds, RAID format, workload
variance, workload growth, measurement intervals, and acceptance of response time and
throughput variance. We suggest these rules:
Write and read response times in general must be as specified in the Table 10-2 on
page 348.
There must be a definite correlation between the counter values; therefore, the increase of
one counter value must lead to the increase of the others connected to it. For example, the
increase of the Transfers/sec counter leads to the increase of the Average sec/Transfer
counter because the increased number of IOPS leads to an increase in the response time
of each I/O.
If one counter has a high value and the related parameter value is low, pay attention to this
area. It can be a hardware or software problem or a bottleneck.
A Disk busy counter close to 100% does not mean that the system is out of its disk
performance capability. The disk is busy with I/Os. Problems occur when there are 100%
disk busy counters with close to zero counters of the I/Os at the same time.
Shared storage environments are more likely to have a variance in disk response time. If
your application is highly sensitive to variance in response time, you need to isolate the
application at either the processor complex, DA, or rank level.
With the perfmon tool, you monitor only the front-end activity of the disk system. To see the
complete picture, monitor the back-end activity, also. Use the Tivoli Storage Productivity
Center console and Storage Tier Advisor Tool (STAT).
The Performance console is a snap-in tool for the Microsoft Management Console (MMC).
You use the Performance console to configure the System Monitor and Performance Logs
and Alerts tools.
With Windows Server 2008, you can open the Performance console by clicking Start →
Programs → Administrative Tools → Performance or by typing perfmon on the command
line.
Figure 10-5 shows several disks. To identify them, right-click the name and click the
Properties option on each of them. Disks from the DS8000 storage system show IBM 2107
in the properties, which is the definition of the DS8000 machine-type. So, in this example, our
disk from the DS8000 storage system is Disk 2, as shown in Figure 10-6 on page 353.
Multi-Path Disk Device means that you are running SDD. You can also check for SDD from
the Device Manager option of the Computer Management snap-in, as shown in Figure 10-7.
In Figure 10-7, you see several devices and one SDD that is running. Use the datapath query
device command to show the disk information in the SDDDSM console (Example 10-1).
List the worldwide port names (WWPNs) of the ports by running the datapath query wwpn
command (Figure 10-2 on page 347).
In Figure 10-2 on page 347, there are two WWPNs for Port 2 and Port 4.
Identify the disk in the DS8000 storage system by using the DSCLI console (Figure 10-3 on
page 350).
Example 10-3 List the volumes in the DS8800 DSCLI console (output truncated)
dscli> lsfbvol
Name ID accstate datastate configstate deviceMTM datatype extpool cap (2^30B) cap (10^9B) cap (blocks)
===========================================================================================================
- 8600 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8601 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8603 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8604 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8605 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8606 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8607 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
Figure 10-3 on page 350 shows the output of the DSCLI command lsfbvol that lists all the
fixed block (FB) volumes in the system. Our example volume has ID 8601. It is shown in bold.
It is created on extent pool number P4.
Next, list the ranks that are allocated with this volume by running showfbvol -rank Vol_ID
(Example 10-4).
Example 10-4 List the ranks that are allocated with this volume
dscli> showfbvol -rank 8601
Name -
ID 8601
accstate Online
datastate Normal
configstate Normal
deviceMTM 2107-900
datatype FB 512
addrgrp 8
extpool P4
exts 100
captype DS
cap (2^30B) 100.0
cap (10^9B) -
cap (blocks) 209715200
Figure 10-4 on page 351 shows the output of the command. Volume 8601 is the extent
space-efficient (ESE) volume with a virtual capacity of 100 GB, which occupies four extents
(two from each) from ranks R17 and R18. There is 4 GB of occupied space. Extent pool P4 is
under the Easy Tier automatic management.
Check for the arrays, RAID type, and the DA pair allocation (Figure 10-5 on page 352).
dscli> lsarray
Array State Data RAIDtype arsite Rank DA Pair DDMcap (10^9B)
======================================================================
A0 Assigned Normal 5 (6+P+S) S1 R0 0 300.0
A1 Assigned Normal 5 (6+P+S) S2 R1 0 300.0
A2 Assigned Normal 5 (6+P+S) S3 R17 2 600.0
A3 Assigned Normal 5 (6+P+S) S4 R18 2 600.0
A4 Unassigned Normal 5 (6+P+S) S5 - 2 600.0
Figure 10-5 on page 352 shows the array and DA pair allocation by running the lsarray and
lsrank commands. Ranks R17 and R18 relate to arrays A2 and A3 and the Enterprise drives
of 600 GB on the DA Pair 2.
By running lshostconnect -volgrp VolumeGroup_ID, you can list the ports to which this
volume group is connected, as shown in Example 10-6. This volume group uses host
connections with IDs 0008 and 0009 and the WWPNs that are shown in bold. These WWPNs
are the same as the WWNs in Example 10-2 on page 354.
List the ports that are used in the disk system for the host connections (Example 10-7).
dscli> lsioport
ID WWPN State Type topo portgrp
===============================================================
I0000 500507630A00029F Online Fibre Channel-SW SCSI-FCP 0
I0001 500507630A00429F Online Fibre Channel-SW FC-AL 0
I0002 500507630A00829F Online Fibre Channel-SW SCSI-FCP 0
I0003 500507630A00C29F Online Fibre Channel-SW SCSI-FCP 0
I0004 500507630A40029F Online Fibre Channel-SW FICON 0
I0005 500507630A40429F Online Fibre Channel-SW SCSI-FCP 0
Example 10-7 on page 356 shows how to obtain the WWPNs and port IDs by running the
showhostconnect and lsioport commands. All of the information that you need is in bold.
After these steps, you have all the configuration information for a single disk in the system.
After the performance data is correlated to the DS8000 LUNs and reformatted, open the
performance data file in Microsoft Excel. It looks similar to Figure 10-8.
DATE TIME Subsyste LUN Disk Disk Avg Read Avg Total Avg Read Read
m Serial Reads/se RT(ms) Time Queue KB/sec
c Length
11/3/2008 13:44:48 75GB192 75GB1924 Disk6 1,035.77 0.612 633.59 0.63 66,289.14
11/3/2008 13:44:48 75GB192 75GB1924 Disk2 1,035.75 0.613 634.49 0.63 66,288.07
11/3/2008 13:44:48 75GB192 75GB1924 Disk3 1,035.77 0.612 633.87 0.63 66,289.14
11/3/2008 13:44:48 75GB192 75GB1924 Disk5 1,035.77 0.615 637.11 0.64 66,289.14
11/3/2008 13:44:48 75GB192 75GB1924 Disk4 1,035.75 0.612 634.38 0.63 66,288.07
11/3/2008 13:44:48 75GB192 75GB1924 Disk1 1,035.77 0.612 633.88 0.63 66,289.14
11/3/2008 14:29:48 75GB192 75GB1924 Disk6 1,047.24 5.076 5,315.42 5.32 67,023.08
11/3/2008 14:29:48 75GB192 75GB1924 Disk2 1,047.27 5.058 5,296.86 5.30 67,025.21
11/3/2008 14:29:48 75GB192 75GB1924 Disk3 1,047.29 5.036 5,274.30 5.27 67,026.28
11/3/2008 14:29:48 75GB192 75GB1924 Disk5 1,047.25 5.052 5,291.01 5.29 67,024.14
11/3/2008 14:29:48 75GB192 75GB1924 Disk4 1,047.29 5.064 5,303.36 5.30 67,026.28
11/3/2008 14:29:48 75GB192 75GB1924 Disk1 1,047.29 5.052 5,290.89 5.29 67,026.28
11/3/2008 13:43:48 75GB192 75GB1924 Disk6 1,035.61 0.612 634.16 0.63 66,279.00
11/3/2008 13:43:48 75GB192 75GB1924 Disk2 1,035.61 0.612 633.88 0.63 66,279.00
11/3/2008 13:43:48 75GB192 75GB1924 Disk3 1,035.61 0.615 636.72 0.64 66,279.00
Figure 10-8 The perfmon-essmap.pl script output
A quick look at the compiled data in Figure 10-8 gives you a high increase of the response
time without an appropriate increase in the number of IOPS (Disk Reads/sec). This counter
shows that a problem occurred, which you confirm with the increased Queue Length value.
You must look at the drives that show the response time increase and collect additional data
for the drives.
Because you have a possible reason for the read response time increase, you can specify the
further steps to confirm it:
Gather additional performance data for the volumes from the Windows server, including
write activity.
Gather performance data from the back end of the disk system on those volumes for any
background activity or secondary operations.
Examine the balancing policy on the disk path.
Examine the periodic processes initiated in the application. There might be activity on the
log files.
For database applications, separate the log files from the main data and indexes.
Check for any other activity on that drive that can cause an increase of the write I/Os.
You can see how even a small and uncertain amount of the collected performance data can
help you in the early detection of performance problems and help you quickly identify further
steps.
At the disk subsystem level, there can be bottlenecks on the rank level, extent pool level, DA
pair level, and cache level that lead to problems on the volume level. Table 10-3 on page 359
describes the reasons for the problems on different levels.
Rank level 1. Rank IOPS capability exceeded: 1. Split workload between several
Rank bandwidth capability exceeded ranks organized into one
extent pool with rotate extents
2. RAID type does not fit the workload feature. If already organized,
type. manually rebalance the ranks.
3. Disk type is wrong. 2. Change the RAID level to a
better performing RAID level
4. Physical problems with the disks in
(RAID 5 to RAID 10, for
the rank.
example) or migrate extents to
another extent pool with better
conditions.
3. Migrate extents to the better
performing disks.
4. Fix the problems with the
disks.
Extent pool level 1. Extent pool capability reached its 1. Add more ranks to the pool,
maximum. examine the STAT reports for
2. Conflicting workloads mixed in the the recommendations, add
same extent pool. more tiers in the pool, or use
3. No Easy Tier management for this hot promotion and cold
pool. demotion.
4. One of the ranks in the pool is 2. Split workloads into the
overloaded. separate pools, split one pool
5. Physical problems with the disk in the into two dedicated pools to
rank. both processor complexes,
examine the STAT data and
add required tier, or set
priorities for the workloads and
enable IOPM.
3. Start Easy Tier for this pool by
following the
recommendations from the
STAT tool.
4. Perform the extent
redistribution for the pool, start
Easy Tier and follow
recommendations from the
STAT tool, or use the rotate
extents method.
5. Fix the problems with the disks
or remove the rank from the
pool.
Cache level 1. Cache-memory limits are reached. 1. Upgrade the cache memory,
2. Workload is not “cache-friendly”. add more ranks to the extent
3. Large number of write requests to a pool, enable Easy Tier, or split
single volume (rank or extent pool). extent pools evenly between
CPCs.
2. Add more disks, use the
Micro-tiering function to unload
the 15-K RPM drives, or tune
the application, if possible.
3. Split the pools and ranks
evenly between CPCs to use
all the cache memory.
CPC level 1. CPCs are overloaded. 1. Split the workload between two
2. There is uneven volume, extent pool, CPCs evenly, stop the
or rank assignment on CECs. unnecessary Copy Services,
3. There are CPC hardware problems. or upgrade the system to
another model.
2. Split the pools and ranks
evenly between CPCs to use
all the cache memory and
processor power.
3. Fix the problems.
Host adapter (HA) 1. Ports are overloaded. 1. Add more HAs, change HAs to
level 2. Host I/Os are mixed with Copy the better performing ones
Services I/Os. (4 - 8 Gbps), use another
3. There are faulty SFPs. multipathing balancing
4. There is incorrect cabling. approach, or use the
5. There are other hardware problems. recommended number of the
logical volumes per port.
2. Use dedicated adapters for the
Copy Services, split host ports
for the different operating
system, or separate backup
workloads and host I/O
workloads.
3. Replace the SFPs.
4. Change the cabling or check
the cables with tools.
5. Fix the problems with the SAN
hardware.
At the application level, there can be bottlenecks in the application, multipathing drivers,
device drivers, zoning misconfiguration, or adapter settings. However, a Microsoft
environment is self-tuning and many problems might be fixed without any indication. Windows
can cache many I/Os and serve them from cache. It is important to maintain a large amount
of free Windows Server memory for peak usage. Also, the paging file should be set up based
on the guidelines that are given in 10.4.3, “Paging file” on page 339.
For the other Microsoft applications, follow the guidelines in Table 10-1 on page 342:
To avoid bottlenecks on the SDDDSM side, maintain a balanced use of all the paths and
keep them active always, as shown in Example 10-8. You can see the numbers for reads
and writes on each adapter, which are nearly the same.
SAN zoning, cabling, and FC-adapter settings must be done according to the IBM System
Storage DS8700 and DS8800 Introduction and Planning Guide, GC27-2297-07, but do
not have more than four paths per logical volume.
After you detect a disk bottleneck, you might perform several of these actions:
If the disk bottleneck is a result of another application in the shared environment that
causes disk contention, request a LUN on a less used rank and migrate the data from the
current rank to the new rank. Start by using Priority Groups.
If the disk bottleneck is caused by too much load that is generated from the Windows
Server to a single DS8000 LUN, spread the I/O activity across more DS8000 ranks, which
might require the allocation of additional LUNs. Start Easy Tier for the volumes and
migrate to hybrid pools.
For more information about Windows Server disk subsystem tuning, see the following
website:
http://www.microsoft.com/whdc/archive/subsys_perf.mspx
VMware ESXi Server supports the use of external storage that can be on a DS8000 storage
system. The DS8000 storage system is typically connected by Fibre Channel (FC) and
accessed over a SAN. Each logical volume that is accessed by a VMware ESXi Server is
configured in a specific way, and this storage can be presented to the virtual machines (VMs)
as virtual disks (VMDKs).
To understand how storage is configured in VMware ESXi Server, you must understand the
layers of abstraction that are shown in Figure 11-1.
Virtual disks
ESX Server
VMFS volume
External storage
For VMware to use external storage, VMware must be configured with logical volumes that
are defined in accordance with the expectations of the users, which might include the use of
RAID or striping at a storage hardware level. Striping at a storage hardware level is preferred
because DS8000 storage system can combine EasyTier and IOPM mechanisms. These
logical volumes must be presented to VMware ESXi Server. For the DS8000 storage system,
host access to the volumes includes the correct configuration of logical volumes, host
mapping, correct logical unit number (LUN) masking, and zoning of the involved SAN fabric.
Two options exist to use these logical drives within vSphere Server:
Formatting these disks with the VMFS: This option is the most common option because a
number of features require that the VMDKs are stored on VMFS volumes.
Passing the disk through to the guest OS as a raw disk. No further virtualization occurs.
Instead, the OS writes its own file system onto that disk directly as though it is in a
stand-alone environment without an underlying VMFS structure.
The VMFS volumes house the VMDKs that the guest OS sees as its real disks. These
VMDKs are in the form of a file with the extension.vmdk. The guest OS either read/writes to
the VMDK file (.vmdk) or writes through the VMware ESXi Server abstraction layer to a raw
disk. In either case, the guest OS considers the disk to be real.
Figure 11-2 compares VMware VMFS volumes to logical volumes so that you can understand
the logical volumes for a DS8800 and DS8700 storage systems as references to volume IDs,
for example, 1000, 1001, and 1002.
On the VM layer, you can configure one or several VMDKs out of a single VMFS volume.
These VMDKs can be configured for use by several VMs.
The VM disks are stored as files within a VMFS. When a guest operating system issues a
Small Computer System Interface (SCSI) command to its VMDKs, the VMware virtualization
layer converts this command to VMFS file operations. From the standpoint of the VM
operating system, each VMDK is recognized as a direct-attached SCSI drive connected to a
SCSI adapter. Device drivers in the VM operating system communicate with the VMware
virtual SCSI controllers.
ESXi Server
virtual machine 1 virtual machine 2 virtual machine 3
VMFS
LUN0
d isk1 .vmd k
di sk2.vmdk
di sk3.vmdk
VMFS is optimized to run multiple VMs as one workload to minimize disk I/O impact. A VMFS
volume can be spanned across several logical volumes, but there is no striping available to
improve disk throughput in these configurations. Each VMFS volume can be extended by
adding additional logical volumes while the VMs use this volume.
Important: A VMFS volume can be spanned across several logical volumes, but there is
no striping available to improve disk throughput in these configurations. With Easy Tier,
hot/cold extents can be promoted or demoted, and you can achieve superior performance
versus economics on VMware ESXi hosts as well.
An RDM is implemented as a special file in a VMFS volume that acts as a proxy for a raw
device. An RDM combines the advantages of direct access to physical devices with the
advantages of VMDKs in the VMFS. In special configurations, you must use RDM raw
devices, such as in Microsoft Cluster Services (MSCS) clustering, or when you install IBM
Spectrum Protect™ Snapshot in a VM that is running on Linux.
ESXi Server
virtual machine 1 virtual machine 2
HBA1 HBA2
.vmdk RDM
Figure 11-5 vStorage APIs for Array Integration in the VMware storage stack
VAAI support relies on the storage implementing several fundamental operations that are
named as primitives. These operations are defined in terms of standard SCSI commands,
which are defined by the T10 SCSI specification.
For more information about VAAI integration and usage with the DS8000 storage system, see
IBM DS8870 and VMware Synergy, REDP-4915.l
Example 11-1 Create volume groups and host connections for VMware hosts
dscli> mkvolgrp -type scsimap256 VMware_Host_1_volgrp_1
CMUC00030I mkvolgrp: Volume group V19 successfully created.
dscli> mkhostconnect -wwpn 21000024FF2D0F8D -hosttype VMware -volgrp V19 -desc "Vmware host1 hba1"
Vmware_host_1_hba_1
CMUC00012I mkhostconnect: Host connection 0036 successfully created.
For example: vmhba1:C3:T2:L0 represents LUN 0 on target 2 accessed through the storage
adapter vmhba1 and channel 3.
VMware ESXi Server provides built-in multipathing support, which means that it is not
necessary to install an additional failover driver. Any external failover drivers, such as
subsystem device drivers (SDDs), are not supported for VMware ESXi. Since ESX 4.0, it
supports path failover and the round-robin algorithm.
VMware ESXi Server 6.0 provides three major multipathing policies for use in production
environments: Most Recently Used (MRU), Fixed, and Round Robin (RR):
MRU policy is designed for usage by active/passive storage devices, such as IBM System
Storage DS4000® storage systems, with only one active controller available per LUN.
The Fixed policy ensures that the designated preferred path to the storage is used
whenever available. During a path failure, an alternative path is used, and when the
preferred path is available again, the multipathing module switches back to it as the active
path.
The Round Robin policy with a DS8000 storage system uses all available paths to rotate
through the available paths. It is possible to switch from MRU and Fixed to RR without
interruptions. With RR, you can change the number of bytes and number of I/O operations
sent along one path before you switch to the other path. RR is a good approach for various
systems; however, it is not supported for use with VMs that are part of MSCS.
The default multipath policy for ALUA devices since ESXi 5 is Round Robin.
The multipathing policy and the preferred path can be configured from the vSphere Client or
by using the command-line tool esxcli. For command differences among the ESXi versions,
see Table 11-1 on page 371.
Figure 11-7 shows how the preferred path policy can be checked from the vSphere Client.
By using the Fixed multipathing policy, you can implement static load balancing if several
LUNs are attached to the VMware ESXi Server. The multipathing policy is set on a per LUN
basis, and the preferred path is chosen for each LUN. If VMware ESXi Server is connected
over four paths to its DS8000 storage system, spread the preferred paths over all four
available physical paths.
Important: Before zoning your VMware host to a DS8000 storage system, you must
ensure that the number of available paths for each LUN must meet a minimum of two paths
(for redundancy), but because of limitations on VMware, a maximum of all paths on a host
is 1024. The number of paths to a LUN is limited to 32, and the maximum available LUNS
per host is 256. The maximum size of a LUN is 64 TB. So, plan the number of paths
available to each VMware host carefully to avoid future problems with provisioning to your
VMware hosts.
For example, when you want to configure four LUNs, assign the preferred path of LUN0
through the first path, the one for LUN1 through the second path, the preferred path for LUN2
through the third path, and the one for LUN3 through the fourth path. With this method, you
can spread the throughput over all physical paths in the SAN fabric. Thus, this method results
in optimized performance for the physical connections between the VMware ESXi Server and
the DS8000 storage system.
When the active path fails, for example, because of a physical path failure, I/O might pause for
about 30 - 60 seconds until the FC driver determines that the link is down and fails over to one
of the remaining paths. This behavior can cause the VMDKs that are used by the operating
systems of the VMs to appear unresponsive. After failover is complete, I/O resumes normally.
The timeout value for detecting a failed link can be adjusted; it is set in the HBA BIOS or
driver and the way to set this option depends on the HBA hardware and vendor. The typical
failover timeout value is 30 seconds. With VMware ESXi, you can adjust this value by editing
the device driver options for the installed HBAs in /etc/vmware/esx.conf.
Additionally, you can increase the standard disk timeout value in the VM operating system to
ensure that the operating system is not disrupted and to ensure that the system logs
permanent errors during the failover phase. Adjusting this timeout again depends on the
operating system that is used and the amount of queue that is expected by one path after it
fails; see the appropriate technical documentation for details.
VC includes real-time performance counters that display the past hour (which is not archived),
and archived statistics that are stored in a database. The real-time statistics are collected
every 20 seconds and presented in the vSphere Client for the past 60 minutes (Figure 11-8
on page 373).
These real-time counters are also the basis for the archived statistics, but to avoid too much
performance database expansion, the granularity is recalculated according to the age of the
performance counters. VC collects those real-time counters, aggregates them for a data point
every 5 minutes, and stores them as past-day statistics in the database. After one day, these
counters are aggregated once more to a 30-minute interval for the past week statistics. For
the past month, a data point is available every 2 hours, and for the last year, one datapoint is
stored per day.
In general, the VC statistics are a good basis to get an overview about the performance
statistics and to further analyze performance counters over a longer period, for example,
several days or weeks. If a granularity of 20-second intervals is sufficient for your individual
performance monitoring perspective, VC can be a good data source after configuration. You
can obtain more information about how to use the VC Performance Statistics at this website:
http://communities.vmware.com/docs/DOC-5230
After this initial configuration, the performance counters are displayed as shown in
Example 11-3.
ADAPTR CID TID LID NCHNS NTGTS NLUNS NVMS AQLEN LQLEN WQLEN ACTV %USD LOAD CMDS/s READS/s WRITES/s MBREAD/s
vmhba1 - - - 2 1 1 32 238 0 0 - - - 4.11 0.20 3.91 0.00
vmhba2 0 0 0 1 1 1 10 4096 32 0 8 25 0.25 25369.69 25369.30 0.39 198.19
vmhba2 0 0 1 1 1 1 10 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 2 1 1 1 10 4096 32 0 0 0 0.00 0.39 0.00 0.39 0.00
vmhba2 0 0 3 1 1 1 9 4096 32 0 0 0 0.00 0.39 0.00 0.39 0.00
vmhba2 0 0 4 1 1 1 4 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 5 1 1 1 17 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 6 1 1 1 4 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 7 1 1 1 4 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 8 1 1 1 4 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 9 1 1 1 4 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 1 - 1 1 10 76 4096 0 0 - - - 0.00 0.00 0.00 0.00
vmhba3 - - - 1 2 4 16 4096 0 0 - - - 0.00 0.00 0.00 0.00
vmhba4 - - - 1 2 4 16 4096 0 0 - - - 0.00 0.00 0.00 0.00
vmhba5 - - - 1 2 20 152 4096 0 0 - - - 0.78 0.00 0.78 0.00
Additionally, you can change the field order and select or clear various performance counters
in the view. The minimum refresh rate is 2 seconds, and the default setting is 5 seconds.
When you use esxtop in Batch Mode, always include all of the counters by using the -a
option. To collect the performance counters every 10 seconds for 100 iterations and save
them to a file, run esxtop this way:
esxtop -b -a -d 10 -n 100 > perf_counters.csv
For more information about how to use esxtop and other tools, see vSphere Resource
Management Guide, found at:
http://www.vmware.com/pdf/vsphere4/r40/vsp_40_resource_mgmt.pdf
Each VMware version has its own particularities and some might have performance analysis
tools that are not part of the older versions.
The guest operating system is unaware of the underlying VMware ESXi virtualization layer, so
any performance data captured inside the VMs can be misleading and must be analyzed and
interpreted only with the actual configuration and performance data gathered in VMware ESXi
Server or on a disk or SAN layer.
There is one additional benefit of the Windows Performance Monitor perfmon (see 10.8.2,
“Windows Performance console (perfmon)” on page 350). When you use esxtop in Batch
Mode with the -a option, it collects all available performance counters and thus the collected
comma-separated values (CSV) data gets large and cannot be easily parsed. Perfmon can
help you to analyze quickly results or to reduce the amount of CSV data to a subset of
counters that can be analyzed more easily by using other utilities. You can obtain more
information about importing the esxtop CSV output into perfmon by going to the following
website:
http://communities.vmware.com/docs/DOC-5100
It is also important to identify and separate specific workloads because they can negatively
influence other workloads that might be more business critical.
Within VMware ESXi Server, it is not possible to configure striping over several LUNs for one
datastore. It is possible to add more than one LUN to a datastore, but adding more than one
LUN to a datastore only extends the available amount of storage by concatenating one or
more additional LUNs without balancing the data over the available logical volumes.
The easiest way to implement striping over several hardware resources is to use storage pool
striping in extent pools (see 4.8, “Planning extent pools” on page 103) of the attached
DS8000 storage system.
The only other possibility to achieve striping at the VM level is to configure several VMDKs for
a VM that are on different hardware resources, such as different HBAs, DAs, or servers, and
then configure striping of those VMDKs within the host operating system layer.
For performance monitoring purposes, be careful with spanned volumes or even avoid these
configurations. When configuring more than one LUN to a VMFS datastore, the volume space
is spanned across multiple LUNs, which can cause an imbalance in the utilization of those
LUNs. If several VMDKs are initially configured within a datastore and the disks are mapped
to different VMs, it is no longer possible to identify in which area of the configured LUNs the
data of each VM is allocated. Thus, it is no longer possible to pinpoint which host workload
causes a possible performance problem.
In summary, avoid using spanned volumes and configure your systems with only one LUN per
datastore.
For more information, see the following VMware knowledge base article:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC
&externalId=1267
If a VM generates more commands to a LUN than the LUN queue depth, these additional
commands are queued in the ESXi kernel, which increases the latency. The queue depth is
defined on a per LUN basis, not per initiator. An HBA (SCSI initiator) supports many more
outstanding commands.
For VMware ESXi Server, if two VMs access their VMDKs on two different LUNs, each VM
can generate as many active commands as the LUN queue depth. But if those two VMs have
their VMDKs on the same LUN (within the same VMFS volume), the total number of active
commands that the two VMs combined can generate without queuing I/Os in the ESXi kernel
is equal to the LUN queue depth. Therefore, when several VMs share a LUN, the maximum
number of outstanding commands to that LUN from all those VMs together must not exceed
the LUN queue depth.
To reduce latency, it is important to ensure that the sum of active commands from all VMs of
an VMware ESXi Server does not frequently exceed the LUN queue depth. If the LUN queue
depth is exceeded regularly, you might either increase the queue depth or move the VMDKs
of a few VMs to different VMFS volumes. Therefore, you lower the number of VMs that access
a single LUN. The maximum LUN queue depth per VMware ESXi Server must not exceed 64.
The maximum LUN queue depth per VMware ESXi Server can be up to 128 only when a
server has exclusive access to a LUN.
To avoid SCSI reservation conflicts in a production environment with several VMware ESXi
Servers that access shared LUNs, it might be helpful to perform those administrative tasks at
off-peak hours. If this approach is not possible, perform the administrative tasks from an
VMware ESXi Server that also hosts I/O-intensive VMs, which are less affected because the
SCSI reservation is set on the SCSI initiator level, which means for the complete VMware
ESXi Server.
The maximum number of VMs that can share the LUN depends on several conditions. VMs
with heavy I/O activity result in a smaller number of possible VMs per LUN. Additionally, you
must consider the already described LUN queue depth limits per VMware ESXi Server and
the storage system-specific limits.
RDM offers two configuration modes: virtual compatibility mode and physical compatibility
mode. When you use physical compatibility mode, all SCSI commands toward the VMDK are
passed directly to the device, which means that all physical characteristics of the underlying
hardware become apparent. Within virtual compatibility mode, the VMDK is mapped as a file
within a VMFS volume, which allows advanced file locking support and the use of snapshots.
ESX Server
virtual machine virtual machine virtual machine
a ddre ss ad dre ss
LUN0 LUN0 LUN2 LUN 0 LU N3
reso lutio n resol utio n
Figure 11-9 Comparison of RDM virtual and physical modes with VMFS
The implementations of VMFS and RDM imply a possible impact on the performance of the
VMDKs; therefore, all three possible implementations are tested together with the DS8000
storage system. This section summarizes the outcome of those performance tests.
In general, the file system selection affects the performance in a limited manner:
For random workloads, the measured throughput is almost equal between VMFS, RDM
physical, and RDM virtual. Only for read requests of 32 KB, 64 KB, and 128 KB transfer
sizes, both RDM implementations show a slight performance advantage (Figure 11-10 on
page 379).
For sequential workloads for all transfer sizes, there is a verified a slight performance
advantage for both RDM implementations against VMFS. For all sequential write and
certain read requests, the measured throughput for RDM virtual was slightly higher than
for RDM physical mode. This difference might be caused by an additional caching of data
within the virtualization layer, which is not used in RDM physical mode (Figure 11-11 on
page 379).
Performance data varies: The performance data in Figure 11-10 and Figure 11-11 was
obtained in a controlled, isolated environment at a specific point by using the
configurations, hardware, and software levels available at that time. Actual results that
might be obtained in other operating environments can vary. There is no guarantee that the
same or similar results can be obtained elsewhere. The data is intended to help illustrate
only how different technologies behave in relation to each other.
Sequential Throughput
250,00
200,00
VMFS write
VMFS read
100,00 RDM physical read
RDM virtual read
50,00
0,00
4 8 16 32 64 128
transfer size in KB
Figure 11-11 Result of sequential workload test for VMFS, RDM physical, and RDM virtual
When using VMware ESXi, each VMFS datastore segments the allocated LUN into blocks,
which can be 1 - 8 MB. The file system that is used by the VM operating system optimizes I/O
by grouping several sectors into one cluster. The cluster size usually is in the range of several
KB.
If the VM operating system reads a single cluster from its VMDK, at least one block (within
VMFS) and all the corresponding stripes on physical disk must be read. Depending on the
sizes and the starting sector of the clusters, blocks, and stripes, reading one cluster might
require reading two blocks and all of the corresponding stripes. Figure 11-12 illustrates that in
an unaligned structure, a single I/O request can cause additional I/O operations. Thus, an
unaligned partition setup results in additional I/O that incurs a penalty on throughput and
latency and leads to lower performance for the host data traffic.
cluster
VMFS
block
DS8000 LUN
stripe
Figure 11-12 Processing of a data request in an unaligned structure
An aligned partition setup ensures that a single I/O request results in a minimum number of
physical disk I/Os, eliminating the additional disk operations and resulting in an overall
performance improvement.
cluster
VMFS
block
DS8000 LUN
stripe
Figure 11-13 Processing a data request in an aligned structure
Partition alignment is a known issue in file systems, but its effect on performance is somehow
controversial. In performance lab tests, it turned out that in general all workloads show a slight
increase in throughput when the partitions are aligned. A significant effect can be verified only
on sequential workloads. Starting with transfer sizes of 32 KB and larger in this example, we
recognized performance improvements of up to 15%.
In general, aligning partitions can improve the overall performance. For random workloads in
this example, we identified only a slight effect. For sequential workloads, a possible
performance gain of about 10% seems to be realistic. So, as a preferred practice, align
partitions especially for sequential workload characteristics.
Aligning partitions within an VMware ESXi Server environment requires two steps. First, the
VMFS partition must be aligned. Then, the partitions within the VMware guest system file
systems must be aligned for maximum effectiveness.
Example 11-4 shows how to create an aligned partition with an offset of 512 by using fdisk.
Then, you must create a VMFS file system within the aligned partition by using the
vmkfstools command, as shown in Example 11-5.
As a last step, all the partitions at the VM level must be aligned as well. This task must be
performed from the operating system of each VM by using the available tools. For example,
for Windows, use the diskpart utility, as shown in Example 11-6. You can use Windows to
align basic partitions only, and the offset size is set in KB (not in sectors).
DISKPART>
For more information about aligning VMFS partitions and the performance effects, see
Performance Best Practices for VMware vSphere 6.0, found at:
http://www.vmware.com/files/pdf/techpaper/VMware-PerfBest-Practices-vSphere6-0.pdf
This chapter also describes the supported distributions of Linux when you use the DS8000
storage system, and the tools that can be helpful for the monitoring and tuning activities:
Linux disk I/O architecture
Host bus adapter (HBA) considerations
Multipathing
Logical Volume Manager (LVM)
Disk I/O schedulers
File system considerations
For further clarification and the most current information about supported Linux distributions
and hardware prerequisites, see the IBM System Storage Interoperation Center (SSIC)
website:
http://www.ibm.com/systems/support/storage/ssic/interoperability.wss
This chapter introduces the relevant logical configuration concepts that are needed to attach
Linux operating systems to a DS8000 storage system and focuses on performance relevant
configuration and measuring options. For more information about hardware-specific Linux
implementation and general performance considerations about the hardware setup, see the
following documentation:
For a general Linux implementation overview:
– The IBM developerWorks website for Linux, including a technical library:
http://www.ibm.com/developerworks/linux
– Anatomy of the Linux kernel:
http://www.ibm.com/developerworks/library/l-linux-kernel/
– SUSE Linux Enterprise Server documentation
https://www.suse.com/documentation/sles-12/
– RHEL documentation
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/ind
ex.html
For x86-based architectures:
Tuning IBM System x Servers for Performance, SG24-5287
For System p hardware:
– Performance Optimization and Tuning Techniques for IBM Power Systems Processors
Including IBM POWER8, SG24-8171
– POWER7 and POWER7+ Optimization and Tuning Guide, SG24-8079
For z Systems hardware:
– Set up Linux on IBM System z for Production, SG24-8137
– Linux on IBM System z: Performance Measurement and Tuning, SG24-6926
The architecture that is described applies to Open Systems servers attached to the DS8000
storage system by using the Fibre Channel Protocol (FCP). For Linux on z Systems with
extended count key data (ECKD) (Fibre Channel connection (FICON) attached) volumes, a
different disk I/O setup applies. For more information about disk I/O setup and configuration
for z Systems, see Linux for IBM System z9 and IBM zSeries, SG24-6694.
For a quick overview of overall I/O subsystem operations, we use an example of writing data
to a disk. The following sequence outlines the fundamental operations that occur when a
disk-write operation is performed, assuming that the file data is on sectors on disk platters,
that it was read, and is on the page cache:
1. A process requests to write a file through the write() system call.
2. The kernel updates the page cache mapped to the file.
For more information, see the IBM developerWorks article Anatomy of the Linux kernel, found
at:
http://www.ibm.com/developerworks/library/l-linux-kernel/
Data Data
Cache Cache
CPU CPU
Data Data
Register Register
Data Data
Cache Cache
Disk Disk
Linux uses this principle in many components, such as page cache, file object cache (i-node
cache and directory entry cache), and read ahead buffer.
The synchronization process for a dirty buffer is called a flush. The flush occurs on a regular
basis and when the proportion of dirty buffers in memory exceeds a certain threshold. The
threshold is configurable in the /proc/sys/vm/dirty_background_ratio file.
The operating system synchronizes the data regularly, but with large amounts of system
memory, it might keep updated data for several days. Such a delay is unsafe for the data in a
failure situation. To avoid this situation, use the sync command. When you run the sync
command, all changes and records are updated on the disks and all buffers are cleared.
Periodic usage of sync is necessary in transaction processing environments that frequently
update the same data set file, which is intended to stay in memory. Data synchronization can
be set to frequent updates automatically, but the sync command is useful in situations when
large data movements are required and copy functions are involved.
Important: It is a preferred practice to run the sync command after data synchronization in
an application before issuing a FlashCopy operation.
When a write is performed, the file system layer first writes to the page cache, which is made
up of block buffers. It creates a bio structure by putting the contiguous blocks together and
then sends the bio to the block layer (see Figure 12-1 on page 387).
The block layer handles the bio request and links these requests into a queue called the I/O
request queue. This linking operation is called I/O elevator or I/O scheduler. The Linux
kernel 2.4 used a single, general-purpose I/O elevator, but since Linux kernel 2.6, four types
of I/O elevator algorithms are available. Because the Linux operating system can be used for
a wide range of tasks, both I/O devices and workload characteristics change significantly. A
notebook computer probably has different I/O requirements than a 10,000 user database
system. To accommodate these differences, four I/O elevators are available. For more
information about I/O elevator implementation and tuning, see 12.3.4, “Tuning the disk I/O
scheduler” on page 397.
SCSI
SCSI is the most commonly used I/O interface, especially in the enterprise server
environment. The FCP transports SCSI commands over Fibre Channel networks. In Linux
kernel implementations, SCSI devices are controlled by device driver modules. They consist
of the following types of modules (Figure 12-3 on page 391):
The upper layer consists of specific device type drivers that are closest to user-space,
such as the disk and tape drivers: st (SCSI Tape) and sg (SCSI generic device).
Middle level driver: scsi_mod
It implements SCSI protocol and common SCSI functions.
The lower layer consists of drivers, such as the Fibre Channel HBA drivers, which are
closest to the hardware. They provide lower-level access to each device. A low-level driver
is specific to a hardware device and is provided for each device, for example, ips for the
IBM ServeRAID controller, qla2300 for the QLogic HBA, and mptscsih for the LSI Logic
SCSI controller.
Pseudo driver: ide-scsi
It is used for IDE-SCSI emulation.
Device
Figure 12-3 Structure of SCSI drivers
If specific functions are implemented for a device, they must be implemented in the device
firmware and the low-level device driver. The supported functions depend on which hardware
you use and which version of the device driver that you use. The device must also support the
wanted functions. Specific functions are tuned by a device driver parameter.
For more information about performance and tuning recommendations, see Linux
Performance and Tuning Guidelines, REDP-4285.
In the SSIC web application, describe your target configuration in as much detail as possible.
Press the Submit and Show Details to display the supported HBA BIOS and drivers versions
(see Example 12-2 on page 394).
To configure the HBA correctly, see the IBM DS8000: Host Systems Attachment Guide,
GC27-4210.
IBM System Storage DS8000: Host Attachment and Interoperability, SG24-8887 has detailed
procedures and suggested settings:
Also, check the readme files and manuals of the BIOS, HBA, and driver. The Emulex and
QLogic Fibre Channel device driver documentation is available at the following websites:
http://www.emulex.com/downloads
http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI
In current Linux distributions, the HBA driver is loaded automatically. However, you can
configure several driver parameters. The list of available parameters depends on the specific
HBA type and driver implementation. If these settings are not configured correctly, it might
affect performance or the system might not work correctly.
The queue depth parameter specifies the length of the queue of the SCSI commands, which
a device can keep unconfirmed while maintaining the I/O requests. A device (disk or FC
adapter) sends a successful command completion notification to a driver before it is
completed, which allows the driver to send another command or I/O request.
By changing the queue depth, you can queue more outstanding I/Os on the adapter or disk
level, which can have, in certain configurations, a positive effect on throughput. However,
increasing the queue depth cannot be advised because it can slow performance or cause
delays, depending on the actual configuration. Thus, the complete setup must be checked
carefully before adjusting the queue depth. Increasing the queue depth can be beneficial for
the sequential large block write workloads and for some sequential read workloads. Random
workloads do not benefit much from the increased queue depth values. Indeed, you might
notice only a slight improvement in performance after the queue depth is increased. However,
the improvement might be greater if other methods of optimization are used.
You can configure each parameter as either temporary or persistent. For temporary
configurations, you can use the modprobe command. Persistent configuration is performed by
editing the appropriate configuration file (based on distribution):
/etc/modprobe.d/lpfc.conf for RHEL 6.x or higher
/etc/modprobe.d/99-qlogichba.conf for SUSE Linux Enterprise Server 11 SPx or higher
/etc/modprobe.conf for RHEL 5.x
/etc/modprobe.conf.local for older SUSE Linux Enterprise Server releases
To set the queue depth parameter value for an Emulex adapter, add the following line to the
configuration file to set the queue depth of an Emulex HBA to 20:
options lpfc lpfc_lun_queue_depth=20
To set the queue depth parameter value for a QLogic adapter for the Linux kernel Version 3.0
(For example, SUSE Linux Enterprise Server 11 or later), create a file that is named
/etc/modprobe.d/99-qlogichba.conf that contains the following line:
options qla2xxx ql2xmaxqdepth=48 qlport_down_retry=1
If you are running on a SUSE Linux Enterprise Server 11 or later operating system, run the
mkinitrd command and then restart.
Example 12-1 shows the output of the iostat -kx command. You can see that the average
queue size value might look high enough that the queue depth parameter might be increased.
If you look closer at the statistics, you can see that the service time is low enough and the
counter for write requests merged is high. Many write requests can be merged into fewer
write requests before they are sent to the adapter. Service times for write requests less than
1 ms indicate that writes are cached. Taking all these observations into consideration,
everything is fine with queue depth setting in this example.
High queue depth parameter values might lead to adapter overload situations, which can
cause adapter resets and loss of paths. In turn, this situation might cause adapter failover and
overload the rest of the paths. It might lead to situations where I/O is stuck for a period.
Consider decreasing the queue depth parameter values to allow the faster reaction of the
multipathing module in path or adapter problems to avoid potential failures when you use
DM-MP.
For older Linux distributions (up to Red Hat Linux 4 and SUSE Linux Enterprise Server 9),
IBM supported the IBM Multipath Subsystem Device Driver (SDD). SDD is not available for
current Linux distributions.
The Multipath I/O support that is included in Linux kernel Version 2.6 or higher is based on
Device Mapper (DM), a layer for block device virtualization that also supports logical volume
management, multipathing, and software RAID.
With DM, a virtual block device is presented where blocks can be mapped to any existing
physical block device. By using the multipath module, the virtual block device can be mapped
to several paths toward the same physical target block device. DM balances the workload of
I/O operations across all available paths, detects defective links, and fails over to the
remaining links.
For more information about supported distribution releases, kernel versions, and multipathing
software, see the IBM Subsystem Device Driver for Linux website:
https://www.ibm.com/support/docview.wss?uid=ssg1S4000107
IBM provides a device-specific configuration file for the DS8000 storage system for the
supported levels of RHEL and SUSE Linux Enterprise Server. Append the device-specific
section of this file to the /etc/multipath.conf configuration file to set default parameters for
the attached DS8000 volumes (LUNs) and create names for the multipath devices that are
managed by DM-MP (see Example 12-2). Further configuration, adding aliases for certain
LUNs, or blacklisting specific devices can be manually configured by editing this file.
Using DM-MP, you can configure various path failover policies, path priorities, and failover
priorities. This type of configuration can be done individually for each device or for devices of
a certain type in the /etc/multipath.conf setup.
The multipath -ll command displays the available multipath information for disk devices, as
shown in Example 12-3.
For more configuration and setup information, see the following publications:
For SUSE Linux Enterprise Server:
– http://www.suse.com/documentation/sles11/pdfdoc/stor_admin/stor_admin.pdf
– https://www.suse.com/documentation/sles-12/pdfdoc/stor_admin/stor_admin.pdf
This section always refers to LVM Version 2 and uses the term LVM.
With LVM2, you can influence the way that LEs (for an LV) are mapped to the available PEs.
With LVM linear mapping, the extents of several PVs are concatenated to build a larger LV.
physical
extent
logical
extent
stripe
Logical Volume
Figure 12-4 LVM striped mapping of three LUNs to a single logical volume
For more information about how to use the command-line RAID tools in Linux, see this
website:
https://raid.wiki.kernel.org/index.php/Linux_Raid
The preferred way to use LVM, Easy Tier, and IOPM is to use LVM concatenated LVs. This
method might be useful when it is not possible to use volumes larger than2 TB in DS8700 and
DS8800 storage system (there are still some copy function limitations) or when implementing
disaster recovery solutions that require LVM involvement. In other cases, follow the preferred
practices that are described in Chapter 4, “Logical configuration performance considerations”
on page 83 and Chapter 3, “Logical configuration concepts and terminology” on page 47.
The I/O scheduler forms the interface between the generic block layer and the low-level
device drivers. Functions that are provided by the block layer can be used by the file systems
and the virtual memory manager to submit I/O requests to the block devices. These requests
are transformed by the I/O scheduler to the low-level device driver. Red Hat Enterprise Linux
AS 4 and SUSE Linux Enterprise Server 11 support four types of I/O schedulers.
You can obtain additional details about configuring and setting up I/O schedulers in Tuning
Linux OS on System p The POWER Of Innovation, SG24-7338.
With the capability to have different I/O elevators per disk system, the administrator now can
isolate a specific I/O pattern on a disk system (such as write-intensive workloads) and select
the appropriate elevator algorithm:
Synchronous file system access
Certain types of applications must perform file system operations synchronously, which
can be true for databases that might use a raw file system or for large disk systems where
caching asynchronous disk accesses is not an option. In those cases, the performance of
the anticipatory elevator usually has the least throughput and the highest latency. The
three other schedulers perform equally up to an I/O size of roughly 16 KB where the CFQ
and the Noop elevators outperform the deadline elevator (unless disk access is
seek-intense).
Complex disk systems
Benchmarks show that the Noop elevator is an interesting alternative in high-end server
environments. When using configurations with enterprise-class disk systems, such as the
DS8000 storage system, the lack of ordering capability of the Noop elevator becomes its
strength. It becomes difficult for an I/O elevator to anticipate the I/O characteristics of such
complex systems correctly, so you might often observe at least equal performance at less
impact when using the Noop I/O elevator. Most large-scale benchmarks that use hundreds
of disks most likely use the Noop elevator.
Database systems
Because of the seek-oriented nature of most database workloads, some performance gain
can be achieved when selecting the deadline elevator for these workloads.
With the DS8000 IOPM feature, you can choose where to use I/O priority management. IOPM
has following advantages:
Provides the flexibility of many levels of priorities to be set
Does not consume the resources on the server
Sets real priority at the disk level
Manages internal bandwidth access contention between several servers
We suggest that you use DS8000 IOPM in most of the cases for priority management. The
operating system priority management can be used combined with IOPM. This combination
provides the highest level of flexibility.
Do not confuse the JFS versions for Linux and AIX operating systems. AIX differentiates the
older JFS (JFS generation 1) and JFS2. On Linux, only JFS generation 2 exists, but is simply
called JFS. Today, JFS is rarely used on Linux because ext4 typically offers better
performance.
As of November 2015, the DS8000 storage system supports RHEL and SUSE Linux
Enterprise Server distributions. Thus, this section focuses on the file systems supported by
these Linux distributions.
SUSE Linux Enterprise Server 12 ships with a number of file systems, including Ext Versions
2 - 4, Btrfs, XFS, and ReiserFS. Btrfs is the default for the root partition while XFS is the
default for other use cases. For more information, see the following website:
https://www.suse.com/documentation/sles-12/stor_admin/data/cha_filesystems.html
RHEL 7 file system support includes Ext versions 3 and 4, XFS, and Btrfs. XFS is the default
file system. For more information, see the following website:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Stor
age_Administration_Guide/part-file-systems.html
The JFS and XFS workload patterns are suited for high-end data warehouses, scientific
workloads, large symmetric multiprocessor (SMP) servers, or streaming media servers.
ReiserFS and Ext3 are typically used for file, web, or mail serving.
ReiserFS is more suited to accommodate small I/O requests. XFS and JFS are tailored
toward large file systems and large I/O sizes. Ext3 fits the gap between ReiserFS and JFS
and XFS because it can accommodate small I/O requests while offering good multiprocessor
scalability.
Ext4 is compatible with Ext3 and Ext2 file systems. Mounting the older file systems as Ext4
can improve performance because new file system features, such as the newer block
allocation algorithm, are used.
Example 12-4 Update /etc/fstab file with noatime option set on mounted file systems
/dev/mapper/ds8kvg-lvol0 /ds8k ext4 defaults,noatime 1 2
140000
120000
100000
80000
kB/sec
data=ordered
60000 data=writeback
40000
20000
0
4 8 16 32 64 128 256 512 1024 2048
kB/op
The benefit of using writeback journaling declines as I/O sizes grow. Also, the journaling
mode of your file system affects only write performance. Therefore, a workload that performs
mostly reads (for example, a web server) does not benefit from changing the journaling mode.
There are three ways to change the journaling mode on a file system:
Run the mount command:
mount -o data=writeback /dev/mapper/ds8kvg-lvol0 /ds8k
Include the mode in the options section of the /etc/fstab file:
/dev/mapper/ds8kvg-lvol0 /ds8k ext4 defaults,data=writeback 1 2
If you want to modify the default data=ordered option on the root partition, change the
/etc/fstab file. Then, run the mkinitrd command to scan the changes in the /etc/fstab
file and create an initial RAM disk image. Update grub or lilo to point to the new image.
Blocksizes
The blocksize, the smallest amount of data that can be read or written to a drive, can have a
direct impact on server performance. As a guideline, if your server handles many small files, a
smaller blocksize is more efficient. If your server is dedicated to handling large files, a larger
blocksize might improve performance. Blocksizes cannot be changed dynamically on existing
file systems, and only a reformatting modifies the current blocksize.
1
The performance data that is contained in this figure is obtained in a controlled, isolated environment at a specific
point in by using the configurations, hardware, and software levels available at that time. Actual results that might be
obtained in other operating environments can vary. There is no guarantee that the same or similar results can be
obtained elsewhere. The data is intended to help illustrate how different technologies behave in relation to each
other.
You can identify the DS8000 disks with the multipath -ll or multipathd -k commands, as
shown in Example 12-6. In current Linux versions, the multipathd -k interactive prompt is
used to communicate with DM-MP. multipath -ll is deprecated. For more information, see
IBM System Storage DS8000: Host Attachment and Interoperability, SG24-8887.
Example 12-6 on page 403 shows that the device is a DS8000 volume with an active-active
configuration. The LUNs have the names mpathc and mpathd and device names of dm-2 and
dm-3, which appear in the performance statistics. The LUN IDs in the parentheses,
36005076303ffd5aa000000000000002c and 36005076303ffd5aa000000000000002d, contain the
ID of the LV in the DS8000 storage system in the last four digits of the whole LUN ID: 002c
and 002d. The output also indicates that the size of each LUN is 10 GB. There is no hardware
handler that is assigned to this device, and I/O is supposed to be queued forever if no paths
are available. All paths group are in the active state, which means that all paths in this group
carry all the I/Os to the storage. All paths to the device (LUN) are in active ready mode.
There are two paths per LUN presented in the system as sdX devices, where X is the index
number of the disk.
Example 12-7 shows the contents of the /sys/class/fc_host folder. This system has four FC
ports. You might use a script similar to Example 12-8 to display the FC port information.
0x10008c7cff82b000
Linkdown
Unknown
unknown
2 Gbit, 4 Gbit, 8 Gbit, 16 Gbit
0x2100000e1e30c2ff
Online
NPort (fabric via point-to-point)
16 Gbit
4 Gbit, 8 Gbit, 16 Gbit
0x10008c7cff82b001
Linkdown
Unknown
unknown
2 Gbit, 4 Gbit, 8 Gbit, 16 Gbit
Another way to discover the connection configuration is to use the systool -av -c fc_host
command, as shown in Example 12-10. This command displays extended output and
information about the FC ports. However, this command might not be available in all Linux
distributions.
Example 12-10 Output of the port information with systool -av -c fc_host (only one port shown)
Class Device = "host5"
Class Device path =
"/sys/devices/pci0000:40/0000:40:03.0/0000:51:00.1/host5/fc_host/host5"
dev_loss_tmo = "30"
fabric_name = "0x10000005339ff896"
issue_lip = <store method only>
max_npiv_vports = "254"
node_name = "0x2000000e1e30c2ff"
npiv_vports_inuse = "0"
port_id = "0x011c00"
port_name = "0x2100000e1e30c2ff"
port_state = "Online"
port_type = "NPort (fabric via point-to-point)"
speed = "16 Gbit"
supported_classes = "Class 3"
supported_speeds = "4 Gbit, 8 Gbit, 16 Gbit"
symbolic_name = "QLE2662 FW:v6.03.00 DVR:v8.07.00.08.12.0-k"
system_hostname = ""
tgtid_bind_type = "wwpn (World Wide Port Name)"
Device = "host5"
Device path = "/sys/devices/pci0000:40/0000:40:03.0/0000:51:00.1/host5"
fw_dump =
nvram = "ISP "
optrom_ctl = <store method only>
optrom =
reset = <store method only>
sfp = ""
uevent = "DEVTYPE=scsi_host"
vpd = ")"
Example 12-10 on page 405 shows the output for one FC port:
The device file for this port is host5.
This port has a worldwide port name (WWPN) of 0x2100000e1e30c2ff, which appears in
the fabric.
This port is connected at 16 Gbps.
It is a QLogic card with firmware version 6.03.00.
The rank configuration for the disk can be shown by running the showfbvol -rank VOL_ID
command, where VOL_ID is 002c in this example (Example 12-11).
Example 12-11 on page 406 shows the properties of the LV. The following information can be
discovered for the volume:
Occupies one rank (R4)
Belongs to VG V40
Is 10 GB
Uses an extent allocation method (EAM) that is managed by Easy Tier
Uses a standard storage allocation method (SAM)
Is a regular, non-thin provisioned volume
Example 12-12 shows how to reveal the physical disk and array information. The properties of
the rank provide the array number, and the array properties provide the disk information and
the RAID type. In this case, rank R1 is on array A1, which consists of 3 TB SAS drives in a
RAID 6 configuration.
dscli> showarray A4
Date/Time: November 3, 2015 4:49:25 PM CET IBM DSCLI Version: 7.8.0.372 DS: IBM.
2107-75ZA571
Array A4
SN BZA57150A3E368S
State Assigned
datastate Normal
RAIDtype 6 (5+P+Q+S)
arsite S15
Rank R4
DA Pair 2
DDMcap (10^9B) 3000.0
DDMRPM 7200
Interface Type SAS
interrate 6.0 Gb/sec
diskclass NL
encrypt supported
Example 12-13 on page 408 shows that VG V40 participates in two host connections for two
WWPNs: 2100000E1E30C2FE and 2100000E1E30C2FE. These WWPNs are the same as in
Example 12-9 on page 405. Now, you checked all the information for a specific volume.
The symptoms that show that the server might be suffering from a disk bottleneck (or a
hidden memory problem) are shown in Table 12-1.
Disk I/O numbers and wait time Analyze the number of I/Os to the LUN. This data
can be used to discover if reads or writes are the
cause of problem. Run iostat to get the disk
I/Os. Run stap ioblock.stp to get read/write
blocks. Also, run scsi.stp to get the SCSI wait
times, requests submitted, and completed. Also,
long wait times might mean the I/O is to specific
disks and not spread out.
Disk I/O size The memory buffer available for the block I/O
request might not be sufficient, and the page
cache size can be smaller than the maximum
amount of Disk I/O size. Run stap ioblock.stp
to get request sizes. Run iostat to get the
blocksizes.
Disk I/O to physical device If all the disk I/Os are directed to the same
physical disk, it might cause a disk I/O bottleneck.
Directing the disk I/O to different physical disks
increases the performance.
Look at the statistics from the iostat tool to help you understand the situation. You can use
the following suggestions as shown in the examples.
Good situations
Good situations have the following characteristics:
High tps value, high %user value, low %iowait, and low svctm: A good condition, as
expected.
High tps value, high %user value, medium %iowait, and medium svctm: Situation is still
good, but requires attention. Probably write activity is a little higher than expected. Check
write block size and queue size.
Low tps value, low %user value, medium-high %iowait value, low %idle value, high MB\sec
value, and high avgrq-sz value: System performs well with large block write or read
activity.
Bad situations
Low tps value, low %user value, low %iowait, and low svctm: The system is not handling
disk I/O. If the application still suffers from the disk I/O, look at the application first, not at
the disk system.
Low tps value, low %user value, high %system value, high or low svctime, and 0 %idle
value: System is stuck with disk I/O. This situation can happen when a path failed, a
device adapter (DA) problem, and application errors.
High tps value, medium %user value, medium %system value, high %iowait value, and
high svctim: System consumed the disk resources, and you must consider an upgrade.
Increase the number of disks and I/O paths first. In this case, review the physical layout.
Low tps value, low %user value, high %iowait value, high service time, high read queue,
and high write queue: High write large block activity exists on the same disks that are also
intended for read activity. If applicable, split the data for writes and reads on separate
disks.
These situations are examples for your understanding. Plenty of similar situations might
occur. Remember to analyze not one or two values of the collected data, but try to obtain a full
picture by combining all the available data.
Example 12-15 Monitoring workload distribution with the iostat -kxt command
avg-cpu: %user %nice %system %iowait %steal %idle
0,00 0,00 0,62 25,78 0,09 73,51
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdc 0,00 0,00 978,22 11,88 124198,02 6083,17 263,17 1,75 1,76 0,51 50,30
sdd 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sde 0,00 0,00 20,79 0,00 2538,61 0,00 244,19 0,03 0,95 0,76 1,58
dm-0 59369,31 1309,90 1967,33 15,84 244356,44 8110,89 254,61 154,74 44,90 0,50 99,41
Example 12-15 on page 412 shows that disk dm-0 is running the workload. It has four paths:
sdc, sde, sdg, and sdi. Workload is distributed now to three paths for reading (sdc, sde, and
sdi) and to two paths for writing (sdc and sdi).
Example 12-16 shows the output of the iostat command on an LPAR configured with
1.2 processors running RHEL while issuing server writes to the disks sda and dm-2. The disk
transfers per second are 130 for sda and 692 for dm-2. The %iowait is 6.37%, which might
seem high for this workload, but it is not. It is normal for a mix of write and read workloads.
However, it might grow rapidly in the future, so pay attention to it.
Example 12-17 shows the output of the iostat -k command on an LPAR configured with a
1.2 processor running RHEL that issues server writes to the sda and dm-2 disks. The disk
transfers per second are 428 for sda and 4024 for dm-2. The %iowait increased to 12.42%.
The prediction from the previous example is now true. The workload became higher and the
%iowait value grew, but the %user value remained the same. The disk system now can hardly
manage the workload and requires some tuning or an upgrade. Although the workload grew,
the performance of the user processes did not improve. The application might issue more
requests, but they must wait in the queue instead of being serviced. Gather the extended
iostat statistics.
sar command
The sar command, which is included in the sysstat installation package, uses the standard
system activity data file to generate a report.
The system must be configured to collect the information and log it; therefore, a cron job
must be set up. Add the following lines shown in Example 12-18 to the /etc/crontab for
automatic log reporting with cron.
You get a detailed overview of your processor utilization (%user, %nice, %system, and %idle),
memory paging, network I/O and transfer statistics, process creation activity, activity for block
devices, and interrupts/second over time.
The sar -A command (the -A is equivalent to -bBcdqrRuvwWy -I SUM -I PROC -n FULL -U ALL,
which selects the most relevant counters of the system) is the most effective way to grep all
relevant performance counters. Use the sar command to analyze whether a system is disk
I/O-bound and I/O waits are high, which results in filled-up memory buffers and low processor
usage. Furthermore, this method is useful to monitor the overall system performance over a
longer period, for example, days or weeks, to further understand which times a claimed
performance bottleneck is seen.
A variety of additional performance data collection utilities are available for Linux. Most of
them are transferred from UNIX systems. You can obtain more details about those additional
tools in 9.3, “AIX performance monitoring tools” on page 312.
The following IBM i specific features are important for the performance of external storage:
Single-level storage
Object-based architecture
Storage management
Types of disk pools
This section describes these features and explains how they relate to the performance of a
connected DS8800 storage system.
The IBM i system takes responsibility for managing the information in disk pools. When you
create an object, for example, a file, the system places the file in the best location that
ensures the best performance. It normally spreads the data in the file across multiple disk
units. Advantages of such design are ease of use, self-management, automation of using the
added disk units, and so on. IBM i object-based architecture is shown on Figure 13-1 on
page 417.
IBM i LPAR
IBM i
TIMI – Technology independent
Machine Interface
IO
When the application performs an I/O operation, the portion of the program that contains read
or write instructions is first brought into main memory where the instructions are then run.
With the read request, the virtual addresses of the needed record are resolved, and for each
needed page, storage management first looks to see whether it is in main memory. If the
page is there, it is used to resolve the read request. However, if the corresponding page is not
in main memory, a page fault is encountered and it must be retrieved from disk. When a page
is retrieved, it replaces another page in memory that recently was not used; the replaced
page is paged out to disk.
When resolving virtual addresses for I/O operations, storage management directories map
the disk and sector to a virtual address. For a read operation, a directory lookup is performed
to get the needed information for mapping. For a write operation, the information is retrieved
from the page tables.
System ASP
The system ASP is the basic disk pool for the IBM i system. This ASP contains the IBM i
system boot disk (load source), system libraries, indexes, user profiles, and other system
objects. The system ASP is always present in the IBM i system and is needed for the IBM i
system. The IBM i system does not start if the system ASP is inaccessible.
User ASP
A user ASP separates the storage for different objects for easier management. For example,
the libraries and database objects that belong to one application are in one user ASP, and the
objects of another application are in a different user ASP. If user ASPs are defined in the IBM
i system, they are needed for the IBM i system to start.
The DS8800 storage system can connect to the IBM i system in one of the following ways:
Native: FC adapters in the IBM i system are connected through a storage area network
(SAN) to the host bus adapters (HBAs) in the DS8800 storage system.
With Virtual I/O Server Node Port ID Virtualization (VIOS NPIV): FC adapters in the VIOS
are connected through a SAN to the HBAs in the DS8800 storage system. The IBM i
system is a client of the VIOS and uses virtual FC adapters; each virtual FC adapter is
mapped to a port in an FC adapter in the VIOS.
For more information about connecting the DS8800 storage system to the IBM i system
with VIOS_NPIV, see DS8000 Copy Services for IBM i with VIOS, REDP-4584, and IBM
System Storage DS8000: Host Attachment and Interoperability, SG24-8887.
With VIOS: FC adapters in the VIOS are connected through a SAN to the HBAs in the
DS8800 storage system. The IBM i system is a client of the VIOS, and virtual SCSI
adapters in VIOS are connected to the virtual SCSI adapters in the IBM i system.
For more information about connecting storage systems to the IBM i system with the
VIOS, see IBM i and Midrange External Storage, SG24-7668, and DS8000 Copy Services
for IBM i with VIOS, REDP-4584.
Most installations use the native connection of the DS8800 storage system to the IBM i
system or the connection with VIOS_NPIV.
IBM i I/O processors: The information that is provided in this section refers to connection
with IBM i I/O processor (IOP)-less adapters. For similar information about older
IOP-based adapters, see IBM i and IBM System Storage: A Guide to Implementing
External Disks on IBM i, SG24-7120.
Note: The supported adapters depend on the type of POWER server and the level of
the IBM i system. For detailed specifications, see the IBM System Storage Interoperation
Center (SSIC) at the following website:
http://www.ibm.com/systems/support/storage/ssic/interoperability.wss
All listed adapters are IOP-less adapters. They do not require an I/O processor card to offload
the data management. Instead, the processor manages the I/O and communicates directly
with the FC adapter. Thus, the IOP-less FC technology takes full advantage of the
performance potential in the IBM i system.
Before the availability of IOP-less adapters, the DS8800 storage system connected to
IOP-based FC adapters that require the I/O processor card.
IOP-less FC architecture enables two technology functions that are important for the
performance of the DS8800 storage system with the IBM i system: Tag Command Queuing
and Header Strip Merge.
13.2.3 Multipath
The IBM i system allows multiple connections from different ports on a single IBM i partition to
the same LVs in the DS8800 storage system. This multiple connections support provides an
extra level of availability and error recovery between the IBM i system and the DS8800
storage system. If one IBM i adapter fails, or one connection to the DS8800 storage system is
lost, you can continue using the other connections and continue communicating with the disk
unit. The IBM i system supports up to eight active connections (paths) to a single LUN in the
DS8800 storage system.
In addition to high availability, multiple paths to the same LUN provide load balancing. A
Round-Robin algorithm is used to select the path for sending the I/O requests. This algorithm
enhances the performance of the IBM i system with the DS8800 LUNs connected in
Multipath.
Multipath is part of the IBM i operating system. This Multipath differs from other platforms that
have a specific software component to support multipathing, such as the Subsystem Device
Driver (SDD).
When the DS8800 storage system connects to the IBM i system through the VIOS, Multipath
in the IBM i system is implemented so that each path to a LUN uses a different VIOS.
Therefore, at least two VIOSs are required to implement Multipath for an IBM i client. This way
of multipathing provides additional resiliency if one VIOS fails. In addition to IBM i Multipath
with two or more VIOS, the FC adapters in each VIOS can multipath to the connected
DS8800 storage system to provide additional resiliency and enhance performance.
Use RAID 10 for IBM i systems, especially for the following types of workloads:
Workloads with large I/O rates
Workloads with many write operations (low read/write ratio)
When an IBM i page or a block of data is written to disk space, storage management spreads
it over multiple disks. By spreading data over multiple disks, multiple disk arms work in
parallel for any request to this piece of data, so writes and reads are faster.
When using external storage with the IBM i system, storage management sees an LV (LUN)
in the DS8800 storage system as a “physical” disk unit. If a LUN is created with the rotate
volumes extent allocation method (EAM), it occupies multiple stripes of a rank. If a LUN is
created with the rotate extents EAM, it is composed of multiple stripes of different ranks.
Figure 13-3 shows the use of the DS8800 disk with IBM i LUNs created with the rotate
extents EAM.
6+P+S arrays
LUN 1
S
Disk
unit 2
7+P array
LUN 2
Figure 13-3 Use of disk arms with LUNs created in the rotate extents method
Therefore, a LUN uses multiple DS8800 disk arms in parallel. The same DS8800 disk arms
are used by multiple LUNs that belong to the same IBM i workload, or even to different IBM i
workloads. To support efficiently this structure of I/O and data spreading across LUNs and
disk drives, it is important to provide enough disk arms to an IBM i workload.
Use the Disk Magic tool when you plan the number of ranks in the DS8800 storage system for
an IBM i workload. To provide a good starting point for Disk Magic modeling, consider the
number of ranks that is needed to keep disk utilization under 60% for your IBM i workload.
Table 13-1 on page 423 shows the maximal number of IBM i I/O/sec for one rank to keep the
disk utilization under 60%, for the workloads with read/write ratios 70/30 and 50/50.
Use the following steps to calculate the necessary number of ranks for your workload by using
Table 13-1:
1. Decide which read/write ratio (70/30 or 50/50) is appropriate for your workload.
2. Decide which RAID level to use for the workload.
3. Look for the corresponding number in Table 13-1.
4. Divide the I/O/sec of your workload by the number from the table to get the number of
ranks.
For example, we show a calculation for a medium IBM i workload with a read/write ratio of
50/50. The IBM i workload experiences 8500 I/O per second at a read/write ratio of 50/50.
The 15 K RPM disk drives in RAID 10 are used for the workload. Here are the number of
needed ranks:
Therefore, use eight ranks of 15 K RPM disk drives in RAID 10 for the workload.
With IBM i internal disks, a disk unit is a physical disk drive. With a connected DS8800
storage system, a disk unit is a LUN. Therefore, it is important to provide many LUNs to an
IBM i system. With a certain disk capacity, define more smaller LUNs.
Number of disk drives in the DS8800 storage system: In addition to the suggestion for
many LUNs, use a sufficient number of disk drives in the DS8800 storage system to
achieve good IBM i performance, as described in 13.3.2, “Number of ranks” on page 422.
1
The calculations for the values in Table 13-1 are based on the measurements of how many I/O operations one rank
can handle in a certain RAID level, assuming 20% read cache hit and 30% write cache efficiency for the IBM i
workload. Assume that half of the used ranks have a spare and half are without a spare.
Also, by considering the manageability and limitations of external storage and an IBM i
system, define LUN sizes of about 70 - 140 GB.
You might think that the rotate volumes EAM for creating IBM i LUNs provides sufficient disk
arms for I/O operations and that the use of the rotate extents EAM is “overvirtualizing”.
However, based on the performance measurements and preferred practices, the rotate
extents EAM of defining LUNs for an IBM i system still provides the preferred performance, so
use it.
Sharing the ranks among the IBM i systems enables the efficient use of the DS8800
resources. However, the performance of each LPAR is influenced by the workloads in the
other LPARs.
For example, two extent pools are shared among IBM i LPARs A, B, and C. LPAR A
experiences a long peak with large blocksizes that causes a high I/O load on the DS8800
ranks. During that time, the performance of B and the performance of C decrease. But, when
the workload in A is low, B and C experience good response times because they can use
most of the disk arms in the shared extent pool. In these periods, the response times in B and
C are possibly better than if they use dedicated ranks.
You cannot predict when the peaks in each LPAR happen, so you cannot predict how the
performance in the other LPARs is influenced.
Many IBM i data centers successfully share the ranks with little unpredictable performance
because the disk arms and cache in the DS8800 storage system are used more efficiently
this way.
Many BM i installations have one or two LPARs with important workloads and several smaller,
less important LPARs. These data centers dedicate ranks to the large systems and share the
ranks among the smaller ones.
For more information about the use of Disk Magic with an IBM i system, see 6.1, “Disk Magic”
on page 160 and IBM i and IBM System Storage: A Guide to Implementing External Disks on
IBM i, SG24-7120.
Note:
We used the same capacity of about 1840 GiB in each test.
For two paths to an IBM i system, we used two ports in different 16 Gb adapters; for four
paths to an IBM i system, we used two ports in different 16 Gb adapters and two ports
in different 8 Gb adapters. Two IBM i ports are in SAN switch zone with one DS8870
port.
Note: Disk response time in an IBM i system consists of service time (the time for I/O
processing, and wait time) and the time of potential I/O queuing in the IBM i host.
To provide a clearer comparison, these figures show a graph comparing a performance of two
and four paths with the same size of LUNs, and a graph comparing performance of different
sizes of LUNs with the same number of paths, for either service time or durations.
2 path 4 path
2 path 4 path
2 path 4 path
2 path 4 path
Comparing disk response time of different sizes of LUNs with given number of paths shows a
similar performance.
Elapsed times are similar when using two or four paths, and when using different sizes of
LUNs. Slightly higher duration is experienced with 262.9 GiB LUNs and with four paths,
although the service time with this scenario is lower; this can be because of other than I/O
waits during the jobs of the workload.
Note: The wait time is 0 in many of the tests, so we do not show it in the graphs. The only
wait time bigger than 0 is experienced in Readfile with seven 262.9 GiB LUNs; we show it
in the graph in Figure 13-6.
2 path 4 path
Figure 13-6 DS8870 HPFE - compare Readfile service time and wait time
2 path 4 path
2 path 4 path
You see slightly shorter disk response times when using four paths comparing to using two
paths, which is most probably because of large transfer sizes that typically require bigger
bandwidth for good performance. The workload experiences a longer response time when
running on seven 262.9 GiB LUNs, compared to a response time with smaller LUNs.
When using 4four paths, durations show an increase when increasing the LUN size. With two
paths, you do not see such an increase, which might be because of high utilization of IBM i
and DS8870 ports that covers up the differences.
2 path 4 path
2 path 4 path
2 path 4 path
2 path 4 path
There is no difference in service time comparing two and four paths, and there is no
difference comparing different LUN sizes.
Using four paths provides shorter elapsed times than using two paths. With two paths, the
262.9 GiB LUNs enable the shortest duration, and differences in duration with four paths are
small.
Figure 13-10 Service time in an IBM i system and response time in a DS8870 storage system
Therefore, use LUNs sizes 50 GiB - 150 GiB on flash storage in HPFE. The more LUNs that
are implemented, the potentially better becomes I/O performance, so make sure that you
create at least 8 LUNs for an IBM i system.
Therefore, zone one port in an IBM i system with one port in the DS8870 storage system
when running the workload on flash cards.
When sizing disk capacity per port, consider that the access density of a workload increases
when the workload is implemented on flash storage.
Each path is implemented with a port in 16 Gb adapter in an IBM i system and a port in a
16 Gb adapter in the DS8886 storage system.
Figure 13-11 Writefile for a DS8886 storage system - compare service time
Figure 13-12 Writefile for a DS8886 storage system - compare elapsed times
Durations show the performance benefit of implementing more paths to an IBM i system.
Although the service times are the same as using more paths, the durations are shorter. The
explanation is that the more paths that are used, the more IOPS occur during a given
workload, which results in shorter elapsed time.
When implementing four paths, a bigger number of smaller LUNs shows better performance
than a smaller number of bigger LUNs. The difference in the tests is small, so in environments
with larger capacities, the difference might be significant.
Service times show drastic improvement when using three or four paths. With the Readfile
workload, this difference in performance is significant. The reasons for this is in large
blocksizes, which make the workload sensitive to the available bandwidth.
The performance benefit of bigger number of LUNs is shown with duration times when using
four paths. When using fewer paths, this fact might be covered up by constraint in available
bandwidth.
Clearly, the bandwidth that is provided by two paths limits performance compared to
bandwidth with three and four paths, so there is higher service time and elapsed time when
using two paths. With higher available bandwidth, there is almost no difference in service
times and durations when using different numbers of LUNs.
Both service time and elapsed time show performance benefits when implementing more
paths. However, the number of LUNs has minimal influence on performance.
As a preferred practice, use 50 - 150 GiB LUN sizes for a given capacity. Potentially, better
performance is achieved with LUN sizes smaller than 100 GiB. Make sure that you create at
least eight LUNs for an IBM i system.
To provide sufficient port bandwidth in a DS8886 storage system, zone one port in an IBM i
system with one port in the DS8886 storage system when running the workload on flash
storage.
As a preferred practice, use at least as many ports as are recommended in the IBM
publications that are listed in 13.4.5, “Conclusions and recommendations” on page 434,
taking into account a maximum of 70% utilization of a port in the peaks. The workloads with
an I/O rate of 20000 - 60000 IOPS should experience potentially better performance by using
four paths to the LUNs on flash cards in the DS8886 storage system.
When implementing smaller and less important IBM i workloads with flash storage, it might be
a good idea to share an extent pool among the hosts.
To help you better understand the tool functions, they are divided into two groups:
performance data collectors (the tools that collect performance data) and performance data
investigators (the tools to analyze the collected data).
Collectors can be managed by IBM System Director Navigator for i, IBM System i Navigator,
or IBM i commands.
The following tools are or contain the IBM i performance data investigators:
IBM Performance Tools for i
IBM System Director Navigator for i
iDoctor
Most of these comprehensive planning tools address the entire spectrum of workload
performance on System i, including processor, system memory, disks, and adapters. To plan
or analyze performance for the DS8800 storage system with an IBM i system, use the parts of
the tools or their reports that show the disk performance.
Collection Services
The major IBM i performance data collector is called Collection Services. It is designed to run
all the time to provide data for performance health checks, for analysis of a sudden
performance problem, or for planning new hardware and software upgrades. The tool is
documented in detail in the IBM i IBM Knowledge Center:
http://www.ibm.com/support/knowledgecenter/ssw_ibm_i/welcome
The following tools can be used to manage the data collection and report creation of
Collection Services:
IBM System i Navigator
IBM System Director navigator
IBM Performance Tools for i
iDoctor Collection Service Investigator can be used to create graphs and reports based on
Collection Services data. For more information about iDoctor, see the IBM i iDoctor online
documentation at the following website:
https://www.ibm.com/i_dir/idoctor.nsf/documentation.html
With IBM i level V7R1, the Collection Services tool offers additional data collection categories,
including a category for external storage. This category supports the collection of
nonstandard data that is associated with certain external storage subsystems that are
attached to an IBM i partition. This data can be viewed within iDoctor, which is described in
“iDoctor” on page 446.
Another resource is “Web Power - New browser-based Job Watcher tasks help manage your
IBM i performance” in the IBM Systems Magazine (on the IBM Systems Magazine page,
search for the title of the article):
http://www.ibmsystemsmag.com/ibmi
Disk Watcher
Disk Watcher is a function of an IBM i system that provides disk data to help identify the
source of disk-related performance problems on the IBM i platform. It can either collect
information about every I/O in trace mode or collect information in buckets in statistics mode.
In statistics mode, it can run more often than Collection Services to see more granular
statistics. The command strings and file layouts are documented in the IBM i IBM Knowledge
Center:
http://www.ibm.com/support/knowledgecenter/ssw_ibm_i/welcome
For more information about the use of Disk Watcher, see “A New Way to Look at Disk
Performance” and “Analyzing Disk Watcher Data” in the IBM Systems Magazine (on the IBM
Systems Magazine page, search for the title of the article):
http://www.ibmsystemsmag.com/ibmi
Disk Watcher gathers detailed information that is associated with I/O operations to disk units,
and provides data beyond the data that is available in other IBM i integrated tools, such as
Work with Disk Status (WRKDSKSTS), Work with System Status (WRKSYSSTS), and Work with
System Activity (WKSYSACT).
Performance Explorer
PEX is a data collection tool in the IBM i system that collects information about a specific
system process or resource to provide detailed insight. PEX complements IBM i Collection
Services.
An example of PEX, the IBM i system, and connection with external storage is identifying the
IBM i objects that are most suitable to relocate to SSDs. You can use PEX to collect IBM i disk
events, such as synchronous and asynchronous reads, synchronous and asynchronous
writes, page faults, and page-outs. The collected data is then analyzed by the iDoctor tool
PEX-Analyzer to observe the I/O rates and disk service times of different objects. The objects
that experience the highest accumulated read service time, the highest read rate, and a
modest write rate at the same time are good candidates to relocate to SSDs.
For a better understanding of IBM i architecture and I/O rates, see 13.1, “IBM i storage
architecture” on page 416. For more information about using SSDs with the IBM i system, see
13.7, “Easy Tier with the IBM i system” on page 448.
The Job Watcher part of Performance Tools analyzes the Job Watcher data through the IBM
Systems Director Navigator for i Performance Data Visualizer.
Collection Services reports about disk utilization and activity, which are created with
IBM Performance Tools for i, are used for sizing and Disk Magic modeling of the DS8800
storage system for the IBM i system:
The Disk Utilization section of the System report
The Disk Utilization section of the Resource report
The Disk Activity section of the Component report
The Performance section of IBM Systems Director Navigator for i provides tasks to manage
the collection of performance data and view the collections to investigate potential
performance issues. Figure 13-17 shows the menu of performance functions in the IBM
Systems Director Navigator for i.
iDoctor
iDoctor is a suite of tools that is used to manage the collection of data, investigate
performance data, and analyze performance data on the IBM i system. The goals of iDoctor
are to broaden the user base for performance investigation, simplify and automate processes
of collecting and investigating the performance data, provide immediate access to collected
data, and offer more analysis options.
The iDoctor tools are used to monitor the overall system health at a high level or to drill down
to the performance details within jobs, disk units, and programs. Use iDoctor to analyze data
that is collected during performance situations. iDoctor is frequently used by IBM, clients, and
consultants to help solve complex performance issues quickly.
2. Select the PEX collection on which you want to work and select the type of graph that you
want to create, as shown in Figure 13-19.
IBM i Storage Manager spreads the IBM i data across the available disk units (LUNs) so that
each disk drive is about equally occupied. The data is spread in extents that are 4 KB - 1 MB
or even 16 MB. The extents of each object usually span as many LUNs as possible to provide
many volumes to serve the particular object. Therefore, if an object experiences a high I/O
rate, this rate is evenly split among the LUNs. The extents that belong to the particular object
on each LUN are I/O-intense.
Also, the Easy Tier tool monitors and relocates data on the 1 GB extent level. IBM i ASP
balancing, which is used to relocate data to SSDs, works on the 1 MB extent level. Monitoring
extents and relocating extents do not depend on the object to which the extents belong; they
occur on the subobject level.
In certain cases, queries must be created to run on the PEX collection to provide specific
information, for example, the query that provides information about which jobs and threads
use the objects with the highest read service time. You might also need to run a query to
provide the blocksizes of the read operations because you expect that the reads with
smaller blocksizes profit the most from SSDs. If these queries are needed, contact IBM
Lab Services to create them.
3. Based on the PEX analysis, decide which database objects to relocate to the SSD in the
DS8800 storage system. Then, use IBM i commands such as Change Physical File
(CHGPF) with the UNIT(*SSD) parameter, or use the SQL command ALTER TABLE UNIT SSD,
which sets on the file a preferred media attribute that starts dynamic data movement. The
preferred media attribute can be set on database tables and indexes, and on User-Defined
File Systems (UDFS).
For more information about the UDFS, see the IBM i IBM Knowledge Center:
http://www.ibm.com/support/knowledgecenter/ssw_ibm_i/welcome
ASP balancing
This IBM i method is similar to DS8800 Easy Tier because it is based on the data movement
within an ASP by IBM i ASP balancing. The ASP balancing function is designed to improve
IBM i system performance by balancing disk utilization across all of the disk units (or LUNs) in
an ASP. It provides four ways to balance an ASP. Two of these ways relate to data relocation
to SSDs:
Hierarchical Storage Management (HSM) balancing
Media preference balancing
The Media preference balancer function is the ASP balancing function that helps to correct
any issues with Media preference-flagged database objects or UDFS files not on their
preferred media type, which is either SSD or HDD, based on the specified subtype parameter.
The function is started by the STRASPBAL TYPE(*MP) command with the SUBTYPE parameter
equal to either *CALC (for data migration to both SSD and HDD), *SSD, or *HDD.
ASP balancer migration priority is an option in the ASP balancer so that you can specify the
migration priority for certain balancing operations, including *HSM or *MP in levels of either
*LOW, *MEDIUM, or *HIGH, thus influencing the speed of data migration.
Location: For data relocation with Media preference or ASP balancing, the LUNs defined
on SSD and on HDD must be in the same IBM i ASP. It is not necessary that they are in the
same extent pool in the DS8800 storage system.
The method requires that you create a separate ASP that contains LUNs that are on the
DS8800 SSD and then save the relevant IBM i libraries and restore them to the ASP with
SSD. All the files in the libraries then are on SSDs, and the performance of the applications
that use these files improves.
Additional information
For more information about the IBM i methods for SSD hot-spot management, including the
information about IBM i prerequisites, see the following documents:
IBM i 7.1 Technical Overview with Technology Refresh Updates, SG24-7858
Performance Value of Solid-State Drives using IBM i, found at:
http://www.ibm.com/systems/resources/ssd_ibmi.pdf
Before deciding on a mixed SSD and HDD environment or deciding to obtain additional
SSDs, consider these questions:
How many SSDs do you need to install to get the optimal balance between the
performance improvement and the cost?
What is the estimated performance improvement after you install the SSDs?
The clients that use IBM i Media preference get at least a partial answer to these questions
from the collected PEX data by using queries and calculations. The clients that decide on
DS8800 Easy Tier or even IBM i ASP balancing get the key information to answer these
questions by the skew level of their workloads. The skew level describes how the I/O activity is
distributed across the capacity for a specific workload. The workloads with the highest skew
level (heavily skewed workloads) benefit most from the Easy Tier capabilities because even
when moving a small amount of data, the overall performance improves. For more information
abut the skew level, see 6.4, “Disk Magic Easy Tier modeling” on page 208. You can obtain
the skew level only for the workloads that run on the DS8800 storage system with Easy Tier.
To provide an example of the skew level of a typical IBM i installation, we use the IBM i
benchmark workload, which is based on the workload TPC-E. TPC-E is a new online
transaction processing (OLTP) workload developed by the Tivoli Storage Productivity Center.
It uses a database to model a brokerage firm with customers who generate transactions that
are related to trades, account inquiries, and market research. The brokerage firm in turn
interacts with financial markets to run orders on behalf of the customers and updates relevant
account information. The benchmark workload is scalable, which means that the number of
customers who are defined for the brokerage firm can be varied to represent the workloads of
different-sized businesses. The workload runs with a configurable number of job sets. Each
job set runs independently, generating its own brokerage firm next transaction. By increasing
the number of job sets, you increase the throughput and processor utilization of the run.
In our example, we used the following configuration for Easy Tier monitoring for which we
obtained the skew level:
IBM i LPAR with eight processing units and 60 GB memory in POWER7 model 770.
Disk space for the IBM i provided from an extent pool with four ranks of HDD in DS8800
code level 6.2.
Forty-eight LUNs of the size 70 GB used for the IBM i system (the LUNs are defined in the
rotate extents EAM from the extent pool).
The LUNs are connected to the IBM i system in Multipath through two ports in separate
4 Gb FC adapters.
In the IBM i LPAR, we ran the following two workloads in turn:
– The workload with six database instances and six job sets
– The workload with six database instances and three job sets
The workload with six database instances was used to achieve 35% occupation of the disk
space. During the run with six job sets, the access density was about 2.7 IO/sec/GB. During
the run with three job sets, the access density was about 0.3 IO/sec/GB.
To use the STAT for an IBM i workload, complete the following steps:
1. Enable the collection of the heat data I/O statistics by changing the Easy Tier monitor
parameter to all or automode. Use the DS8800 command-line interface (DSCLI)
command chsi -etmonitor all or chsi -etmonitor automode. The parameter -etmonitor
all enables monitoring on all LUNs in the DS8800 storage system. The parameter
-etmonitor automode monitors the volumes that are managed by Easy Tier automatic
mode only.
2. Offload the collected data from the DS8800 clusters to the user workstation. Use either
the DS8800 Storage Manager GUI or the DSCLI command offloadfile -etdata
<directory>, where <directory> is the directory where you want to store the files with the
data on your workstation.
3. The offloaded data is stored in the following files in the specified directory:
– SF75NT100ESS01_heat.data
– SF75NT100ESS11_heat.data
The variable 75NT100 denotes the serial number of the DS8800 storage facility. The
variables ESS01 and ESS11 are the processor complexes 0 and 1.
4. Use the STAT on your workstation to create the heat distribution report that can be
presented in a web browser.
Figure 13-24 on page 455 shows an example of the STAT heat distribution on IBM i LUNs
after running the IBM i workload described in 13.7.3, “Skew level of an IBM i workload” on
page 452. The hot and warm data is evenly spread across the volumes, which is typical for an
IBM i workload distribution.
An IBM i client can also use the IBM i Media preference or ASP balancing method for hot-spot
management. It is not the goal to compare the performance for the three relocation methods.
However, do not expect much difference in performance by using one or another method.
Factors such as ease of use, consolidation of the management method, or control over which
data to move, are more important for an IBM i client to decide which method to use.
Many IBM i clients run multiple IBM i workloads in different POWER partitions that share the
disk space in the DS8800 storage system. The installations run important production systems
and less important workloads for testing and developing. The other partitions can be used as
disaster recovery targets of production systems in another location. Assume that IBM i
centers with various workloads that share the DS8800 disk space use I/O Priority Manager to
achieve a more efficient spread of storage resources.
Here is a simple example of using the I/O Priority Manager for two IBM i workloads.
The POWER partition ITSO_1 is configured with four processor units, 56 GB memory, and
forty-eight 70 GB LUNs in a DS8800 extent pool with Enterprise drives.
The partition ITSO-2 is configured with one processor unit, 16 GB memory, and thirty-two
70 GB LUNs in a shared hybrid extent pool with all SSDs, Enterprise drives, and Nearline
disk drives.
Performance Group 1 is defined for the 64 LUNs, but only 48 LUNs from these 60 LUNs are
added to the ASP and used by the system ITSO_1; the other LUNs are in the IBM i system
non-configured status.
We set up the I/O Priority Manager Performance Group 11 (PG11) for the volumes of ITSO_2
by using the following DSCLI command:
chfbvol -perfgrp pg11 2200-221f
After we define the performance groups for the IBM i LUNs, we ran the IBM i benchmark
workload described in 13.7.3, “Skew level of an IBM i workload” on page 452, with 40 job sets
in each of the ITSO_1 and ITSO_2 partitions.
After the workload finished, we obtained the monitoring reports of each performance group,
PG1 with the LUNs of ITSO_1 and PG11 with the LUNs of ITSO_2, during the 5-hour
workload run with 15-minute monitoring intervals.
Figure 13-25 and Figure 13-26 on page 457 show the DSCLI commands that we used to
obtain the reports and the displayed performance values for each performance group. The
workload in Performance Group PG11 shows different I/O characteristics than the workload in
Performance Group 1. Performance Group PG11 also experiences relatively high response
times compared to Performance Group 1. In our example, the workload characteristics and
response times are influenced by the different priority groups, types of disk drives used, and
Easy Tier management.
For more information about IBM i performance with the DS8000 I/O Priority Manager, see
IBM i Shared Storage Performance using IBM System Storage DS8000 I/O Priority Manager,
WP101935, which is available in the IBM Techdoc library.
A note about DS8000 sizing: z Systems I/O workload is complex. Use RMF data and
Disk Magic models for sizing. For more information about Disk Magic and how to get a
model, see 6.1, “Disk Magic” on page 160.
IBM z Systems and the IBM System Storage DS8000 storage systems family have a long
common history. Numerous features were added to the whole stack of server and storage
hardware, firmware, operating systems, and applications to improve I/O performance. This
level of synergy is unique to the market and is possible only because IBM is the owner of the
complete stack. This chapter does not describe these features because they are explained in
other places:
For an overview of z Systems synergy features, see “Performance characteristics for z
Systems” on page 10.
For a detailed description of these features, see IBM DS8870 and IBM z Systems
Synergy, REDP-5186.
The current zSynergy features are also explained in detail in Get More Out of Your IT
Infrastructure with IBM z13 I/O Enhancements, REDP-5134.
Contact your IBM representative or IBM Business Partner if you have questions about the
expected performance capability of IBM products in your environment.
The data is then stored and processed in several ways. The ones that are described and used
in this chapter are:
Data that is collected by the three monitors can be stored as SMF records (SMF types
70 - 79) for later reporting.
RMF Monitor III can write VSAM records to in-storage buffer or into VSAM data sets.
The RMF postprocessor is the function to extract historical reports for Monitor I data.
Other methods of working with RMF data, which is not described in this chapter, are:
The RMF Spread Sheet Reporter provides graphical presentation of long-term
Postprocessor data. It helps you view and analyze performance at a glance, or for system
health check
The RMF Distributed Data Server (DDS) supports HTTP requests to retrieve RMF
Postprocessor data from a selection of reports since z/OS 1.12. The data is returned as an
XML document, so a web browser can act as Data Portal to RMF data.
With z/OS 1.12, z/OS Management Facility provides the presentation for DDS data.
RMF for z/OS 1.13 enhances the DDS layer and provides a new solution as RMF XP that
enables Cross Platform Performance Monitoring.
The following sections describe how to gather and store RMF Monitor I data and then extract
it as reports by using the RMF postprocessor.
RMF Monitor I gatherer session options and write SMF record types
To specify which types of data RMF is collecting, you specify Monitor I session gatherer
options in the ERBRMFxx parmlib member.
Table 14-1 shows the Monitor I session options and associated SMF record types that are
related to monitoring I/O performance. The defaults are emphasized.
Table 14-1 Monitor I gatherer session options and write SMF record types
Activities Session options in SMF record types
ERBRMFxx parmlib member (Long-term Monitor I)
Note: The Enterprise Storage Server activity is not collected by default. Change this and
turn on the collection if you have DS8000 storage systems installed. It provides valuable
information about DS8000 internal resources.
Note: FCD performance data is collected from the FICON directors. You must have the
FICON Control Unit Port (CUP) feature licensed and installed on your directors.
Certain measurements are performed by the storage hardware. The associated RMF record
types are 74.5, 74.7, and 74.8. They are marked with an (H) in Table 14-1 on page 461. They
do not have to be collected by each attached z/OS system separately; it is sufficient to get
them from one, or for redundancy reasons, two systems.
Note: Many clients, who have several z/OS systems sharing disk systems, typically collect
these records from two production systems that are always up and not running at the same
time.
In the ERBRMFxx parmlib member, you also find a TIMING section, where you can set the
RMF sampling cycle. It defaults to 1 second, which should be good for most cases. The RMF
cycle does not determine the amount of data that is stored in SMF records.
To store the collected RMF data, you must make sure that the associated SMF record types
(70 - 78) are included in the SYS statement in the SMFPRMxx parmlib member.
You also must specify the interval at which RMF data is stored. You can either do this explicitly
for RMF in the ERBRMFxx parmlib member or use the system wide SMF interval. Depending
on the type of data, RMF samples are added up or averaged for each interval. The number of
SMF record types you store and the interval make up the amount of data that is stored.
For more information about setting up RMF and the ERBRMFxx and SMFPRMxx parmlib
members, see z/OS Resource Measurement Facility User's Guide, SC34-2664.
Important: The shorter your interval, the more accurate your data is. However, there
always is a trade-off between shorter interval and the size of the SMF data sets.
IFASMFDP can also be used to extract and concatenate certain record types or time ranges
from existing sequential SMF dump data sets. For more information about the invocation and
control of ISASMFDP, see z/OS MVS System Management Facilities (SMF), SA38-0667.
To create meaningful RMF reports or analysis, the records in the dump data set must be
chronological. This is important if you plan to analyze RMF data from several LPARs. Use a
SORT program to combine the individual data sets and sort them by date and time.
Example 14-2 shows a sample job snippet with the required sort parameters by using the
DFSORT program.
RMF postprocessor
The RMF postprocessor analyses and summarizes RMF data into human readable reports.
Example 14-3 shows a sample job to run the ERBRMFPP post processing program.
In the control statements, you specify the reports that you want to get by using the REPORTS
keyword. Other control statements define the time frame, intervals, and summary points for
the reports to create. For more information about the available control statements, see z/OS
Resource Measurement Facility User's Guide, SC34-2664.
Note: You can also generate and start the postprocessor batch job from the ISPF menu of
RMF.
To get a first impression, you can rank volumes by their I/O intensity, which is the I/O rate
multiplied by Service Time (PEND + DISC + CONN component). Also, look for the largest
component of the response time. Try to identify the bottleneck that causes this problem. Do
not pay too much attention to volumes that have low or no Device Activity Rate, even if they
show high I/O response time. The following sections provide more detailed explanations.
The device activity report accounts for all activity to a base and all of its associated alias
addresses. Activity on alias addresses is not reported separately, but accumulated into the
base address.
The Parallel Access Volume (PAV) value is the number of addresses assigned to a unit
control block (UCB), including the base address and the number of aliases assigned to that
base address.
RMF reports the number of PAV addresses (or in RMF terms, exposures) that are used by a
device. In a HyperPAV environment, the number of PAVs is shown in this format: n.nH. The H
indicates that this volume is supported by HyperPAV. The n.n is a one decimal number that
shows the average number of PAVs assigned to the address during the RMF report period.
Example 14-4 shows that address 7010 has an average of 1.5 PAVs assigned to it during this
RMF period. When a volume has no I/O activity, the PAV is always 1, which means that there
is no alias that is assigned to this base address because in HyperPAV an alias is used or
assigned to a base address only during the period that is required to run an I/O. The alias is
then released and put back into the alias pool after the I/O is completed.
Important: The number of PAVs includes the base address plus the number of aliases
assigned to it. Thus, a PAV=1 means that the base address has no aliases assigned to it.
DEVICE AVG AVG AVG AVG AVG AVG AVG AVG % % % AVG %
STORAGE DEV DEVICE NUMBER VOLUME PAV LCU ACTIVITY RESP IOSQ CMR DB INT PEND DISC CONN DEV DEV DEV NUMBER ANY
GROUP NUM TYPE OF CYL SERIAL RATE TIME TIME DLY DLY DLY TIME TIME TIME CONN UTIL RESV ALLOC ALLOC
7010 33909 10017 ST7010 1.5H 0114 689.000 1.43 .048 .046 .000 .163 .474 .741 33.19 54.41 0.0 2.0 100.0
7011 33909 10017 ST7011 1.5H 0114 728.400 1.40 .092 .046 .000 .163 .521 .628 29.72 54.37 0.0 2.0 100.0
YGTST00 7100 33909 60102 YG7100 1.0H 003B 1.591 12.6 .000 .077 .000 .067 .163 12.0 .413 0.07 1.96 0.0 26.9 100.0
YGTST00 7101 33909 60102 YG7101 1.0H 003B 2.120 6.64 .000 .042 .000 .051 .135 6.27 .232 0.05 1.37 0.0 21.9 100.0
Figure 14-2 illustrates how these components relate to each other and to the common
response and service time definitions.
Before learning about the individual response time components, you should know about the
different service time definitions in simple terms:
I/O service time is the time that is required to fulfill an I/O request after it is dispatched to
the storage hardware. It includes locating and transferring the data and the required
handshaking.
I/O response time is the I/O service time plus the time the I/O request spends in the I/O
queue of the host.
System service time is the I/O response time plus the time it takes to notify the requesting
application of the completion.
Only the I/O service time is directly related to the capabilities of the storage hardware. The
additional components that make up I/O response or system service time are related to host
system capabilities or configuration.
The following sections describe all response time components in more detail. They also
provide possible causes for unusual values.
PEND time
PEND time represents the time that an I/O request waits in the hardware. It can become
increased by the following conditions:
High DS8000 HA utilization:
– An HA can be saturated even if the individual ports have not yet reached their limits.
HA utilization is not directly reported by RMF.
– The Command response (CMR) delay, which is part of PEND, can be an indicator for
high HA utilization. It represents the time that a Start- or Resume Subchannel function
needs until the first command is accepted by the device. It should not exceed a few
hundred microseconds.
– The Enterprise Storage Server report can help you further to find the reason for
increased PEND caused by a DS8000 HA. For more information, see “Enterprise
Storage Server” on page 488.
High FICON Director port utilization:
– Sometimes, high FICON Director port or DS8000 HA port utilization is because of over
commission. Multiple FICON channels from different CPCs connect to the same
outbound switch port.
In this case, the FICON channel utilization as seen from the host might be low, but the
combination or sum of the utilization of these channels that share the outbound port
can be significant.
– The FICON Director report can help you isolate the ports that cause increased PEND.
For more information, see 14.1.6, “FICON Director Activity report” on page 470.
Device Busy (DB) delay: Time that an I/O request waits because the device is busy. Today,
mainly because of the Multiple Allegiance feature, DB delay is rare. If it occurs, it is most
likely because the device is RESERVED by another system. Use GRS to avoid hardware
RESERVES, as already indicated in “IOSQ time” on page 466.
SAP impact: Indicates the time the I/O Processor (IOP/SAP) needs to handle the I/O
request. For more information, see 14.1.4, “I/O Queuing Activity report” on page 468.
CONN time
For each I/O operation, the channel subsystem measures the time that storage system,
channel, and CPC are connected for data transmission. CONN depends primarily on the
amount of data that is transferred per I/O. Large I/Os naturally have a higher CONN
component than small ones.
If there is a high level of utilization of resources, time can be spent in contention, rather than
transferring data. Several reasons exist for higher than expected CONN time:
FICON channel saturation. CONN time increases if the channel or BUS utilization is high.
FICON data is transmitted in frames. When multiple I/Os share a channel, frames from an
I/O are interleaved with those from other I/Os, thus elongating the time that it takes to
transfer all of the frames of that I/O. The total of this time, including the transmission time
of the interleaved frames, is counted as CONN time. For details and thresholds, see
“FICON channels” on page 478.
Contention in the FICON Director. A DS8000 HA port can also affect CONN time, although
they primarily affect PEND time.
Note: This measurement is fairly new. It is available since z12 with z/OS V1.12 and V1.13
with APAR OA39993, or z/OS 2.1 and later.
In Example 14-4 on page 464, the AVG INT DLY is not displayed for devices 7010 and 7011.
The reason is that these volume records were collected on a z196 host system.
Note: I/O interrupt delay time is not counted as part of the I/O response time.
If the utilization (% IOP BUSY) is unbalanced and certain IOPs are saturated, it can help to
redistribute the channels assigned to the storage systems. An IOP is assigned to handle a
certain set of channel paths. Assigning all of the channels from one IOP to access a busy disk
system can cause a saturation on that particular IOP. For more information, see the hardware
manual of the CPC that you use.
- INITIATIVE QUEUE - ------- IOP UTILIZATION ------- -- % I/O REQUESTS RETRIED -- -------- RETRIES / SSCH ---------
IOP ACTIVITY AVG Q % IOP I/O START INTERRUPT CP DP CU DV CP DP CU DV
RATE LNGTH BUSY RATE RATE ALL BUSY BUSY BUSY BUSY ALL BUSY BUSY BUSY BUSY
00 259.349 0.12 0.84 259.339 300.523 31.1 31.1 0.0 0.0 0.0 0.45 0.45 0.00 0.00 0.00
01 127.068 0.14 100.0 126.618 130.871 50.1 50.1 0.0 0.0 0.0 1.01 1.01 0.00 0.00 0.00
02 45.967 0.10 98.33 45.967 54.555 52.0 52.0 0.0 0.0 0.0 1.08 1.08 0.00 0.00 0.00
03 262.093 1.72 0.62 262.093 279.294 32.9 32.9 0.0 0.0 0.0 0.49 0.49 0.00 0.00 0.00
SYS 694.477 0.73 49.95 694.017 765.243 37.8 37.8 0.0 0.0 0.0 0.61 0.61 0.00 0.00 0.00
In a HyperPAV environment, you can also check the usage of HyperPAV alias addresses.
Example 14-6 shows the LCU section of the I/O Queueing Activity report. It reports on
HyperPAV alias usage in the HPAV MAX column. Here, a maximum of 32 PAV alias
addresses were used for that LCU during the reporting interval. You can compare this value
to the number of aliases that are defined for that LCU. If all are used, you might experience
delays because of a lack of aliases.
This condition is also indicated by the HPAV WAIT value. It is calculated as the ratio of the
number of I/O requests that cannot start because no HyperPAV aliases are available, to the
total number of I/O requests for that LCU. If it is nonzero in a significant number of intervals,
you might consider defining more aliases for this LCU.
Note: If your HPAV MAX value is constantly below the number of defined alias addresses,
you can consider unassigning some aliases and use these addresses for additional base
devices. Do this only if you are short of device addresses. Monitor HPAV MAX over an
extended period to make sure that you do not miss periods of higher demand for PAV.
Note: We explain this estimation with CHPID 30 in the example in Figure 14-3. The
SPEED value is 16 Gbps, which roughly converts to 1600 MBps. TOTAL READ is 50.76
MBps, which is higher than TOTAL WRITE. Therefore, the link utilization is
approximately 50.78 divided by 1600, which results in 0.032 or 3.2%
Exceeding the thresholds significantly causes frame pacing, which eventually leads to higher
than necessary CONNECT times. If this happens only for a few intervals, it is most likely no
problem.
For small block transfers, the BUS utilization is less than the FICON channel utilization. For
large block transfers, the BUS utilization is greater than the FICON channel utilization.
The Generation (G) field in the channel report shows the combination of the FICON channel
generation that is installed and the speed of the FICON channel link for this CHPID at the time
of the machine start. The G field does not include any information about the link between the
director and the DS8000 storage system.
The link between the director and the DS8000 storage system can run at 1, 2, 4, 8, or
16 Gbps.
If the channel is point-to-point connected to the DS8000 HA port, the G field indicates the
speed that was negotiated between the FICON channel and the DS8000 port. With
z/OS V2.1 and later, a SPEED column was added that indicates the actual channel path
speed at the end of the interval.
The RATE field in the FICON OPERATIONS or zHPF OPERATIONS columns means the
number of FICON or zHPF I/Os per second that are initiated at the physical channel level. It is
not broken down by LPAR.
Note: To get the data that is related to the FICON Director Activity report, the CUP device
must be online on the gathering z/OS system.
The measurements that are provided are on a director port level. It represents the total I/O
passing through this port and is not broken down by LPAR or device.
The important performance metric is AVG FRAME PACING. This metric shows the average
time (in microseconds) that a frame waited before it was transmitted. The higher the
contention on the director port, the higher the average frame pacing metric. High frame
pacing negatively influences the CONNECT time.
PORT ---------CONNECTION-------- AVG FRAME AVG FRAME SIZE PORT BANDWIDTH (MB/SEC) ERROR
ADDR UNIT ID SERIAL NUMBER PACING READ WRITE -- READ -- -- WRITE -- COUNT
05 CHP FA 0000000ABC11 0 808 285 50.04 10.50 0
07 CHP 4A 0000000ABC11 0 149 964 20.55 5.01 0
09 CHP FC 0000000ABC11 0 558 1424 50.07 10.53 0
0B CHP-H F4 0000000ABC12 0 872 896 50.00 10.56 0
12 CHP D5 0000000ABC11 0 73 574 20.51 5.07 0
13 CHP C8 0000000ABC11 0 868 1134 70.52 2.08 1
14 SWITCH ---- 0ABCDEFGHIJK 0 962 287 50.03 10.59 0
15 CU C800 0000000XYG11 0 1188 731 20.54 5.00 0
Note: Cache reports by LCU calculate the total activities of volumes that are online.
------------------------------------------------------------------------------------------------------------------------------------
CACHE SUBSYSTEM OVERVIEW
------------------------------------------------------------------------------------------------------------------------------------
TOTAL I/O 19976 CACHE I/O 19976 CACHE OFFLINE 0
TOTAL H/R 0.804 CACHE H/R 0.804
CACHE I/O -------------READ I/O REQUESTS------------- ----------------------WRITE I/O REQUESTS---------------------- %
REQUESTS COUNT RATE HITS RATE H/R COUNT RATE FAST RATE HITS RATE H/R READ
NORMAL 14903 252.6 10984 186.2 0.737 5021 85.1 5021 85.1 5021 85.1 1.000 74.8
SEQUENTIAL 0 0.0 0 0.0 N/A 52 0.9 52 0.9 52 0.9 1.000 0.0
CFW DATA 0 0.0 0 0.0 N/A 0 0.0 0 0.0 0 0.0 N/A N/A
TOTAL 14903 252.6 10984 186.2 0.737 5073 86.0 5073 86.0 5073 86.0 1.000 74.6
-----------------------CACHE MISSES----------------------- ------------MISC------------ ------NON-CACHE I/O-----
REQUESTS READ RATE WRITE RATE TRACKS RATE COUNT RATE COUNT RATE
DFW BYPASS 0 0.0 ICL 0 0.0
NORMAL 3919 66.4 0 0.0 3921 66.5 CFW BYPASS 0 0.0 BYPASS 0 0.0
SEQUENTIAL 0 0.0 0 0.0 0 0.0 DFW INHIBIT 0 0.0 TOTAL 0 0.0
CFW DATA 0 0.0 0 0.0 ASYNC (TRKS) 3947 66.9
TOTAL 3919 RATE 66.4
---CKD STATISTICS--- ---RECORD CACHING--- ----HOST ADAPTER ACTIVITY--- --------DISK ACTIVITY-------
BYTES BYTES RESP BYTES BYTES
WRITE 0 READ MISSES 0 /REQ /SEC TIME /REQ /SEC
WRITE HITS 0 WRITE PROM 3456 READ 6.1K 1.5M READ 6.772 53.8K 3.6M
WRITE 5.7K 491.0K WRITE 12.990 6.8K 455.4K
The report shows the I/O requests by read and by write. It shows the rate, the hit rate, and the
hit ratio of the read and the write activities. The read-to-write ratio is also calculated.
In this report, you can check the value of the read hit ratio. Low read hit ratios contribute to
higher DISC time. For a cache friendly workload, you see a read hit ratio of better than 90%.
The write hit ratio is usually 100%.
High DASD Fast Write (DFW) Bypass is an indication that persistent memory or NVS is
overcommitted. DFW BYPASS means that write I/Os cannot be completed because
persistent memory is full and must be retried. If the DFW Bypass Rate is higher than 1%, the
write retry operations can affect the DISC time. It is an indication of insufficient back-end
resources because write cache destaging operations are not fast enough.
Note: In cases with high DFW Bypass Rate, you usually see high rank utilization in the
DS8000 rank statistics, which are described in 14.1.8, “Enterprise Disk Systems report” on
page 473.
The DISK ACTIVITY part of the report can give you a rough indication of the back-end
performance. The read response time can be in the order of 10 - 20 ms if you have mostly
HDDs in the back end, and lower if SSDs and HPFE resources are used. The write response
time can be higher by a factor of two. Do not overrate this information, and check the ESS
report (see 14.1.8, “Enterprise Disk Systems report” on page 473), which provides much
more detail.
The report also shows the number of sequential I/O as SEQUENTIAL row and random I/O as
NORMAL row for read and write operations. These metrics can also help to analyze and
specify I/O bottlenecks.
Example 14-9 is the second part of the CACHE SUBSYSTEM ACTIVITY report, providing
measurements for each volume in the LCU. You can also see to which extent pool each
volume belongs.
------------------------------------------------------------------------------------------------------------------------------------
CACHE SUBSYSTEM DEVICE OVERVIEW
------------------------------------------------------------------------------------------------------------------------------------
VOLUME DEV XTNT % I/O ---CACHE HIT RATE-- ----------DASD I/O RATE---------- ASYNC TOTAL READ WRITE %
SERIAL NUM POOL I/O RATE READ DFW CFW STAGE DFWBP ICL BYP OTHER RATE H/R H/R H/R READ
*ALL 100.0 338.6 186.2 86.0 0.0 66.4 0.0 0.0 0.0 0.0 66.9 0.804 0.737 1.000 74.6
*CACHE-OFF 0.0 0.0
*CACHE 100.0 338.6 186.2 86.0 0.0 66.4 0.0 0.0 0.0 0.0 66.9 0.804 0.737 1.000 74.6
PR7000 7000 0000 22.3 75.5 42.8 19.2 0.0 13.5 0.0 0.0 0.0 0.0 14.4 0.821 0.760 1.000 74.6
PR7001 7001 0000 11.5 38.8 20.9 10.5 0.0 7.5 0.0 0.0 0.0 0.0 7.6 0.807 0.736 1.000 73.1
PR7002 7002 0000 11.1 37.5 20.4 9.5 0.0 7.6 0.0 0.0 0.0 0.0 7.0 0.797 0.729 1.000 74.7
PR7003 7003 0000 11.3 38.3 22.0 8.9 0.0 7.4 0.0 0.0 0.0 0.0 6.8 0.806 0.747 1.000 76.8
PR7004 7004 0000 3.6 12.0 6.8 3.0 0.0 2.3 0.0 0.0 0.0 0.0 2.6 0.810 0.747 1.000 75.2
PR7005 7005 0000 3.7 12.4 6.8 3.2 0.0 2.4 0.0 0.0 0.0 0.0 2.7 0.808 0.741 1.000 74.1
PR7006 7006 0000 3.8 12.8 6.5 3.6 0.0 2.6 0.0 0.0 0.0 0.0 3.1 0.796 0.714 1.000 71.5
PR7007 7007 0000 3.6 12.3 6.9 3.1 0.0 2.4 0.0 0.0 0.0 0.0 2.5 0.806 0.742 1.000 75.2
PR7008 7008 0000 3.6 12.2 6.7 3.4 0.0 2.2 0.0 0.0 0.0 0.0 2.7 0.821 0.753 1.000 72.5
PR7009 7009 0000 3.6 12.2 6.8 2.9 0.0 2.5 0.0 0.0 0.0 0.0 2.3 0.796 0.732 1.000 76.4
If you specify REPORTS(CACHE(DEVICE)) when running the cache report, you get the complete
report for each volume, as shown in Example 14-10 on page 473. You get detailed cache
statistics of each volume. By specifying REPORTS(CACHE(SSID(nnnn))), you can limit this
report to only certain LCUs.
Important: Enterprise Storage Server data is not gathered by default. Make sure that
Enterprise Storage Server data is collected as processed, as described in 14.1.1, “RMF
Overview” on page 460.
DS8000 rank
ESS RANK STATISTICS provides measurements of back-end activity on the extent pool and
RAID array (rank) levels, such as OPS/SEC, BYTES/OP, BYTES/SEC, and RTIME/OP, for
read and write operations.
Example 14-11 shows rank statistics for a system with multi-rank extent pools, which contain
HDD resources and SSD or HPFE arrays (hybrid pools).
0000 CKD 1Gb 0000 0000 0.0 0.0 0.0 16.0 0.0 0.0 0.0 96.0 1 6 15 1800G RAID 5
0004 0000 0.0 65.5K 72.8 0.0 0.0 1.3M 2.8K 100.0 1 7 15 2100G RAID 5
0010 000A 190.0 57.2K 10.9M 2.2 8.0 1.1M 8.9M 9.3 Y 1 6 N/A 2400G RAID 5
0012 000A 180.6 57.3K 10.3M 2.3 8.3 1.1M 9.0M 9.5 Y 1 6 N/A 2400G RAID 5
POOL 370.6 57.2K 21.2M 2.2 16.3 1.1M 17.9M 9.4 Y 4 25 0 8700G RAID 5
POOL 165.8 57.3K 9.5M 2.4 9.5 1.1M 10.1M 7.7 Y 4 25 0 8700G RAID 5
Note: Starting with z/OS V2.2 and DS8000 LMC R7.5, this report also shows the
relationship of ranks to DA pairs (ADAPT ID). The MIN RPM value for SSD ranks was also
changed. It used to be 65 and is now N/A.
If I/O response elongation and rank saturation is suspected, it is important to check IOPS
(OPS/SEC) and throughput (BYTES/SEC) for both read and write rank activities. Also, you
must make sure what kind of workloads were run and how was the ratio of those workloads. If
your workload is of a random type, the IOPS is the significant figure. If it is more sequential,
the throughput is more significant.
The rank response times (RTIME) can be indicators for saturation. In a balanced and sized
system with growth potential, the read response times are in the order of 1 - 2 ms for SSD or
HPFE ranks and 10 - 15 ms for enterprise class SAS ranks. Ranks with response times
reaching the range of 3 - 5 ms for SSD and HPFE or 20 - 30 ms for enterprise SAS are
approaching saturation.
The write response times for SSD and HPFE should be in the same order as reads. For
HDDs, it can be about twice as high as for reads. Ranks based on NL-SAS drives have higher
response times, especially for write operations. They should not be used for performance
sensitive workload.
Note: IBM Spectrum Control performance monitoring also provides a calculated rank
utilization value. For more information, see 7.2.1, “IBM Spectrum Control overview” on
page 223.
For a definition of the HA port ID (SAID), see IBM DS8880 Architecture and Implementation
(Release 8), SG24-8323.
The I/O INTENSITY is the result of multiplication of the operations per second and the
response time per operation. For FICON ports, it is calculated for both the read and write
operations, and for PPRC ports, it is calculated for both the send and receive operations. The
total I/O intensity is the sum of those two numbers on each port.
Note: IBM Spectrum Control performance monitoring also provides a calculated port
utilization value. For more information, see 7.2.1, “IBM Spectrum Control overview” on
page 223.
If you must monitor or analyze statistics for certain ports over time, you can generate an
overview report and filter for certain port IDs. Provide post processor control statements like
the ones in Example 14-13 and you get an overview report, as shown in Example 14-14.
A z Systems workload is too complex to be estimated and described with only a few numbers
and general guidelines. Furthermore, the capabilities of the storage systems are not just the
sum of the capabilities of their components. Advanced technologies, such as Easy Tier, can
influence the throughput of the complete solution positively, and other functions, such as
point-in-time copies or remote replication, add additional workload.
Most of these factors can be accounted for by modeling the new solution with Disk Magic. For
z Systems, the modeling is based on RMF data and real workload characteristics. You can
compare the current to potential new configurations and consider growth (capacity and
workload), and the influence of Easy Tier and Copy Services.
For more information about Disk Magic, see 6.1, “Disk Magic” on page 160, which describes
the following items
How to get access to the tool or someone who can use it
The capabilities and limitations
The data that is required for a proper model
Note: It is important, particularly for mainframe workloads, which often are I/O response
time sensitive, to observe the thresholds and limitations provided by the Disk Magic model.
Easy Tier
Easy Tier is a performance enhancement to the DS8000 family of storage systems that helps
to avoid issues that might occur if the back-end workload is not balanced across all the
available resources. For a description of Easy Tier, its capabilities, and how it works, see
1.3.4, “Easy Tier” on page 11.
There are some IBM Redbooks publications that you can refer to if you need more information
about Easy Tier:
IBM DS8000 Easy Tier, REDP-4667
DS8870 Easy Tier Application, REDP-5014
IBM DS8870 Easy Tier Heat Map Transfer, REDP-5015
The predominant feature of Easy Tier is to manage data distribution in hybrid pools. It is
important to know how much of each resource type is required to optimize performance while
keeping the cost as low as possible. To determine a reasonable ratio between fast, but
expensive, and slower, but more cost-effective, resources, you can use the skew value of your
workload. Workload skew is a number that indicates how your active and frequently used data
is distributed across the total space:
Workloads with high skew access data mainly from a small portion of the available storage
space.
Workloads with low skew access data evenly from all or a large part of the available
storage space.
You can gain the most from a small amount of fast resources with a high skew value. You can
determine the skew value of your workload by using the Storage Tier Advisor Tool (STAT) that
is provided with the DS8000 storage systems. For more information, see to 6.5, “Storage Tier
Advisor Tool” on page 213.
Easy Tier is designed to provide a balanced and stable workload distribution. It aims at
minimizing the amount of data that must be moved as part of the optimization. To achieve this
task, it monitors data access permanently and establishes migration plans based on current
and past patterns. It will not immediately start moving data as soon as it is accessed more
frequently than before. There is a certain learning phase. You can use Easy Tier Application
to overrule learning if necessary.
FICON channels
The Disk Magic model that was established for your configuration indicates the number of
host channels, DS8000 HAs, and HA FICON ports that are necessary to meet the
performance requirements. The modeling results are valid only if the workload is evenly
distributed across all available resources. It is your responsibility to make sure this really is
the case.
In addition, there are certain conditions and constraints that you must consider:
IBM mainframe systems support a maximum of eight channel paths to each logical control
unit (LCU). A channel path is a connection from a host channel to a DS8000 HA port.
A FICON channel port can be shared between several z/OS images and can access
several DS8000 HA ports.
The following sections provide some examples that can help to select the best connection
topology.
The simplest case is that you must connect one host to one storage system. If the Disk Magic
model indicates that you need eight or less host channels and DS8000 host ports, you can
use a one-to-one connection scheme, as shown in Figure 14-4.
z Systems host
. . . FICON Channel Ports
. . .
FICON Fabric
. . .
FICON HA Ports
DS8000
To simplify the figure, it shows only one fabric “cloud”, where you normally have at least two
for redundancy reasons. The orange lines that here directly connect one port to another stand
for any route through the fabric. It can range from a direct connection without any switch
components to a cascaded FICON configuration.
z Systems host
. . . . . . FICON Channel Ports
FICON Fabric
. . . . . .
FICON HA Ports
A B C X Y Z DS8000 LCUs
DS8000
Figure 14-5 Split LCUs into groups and assign them to connections
Therefore, you can have up to eight connections and host and storage ports in each group. It
is your responsibility to define the LCU split in a way that each group gets the amount of
workload that the assigned number of connections can sustain.
Note: Disk Magic models can be created on several levels. To determine the best LCU
split, you might need LCU level modeling.
Therefore, in many environments, more than one host system shares the data and accesses
the same storage system. If eight or less storage ports are sufficient according to your Disk
Magic model, you can implement a configuration as shown in Figure 14-6, where the storage
ports are shared between the host ports.
. . .
FICON Fabric
. . .
FICON HA Ports
DS8000
Figure 14-6 Several hosts accessing the same storage system - sharing all ports
. . .
FICON Fabric
. . . . . .
FICON HA Ports
DS8000
Figure 14-7 Several hosts accessing the same storage system, - distributed ports
As you split storage system resources between the host systems, you must make sure that
you assign a sufficient number of ports to each host to sustain its workload.
Figure 14-8 shows another variation, where more than one storage system is connected to a
host. In this example, all storage system ports share the host ports. This works well if the Disk
Magic model does not indicate that more than eight host ports are required for the workload.
z Systems host
. . . FICON Channel Ports
FICON Fabric
. . . . . . FICON HA Ports
DS8000 DS8000
z Systems host
FICON Fabric
. . . . . . FICON HA Ports
DS8000 DS8000
Another way of splitting HA ports is by LCU, in a similar fashion as shown in Figure 14-5 on
page 480. As in the earlier examples with split resources, you must make sure that you
provide enough host ports for each storage system to sustain the workload to which it is
subject.
Mixed workload: A DS8000 HA has several ports. You can set each individual port to run
FICON or FCP topology. There is nothing wrong with using an HA for FICON, FCP, and
remote replication connectivity concurrently. However, if you need the highest possible
throughput or lowest possible response time for a given topology, consider isolating this
topology on separate HAs.
Remote replication: To optimize the HA throughput if you have remote replication active,
consider not sharing HA ports for the following items:
Synchronous and asynchronous remote replication
For FCP host I/O and remote replication
At the time of writing, zHPF is used by most access methods and all DB2 workloads use it.
For a more detailed description and how to enable or disable the feature, see IBM DS8870
and IBM z Systems Synergy, REDP-5186.
zHPF Extended Distance II: With DS8000 R7.5 and IBM z Systems z13, zHPF was
further improved to deliver better response times for long write operations, as used, for
example, by DB2 utilities, at greater distances. It reduces the required round trips on the
FICON link. This benefits environments that are IBM HyperSwap® enabled and where the
auxiliary storage system is further away.
For more information about the performance implications of host connectivity, see 8.3,
“Attaching z Systems hosts” on page 276 and 4.10.1, “I/O port planning considerations” on
page 131.
Note: The DS8000 architecture is symmetrical, based on the two CPCs. Many resources,
like cache, device adapters (DAs), and RAID arrays, become associated to a CPC. When
defining a logical configuration, you assign each array to one of the CPCs. Make sure to
spread them evenly, not only by count, but also by their capabilities. The I/O workload must
also be distributed across the CPCs as evenly as possible.
The preferred way to achieve this situation is to create a symmetrical logical configuration.
A fundamental question that you must consider is whether there is any special workload that
must be isolated, either because it has high performance needs or because it is of low
importance and should never influence other workloads.
The major groups of resources in a DS8000 storage system are the storage resources, such
as RAID arrays and DAs, and the connectivity resources, such as HAs. The traditional
approach in mainframe environments to assign storage resources is to divide them into the
smallest possible entity (single rank extent pool) and distribute the workload either manually
or managed by the Workload Manager (WLM) and System Managed Storage (SMS) as
evenly as possible. This approach still has its advantages:
RMF data provides granular results, which can be linked directly to a resource.
If you detect a resource contention, you can use host system tools to fix it, for example, by
moving a data set to a different volume in a different pool.
It is easy to detect applications or workloads that cause contention on a resource.
Isolation of critical applications is easy.
On the contrary, this approach comes with some significant disadvantages, especially with
modern storage systems that support automated tiering and autonomous balancing of
resources:
Statically assigned resources might be over- or underutilized for various reasons:
– Monitoring is infrequent and only based on events or issues.
– Too many or too few resources are allocated to certain workloads.
– Workloads change without resources being adapted.
All rebalancing actions can be performed only on a volume level.
Modern automatic workload balancing methods cannot be used:
– Storage pool striping.
– Easy Tier automatic tiering.
– Easy Tier intra-tier rebalancing.
Note: If you plan to share your storage resources to a large extent but still want to make
sure that certain applications have priority over others, consider using the DS8000 I/O
Priority Manager (IOPM) feature, which is described in “I/O Priority Manager” on page 485.
For the host connectivity resources (HAs and FICON links), similar considerations apply. You
can share FICON connections by defining them equally for all accessed LCUs in the
z Systems I/O definitions. That way, the z Systems I/O subsystem takes care of balancing the
load over all available connections. If there is a need to isolate a certain workload, you can
define specific paths for their LCUs and volumes.
Volume sizes
The DS8000 storage system now supports CKD logical volumes of any size 1 - 1182006
cylinders, which is 1062 times the capacity of a 3390-1 (1113 cylinders).
Note: The DS8000 storage system allocates storage with a granularity on one extent,
which is the equivalent of 1113 cylinders. Therefore, selecting capacities of multiples of this
value is most effective.
A key factor to consider when planning the CKD volume configuration and sizes is the limited
number of devices a z/OS system can address within one Subchannel Set (65,535). You must
define volumes with enough capacity so that you satisfy you storage requirements within this
supported address range, including room for PAV aliases and future growth.
Apart from saving device addresses, using large volumes brings additional benefits:
Simplified storage administration
Reduced number of X37 abends and allocation failures because of larger pools of free
space
Reduced number of multivolume data sets to manage
One large volume performs the same as though you allocated the same capacity in several
smaller ones, if you use the DS8000 built-in features to distribute and balance the workload
across resources. There are two factors to consider to avoid potential I/O bottlenecks when
using large volumes:
Use HyperPAV to reduce IOSQ.
With equal I/O density (I/Os per GB), the larger a volume, the more I/Os it gets. To avoid
excessive queuing, the use of PAV is of key importance. With HyperPAV, you can reduce
the total number of alias addresses because they are assigned automatically as needed.
For more information about the performance implications of PAV, see “Parallel Access
Volumes” on page 485.
Eliminate unnecessary reserves.
As the volume sizes grow larger, more data on a single CKD device address will be
accessed in parallel. There is a danger of performance bottlenecks when there are
frequent activities that reserve an entire volume or its VTOC/VVDS.
PAV is implemented by defining alias addresses to the conventional base address. The alias
address provides the mechanism for z/OS to initiate parallel I/O to a volume. An alias is
another address/UCB that can be used to access the volume that is defined on the base
address. An alias can be associated with a base address that is defined in the same LCU
only. The maximum number of addresses that you can define in an LCU is 256. Theoretically,
you can define one base address, plus 255 aliases in an LCU.
Aliases are initially defined to be associated to a certain base address. In a traditional static
PAV environment, the alias is always associated to the same base address, which requires
many aliases and manual tuning.
In dynamic PAV or HyperPAV environments, an alias can be reassigned to any base address
as your needs dictate. Therefore, you need less aliases and no manual tuning.
With dynamic PAV, the z/OS WLM takes care of the alias assignment. Therefore, it
determines the need for additional aliases at fixed time intervals and adapts to workload
changes rather slowly.
The more modern approach of HyperPAV assigns aliases in real time, based on outstanding
I/O requests to a volume. The function is performed by the I/O subsystem with the storage
system. HyperPAV reacts immediately to changes. With HyperPAV, you achieve better
average response times and higher total throughput. Today, there is no technical reason
anymore to use either static or dynamic PAV.
You can check the usage of alias addresses by using RMF data. Example 14-6 on page 468
shows the I/O queuing report for an LCU, which includes the maximum number of aliases that
are used in the sample period. You can use such reports to determine whether you assigned
enough alias addresses for an LCU.
Number of aliases: With HyperPAV, you need fewer aliases than with the older PAV
algorithms. Assigning 32 aliases per LCU is a good starting point for most workloads. It is a
preferred practice to leave a certain number of device addresses in an LCU initially
unassigned in case it turns out that a higher number of aliases is required.
IBM zHyperwrite
IBM zHyperWrite™ is a technology that is provided by the DS8870 storage system, and used
by z/OS (DFSMS) and DB2 to accelerate DB2 log writes in HyperSwap enabled Metro Mirror
environments.
When an application sends a write I/O request to a volume that is in synchronous data
replication, the response time is increased by the latency because of the distance and by the
replication itself. Although the DS8000 PPRC algorithm is the most effective synchronous
replication available, there still is some replication impact because of the start of the write to
the primary and sending the data on to the secondary must happen one after another.
An application that uses zHyperwrite can avoid the replication impact for certain write
operations. Accordingly, an I/O that is flagged is not replicated by PPRC, but written to the
primary and secondary simultaneously by the host itself. The application, DFSMS, the I/O
subsystem, and the DS8000 storage system are closely coordinating the process to maintain
data consistency. The feature is most effective for the following situations:
Small writes, where the data transfer time is short.
Short distances, where the effect of the latency is not significant.
Note: At the time of writing, only DB2 uses zHyperwrite for log writes.
This is only an introduction. It cannot replace a thorough analysis by IBM Technical Support in
more complex situations or if there are product issues.
Next, gather performance data from the system. For I/O related investigations, use RMF data.
For a description about how to collect, process, and interpret RMF data, see 14.1, “DS8000
performance monitoring with RMF” on page 460. There might be many data that you must
analyze. The faster you can isolate the issue up front, both from a time and a device point of
view, the more selective your RMF analysis can be.
There are other tools or methods that you can use to gather performance data for a DS8000
storage system. They most likely are of limited value in a mainframe environment because
they do not take the host system into account. However, they can be useful in situations
where RMF data does not cover the complete configuration, for example:
Mixed environments (mainframe, open, and IBM i)
Special copy services configurations, such as Global Mirror secondary
For more information about these other tools, see 14.1.9, “Alternatives and supplements to
RMF” on page 475.
To match physical resources to logical devices, you also need the exact logical configuration
of the affected DS8000 storage systems, and the I/O definition of the host systems you are
analyzing.
The following sections point you to some key metrics in the RMF reports, which might help
you isolate the cause of a performance issue.
Attention: Because the summary report provides a high-level overview, there might be
issues with individual components that are not directly visible here.
After you isolate a certain time and a set of volumes that are conspicuous, you can analyze
further. Discover which of the response time components are higher than usual. A description
of these components and why they can be increased is provided in 14.1.3, “I/O response time
components” on page 465.
If you also need this information on a Sysplex scope, create the Shared Direct Access Device
Activity report. It provides a similar set of measurements for each volume by LPAR and also
summarized for the complete Sysplex.
Important: Devices with no or almost no activity should not be considered. Their response
time values are not relevant and might be inaccurate.
This rule can also be applied to all other reports and measurements.
You specifically might want to check the volumes you identified in the previous section and
see whether they have the following characteristics:
They have a high write ratio.
They show a DASD Fast Write Bypass rate greater than 0, which indicates that you are
running into an NVS Full condition.
Use the Enterprise Storage Server Link Statistics to analyze the throughput of the DS8000
HA ports. Pay particular attention to those that have higher response time than others. Also,
use the I/O intensity value to determine whether a link might be close to its limitations. All HA
ports are listed here. You can also analyze remote replication and Open Systems workload.
Attention: Different types of ranks have different capabilities. HPFE or SSD ranks can
sustain much higher IOPS with much lower response time than any HDD rank. Within the
different HDD types, 15 K RPM enterprise (ENT) SAS drives perform better than 10 K
RPM ENT SAS drives. Nearline (NL) SAS drives have the worst performance. The RAID
type (RAID 5, 6, or 10) also affects the capabilities of a rank. Keep this in mind when
comparing back-end performance values. The drive type and RAID level are indicated in
the report.
The second part of the report provides queuing details about an LCU and channel path level.
You can see whether I/Os are delayed on their way to the device. Check for LCUs / paths that
have the following features:
Higher average control unit busy delay (AVG CUB DLY), which can mean that devices are
in use or reserved by another system.
Higher average command response delay (AVG CMR DLY), which can indicate a saturation
of certain DS8000 resources, such as HA, internal bus, or processor.
Nonzero HPAV wait times and HPAV max values in the order of the number of defined
alias addresses, which can indicate that the number of alias addresses is not sufficient.
In many cases, your analysis shows that one or more resources either on the host system or
DS8000 storage system are saturated or overloaded. The first thing to check is whether your
storage system is configured to use all available features that improve performance or
automate resource balancing.
If these features do not solve the issue, consider the following actions:
Distribute the workload further over additional existing resources with less utilization.
Add more resources of the same type, if there is room.
Exchange the existing, saturated resources for different ones (other or newer technology)
with higher capabilities.
If you isolated applications (for example, by using their own set of SSD ranks) but still
experience poor response times, check the following items:
Are the dedicated resources saturated? If yes, you can add more resources, or consider
switching to a shared resource model.
Is the application doing something that the dedicated resources are not suited for (for
example, mostly sequential read and write operations on SSD ranks)? If yes, consider
changing the resource type, or again, switching to a shared model.
Does the contention come from other resources that are not dedicated, such as HA ports
in our example with dedicated SSD ranks? Here, you can consider increasing the isolation
by dedicating host ports to the application as well.
If you are running in a resource sharing model and find that your overall I/O performance is
good, but there is one critical application that suffers from poor response times, you can
consider moving to an isolation model, and dedicate certain resources to this application. If
the issue is limited to the back end, another solution might be to use advanced functions:
IOPM to prioritize the critical application
Easy Tier Application to manually assign certain data to a specific storage tier
If you cannot identify a saturated resource, but still have an application that experiences
insufficient throughput, it might not use the I/O stack optimally. For example, modern storage
systems can process many I/Os in parallel. If an application does not use this capability and
serializes all I/Os, it might not get the required throughput, although the response times of
individual I/Os are good.
For more information about SAN Volume Controller, see Implementing the IBM System
Storage SAN Volume Controller V7.4, SG24-7933.
The Spectrum Virtualize solution is designed to reduce both the complexity and costs of
managing your SAN-based storage. With the SAN Volume Controller, you can perform these
tasks:
Simplify management and increase administrator productivity by consolidating storage
management intelligence from disparate storage controllers into a single view, including
non-IBM storage.
Improve application availability by enabling data migration between disparate disk storage
devices non-disruptively.
Improve disaster recovery and business continuance needs by applying and managing
Copy Services across disparate disk storage devices within the SAN.
Provide advanced features and functions to the entire SAN:
– Large scalable cache
– Advanced Copy Services
– Space management
– Mapping based on wanted performance characteristics
– Quality of service (QoS) metering and reporting
The SAN Volume Controller enables the DS8000 storage system with the options for the
following items:
iSCSI or FCoE attachment
IBM Real-time Compression™ (RtC) of volumes by using SAN Volume Controller built-in
compression accelerator cards and software.
For more information about the SAN Volume Controller, see the IBM Knowledge Center:
http://www.ibm.com/support/knowledgecenter/STPVGU_7.6.0/com.ibm.storage.svc.consol
e.760.doc/svc_ichome_760.html
The SAN must be zoned in such a way that the application servers cannot see the back-end
storage, preventing the SAN Volume Controller and the application servers from both trying to
manage the back-end storage.
The SAN Volume Controller I/O Groups are connected to the SAN in such a way that all
back-end storage and all application servers are visible to all of the I/O Groups. The SAN
Volume Controller I/O Groups see the storage presented to the SAN by the back-end
controllers as a number of disks, which are known as MDisks. MDisks are collected into one
or several groups, which are known as Managed Disk Groups (MDGs), or storage pools.
When an MDisk is assigned to a storage pool, the MDisk is divided into a number of extents
(the extent minimum size is 16 MiB and the extent maximum size is 8 GiB). The extents are
numbered sequentially from the start to the end of each MDisk.
The storage pool provides the capacity in the form of extents, which are used to create
volumes, also known as virtual disks (VDisks).
When creating SAN Volume Controller volumes or VDisks, the default option of striped
allocation is normally the preferred choice. This option helps balance I/Os across all the
MDisks in a storage pool, which optimizes overall performance and helps reduce hot spots.
Conceptually, this method is represented in Figure 15-1.
Storage Pool
VDisk is a collection of
Extents
(each 16 MiB to 8 GiB)
Figure 15-1 Extents being used to create virtual disks
The virtualization function in the SAN Volume Controller maps the volumes that are seen by
the application servers onto the MDisks provided by the back-end controllers. I/O traffic for a
particular volume is, at any one time, handled exclusively by the nodes in a single I/O Group.
Thus, although a cluster can have many nodes within it, the nodes handle I/O in independent
pairs, which means that the I/O capability of the SAN Volume Controller scales well (almost
linearly) because additional throughput can be obtained by adding additional I/O Groups.
Chapter 15. IBM System Storage SAN Volume Controller attachment 493
Figure 15-2 summarizes the various relationships that bridge the physical disks through to the
VDisks within the SAN Volume Controller architecture.
VD 1
VD 2
VD 3
VD 6
VD 7
VD 4
VD 5
Virtual Disks storage modifications. SVC
Type = 2145 manages the relation between
Virtualization Engine Virtual Disks and Managed
Disks.
Managed Disk Group High Perf Low Cost
MD 2
MD 3
MD 4
MD 5
MD 6
MD 1
MD 7
MD 8
Managed Disks Managed Disks are grouped in
Managed Disks Groups
depending on their
Fabric 1 characteristics – Storage Pools
LUN 3
LUN 1
LUN 2
LUN 4
are directly mapped to SVC
LUN 1
LUN 2
LUN 3
LUN 4
SCSI LUNs
cluster.
RAID RAID
controller 1 controller 2
RAID Array
Physical disks
Because most operating systems have only some basic detection or management of multiple
paths to a single physical device, IBM provides a multipathing device driver. The multipathing
driver that is supported by the SAN Volume Controller is the IBM Subsystem Device Driver
(SDD). SDD groups all available paths to a VDisk device and presents it to the operating
system. SDD performs all the path handling and selects the active I/O paths.
SDD supports the concurrent attachment of various DS8000 and IBM FlashSystem™
models, IBM Storwize V7000, V5000, and V3000, and SAN Volume Controller storage
systems to the same host system. Where one or more alternative storage systems are to be
attached, you can identify the required version of SDD at this website:
http://www.ibm.com/support/docview.wss?uid=ssg1S7001350
FlashCopy makes an instant, point-in-time copy from a source VDisk volume to a target
volume. A FlashCopy can be made only to a volume within the same SAN Volume Controller.
Metro Mirror is a synchronous remote copy, which provides a consistent copy of a source
volume to a target volume. Metro Mirror can copy between volumes (VDisks) on separate
SAN Volume Controller clusters or between volumes within the same I/O Group on the same
SAN Volume Controller.
Global Mirror is an asynchronous remote copy, which provides a remote copy over extended
distances. Global Mirror can copy between volumes (VDisks) on separate SAN Volume
Controller clusters or between volumes within the same I/O Group on the same SAN Volume
Controller.
Important: SAN Volume Controller Copy Services functions are incompatible with the
DS8000 Copy Services.
The physical implementation of Global Mirror between DS8000 storage systems versus
Global Mirror between SAN Volume Controllers is internally different. The DS8000 storage
system is one of the products with the most sophisticated Global Mirror implementations
available, especially, for example, when it comes to working with small bandwidths, weak or
unstable links, or intercontinental distances. When changing architectures from one Global
Mirror implementation to the other, a resizing of the minimum required Global Mirror
bandwidth is likely needed.
For more information about the configuration and management of SAN Volume Controller
Copy Services, see the Advanced Copy Services chapters of Implementing the IBM System
Storage SAN Volume Controller V7.4, SG24-7933 or IBM System Storage SAN Volume
Controller and Storwize V7000 Replication Family Services, SG24-7574.
A FlashCopy mapping can be created between any two VDisk volumes in a SAN Volume
Controller cluster. It is not necessary for the volumes to be in the same I/O Group or storage
pool. This function can optimize your storage allocation by using an auxiliary storage system
(with, for example, lower performance) as the target of the FlashCopy. In this case, the
resources of your high-performance storage system are dedicated for production. Your
low-cost (lower performance) storage system is used for a secondary application (for
example, backup or development).
Chapter 15. IBM System Storage SAN Volume Controller attachment 495
An advantage of SAN Volume Controller remote copy is that you can implement these
relationships between two SAN Volume Controller clusters with different back-end disk
subsystems. In this case, you can reduce the overall cost of the disaster recovery
infrastructure. The production site can use high-performance back-end disk systems, and the
recovery site can use low-cost back-end disk systems, even where the back-end disk
subsystem Copy Services functions are not compatible (for example, different models or
different manufacturers). This relationship is established at the volume level and does not
depend on the back-end disk storage system Copy Services.
Important: For Metro Mirror copies, the recovery site VDisk volumes must have
performance characteristics similar to the production site volumes when a high write I/O
rate is present to maintain the I/O response level for the host system.
The following section presents the SAN Volume Controller concepts and describes the
performance of the SAN Volume Controller. This section assumes that there are no
bottlenecks in the SAN or on the disk system.
To determine the number of I/O Groups and to monitor the processor performance of each
node, you can use IBM Spectrum Control, the IBM Virtual Storage Center (VSC), or the IBM
Tivoli Storage Productivity Center. The processor performance is related to I/O performance,
and when the processors become consistently 70% busy, you must consider one of these
actions:
Adding more nodes to the cluster and moving part of the workload onto the new nodes
Moving VDisk volumes to another I/O Group, if the other I/O Group is not busy
To see how busy your processors are, you can use the Tivoli Storage Productivity Center
performance report, by selecting the CPU Utilization option.
With the newly added I/O Group, the SAN Volume Controller cluster can potentially double
the I/O rate per second (IOPS) that it can sustain. A SAN Volume Controller cluster can be
scaled up to an eight-node cluster with which you quadruple the total I/O rate.
You must carefully plan the SAN Volume Controller port bandwidth.
For the DS8000 storage system, there is no controller affinity for the LUNs. So, a single zone
for all SAN Volume Controller ports and up to eight DS8000 host adapter (HA) ports must be
defined on each fabric. The DS8000 HA ports must be distributed over as many HA cards as
available and dedicated to SAN Volume Controller use if possible.
Configure a minimum of eight controller ports to the SAN Volume Controller per controller
regardless of the number of nodes in the cluster. Configure 16 controller ports for large
controller configurations where more than 48 DS8000 ranks are being presented to the SAN
Volume Controller cluster.
For the DS8000 storage system, all LUNs in the same storage pool tier ideally have these
characteristics:
Use disk drive modules (DDMs) of the same capacity and speed.
Arrays must be the same RAID type.
Use LUNs that are the same size.
For the extent size, to maintain maximum flexibility, use an SAN Volume Controller extent size
of 1 GiB (1024 MiB).
Chapter 15. IBM System Storage SAN Volume Controller attachment 497
For more information, see IBM System Storage SAN Volume Controller and Storwize V7000
Best Practices and Performance Guidelines, SG24-7521.
There are many workload attributes that influence the relative performance of RAID 5
compared to RAID 10, including the use of cache, the relative mix of read as opposed to write
operations, and whether data is referenced randomly or sequentially.
SAN Volume Controller does not need to influence your choice of the RAID type that is used.
For more information about the RAID 5 and RAID 10 differences, see 4.7, “Planning RAID
arrays and ranks” on page 97.
The DS8000 processor complex (or server group) affinity is determined when the rank is
assigned. Assign the same number of ranks in a DS8000 storage system to each of the
processor complexes. Additionally, if you do not need to use all of the arrays for your SAN
Volume Controller storage pool, select the arrays so that you use arrays from as many device
adapters (DAs) as possible to balance the load across the DAs.
Often, clients worked with at least two storage pools: one (or two) containing MDisks of all the
6+P RAID 5 ranks of the DS8000 storage system, and the other one (or more) containing the
slightly larger 7+P RAID 5 ranks. This approach maintains equal load balancing across all
ranks when the SAN Volume Controller striping occurs because each MDisk in a storage pool
is the same size then.
The SAN Volume Controller extent size is the stripe size that is used to stripe across all these
single-rank MDisks.
This approach delivered good performance and has its justifications. However, it also has a
few drawbacks:
There can be natural skew, for example, a small file of a few hundred KiB that is heavily
accessed. Even with a smaller SAN Volume Controller extent size, such as 256 MiB, this
classical setup led in a few cases to ranks that are more loaded than other ranks.
When you have more than two MDisks on one rank, and not as many SAN Volume
Controller storage pools, the SAN Volume Controller might start striping across many
entities that are effectively in the same rank, depending on the MDG layout. Such striping
should be avoided.
Clients tend to in DS8000 installations go to larger (multi-rank) extent pools to use modern
features, such as auto-rebalancing or advanced tiering.
An advantage of this classical approach is that it delivers more options for fault isolation and
control over where a certain volume and extent are located.
1
The SAN Volume Controller 7.6 Restrictions document, which is available at
http://www.ibm.com/support/docview.wss?uid=ssg1S1005424#_Extents, provides a table with the maximum size of
an MDisk that depends on the extent size of the storage pool.
Chapter 15. IBM System Storage SAN Volume Controller attachment 499
You have two options:
Go for huge multitier hybrid pools, having just one pair of DS8000 pools where the
DS8000 internal Easy Tier logic is also doing the cross-tier internal optimization.
Create in the DS8000 storage system as many extent pool pairs as you have tiers in the
DS8000 storage system, report each such DS8000 pool separately to the SAN Volume
Controller, and the SAN Volume Controller-internal Easy Tier logic makes the cross-tier
optimization.
In the latter case, the DS8000 internal Easy Tier logic can still do intra-tier auto-rebalancing.
For more information about this topic, see 15.8, “Where to place Easy Tier” on page 506.
You need only one MDisk volume size with this multi-rank approach because plenty of space
is available in each large DS8000 extent pool. Often, clients choose 2 TiB (2048 GiB) MDisks
for this approach. Create many 2-TiB volumes in each extent pool until the DS8000 extent
pool is full, and provide these MDisks to the SAN Volume Controller to build the storage pools.
Two extent pools are needed at least so that each DS8000 processor complex (even/odd) is
loaded about equal, or it might take a larger number of pools.
If you use DS8000 Easy Tier, even only for intra-tier auto-rebalancing, do not use 100% of
your extent pools. You must leave some small space of a few extents per rank free for Easy
Tier so that it can work.
To maintain the highest flexibility and for easier management, large DS8000 extent pools are
beneficial. However, if the SAN Volume Controller DS8000 installation is dedicated to
shared-nothing environments, such as Oracle ASM, DB2 warehouses, or General Parallel
File System (GPFS), use the single-rank extent pools.
With the DS8000 supporting volume sizes up to 16 TiB, a classical approach is still possible
when using large disks, such as placing only two volumes onto an array of the 4 TB Nearline
disks (RAID 6). The volume size in this case is determined by the rank capacity.
With the modern approach of using large multi-rank extent pools, more clients use a standard
and not-too-large volume MDisk size, such as 2 TiB for all MDisks, with good results.
As a preferred practice, assign SAN Volume Controller DS8000 LUNs of all the same size for
each storage pool. In this configuration, the workload that is applied to a VDisk is equally
balanced on the MDisks within the storage pool.
Volumes can be added dynamically to the SAN Volume Controller. When the volume is added
to the volume group, run the svctask detectmdisk command on the SAN Volume Controller
to add it as a MDisk.
Before you delete or unmap a volume that is allocated to the SAN Volume Controller, remove
the MDisk from the SAN Volume Controller storage pool, which automatically migrates any
extents for defined volumes to other MDisks in the storage pool, if there is space available.
When it is unmapped on the DS8000 storage system, run the svctask detectmdisk
command and then run the maintenance procedure on the SAN Volume Controller to confirm
its removal.
To configure Spectrum Control or Tivoli Storage Productivity Center to monitor IBM SAN
Volume Controller, see IBM System Storage SAN Volume Controller and Storwize V7000
Best Practices and Performance Guidelines, SG24-7521.
IBM Spectrum Control offers many disk performance reporting options that support the SAN
Volume Controller environment and also the storage controller back end for various storage
controller types. The following storage components are the most relevant for collecting
performance metrics when monitoring storage controller performance:
Subsystem
Controller
Array
Chapter 15. IBM System Storage SAN Volume Controller attachment 501
MDisk
MDG, or storage pool
Port
With the SAN Volume Controller, you can monitor on the levels of the I/O Group and the SAN
Volume Controller node.
The default status for these properties is Disabled with the Warning and Error options set to
None. Enable a particular threshold only after the minimum values for warning and error
levels are defined.
Tip: In Tivoli Storage Productivity Center for Disk, default threshold warning or error values
of -1.0 are indicators that there is no suggested minimum value for the threshold and are
entirely user-defined. You can choose to provide any reasonable value for these thresholds
based on the workload in your environment.
For the current (currently Version 7.6) list of hardware that is supported for attachment to the
SAN Volume Controller, see this website:
http://www.ibm.com/support/docview.wss?uid=ssg1S1005419
15.5.1 Sharing the DS8000 storage system between Open Systems servers
and the SAN Volume Controller
If you have a mixed environment that includes IBM SAN Volume Controller and Open
Systems servers, share the maximum of the DS8000 resources to both environments.
Most clients choose a DS8000 extent pool pair (or pairs) for their SAN Volume Controller
volumes only, and other extent pool pairs for their directly attached servers. This approach is
a preferred practice, but you can fully share on the drive level if preferred. Easy Tier
auto-rebalancing, as done by the DS8000 storage system, can be enabled for all pools.
IBM supports sharing a DS8000 storage system between a SAN Volume Controller and an
Open Systems server. However, if a DS8000 port is in the same zone as a SAN Volume
Controller port, that same DS8000 port must not be in the same zone as another server.
15.5.2 Sharing the DS8000 storage system between z Systems servers and the
SAN Volume Controller
SAN Volume Controller does not support FICON/count key data (CKD) based z Systems
server attachment. If you have a mixed server environment that includes IBM SAN Volume
Controller and hosts that use CKD, you must share your DS8000 storage system to provide
direct access to z Systems volumes and access to Open Systems server volumes through
the SAN Volume Controller.
In this case, you must split your DS8000 resources between two environments. You must
create a part of the ranks by using the CKD format (used for z Systems access) and the other
ranks in FB format (used for SAN Volume Controller access). In this case, both environments
get performance that is related to the allocated DS8000 resources.
A DS8000 port does not support a shared attachment between z Systems and SAN Volume
Controller. z Systems servers use the Fibre Channel connection (FICON), and SAN Volume
Controller supports Fibre Channel Protocol (FCP) connection only. Both environments should
each use their dedicated DS8000 HAs.
Guidelines: Many of these guidelines are not unique to configuring the DS8000 storage
system for SAN Volume Controller attachment. In general, any server can benefit from a
balanced configuration that uses the maximum available bandwidth of the DS8000 storage
system.
Follow the guidelines and procedures that are outlined in this section to make the most of the
performance that is available from your DS8000 storage systems and to avoid potential I/O
problems:
Use multiple HAs on the DS8000 storage system. In case there are many DS8000 spare
ports available, use no more than two ports on each card. Use many ports on the DS8000
storage system, which is usually the SAN Volume Controller maximum port number.
Unless you have special requirements, or if in doubt, build your MDisk volumes from large
extent pools on the DS8000 storage system.
If using a 1:1 mapping of ranks to DS8000 extent pools, use one, or a maximum of two
volumes on this rank, and adjust the MDisk volume size for this 1:1 mapping.
Create fewer and larger SAN Volume Controller storage pools and have multiple MDisks in
each pool.
Chapter 15. IBM System Storage SAN Volume Controller attachment 503
Keep many DS8000 arrays active.
Ensure that you have an equal number of extent pools and, as far as possible, spread the
volumes equally across the DAs and the two processor complexes of the DS8000 storage
system.
In a storage pool, ensure that for a certain tier that all MDisks ideally have the same
capacity and RAID/RPM characteristics.
For Metro Mirror configurations, always use DS8000 MDisks with similar characteristics for
both the master VDisk volume and the auxiliary volume.
Spread the VDisk volumes across all SAN Volume Controller nodes, and check for
balanced preferred node assignments.
In the SAN, use a dual fabric.
Use multipathing software in the servers.
Consider DS8000 Easy Tier auto-rebalancing for DS8000 homogeneous capacities.
When using Easy Tier in the DS8000 storage system, consider a SAN Volume Controller
extent size of 1 GiB (1024 MiB) minimum to not put skew away from the DS8000 extents.
When using DS8000 Easy Tier, leave some small movement space empty in the extent
pools to help it start working. Ten free extents per rank are sufficient.
Consider the correct amount of cache, as explained in 2.2.2, “Determining the correct
amount of cache storage” on page 33. Usually, SAN Volume Controller installations have a
DS8000 cache of not less than 128 GB.
However, there exist solutions where you use locally attached flash as flash cache (read
cache), for example, as part of a newer operating system such as AIX 7.2 or when using
Each SSD MDisk goes into one pool, which determines how many storage pools can benefit
from SAN Volume Controller Easy Tier. The SSD size determines the granularity of the
offered SSD capacity. Scalability is limited compared to flash in a DS8000 storage system.
Another argument against this concept is that the ports that handle the traffic to the flash-only
storage system experience exceptionally high workloads.
Chapter 15. IBM System Storage SAN Volume Controller attachment 505
15.8 Where to place Easy Tier
IBM Easy Tier is an algorithm that is developed by IBM Almaden Research and made
available to storage systems, such as the DS8880 and DS8870 families, and to Spectrum
Virtualize and the SAN Volume Controller. When using Easy Tier in the SAN Volume
Controller with a mixed-tier storage pool, the MDisks can be flagged as ssd, enterprise, or
Nearline when you introduce them to the SAN Volume Controller storage pool.
When using the internal SSDs in the SAN Volume Controller nodes, only Easy Tier performed
by the SAN Volume Controller is possible for the inter-tier movements between SSD and HDD
tiers. DS8000 intra-tier auto-rebalancing can be used and can monitor the usage of all the
HDD ranks and move loads intra-tier by DS8000 storage system if some ranks are more
loaded.
When you have the flash in the DS8000 storage system, together with Enterprise HDDs and
also Nearline HDDs, on which level do you perform the overall inter-tier Easy Tiering? It can
be either done by the SAN Volume Controller, by setting the ssd attribute for all the DS8000
HPFE and SSD flash MDisks (which also means that SAN Volume Controller Easy Tier treats
HPFE volumes and DS8000 SSD volumes likewise). You also can leave the enterprise
(generic_hdd) attribute for all MDisks, but allow DS8000 Easy Tier to manage these MDisks,
with two-tier or three-tier MDisks offered to the SAN Volume Controller, and these MDisks
contain some flash (which is invisible to the SAN Volume Controller). For both options,
well-running installations exist.
There are differences between the Easy Tier algorithms in the DS8000 storage system and in
SAN Volume Controller. The DS8000 storage system is in the eighth generation of Easy Tier,
with additional functions available, such as Extended-Cold-Demote or Warm-Demote. The
warm demote checking is reactive if certain flash ranks or SSD-serving DAs suddenly
become overloaded. The SAN Volume Controller must work with different vendors and
varieties of flash space that is offered, and uses a more generic algorithm, which cannot learn
easily whether the SSD array of a certain vendor’s disk system is approaching its limits.
As a rule, when you use flash in the DS8000 storage system and use many or even
heterogeneous storage systems, and also for most new installations, consider implementing
cross-tier Easy Tier on the highest level, that is, managed by the SAN Volume Controller. SAN
Volume Controller can use larger blocksizes for its back end, such as 60 K and over, which do
not work well for DS8000 Easy Tier, so you have another reason to use SAN Volume
Controller Easy Tier inter-tiering. However, observe the system by using the STAT for SAN
Volume Controller. If the flash space gets overloaded, consider either adding more SSDs as
suggested by the STAT, or removing and reserving some of the flash capacity so that it is not
fully used by SAN Volume Controller by creating smaller SSD MDisks and leaving empty
space there.
Tip: For most new installations, the following ideas are preferred practices:
Use SAN Volume Controller Easy Tier for the cross-tier (inter-tier) tiering.
Have several larger extent pools (multi-rank, single-tier) on the DS8000 storage system,
with each pool pair containing the ranks of just one certain tier.
Turn on the DS8000 based Easy Tier auto-rebalancing so that DS8000 Easy Tier is
used for the intra-tier rebalancing within each DS8000 pool.
Use the current level of SAN Volume Controller software before you start SAN Volume
Controller managed Easy Tier and the current SAN Volume Controller node hardware.
Chapter 15. IBM System Storage SAN Volume Controller attachment 507
508 IBM System Storage DS8000 Performance Monitoring and Tuning
16
Figure 16-1 illustrates the basic components of data deduplication for the IBM System
Storage TS7650G ProtecTIER Gateway.
With data deduplication, data is read by the data deduplication product while it looks for
duplicate data. Different data deduplication products use different methods of breaking up the
data into elements, but each product uses a technique to create a signature or identifier for
each data element. After the duplicate data is identified, one copy of each element is
retained, pointers are created for the duplicate items, and the duplicate items are not stored.
The effectiveness of data deduplication depends on many variables, including the data rate of
data change, the number of backups, and the data retention period. For example, if you back
up the same incompressible data once a week for six months, you save the first copy and do
not save the next 24. This example provides a 25:1 data deduplication ratio. If you back up an
incompressible file on week one, back up the same file again on week two, and never back it
up again, you have a 2:1 data deduplication ratio.
The IBM System Storage TS7650G is a preconfigured virtualization solution of IBM systems.
The IBM ProtecTIER data deduplication software improves backup and recovery operations.
The solution is available in single-node or two-node cluster configurations to meet the
disk-based data protection needs of a wide variety of IT environments and data centers. The
TS7650G ProtecTIER Deduplication Gateway can scale to repositories in the petabyte (PB)
range, and all DS8000 models are supported behind it. Your DS8000 storage system can
become a Virtual Tape Library (VTL). The multi-node concepts help achieve higher
throughput and availability for the backup, and replication concepts are available.
ProtecTIER access patterns can have either high random-read content (90%)(UserData), or
there is a random-write ratio for the MetaData areas. For the UserData, a selection of RAID 5
or RAID 6 is possible, with RAID 6 being preferred when using Nearline. However, for the
MetaData, RAID 10 must be used, and when using Nearline, follow the RPQ process to allow
the Nearline drives to be formatted as RAID 10 for such random-write data.
You can obtain more information about the IBM DB2 and Oracle databases at these websites:
http://www.ibm.com/software/data/db2/linux-unix-windows/
http://www.ibm.com/support/knowledgecenter/SSEPGG/welcome
http://www.oracle.com/technetwork/database/enterprise-edition/documentation/index
.html
Instances
Databases
Tablespaces are where tables are stored:
SMS or DMS
Tablespaces Each container Each container
is a directory is a fixed,
in the file space pre-allocated
Tables of the operating file or physical
Indexes system. device such as
a disk.
long data
/fs.rb.T1.DA3a1
/fs.rb.T1.DA3b1
Figure 17-1 DB2 for Linux, UNIX, and Windows logical structure
Instances
An instance is a logical database manager environment where databases are cataloged and
configuration parameters are set. An instance is similar to an image of the actual database
manager environment. You can have several instances of the database manager product on
the same database server. You can use these instances to separate the development
environment from the production environment, tune the database manager to a particular
environment, and protect sensitive information from a particular group of users.
For the DB2 Database Partitioning Feature (DPF) of the DB2 Enterprise Server Edition
(ESE), all data partitions are within a single instance.
DB2 for Linux, UNIX, and Windows allows multiple databases to be defined within a single
database instance. Configuration parameters can also be set at the database level, so that
you can tune, for example, memory usage and logging.
Database partitions
A partition number in DB2 terminology is equivalent to a data partition. Databases with
multiple data partitions and that are on a symmetric multiprocessor (SMP) system are also
called multiple logical partition (MLN) databases.
Partitions are identified by the physical system where they are and by a logical port number
with the physical system. The partition number, which can be 0 - 999, uniquely defines a
partition. Partition numbers must be in ascending sequence (gaps in the sequence are
allowed).
The configuration information of the database is stored in the catalog partition. The catalog
partition is the partition from which you create the database.
Partitiongroups
A partitiongroup is a set of one or more database partitions. For non-partitioned
implementations (all editions except for DPF), the partitiongroup is always made up of a
single partition.
Partitioning map
When a partitiongroup is created, a partitioning map is associated to it. The partitioning map,
with the partitioning key and hashing algorithm, is used by the database manager to
determine which database partition in the partitiongroup stores a specific row of data.
Partitioning maps do not apply to non-partitioned databases.
Containers
A container is the way of defining where on the storage device the database objects are
stored. Containers can be assigned from file systems by specifying a directory. These
containers are identified as PATH containers. Containers can also reference files that are
within a directory. These containers are identified as FILE containers, and a specific size must
be identified. Containers can also reference raw devices. These containers are identified as
DEVICE containers, and the device must exist on the system before the container can be
used.
All containers must be unique across all databases; a container can belong to only one table
space.
Table spaces
A database is logically organized in table spaces. A table space is a place to store tables. To
spread a table space over one or more disk devices, you specify multiple containers.
For partitioned databases, the table spaces are in partitiongroups. In the create table space
command execution, the containers themselves are assigned to a specific partition in the
partitiongroup, thus maintaining the shared nothing character of DB2 DPF.
There are three major types of user table spaces: regular (index and data), temporary, and
long. In addition to these user-defined table spaces, DB2 requires that you define a system
table space, which is called the catalog table space. For partitioned database systems, this
catalog table space is on the catalog partition.
When creating a table, you can choose to have certain objects, such as indexes and large
object (LOB) data, stored separately from the rest of the table data, but you must define this
table to a DMS table space.
Indexes are defined for a specific table and help with the efficient retrieval of data to satisfy
queries. They also can be used to help with the clustering of data.
LOBs can be stored in columns of the table. These objects, although logically referenced as
part of the table, can be stored in their own table space when the base table is defined to a
DMS table space. This approach allows for more efficient access of both the LOB data and
the related table data.
Pages
Data is transferred to and from devices in discrete blocks that are buffered in memory. These
discrete blocks are called pages, and the memory that is reserved to buffer a page transfer is
called an I/O buffer. DB2 supports various page sizes, including 4 K, 8 K, 16 K, and 32 K.
When an application accesses data randomly, the page size determines the amount of data
transferred. This size corresponds to the size of the data transfer request to the DS8000
storage system, which is sometimes referred to as the physical record.
Sequential read patterns can also influence the page size that is selected. Larger page sizes
for workloads with sequential read patterns can enhance performance by reducing the
number of I/Os.
Extents
An extent is a unit of space allocation within a container of a table space for a single table
space object. This allocation consists of multiple pages. The extent size (number of pages) for
an object is set when the table space is created:
An extent is a group of consecutive pages defined to the database.
The data in the table spaces is striped by extent across all the containers in the system.
Sequential prefetch reads consecutive pages into the buffer pool before they are needed by
DB2. List prefetches are more complex. In this case, the DB2 optimizer optimizes the retrieval
of randomly located data.
The amount of data that is prefetched determines the amount of parallel I/O activity.
Ordinarily, the database administrator defines a prefetch value large enough to allow parallel
use of all of the available containers.
Page cleaners
Page cleaners are present to make room in the buffer pool before prefetchers read pages on
disk storage and move them into the buffer pool. For example, if a large amount of data is
updated in a table, many data pages in the buffer pool might be updated but not written into
disk storage (these pages are called dirty pages). Because prefetchers cannot place fetched
data pages onto the dirty pages in the buffer pool, these dirty pages must be flushed to disk
storage and become clean pages so that prefetchers can place fetched data pages from disk
storage.
To optimize performance, the updated data pages in the buffer pool and the log records in the
log buffer are not written to disk immediately. The updated data pages in the buffer pool are
written to disk by page cleaners and the log records in the log buffer are written to disk by the
logger.
The logger and the buffer pool manager cooperate and ensure that the updated data page is
not written to disk storage before its associated log record is written to the log. This behavior
ensures that the database manager can obtain enough information from the log to recover
and protect a database from being left in an inconsistent state when the database crashes as
a result of an event, such as a power failure.
Parallel operations
DB2 for Linux, UNIX, and Windows extensively uses parallelism to optimize performance
when accessing a database. DB2 supports several types of parallelism, including query and
I/O parallelism.
Query parallelism
There are two dimensions of query parallelism: inter-query parallelism and intra-query
parallelism. Inter-query parallelism refers to the ability of multiple applications to query a
database concurrently. Each query runs independently of the other queries, but they are all
run concurrently. Intra-query parallelism refers to the simultaneous processing of parts of a
single query by using intra-partition parallelism, inter-partition parallelism, or both:
Intra-partition parallelism subdivides what is considered a single database operation, such
as index creation, database loading, or SQL queries, into multiple parts, many or all of
which can be run in parallel within a single database partition.
Inter-partition parallelism subdivides what is considered a single database operation, such
as index creation, database loading, or SQL queries, into multiple parts, many or all of
which can be run in parallel across multiple partitions of a partitioned database on one
machine or on multiple machines. Inter-partition parallelism applies to DPF only.
I/O parallelism
When there are multiple containers for a table space, the database manager can use parallel
I/O. Parallel I/O refers to the process of writing to, or reading from, two or more I/O devices
simultaneously. Parallel I/O can result in significant improvements in throughput.
DB2 implements a form of data striping by spreading the data in a table space across multiple
containers. In storage terminology, the part of a stripe that is on a single device is a strip. The
DB2 term for strip is extent. If your table space has three containers, DB2 writes one extent to
container 0, the next extent to container 1, the next extent to container 2, and then back to
container 0. The stripe width (a generic term that is not often used in DB2 literature) is equal
to the number of containers, or three in this case.
The storage layout recommendations depend on the used technology. Although traditional
hard disk drives (HDDs) use requires a manual workload balance across the DS8000
resources, with hybrid storage pools and Easy Tier, the storage controller chooses what
should be stored on SSDs and what should be stored on HDDs.
If you want optimal performance from the DS8000 storage system, do not treat it like a black
box. Establish a storage allocation policy that allocates data by using several DS8000 ranks.
Understand how DB2 tables map to underlying logical disks, and how the logical disks are
allocated across the DS8000 ranks. One way of making this process easier to manage is to
maintain a modest number of DS8000 logical disks.
If the containers of a table space on separate DS8000 logical disks are on different DS8000
ranks, stripe across DS8000 arrays, disk adapters, clusters. This striping eliminates the need
for using underlying operating system or LVM striping.
Page size
Page sizes are defined for each table space. There are four supported page sizes: 4 K, 8 K,
16 K, and 32 K.
For DMS, temporary DMS, and nontemporary automatic storage table spaces (see “Table
spaces” on page 515), the page size you choose for your database determines the upper limit
for the table space size. For tables in SMS and temporary automatic storage table spaces,
page size constrains the size of the tables themselves.
For more information, see the Page, table and table space size topic on the DB2 10.5 for
Linux, UNIX, and Windows IBM Knowledge Center website:
http://www.ibm.com/support/knowledgecenter/SSEPGG
Select a page size that can accommodate the total expected growth requirements of the
objects in the table space.
Tip: A 4- or 8-KB page size is suitable for an OLTP environment, and a 16- or 32-KB page
size is appropriate for analytics. A 32-KB page size is recommended for column-organized
tables.
Extent size
The extent size for a table space is the amount of data that the database manager writes to a
container before writing to the next container. Ideally, the extent size should be a multiple of
the underlying segment size of the disks, where the segment size is the amount of data that
the disk controller writes to one physical disk before writing to the next physical disk.
If you stripe across multiple arrays in your DS8000 storage system, assign a LUN from each
rank to be used as a DB2 container. During writes, DB2 writes one extent to the first container
and the next extent to the second container until all eight containers are addressed before
cycling back to the first container. DB2 stripes across containers at the table space level.
Because the DS8000 storage system stripes at a fairly fine granularity (256 KB), selecting
multiples of 256 KB for the extent size ensures that multiple DS8000 disks are used within a
rank when a DB2 prefetch occurs. However, keep your extent size below 1 MB.
I/O performance is fairly insensitive to the selection of extent sizes, mostly because the
DS8000 storage system employs sequential detection and prefetch. For example, even if you
select an extent size, such as 128 KB, which is smaller than the full array width (it accesses
only four disks in the array), the DS8000 sequential prefetch keeps the other disks in the array
busy.
Prefetch size
The table space prefetch size determines the degree to which separate containers can
operate in parallel.
Prefetching pages means that one or more pages are retrieved from disk in the expectation
that they are required by an application. Prefetching index and data pages into the buffer pool
can help improve performance by reducing I/O wait times. In addition, parallel I/O enhances
prefetching efficiency.
Prefetch size is tunable, that is, the prefetch size can be altered after the table space is
defined and data is loaded, which is not true for extents and page sizes that are set at table
space creation time and cannot be altered without redefining the table space and reloading
the data.
For more information, see the Prefetching data into the buffer pool & Parallel I/O
management and Optimizing table space performance when data is on RAID devices topics
on the DB2 10.5 for Linux, UNIX, and Windows IBM Knowledge Center website:
http://www.ibm.com/support/knowledgecenter/SSEPGG
The DS8000 storage system supports a high degree of parallelism and concurrency on a
single logical disk. As a result, a single logical disk the size of an entire array achieves the
same performance as many smaller logical disks. However, you must consider how logical
disk size affects both the host I/O operations and the complexity of your systems
administration.
Smaller logical disks provide more granularity, with their associated benefits. But, smaller
logical disks also increase the number of logical disks seen by the operating system. Select a
DS8000 logical disk size that allows for granularity and growth without proliferating the
number of logical disks.
Account for your container size and how the containers map to AIX logical volumes (LVs) and
DS8000 logical disks. In the simplest situation, the container, the AIX LV, and the DS8000
logical disk are the same size.
Tip: Try to strike a reasonable balance between flexibility and manageability for your
needs. As a preferred practice, create no fewer than two logical disks in an array, and the
minimum logical disk size must be 16 GB. Unless you have a compelling reason,
standardize a unique logical disk size throughout the DS8000 storage system.
Smaller logical disk sizes have the following advantages and disadvantages:
Advantages of smaller size logical disks:
– Easier to allocate storage for different applications and hosts.
– Greater flexibility in performance reporting.
Examples
Assume a 6+P array with 146 GB disk drives. You want to allocate disk space on your
16-array DS8000 storage system as flexibly as possible. You can carve each of the 16 arrays
into 32 GB logical disks or logical unit numbers (LUNs), resulting in 27 logical disks per array
(with a little left over). This design yields a total of 16 x 27 = 432 LUNs. Then, you can
implement four-way multipathing, which in turn makes 4 x 432 = 1728 hdisks visible to the
operating system.
This approach creates an administratively complex situation, and, at every restart, the
operating system queries each of those 1728 disks. Restarts might take a long time.
Alternatively, you create 16 large logical disks. With multipathing and attachment of four Fibre
Channel ports, you have 4 x 16 = 128 hdisks visible to the operating system. Although this
number is large, it is more manageable, and restarts are much faster. After overcoming that
problem, you can then use the operating system LVM to carve this space into smaller pieces
for use.
However, there are problems with this large logical disk approach. If the DS8000 storage
system is connected to multiple hosts or it is on a SAN, disk allocation options are limited
when you have so few logical disks. You must allocate entire arrays to a specific host, and if
you want to add additional space, you must add it in array-size increments.
17.2.7 Multipathing
Use the DS8000 multipathing along with DB2 striping to ensure the balanced use of Fibre
Channel paths.
Multipathing is the hardware and software support that provides multiple avenues of access
to your data from the host computer. You must provide at least two Fibre Channel paths from
the host computer to the DS8000 storage system. Paths are defined by the number of host
adapters (HAs) on the DS8000 storage system that service the LUNs of a certain host
system, the number of Fibre Channel host bus adapters (HBAs) on the host system, and the
SAN zoning configuration. The total number of paths includes consideration for the
throughput requirements of the host system. If the host system requires more than (2 x 200)
400 MBps throughput, two HBAs are not adequate.
DS8000 multipathing requires the installation of multipathing software. For example, the IBM
Subsystem Device Driver Path Control Module (SDDPCM) on AIX and Device Mapper -
Multipath I/O (DM-MP) on Linux. These products are described in Chapter 9, “Performance
considerations for UNIX servers” on page 285 and Chapter 8, “Host attachment” on
page 267.
Also, this section is intended to focus on Oracle I/O characteristics. Some memory or
processor considerations are needed, but these considerations are done at the appropriate
level according to your system specifications and planning.
Reviewing the following considerations can help you understand the Oracle I/O demand and
your DS8800/DS8700 planning for its use.
Although the instance is an important part of the Oracle components, this section focuses on
the data files. OLTP workloads can benefit from SSDs combined with Easy Tier automatic
mode management to optimize performance. Furthermore, you must consider segregation
and resource-sharing aspects when performing separate levels of isolation on the storage for
different components. Typically, in an Oracle database, you separate redo logs and archive
logs from data and indexes. The redo logs and archives are known for performing intensive
read/write workloads.
In a database, the disk part is considered the slowest component in the whole infrastructure.
You must plan to avoid reconfigurations and time-consuming performance problem
investigations when future problems, such as bottlenecks, might occur. However, as with all
I/O subsystems, good planning and data layout can make the difference between having
excellent I/O throughput and application performance, and having poor I/O throughput, high
I/O response times, and correspondingly poor application performance.
In many cases, I/O performance problems can be traced directly to “hot” files that cause a
bottleneck on some critical component, for example, a single physical disk. This problem can
occur even when the overall I/O subsystem is fairly lightly loaded. When bottlenecks occur,
storage or database administrators might need to identify and manually relocate the high
activity data files that contributed to the bottleneck condition. This problem solving tends to be
a resource-intensive and often frustrating task. As the workload content changes with the
daily operations of normal business cycles, for example, hour by hour through the business
day or day by day through the accounting period, bottlenecks can mysteriously appear and
disappear or migrate over time from one data file or device to another.
In addition, the prioritization of important database workloads to meet their quality of service
(QoS) requirements when they share storage resources with less important workloads can be
managed easily by using the DS8000 I/O Priority Manager.
Section “RAID-level performance considerations” on page 98 reviewed the RAID levels and
their performance aspects. It is important to describe the RAID levels because some data
files can benefit from certain RAID levels, depending on their workload profile, as shown in
Figure 4-6 on page 112. However, advanced storage architectures, for example, cache and
advanced cache algorithms, or even multitier configurations with Easy Tier automatic
management, can make RAID level considerations less important.
For example, with 15 K RPM Enterprise disks and a significant amount of cache available on
the storage system, some environments might have similar performances on RAID 10 and
RAID 5, although mostly workloads with a high percentage of random write activity and high
I/O access densities benefit from RAID 10. RAID 10 benefits clients in single-tier pools.
RAID 10 takes advantage of Easy Tier automatic intra-tier performance management
(auto-rebalance) and constantly optimizes data placement across ranks based on rank
utilization in the extent pool.
However, by using hybrid pools with flash/SSDs and Easy Tier automode cross-tier
performance management that promotes the hot extents to flash on a subvolume level, you
can boost database performance and automatically adapt to changing workload conditions.
You might consider striping on one level only (storage system or host/application-level),
depending on your needs. The use of host-level or application-level striping might be
counterproductive when using Easy Tier in multitier extent pools because striping dilutes the
workload skew and can reduce the effectiveness of Easy Tier.
On previous DS8300/DS8100 storage systems, you benefited from using storage pool
striping (rotate extents) and striping on the storage level. You can create your redo logs and
spread them across as many extent pools and ranks as possible to avoid contention. On a
DS8800/DS8700 storage system with Easy Tier, data placement and workload spreading in
extent pools is automatic, even across different storage tiers.
You still can divide your workload across your planned extent pools (hybrid or homogeneous)
and consider segregation on the storage level by using different storage classes or RAID
levels or by separating table spaces from logs across different extent pools with regard to
failure boundary considerations.
However, if you consider striping on the AIX LVM level or the database level, for example,
Oracle Automatic Storage Management (ASM), you must consider the best possible
approaches if you use it with Easy Tier and multitier configurations. Keep your physical
partition (PP) size or stripe size at a high value to have enough skew with Easy Tier to
promote efficiently hot extents.
AIX LVM features different file systems mount options. Here are different mount options for
logical file systems as the preferred practices with Oracle databases:
Direct I/O (DIO):
– Data is transferred directly from the disk to the application buffer. It bypasses the file
buffer cache and avoids double caching (file system cache + Oracle System Global
Area (SGA)).
– Emulates a raw device implementation.
Concurrent I/O (CIO):
– Implicit use of DIO.
– No inode locking: Multiple threads can perform reads and writes on the same file
concurrently.
– Performance that is achieved by using CIO is comparable to raw devices.
– Avoid double caching: Some data is already cached in the application layer (SGA).
– Provides faster access to the back-end disk and reduces the processor utilization.
– Disables the inode-lock to allow several threads to read and write the same file (CIO
only).
– Because data transfer is bypassing the AIX buffer cache, Journaled File System 2
(JFS2) prefetching and write-behind cannot be used. These functions can be handled
by Oracle.
When using DIO or CIO, I/O requests made by Oracle must be aligned with the JFS2
blocksize to avoid a demoted I/O (returns to normal I/O after a DIO failure).
Additionally, when using JFS2, consider using the INLINE log for file systems so that it can
have the log striped and not be just placed in a single AIX PP.
For more information about AIX and Linux file systems see 9.2.1, “AIX Journaled File System
and Journaled File System 2” on page 296 and 12.3.6, “File systems” on page 400.
Other options that are supported by Oracle include the Asynchronous I/O (AIO), IBM
Spectrum Scale (formerly IBM General Parallel File System (GPFS)), and Oracle ASM:
AIO:
– Allows multiple requests to be sent without having to wait until the DSS completes the
physical I/O.
– Use of AIO is advised no matter what type file system and mount option you implement
(JFS, JFS2, CIO, or DIO).
Note: Since Oracle Database 11g Release 2, Oracle no longer supports raw devices. You
can create databases on an Oracle ASM or a filesystem infrastructure. For more
information, see the Oracle Technology Network article:
http://www.oracle.com/technetwork/articles/sql/11g-asm-083478.html
For more information about the topics in this section, see the following resources:
The Oracle Architecture and Tuning on AIX paper:
http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100883
The IBM Power Systems, AIX and Oracle Database Performance Considerations paper
includes parameter setting recommendations for AIX 6.1 and AIX 7.1 environments:
http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP102171
Oracle has recommended S.A.M.E. for many years. Oracle database installations used
Volume Manager or ASM-based mirroring and striping to implement the S.A.M.E.
methodology.
However, with storage technologies such as RAID and Easy Tier, alternative solutions are
available.
Use the Easy Tier function on hybrid storage pools (flash/SSD and HDD) to improve I/O
performance. In most cases, 5 -10% of flash (compared to the overall storage pool capacity)
should be sufficient.
For HDD-only setups, consider RAID 10 for workloads with a high percentage of random write
activity (> 40%) and high I/O access densities (peak > 50%).
As a result, the LUNs are striped across all disks in the storage pool, as shown in Figure 17-2
on page 529.
Separate data files and transaction logs on different physical disks, not just because of
performance improvements, but because of data safety in case of a RAID array failure.
Separating data files and transaction logs does not need to be considered for an LVM mirror
across two DS8000 storage systems.
You can obtain additional information about IBM DB2 and IMS at these websites:
http://www.ibm.com/software/data/db2/zos/family/
http://www.ibm.com/software/data/ims/
OLTP systems process the day-to-day operation of businesses, so they have strict user
response and availability requirements. They also have high throughput requirements and are
characterized by many database inserts and updates. They typically serve hundreds, or even
thousands, of concurrent users.
Decision support systems typically deal with substantially larger volumes of data than OLTP
systems because of their role in supplying users with large amounts of historical data.
Although 100 GB of data is considered large for an OLTP environment, a large decision
support system might be 1 TB of data or more. The increased storage requirements of
decision support systems can also be attributed to the fact that they often contain multiple,
aggregated views of the same data.
Data table spaces can be divided into two groups: system table spaces and user table spaces.
Both of these table spaces have identical data attributes. The difference is that system table
spaces are used to control and manage the DB2 subsystem and user data. System table
spaces require the highest availability and special considerations. User data cannot be
accessed if the system data is not available.
In addition to data table spaces, DB2 requires a group of traditional data sets that are not
associated to table spaces that are used by DB2 to provide data availability: The backup and
recovery data sets.
In summary, there are three major data set types in a DB2 subsystem:
DB2 system table spaces
DB2 user table spaces
DB2 backup and recovery data sets
The following sections describe the objects and data sets that DB2 uses.
TABLE
All data that is managed by DB2 is associated to a table. The table is the main object used by
DB2 applications.
TABLESPACE
A table space is used to store one or more tables. A table space is physically implemented
with one or more data sets. Table spaces are VSAM linear data sets (LDS). Because table
spaces can be larger than the largest possible VSAM data set, a DB2 table space can require
more than one VSAM data set.
INDEXSPACE
An index space is used to store an index. An index space is physically represented by one or
more VSAM LDSs.
DATABASE
The database is a DB2 representation of a group of related objects. Each of the previously
named objects must belong to a database. DB2 databases are used to organize and manage
these objects.
STOGROUP
A DB2 storage group is a list of storage volumes. STOGROUPs are assigned to databases,
table spaces, or index spaces when using DB2 managed objects. DB2 uses STOGROUPs for
disk allocation of the table and index spaces.
Application table spaces and index spaces are VSAM LDSs with the same attributes as DB2
system table spaces and index spaces. System and application data differ only because they
have different performance and availability requirements.
If you want optimal performance from the DS8000 storage system, do not treat the DS8000
storage system as a “black box.” Understand how DB2 tables map to underlying volumes and
how the volumes map to RAID arrays.
Using Easy Tier ensures that data is spread across ranks in a hybrid pool and placed on the
appropriate tier, depending on how active the data is.
You can intermix tables and indexes and also system, application, and recovery data sets on
the DS8000 ranks. The overall I/O activity is then more evenly spread, and I/O skews are
avoided.
The results in Figure 18-1 on page 537 were measured by a DB2 I/O benchmark. They show
random 4-KB read throughput and response times. The SSD response times are low across
the curve. They are lower than the minimum HDD response time for all data points.
If you plan to create an extent pool with only SSD or flash ranks, the FLASHDA tool can help
you identify which data benefits the most if placed on those extent pools.
FLASHDA is a tool based on SAS software to manage the transition to SSD. The tool
provides volumes and data set usage reports that use SAS code to analyze SMF 42 subtype
6 and SMF 74 subtype 5 records to help identify volumes and data sets that are good
candidates to be on SSDs.
Based on the output report of these tools, you can select which hot volumes might benefit
most when migrated to SSDs. The tool output also provides the hot data at the data set level.
Based on this data, the migration to the SSD ranks can be done by data set by using the
appropriate z/OS tools.
The DS8884 storage system provides up to four HPFEs, and the DS8886 storage system
provides up to eight HPFEs:
DS8884 storage system: Two flash enclosures in a base rack with two additional
enclosures in the first expansion rack
DS8886 storage system: Four flash enclosures in a base rack with four additional
enclosures in the first expansion rack
Figure 18-2 shows the comparison for a 4-KB random read workload between HPFE and
SSD (Single Array RAID 5 6+p) in a DS8870 storage system. As shown in Figure 18-2, for a
latency-sensitive application, HPFE can sustain low response time at more demanding I/O
rates than its SSD equivalent.
Figure 18-2 4-KB random read comparison between HPFE and SSD
For more information about DB2 performance with HPFE, see Making Smart Storage
Decisions for DB2 in a Flash and SSD World, REDP-5141.
Important: Today, All DB2 I/Os, including format write and list prefetch, are supported by
High Performance Fibre Channel connection (FICON).
High Performance FICON, also known as zHPF, is not new. It was introduced in 2009. Initially,
zHPF support for DB2 I/O was limited to sync I/Os and write I/Os of individual records. That
support was enhanced gradually, zHPF supports all types of DB2 I/O in 2011.
Conversion of DB2 I/O to zHPF delivers more optimal resource utilization, bandwidth, and
response time:
4-KB pages format write throughput increases up to 52%.
Preformatting throughput increases up to 100%.
Sequential prefetch through put increases up to 19%.
Dynamic prefetch throughput increases up to 23% (40% with SSD).
DB2 10 throughput increases up to 111% (more with 8-K pages) for disorganized index
scan. DB2 10 with zHPF is up to 11 times faster than DB2 9 without zHPF.
Sync I/O cache hit response time decreases by up to 30%.
FICON Express 16S on z13 with a DS8870 storage system improves DB2 log write latency
and throughput by up to 32% with multiple I/O streams, compared with FICON Express 8S in
zEC12, resulting in improved DB2 transactional latency. Thus, you can expect up to 32%
reduction in elapsed time for I/O bound batch jobs.
zHPF with FICON Express 16S provides greater throughput and higher response time
compared with FICON Express 8 or 8S. For more information about that comparison, see
8.3.1, “FICON” on page 276.
Today, all DB2 I/O is supported by zHPF. The following environments are required to obtain
list prefetch with zHPF benefits:
DS8700 or DS8800 storage system with LMC R6.2 or above
DB2 10 or DB2 11
z13, zEC12, zBC12, z194, or z114
FICON Express 8S or 16S
z/OS V1.11 or above with required PTFs
List Prefetch Optimizer optimizes the fetching of data from the storage system when DB2 list
prefetch runs. DB2 list prefetch I/O is used for disorganized data and indexes or for
skip-sequential processing. Where List Prefetch with zHPF improves connect time of I/O
response, List Prefetch Optimizer contributes to reduce disconnect time.
Today, IBM introduces High Performance Flash Enclosure (HPFE) and SSDs, which achieve
higher performance for DB2 applications than before. The synergy between List Prefetch
Optimizer and HPFE or SSDs provides significant improvement for DB2 list prefetch I/O.
For more information about List Prefetch Optimizer, see DB2 for z/OS and List Prefetch
Optimizer, REDP-4862.
VSAM data striping addresses this problem with two modifications to the traditional data
organization:
The records are not placed in key ranges along the volumes; instead, they are organized
in stripes.
Parallel I/O operations are scheduled to sequential stripes in different volumes.
By striping data, the VSAM control intervals (CIs) are spread across multiple devices. This
format allows a single application request for records in multiple tracks and CIs to be satisfied
by concurrent I/O requests to multiple volumes.
The result is improved data transfer to the application. The scheduling of I/O to multiple
volumes to satisfy a single application request is referred as an I/O path packet.
You can stripe across ranks, DAs, servers, and the DS8000 storage systems.
In a DS8000 storage system with storage pool striping, the implementation of VSAM striping
still provides a performance benefit. Because DB2 uses two engines for the list prefetch
operation, VSAM striping increases the parallelism of DB2 list prefetch I/Os. This parallelism
exists with respect to the channel operations and the disk access.
If you plan to enable VSAM I/O striping, see DB2 9 for z/OS Performance Topics, SG24-7473.
Measurements that are oriented to determine how large volumes can affect DB2 performance
show that similar response times can be obtained by using larger volumes compared to using
the smaller 3390-3 standard-size volumes. For more information, see “Volume sizes” on
page 484.
Examples of DB2 applications that benefit from MIDAWs are DB2 prefetch and DB2 utilities.
As more data is prefetched, more disks are employed in parallel. Therefore, high throughput
is achieved by employing parallelism at the disk level. In addition to enabling one sequential
stream to be faster, AMP also reduces disk thrashing when there is disk contention.
This architecture allows DB2 to communicate performance requirements for optimal data set
placement by communicating application performance information (hints) to the Easy Tier
Application API by also using DFSMS. The application hint sets the intent through the API,
and Easy Tier moves the data set to the correct tier.
The following description provides an example of the DB2 reorganization (REORG) process
with Easy Tier Application.
Without integration of Easy Tier Application, a DB2 REORG places the extents of the REORG
result in new extents. These new extents can be extents on a lower tier, and it takes a while
for Easy Tier to detect that these extents are hot and must be moved to a higher-level tier.
The integration of Easy Tier Application allows DB2 to instruct proactively Easy Tier about the
application-intended use of the data. The application hint sets the intent and Easy Tier moves
the data to the correct tier. So, hot extents before the REORG is moved to higher tier extents.
Starting with DS8870 R7.2, Metro Mirror now recognizes the bypass extent blocking option,
which reduces the device busy delay time and improves the throughput in a Metro Mirror
environment, in some cases by up to 100%.
Especially with a data-sharing environment, the DB2 burst write tends to range across many
tracks. The write operation to a specific track is serialized at the specific time with a non Metro
Mirror environment, but with Metro Mirror, the entire range of track is serialized of the entire
time of write I/O operation. The DB2 burst write processed asynchronous, but it might cause
serious problem if the Group Buffer Pool (GBP) fills up.
This function accelerates throughput for some Metro Mirror environments by up to 100%,
which reflects the reduction of device busy delay as I/O response time.
18.3.14 zHyperWrite
DS8870 LMC R7.4 released zHyperWrite, which helps accelerate DB2 log writes in Metro
Mirror synchronous data replication environments. zHyperWrite combines concurrent
DS8870 Metro Mirror (PPRC) synchronous replication and software mirroring through media
manager (DFSMS) to provide substantial improvements in DB2 log write latency. This
function also coexists with HyperSwap.
Without zHyperWrite in the Metro Mirror environment, the I/O response time of DB2 log write
is impacted by latency that is caused by synchronous replication.
zHyperWrite enables DB2 to perform parallel log writes to primary and secondary volumes.
When DB2 writes to the primary log volume, DFSMS updates the secondary log volume
concurrently.The write I/O to DB2 log with zHyperWrite is ended when both primary and
secondary volumes are updated by DFSMS.
Thus, with zHyperWrite, you can avoid the latency of storage-based synchronous mirroring,
which delivers an improvement of log write throughput. zHyperWrite realized a reduction of
response time up to 40% and 179% throughput improvement. These benefits depend on the
distance.
IMS Database Manager provides functions for preserving the integrity of databases and
maintaining the databases. It allows multiple tasks to access and update the data, while
ensuring the integrity of the data. It also provides functions for reorganizing and restructuring
the databases.
The IMS databases are organized internally by using a number of IMS internal database
organization access methods. The database data is stored on disk storage by using the
normal operating system access methods.
During IMS execution, all information that is necessary to restart the system if there is a
failure is recorded on a system log data set. The IMS logs are made up of the following
information.
The OLDS are made of multiple data sets that are used in a wraparound manner. At least
three data sets must be allocated for the OLDS to allow IMS to start, and an upper limit of 100
data sets is supported.
Only complete log buffers are written to OLDS to enhance performance. If any incomplete
buffers must be written out, they are written to the write ahead data sets (WADS).
When IMS processing requires writing a partially filled OLDS buffer, a portion of the buffer is
written to the WADS. If IMS or the system fails, the log data in the WADS is used to terminate
the OLDS, which can be done as part of an emergency restart, or as an option on the IMS
Log Recovery Utility.
The WADS space is continually reused after the appropriate log data is written to the OLDS.
This data set is required for all IMS systems, and must be pre-allocated and formatted at IMS
startup when first used.
When using a DS8000 storage system with storage pool striping, define the WADS volumes
as 3390-Mod.1 and allocate them consecutively so that they are allocated to different ranks.
WADS has a fixed 1-byte key of ’00’x. WADS records include this CKD key field, which needs
to be in cache before data can be updated. There might be a cache contention, must to be
updated, the appropriate WADS key field must be staged into cache first. This action slows
the WADS write response time; it shows up as an increase in disconnect time, until recently,
where IMS changed the way the write I/Os are run.
In IMS V.11, an enhancement was made that allows the host software to provide an I/O
channel program indication that this is WADS, so the disk subsystem (that supports the
indication) can predict what the disk format key field is and avoid a write miss. This
enhancement requires two IMS APARs, which are PM44110 and PM19513.
Figure 18-3 shows the comparison of the performance of the IMS WADS volume before and
after the two APARs are put on and the disk subsystem has the appropriate Licensed Internal
Code to support this function. A significant reduction in the disconnect time can be observed
from the RMF report.
Volume SRE004 has the WADS data set that does not support this enhancement and has a
much higher disconnect time compared to volume SRE104, which does have the above
enhancement.
Another enhancement in IMS V.12 made the WADS channel program to conform to ECKD
architecture, providing greater efficiency and reducing channel program operation.
Table 18-1 shows the response time improvement on the WADS volume between IMS V.11
with the enhancements as compared to the current IMS V.12. The response time improves
from 0.384 ms to 0.344 ms, which is a 10% improvement. The volume S24$0D on address
240D is allocated on a DDM rank on the DS8700 storage system and not on an SSD rank.
IMS V.11
240D S24$0D 1.0H 0023 1143.49 0.384 0.000 0.025 0.000 0.179 0.001 0.204
IMS V.12
240D S24$0D 1.0H 0023 1163.02 0.384 0.000 0.028 0.000 0.184 0.001 0.159
Part 4 Appendixes
This part includes the following topics:
Performance management process
Benchmarking
This power is the potential of the DS8000 storage system, but careful planning and
management are essential to realize that potential in a complex IT environment. Even a
well-configured system is subject to the following changes over time that affect performance:
Additional host systems
Increasing workload
Additional users
Additional DS8000 capacity
A typical case
To demonstrate the performance management process, here is a typical situation where
DS8000 performance is an issue.
Users open incident tickets to the IT Help Desk claiming that the system is slow and is
delaying the processing of orders from their clients and the submission of invoices. IT Support
investigates and detects that there is contention in I/O to the host systems. The Performance
and Capacity team is involved and analyzes performance reports together with the IT Support
teams. Each IT Support team (operating system (OS), storage, database, and application)
issues its report defining the actions necessary to resolve the problem. Certain actions might
have a marginal effect but are faster to implement; other actions might be more effective but
need more time and resources to put in place. Among the actions, the Storage Team and
Performance and Capacity Team report that additional storage capacity is required to support
the I/O workload of the application and ultimately to resolve the problem. IT Support presents
its findings and recommendations to the company’s Business Unit, requesting application
downtime to implement the changes that can be made immediately. The Business Unit
accepts the report but says that it has no money for the purchase of new storage. They ask
the IT department how they can ensure that the additional storage can resolve the
performance issue. Additionally, the Business Unit asks the IT department why the need for
additional storage capacity was not submitted as a draft proposal three months ago when the
budget was finalized for next year, knowing that the system is one of the most critical systems
of the company.
Incidents, such as this one, make you realize the distance that can exist between the IT
department and the company’s business strategy. In many cases, the IT department plays a
key role in determining the company’s strategy. Therefore, consider these questions:
How can you avoid situations like those described?
How can you make performance management become more proactive and less reactive?
What are the preferred practices for performance management?
What are the key performance indicators of the IT infrastructure and what do they mean
from the business perspective?
Are the defined performance thresholds adequate?
How can you identify the risks in managing the performance of assets (servers, storage
systems, and applications) and mitigate them?
To better align the understanding between the business and the technology, use the
Information Technology Infrastructure Library (ITIL) as a guide to develop a process for
performance management as applied to DS8000 performance and tuning.
Purpose
The purpose of performance management is to ensure that the performance of the IT
infrastructure matches the demands of the business. The following activities are involved:
Define and review performance baselines and thresholds.
Collect performance data from the DS8000 storage system.
Check whether the performance of the resources is within the defined thresholds.
Analyze performance by using collected DS8000 performance data and tuning
suggestions.
Define and review standards and IT architecture that are related to performance.
Analyze performance trends.
Size new storage capacity requirements.
Certain activities relate to the operational activities, such as the analysis of performance of
DS8000 components, and other activities relate to tactical activities, such as the performance
analysis and tuning. Other activities relate to strategic activities, such as storage capacity
sizing. You can split the process into three subprocesses:
Operational performance subprocess
Analyze the performance of DS8000 components (processor complexes, device adapters
(DAs), host adapters (HAs), and ranks) and ensure that they are within the defined
thresholds and service-level objectives (SLOs) and service-level agreements (SLAs).
Tactical performance subprocess
Analyze performance data and generate reports for tuning recommendations and the
review of baselines and performance trends.
Strategic performance subprocess
Analyze performance data and generate reports for storage sizing and the review of
standards and architectures that relate to performance.
When assigning the tasks, you can use a Responsible, Accountable, Consulted, and
Informed (RACI) matrix to list the actors and the roles that are necessary to define a process
or subprocess. A RACI diagram, or RACI matrix, is used to describe the roles and
responsibilities of various teams or people to deliver a project or perform an operation. It is
useful in clarifying roles and responsibilities in cross-functional and cross-departmental
projects and processes.
With Tivoli Storage Productivity Center, you can set performance thresholds for two major
categories:
Status change alerts
Configuration change alerts
Important: Tivoli Storage Productivity Center for Disk is not designed to monitor hardware
or to report hardware failures. You can configure the DS8000 Hardware Management
Console (HMC) to send alerts through Simple Network Management Protocol (SNMP) or
email when a hardware failure occurs.
You might also need to compare the DS8000 performance with the users’ performance
requirements. Often, these requirements are explicitly defined in formal agreements between
IT management and user management. These agreements are referred to as SLAs or SLOs.
These agreements provide a framework for measuring IT resource performance requirements
against IT resource fulfillment.
Performance SLA
A performance SLA is a formal agreement between IT Management and User
representatives concerning the performance of the IT resources. Often, these SLAs provide
goals for end-to-end transaction response times. For storage, these types of goals typically
relate to average disk response times for different types of storage. Missing the technical
goals described in the SLA results in financial penalties to the IT Service Management
providers.
Performance SLO
Performance SLOs are similar to SLAs with the exception that misses do not carry financial
penalties. Although SLO misses do not carry financial penalties, misses are a breach of
contract in many cases and can lead to serious consequences if not remedied.
Having reports that show you how many alerts and how many misses in SLOs/SLAs occurred
over time is important. The reports tell how effective your storage strategy is (standards,
architectures, and policy allocation) in the steady state. In fact, the numbers in those reports
are inversely proportional to the effectiveness of your storage strategy. The more effective
your storage strategy, the fewer performance threshold alerts are registered, and the fewer
SLO/SLA targets are missed.
It is not necessary to implement SLOs or SLAs for you to discover the effectiveness of your
current storage strategy. The definition of SLO/SLA requires a deep and clear understanding
of your storage strategy and how well your DS8000 storage system is running. That is why
before implementing this process that you should start with the tactical performance process:
Generate the performance reports.
Define tuning suggestions.
Review the baseline after implementing tuning recommendations.
Generate performance trends reports.
Inputs
The following inputs are necessary to make this process effective:
Performance trends reports of DS8000 components: Many people ask for the IBM
recommended thresholds. The best recommended thresholds are those thresholds that fit
your environment. The best thresholds depend on the configuration of your DS8000
storage system and the types of workloads. For example, you must define thresholds for
I/O per second (IOPS) if your application is a transactional system. If the application is a
data warehouse, you must define thresholds for throughput. Also, you must not expect the
same performance from different ranks where one set of ranks has 300 GB, 15-K
revolutions per minute (RPM) disk drive modules (DDMs) and another set of ranks has
4 TB, 7200 RPM Nearline DDMs. For more information, check the outputs that are
generated from the tactical performance subprocess.
Performance SLO and performance SLA: You can define the SLO/SLA requirements in
two ways:
– By hardware (IOPS by rank or MBps by port): This performance report is the easiest
way to implement an SLO or SLA, but the most difficult method for which to get client
agreement. The client normally does not understand the technical aspects of a
DS8000 storage system.
– By host or application (IOPS by system or MBps by host): Most probably, this
performance report is the only way that you are going to get an agreement from the
client, but this agreement is not certain. The client sometimes does not understand the
technical aspects of IT infrastructure. The typical way to define a performance SLA is
by the average execution time or response time of a transaction in the application. So,
the performance SLA/SLO for the DS8000 storage system is normally an internal
agreement among the support teams, which creates additional work for you to
generate those reports, and there is no predefined solution. It depends on your
environment’s configuration and the conditions that define those SLOs/SLAs. When
configuring the DS8000 storage system with SLO/SLA requirements, separate the
applications or hosts by logical subsystem (LSS) (reserve two LSSs, one even and one
odd, for each host, system, or instance). The benefit of generating performance reports
by using this method is that they are more meaningful to the other support teams and
to the client. So, the level of communication increases and reduce chances for
misunderstandings.
Important: When defining a DS8000 related SLA or SLO, ensure that the goals are based
on empirical evidence of performance within the environment. Application architects with
applications that are highly sensitive to changes in I/O throughput or response time must
consider the measurement of percentiles or standard deviations as opposed to average
values over an extended period. IT management must ensure that the technical
requirements are appropriate for the technology.
Although there might not be any immediate financial penalties associated with missed user
expectations, prolonged negative experiences with underperforming IT resources result in low
user satisfaction.
Outputs
The following outputs are generated by this process:
Documentation of defined DS8000 performance thresholds. It is important to document
the agreed-to thresholds. Not just for you, but also for other members of your team or other
teams that need to know.
DS8000 alerts for performance utilization. These alerts are generated when a DS8000
component reaches a defined level of utilization. With Tivoli Storage Productivity Center
for Disk, you can automate the performance data collection and also configure Tivoli
Storage Productivity Center to send an alert when this type of an event occurs.
Performance reports comparing the performance utilization of the DS8000 storage system
with the performance SLO and SLA.
Implement performance monitoring and alerting: After you define the DS8000 components
to monitor, set their corresponding threshold values. For more information about how to
configure Tivoli Storage Productivity Center, see the IBM Spectrum Control or Tivoli
Storage Productivity Center documentation, found at:
http://www.ibm.com/support/knowledgecenter/SS5R93_5.2.8/com.ibm.spectrum.sc.doc
/fqz0_c_wg_managing_resources.html
Publish the documentation to the IT Management team: After you implement the
monitoring, send the respective documentation to those people who need to know.
Performance troubleshooting
If an incident ticket is open for performance issues, you might be asked to investigate. The
following tips can help during your problem determination.
In addition to the answers to these questions, the client must provide server performance and
configuration data. For more information, see the relevant host chapters in this book.
Tip: Start with the tactical performance subprocess for the implementation of a
performance management process.
Inputs
The following inputs are necessary to make this process effective:
Product specifications: Documents that describe the characteristics and features of the
DS8000 storage system, such as data sheets, Announcement Letters, and planning
manuals
Product documentation: Documents that provide information about the installation and use
of the DS8000 storage system, such as user manuals, white papers, and IBM Redbooks
publications
Performance SLOs/performance SLAs: The documentation of performance SLO/SLAs to
which the client agreed for the DS8000 storage system.
Outputs
Performance reports with tuning recommendations and performance trends reports are the
outputs that are generated by this process.
It might be an obvious observation, but it is important to remember that the IT resources are
finite and some day they will run out. In the same way, the money to invest in IT Infrastructure
is limited, which is why this process is important. In each company, there is normally a time
when the budget for the next year is decided. So, even if you present a list of requirements
with performance reports to justify the investments, you might not be successful. The timing
of the request and the benefit of the investment to the business are also important
considerations.
Just keeping the IT systems running is not enough. The IT Manager and Chief Information
Officer (CIO) must show the business benefit for the company. Usually, this benefit means
providing the service at the lowest cost, but also showing a financial advantage that the
services provide. This benefit is how the IT industry grew over the years while it increased
productivity, reduced costs, and enabled new opportunities.
You must check with your IT Manager or Architect to learn when the budget is set and start
3 - 4 months before this date. You can then define the priorities for the IT infrastructure for the
coming year to meet the business requirements.
Inputs
The following inputs are required to make this process effective:
Performance reports with tuning recommendations
Performance trends reports
Define priorities of new investments: In defining the priorities of where to invest, you must
consider these four objectives:
– Reduce cost: The simplest example is storage consolidation. There might be several
storage systems in your data center that are nearing the ends of their useful lives. The
costs of maintenance are increasing, and the storage systems use more energy than
new models. The IT Architect can create a case for storage consolidation, but needs
your help to specify and size the new storage.
– Increase availability: There are production systems that need to be available 24x7. The
IT Architect must submit a new solution for this case to provide data mirroring. The IT
Architect requires your help to specify the new storage for the secondary site and to
provide figures for the necessary performance.
Tip: For environments with multiple applications on the same physical servers or on
logical partitions (LPARs) that use the same Virtual I/O Servers (VIOSs), defining new
requirements can be challenging. Build profiles at the DS8000 level first and eventually
move into more in-depth study and understanding of the other shared resources in the
environment.
Plan configuration of a new DS8000 storage system: Configuring the DS8000 storage
system to meet the specific I/O performance requirements of an application reduces the
probability of production performance issues. To produce a design to meet these
requirements, Storage Management needs to know the following items:
– IOPS
– Read/write ratios
– I/O transfer size
– Access type: Sequential or random
For help in converting application profiles to I/O workload, see Chapter 5, “Understanding
your workload” on page 141.
After the I/O requirements are identified, documented, and agreed upon, the DS8000
layout and logical planning can begin. For more information and considerations for
planning for performance, see Chapter 4, “Logical configuration performance
considerations” on page 83.
For existing applications, you can use Disk Magic to analyze an application I/O profile.
Details about Disk Management are in Chapter 6, “Performance planning tools” on
page 159.
Appendix B. Benchmarking
Benchmarking storage systems is complex because of all of the hardware and software that
are used for storage systems. This appendix describes the goals and the ways to conduct an
effective storage benchmark.
To conduct a benchmark, you need a solid understanding of all of the parts of your
environment. This understanding includes the storage system requirements and also the
storage area network (SAN) infrastructure, the server environments, and the applications.
Emulating the actual environment, including actual applications and data, along with user
simulation, provides efficient and accurate analysis of the performance of the storage system
tested. The characteristic of a performance benchmark test is that results must be
reproducible to validate the integrity of the test.
What to benchmark
Benchmarking can be a simple thing, such as when you want to see the performance impact
of upgrading to a 16 Gb host adapter (HA) from an 8 Gb HA. The simplest scenario is if you
have the new HA card that is installed on your test system, in which case you run your normal
test workload by using the old HA card and rerun the same workload by using the new HA
card. Analyzing the performance metrics of the two runs gives you a comparison, hopefully
improvement, of the performance on the 16 Gb HA. Comparison can be done for the
response time, port intensity, HA utilization, and in case of z/OS, the connect time.
Benchmarking can be a complex and laborious project. The hardware that is required to do
the benchmark can be substantial. An example is if you want to benchmark your production
workload on the new DS8880 storage system in a Metro/Global Mirror environment at a
1000 km distance. Fortunately, there is equipment that can simulate the distance from your
primary to secondary site, so you do not need to have a physical storage system at a remote
location 1000 km away to perform this benchmark.
Performance is not the only component to consider in benchmark results. Reliability and
cost-effectiveness must be considered. Balancing benchmark performance results with
reliability, functions, and TCO of the storage system provides a global view of the storage
product value.
Selecting one of these workloads from SPC depends on how representative that workload is
to your current production workload or new workload that you plan to implement. If none of
them fits your needs, then you must either build your own workload or ask your IBM account
team or IBM Business Partner for assistance in creating one. This way, the benchmark result
reflects what you expected to evaluate in the first place.
The OLTP category typically has many users, who all access the same storage system and a
common set of files. The requests are typically random access and spread across many files
with a small transfer size (typically 4-K records).
To identify the specification of your production workload, you can use monitoring tools that are
available at the operating system level.
To set up a benchmark environment, there are two ways to generate the workload.
The first way to generate the workload, which is the most complex, is to create a copy of
the production environment, including the applications software and the application data.
In this case, you must ensure that the application is well-configured and optimized on the
server operating system. The data volume also must be representative of the production
environment. Depending on your application, a workload can be generated by using
application scripts or an external transaction simulation tool. These tools provide a
simulation of users accessing your application. To build up an external simulation tool, you
first record several typical requests from several users and then generate these requests
multiple times. This process can provide an emulation of hundreds or thousands of
concurrent users to put the application through the rigors of real-life user loads and
measure the response times of key performance metrics.
The other way to generate the workload is to use a standard workload generator or an I/O
driver. These tools, specific to each operating system, produce different kinds of I/O loads
on the storage systems. You can configure and tune these tools to match your application
workload. Here are the main performance metrics that must be tuned to simulate closely
your current workload:
– I/O Rate
– Read to Write Ratio
– Read Hit Ratio
– Read and Write Transfer Size
– % Read and Write Sequential I/O
After the benchmark is completed, the performance measurement data that is collected can
then be analyzed. The benchmark can be repeated and compared to the previous results.
This action ensures that there is no anomaly during the workload run.
Considering all the efforts and resources that are required to set up the benchmark, it is
prudent to plan other benchmark scenarios that you might want to run. As an example, you
might run other types of workloads, such OLTP and batch. Another scenario might be running
at different cache sizes.
Based on the monitoring reports, bottlenecks can be identified. Now, either the workload
should be modified or additional hardware must be added, such as more flash ranks if
possible.
During a benchmark, each scenario must be run several times to understand how the
different components perform by using monitoring tools, to identify bottlenecks, and then, to
test different ways to get an overall performance improvement by tuning each component.
The publications that are listed in this section are considered suitable for a more detailed
discussion of the topics covered in this book.
Other publications
These publications are also relevant as further information sources:
AIX Disk Queue Depth Tuning for Performance, TD105745
Application Programming Interface Reference, GC27-4211
Command-Line Interface User's Guide, SC27-8526
Driving Business Value on Power Systems with Solid-State Drives, POW03025USEN
DS8870 Introduction and Planning Guide, GC27-4209
Host Systems Attachment Guide, SC27-8527
IBM DS8000 Performance Configuration Guidelines for Implementing Oracle Databases
with ASM, WP101375
IBM DS8880 Introduction and Planning Guide, GC27-8525
IBM i Shared Storage Performance using IBM System Storage DS8000 I/O Priority
Manager, WP101935
IBM System Storage DS8700 and DS8800 Performance with Easy Tier 2nd Generation,
WP101961
IBM System Storage DS8800 and DS8700 Performance with Easy Tier 3rd Generation,
WP102024
IBM System Storage DS8800 Performance Whitepaper, WP102025
IS8800 and DS8700 Introduction and Planning Guide, GC27-2297
Multipath Subsystem Device Driver User’s Guide, GC52-1309
Tuning SAP on DB2 for z/OS on z Systems, WP100287
Tuning SAP with DB2 on IBM AIX, WP101601
Tuning SAP with Oracle on IBM AIX, WP100377
“Web Power – New browser-based Job Watcher tasks help manage your IBM i
performance” in the IBM Systems Magazine:
http://www.ibmsystemsmag.com/ibmi
SG24-8318-00
ISBN 073844149X
Printed in U.S.A.
®
ibm.com/redbooks