0% found this document useful (0 votes)
186 views600 pages

IBM DS8000 Performance Monitoring and Tuning

Uploaded by

Văn Tạ Trung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
186 views600 pages

IBM DS8000 Performance Monitoring and Tuning

Uploaded by

Văn Tạ Trung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 600

Front cover

IBM System Storage DS8000


Performance Monitoring and Tuning

Axel Westphal
Bert Dufrasne
Wilhelm Gardt
Jana Jamsek
Peter Kimmel
Flavio Morais
Paulus Usong
Alexander Warmuth
Kenta Yuge

Redbooks
International Technical Support Organization

IBM System Storage DS8000 Performance Monitoring


and Tuning

April 2016

SG24-8318-00
Note: Before using this information and the product it supports, read the information in “Notices” on
page xiii.

First Edition (April 2016)

This edition applies to Version 8.0 of the IBM DS8884 and DS8886 models (product numbers 2831–2834) and
Version 7.5 of the IBM DS8870 (product numbers 2421–2424).

© Copyright International Business Machines Corporation 2016. All rights reserved.


Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule
Contract with IBM Corp.
Contents

Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

IBM Redbooks promotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx

Part 1. IBM System Storage DS8000 performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 1. IBM System Storage DS8880 family characteristics . . . . . . . . . . . . . . . . . . . 3


1.1 The storage server challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Performance numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Recommendations and rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Modeling your workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.4 Allocating hardware components to workloads. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Meeting the challenge: The DS8000 product family . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 DS8880 family models and characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 DS8000 performance characteristics overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Advanced caching techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 IBM Subsystem Device Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 Performance characteristics for z Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.4 Easy Tier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.5 I/O Priority Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Chapter 2. Hardware configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25


2.1 Storage system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Processor memory and cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.1 Cache and I/O operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2 Determining the correct amount of cache storage . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 I/O enclosures and the PCIe infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4 Disk subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.1 Device adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.2 DS8000 Fibre Channel switched interconnection at the back end . . . . . . . . . . . . 36
2.4.3 Disk enclosures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.4 DDMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.5 Enterprise drives compared to Nearline drives and flash . . . . . . . . . . . . . . . . . . . 39
2.4.6 Installation order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Host adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.1 Fibre Channel and FICON host adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.2 Multiple paths to Open Systems servers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.3 Multiple paths to z Systems servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.4 Spreading host attachments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6 Tools to aid in hardware planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6.1 White papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6.2 Disk Magic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

© Copyright IBM Corp. 2016. All rights reserved. iii


2.6.3 Capacity Magic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Chapter 3. Logical configuration concepts and terminology . . . . . . . . . . . . . . . . . . . . 47


3.1 RAID levels and spares. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.1 RAID 5 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.2 RAID 6 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.3 RAID 10 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.4 Array across loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.5 Spare creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 The abstraction layers for logical configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.1 Array sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.2 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.3 Ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.4 Extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.5 Logical volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.6 Space-efficient volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2.7 Extent allocation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2.8 Allocating, deleting, and modifying LUNs and CKD volumes . . . . . . . . . . . . . . . . 65
3.2.9 Logical subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2.10 Address groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.11 Volume access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.12 Summary of the DS8000 logical configuration hierarchy . . . . . . . . . . . . . . . . . . 69
3.3 Data placement on ranks and extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3.1 Rank and array considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.2 Extent pool considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.3 Easy Tier considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Chapter 4. Logical configuration performance considerations . . . . . . . . . . . . . . . . . . 83


4.1 Reviewing the tiered storage concepts and Easy Tier . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.1.1 Tiered storage concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.1.2 IBM System Storage Easy Tier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Configuration principles for optimal performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2.1 Workload isolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.2 Workload resource-sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2.3 Workload spreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2.4 Using workload isolation, resource-sharing, and spreading . . . . . . . . . . . . . . . . . 91
4.3 Analyzing application workload characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.1 Determining skew and storage requirements for Easy Tier . . . . . . . . . . . . . . . . . 92
4.3.2 Determining isolation requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.3 Reviewing remaining workloads for feasibility of resource-sharing. . . . . . . . . . . . 94
4.4 Planning allocation of disk and host connection capacity . . . . . . . . . . . . . . . . . . . . . . . 94
4.5 Planning volume and host connection spreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5.1 Spreading volumes for isolated and resource-sharing workloads. . . . . . . . . . . . . 95
4.5.2 Spreading host connections for isolated and resource-sharing workloads . . . . . . 96
4.6 Planning array sites. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.7 Planning RAID arrays and ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.8 Planning extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.8.1 Overview of DS8000 extent pool configurations with Easy Tier . . . . . . . . . . . . . 104
4.8.2 Single-tier extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.8.3 Multitier extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.8.4 Additional implementation considerations for multitier extent pools . . . . . . . . . . 113
4.8.5 Extent allocation in homogeneous multi-rank extent pools . . . . . . . . . . . . . . . . . 114
4.8.6 Balancing workload across available resources . . . . . . . . . . . . . . . . . . . . . . . . . 117

iv IBM System Storage DS8000 Performance Monitoring and Tuning


4.8.7 Extent pool configuration examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.8.8 Assigning workloads to extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.9 Planning address groups, LSSs, volume IDs, and CKD PAVs . . . . . . . . . . . . . . . . . . 123
4.9.1 Volume configuration scheme by using application-related LSS/LCU IDs . . . . . 125
4.10 I/O port IDs, host attachments, and volume groups . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.10.1 I/O port planning considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.11 Implementing and documenting a DS8000 logical configuration . . . . . . . . . . . . . . . 133

Part 2. IBM System Storage DS8000 performance management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Chapter 5. Understanding your workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141


5.1 General workload types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.1.1 Typical online transaction processing workload . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.1.2 Microsoft Exchange Server workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.1.3 Sequential workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.1.4 Batch jobs workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.1.5 Sort jobs workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.1.6 Read-intensive cache friendly and unfriendly workloads . . . . . . . . . . . . . . . . . . 144
5.2 Database workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.2.1 Database query workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.2.2 Database logging workload. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.2.3 Database transaction environment workload . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.2.4 Database utilities workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.3 Application workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.3.1 General file serving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.3.2 Online transaction processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.3.3 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.3.4 Video on demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.3.5 Data warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.3.6 Engineering and scientific applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.3.7 Digital video editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.4 Profiling workloads in the design phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.5 Understanding your workload type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.5.1 Monitoring the DS8000 workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.5.2 Monitoring the host workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.5.3 Modeling the workload and sizing the system. . . . . . . . . . . . . . . . . . . . . . . . . . . 155

Chapter 6. Performance planning tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159


6.1 Disk Magic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.1.1 The need for performance planning and modeling tools. . . . . . . . . . . . . . . . . . . 160
6.1.2 Overview and characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.1.3 Disk Magic data collection for a z/OS environment. . . . . . . . . . . . . . . . . . . . . . . 161
6.1.4 Disk Magic data collection for Open Systems environment . . . . . . . . . . . . . . . . 162
6.1.5 Configuration options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.1.6 Workload growth projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.1.7 Disk Magic modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.2 Disk Magic for z Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.2.1 Processing the DMC file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.2.2 z Systems model to merge the two DS8870 storage systems to a DS8886 storage
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.2.3 Disk Magic performance projection for the z Systems model . . . . . . . . . . . . . . . 188
6.2.4 Workload growth projection for a z Systems model . . . . . . . . . . . . . . . . . . . . . . 190
6.3 Disk Magic for Open Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.3.1 Processing the CSV output file to create the base model for the DS8870 storage

Contents v
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.3.2 Migrating the DS8870 storage system to the DS8886 storage system. . . . . . . . 199
6.4 Disk Magic Easy Tier modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.4.1 Predefined skew levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.4.2 Current workload existing skew level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
6.4.3 Heat map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
6.5 Storage Tier Advisor Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.5.1 Storage Tier Advisor Tool output samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.5.2 Storage Tier Advisor Tool for Disk Magic skew . . . . . . . . . . . . . . . . . . . . . . . . . 218

Chapter 7. Practical performance management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221


7.1 Introduction to practical performance management . . . . . . . . . . . . . . . . . . . . . . . . . . 222
7.2 Performance management tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
7.2.1 IBM Spectrum Control overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.2.2 IBM Spectrum Control measurement of DS8000 components . . . . . . . . . . . . . . 224
7.2.3 General IBM Spectrum Control measurement considerations . . . . . . . . . . . . . . 236
7.3 IBM Spectrum Control data collection considerations. . . . . . . . . . . . . . . . . . . . . . . . . 238
7.3.1 Time stamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
7.3.2 Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
7.3.3 Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
7.4 IBM Spectrum Control performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.4.1 DS8000 key performance indicator thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.4.2 Alerts and thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
7.5 IBM Spectrum Control reporting options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.5.1 WebUI reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
7.5.2 Cognos reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
7.5.3 TPCTOOL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
7.5.4 Native SQL Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
7.6 Advanced analytics and self-service provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
7.6.1 Advanced analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
7.6.2 Self-service provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
7.7 Using IBM Spectrum Control network functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
7.8 End-to-end analysis of I/O performance problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
7.9 Performance analysis examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
7.9.1 Example 1: Disk array bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
7.9.2 Example 2: Hardware connectivity part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
7.9.3 Example 3: Hardware connectivity part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
7.9.4 Example 4: Port bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
7.9.5 Example 5: Server HBA bottleneck. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
7.10 IBM Spectrum Control in mixed environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

Part 3. Performance considerations for host systems and databases . . . . . . . . . . . . . . . . . . . . . . . . 265

Chapter 8. Host attachment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267


8.1 DS8000 host attachment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
8.2 Attaching Open Systems hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
8.2.1 Fibre Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
8.2.2 Storage area network implementations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
8.2.3 Multipathing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
8.3 Attaching z Systems hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
8.3.1 FICON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
8.3.2 FICON configuration and sizing considerations . . . . . . . . . . . . . . . . . . . . . . . . . 279
8.3.3 z/VM, z/VSE, and Linux on z Systems attachment . . . . . . . . . . . . . . . . . . . . . . . 283

vi IBM System Storage DS8000 Performance Monitoring and Tuning


Chapter 9. Performance considerations for UNIX servers . . . . . . . . . . . . . . . . . . . . . 285
9.1 Planning and preparing UNIX servers for performance . . . . . . . . . . . . . . . . . . . . . . . 286
9.1.1 UNIX disk I/O architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
9.1.2 Modern approach to the logical volume management in UNIX systems. . . . . . . 289
9.1.3 Queue depth parameter (qdepth) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
9.2 AIX disk I/O components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
9.2.1 AIX Journaled File System and Journaled File System 2 . . . . . . . . . . . . . . . . . . 296
9.2.2 Symantec Veritas File System for AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
9.2.3 IBM Spectrum Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
9.2.4 IBM Logical Volume Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
9.2.5 Symantec Veritas Volume Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
9.2.6 IBM Subsystem Device Driver for AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
9.2.7 Multipath I/O with IBM Subsystem Device Driver Path Control Module . . . . . . . 306
9.2.8 Symantec Veritas Dynamic MultiPathing for AIX . . . . . . . . . . . . . . . . . . . . . . . . 307
9.2.9 Fibre Channel adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
9.2.10 Virtual I/O Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
9.3 AIX performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
9.3.1 AIX vmstat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
9.3.2 pstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
9.3.3 AIX iostat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
9.3.4 lvmstat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
9.3.5 topas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
9.3.6 nmon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
9.3.7 fcstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
9.3.8 filemon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
9.4 Verifying your system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

Chapter 10. Performance considerations for Microsoft Windows servers . . . . . . . . 335


10.1 General Windows performance tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
10.2 I/O architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
10.3 Windows Server 2008 I/O Manager enhancements . . . . . . . . . . . . . . . . . . . . . . . . . 337
10.4 File system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
10.4.1 Windows file system overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
10.4.2 NTFS guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
10.4.3 Paging file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
10.5 Volume management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
10.5.1 Microsoft Logical Disk Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
10.5.2 Veritas Volume Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
10.5.3 Determining volume layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
10.6 Multipathing and the port layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
10.6.1 SCSIport scalability issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
10.6.2 Storport scalability features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
10.6.3 Subsystem Device Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
10.6.4 Subsystem Device Driver Device Specific Module . . . . . . . . . . . . . . . . . . . . . . 345
10.6.5 Veritas Dynamic MultiPathing for Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
10.7 Host bus adapter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
10.8 I/O performance measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
10.8.1 Key I/O performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
10.8.2 Windows Performance console (perfmon) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
10.8.3 Performance log configuration and data export . . . . . . . . . . . . . . . . . . . . . . . . 351
10.8.4 Collecting configuration data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
10.8.5 Correlating performance and configuration data. . . . . . . . . . . . . . . . . . . . . . . . 357

Contents vii
Chapter 11. Performance considerations for VMware . . . . . . . . . . . . . . . . . . . . . . . . . 363
11.1 Disk I/O architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
11.2 vStorage APIs for Array Integration support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
11.3 Host type for the DS8000 storage system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
11.4 Multipathing considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
11.5 Performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
11.5.1 Virtual Center performance statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
11.5.2 Performance monitoring with esxtop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
11.5.3 Guest-based performance monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
11.5.4 VMware specific tuning for maximum performance . . . . . . . . . . . . . . . . . . . . . 375
11.5.5 Workload spreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
11.5.6 Virtual machines sharing the LUN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
11.5.7 ESXi file system considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
11.5.8 Aligning partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380

Chapter 12. Performance considerations for Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . 385


12.1 Supported platforms and distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
12.2 Linux disk I/O architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
12.2.1 I/O subsystem architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
12.2.2 Cache and locality of reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
12.2.3 Block layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
12.2.4 I/O device driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
12.3 Specific configuration for storage performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
12.3.1 Host bus adapter for Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
12.3.2 Multipathing in Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
12.3.3 Logical Volume Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
12.3.4 Tuning the disk I/O scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
12.3.5 Using ionice to assign I/O priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
12.3.6 File systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
12.4 Linux performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
12.4.1 Gathering configuration data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
12.4.2 Disk I/O performance indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
12.4.3 Identifying disk bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410

Chapter 13. Performance considerations for the IBM i system . . . . . . . . . . . . . . . . . 415


13.1 IBM i storage architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
13.1.1 Single-level storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
13.1.2 Object-based architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
13.1.3 Storage management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
13.1.4 Disk pools in the IBM i system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
13.2 Fibre Channel adapters and Multipath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
13.2.1 FC adapters for native connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
13.2.2 FC adapters in VIOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
13.2.3 Multipath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
13.3 Performance guidelines for hard disk drives in a DS8800 storage system with the IBM i
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
13.3.1 RAID level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
13.3.2 Number of ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
13.3.3 Number and size of LUNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
13.3.4 DS8800 extent pools for IBM i workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
13.3.5 Disk Magic modeling for an IBM i system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
13.4 Preferred practices for implementing IBM i workloads on flash cards in a DS8870 storage
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
13.4.1 Testing environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

viii IBM System Storage DS8000 Performance Monitoring and Tuning


13.4.2 Workloads for testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
13.4.3 Testing scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
13.4.4 Test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
13.4.5 Conclusions and recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
13.5 Preferred practices for implementing IBM i workloads on flash cards in a DS8886 storage
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
13.5.1 Testing environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
13.5.2 Workloads for testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
13.5.3 Testing scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
13.5.4 Test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
13.5.5 Conclusions and preferred practices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
13.6 Analyzing performance data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
13.6.1 IBM i performance tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
13.6.2 DS8800 performance tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
13.6.3 Periods and intervals of collecting data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
13.7 Easy Tier with the IBM i system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
13.7.1 Hot data in an IBM i workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
13.7.2 IBM i methods for hot-spot management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
13.7.3 Skew level of an IBM i workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
13.7.4 Using Easy Tier with the IBM i system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
13.8 I/O Priority Manager with the IBM i system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455

Chapter 14. Performance considerations for IBM z Systems servers . . . . . . . . . . . . 459


14.1 DS8000 performance monitoring with RMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
14.1.1 RMF Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
14.1.2 Direct Access Device Activity report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
14.1.3 I/O response time components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
14.1.4 I/O Queuing Activity report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
14.1.5 FICON host channel report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
14.1.6 FICON Director Activity report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
14.1.7 Cache and NVS report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
14.1.8 Enterprise Disk Systems report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
14.1.9 Alternatives and supplements to RMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
14.2 DS8000 and z Systems planning and configuration . . . . . . . . . . . . . . . . . . . . . . . . . 476
14.2.1 Sizing considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
14.2.2 Optimizing performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
14.3 Problem determination and resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
14.3.1 Sources of information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
14.3.2 Identifying critical and restrained resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
14.3.3 Corrective actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489

Chapter 15. IBM System Storage SAN Volume Controller attachment . . . . . . . . . . . 491
15.1 IBM System Storage SAN Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
15.1.1 SAN Volume Controller concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
15.1.2 SAN Volume Controller multipathing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
15.1.3 SAN Volume Controller Advanced Copy Services . . . . . . . . . . . . . . . . . . . . . . 495
15.2 SAN Volume Controller performance considerations . . . . . . . . . . . . . . . . . . . . . . . . 496
15.3 DS8000 performance considerations with SAN Volume Controller . . . . . . . . . . . . . 498
15.3.1 DS8000 array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
15.3.2 DS8000 rank format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
15.3.3 DS8000 extent pool implications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
15.3.4 DS8000 volume considerations with SAN Volume Controller. . . . . . . . . . . . . . 500
15.3.5 Volume assignment to SAN Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . 501

Contents ix
15.4 Performance monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
15.5 Sharing the DS8000 storage system between various server types and the SAN Volume
Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
15.5.1 Sharing the DS8000 storage system between Open Systems servers and the SAN
Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
15.5.2 Sharing the DS8000 storage system between z Systems servers and the SAN
Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
15.6 Configuration guidelines for optimizing performance . . . . . . . . . . . . . . . . . . . . . . . . 503
15.7 Where to place flash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
15.8 Where to place Easy Tier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506

Chapter 16. IBM ProtecTIER data deduplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509


16.1 IBM System StorageTS7600 ProtecTIER data deduplication . . . . . . . . . . . . . . . . . . 510
16.2 DS8000 attachment considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511

Chapter 17. Databases for open performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513


17.1 DS8000 with DB2 for Linux, UNIX, and Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
17.1.1 DB2 for Linux, UNIX, and Windows storage concepts . . . . . . . . . . . . . . . . . . . 514
17.2 DB2 for Linux, UNIX, and Windows with DS8000 performance recommendations . 519
17.2.1 DS8000 volume layout for databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
17.2.2 Know where your data is. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
17.2.3 Balance workload across DS8000 resources . . . . . . . . . . . . . . . . . . . . . . . . . . 519
17.2.4 Use DB2 to stripe across containers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
17.2.5 Selecting DB2 logical sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
17.2.6 Selecting the DS8000 logical disk sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
17.2.7 Multipathing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
17.3 Oracle with DS8000 performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
17.3.1 Architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
17.3.2 DS8000 performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
17.3.3 Oracle for AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
17.4 Database setup with a DS8000 storage system: Preferred practices . . . . . . . . . . . . 528
17.4.1 The Oracle stripe and mirror everything approach . . . . . . . . . . . . . . . . . . . . . . 528
17.4.2 DS8000 RAID policy and striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
17.4.3 LVM striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529

Chapter 18. Database for IBM z/OS performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531


18.1 DB2 in a z/OS environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
18.1.1 Understanding your database workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
18.1.2 DB2 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
18.1.3 DB2 storage objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
18.1.4 DB2 data set types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
18.2 DS8000 considerations for DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
18.3 DB2 with DS8000 performance recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . 536
18.3.1 Knowing where your data is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
18.3.2 Balancing workload across DS8000 resources. . . . . . . . . . . . . . . . . . . . . . . . . 536
18.3.3 Solid-state drives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
18.3.4 High-Performance Flash Enclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
18.3.5 High Performance FICON and FICON Express 16S . . . . . . . . . . . . . . . . . . . . 538
18.3.6 DB2 Adaptive List Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
18.3.7 Taking advantage of VSAM data striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
18.3.8 Large volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
18.3.9 Modified Indirect Data Address Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
18.3.10 Adaptive Multi-stream Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
18.3.11 DB2 burst write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541

x IBM System Storage DS8000 Performance Monitoring and Tuning


18.3.12 DB2 / Easy Tier integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
18.3.13 Bypass extent serialization in Metro Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
18.3.14 zHyperWrite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
18.3.15 Monitoring the DS8000 performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
18.4 IMS in a z/OS environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
18.4.1 IMS overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
18.4.2 IMS logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
18.5 DS8000 storage system considerations for IMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
18.6 IMS with DS8000 performance recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . 545
18.6.1 Balancing workload across DS8000 resources. . . . . . . . . . . . . . . . . . . . . . . . . 545
18.6.2 Write ahead data set volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
18.6.3 Monitoring DS8000 performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547

Part 4. Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549

Appendix A. Performance management process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551


Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
Operational performance subprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
Tasks, actors, and roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
Performance troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
Tactical performance subprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
Tasks, actors, and roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
Strategic performance subprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
Tasks, actors, and roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563

Appendix B. Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565


Goals of benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
Requirements for a benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
Defining the benchmark architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
Defining the benchmark workload. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
Running the benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
Monitoring the performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
Defining the benchmark time frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
Using the benchmark results to configure the storage system . . . . . . . . . . . . . . . . . . . 570

Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571


IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
How to get IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573

Contents xi
xii IBM System Storage DS8000 Performance Monitoring and Tuning
Notices

This information was developed for products and services offered in the US. This material might be available
from IBM in other languages. However, you may be required to own a copy of the product or product version in
that language in order to access it.

IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US

INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION “AS IS”


WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in
certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.

Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.

IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.

The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.

Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.

Statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and
represent goals and objectives only.

This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to actual people or business enterprises is entirely
coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are
provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use
of the sample programs.

© Copyright IBM Corp. 2016. All rights reserved. xiii


Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright
and trademark information” at http://www.ibm.com/legal/copytrade.shtml

The following terms are trademarks or registered trademarks of International Business Machines Corporation,
and might also be trademarks or registered trademarks in other countries.
AIX® IBM FlashSystem® PR/SM™
CICS® IBM Spectrum™ ProtecTIER®
Cognos® IBM Spectrum Control™ Real-time Compression™
DB2® IBM Spectrum Protect™ Redbooks®
DB2 Universal Database™ IBM Spectrum Scale™ Redpapers™
developerWorks® IBM Spectrum Virtualize™ Redbooks (logo) ®
DS4000® IBM z Systems™ Resource Measurement Facility™
DS8000® IBM z13™ RMF™
Easy Tier® IBM zHyperWrite™ Storwize®
ECKD™ IMS™ System i®
Enterprise Storage Server® OMEGAMON® System Storage®
FICON® Parallel Sysplex® Tivoli®
FlashCopy® POWER® Tivoli Enterprise Console®
FlashSystem™ Power Systems™ z Systems™
GDPS® POWER7® z/Architecture®
GPFS™ POWER7+™ z/OS®
HACMP™ POWER8® z/VM®
HyperSwap® PowerHA® z/VSE®
IBM® PowerPC® z13™

The following terms are trademarks of other companies:

is a trademark or registered trademark of Ustream, Inc., an IBM Company.

ITIL is a Registered Trade Mark of AXELOS Limited.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the
United States, other countries, or both.

Java, and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its
affiliates.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Other company, product, or service names may be trademarks or service marks of others.

xiv IBM System Storage DS8000 Performance Monitoring and Tuning


IBM REDBOOKS PROMOTIONS

IBM Redbooks promotions

Find and read thousands of


IBM Redbooks publications
Search, bookmark, save and organize favorites
Get personalized notifications of new content
Link to the latest Redbooks blogs and videos

Get the latest version of the Redbooks Mobile App

Download
Android
iOS

Now

Promote your business


in an IBM Redbooks
publication
®
Place a Sponsorship Promotion in an IBM
®
Redbooks publication, featuring your business
or solution with a link to your web site.

Qualified IBM Business Partners may place a full page


promotion in the most popular Redbooks publications.
Imagine the power of being seen by users who download ibm.com/Redbooks
millions of Redbooks publications each year! About Redbooks Business Partner Programs
THIS PAGE INTENTIONALLY LEFT BLANK
Preface

This IBM® Redbooks® publication provides guidance about how to configure, monitor, and
manage your IBM DS8880 storage systems to achieve optimum performance, and it also
covers the IBM DS8870 storage system. It describes the DS8880 performance features and
characteristics, including hardware-related performance features, synergy items for certain
operating systems, and other functions, such as IBM Easy Tier® and the DS8000® I/O
Priority Manager.

The book also describes specific performance considerations that apply to particular host
environments, including database applications.

This book also outlines the various tools that are available for monitoring and measuring I/O
performance for different server environments, and it describes how to monitor the
performance of the entire DS8000 storage system.

This book is intended for individuals who want to maximize the performance of their DS8880
and DS8870 storage systems and investigate the planning and monitoring tools that are
available. The IBM DS8880 storage system features, as described in this book, are available
for the DS8880 model family with R8.0 release bundles (Licensed Machine Code (LMC) level
7.8.0).

For more information about optimizing performance with the previous DS8000 models such
as the DS8800 or DS8700 models, see DS8800 Performance Monitoring and Tuning,
SG24-8013.

Performance: Any sample performance measurement data that is provided in this book is
for comparative purposes only. The data was collected in controlled laboratory
environments at a specific point by using the configurations, hardware, and firmware levels
that were available then. The performance in real-world environments can vary. Actual
throughput or performance that any user experiences also varies depending on
considerations, such as the I/O access methods in the user’s job, the I/O configuration, the
storage configuration, and the workload processed. The data is intended only to help
illustrate how different hardware technologies behave in relation to each other. Contact
your IBM representative or IBM Business Partner if you have questions about the expected
performance capability of IBM products in your environment.

Authors
This book was produced by a team of specialists from around the world working for the
International Technical Support Organization, at the EMEA Storage Competence Center
(ESCC) in Mainz, Germany.

Axel Westphal works as a certified IT Specialist at the IBM EMEA Storage Competence
Center (ESCC) in Mainz, Germany. He joined IBM in 1996, working for IBM Global Services
as a Systems Engineer. His areas of expertise include setup and demonstration of
IBM System Storage® products and solutions in various environments. He wrote several
storage white papers and co-authored several IBM Redbooks publications.

© Copyright IBM Corp. 2016. All rights reserved. xvii


Bert Dufrasne is an IBM Certified IT Specialist and Project Leader for System Storage disk
products at the ITSO, San Jose Center. He has worked at IBM in various IT areas. He has
written many IBM Redbooks publications and has developed and taught technical workshops.
Before joining the ITSO, he worked for IBM Global Services as an Application Architect. He
holds a master’s degree in electrical engineering.

Wilhelm Gardt is a certified IT Specialist in Germany. He has more than 25 years of


experience in Open Systems, database, and storage environments. He worked as a software
developer and a UNIX and database administrator, designing and implementing
heterogeneous IT environments (including SAP, Oracle, UNIX, and SAN). Wilhelm joined IBM
in 2001. He is a member of the Technical Presales Support team for IBM Storage (ATS).
Wilhelm is a frequent speaker at IBM storage and other conferences. He holds a degree in
computer sciences from the University of Kaiserslautern, Germany.

Jana Jamsek is an IT specialist that works in Storage Advanced Technical Skills for Europe
as a specialist for IBM Storage Systems and IBM i systems. Jana has eight years of
experience in the IBM System i® and AS/400 areas, and 15 years of experience in Storage.
She has a master’s degree in computer science and a degree in mathematics from the
University of Ljubljana, Slovenia. Jana works on complex customer cases that involve IBM i
and Storage systems, in different European, Middle Eastern, and African countries. She
presents at IBM Storage and Power universities and runs workshops for IBM employers and
customers. Jana is the author or co-author of several IBM publications, including IBM
Redbooks publications, IBM Redpapers™ publications, and white papers.

Peter Kimmel is an IT Specialist and ATS team lead of the Enterprise Disk Solutions team at
the ESCC in Mainz, Germany. He joined IBM Storage in 1999 and worked since then with all
the various IBM Enterprise Storage Server® and DS8000 generations, with a focus on
architecture and performance. He was involved in the Early Shipment Programs (ESPs) of
these early installations, and co-authored several IBM Redbooks publications. Peter holds a
diploma (MSc) degree in physics from the University of Kaiserslautern.

Flavio Morais is a GTS Storage Specialist in Brazil. He has six years of experience in the
SAN/storage field. He holds a degree in computer engineering from Instituto de
EnsinoSuperior de Brasilia. His areas of expertise include DS8000 planning, copy services,
IBM Tivoli® Storage Productivity Center for Replication, and performance troubleshooting. He
has extensive experience solving performance problems with Open Systems.

Paulus Usong joined IBM in Indonesia as a Systems Engineer. His next position brought him
to New York as a Systems Programmer at a bank. From New York, he came back to IBM in
San Jose at the Santa Teresa Lab, which is now known as the Silicon Valley Lab. During his
IBM employment in San Jose, Paulus moved to several different departments, all in San Jose.
His latest position at IBM is a Consulting I/T Specialist with the IBM ATS group as a disk
performance expert covering customers in the Americas. After his retirement from IBM, he
joined IntelliMagic as a Mainframe Consultant.

Alexander Warmuth is a Senior IT Specialist for IBM at the ESCC in Mainz, Germany.
Working in technical sales support, he designs and promotes new and complex storage
solutions, drives the introduction of new products, and provides advice to clients, Business
Partners, and sales. His main areas of expertise are high-end storage solutions and business
resiliency. He joined IBM in 1993 and has been working in technical sales support since 2001.
Alexander holds a diploma in electrical engineering from the University of Erlangen,
Germany.

xviii IBM System Storage DS8000 Performance Monitoring and Tuning


Kenta Yuge is an IT Specialist at IBM Japan. He has two years of experience in technical
support for DS8000 as Advanced Technical Support after two years of experience supporting
IBM z Systems™. His areas of expertise include DS8000 planning and implementation, copy
services, and performance analysis, especially in mainframe environments.

Thanks to the following people for their contributions to this project:

Nick Clayton, Peter Flämig, Dieter Flaut, Marc Gerbrecht, Marion Hejny, Karl Hohenauer, Lee
La Frese, Frank Krüger, Uwe Heinrich Müller, Henry Sautter, Louise Schillig, Dietmar
Schniering, Uwe Schweikhard, Christopher Seiwert, Paul Spagnolo, Mark Wells, Harry
Yudenfriend
ESCC Rhine-Main Lab Operations

Now you can become a published author, too!


Here’s an opportunity to spotlight your skills, grow your career, and become a published
author—all at the same time! Join an ITSO residency project and help write a book in your
area of expertise, while honing your experience using leading-edge technologies. Your efforts
will help to increase product acceptance and customer satisfaction, as you expand your
network of technical contacts and relationships. Residencies run from two to six weeks in
length, and you can participate either in person or as a remote resident working from your
home base.

Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html

Comments welcome
Your comments are important to us!

We want our books to be as helpful as possible. Send us your comments about this book or
other IBM Redbooks publications in one of the following ways:
򐂰 Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
򐂰 Send your comments in an email to:
[email protected]
򐂰 Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400

Preface xix
Stay connected to IBM Redbooks
򐂰 Find us on Facebook:
http://www.facebook.com/IBMRedbooks
򐂰 Follow us on Twitter:
http://twitter.com/redbooks
򐂰 Look for us on LinkedIn:
http://www.linkedin.com/groups?home=&gid=2130806
򐂰 Explore new Redbooks publications, residencies, and workshops with the IBM Redbooks
weekly newsletter:
https://www.redbooks.ibm.com/Redbooks.nsf/subscribe?OpenForm
򐂰 Stay current on recent Redbooks publications with RSS Feeds:
http://www.redbooks.ibm.com/rss.html

xx IBM System Storage DS8000 Performance Monitoring and Tuning


Part 1

Part 1 IBM System Storage


DS8000 performance
considerations
This part provides an overview of the DS8000 characteristics and logical configuration
concepts to achieve optimum performance.

This part includes the following topics:


򐂰 IBM System Storage DS8880 family characteristics
򐂰 Hardware configuration
򐂰 Logical configuration concepts and terminology
򐂰 Logical configuration performance considerations

© Copyright IBM Corp. 2016. All rights reserved. 1


2 IBM System Storage DS8000 Performance Monitoring and Tuning
1

Chapter 1. IBM System Storage DS8880


family characteristics
This chapter contains a high-level description and introduction to the storage server
performance capabilities. It provides an overview of the model characteristics of the DS8880
family that allow the DS8000 series to meet this performance challenge. When doing
comparisons between the models, for some of the charts we also include the previous
DS8870 models for reference.

This chapter includes the following topics:


򐂰 The storage server challenge
򐂰 Meeting the challenge: The DS8000 product family
򐂰 DS8000 performance characteristics overview

© Copyright IBM Corp. 2016. All rights reserved. 3


1.1 The storage server challenge
One of the primary criteria in judging a storage server is performance, that is, how fast it
responds to a read or write request from an application server. How well a storage server
accomplishes this task depends on the design of its hardware and firmware.

Data continually moves from one component to another within a storage server. The objective
of server design is hardware that has sufficient throughput to keep data flowing smoothly
without waiting because a component is busy. When data stops flowing because a
component is busy, a bottleneck forms. Obviously, it is preferable to minimize the frequency
and severity of bottlenecks.

The ideal storage server is one in which all components are used and bottlenecks are few.
This scenario is the case if the following conditions are met:
򐂰 The machine is designed well, with all hardware components in balance. To provide this
balance over a range of workloads, a storage server must allow a range of hardware
component options.
򐂰 The machine is sized well for the client workload. Where options exist, the correct
quantities of each option are chosen.
򐂰 The machine is set up well. Where options exist in hardware installation and logical
configuration, these options are chosen correctly.

Automatic rebalancing and tiering options can help achieve optimum performance even in an
environment of ever-changing workload patterns, but they cannot replace a correct sizing of
the machine.

1.1.1 Performance numbers


Performance numbers alone provide only value with an explanation and making clear what
exactly they mean. Raw performance numbers provide evidence that a particular storage
server is better than the previous generation model, or better than the competitive product.
But, isolated performance numbers are often out of line with a production environment. It is
important to understand how raw performance numbers relate to the performance of the
storage server in processing a particular production workload.

Throughput numbers are achieved in controlled tests that push as much data as possible
through the storage server as a whole, or perhaps a single component. At the point of
maximum throughput, the system is so overloaded that response times are greatly extended.
Trying to achieve such throughput numbers in a normal business environment brings protests
from the users of the system because response times are poor.

To assure yourself that the DS8880 family offers the fastest technology, consider the
performance numbers for the individual disks, adapters, and other components of the
DS8886 and DS8884 models, and for the total device. The DS8880 family uses the most
current technology available. But, use a more rigorous approach when planning the DS8000
hardware configuration to meet the requirements of a specific environment.

4 IBM System Storage DS8000 Performance Monitoring and Tuning


1.1.2 Recommendations and rules
Hardware selection is sometimes based on general recommendations and rules. A general
rule is a simple guideline for making a selection based on limited information. You can make a
quick decision, with little effort, that provides a solution that works most of the time. The
disadvantage is that it does not work all the time; sometimes, the solution is not at all what the
client needs. You can increase the chances that the solution works by making it more
conservative. However, a conservative solution generally involves more hardware, which
means a more expensive solution. This chapter provides preferred practices and general
rules for different hardware components. Use general rules only when no information is
available to make a more informed decision. In general, when options are available for
automated tiering and rebalancing, in the most typical cases, use these options.

1.1.3 Modeling your workload


A much better way to determine the hardware requirements for your workload is to run a Disk
Magic model. Disk Magic is a modeling tool, which calculates the throughput and response
time of a storage server based on workload characteristics and the hardware resources of the
storage server. By converting the results of performance runs into mathematical formulas,
Disk Magic allows the results to be applied to a wide range of workloads. Disk Magic allows
many variables of hardware to be brought together. This way, the effect of each variable is
integrated, producing a result that shows the overall performance of the storage server.

For more information about this tool, see 6.1, “Disk Magic” on page 160.

1.1.4 Allocating hardware components to workloads


Two contrasting methods are available to allocate the use of hardware components to
workloads:
򐂰 The first method is spreading the workloads across components, which means that you try
to share the use of hardware components across all, or at least many, workloads. The
more hardware components are shared, the more effectively they are used, which reduces
total cost of ownership (TCO). For example, to attach multiple UNIX servers, you can use
the same host adapters (HAs) for all hosts instead of acquiring a separate set of HAs for
each host. However, the more that components are shared, the more potential exists that
one workload dominates the use of the component.
򐂰 The second method is isolating workloads to specific hardware components, which
means that specific hardware components are used for one workload, others for different
workloads. The downside of isolating workloads is that certain components are unused
when their workload is not demanding service. On the upside, when that workload
demands service, the component is available immediately. The workload does not
contend with other workloads for that resource.

Spreading the workload maximizes the usage and performance of the storage server as a
whole. Isolating a workload is a way to maximize the workload performance, making the
workload run as fast as possible. Automatic I/O prioritization can help avoid a situation in
which less-important workloads dominate the mission-critical workloads in shared
environments, and allow more shared environments.

If you expect growing loads, for example, when replacing one system with a new one that also
has a much bigger capacity, add some contingency for this amount of foreseeable I/O growth.

For more information, see 4.2, “Configuration principles for optimal performance” on page 87.

Chapter 1. IBM System Storage DS8880 family characteristics 5


1.2 Meeting the challenge: The DS8000 product family
IBM System Storage DS8886, IBM System Storage DS8884, and their predecessor models
are members of the DS8000 product family. They offer disk storage servers with a wide range
of hardware component options to fit many workload requirements, in both type and size. A
DS8886 can scale well to the highest disk storage capacities. The scalability is supported by
design functions that allow the installation of additional components without disruption.

With the capabilities of the DS8000 series, a huge number of multiple workloads can be
easily consolidated into a single storage system.

1.2.1 DS8880 family models and characteristics


The DS8880, which is the seventh generation of the DS8000, comes in two models, with a
third all-flash model pre-announced.

Compared to the previous DS8870 models, the DS8880 family comes with largely increased
processing power because of IBM POWER8® technology, more scalability for processors and
cache, a new I/O bay interconnect, and improved I/O enclosure bays.

The storage systems can scale up to 1536 disk drives (or 2.5-inch solid-state drives (SSDs)
plus another 240 high-performance HPFE cards, the latter on a dedicated 1.8-inch flash
architecture.

Table 1-1 provides an overview of the DS8880 and DS8870 models, including processor,
memory, HA, and disk specifications for each model. The number of processor complexes is
two, for each of the models.

Table 1-1 DS8880 and DS8870 model overview


Component All-flash DS8886 DS8884 DS8870 DS8870
model Enterprise Business
Class Class

Base frame model 982 981 980 961 961

Expansion frame model 98F 98E 98B 96E 96E

Processor technology POWER8 POWER8 POWER8 IBM POWER7+


POWER7+™ (Second gen.)
(Second gen.) POWER7
IBM (First gen.)
POWER7®
(First gen.)

Number of processor cores 48 8 - 24 6 2 - 16 2 - 16


per complex

Number of transistors per 4.2 billion 4.2 billion 4.2 billion 2.1 billion 2.1 billion
processor (Second gen.) (Second gen.)
1.2 billion 1.2 billion
(First gen.) (First gen.)

Processor speed 3.02 - 3.89 GHz 3.89 GHz 4.23 GHz 4.23 GHz
3.72 GHz (8/16 core) (Second gen.) (Second gen.)
3.53 GHz 3.55 GHz 3.55 GHz
(24 core) (First gen.) (First gen.)

Processor memory options Up to 2 TB 128 GB - 2 TB 64 GB - 16 GB - 1 TB 16 GB - 1 TB


(cache) 256 GB

6 IBM System Storage DS8000 Performance Monitoring and Tuning


Component All-flash DS8886 DS8884 DS8870 DS8870
model Enterprise Business
Class Class

HA ports (maximum) Up to 128 Up to 128 Up to 64 Up to 128 Up to 128


(16 Gb) (16 Gb) (16 Gb) (if 8 Gb) (if 8 Gb)

Disk drive modules - 1536 768 1536 1056


(DDMs/2.5-inch)
(maximum)

HPFE cards (1.8-inch) 480 240 120 240 240


(maximum)

Disk drive interface - SAS 12 Gbps SAS 12 Gbps SAS 6 Gbps SAS 6 Gbps
technology

The following sections provide a short description of the main hardware components.

POWER8 processor technology


The DS8000 series uses the IBM POWER8 technology, which is the foundation of the storage
system logical partitions (LPARs). The DS8880 family (98x) models use the dual 6-way up to
dual 48-way processor complexes of the 64-bit microprocessors. Within the POWER8
servers, the DS8880 family offers up to 2 TB of cache.

The IBM POWER® processor architecture offers superior performance and availability
features, compared to other conventional processor architectures on the market. POWER7+
allowed up to four intelligent simultaneous threads per core, which is twice of what larger x86
processors have. The POWER8 architecture even allows up to eight simultaneous threads
per core.

Switched PCIe disk connection


The I/O bay interconnect for the members of the DS8880 family runs through the Peripheral
Component Interconnect Express Gen3 (PCI Express Gen3, or PCIe Gen3) standard.
This is a major change to the DS8870, which previously had the GX++ I/O bus with the PCIe
Gen2 interconnects. As this supplemental I/O bridge component of the GX++ bus is gone in
the DS8880 storage system, this also gives some additional latency advancements in the
back end. Also, the High-Performance Flash Enclosures (HPFE) connect directly through
PCIe in all DS8880 models. In addition, here the number of lanes for the HPFE attachment
are doubled, to potentially allow even higher throughputs.

Disk drives
The DS8880 family offers a selection of industry-standard Serial Attached SCSI (SAS 3.0)
disk drives. Most drive types (15,000 and 10,000 RPM) are 6.35 cm (2.5-inch) small form
factor (SFF) sizes, with drive capacities of 300 GB - 1200 GB. SSDs are also available in
2.5-inch, with capacities of 200 - 1600 GB. The Nearline drives (7,200 RPM) are 8.9 cm
(3.5-inch) size drives with a SAS interface. With the maximum number and type of drives, the
storage system can scale to over 3 PB of total raw capacity.

Chapter 1. IBM System Storage DS8880 family characteristics 7


Host adapters
The DS8880 models offer enhanced connectivity with the availability of either eight-port or
four-port FC/IBM FICON® HAs (8-port is only available with 8 Gbps). Each adapter can
auto-negotiate up to two speed factors down, meaning that the 16 Gbps adapter also can run
on 8 Gbps or 4 Gbps speeds, but not less. With this flexibility, you can benefit from the higher
performance, 16 Gbps SAN-based solutions and also maintain compatibility with existing
infrastructures. In addition, you can configure the ports on the adapter with an intermix of
Fibre Channel Protocol (FCP) and FICON, which can help protect your investment in Fibre
adapters and increase your ability to migrate to new servers. A DS8886 can support a
maximum of 32 HAs (16 Gbps), which provides up to 128 FC/FICON ports. Adapters can be
shortwave or longwave.

1.3 DS8000 performance characteristics overview


The DS8000 series offers optimally balanced performance. For example, the initial version of
the DS8886 (using SMT4 mode, four threads per core) offers a maximum sequential
throughput of around twice that of the DS8870 generation 16-way model, the latter with all
acceleration features included. The smaller 6-core DS8884 model offers performance
capabilities in the range of the fully equipped 16-way DS8870 model.

The DS8000 series incorporates many performance enhancements:


򐂰 Dual-clustered POWER8 servers
򐂰 PCIe 16 Gbps FC/FICON HAs
򐂰 New SAS disk drives
򐂰 High-bandwidth, fault-tolerant internal interconnections with a switched PCIe Gen3
architecture.

With all these new components, the DS8880 family is positioned at the top of the
high-performance category. The following hardware components contribute to the high
performance of the DS8000 family:
򐂰 Array across loops (AAL) when building Redundant Array of Independent Disks (RAID)
򐂰 POWER symmetrical multiprocessor system (SMP) processor architecture
򐂰 Multi-threaded design, simultaneous multithreading (SMT)
򐂰 Switched PCIe architecture
򐂰 Powerful processing components on HA and device adapter (DA) level, again with own
IBM PowerPC® CPUs on board of each such adapter, to which major CPU-intensive
functions can be offloaded.

In addition to these hardware-based enhancements, these following sections describe


additional firmware-based contributions to performance.

1.3.1 Advanced caching techniques


The DS8000 series benefits from advanced caching techniques, most of which are unique to
IBM storage systems. Apart from this, the DS8000 series is one of the few products on the
market that uses small cache segments of just 4 KB, which gives the most efficient design in
terms of cache usage efficiency, especially as far as smaller-block OLTP workloads are
concerned. Data Warehouse (DWH) workloads benefit from caching exactly what is needed.
It is the superior CPU and SMT capability of the POWER architecture that you can use to
handle and manage cache space in such small and efficient segments.

8 IBM System Storage DS8000 Performance Monitoring and Tuning


Sequential Prefetching in Adaptive Replacement Cache
Another performance enhancer is the use of the self-learning cache algorithms. The DS8000
caching technology improves cache efficiency and enhances cache-hit ratios, especially in
environments that change dynamically. One of the patent-pending algorithms that is used in
the DS8000 series is called Sequential Prefetching in Adaptive Replacement Cache (SARC).

SARC provides these advantages:


򐂰 Sophisticated, patented algorithms to determine what data to store in cache based on the
recent access and frequency needs of the hosts
򐂰 Prefetching, which anticipates data before a host request and loads it into cache
򐂰 Self-learning algorithms to adapt and dynamically learn what data to store in cache based
on the frequency needs of the hosts.

Adaptive Multi-Stream Prefetching


Adaptive Multi-Stream Prefetching (AMP) introduces an autonomic, workload-responsive,
self-optimizing prefetching technology. This technology adapts both the amount and the
timing of prefetch on a per-application basis to maximize the performance of the system.

AMP provides a provable, optimal sequential read performance and maximizes the sequential
read throughputs of all RAID arrays where it is used, and therefore of the system.

Intelligent Write Caching


Another additional cache algorithm, Intelligent Write Caching (IWC), is implemented in the
DS8000 series. IWC improves performance through better write-cache management and a
better destaging order of writes.

By carefully selecting the data that we destage and, at the same time, reordering the
destaged data, we minimize the disk head movements that are involved in the destage
processes. Therefore, we achieve a large performance boost on random-write workloads.

1.3.2 IBM Subsystem Device Driver


IBM Subsystem Device Driver (SDD) is a pseudo device driver on the host system that
supports the multipath configuration environments in IBM products. It provides load balancing
and enhanced data availability capability. By distributing the I/O workload over multiple active
paths, SDD provides dynamic load balancing and eliminates data flow bottlenecks. The same
applies to the respective Device Specific Modules and Path Control Modules that are
available for certain operating systems and that sit on top of their internal Multipath I/O
(MPIO) multipathing. SDD also helps eliminate a potential single point of failure by
automatically rerouting I/O operations when a path failure occurs.

SDD is provided with the DS8000 series at no additional charge. Fibre Channel (SCSI-FCP)
attachment configurations are supported in the IBM AIX®, Hewlett-Packard UNIX (HP-UX),
Linux, Microsoft Windows, and Oracle Solaris environments.

In addition, the DS8000 series supports the built-in multipath options of many distributed
operating systems.

Chapter 1. IBM System Storage DS8880 family characteristics 9


1.3.3 Performance characteristics for z Systems
The DS8000 supports numerous IBM performance innovations for z Systems environments.
Here are the most significant ones:
򐂰 FICON extends the ability of the DS8000 storage system to deliver high-bandwidth
potential to the logical volumes that need it when they need it. Working together with other
DS8000 functions, FICON provides a high-speed pipe that supports a multiplexed
operation.
򐂰 High Performance FICON for z (zHPF) takes advantage of the hardware that is available
today, with enhancements that are designed to reduce the impact that is associated with
supported commands. These enhancements can improve FICON I/O throughput on a
single DS8000 port by 100%. Enhancements to the IBM z/Architecture® and the FICON
interface architecture deliver improvements for online transaction processing (OLTP)
workloads. When used by the FICON channel, the IBM z/OS® operating system, and the
control unit, zHPF is designed to help reduce impact and improve performance. Since the
first implementation of zHPF, several generational enhancements were added.
򐂰 Parallel Access Volume (PAV) enables a single z Systems server to process
simultaneously multiple I/O operations to the same logical volume, which can help reduce
device queue delays. This function is achieved by defining multiple addresses per volume.
With Dynamic PAV, the assignment of addresses to volumes can be managed
automatically to help the workload meet its performance objectives and reduce overall
queuing.
򐂰 HyperPAV allows an alias address to be used to access any base on the same control unit
image per I/O base. Multiple HyperPAV hosts can use one alias to access separate bases.
This capability reduces the number of alias addresses that are required to support a set of
bases in a z Systems environment with no latency in targeting an alias to a base. This
function is also designed to enable applications to achieve equal or better performance
than possible with the original PAV feature alone while also using the same or fewer z/OS
resources. Like PAV or zHPF. The HyperPAV feature comes, for the DS8880 storage
system, all in one combined license that includes all z Systems related features.
򐂰 Multiple Allegiance expands the simultaneous logical volume access capability across
multiple z Systems servers. This function, along with PAV, enables the DS8000 series to
process more I/Os in parallel, helping to improve performance and enabling greater use of
large volumes.
򐂰 Application and host integration. With the conversion of IBM DB2® I/Os to zHPF, massive
performance gains for certain database tasks can be achieved. For example, the DB2
application can give cache prefetching hints into the storage system.
򐂰 zHyperWrite allows the I/O driver to schedule up to three parallel write I/Os and direct one
or two of these writes to Metro Mirror (MM) target volumes to avoid the potential
synchronous replication impact of Metro Mirror. Especially when considering DB2 logs,
this integration of host-based mirroring into the DS8000 Metro Mirror with the zHyperWrite
function can lead to significant response time reductions for these logs.
򐂰 I/O priority queuing allows the DS8000 series to use I/O priority information that is
provided by the z/OS Workload Manager to manage the processing sequence of I/O
operations.
򐂰 I/O Priority Manager integration with zWLM. Section 1.3.5, “I/O Priority Manager” on
page 17 explains how the DS8000 I/O Priority Manager feature offers additional add-ons
for z Systems. With these add-ons, the z/OS host can manage and reassign a count key
data (CKD) volume automatically to another DS8000 performance class. This capability is
useful when the priorities of the CKD volume do not match the priorities preset in the z
Systems Workload Manager.

10 IBM System Storage DS8000 Performance Monitoring and Tuning


򐂰 SAN Fabric I/O Priority Management. Starting with newer z Systems such as IBM z13™,
the quality of service (QoS) functions, such as the I/O Priority Manager offers on the level
of drive pools and DA, also are extended into the SAN fabric. Also for the SAN QoS, the z
Systems Workload Manager (WLM) is integrated into that function. The WLM can send
the priority information of each I/O to the SAN fabric and to the DS8000 storage system,
completing the management of the entire end-to-end flow of an I/O.
򐂰 FICON Dynamic Routing (FIDR). Another performance-relevant function that is available
with newer FICON cards such as the z13 offers is the FIDR function. Here, when
considering many paths in a SAN, for example, with a larger number of inter-switch links
(ISLs), the traditional static routing often led to imbalanced ISLs with not all available
bandwidth used. FIDR, using dynamically assigned Exchange IDs for each I/O, leads to
an optimally balanced SAN, what converts into a bigger effective regain of SAN ISL
bandwidth.

1.3.4 Easy Tier


Easy Tier is an optimizing and balancing feature at no charge that is available on DS8000
storage systems since the DS8700 model. It offers enhanced capabilities through the
following functions:
򐂰 Volume rebalancing
򐂰 Auto-rebalancing
򐂰 Hot spot management
򐂰 Rank depopulation
򐂰 Manual volume migration
򐂰 Heat-map transfer support when replicating between hybrid-pool DS8000 storage systems
򐂰 Pinning volumes temporarily to drive tiers
򐂰 Thin provisioning support
򐂰 Optional additional controls on pool and volume level

Rank stands here for a formatted RAID array. A drive tier is a group of drives with similar
performance characteristics, for example, flash. Easy Tier determines the appropriate tier of
storage based on data access requirements. It then automatically and nondisruptively moves
data, at the subvolume or sublogical unit number (LUN) level, to the appropriate tier on the
DS8000 storage system.

Easy Tier automatic mode provides automatic cross-tier storage performance and storage
economics management for up to three tiers. It also provides automatic intra-tier performance
management (auto-rebalance) in multitier (hybrid) or single-tier (homogeneous) extent pools.

Easy Tier is designed to balance system resources to address application performance


objectives. It automates data placement across Enterprise-class (Tier 1), Nearline (Tier 2),
and flash/SSD (Tier 0) tiers, and among the ranks of the same tier (auto-rebalance). The
system can automatically and nondisruptively relocate data at the extent level across any two
or three tiers of storage, and manually relocates full volumes. The potential benefit is to align
the performance of the system with the appropriate application workloads. This enhancement
can help clients to improve storage usage and address performance requirements for multitier
systems. For those clients that use some percentage of flash, Easy Tier analyzes the system
and migrates only those parts of the volumes whose workload patterns benefit most from the
more valuable flash space to the higher tier.

Chapter 1. IBM System Storage DS8880 family characteristics 11


Use the capabilities of Easy Tier to support these functions:
򐂰 Three tiers: Use three tiers and enhanced algorithms to improve system performance and
cost-effectiveness.
򐂰 Cold demotion: Cold data stored on a higher-performance tier is demoted to a more
appropriate tier. Expanded cold demotion demotes part of the sequential workload to the
Nearline tier to better use the total available bandwidth.
򐂰 Warm demotion: To avoid higher-performance tiers from becoming overloaded in hybrid
extent pools, the Easy Tier automatic mode monitors the performance of the ranks. It
triggers the movement of selected active extents from the higher-performance tier to the
lower-performance tier. This warm demotion helps to ensure that the higher-performance
tier does not suffer from saturation or overload conditions that might affect the overall
performance in the extent pool.
򐂰 Manual volume rebalance: Volume rebalancing relocates the smallest number of extents
of a volume and restripes those extents on all available ranks of the extent pool.
򐂰 Auto-rebalance: This capability automatically balances the workload across ranks of the
same storage tier within homogeneous and hybrid pools based on usage to improve
system performance and resource utilization. It enables the enhanced auto-rebalancing
functions of Easy Tier to manage a combination of homogeneous and hybrid pools,
including relocating hot spots on ranks. Even with homogeneous pools, systems with only
one tier can use Easy Tier technology to optimize their RAID array utilization.
򐂰 Rank depopulation: This capability allows ranks with allocated data to be unassigned from
an extent pool. It uses extent migration to move extents from the specified ranks to other
ranks within the pool.

Easy Tier also provides a performance monitoring capability, even when the Easy Tier
auto-mode (auto-migration) is not turned on. Easy Tier uses the monitoring process to
determine what data to move and when to move it when using automatic mode. The usage of
thin provisioning (extent space-efficient (ESE) volumes) is also possible with Easy Tier.

You can enable monitoring independently from the auto-migration for information about the
behavior and benefits that can be expected if automatic mode is enabled. Data from the
monitoring process is included in a summary report that you can download to your Microsoft
Windows system. Use the DS8000 Storage Tier Advisor Tool (STAT) application to view the
data when you point your browser to that file.

To download the STAT tool, check the respective DS8000 download section of IBM Fix
Central, found at:
http://www.ibm.com/support/fixcentral/

Easy Tier is described in detail in IBM DS8000 Easy Tier, REDP-4667.

Prerequisites
To enable Easy Tier automatic mode, you must meet the following requirements:
򐂰 Easy Tier automatic monitoring (monitor mode) is set to either All or Auto Mode.
򐂰 For Easy Tier to manage pools, the Easy Tier Auto Mode setting must be set to either
Tiered or All (in the DS GUI, click Settings → System → Easy Tier).

Easy Tier: Automatic mode


In the DS8880 family, Easy Tier is part of the standard Base license. With DS8870 and earlier
models, Easy Tier was a no charge license feature, but separately configured.

12 IBM System Storage DS8000 Performance Monitoring and Tuning


In Easy Tier, both I/O per second (IOPS) and bandwidth algorithms determine when to
migrate your data. This process can help you improve performance.

Use the automatic mode of Easy Tier to relocate your extents to their most appropriate
storage tier in a hybrid pool, based on usage. Because workloads typically concentrate I/O
operations (data access) on only a subset of the extents within a volume or LUN, automatic
mode identifies the subset of your frequently accessed extents. It then relocates them to the
higher-performance storage tier.

Subvolume or sub-LUN data movement is an important option to consider in volume


movement because not all data at the volume or LUN level becomes hot data. For most
workloads, there is a distribution of data considered either hot or cold. A significant impact
can be associated with moving entire volumes between tiers. For example, if a volume is
1 TB, you do not want to move the entire 1 TB volume if the heat map that is generated
indicates that only 10 GB are considered hot. This capability uses your higher performance
tiers while reducing the number of drives that you need to optimize performance.

Using automatic mode, you can use high-performance storage tiers with a smaller cost. You
invest a small portion of storage capacity in the high-performance storage tier. You can use
automatic mode for relocation and tuning without intervention. Automatic mode can help
generate cost-savings while optimizing your storage performance.

Three-tier automatic mode is supported by the following Easy Tier functions:


򐂰 Support for ESE volumes with the thin provisioning of your FB volumes
򐂰 Support for a matrix of device or disk drive modules (DDMs) and adapter types
򐂰 Enhanced monitoring of both bandwidth and IOPS limitations
򐂰 Enhanced data demotion between tiers
򐂰 Automatic mode auto-performance rebalancing (auto-rebalance), which applies to the
following situations:
– Redistribution within a tier after a new rank is added into a managed pool (adding new
capacity)
– Redistribution within a tier after extent pools are merged
– Redistribution within a tier after a rank is removed from a managed pool
– Redistribution when the workload is imbalanced on the ranks within a tier of a
managed pool (natural skew)

Easy Tier automatically relocates that data to an appropriate storage device in an extent pool
that is managed by Easy Tier. It uses an algorithm to assign heat values to each extent in a
storage device. These heat values determine which tier is best for the data, and migration
takes place automatically. Data movement is dynamic and transparent to the host server and
to applications that use the data.

By default, automatic mode is enabled (through the DSCLI and DS Storage Manager) for
heterogeneous pools. You can temporarily disable automatic mode, or as special option reset
or pause the Easy Tier “learning” logic.

Functions and features of Easy Tier: Automatic mode


This section describes the functions and features of Easy Tier in automatic mode.

Chapter 1. IBM System Storage DS8880 family characteristics 13


Auto-rebalance
Auto-rebalance is a function of Easy Tier automatic mode to balance the utilization within a
tier by relocating extents across ranks based on usage. Auto-rebalance supports single-tier
managed pools, and multitier hybrid pools. You can use controls, for example, on full system
level, to enable or disable the auto-rebalance function on all pools. When you enable
auto-rebalance, every standard and ESE volume is placed under Easy Tier control and
auto-rebalancing management. Using auto-rebalance gives you the advantage of these
automatic functions:
򐂰 Easy Tier auto-rebalance operates within a tier, inside a managed single-tier or multitier
storage pool, relocating extents across ranks of the same tier based on utilization and
workload.
򐂰 Easy Tier auto-rebalance automatically detects performance skew and rebalances
utilization across ranks within the same tier.

In any tier, placing highly active (hot) data on the same physical rank can cause the hot rank
or the associated DA to become a performance bottleneck. Likewise, over time, skew can
appear within a single tier that cannot be addressed by migrating data to a faster tier alone. It
requires some degree of workload rebalancing within the same tier. Auto-rebalance
addresses these issues within a tier in both hybrid (multitier) and homogeneous (single-tier)
pools. It also helps the system to respond in a more timely and appropriate manner to
overload situations, skew, and any under-utilization. These conditions can occur for the
following reasons:
򐂰 Addition or removal of hardware
򐂰 Migration of extents between tiers
򐂰 Merger of extent pools
򐂰 Changes in the underlying volume configuration
򐂰 Variations in the workload

Auto-rebalance adjusts the system to provide continuously optimal performance by balancing


the load on the ranks and on DA pairs.

If you set the Easy Tier automatic mode control to manage All Pools, Easy Tier also manages
homogeneous extent pools with only a single tier and performs intra-tier performance
balancing. If Easy Tier is turned off, no volumes are managed. If Easy Tier is turned on, it
manages all supported volumes, either standard (thick) or ESE (thin) volumes. For DS8870
and earlier models, the Track space-efficient (TSE) volumes that were offered by these
models are not supported by Easy Tier.

Warm demotion
To avoid overloading higher-performance tiers in hybrid extent pools, and thus potentially
degrading overall pool performance, Easy Tier automatic mode monitors performance of the
ranks. It can trigger the move of selected extents from the higher-performance tier to the
lower-performance tier based on either predefined bandwidth or IOPS overload thresholds.
The Nearline tier drives perform almost as well as SSDs and Enterprise hard disk drives
(HDDs) for sequential (high-bandwidth) operations.

This automatic operation is rank-based, and the target rank is randomly selected from the
lower tier. Warm demote is the highest priority to relieve quickly overloaded ranks.
So, Easy Tier continuously ensures that the higher-performance tier does not suffer from
saturation or overload conditions that might affect the overall performance in the extent pool.
Auto-rebalancing movement takes place within the same tier. Warm demotion takes place
across more than one tier. Auto-rebalance can be initiated when the rank configuration
changes the workload that is not balanced across ranks of the same tier. Warm demotion is
initiated when an overloaded rank is detected.

14 IBM System Storage DS8000 Performance Monitoring and Tuning


Cold demotion
Cold demotion recognizes and demotes cold or semi-cold extents to the lower tier. Cold
demotion occurs between HDD tiers, that is, between Enterprise and Nearline drives. Cold
extents are demoted in a storage pool to a lower tier if that storage pool is not idle.

Cold demotion occurs when Easy Tier detects any of the following scenarios:
򐂰 Segments in a storage pool become inactive over time, while other data remains active.
This scenario is the most typical use for cold demotion, where inactive data is demoted to
the Nearline tier. This action frees up segments on the Enterprise tier before the segments
on the Nearline tier become hot, which helps the system to be more responsive to new,
hot data.
򐂰 In addition to cold demote, which uses the capacity in the lowest tier, segments with
moderate bandwidth but low random IOPS requirements are selected for demotion to the
lower tier in an active storage pool. This demotion better uses the bandwidth in the
Nearline tier (expanded cold demote).

If all the segments in a storage pool become inactive simultaneously because of either a
planned or an unplanned outage, cold demotion is disabled. Disabling cold demotion assists
the user in scheduling extended outages or when experiencing outages without changing the
data placement.

Figure 1-1 illustrates all of the migration types that are supported by the Easy Tier
enhancements in a three-tier configuration. The auto-rebalance might also include additional
swap operations.

A uto
R e b a l an c e

Hi g h e s t
SSD SSD SSD
Pe rfo rm a n c e …
RANK 1 RANK 2 RANK n
Ti e r

W ar m
Pro m o te De m o te S wa p

Hi g h e r
ENT H DD ENT H DD ENT H DD
P er fo rm a n c e ...
R ANK 1 RA NK 2 RA NK n
Ti e r

Wa rm C o ld Ex p an d e d
Pro m o te Swa p
D e mo te Co l d De m o te
De m o te

L o we r
NL HD D R ANK NL HD D R ANK NL HD D R ANK
P er fo rm a n c e ...
1 2 m
Ti e r

A uto R e ba l a n c e

Figure 1-1 Features of Easy Tier automatic mode

Chapter 1. IBM System Storage DS8880 family characteristics 15


Easy Tier: Manual mode
Easy Tier in manual mode provides the capability to migrate volumes and merge extent pools,
under the same DS8000 storage system, concurrently with I/O operations.

In Easy Tier manual mode, you can dynamically relocate a logical volume between extent
pools or within an extent pool to change the extent allocation method of the volume or to
redistribute the volume across new ranks. This capability is referred to as dynamic volume
relocation. You can also merge two existing pools into one pool without affecting the data on
the logical volumes that are associated with the extent pools. In an older installation with
many pools, you can introduce the automatic mode of Easy Tier with automatic inter-tier and
intra-tier storage performance and storage economics management in multi-rank extent pools
with one or more tiers. Easy Tier manual mode also provides a rank depopulation option to
remove a rank from an extent pool with all the allocated extents on this rank automatically
moved to the other ranks in the pool.

The enhanced functions of Easy Tier manual mode provide additional capabilities. You can
use manual mode to relocate entire volumes from one pool to another pool. Upgrading to a
new disk drive technology, rearranging the storage space, or changing storage distribution for
a workload are typical operations that you can perform with volume relocations.

You can more easily manage configurations that deploy separate extent pools with different
storage tiers or performance characteristics. The storage administrator can easily and
dynamically move volumes to the appropriate extent pool. Therefore, the storage
administrator can meet storage performance or storage economics requirements for these
volumes transparently to the host and the application. Use manual mode to achieve these
operations and increase the options to manage your storage.

Functions and features of Easy Tier: Manual mode


This section describes the functions and features of Easy Tier manual mode. Hybrid pools are
always assumed to be created for Easy Tier automatic mode management.

Volume migration
You can select which logical volumes to migrate, based on performance considerations or
storage management concerns:
򐂰 Migrate volumes from one extent pool to another. You might want to migrate volumes to a
different extent pool with more suitable performance characteristics, such as different disk
drives or RAID ranks. Also, as different RAID configurations or drive technologies become
available, you might want to move a logical volume to a different extent pool with different
characteristics. You might also want to redistribute the available disk capacity between
extent pools.
򐂰 Change the extent allocation method that is assigned to a volume, like restriping it again
across the ranks. (This is meaningful only for those few installations and pools that are not
managed by an Easy Tier auto-rebalancing.)

The impact that is associated with volume migration is comparable to an IBM FlashCopy®
operation that runs in a background copy.

16 IBM System Storage DS8000 Performance Monitoring and Tuning


Dynamic extent pool merge
The dynamic extent pool merge capability allows a user to initiate a merging process of one
extent pool (source extent pool) into another (target extent pool). During this process, all the
volumes in both source and target extent pools remain accessible to the hosts. For example,
you can manually combine two existing extent pools with a homogeneous or a hybrid disk
configuration into a single extent pool with SSDs to use Easy Tier automatic mode. When the
merged hybrid pools contain SSD-class, Enterprise-class, and Nearline-class tiers, they all
can be managed by Easy Tier automatic mode as a three-tier storage hierarchy.

Easy Tier monitoring


STAT helps reporting volume performance monitoring data. The monitoring capability of the
DS8000 storage system enables it to monitor the usage of storage at the pool and
volume/extent level. Monitoring statistics are gathered and analyzed every few hours. In an
Easy Tier managed extent pool, the analysis is used to form an extent relocation plan for the
extent pool. This plan provides a recommendation, which is based on your plan, for relocating
extents on a volume to the most appropriate storage device. The results of this data are
summarized in a report that is available for download.

DS8000 Storage Tier Advisor Tool


IBM offers STAT for the DS8000 storage system and additional storage systems such as SAN
Volume Controller. STAT is a Microsoft Windows application that provides a graphical
representation of performance data that is collected by Easy Tier over its operational cycle.
This application generates a report that can be displayed with any internet browser. The STAT
provides configuration recommendations about high-performance flash (HPF)/SSD-class,
Enterprise-class, and Nearline-class tiers, and information about rank utilization and volume
heat distribution.

For more information about the STAT features, see 6.5, “Storage Tier Advisor Tool” on
page 213.

1.3.5 I/O Priority Manager


I/O Priority Manager enables more effective storage performance management that is
combined with the ability to align QoS levels to separate workloads in the system. For
DS8870 and earlier models, the I/O Priority Manager feature had to be purchased and
licensed separately. With the DS8880 family, this license is now included the standard Base
license feature, so every DS8880 user is entitled to use this QoS feature without additional
cost.

The DS8000 storage system prioritizes access to system resources to achieve the wanted
QoS for the volume based on defined performance goals of high, standard, or low. I/O Priority
Manager constantly monitors and balances system resources to help applications meet their
performance targets automatically, without operator intervention.

It is increasingly common to use one storage system, and often fairly large pools, to serve
many categories of workloads with different characteristics and requirements. The
widespread use of virtualization and the advent of cloud computing that facilitates
consolidating applications into a shared storage infrastructure are common practices.
However, business-critical applications can suffer performance degradation because of
resource contention with less important applications. Workloads are forced to compete for
resources, such as disk storage capacity, bandwidth, DAs, and ranks.

Chapter 1. IBM System Storage DS8880 family characteristics 17


Clients more often reserve the QoS for selected applications, even when the contentions
occur. This environment is challenging because of the dynamic and diverse characteristics of
storage workloads, different types of storage system components, and real-time
requirements.

I/O Priority Manager at work


A performance group (PG) attribute associates the logical volume with a performance group.
Each performance group is associated with a performance policy that determines how the
I/O Priority Manager processes I/O operations for the logical volume.

The I/O Priority Manager maintains statistics for the set of logical volumes in each
performance group that can be queried. If management is performed for the performance
policy, the I/O Priority Manager controls the I/O operations of all managed performance
groups to achieve the goals of the associated performance policies. The performance group
of a volume defaults to PG0. Table 1-2 lists the performance groups that are predefined, with
their associated performance policies.

Table 1-2 DS8000 I/O Priority Manager - performance group to performance policy mapping
Performance Performance Priority QoS target Ceiling Performance policy description
group policy (max. delay)
factor [%]

0 1 0 0 (N/A) 0 (N/A) Fixed block, No Management CKD,


or No management

1-5 2 1 70 0 (N/A) Fixed block high priority

6 - 10 3 5 40 500 Fixed block medium priority

11 - 15 4 15 0 (N/A) 2000 Fixed block low priority

16 - 18 16 - 18 0 0 (N/A) 0 (N/A) CKD, No management

19 19 1 80 2000 CKD high priority 1

20 20 2 80 2000 CKD high priority 2

21 21 3 70 2000 CKD high priority 3

22 22 4 45 2000 CKD medium priority 1

23 23 4 5 2000 CKD medium priority 2

24 24 5 45 2000 CKD medium priority 3

25 25 6 5 2000 CKD medium priority 4

26 26 7 5 2000 CKD low priority 1

27 27 8 5 2000 CKD low priority 2

28 28 9 5 2000 CKD low priority 3

29 29 10 5 2000 CKD low priority 4

30 30 11 5 2000 CKD low priority 5

31 31 12 5 2000 CKD low priority 6

Each performance group comes with a predefined priority and QoS target. Because
mainframe volumes and Open Systems volumes are on separate extent pools with different
rank sets, they do not interfere with each other, except in rare cases of overloaded DAs.

18 IBM System Storage DS8000 Performance Monitoring and Tuning


The user must associate the DS8000 logical volume with the performance policy that
characterizes the expected type of workload for that volume. This policy assignment allows
the I/O Priority Manager to determine the relative priority of I/O operations.

Open Systems can use several performance groups (such as PG1 - PG5) that share priority
and QoS characteristics. You can put applications into different groups for monitoring
purposes, without assigning different QoS priorities.

If the I/O Priority Manager detects resource overload conditions, such as resource contention
that leads to insufficient response times for higher-priority volumes, it throttles I/O for volumes
in lower-priority performance groups. This method allows the higher-performance group
applications to run faster and meet their QoS targets.

Important: Lower-priority I/O operations are delayed by I/O Priority Manager only if
contention exists on a resource that causes a deviation from normal I/O operation
performance. The I/O operations that are delayed are limited to operations that involve the
RAID arrays or DAs that experience contention.

Performance groups are assigned to a volume at the time of the volume creation. You can
assign performance groups to existing volumes by using the DS8000 command-line interface
(DSCLI) chfbvol and chckdvol commands. At any time, you can reassign volumes online to
other performance groups.

Modes of operation
I/O Priority Manager can operate in the following modes:
򐂰 Disabled: I/O Priority Manager does not monitor any resources and does not alter any I/O
response times.
򐂰 Monitor: I/O Priority Manager monitors resources and updates statistics that are available
in performance data. This performance data can be accessed from the DSCLI. No I/O
response times are altered.
򐂰 Manage: I/O Priority Manager monitors resources and updates statistics that are available
in performance data. I/O response times are altered on volumes that are in lower-QoS
performance groups if resource contention occurs.

In both monitor and manage modes, I/O Priority Manager can send Simple Network
Management Protocol (SNMP) traps to alert the user when certain resources detect a
saturation event.

Monitoring and performance reports


I/O Priority Manager generates performance statistics every 60 seconds for DAs, ranks, and
performance groups. These performance statistics samples are kept for a specified period.
I/O Priority Manager maintains statistics for the last immediate set of values:
򐂰 Sixty 1-minute intervals
򐂰 Sixty 5-minute intervals
򐂰 Sixty 15-minute intervals
򐂰 Sixty 1-hour intervals
򐂰 Sixty 4-hour intervals
򐂰 Sixty 1-day intervals

Chapter 1. IBM System Storage DS8880 family characteristics 19


An interesting option of I/O Priority Manager is that it can monitor the performance of the
entire storage system. For example, on a storage system, where all the volumes are still in
their default performance group PG0 (which is not managed by I/O Priority Manager), a
regular performance report of the machine can be obtained, as shown in Example 1-1.

Example 1-1 Monitoring default performance group PG0 for one entire month in one-day intervals
dscli> lsperfgrprpt -start 32d -stop 1d -interval 1d pg0
time grp resrc avIO avMB avresp pri avQ tgtQ %hlpT %dlyT %impt
==============================================================================================================
2015-10-01/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 8617 489.376 0.554 0 43 0 0 0 0
2015-10-02/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 11204 564.409 2.627 0 37 0 0 0 0
2015-10-03/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 21737 871.813 5.562 0 27 0 0 0 0
2015-10-04/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 21469 865.803 5.633 0 32 0 0 0 0
2015-10-05/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 23189 1027.818 5.413 0 54 0 0 0 0
2015-10-06/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 20915 915.315 5.799 0 52 0 0 0 0
2015-10-07/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 18481 788.450 6.690 0 41 0 0 0 0
2015-10-08/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 19185 799.205 6.310 0 43 0 0 0 0
2015-10-09/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 19943 817.699 6.069 0 41 0 0 0 0
2015-10-10/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 20752 793.538 5.822 0 49 0 0 0 0
2015-10-11/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 23634 654.019 4.934 0 97 0 0 0 0
2015-10-12/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 23136 545.550 4.961 0 145 0 0 0 0
2015-10-13/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 19981 505.037 5.772 0 92 0 0 0 0
2015-10-14/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 5962 176.957 5.302 0 93 0 0 0 0
2015-10-15/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 2286 131.120 0.169 0 135 0 0 0 0
2015-10-16/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 2287 131.130 0.169 0 135 0 0 0 0
2015-10-17/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 2287 131.137 0.169 0 135 0 0 0 0
2015-10-18/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 10219 585.908 0.265 0 207 0 0 0 0
2015-10-19/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 22347 1281.260 0.162 0 490 0 0 0 0
2015-10-20/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 13327 764.097 0.146 0 507 0 0 0 0
2015-10-21/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 9353 536.250 0.151 0 458 0 0 0 0
2015-10-22/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 9944 570.158 0.127 0 495 0 0 0 0
2015-10-23/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 11753 673.871 0.147 0 421 0 0 0 0
2015-10-24/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 11525 660.757 0.140 0 363 0 0 0 0
2015-10-25/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 5022 288.004 0.213 0 136 0 0 0 0
2015-10-26/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 5550 318.230 0.092 0 155 0 0 0 0
2015-10-27/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 8732 461.987 0.313 0 148 0 0 0 0
2015-10-28/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 13404 613.771 1.434 0 64 0 0 0 0
2015-10-29/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 25797 689.529 0.926 0 51 0 0 0 0
2015-10-30/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 25560 725.174 1.039 0 49 0 0 0 0
2015-10-31/17:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 25786 725.305 1.013 0 49 0 0 0 0

A client can create regular performance reports and store them easily, for example, on a
weekly or monthly basis. Example 1-1 is a report created on the first day of a month for the
previous month. This report shows a long-term overview of the amount of IOPS, MBps, and
their average response times.

If all volumes are in their default performance group, which is PG0, the report that is shown in
Example 1-1 is possible. However, when you want to start the I/O Priority Manager QoS
management and throttling of the lower-priority volumes, you must put the volumes into
non-default performance groups. For this example, we use PG1 - PG15 for Open Systems
and PG19 - PG31 for CKD volumes.

Monitoring is then possible on the performance-group level, as shown in Example 1-2, on the
RAID-rank level, as in Example 1-3 on page 21, or the DA pair level, as in Example 1-4 on
page 21.

Example 1-2 Showing reports for a certain performance group PG28 for a certain time frame
dscli> lsperfgrprpt -start 3h -stop 2h pg28
time grp resrc avIO avMB avresp pri avQ tgtQ %hlpT %dlyT %impt
============================================================================================================
2015-11-01/14:10:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 204 11.719 1.375 9 17 5 0 0 0

20 IBM System Storage DS8000 Performance Monitoring and Tuning


2015-11-01/14:15:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 401 23.020 1.273 9 18 5 0 0 0
2015-11-01/14:20:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 1287 73.847 1.156 9 28 5 0 0 0
2015-11-01/14:25:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 315 18.102 1.200 9 19 5 0 0 0
2015-11-01/14:30:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 261 15.013 1.241 9 22 5 0 0 0
2015-11-01/14:35:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 330 18.863 1.195 9 24 5 0 0 0
2015-11-01/14:40:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 535 30.670 1.210 9 19 5 0 0 0
2015-11-01/14:45:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 514 29.492 1.209 9 20 5 0 0 0
2015-11-01/14:50:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 172 9.895 1.191 9 21 5 0 0 0
2015-11-01/14:55:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 91 5.262 0.138 9 299 5 0 0 0
2015-11-01/15:00:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 0 0.000 0.000 9 0 5 0 0 0
2015-11-01/15:05:00 IBM.2107-75TV181/PG28 IBM.2107-75TV181 0 0.000 0.000 9 0 5 0 0 0

Example 1-3 Monitoring on the rank level for rank R19


dscli> lsperfrescrpt R19
time resrc avIO avMB avresp %Hutl %hlpT %dlyT %impt
===================================================================
2015-11-25/12:30:00 R19 1552 6.357 7.828 100 24 7 202
2015-11-25/12:35:00 R19 1611 6.601 8.55 100 4 6 213
2015-11-25/12:40:00 R19 1610 6.596 7.888 50 7 4 198
2015-11-25/12:45:00 R19 1595 6.535 6.401 100 12 7 167
2015-11-25/12:50:00 R19 1586 6.496 8.769 50 5 3 219

The lsperfrescrpt command displays performance statistics for individual ranks, as shown
in Example 1-3:
򐂰 The first three columns show the average number of IOPS, throughput, and average
response time, in milliseconds, for all I/Os on that rank.
򐂰 The %Hutl column shows the percentage of time that the rank utilization was high enough
(over 33%) to warrant workload control.
򐂰 The %hlpT column shows the average percentage of time that I/Os were helped on this
rank for all performance groups. This column shows the percentage of time where lower
priority I/Os were delayed to help higher priority I/Os.
򐂰 The %dlyT column specifies the average percentage of time that I/Os were delayed for all
performance groups on this rank.
򐂰 The %impt column specifies, on average, the length of the delay.

Example 1-4 Monitoring on the DA pair level for DP1


dscli> lsperfgrprpt -dapair dp1 pg0
time grp resrc avIO avMB avresp pri avQ tgtQ %hlpT %dlyT %impt
===================================================================================================
2015-11-01/16:55:00 IBM.2107-75TV181/PG0 DP1 55234 1196.234 0.586 0 110 0 0 0 0
2015-11-01/17:00:00 IBM.2107-75TV181/PG0 DP1 51656 1190.322 0.627 0 110 0 0 0 0
2015-11-01/17:05:00 IBM.2107-75TV181/PG0 DP1 52990 1181.204 0.609 0 108 0 0 0 0
2015-11-01/17:10:00 IBM.2107-75TV181/PG0 DP1 55108 1198.362 0.589 0 109 0 0 0 0
2015-11-01/17:15:00 IBM.2107-75TV181/PG0 DP1 54946 1197.284 0.589 0 110 0 0 0 0
2015-11-01/17:20:00 IBM.2107-75TV181/PG0 DP1 49080 1174.828 0.657 0 113 0 0 0 0
2015-11-01/17:25:00 IBM.2107-75TV181/PG0 DP1 53034 1198.398 0.607 0 110 0 0 0 0
2015-11-01/17:30:00 IBM.2107-75TV181/PG0 DP1 55278 1193.400 0.583 0 109 0 0 0 0
2015-11-01/17:35:00 IBM.2107-75TV181/PG0 DP1 55112 1198.100 0.584 0 110 0 0 0 0
2015-11-01/17:40:00 IBM.2107-75TV181/PG0 DP1 53204 1197.236 0.610 0 112 0 0 0 0
2015-11-01/17:45:00 IBM.2107-75TV181/PG0 DP1 54316 1200.102 0.594 0 110 0 0 0 0
2015-11-01/17:50:00 IBM.2107-75TV181/PG0 DP1 55222 1190.582 0.583 0 110 0 0 0 0

Chapter 1. IBM System Storage DS8880 family characteristics 21


I/O Priority Manager with z Systems
The DS8000 storage systems also feature I/O Priority Manager for CKD volumes. Together
with z/OS Workload Manager (zWLM), this capability enables effective storage consolidation
and performance management that is combined with the ability to align QoS levels to
separate workloads in the system. This capability is exclusive to the DS8000 and z Systems
storage systems.

The DS8000 storage systems prioritize access to system resources to achieve the QoS of the
volume based on the defined performance goals (high, medium, or low) of any volume. I/O
Priority Manager constantly monitors and balances system resources to help applications
meet their performance targets automatically without operator intervention based on input
from the zWLM software.

You can run the I/O Priority Manager for z/OS the same way as for Open Systems. You can
control the performance results of the CKD volumes and assign them online into
higher-priority or lower-priority performance groups, if needed. The zWLM integration offers
an additional level of automation. The zWLM can control and fully manage the performance
behavior of all volumes.

With z/OS and zWLM software support, the user assigns application priorities through the
Workload Manager. z/OS, then assigns an “importance” value to each I/O, based on the
zWLM inputs. Then, based on the prior history of I/O response times for I/Os with the same
importance value, and based on the zWLM expectations for this response time, z/OS assigns
an “achievement” value to each I/O.

The importance and achievement values for each I/O are compared. The I/O becomes
associated with a performance policy, independently of the performance group or policy of the
volume. When saturation or resource contention occurs, I/O is then managed according to the
preassigned zWLM performance policy.

I/O Priority Manager and Easy Tier


I/O Priority Manager and Easy Tier work independently. They can coexist in the same pool
and, if both enabled, provide independent benefits:
򐂰 I/O Priority Manager attempts to ensure that the most important I/O operations are
serviced in resource contention situations by delaying less important I/Os. In contrast to
Easy Tier, it does not relocate any data or extents. I/O Priority Manager is, in that sense, a
reactive algorithm that works on a short-term time scale and only when resource
contention is present.
򐂰 Easy Tier (automatic mode) attempts to relocate extents across ranks and storage tiers to
identify the most appropriate storage tier for the subvolume data based on the workload
and access pattern. Easy Tier helps maximize storage performance and storage
economics throughout the system and maintain an optimal resource utilization.
Easy Tier, although it is based on past performance monitoring, moves the extents
proactively (except with warm-demote). Its goal is to minimize any potentially upcoming
bottleneck conditions so that the system performs at its optimal point. Also, it works on a
much longer time scale than I/O Priority Manager.

Together, these features can help consolidate multiple workloads on a single DS8000 storage
system. They optimize overall performance through automated tiering in a simple and
cost-effective manner while sharing resources. The DS8000 storage system can help
address storage consolidation requirements, which in turn helps manage increasing amounts
of data with less effort and lower infrastructure costs.

22 IBM System Storage DS8000 Performance Monitoring and Tuning


DS8000 I/O Priority Manager, REDP-4760 describes the I/O Priority Manager in detail. Also,
the following IBM patent provides insight into this feature:
http://depatisnet.dpma.de/DepatisNet/depatisnet?action=pdf&docid=US000007228354B2&
switchToLang=en

Chapter 1. IBM System Storage DS8880 family characteristics 23


24 IBM System Storage DS8000 Performance Monitoring and Tuning
2

Chapter 2. Hardware configuration


This chapter describes the IBM System Storage DS8880 hardware configuration and its
relationship to the performance of the device.

Understanding the hardware components, the functions that are performed by each
component, and the technology can help you select the correct components to order and the
quantities of each component. However, do not focus too much on any one hardware
component. Instead, ensure that you balance components to work together effectively. The
ultimate criteria for storage server performance are response times and the total throughput.

This chapter includes the following topics:


򐂰 Storage unit, processor complex, and Storage Facility Image (SFI)
򐂰 Cache
򐂰 Disk enclosures, and PCIe infrastructure
򐂰 Device adapters (DAs) and disk back end
򐂰 Host adapters (HAs)

© Copyright IBM Corp. 2016. All rights reserved. 25


2.1 Storage system
It is important to understand the naming conventions that are used to describe the DS8000
components and constructs.

Storage unit
A storage unit consists of a single DS8000 storage system, including all its expansion frames.
A storage unit can consist of several frames: one base frame and up to four expansion frames
for a DS8886 storage system, up to three expansion frames for a DS8870 storage system, or
up to two expansion frames for a DS8884 storage system. The storage unit ID is the DS8000
base frame serial number, ending in 0 (for example, 75-06570).

Processor complex
A DS8880 processor complex is one POWER8 symmetric multiprocessor (SMP) system unit.
A DS8886 storage system uses a pair of IBM Power System S824 servers as processor
complexes, and a DS8884 storage system uses a pair of IBM Power System S822 servers as
processor complexes. In comparison, a DS8870 storage system uses an IBM Power 740
server pair (initially running on POWER7, later on POWER7+).

On all DS8000 models, the two processor complexes (servers) are housed in the base frame.
They form a redundant pair so that if either processor complex fails, the surviving one
continues to run the workload.

Storage Facility Image


In a DS8000, an SFI is a union of two logical partitions (processor LPARs), one from each
processor complex. Each LPAR hosts one server. The SFI controls one or more DA pairs and
two or more disk enclosures. Sometimes, the SFI is also called a storage image.

In a DS8000 storage system, a server is effectively the software that uses a processor logical
partition (a processor LPAR) and that has access to the memory and processor resources
that are available on a processor complex. The DS8880 models 980 and 981, and the
DS8870 model 961, are single-SFI models, so this one storage image is using 100% of the
resources.

A SFI has the following DS8000 resources dedicated to its use:


򐂰 Processors
򐂰 Cache and persistent memory
򐂰 I/O enclosures
򐂰 Disk enclosures

For the internal naming for this example, we work with server 0 and server 1.

Figure 2-1 on page 27 shows an overview of the architecture of a DS8000 storage system.
You see the RAID adapters in the back end, which are the DAs that are used for the disk
drives and 2.5-inch solid-state drives (SSDs). The High-Performance Flash Enclosure
(HPFE) device carries out the RAID calculation internally on own dedicated PowerPC chips,
so it does not need to connect through conventional DAs.

26 IBM System Storage DS8000 Performance Monitoring and Tuning


The architecture of the previous storage system, the DS8870, is similar to Figure 2-1, with the
following exceptions:
򐂰 Instead of POWER8, the DS8870 storage system runs on POWER7 or POWER7+ (with
fewer transistors than that former architecture).
򐂰 The internal PCIe was Gen2 instead of Gen3.
򐂰 The I/O enclosures are new with the DS8880 storage system, and now attached through
x8 PCIe Gen3 cables.
򐂰 As the DS8870 was still using the internal GX++ bus as I/O bridge, there was one
additional component in the path, and also the HPFEs needed another extra connection
(pass-through adapter pair) to carry the signals forward into the processor complexes.
Now, the HPFEs can connect directly within the PCIe fabric.

Figure 2-1 DS8000 general architecture

2.2 Processor memory and cache


The DS8886 model comes in cache sizes 128 GB - 2 TB. For the DS8884 storage system,
possible cache sizes are 64 GB - 256 GB. For the DS8870 storage system, the cache range
is 16 GB - 1 TB.

On all DS8000 models, each processor complex has its own system memory. Within each
processor complex, the system memory is divided into these parts:
򐂰 Memory that is used for the DS8000 control program
򐂰 Cache
򐂰 Persistent cache

Chapter 2. Hardware configuration 27


The amount that is allocated as persistent memory scales according to the processor
memory that is selected.

2.2.1 Cache and I/O operations


Caching is a fundamental technique for avoiding (or reducing) I/O latency. Cache is used to
keep both the data read and written by the host servers. The host does not need to wait for
the hard disk drive (HDD) to either obtain or store the data that is needed because cache can
be used as an intermediate repository. Prefetching data from the disk for read operations, and
operations of writing to the disk, are done by the DS8000 storage system asynchronously
from the host I/O processing.

Cache processing improves the performance of the I/O operations that are done by the host
systems that attach to the DS8000 storage system. Cache size, the efficient internal
structure, and algorithms of the DS8000 storage system are factors that improve I/O
performance. The significance of this benefit is determined by the type of workload that is run.

Read operations
These operations occur when a host sends a read request to the DS8000 storage system:
򐂰 A cache hit occurs if the requested data is in the cache. In this case, the I/O operation
does not disconnect from the channel/bus until the read is complete. A read hit provides
the highest performance.
򐂰 A cache miss occurs if the data is not in the cache. The I/O operation logically disconnects
from the host. Other I/O operations occur over the same interface. A stage operation from
the disk back end occurs.

Write operations: Fast writes


A fast write hit occurs when the write I/O operation completes when the data that is received
from the host is transferred to the cache and a copy is made in the persistent memory. Data
that is written to a DS8000 storage system is almost 100% fast-write hits. The host is notified
that the I/O operation is complete when the data is stored in the two locations.

The data remains in the cache and persistent memory until it is destaged, at which point it is
flushed from cache. Destage operations of sequential write operations to RAID 5 arrays are
done in parallel mode, writing a stripe to all disks in the RAID set as a single operation. An
entire stripe of data is written across all the disks in the RAID array. The parity is generated
one time for all the data simultaneously and written to the parity disk. This approach reduces
the parity generation penalty that is associated with write operations to RAID 5 arrays. For
RAID 6, data is striped on a block level across a set of drives, similar to RAID 5
configurations. A second set of parity is calculated and written across all the drives. This
technique does not apply for the RAID 10 arrays because there is no parity generation that is
required. Therefore, no penalty is involved other than a double write when writing to RAID 10
arrays.

It is possible that the DS8000 storage system cannot copy write data to the persistent cache
because it is full, which can occur if all data in the persistent cache waits for destage to disk.
In this case, instead of a fast write hit, the DS8000 storage system sends a command to the
host to retry the write operation. Having full persistent cache is not a good situation because it
delays all write operations. On the DS8000 storage system, the amount of persistent cache is
sized according to the total amount of system memory. The amount of persistent cache is
designed so that the probability of full persistent cache occurring in normal processing is low.

28 IBM System Storage DS8000 Performance Monitoring and Tuning


Cache management
The DS8000 storage system offers superior caching algorithms:
򐂰 Sequential Prefetching in Adaptive Replacement Cache (SARC)
򐂰 Adaptive Multi-stream Prefetching (AMP)
򐂰 Intelligent Write Caching (IWC)

IBM Storage Development in partnership with IBM Research developed these algorithms.

Sequential Prefetching in Adaptive Replacement Cache


SARC is a self-tuning, self-optimizing solution for a wide range of workloads with a varying
mix of sequential and random I/O streams. This cache algorithm attempts to determine four
things:
򐂰 When data is copied into the cache
򐂰 Which data is copied into the cache
򐂰 Which data is evicted when the cache becomes full
򐂰 How the algorithm dynamically adapts to different workloads

The DS8000 cache is organized in 4 KB pages called cache pages or slots. This unit of
allocation ensures that small I/Os do not waste cache memory.

The decision to copy an amount of data into the DS8000 cache can be triggered from two
policies: demand paging and prefetching.
򐂰 Demand paging means that eight disk blocks (a 4 K cache page) are brought in only on a
cache miss. Demand paging is always active for all volumes and ensures that I/O patterns
with locality find at least some recently used data in the cache.
򐂰 Prefetching means that data is copied into the cache even before it is requested. To
prefetch, a prediction of likely future data access is required. Because effective,
sophisticated prediction schemes need extensive history of the page accesses, the
algorithm uses prefetching only for sequential workloads. Sequential access patterns are
commonly found in video-on-demand, database scans, copy, backup, and recovery. The
goal of sequential prefetching is to detect sequential access and effectively preload the
cache with data to minimize cache misses.

For prefetching, the cache management uses tracks. A track is a set of 128 disk blocks
(16 cache pages). To detect a sequential access pattern, counters are maintained with every
track to record if a track is accessed together with its predecessor. Sequential prefetching
becomes active only when these counters suggest a sequential access pattern. In this
manner, the DS8000 storage system monitors application read patterns and dynamically
determines whether it is optimal to stage into cache:
򐂰 Only the page requested
򐂰 The page requested, plus remaining data on the disk track
򐂰 An entire disk track or multiple disk tracks not yet requested

The decision of when and what to prefetch is made on a per-application basis (rather than a
system-wide basis) to be responsive to the data reference patterns of various applications
that can run concurrently.

With the z Systems integration of newer DS8000 codes, a host application, such as DB2, can
send cache hints to the storage system and manage the DS8000 prefetching, reducing the
number of I/O requests.

Chapter 2. Hardware configuration 29


To decide which pages are flushed when the cache is full, sequential data and random
(non-sequential) data are separated into different lists, as illustrated in Figure 2-2.

RANDOM SEQ

MRU MRU

Desired size

SEQ bottom
LRU
RANDOM bottom
LRU

Figure 2-2 Sequential Prefetching in Adaptive Replacement Cache

In Figure 2-2, a page that is brought into the cache by simple demand paging is added to the
Most Recently Used (MRU) head of the RANDOM list. With no further references to that
page, it moves down to the Least Recently Used (LRU) bottom of the list. A page that is
brought into the cache by a sequential access or by sequential prefetching is added to the
MRU head of the sequential (SEQ) list. It moves down that list as more sequential reads are
done. Additional rules control the management of pages between the lists so that the same
pages are not kept in memory twice.

To follow workload changes, the algorithm trades cache space between the RANDOM and
SEQ lists dynamically. Trading cache space allows the algorithm to prevent one-time
sequential requests from filling the entire cache with blocks of data with a low probability of
being read again. The algorithm maintains a wanted size parameter for the SEQ list. The size
is continually adapted in response to the workload. Specifically, if the bottom portion of the
SEQ list is more valuable than the bottom portion of the RANDOM list, the wanted size of the
SEQ list is increased. Otherwise, the wanted size is decreased. The constant adaptation
strives to make optimal use of limited cache space and delivers greater throughput and faster
response times for a specific cache size.

Sequential Prefetching in Adaptive Replacement Cache performance


IBM simulated a comparison of cache management with and without the SARC algorithm.
The enabled algorithm, with no change in hardware, provided these results:
򐂰 Effective cache space: 33% greater
򐂰 Cache miss rate: 11% reduced
򐂰 Peak throughput: 12.5% increased
򐂰 Response time: 50% reduced

Figure 2-3 on page 31 shows the improvement in response time because of SARC,
measured on an older DS8000 storage system (and without flash) as this algorithm is
permanently enabled.

30 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 2-3 Response time improvement with Sequential Prefetching in Adaptive Replacement Cache

Adaptive Multi-stream Prefetching


SARC dynamically divides the cache between the RANDOM and SEQ lists. The SEQ list
maintains pages that are brought into the cache by sequential access or sequential
prefetching.

The SEQ list is managed by the AMP technology, which was developed by IBM. AMP
introduces an autonomic, workload-responsive, and self-optimizing prefetching technology
that adapts both the amount of prefetch and the timing of prefetch on a per-application basis
to maximize the performance of the system. The AMP algorithm solves two problems that
plague most other prefetching algorithms:
򐂰 Prefetch wastage occurs when prefetched data is evicted from the cache before it can be
used.
򐂰 Cache pollution occurs when less useful data is prefetched instead of more useful data.

By choosing the prefetching parameters, AMP provides optimal sequential read performance
and maximizes the aggregate sequential read throughput of the system. The amount that is
prefetched for each stream is dynamically adapted according to the needs of the application
and the space that is available in the SEQ list. The timing of the prefetches is also
continuously adapted for each stream to avoid misses and, concurrently, to avoid any cache
pollution.

Chapter 2. Hardware configuration 31


AMP dramatically improves performance for common sequential and batch processing
workloads. It also provides performance synergy with DB2 by preventing table scans from
being I/O bound. It improves the performance of index scans and DB2 utilities, such as Copy
and Recover. Furthermore, AMP reduces the potential for array hot spots that result from
extreme sequential workload demands.

SARC and AMP play complementary roles. SARC carefully divides the cache between the
RANDOM and the SEQ lists to maximize the overall hit ratio. AMP manages the contents of
the SEQ list to maximize the throughput that is obtained for the sequential workloads. SARC
affects cases that involve both random and sequential workloads. AMP helps any workload
that has a sequential read component, including pure sequential read workloads.

Intelligent Write Caching


Another advanced cache algorithm, IWC, is implemented in the DS8000 series. IWC
improves performance through better write-cache management and a better destaging order
of writes. This algorithm is a combination of CLOCK, a predominantly read-cache algorithm,
and CSCAN, an efficient write-cache algorithm. Out of this combination, IBM produced a
powerful and widely applicable write-cache algorithm.

The CLOCK algorithm uses temporal ordering. It keeps a circular list of pages in memory,
with the “hand” pointing to the oldest page in the list. When a page must be inserted into the
cache, then an R (recency) bit is inspected at the “hand” location. If R is zero, the new page is
put in place of the page to which the “hand” points and R is set to 1. Otherwise, the R bit is
cleared and set to zero. Then, the clock hand moves one step clockwise forward and the
process is repeated until a page is replaced.

The CSCAN algorithm uses spatial ordering. The CSCAN algorithm is the circular variation of
the SCAN algorithm. The SCAN algorithm tries to minimize the disk head movement when
servicing read and write requests. It maintains a sorted list of pending requests along with the
position on the drive of the request. Requests are processed in the current direction of the
disk head until it reaches the edge of the disk. At that point, the direction changes. In the
CSCAN algorithm, the requests are always served in the same direction. When the head
arrives at the outer edge of the disk, it returns to the beginning of the disk and services the
new requests in this direction only. This algorithm results in more equal performance for all
head positions.

The idea of IWC is to maintain a sorted list of write groups, as in the CSCAN algorithm. The
smallest and the highest write groups are joined, forming a circular queue. The addition is to
maintain a recency bit for each write group, as in the CLOCK algorithm. A write group is
always inserted in its correct sorted position, and the recency bit is set to 0 at the beginning.
When a write hit occurs, the recency bit is set to 1. The destage operation proceeds. The
destage pointer is maintained that scans the circular list looking for destage victims. Now, this
algorithm allows only destaging of write groups whose recency bit is 0. The write groups with
a recency bit of 1 are skipped and the recent bit is then turned off and reset to 0. This method
gives an “extra life” to those write groups that were hit since the last time the destage pointer
visited them.

Figure 2-4 on page 33 shows how this mechanism works.

32 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 2-4 Intelligent Write Caching

In the DS8000 implementation, an IWC list is maintained for each rank. The dynamically
adapted size of each IWC list is based on the workload intensity on each rank. The rate of
destage is proportional to the portion of nonvolatile storage (NVS) occupied by an IWC list.
The NVS is shared across all ranks of a processor complex. Furthermore, destages are
smoothed out so that write bursts are not converted into destage bursts.

IWC has better or comparable peak throughput to the best of CSCAN and CLOCK across a
wide gamut of write-cache sizes and workload configurations. In addition, even at lower
throughputs, IWC has lower average response times than CSCAN and CLOCK. The
random-write parts of workload profiles benefit from the IWC algorithm. The costs for the
destages are minimized, and the number of possible write-miss IOPS greatly improves
compared to a system not using IWC.

The IWC algorithm can be applied to storage systems, servers, and their operating systems.
The DS8000 implementation is the first for a storage system. Because of IBM patents on this
algorithm and the other advanced cache algorithms, it is unlikely that a competitive system
uses them.

2.2.2 Determining the correct amount of cache storage


A common question is “How much cache do I need in my DS8000 storage system?”
Unfortunately, there is no quick and easy answer.

There are a number of factors that influence cache requirements:


򐂰 Is the workload sequential or random?
򐂰 Are the attached host servers z Systems or Open Systems?
򐂰 What is the mix of reads to writes?
򐂰 What is the probability that data is needed again after its initial access?
򐂰 Is the workload cache friendly (a cache-friendly workload performs better with relatively
large amounts of cache)?

Chapter 2. Hardware configuration 33


It is a common approach to base the amount of cache on the amount of disk capacity, as
shown in these common general rules:
򐂰 For Open Systems, each TB of drive capacity needs 1 GB - 2 GB of cache.
򐂰 For z Systems, each TB of drive capacity needs 2 GB - 4 GB of cache.

For SAN Volume Controller attachments, consider the SAN Volume Controller node cache in
this calculation, which might lead to a slightly smaller DS8000 cache. However, most
installations come with a minimum of 128 GB of DS8000 cache. Using flash in the DS8000
storage system does not typically change the prior values. HPFE cards or SSDs are
beneficial with cache-unfriendly workload profiles because they reduce the cost of cache
misses.

Most storage servers support a mix of workloads. These general rules can work well, but
many times they do not. Use them like any general rule, but only if you have no other
information on which to base your selection.

When coming from an existing disk storage server environment and you intend to consolidate
this environment into DS8000 storage systems, follow these recommendations:
򐂰 Choose a cache size for the DS8000 series that has a similar ratio between cache size
and disk storage to that of the configuration that you use.
򐂰 When you consolidate multiple disk storage servers, configure the sum of all cache from
the source disk storage servers for the target DS8000 processor memory or cache size.

For example, consider replacing four DS8800 storage systems, each with 65 TB and 128 GB
cache, with a single DS8886 storage system. The ratio between cache size and disk storage
for each DS8800 storage system is 0.2% (128 GB/65 TB). The new DS8886 storage system
is configured with 300 TB to consolidate the four 65 TB DS8800 storage systems, plus
provide some capacity for growth. This DS8886 storage system should be fine with 512 GB of
cache to keep mainly the original cache-to-disk storage ratio. If the requirements are
somewhere in between or in doubt, round up to the next available memory size. When using a
SAN Volume Controller in front, round down in some cases for the DS8880 cache.

The cache size is not an isolated factor when estimating the overall DS8000 performance.
Consider it with the DS8000 model, the capacity and speed of the disk drives, and the
number and type of HAs. Larger cache sizes mean that more reads are satisfied from the
cache, which reduces the load on DAs and the disk drive back end that is associated with
reading data from disk. To see the effects of different amounts of cache on the performance of
the DS8000 storage system, run a Disk Magic model, which is described in 6.1, “Disk Magic”
on page 160.

2.3 I/O enclosures and the PCIe infrastructure


This section describes I/O enclosures and the PCIe infrastructure.

I/O enclosures
The DS8880 base frame and first expansion frame (if installed) both contain I/O enclosures.
I/O enclosures are installed in pairs. There can be one or two I/O enclosure pairs installed in
the base frame, depending on the model. Further I/O enclosures are installed in the first
expansion frame. Each I/O enclosure has six slots for adapters: DAs and HAs are installed in
the I/O enclosures. The I/O enclosures provide connectivity between the processor
complexes and the HAs or DAs. The DS8880 can have up to two DAs and four HAs installed
in an I/O enclosure.

34 IBM System Storage DS8000 Performance Monitoring and Tuning


In addition, each I/O enclosure has two PCIe slots to connect HPFE.

Depending on the number of installed disks and the number of required host connections,
some of the I/O enclosures might not contain any adapters.

With the DAs, you work with DA pairs because DAs are always installed in quantities of two
(one DA is attached to each processor complex). The members of a DA pair are split across
two I/O enclosures for redundancy. The number of installed disk devices determines the
number of required DAs. In any I/O enclosure, the number of individual DAs installed can be
zero, one, or two.

PCIe infrastructure
The DS8880 processor complexes use a x8 PCI Express (PCIe) Gen-3 infrastructure to
access the I/O enclosures. This infrastructure greatly improves performance over previous
DS8000 models. There is no longer any GX++ or GX+ bus in the DS8880 storage system, as
there was in earlier models.

High-Performance Flash Enclosure


In a DS8870 storage system, each flash RAID adapter that is connected to an I/O enclosure
uses a 2 GBps PCIe Gen-2 four-lane cable. In a DS8880 storage system, these cables are 4
GBps and eight-lane. The PCIe connectivity has none of the protocol impact that is
associated with the Fibre Channel (FC) architecture. Each I/O enclosure pair supports up to
two HPFEs.

2.4 Disk subsystem


The DS8880 models use a selection of Serial-Attached SCSI (SAS) 3.0 interface disk drives.
SAS allows multiple I/O streams to each device. The disk subsystem consists of three
components:
򐂰 The DAs are in the I/O enclosures. These DAs are RAID controllers that are used by the
storage images to access the RAID arrays.
򐂰 The DAs connect to switched controller cards in the disk enclosures, and they create a
switched FC disk network.
򐂰 The disks are called disk drive modules (DDMs).

2.4.1 Device adapters


Each DS8000 DA, which is installed in an I/O enclosure, provides four FC ports. These ports
are used to connect the processor complexes to the disk enclosures. The DA is responsible
for managing, monitoring, and rebuilding the disk RAID arrays. The PCIe RAID DA is built on
PowerPC technology with four FC ports and high-function, high-performance
application-specific integrated circuits (ASICs). To ensure maximum data integrity, the RAID
DA supports metadata creation and checking. The RAID DA is based on PCIe Gen. 2 and
operates at 8 Gbps.

DAs are installed in pairs because each processor complex requires its own DA to connect to
each disk enclosure for redundancy. DAs in a pair are installed in separate I/O enclosures to
eliminate the I/O enclosure as a single point of failure.

Each DA performs the RAID logic and frees the processors from this task. The throughput
and performance of a DA is determined by the port speed and hardware that are used, and
also by the firmware efficiency.

Chapter 2. Hardware configuration 35


Figure 2-5 shows the detailed cabling between the DAs and the 24-drive enclosures,
so-called Gigapacks.

Gigapack Enclosures

8Gbp s Fibr e Chann el De bu g Po rts


Inte rfac e Car d 1 8 Gbps F ibre Channe l Deb u g Po rts
Inte rfac e Car d 1
Optical Con nection s Optical Conn ections
SFP SFP SFP SFP
8G bp s FC 8G bp s FC 8G bp s FC 8G bp s FC
SFP
SFP
SFP ASIC SFP SFP ASIC SFP

Device
SFP Fla s h Fla s h
SFP Process or Process or
SRAM SRAM
Adapter
6 Gb ps SAS 6 Gb ps SAS

AC/DC AC/DC
Pow er Su pp ly Pow er Su pp ly
SFP SAS SAS ..24.. SAS SAS SAS SAS ..24.. SAS SAS
SFP AC/DC AC/DC
Device SFP
SFP
Pow er Su pp ly Pow er Su pp ly

Adapter 6 Gb ps SAS
SRAM
6 Gb ps SAS
SRAM
Process or Fla s h
Process or Fla s h

SFP
ASIC SFP SFP
ASIC SFP
8G bp s FC 8G bp s FC 8G bp s FC 8G bp s FC
SFP SFP SFP SFP

De bu g Po rts Inte rfac e Car d 2 Deb u g Po rts Inte rfac e Car d 2

Figure 2-5 Detailed DA disk back-end diagram

The ASICs provide the FC-to-SAS bridging function from the external SFP connectors to
each of the ports on the SAS disk drives. The processor is the controlling element in the
system.

2.4.2 DS8000 Fibre Channel switched interconnection at the back end


The DS8000 models work with SAS disks. Until shortly before the FC-to-SAS conversion is
made, FC switching is used in the DS8000 back end.

The FC technology is commonly used to connect a group of disks in a daisy-chained fashion


in a Fibre Channel Arbitrated Loop (FC-AL). For commands and data to get to a particular
disk, they must traverse all disks ahead of it in the loop. Conventional FC-AL has these
shortcomings:
򐂰 In arbitration, each disk within an FC-AL loop competes with the other disks to get on the
loop because the loop supports only one operation at a time.
򐂰 If a failure of a disk or connection in the loop occurs, it can be difficult to identify the failing
component, which leads to lengthy problem determination. Problem determination is more
difficult when the problem is intermittent.
򐂰 A third issue with conventional FC-AL is the increasing amount of time that it takes to
complete a loop operation as the number of devices increases in the loop.

Switched disk architecture


To overcome the arbitration issue within FC-AL, the DS8880 architecture is enhanced by
adding a switch-based approach and creating FC-AL switched loops. It is called an FC
switched disk system.

These switches use the FC-AL protocol and attach to the SAS drives (bridging to SAS
protocol) through a point-to-point connection. The arbitration message of a drive is captured
in the switch, processed, and propagated back to the drive, without routing it through all the
other drives in the loop.

Performance is enhanced because both DAs connect to the switched FC subsystem back
end, as shown in Figure 2-6 on page 37. Each DA port can concurrently send and receive
data.

36 IBM System Storage DS8000 Performance Monitoring and Tuning


The two switched point-to-point connections to each drive, which also connect both DAs to
each switch, have these characteristics:
򐂰 There is no arbitration competition and interference between one drive and all the other
drives because there is no hardware in common for all the drives in the FC-AL loop. This
approach leads to an increased bandwidth with the full 8 Gbps FC speed to the back end,
where the FC-to-SAS conversion is made.
򐂰 This architecture doubles the bandwidth over conventional FC-AL implementations
because of two simultaneous operations from each DA to allow for two concurrent read
operations and two concurrent write operations at the same time.
򐂰 In addition to superior performance, this setup offers improved reliability, availability, and
serviceability (RAS) over conventional FC-AL. The failure of a drive is detected and
reported by the switch. The switch ports distinguish between intermittent failures and
permanent failures. The ports understand intermittent failures, which are recoverable, and
collect data for predictive failure statistics. If one of the switches fails, a disk enclosure
service processor detects the failing switch and reports the failure by using the other loop.
All drives can still connect through the remaining switch.

Figure 2-6 High availability and increased bandwidth connect both DAs to two logical loops

2.4.3 Disk enclosures


DS8880 disks are mounted in a disk enclosure, or Gigapack. Each enclosure holds up to
24 disks when you use the small form factor (SFF) 2.5-inch disks. We speak of disk enclosure
pairs or expansion enclosure pairs because you order and install them in groups of two.

The DS8880 storage system supports two types of high-density storage enclosure: the
2.5-inch SFF enclosure and the 3.5-inch large form factor (LFF) enclosure. The high-density
and lower-cost LFF storage enclosure accepts 3.5-inch drives, offering 12 drive slots. The
SFF enclosure offers twenty-four 2.5-inch drive slots. The front of the LFF enclosure differs
from the front of the SFF enclosure, with its 12 drives slotting horizontally rather than
vertically.

Chapter 2. Hardware configuration 37


Drives are added in increments of 16 (except for Nearline disks, which can be ordered with a
granularity of eight). For each group of 16 disks, eight are installed in the first enclosure and
eight are installed in the next adjacent enclosure. These 16 disks form two array sites of eight
DDMs each, from which RAID arrays are built during the logical configuration process. For
each array site, four disks are from the first (or third, fifth, and so on), and four disks are from
the second (or forth, sixth, and so on) enclosure.

All drives within a disk enclosure pair must be the same capacity and rotation speed. A disk
enclosure pair that contains fewer than 48 DDMs must also contain dummy carriers called
fillers. These fillers are used to maintain airflow.
When ordering SSDs in increments of 16 (Driveset), you can create a balanced configuration
across the two processor complexes, especially with the high-performance capabilities of
SSDs. In general, an uneven number of ranks of similar type drives (especially SSDs) can
cause an imbalance in resources, such as cache or processor usage. Use a balanced
configuration with an even number of ranks and extent pools, for example, one even and one
odd extent pool, and each with one flash rank. This balanced configuration enables Easy Tier
automatic cross-tier performance optimization on both processor complexes. It distributes the
overall workload evenly across all resources.

For Nearline HDDs (especially with only three array-sites per 3.5-inch disk enclosure pair, in
contrast to six array-sites in a 2.5-inch disk enclosure pair), load balancing is not so critical as
it is with flash to go with an uneven number of ranks. Nearline HDDs show lower performance
characteristics.

Arrays across loops


Each array site consists of eight DDMs. Four DDMs are taken from the first enclosure, and
four DDMs are taken from the second enclosure, of an enclosure pair. When a RAID array is
created on the array site, half of the array is in each enclosure. Because the first enclosure is
on one switched loop, and the adjacent enclosure is on another, the array is split across two
loops, which is called array across loops (AAL).

By putting half of each array on one loop and half of each array on another loop, there are
more data paths into each array. This design provides a performance benefit, particularly in
situations where a large amount of I/O goes to one array, such as sequential processing and
array rebuilds.

2.4.4 DDMs
At the time of writing this book, the DS8880 provides a choice of the following DDM types:
򐂰 300 and 600 GB, 15 K RPM SAS disk, 2.5-inch SFF
򐂰 600 and 1200 GB, 10K RPM SAS disk, 2.5-inch SFF
򐂰 4 TB, 7,200 RPM Nearline-SAS disk, 3.5-inch LFF
򐂰 200/400/800/1600 GB e-MLC SAS SSDs (enterprise-grade Multi-Level Cell Solid-State
Drives), 2.5-inch SFF
򐂰 400 GB HPFE cards, 1.8-inch

All drives are encryption capable (Full Disk Encryption, FDE). Whether encryption is enabled
has no influence on their performance. Additional drive types are constantly in evaluation and
added to this list when available.

These disks provide a range of options to meet the capacity and performance requirements of
various workloads, and to introduce automated multitiering.

38 IBM System Storage DS8000 Performance Monitoring and Tuning


2.4.5 Enterprise drives compared to Nearline drives and flash
The Enterprise-SAS drives for the DS8880 storage system run at 10,000 RPM or 15,000
RPM. The Nearline drives run at 7,200 RPM. The rotational speed of the disk drives is one
important aspect of performance. For the preferred performance with random workloads,
such as online transaction processing (OLTP) applications, choose an Enterprise HDD and
SSD combination. For pure sequential streaming loads, Nearline drives deliver almost the
same throughput as Enterprise-class drives or SSD-class drives at a lower cost. For random
workloads, the IOPS performance between Nearline class drives and Enterprise class drives
differs considerably and directly converts into longer access times for Nearline drives.
However, Nearline drives offer large capacities and a good price-per-TB ratio. Because of
their slower random performance, use Nearline drives only for small I/O access densities.
Access density is the number of I/Os per second per gigabyte of usable storage (IOPS/GB).

Another difference between these drive types is the RAID rebuild time after a drive failure.
This rebuild time grows with larger capacity drives. Therefore, RAID 6 is used for the
large-capacity Nearline drives to prevent a second disk failure during rebuild from causing a
loss of data, as described in 3.1.2, “RAID 6 overview” on page 49. RAID 6 has much less
IOPS per array for OLTP compared to RAID 5. So, it is RPM speed and the reduced spindle
count (because of their large capacity) that make Nearline drives comparably slower, and the
RAID type. So, they are typically used as the slow tier in hybrid pools.

A flash rank can potentially do tens of thousands of IOPS if the underlying RAID architecture
supports these many IOPS, and at sub-ms response times even for cache misses. Flash is a
mature technology and can be used in critical production environments. Their high
performance for small-block/random workloads makes them financially viable for part of a
hybrid HDD/flash pool mix. Over-provisioning techniques are used for failing cells, and data in
worn-out cells is copied proactively. There are several algorithms for wear-leveling across the
cells. The algorithms include allocating a rarely used block for the next block to use or moving
data internally to less-used cells to enhance lifetime. Error detection and correction
mechanisms are used, and bad blocks are flagged.

2.4.6 Installation order


A disk enclosure (Gigapack) pair of the DS8880 models holds 24 drives when using the
2.5-inch SFF drives, or 12 drives for the LFF HDDs. Each disk enclosure is installed in a
specific physical location within a DS8000 frame and in a specific sequence. Disk enclosure
pairs are installed in most cases bottom-up. The fastest drives of a configuration (usually an
SSD set) are installed first. Then, the Enterprise HDDs are installed, and then the Nearline
HDDs are installed.

Chapter 2. Hardware configuration 39


Figure 2-7 shows how each disk enclosure is associated with a certain DA pair and position in
the DA loop. When ordering more drives, get more DA pairs for the DS8880 storage system.

1U Empty 1U Empty
DA 3 Disk Enclosure DA 6 Disk Enclosure
DA 3 Disk Enclosure DA 6 Disk Enclosure
DA 2 Disk Enclosure DA 6 Disk Enclosure 11U Empty
DA 2 Disk Enclosure
DA 6 Disk Enclosure
DA 2 Disk Enclosure
DA 6 Disk Enclosure
DA 2 Disk Enclosure
DA 6 Disk Enclosure
DA 2 Disk Enclosure DA 7 Disk Enclosure
DA 3 Disk Enclosure
DA 2 Disk Enclosure DA 7 Disk Enclosure
DA 3 Disk Enclosure
DA 2 Disk Enclosure
DA 7 Disk Enclosure
DA 3 Disk Enclosure
DA 2 Disk Enclosure
DA 7 Disk Enclosure
DA 3 Disk Enclosure
HPFE 19
HPFE 18 DA 7 Disk Enclosure
DA 3 Disk Enclosure

DA 3 Disk Enclosure DA 7 Disk Enclosure


I/O I/O
Enclosure Enclosure HPFE 23 DA 7 Disk Enclosure
HPFE 22
DA 7 Disk Enclosure
System p8 server
I/O I/O DA 6 Disk Enclosure
System p8 server Enclosure Enclosure
DA 6 Disk Enclosure

DC-UPS 1 DC-UPS 2 DC-UPS 1 DC-UPS 2 DC-UPS 1 DC-UPS 2

HMC1 HMC2

Base Frame First Expansion Frame Second Expansion Frame

Figure 2-7 DS8884 model - DA-to-disk mapping with all three possible frames shown

The DS8886 981 model can hold up to 1536 SFF drives with all expansion frames. Figure 2-8
on page 41 shows a DS8886 model that is fully equipped with 1536 SFF SAS drives and 240
HPFE cards.

40 IBM System Storage DS8000 Performance Monitoring and Tuning


DA1 Disk Enclosure DA 4 Disk Enclosure 2U Empty 2U Empty
DA 1 Disk Enclosure DA 4 Disk Enclosure DA 3 Disk Enclosure DA 1 Disk Enclosure
DA 3 Disk Enclosure DA 6 Disk Enclosure DA 3 Disk Enclosure DA 1 Disk Enclosure
DA 3 Disk Enclosure DA 6 Disk Enclosure DA 0 Disk Enclosure DA 3 Disk Enclosure
DA 0 Disk Enclosure
DA 5 Disk Enclosure DA 0 Disk Enclosure DA 3 Disk Enclosure
DA 0 Disk Enclosure
DA 5 Disk Enclosure DA 2 Disk Enclosure DA 0 Disk Enclosure
DA 2 Disk Enclosure
DA 7 Disk Enclosure 22U Empty
DA 2 Disk Enclosure DA 0 Disk Enclosure
DA 2 Disk Enclosure
DA 7 Disk Enclosure DA 5 Disk Enclosure DA 2 Disk Enclosure
HPFE 17
HPFE 19 DA 4 Disk Enclosure DA 5 Disk Enclosure DA 2 Disk Enclosure
HPFE 16
HPFE 18 DA 4 Disk Enclosure DA 7 Disk Enclosure DA 5 Disk Enclosure

I/O I/O DA 6 Disk Enclosure DA 7 Disk Enclosure DA 5 Disk Enclosure


Enclosure Enclosure DA 6 Disk Enclosure DA 7 Disk Enclosure
DA 1 Disk Enclosure
HPFE 21 DA5 Disk Enclosure
DA 1 Disk Enclosure DA 7 Disk Enclosure
HPFE 23
I/O I/O DA 5 Disk Enclosure
HPFE 20 DA 4 Disk Enclosure
DA 3 Disk Enclosure
Enclosure Enclosure HPFE 22 DA 7 Disk Enclosure
DA 3 Disk Enclosure DA 4 Disk Enclosure
I/O I/O DA 7 Disk Enclosure
DA 0 Disk Enclosure DA 6 Disk Enclosure
System p8 server Enclosure Enclosure DA 4 Disk Enclosure
DA 0 Disk Enclosure DA 6 Disk Enclosure
DA 4 Disk Enclosure
I/O I/O DA 2 Disk Enclosure DA 1 Disk Enclosure DA 6 Disk Enclosure
System p8 server
Enclosure Enclosure
DA 2 Disk Enclosure DA 1 Disk Enclosure DA 6 Disk Enclosure

DC-UPS 1 DC-UPS 2 DC-UPS 1 DC-UPS 2 DC-UPS 1 DC-UPS 2 DC-UPS 1 DC-UPS 2 DC-UPS 1 DC-UPS 2

HMC1 HMC2

Base Frame First Expansion Frame Second Expansion Frame Third Expansion Frame Fourth Expansion Frame

Figure 2-8 DS8886 model - DA-to-disk mapping with all five possible frames shown

Up to eight DA pairs (16 individual DAs) are possible in a maximum DS8886 configuration.
With more than 384 disks, each DA pair handles more than 48 DDMs (six ranks). For usual
workload profiles, this number is not a problem. It is rare to overload DAs. The performance of
many storage systems is determined by the number of disk drives if there are no HA
bottlenecks.

2.5 Host adapters


The DS8880 models support three types of HAs: FC/FICON 4-port HA cards with a nominal
speed of 16 Gbps, and 4-port and 8-port HA cards with a nominal port speed of 8 Gbps. Each
of these HAs is then equipped with either shortwave (SW) or longwave (LW) ports. HAs are
installed in the I/O enclosures. There is no affinity between the HA and the processor
complex. Either processor complex can access any HA.

2.5.1 Fibre Channel and FICON host adapters


FC is a technology standard that allows data to be transferred from one node to another node
at high speeds and great distances. Using the longwave fiber port HAs, connections are
possible up to a 10 km (6.2 miles) distance. That distance is not user-configurable; it depends
on the type of ordered HA and switch. All ports on each HA are the same type, either
longwave or shortwave. The DS8000 uses Fibre Channel Protocol (FCP) to transmit SCSI
traffic inside FC frames. It also uses FC to transmit Fibre Channel connection (FICON) traffic,
which uses FC frames to carry z Systems I/Os.

Chapter 2. Hardware configuration 41


The type of the port can be changed through the DS Storage Manager GUI or by using the
DS8000 command-line interface (DSCLI) commands. A port cannot be both FICON and FCP
simultaneously, but it can be changed as required.

The card itself is PCIe Gen 2, but working in a PCIe Gen3 slot. The card is driven by a new
high-function, high-performance application-specific integrated circuit (ASIC). To ensure
maximum data integrity, it supports metadata creation and checking. Each FC port supports a
maximum of 509 host login IDs and 1280 paths so that you can create large storage area
networks (SANs).

HAs and DAs essentially have a similar architecture.

The front end with the 16 Gbps ports scales up to 128 ports for a DS8886, which results in a
theoretical aggregated host I/O bandwidth of 128 × 16 Gbps. The HBAs use PowerPC chips
(quad-core Freescale for 16 Gbps HAs, dual-core Freescale for 8 Gbps HAs).

The 16 Gbps adapter ports can negotiate to 16, 8, or 4 Gbps. The 8 Gbps adapter ports can
each negotiate to 8, 4, or 2 Gbps speeds. For slower attachments, use a switch in between.
Only the 8 Gbps adapter ports can be used in FC-AL protocol, that is, without a switch.

The 8-port (8 Gbps) HA offers essentially the same total maximum throughput when taking
loads of all its ports together as the 4-port 8 Gbps HA. Therefore, the 8-port HAs are meant
for more attachment options, but not for more performance.

Random small-block performance is usually no issue when considering HA performance


because this type of port can deliver up to 100 K IOPS for a 4 K (small-block) cache-hit
workload.

Automatic port queues


For I/O between a server and a DS8880 FC port, both the server HA and the DS8800 HBA
support queuing I/Os. The length of this queue is the queue depth. Because several servers
can communicate with few DS8880 posts, the queue depth of a storage HBA is larger than
the queue depth on the server side. This difference is also true for the DS8880 storage
system, which supports 2048 FC commands queued on a port. However, sometimes the port
queue in the DS8880 HA can be flooded. When the number of commands that are sent to the
DS8000 port exceeds the maximum number of commands that the port can queue, the port
must discard these additional commands. This operation is a normal error recovery operation
in the FCP to prevent more damage. The normal recovery is a 30-second timeout for the
server, after which time the command is resent. The server has a command retry count
before it fails the command. Command Timeout entries show in the server logs.

Automatic Port Queues is a mechanism that the DS8000 storage system uses to self-adjust
the queue based on the workload. The Automatic Port Queues mechanism allows higher port
queue oversubscription while maintaining a fair share for the servers and the accessed LUNs.
The port that the queue fills goes into SCSI Queue Fill mode, where it accepts no additional
commands to slow down the I/Os. By avoiding error recovery and the 30-second blocking
SCSI Queue Full recovery interval, the overall performance is better with Automatic Port
Queues.

2.5.2 Multiple paths to Open Systems servers


For high availability, each host system must use a multipathing device driver, such as
Subsystem Device Driver (SDD), and each host system must have a minimum of two host
connections to HA cards in different I/O enclosures on the DS8880 storage system.

42 IBM System Storage DS8000 Performance Monitoring and Tuning


After you determine your throughput workload requirements, you must choose the
appropriate number of connections to put between your Open Systems hosts and the
DS8000 storage system to sustain this throughput. Use an appropriate number of HA cards
to satisfy high throughput demands. The number of host connections per host system is
primarily determined by the required bandwidth.

Host connections frequently go through various external connections between the server and
the DS8000 storage system. Therefore, you need enough host connections for each server
so that if half of the connections fail, processing can continue at the level before the failure.
This availability-oriented approach requires that each connection carry only half the data
traffic that it otherwise might carry. These multiple lightly loaded connections also help to
minimize the instances when spikes in activity might cause bottlenecks at the HA or port. A
multiple-path environment requires at least two connections. Four connections are typical,
and eight connections are not unusual. Typically, these connections are spread across as
many I/O enclosures in the DS8880 storage system that you equipped with HAs.

Usually, SAN directors or switches are used. Use two separate switches to avoid a single
point of failure.

2.5.3 Multiple paths to z Systems servers


In the z Systems environment, the preferred practice is to provide multiple paths from each
host to a storage system. Typically, four paths are installed. The channels in each host that
can access each Logical Control Unit (LCU) in the DS8000 storage system are defined in the
hardware configuration definition (HCD) or I/O configuration data set (IOCDS) for that host.
Dynamic Path Selection (DPS) allows the channel subsystem to select any available
(non-busy) path to initiate an operation to the storage system. Dynamic Path Reconnect
(DPR) allows the DS8000 storage system to select any available path to a host to reconnect
and resume a disconnected operation, for example, to transfer data after disconnection
because of a cache miss. These functions are part of the z Systems architecture and are
managed by the channel subsystem on the host and the DS8000 storage system.

In a z Systems environment, you must select a SAN switch or director that also supports
FICON. An availability-oriented approach applies to the z Systems environments similar to
the Open Systems approach. Plan enough host connections for each server so that if half of
the connections fail, processing can continue at the level before the failure.

2.5.4 Spreading host attachments


A common question is how to distribute the server connections. Take the example of four host
connections from each of four servers, all running a similar type workload, and four HAs.
Spread the host connections with each host attached to one port on each of four adapters.
Now, consider the scenario in which the workloads differ. You probably want to isolate
mission-critical workloads (for example, customer order processing) from workloads that are
lower priority but I/O intensive (for example, data mining). Prevent the I/O-intensive workload
from dominating the HA. If one of the four servers is running an I/O-intensive workload,
acquire two additional HAs and attach the four connections of the I/O-intensive server to
these adapters. Connect two host connections on each adapter. The other three servers
remain attached to the original four adapters.

Chapter 2. Hardware configuration 43


Here are some general guidelines:
򐂰 Spread the paths from all host systems across the available I/O ports, HA cards, and I/O
enclosures to optimize workload distribution across the available resources, depending on
your workload sharing and isolation considerations.
򐂰 Spread the host paths that access the same set of volumes as evenly as possible across
the available HA cards and I/O enclosures. This approach balances workload across
hardware resources and helps ensure that a hardware failure does not result in a loss of
access.
򐂰 Ensure that each host system uses a multipathing device driver, such as an SDD, and a
minimum of two host connections to different HA cards in different I/O enclosures on the
DS8880 storage system. Preferably, evenly distribute them between left-side
(even-numbered) I/O enclosures and right-side (odd-numbered) I/O enclosures for the
highest availability and a balanced workload across I/O enclosures and HA cards.
򐂰 Do not use the same DS8000 I/O port for host attachment and Copy Services remote
replication (such as Metro Mirror or Global Mirror). Ideally, even separate off HA pairs (with
all their ports) for Copy Services traffic only if enough adapters are available.
򐂰 Consider using separate HA cards for the FICON protocol and FCP. Even though I/O ports
on the same HA can be configured independently for the FCP and the FICON protocol, it
might be preferable to isolate your z/OS environment (FICON) from your Open Systems
environment (FCP).
򐂰 Do not use the 8-port (8 Gbps) HAs with all their ports when you have high throughputs
(GBps) to handle.

For more information, see 4.10.1, “I/O port planning considerations” on page 131.

2.6 Tools to aid in hardware planning


This chapter described the hardware components of the DS8000 storage system. Each
component provides high performance through its specific function and meshes well with the
other hardware components to provide a well-balanced, high-performance storage system.
There are many tools that can assist you in planning your specific hardware configuration.

2.6.1 White papers


IBM publishes white papers that document the performance of specific DS8000
configurations. Typically, workloads are run on multiple configurations, and performance
results are compiled so that you can compare the configurations. For example, workloads can
be run by using different DS8000 models, different numbers of HAs, or different types of
DDMs. By reviewing these white papers, you can make inferences about the relative
performance benefits of different components to help you to choose the type and quantities of
components to fit your particular workload requirements. A selection of these papers can be
found at the following website:
http://www.ibm.com/support/techdocs/atsmastr.nsf/Web/WhitePapers

Your IBM representative or IBM Business Partner has access to these and additional white
papers and can provide them to you.

44 IBM System Storage DS8000 Performance Monitoring and Tuning


2.6.2 Disk Magic
A knowledge of the DS8000 hardware components helps you understand the device and its
potential performance. However, consider using Disk Magic to model your planned DS8000
hardware configuration to ensure that it can handle the required workload. Your IBM
representative or IBM Business Partner has access to this tool and can run a Disk Magic
study to configure a DS8000 storage system that is based on your specific workloads. For
more information about the capabilities of this tool, see 6.1, “Disk Magic” on page 160. The
tool can also, under the name IntelliMagic Direction, be acquired by clients directly from
IntelliMagic B.V. at this website:
https://www.intellimagic.com/solutions/product-overview/disk-magic-rmf-magic-batch
-magic

2.6.3 Capacity Magic


Determining the usable capacity of a disk configuration is a complex task, which depends on
DDM types, the RAID technique, and the logical volume types that are created. Use the
Capacity Magic tool to determine effective utilization. Your IBM representative or IBM
Business Partner has access to this tool and can use it to validate that the planned physical
disk configuration can provide enough effective capacity to meet your storage requirements.

Chapter 2. Hardware configuration 45


46 IBM System Storage DS8000 Performance Monitoring and Tuning
3

Chapter 3. Logical configuration concepts


and terminology
This chapter summarizes the important concepts that must be understood for the preparation
of the DS8000 logical configuration and about performance tuning. You can obtain more
information about the DS8000 logical configuration concepts in IBM DS8880 Architecture and
Implementation (Release 8), SG24-8323.

This chapter includes the following topics:


򐂰 RAID levels and spares
򐂰 The abstraction layers for logical configuration
򐂰 Data placement on ranks and extent pools

© Copyright IBM Corp. 2016. All rights reserved. 47


3.1 RAID levels and spares
The arrays on the DS8000 storage system are configured in either a RAID 5, RAID 6, or
RAID 10 configuration. The following RAID configurations are supported:
򐂰 6+P RAID 5 configuration: The data and parity information in this array is spread across
seven drives. From a capacity perspective, this configuration represents the logical
capacity of six data drives and one parity drive. The remaining drive on the array site is
used as a spare.
򐂰 7+P RAID 5 configuration: The data and parity information in this array is spread across
eight drives. From a capacity perspective, this configuration represents the logical capacity
of seven data drives and one parity drive.
򐂰 5+P+Q RAID 6 configuration: The data and parity information in this array is spread
across seven drives. From a capacity perspective, this configuration represents the logical
capacity of five data drives and two parity drives. The remaining drive on the array site is
used as a spare.
򐂰 6+P+Q RAID 6 configuration: The data and parity information in this array is spread
across eight drives. From a capacity perspective, this configuration represents the logical
capacity of six data drives and two parity drives.
򐂰 3+3 RAID 10 configuration: The array consists of three data drives that are mirrored to
three copy drives. The two remaining drives on the array site are used as spares.
򐂰 4+4 RAID 10 configuration: The array consists of four data drives that are mirrored to four
copy drives.

For a list of drive combinations and RAID configurations, see 8.5.2, “Disk capacity,” in IBM
DS8880 Architecture and Implementation (Release 8), SG24-8323.

3.1.1 RAID 5 overview


The DS8000 series supports RAID 5 arrays. RAID 5 is a method of spreading volume data
plus parity data across multiple disk drives. RAID 5 provides faster performance by striping
data across a defined set of disk drive modules (DDMs). Data protection is provided by the
generation of parity information for every stripe of data. If an array member fails, its contents
can be regenerated by using the parity data.

RAID 5 implementation in the DS8000 storage system


In a DS8000 storage system, a RAID 5 array that is built on one array site contains either
seven or eight disks, depending on whether the array site is supplying a spare. A seven-disk
array effectively uses one disk for parity, so it is called a 6+P array (where P stands for parity).
The reason only seven disks are available to a 6+P array is that the eighth disk in the array
site that is used to build the array is used as a spare. This array is called a 6+P+S array site
(where S stands for spare). An eight-disk array also effectively uses one disk for parity, so it is
called a 7+P array.

Drive failure with RAID 5


When a DDM fails in a RAID 5 array, the device adapter (DA) starts an operation to
reconstruct the data that was on the failed drive onto one of the spare drives. The spare that
is used is chosen based on a smart algorithm that looks at the location of the spares and the
size, speed, and location of the failed DDM. The DA rebuilds the data by reading the
corresponding data and parity in each stripe from the remaining drives in the array,
performing an exclusive-OR operation to re-create the data, and then writing this data to the
spare drive.

48 IBM System Storage DS8000 Performance Monitoring and Tuning


While this data reconstruction occurs, the DA can still service read and write requests to the
array from the hosts. There might be performance degradation while the sparing operation is
in progress because the DA and switched network resources are used to reconstruct the
data. Because of the switch-based architecture, this effect is minimal. Additionally, any read
requests for data on the failed drive require data to be read from the other drives in the array.
Then, the DA reconstructs the data.

Performance of the RAID 5 array returns to normal when the data reconstruction onto the
spare device completes. The time that is taken for sparing can vary, depending on the size of
the failed DDM and the workload on the array, the switched network, and the DA. The use of
arrays across loops (AAL) both speeds up rebuild time and decreases the impact of a rebuild.

Smart Rebuild, an IBM technique further improving the classical RAID 5, further reduces the
risk of a second drive failure for RAID 5 ranks during rebuild by detecting a failing drive early
and copying the drive data to the spare drive in advance. If a RAID 5 array is predicted to fail,
a “rebuild” is initiated by copying off that failing drive to the spare drive before it fails,
decreasing the overall rebuild time. If the drive fails during the copy operation, the rebuild
continues from the parity information like a regular rebuild.

3.1.2 RAID 6 overview


The DS8000 supports RAID 6 protection. RAID 6 presents an efficient method of data
protection in double disk errors, such as two drive failures, two coincident medium errors, or a
drive failure and a medium error. RAID 6 protection provides more fault tolerance than RAID 5
in disk failures and uses less raw disk capacity than RAID 10. RAID 6 allows for additional
fault tolerance by using a second independent distributed parity scheme (dual parity). Data is
striped on a block level across a set of drives, similar to RAID 5 configurations, and a second
set of parity is calculated and written across all the drives.

RAID 6 is preferably used in combination with large capacity disk drives, for example, such as
4 TB Nearline drives because of longer rebuild times and the increased risk of an additional
medium error in addition to the failed drive during the rebuild. In many environments, RAID 6
is also considered for Enterprise drives of capacities above 1 TB in cases where reliability is
favored over performance and the trade-off in performance versus reliability can be accepted.
However, with the Smart Rebuild capability, the risk of a second drive failure for RAID 5 ranks
is also further reduced.

Comparing RAID 6 to RAID 5 performance provides about the same results on reads. For
random writes, the throughput of a RAID 6 array is only two thirds of a RAID 5 array because
of the additional parity handling. Workload planning is important before implementing RAID 6,
specifically for write-intensive applications, including Copy Services targets, where they are
not recommended to be used. Yet, when properly sized for the I/O demand, RAID 6 is a
considerable reliability enhancement.

RAID 6 implementation in the DS8000 storage system


A RAID 6 array from one array site of a DS8000 storage system can be built on either seven
or eight disks:
򐂰 In a seven disk array, two disks are always used for parity, and the eighth disk of the array
site is needed as a spare. This RAID 6 array is called as a 5+P+Q+S array, where P and
Q stand for parity and S stands for spare.
򐂰 A RAID 6 array, consisting of eight disks, is built when all necessary spare drives are
available. An eight disk RAID 6 array also always uses two disks for parity, so it is called a
6+P+Q array.

Chapter 3. Logical configuration concepts and terminology 49


Drive failure with RAID 6
When a DDM fails in a RAID 6 array, the DA reconstructs the data of the failing drive on to one
of the available spare drives. A smart algorithm determines the location of the spare drive to
be used, depending on the size, speed, and location of the failed DDM. After the spare drive
replaces a failed one in a redundant array, the entire contents of the new drive are
recalculated by reading the corresponding data and parity in each stripe from the remaining
drives in the array and then writing this data to the spare drive.

During the rebuild of the data on the new drive, the DA can still handle I/O requests of the
connected hosts to the affected array. A performance degradation can occur during the
reconstruction because DAs and back-end resources are involved in the rebuild. Additionally,
any read requests for data on the failed drive require data to be read from the other drives in
the array, and then the DA reconstructs the data. Any later failure during the reconstruction
within the same array (second drive failure, second coincident medium errors, or a drive
failure and a medium error) can be recovered without loss of data.

Performance of the RAID 6 array returns to normal when the data reconstruction, on the
spare device, completes. The rebuild time varies, depending on the size of the failed DDM
and the workload on the array and the DA. The completion time is comparable to a RAID 5
rebuild, but slower than rebuilding a RAID 10 array in a single drive failure.

3.1.3 RAID 10 overview


RAID 10 provides high availability by combining features of RAID 0 and RAID 1. RAID 0
optimizes performance by striping volume data across multiple disk drives at a time. RAID 1
provides disk mirroring, which duplicates data between two disk drives. By combining the
features of RAID 0 and RAID 1, RAID 10 provides a second optimization for fault tolerance.
Data is striped across half of the disk drives in the RAID 1 array. The same data is also striped
across the other half of the array, creating a mirror. Access to data is preserved if one disk in
each mirrored pair remains available. RAID 10 offers faster random write operations than
RAID 5 because it does not need to manage parity. However, with half of the DDMs in the
group used for data and the other half to mirror that data, RAID 10 disk groups have less
capacity than RAID 5 disk groups.

RAID 10 is not as commonly used as RAID 5 mainly because more raw disk capacity is
required for every gigabyte of effective capacity. Typically, RAID 10 is used for workloads with
a high random-write ratio.

RAID 10 implementation in the DS8000 storage system


In the DS8000 storage system, the RAID 10 implementation is achieved by using either six or
eight DDMs. If spares exist on the array site, six DDMs are used to make a three-disk RAID 0
array, which is then mirrored. If spares do not exist on the array site, eight DDMs are used to
make a four-disk RAID 0 array, which is then mirrored.

Drive failure with RAID 10


When a DDM fails in a RAID 10 array, the DA starts an operation to reconstruct the data from
the failed drive onto one of the hot spare drives. The spare that is used is chosen based on a
smart algorithm that looks at the location of the spares and the size and location of the failed
DDM. A RAID 10 array is effectively a RAID 0 array that is mirrored. Thus, when a drive fails
in one of the RAID 0 arrays, you can rebuild the failed drive by reading the data from the
equivalent drive in the other RAID 0 array.

50 IBM System Storage DS8000 Performance Monitoring and Tuning


While this data reconstruction occurs, the DA can still service read and write requests to the
array from the hosts. There might be degradation in performance while the sparing operation
is in progress because DA and switched network resources are used to reconstruct the data.
Because of the switch-based architecture of the DS8000 storage system, this effect is
minimal. Read requests for data on the failed drive typically are not affected because they can
all be directed to the good RAID 1 array.

Write operations are not affected. Performance of the RAID 10 array returns to normal when
the data reconstruction, onto the spare device, completes. The time that is taken for sparing
can vary, depending on the size of the failed DDM and the workload on the array and the DA.

In relation to RAID 5, RAID 10 sparing completion time is a little faster. Rebuilding a RAID 5
6+P configuration requires six reads plus one parity operation for each write. A RAID 10 3+3
configuration requires one read and one write (a direct copy).

3.1.4 Array across loops


One DA pair creates two switched loops: The upper enclosure populates one loop, and the
lower enclosure populates the other loop, in a disk enclosure pair. Each enclosure can hold
up to 24 DDMs.

Each enclosure places two Fibre Channel (FC) switches onto each loop. SAS DDMs are
purchased in groups of 16 (drive set). Half of the new DDMs go into one disk enclosure, and
half of the new DDMs go into the other disk enclosure of the pair. The same setup applies to
SSD and NL-SAS drives where for the latter you also have a half drive set purchase option
with only eight DDMs.

An array site consists of eight DDMs. Four DDMs are taken from one enclosure in the disk
enclosure pair, and four are taken from the other enclosure in the pair. Therefore, when a
RAID array is created on the array site, half of the array is on each disk enclosure.

One disk enclosure of the pair is on one FC switched loop, and the other disk enclosure of the
pair is on a second switched loop. The array is split across two loops, so the use the term
array across loops (AAL) is used.

AAL is used to increase performance. When the DA writes a stripe of data to a RAID 5 array,
it sends half of the write to each switched loop. By splitting the workload in this manner, each
loop is worked evenly. This setup aggregates the bandwidth of the two loops and improves
performance. If RAID 10 is used, two RAID 0 arrays are created. Each loop hosts one RAID 0
array. When servicing read I/O, half of the reads can be sent to each loop, again improving
performance by balancing workload across loops.

Array across loops and RAID 10


The DS8000 storage system implements the concept of arrays across loops (AAL). With AAL,
an array site is split into two halves. Half of the site is on the first disk loop of a DA pair, and
the other half is on the second disk loop of that DA pair. AAL is implemented primarily to
maximize performance, and it is used for all the RAID types in the DS8000 storage system.
However, in RAID 10, you can take advantage of AAL to provide a higher level of redundancy.
The DS8000 RAS code ensures that one RAID 0 array is maintained on each of the two loops
that are created by a DA pair. Therefore, in the unlikely event of a complete loop outage, the
DS8000 storage system does not lose access to the RAID 10 array. While one RAID 0 array
is offline, the other RAID 0 array remains available to service disk I/O.

Chapter 3. Logical configuration concepts and terminology 51


3.1.5 Spare creation
When the arrays are created on a DS8000 storage system, the Licensed Internal Code
determines which array sites contain spares. The first array sites on each DA pair that are
assigned to arrays contribute one or two spares (depending on the RAID option) until the DA
pair has access to at least four spares, with two spares placed on each loop.

A minimum of one spare is created for each array site that is assigned to an array until the
following conditions are met:
򐂰 Minimum of four spares per DA pair
򐂰 Minimum of four spares for the largest capacity array site on the DA pair
򐂰 Minimum of two spares of capacity and speed (RPM) greater than or equal to the fastest
array site of any capacity on the DA pair.

For HPFE cards, because they are not using DA pairs, the first two arrays on each HPFE
come with spares (6+P+S), and the remaining two arrays come without extra spares (6+P)
because they share the spares with the first array pair.

Floating spares
The DS8000 storage system implements a smart floating technique for spare DDMs. A
floating spare is defined this way. When a DDM fails and the data it contains is rebuilt on to a
spare, then when the disk is replaced the replacement disk becomes the spare. The data is
not migrated to another DDM, such as the DDM in the original position that the failed DDM
occupied.

The DS8000 Licensed Internal Code takes this idea one step further. It might choose to allow
the hot spare to remain where it is moved, but it can instead choose to migrate the spare to a
more optimum position. This migration can better balance the spares across the DA pairs, the
loops, and the disk enclosures. It might be preferable that a DDM that is in use as an array
member is converted to a spare. In this case, the data on that DDM is migrated in the
background on to an existing spare. This process does not fail the disk that is being migrated,
although it reduces the number of available spares in the DS8000 storage system until the
migration process is complete.

The DS8000 storage system uses this smart floating technique so that the larger or faster
(higher RPM) DDMs are allocated as spares. Allocating the larger or faster DDMs as spares
ensures that a spare can provide at least the same capacity and performance as the replaced
drive. If you rebuild the contents of a 600 GB DDM onto a 1200 GB DDM, half of the 1200 GB
DDM is wasted because that space is not needed. When the failed 600 GB DDM is replaced
with a new 600 GB DDM, the DS8000 Licensed Internal Code most likely migrates the data
back onto the recently replaced 600 GB DDM. When this process completes, the 600 GB
DDM rejoins the array and the 1200 GB DDM becomes the spare again.

Another example is if you fail a 300 GB 15 K RPM DDM onto a 600 GB 10 K RPM DDM. The
data is now moved to a slower DDM and wastes significant space. The array has a mix of
RPMs, which is not desirable. When the failed disk is replaced, the replacement is the same
type as the failed 15 K RPM disk. Again, a smart migration of the data is performed after
suitable spares are available.

Hot-pluggable DDMs
Replacement of a failed drive does not affect the operation of the DS8000 storage system
because the drives are fully hot-pluggable. Each disk plugs into a switch, so there is no loop
break associated with the removal or replacement of a disk. In addition, there is no potentially
disruptive loop initialization process.

52 IBM System Storage DS8000 Performance Monitoring and Tuning


Overconfiguration of spares
The DDM sparing policies support the overconfiguration of spares. This possibility might be of
interest to certain installations because it allows the deferral of the repair of certain DDM
failures until a later repair action is required.

Important: In general, when a drive fails either in RAID 10, 5, or 6, it might cause a
minimal degradation of performance on DA and switched network resources during the
sparing process. The DS8880 architecture and features minimize the effect of this
behavior.

By using the DSCLI, it is possible to check for any failed drives by running the DSCLI lsddm
-state not_normal command. See Example 3-1.

Example 3-1 Output of DSCLI lsddm command


dscli> lssi
Name ID Storage Unit Model WWNN State ESSNet
======================================================================================
DS8K-R8-ESP-01 IBM.2107-75FAW81 IBM.2107-75FAW80 981 5005076306FFD693 Online Enabled

dscli> lsddm -state not_normal


CMUC00234I lsddm: No DDM FRU found.

3.2 The abstraction layers for logical configuration


This section describes the terminology and necessary steps to configure the logical volumes
that can be accessed from attached hosts.

The definition of virtualization is the abstraction process from the physical disk drives to a
logical volume that is presented to hosts and servers in a way that they see it as though it
were a physical disk.

When talking about virtualization, it is the process of preparing physical disk drives (DDMs) to
become an entity that can be used by an operating system, which means the creation of
logical unit numbers (LUNs).

The DDMs are mounted in disk enclosures and connected in a switched FC topology by using
a Fibre Channel Arbitrated Loop (FC-AL) protocol. The DS8880 small form factor disks are
mounted in 24 DDM enclosures (mounted vertically), and the Nearline drives come in
12-DDM slot LFF enclosures (mounted horizontally).

The disk drives can be accessed by a pair of DAs. Each DA has four paths to the disk drives.
One device interface from each DA connects to a set of FC-AL devices so that either DA can
access any disk drive through two independent switched fabrics (the DAs and switches are
redundant).

Each DA has four ports, and because DAs operate in pairs, there are eight ports or paths to
the disk drives. All eight paths can operate concurrently and can access all disk drives on the
attached fabric. However, in normal operation disk drives are typically accessed by one DA.
Which DA owns the disk is defined during the logical configuration process to avoid any
contention between the two DAs for access to the disks.

Chapter 3. Logical configuration concepts and terminology 53


3.2.1 Array sites
An array site is a group of eight DDMs. The DDMs that make up an array site are
predetermined by the DS8000 storage system, but no predetermined processor complex
affinity exists for array sites. The DDMs that are selected for an array site are chosen from two
disk enclosures on different loops, as shown in Figure 3-1. The DDMs in the array site are of
the same DDM type, which means that they are the same capacity and speed (RPM).

Array
Site

Switch

Loop 1 Loop 2
Figure 3-1 Array site

As you can see in Figure 3-1, array sites span loops. Four DDMs are taken from loop 1 and
another four DDMs from loop 2. Array sites are the building blocks that are used to define
arrays.

3.2.2 Arrays
An array, also called managed array, is created from one array site. Forming an array means
defining it as a specific RAID type. The supported RAID types are RAID 5, RAID 6, and
RAID 10 (see 3.1, “RAID levels and spares” on page 48). For each array site, you can select a
RAID type. The process of selecting the RAID type for an array is also called defining an
array.

Important: In the DS8000 implementation, one managed array is defined by using one
array site.

Figure 3-2 on page 55 shows the creation of a RAID 5 array with one spare, which is also
called a 6+P+S array (capacity of six DDMs for data, capacity of one DDM for parity, and a
spare drive). According to the RAID 5 rules, parity is distributed across all seven drives in this
example.

54 IBM System Storage DS8000 Performance Monitoring and Tuning


On the right side in Figure 3-2, the terms, D1, D2, D3, and so on, represent the set of data
contained on one disk within a stripe on the array. If, for example, 1 GB of data is written, it is
distributed across all disks of the array.

Array
Site
D1 D7 D13 ...

D2 D8 D14 ...

D3 D9 D15 ...

Creation of D4 D1 0 D16 ...

an array D5 D1 1 P ...

D6 P D17 ...

P D1 2 D18 ...
D ata
D ata
D ata
D ata
RAID
D ata
D ata Array
Parity Spare
Spare

Figure 3-2 Creation of an array

So, an array is formed by using one array site, and although the array can be accessed by
each adapter of the DA pair, it is managed by one DA. Later in the configuration process, you
define the adapter and the server that manage this array.

3.2.3 Ranks
In the DS8000 virtualization hierarchy, there is another logical construct, which is called a
rank. When you define a new rank, its name is chosen by the DS Storage Manager, for
example, R1, R2, and R3. You must add an array to a rank.

Important: In the DS8000 implementation, a rank is built by using only one array.

The available space on each rank is divided into extents. The extents are the building blocks
of the logical volumes. An extent is striped across all disks of an array, as shown in Figure 3-3
on page 56, and indicated by the small squares in Figure 3-4 on page 58.

The process of forming a rank includes two jobs:


򐂰 The array is formatted for either fixed block (FB) type data (Open Systems) or count key
data (CKD) (z Systems). This formatting determines the size of the set of data contained
on one disk within a stripe on the array.
򐂰 The capacity of the array is subdivided into equal-sized partitions, which are called extents.
The extent size depends on the extent type: FB or CKD.

Chapter 3. Logical configuration concepts and terminology 55


An FB rank has an extent size of 1 GiB (where 1 GiB equals 230 bytes).

z Systems users or administrators typically do not deal with gigabytes or gibibytes, and
instead they think of storage in the original 3390 volume sizes. A 3390 Model 3 is three times
the size of a Model 1. A Model 1 has 1113 cylinders (about 0.94 GB). The extent size of a
CKD rank is one 3390 Model 1 or 1113 cylinders.

Figure 3-3 shows an example of an array that is formatted for FB data with 1 GiB extents (the
squares in the rank indicate that the extent is composed of several blocks from DDMs).

D1 D7 D13 ...
Data
Data D2 D8 D14 ...
Data
Data
RAID D3 D9 D15 ...

D4 D10 D16 ...


Data
Data
Array D5 D11 P ...

Parity D6 P D17 ...

Spare P D12 D18 ...

Creation of
a Rank

....

....

....

FB Rank
1 GiB 1 GiB 1 GiB 1 GiB
....

....
of 1 GiB
....
extents
....

....

Figure 3-3 Form an FB rank with 1 GiB extents

It is still possible to define a CKD volume with a capacity that is an integral multiple of one
cylinder or a fixed block LUN with a capacity that is an integral multiple of 128 logical blocks
(64 KB). However, if the defined capacity is not an integral multiple of the capacity of one
extent, the unused capacity in the last extent is wasted. For example, you can define a one
cylinder CKD volume, but 1113 cylinders (one extent) are allocated and 1112 cylinders are
wasted.

Encryption group
A DS8880 storage system comes with encryption-capable disk drives. If you plan to use
encryption before creating a rank, you must define an encryption group. For more
information, see IBM DS8870 Disk Encryption, REDP-4500. Currently, the DS8000 series
supports only one encryption group. All ranks must be in this encryption group. The
encryption group is an attribute of a rank. So, your choice is to encrypt everything or nothing.
You can turn on (create an encryption group) encryption later, but then all ranks must be
deleted and re-created, which means that your data is also deleted.

56 IBM System Storage DS8000 Performance Monitoring and Tuning


3.2.4 Extent pools
An extent pool, also called a pool, is a logical construct to aggregate the extents from a set of
ranks or managed arrays to form a domain for extent allocation to a logical volume.

With Easy Tier, it is possible to mix ranks with different characteristics and features in
managed extent pools to achieve the preferred performance results. You can mix all three
storage classes or storage tiers within the same extent pool that have solid-state drive (SSD),
Enterprise, and Nearline-class disks.

Important: In general, do not mix ranks with different RAID levels or disk types (size and
RPMs) in the same extent pool if you are not implementing Easy Tier automatic
management of these pools. Easy Tier has algorithms for automatically managing
performance and data relocation across storage tiers and even rebalancing data within a
storage tier across ranks in multitier or single-tier extent pools, providing automatic storage
performance and storage economics management with the preferred price, performance,
and energy savings costs.

There is no predefined affinity of ranks or arrays to a storage server. The affinity of the rank
(and its associated array) to a certain server is determined at the point that the rank is
assigned to an extent pool.

One or more ranks with the same extent type (FB or CKD) can be assigned to an extent pool.
One rank can be assigned to only one extent pool. There can be as many extent pools as
there are ranks.

With storage-pool striping (the default extent allocation method (EAM) rotate extents), you
can create logical volumes striped across multiple ranks. This approach typically enhances
performance. To benefit from storage pool striping (see “Rotate extents (storage pool striping)
extent allocation method” on page 63), multiple ranks in an extent pool are required.

Storage-pool striping enhances performance, but it also increases the failure boundary. When
one rank is lost, for example, in the unlikely event that a whole RAID array failed because of
multiple failures at the same time, the data of this single rank is lost. Also, all volumes in the
pool that are allocated with the rotate extents option are exposed to data loss. To avoid
exposure to data loss for this event, consider mirroring your data to a remote DS8000 storage
system.

When an extent pool is defined, it must be assigned with the following attributes:
򐂰 Server affinity (or rank group)
򐂰 Storage type (either FB or CKD)
򐂰 Encryption group

Just like ranks, extent pools also belong to an encryption group. When defining an extent
pool, you must specify an encryption group. Encryption group 0 means no encryption.
Encryption group 1 means encryption.

The minimum reasonable number of extent pools on a DS8000 storage system is two. One
extent pool is assigned to storage server 0 (rank group 0), and the other extent pool is
assigned to storage server 1 (rank group 1) so that both DS8000 storage systems are active.
In an environment where FB storage and CKD storage share the DS8000 storage system,
four extent pools are required to assign each pair of FB pools and CKD pools to both storage
systems, balancing capacity and workload across both DS8000 processor complexes.

Chapter 3. Logical configuration concepts and terminology 57


Figure 3-4 is an example of a mixed environment with CKD and FB extent pools. Additional
extent pools might be wanted to segregate workloads.

Extent pools are expanded by adding more ranks to the pool. Ranks are organized in to two
rank groups: Rank group 0 is controlled by storage server 0 (processor complex 0), and rank
group 1 is controlled by storage server 1 (processor complex 1).

Important: For the preferred performance, balance capacity between the two servers and
create at least two extent pools, with one extent pool per DS8000 storage system.

Extent Pool CKD0 Extent Pool CKD1


1113 111 3 111 3 111 3 1113 1 113 1113 1113
Cyl. C yl. C yl. Cy l. Cyl. C yl. Cyl. Cyl.
CKD C KD C KD CK D CK D C KD CKD CKD
1113 1113 1113
Cyl. Cyl. Cyl.
CK D CK D CKD

0
r Extent Pool FBprod 1
r
e
vr e
1GiB 1GiB 1 GiB 1GiB vr
e FB FB FB FB e
S Extent Pool FBtest S
1GiB 1GiB 1 GiB 1GiB
FB FB FB FB 1GiB 1GiB 1 GiB 1GiB
FB FB FB FB

Figure 3-4 Extent pools

Dynamic extent pool merge


Dynamic extent pool merge is a capability that is provided by the Easy Tier manual mode
feature.

Dynamic extent pool merge allows one extent pool to be merged into another extent pool
while the logical volumes in both extent pools remain accessible to the host servers.

58 IBM System Storage DS8000 Performance Monitoring and Tuning


Dynamic extent pool merge can be used for the following scenarios:
򐂰 Use dynamic extent pool merge for the consolidation of smaller extent pools of the same
storage type into a larger homogeneous extent pool that uses storage pool striping.
Creating a larger extent pool allows logical volumes to be distributed evenly over a larger
number of ranks, which improves overall performance by minimizing skew and reducing
the risk of a single rank that becomes a hot spot. In this case, a manual volume rebalance
must be initiated to restripe all existing volumes evenly across all available ranks in the
new pool. Newly created volumes in the merged extent pool allocate capacity
automatically across all available ranks by using the rotate extents EAM (storage pool
striping), which is the default.
򐂰 Use dynamic extent pool merge for consolidating extent pools with different storage tiers
to create a merged multitier extent pool with a mix of storage classes (SSD, Enterprise
disk, and Nearline disk) for automated management by Easy Tier automatic mode.
򐂰 Under certain circumstances, use dynamic extent pool merge for consolidating extent
pools of the same storage tier but different drive types or RAID levels that can eventually
benefit from storage pool striping and Easy Tier automatic mode intra-tier management
(auto-rebalance) by using the Easy Tier micro-tiering capabilities.

Important: You can apply dynamic extent pool merge only among extent pools that are
associated with the same DS8000 storage system affinity (storage server 0 or storage
server 1) or rank group. All even-numbered extent pools (P0, P2, P4, and so on) belong to
rank group 0 and are serviced by storage server 0. All odd-numbered extent pools (P1, P3,
P5, and so on) belong to rank group 1 and are serviced by storage server 1 (unless one
DS8000 storage system failed or is quiesced with a failover to the alternative storage
system).

Additionally, the dynamic extent pool merge is not supported in these situations:
򐂰 If source and target pools have different storage types (FB and CKD).
򐂰 If you select an extent pool that contains volumes that are being migrated.

For more information about dynamic extent pool merge, see IBM DS8000 Easy Tier,
REDP-4667.

General considerations when adding capacity to an extent pool


If you must add capacity to an extent pool that is almost fully allocated on a DS8000 storage
system without the Easy Tier feature, add several ranks to this pool at the same time, not just
one. With this approach, the new volumes can be striped across the newly added ranks and
reduce skew and hot spots.

With the Easy Tier feature, you can easily add capacity and even single ranks to existing
extent pools without concern about performance.

Manual volume rebalance


Manual volume rebalance is a feature of the Easy Tier manual mode and is available only in
non-managed, single-tier (homogeneous) extent pools. It allows a balanced redistribution of
the extents of a volume across all ranks in the pool. This feature is not available in managed
or hybrid pools where Easy Tier is supposed to manage the placement of the extents
automatically on the ranks in the pool based on their actual workload pattern, available
storage tiers, and rank utilizations.

Chapter 3. Logical configuration concepts and terminology 59


Manual volume rebalance is designed to redistribute the extents of volumes within a
non-managed, single-tier (homogeneous) pool so that workload skew and hot spots are less
likely to occur on the ranks. This feature is useful for redistributing extents after adding new
ranks to a non-managed extent pool. Also, this feature is useful after merging homogeneous
extent pools to balance the capacity and the workload of the volumes across all available
ranks in a pool.

Manual volume rebalance provides manual performance optimization by rebalancing the


capacity of a volume within a non-managed homogeneous extent pool. It is also called
capacity rebalance because it balances the extents of the volume across the ranks in an
extent pool without considering any workload statistics or device usage. A balanced
distribution of the extents of a volume across all available ranks in a pool is supposed to
provide a balanced workload distribution and minimize skew and hot spots. It is also called
volume restriping, and it is supported for standard and ESE thin-provisioned volumes.
For further information about manual volume rebalance, see IBM DS8000 Easy Tier,
REDP-4667.

Auto-rebalance
With Easy Tier automatic mode enabled for single-tier or multitier extent pools, you can
benefit from Easy Tier automated intratier performance management (auto-rebalance), which
relocates extents based on rank utilization, and reduces skew and avoids rank hot spots.
Easy Tier relocates subvolume data on extent level based on actual workload pattern and
rank utilization (workload rebalance) rather than balance the capacity of a volume across all
ranks in the pool (capacity rebalance, as achieved with manual volume rebalance).

When adding capacity to managed pools, Easy Tier automatic mode performance
management, auto-rebalance, takes advantage of the new ranks and automatically populates
the new ranks that are added to the pool when rebalancing the workload within a storage tier
and relocating subvolume data. Auto-rebalance can be enabled for hybrid and homogeneous
extent pools.

Tip: For brand new DS8000 storage systems, the Easy Tier automatic mode switch is set
to Tiered, which means that Easy Tier is working only in hybrid pools. Have Easy Tier
automatic mode working in all pools, including single-tier pools. To do so, set the Easy Tier
automode switch to All.

For more information about auto-rebalance, see IBM DS8000 Easy Tier, REDP-4667.

3.2.5 Logical volumes


A logical volume is composed of a set of extents from one extent pool. On a DS8000 storage
system, up to 65,280 (we use the abbreviation 64 K in this chapter, even though it is 65,536 −
256, which is not quite 64 K in binary) volumes can be created (either 64 K CKD, or 64 K FB
volumes, or a mixture of both types with a maximum of 64 K volumes in total).

Fixed block LUNs


A logical volume that is composed of fixed block extents is called a logical unit number
(LUN). A fixed block LUN is composed of one or more extents (1 GiB, 230 bytes) from one FB
extent pool. A LUN cannot span multiple extent pools, but a LUN can have extents from
different ranks within the same extent pool. You can construct LUNs up to a size of 16 TiB
(16 × 240 bytes, or 244 bytes).

60 IBM System Storage DS8000 Performance Monitoring and Tuning


Important: Copy Services support is usually for logical volumes up to 2 TiB (2×240 bytes)
(up to 4 TiB possible through a special RPQ process). Do not create LUNs larger than 2
TiB if you want to use the DS8000 Copy Services for those LUNs, unless you want to use
them as managed disks in an IBM SAN Volume Controller and use SAN Volume Controller
Copy Services instead.

LUNs can be allocated in binary GiB (230 bytes), decimal GB (109 bytes), or 512 or 520-byte
blocks. However, the physical capacity that is allocated for a LUN always is a multiple of 1 GiB
(binary), so it is a good idea to have LUN sizes that are a multiple of a gibibyte. If you define a
LUN with a LUN size that is not a multiple of 1 GiB, for example, 25.5 GiB, the LUN size is
25.5 GiB, but a capacity of 26 GiB is physically allocated, wasting 0.5 GiB of physical storage
capacity.

CKD volumes
A z Systems CKD volume is composed of one or more extents from one CKD extent pool.
CKD extents are the size of 3390 Model 1, which has 1113 cylinders. However, when you
define a z Systems CKD volume, you do not specify the number of 3390 Model 1 extents, but
the number of cylinders that you want for the volume.

The maximum size for a CKD volume was originally 65,520 cylinders. The DS8000 codes
introduced the Extended Address Volume (EAV), which give you CKD volumes sizes up to
around 1 TB.

If the number of specified cylinders is not an exact multiple of 1113 cylinders, part of the
space in the last allocated extent is wasted. For example, if you define 1114 or 3340
cylinders, 1112 cylinders are wasted. For maximum storage efficiency, consider allocating
volumes that are exact multiples of 1113 cylinders. In fact, consider multiples of 3339
cylinders for future compatibility.

A CKD volume cannot span multiple extent pools, but a volume can have extents from
different ranks in the same extent pool. You can stripe a volume across all ranks in an extent
pool by using the rotate extents EAM to distribute the capacity of the volume. This EAM
distributes the workload evenly across all ranks in the pool, minimizes skew, and reduces the
risk of single extent pools that become a hot spot. For more information, see “Rotate extents
(storage pool striping) extent allocation method” on page 63.

3.2.6 Space-efficient volumes


When a standard (non-thin-provisioned) FB LUN or CKD volume is created on the DS8000
storage system, it occupies as much physical capacity as is specified by its capacity.

For the DS8880 R8.0 code version, extent space-efficient (ESE) volumes are supported for
FB (Open System + IBM i) format. The ESE concept is described in detail in DS8000 Thin
Provisioning, REDP-4554.

A space-efficient volume does not occupy all of its physical capacity at the time it is created.
The space becomes allocated when data is written to the volume. The sum of all the virtual
capacity of all space-efficient volumes can be larger than the available physical capacity,
which is known as over-provisioning or thin provisioning.

Chapter 3. Logical configuration concepts and terminology 61


The ESE volumes are thin-provisioned volumes that are designated for standard host access.
The virtual capacity pool for ESE volumes is created per extent pool. The available physical
capacity pool for allocation is the unused physical capacity in the extent pool. When an ESE
logical volume is created, the volume is not allocated physical data capacity. However, the
DS8000 storage system allocates capacity for metadata that it uses to manage space
allocation. Additional physical capacity is allocated when writes to an unallocated address
occur. Before the write, a new extent is dynamically allocated, initialized, and eventually
assigned to the volume (QuickInit).

The idea behind space-efficient volumes is to allocate physical storage at the time that it is
needed to satisfy temporary peak storage needs.

Important: No Copy Services support exists for logical volumes larger than 2 TiB (2 × 240
bytes). Thin-provisioned volumes (ESE) as released with the DS8000 LMC R8.0 are not
yet supported by CKD volumes. Thin-provisioned volumes are supported by most but not
all DS8000 Copy Services or advanced functions. These restrictions might change with
future DS8000 LMC releases, so check the related documentation for your DS8000 LMC
release.

Space allocation
Space for a space-efficient volume is allocated when a write occurs. More precisely, it is
allocated when a destage from the cache occurs and a new track or extent must be allocated.

Virtual space is created as part of the extent pool definition. This virtual space is mapped to
ESE volumes in the extent pool as needed. The virtual capacity equals the total logical
capacity of all ESE volumes. No physical storage capacity (other than for the metadata) is
allocated until write activity occurs.

The initfbvol command can release the space for space-efficient volumes.

3.2.7 Extent allocation methods


There are two extent allocation methods (EAMs) available on the DS8000 storage system for
non-managed extent pools: rotate volumes and rotate extents (also called storage pool
striping).

When you create a volume in a managed extent pool, that is, an extent pool that is managed
by Easy Tier automatic mode, the EAM of the volume always becomes managed.

Tip: Rotate extents and rotate volume EAMs determine the distribution of a volume
capacity and the volume workload distribution across the ranks in an extent pool. Rotate
extents (the default EAM) evenly distributes the capacity of a volume at a granular 1 GiB
extent level across the ranks in a homogeneous extent pool. It is the preferred method to
reduce skew and minimize hot spots, improving overall performance.

Rotate volumes extent allocation method


Extents can be allocated sequentially. In this case, all extents are taken from the same rank
until you have enough extents for the requested volume size or the rank is full, in which case
the allocation continues with the next rank in the extent pool. If multiple volumes are created
in one operation, the allocation for each volume starts in another rank. When allocating
several volumes, rotate with the start of each volume through the ranks.

62 IBM System Storage DS8000 Performance Monitoring and Tuning


You might want to consider this allocation method when you manage performance manually.
The workload of one volume is going to one rank. This method helps you identify performance
bottlenecks more easily. However, by placing all volume data and workload on one rank, you
increase skew and the likelihood of limiting overall system performance with single ranks that
become a bottleneck.

Rotate extents (storage pool striping) extent allocation method


The preferred storage allocation method is rotate extents (storage pool striping). It is the
default EAM when a volume is created in a non-managed extent pool. The extents of a
volume can be striped across several ranks at a granular 1 GiB extent level. It is the preferred
method to reduce skew and minimize hot spots, improving overall performance in a
homogeneous multi-rank extent pool. An extent pool with multiple ranks is needed to benefit
from this storage allocation method.

Although all volumes in the extent pool that use rotate extents are evenly distributed across all
ranks in the pool, the initial start of each volume is additionally rotated throughout the ranks to
improve a balanced rank utilization and workload distribution. If the first volume is created
starting on rank R(n), the allocation for the later volume starts on the later rank R(n+1) in the
pool. The DS8000 storage system maintains a sequence of ranks. While the extents of each
volume are rotated across all available ranks in the pool, the DS8000 storage system
additionally tracks the rank in which the last volume allocation started. The allocation of the
first extent for the next volume then starts on the next rank in that sequence.

Extent allocation in hybrid and managed extent pools


When you create a volume in a managed extent pool, that is, an extent pool that is managed
by Easy Tier automatic mode, the EAM of the volume always becomes managed. This
situation is true no matter which extent allocation method is specified at volume creation. The
volume is under control of Easy Tier. Easy Tier moves extents to the most appropriate
storage tier and rank in the pool based on performance aspects. Any specified EAM, such as
rotate extents or rotate volumes, is ignored. In managed extent pools, an initial EAM similar to
rotate extents for new volumes is used.

In hybrid or multitier extent pools (whether managed or non-managed by Easy Tier), initial
volume creation always starts on the ranks of the Enterprise tier first. The Enterprise tier is
also called the home tier. The extents of a new volume are distributed in a rotate extents or
storage pool striping fashion across all available ranks in this home tier in the extent pool if
sufficient capacity is available. Only when all capacity on the home tier in an extent pool is
consumed does volume creation continue on the ranks of the Nearline tier. When all capacity
on the Enterprise tier and Nearline tier is exhausted, then volume creation continues
allocating extents on the SSD tier. The initial extent allocation in non-managed hybrid pools
differs from the extent allocation in single-tier extent pools with rotate extents (the extents of a
volume are not evenly distributed across all ranks in the pool because of the different
treatment of the different storage tiers). However, the attribute for the EAM of the volume is
shown as rotate extents if the pool is not under Easy Tier automatic mode control. After the
pool is managed by Easy Tier automatic mode, the EAM becomes managed.

In managed homogeneous extent pools with only a single storage tier, the initial extent
allocation for a new volume is the same as with rotate extents or storage-pool striping. For a
volume, the appropriate DSCLI command, showfbvol or showckdvol, which is used with the
-rank option, allows the user to list the number of allocated extents of a volume on each
associated rank in the extent pool.

Chapter 3. Logical configuration concepts and terminology 63


The EAM attribute of any volume that is created or already in a managed extent pool is
changed to managed after Easy Tier automatic mode is enabled for the pool. When enabling
Easy Tier automatic mode for all extent pools, that is, hybrid and homogeneous extent pools,
all volumes immediately become managed by Easy Tier. Once set to managed, the EAM
attribute setting for the volume is permanent. All previous volume EAM attribute information,
such as rotate extents or rotate volumes, is lost.

Additional considerations for volumes and the extent allocation method


By using striped volumes, you distribute the I/O load of a LUN/CKD volume to multiple sets of
eight disk drives. The ability to distribute a workload to many physical drives can greatly
enhance performance for a logical volume. In particular, operating systems that do not have a
volume manager that can stripe can benefit most from this allocation method.

However, if you use, for example, Physical Partition striping in AIX already, double striping
probably does not improve performance any further.

Tip: Double-striping a volume, for example, by using rotate extents in storage and striping
a volume on the AIX Logical Volume Manager (LVM) level can lead to unexpected
performance results. Consider striping on the storage system level or on the operating
system level only. For more information, see 3.3.2, “Extent pool considerations” on
page 72.

If you decide to use storage-pool striping, it is preferable to use this allocation method for all
volumes in the extent pool to keep the ranks equally allocated and used.

Tip: When configuring a new DS8000 storage system, do not generally mix volumes that
use the rotate extents EAM (storage pool striping) and volumes that use the rotate volumes
EAM in the same extent pool.

Striping a volume across multiple ranks also increases the failure boundary. If you have extent
pools with many ranks and all volumes striped across these ranks, you lose the data of all
volumes in the pool if one rank fails after suddenly losing multiple drives: For example, two
disk drives in the same RAID 5 rank fail at the same time.

If multiple EAM types are used in same (non-managed) extent pool, use Easy Tier manual
mode to change the EAM from rotate volumes to rotate extents and vice versa. Use volume
migration (dynamic volume relocation (DVR)) in non-managed, homogeneous extent pools.

However, before switching any EAM of a volume, consider that you might need to change
other volumes on the same extent pool before distributing your volume across ranks. For
example, you create various volumes with the rotate volumes EAM and only a few with rotate
extents. Now, you want to switch only one volume to rotate extents. The ranks might not have
enough free extents available for Easy Tier to balance all the extents evenly over all ranks. In
this case, you might have to apply multiple steps and switch every volume to the new EAM
type before changing only one volume. Depending on your case, you might also consider
moving volumes to another extent pool before reorganizing all volumes and extents in the
extent pool.

Certain cases, where you merge other extent pools, you also must plan to reorganize the
extents on the new extent pool by using, for example, manual volume rebalance, so that you
can properly redistribute the extents of the volumes across the ranks in the pool.

64 IBM System Storage DS8000 Performance Monitoring and Tuning


3.2.8 Allocating, deleting, and modifying LUNs and CKD volumes
All extents of the ranks that are assigned to an extent pool are independently available for
allocation to logical volumes. The extents for a LUN/volume are logically ordered, but they do
not have to come from one rank and the extents do not have to be contiguous on a rank.

This construction method of using fixed extents to form a logical volume in the DS8000 series
allows flexibility in the management of the logical volumes. You can delete LUNs/CKD
volumes, resize LUNs/volumes, and reuse the extents of those LUNs to create other
LUNs/volumes, maybe of different sizes. One logical volume can be removed without
affecting the other logical volumes defined on the same extent pool. Extents are cleaned after
you delete a LUN or CKD volume.

Logical volume configuration states


Each logical volume has a configuration state attribute. When a logical volume creation
request is received, a logical volume object is created and the logical volume configuration
state attribute is placed in the configuring configuration state.

After the logical volume is created and available for host access, it is placed in the normal
configuration state. If a volume deletion request is received, the logical volume is placed in the
deconfiguring configuration state until all capacity that is associated with the logical volume is
deallocated and the logical volume object is deleted.

The reconfiguring configuration state is associated with a volume expansion request. The
transposing configuration state is associated with an extent pool merge. The migrating,
migration paused, migration error, and migration canceled configuration states are
associated with a volume relocation request.

Dynamic volume expansion


The size of a LUN or CKD volume can be expanded without losing the data. On the DS8000
storage system, you add extents to the volume. The operating system must support this
resizing capability. If a logical volume is dynamically expanded, the extent allocation follows
the volume EAM. Dynamically reducing the size of a logical volume is not supported because
most operating systems do not support this feature.

Important: Before you can expand a volume, you must remove any Copy Services
relationships that involve that volume.

Dynamic volume relocation


DVR or volume migration is a capability that is provided as part of the Easy Tier manual mode
feature. It allows a logical volume to be migrated from its current extent pool to another extent
pool or even back into the same extent pool (to relocate the extents of a volume) while the
logical volume and its data remain accessible to the attached hosts. The user can request
DVR by using the migrate volume function that is available through the DS8000 Storage
Manager GUI or DSCLI (the managefbvol or manageckdvol command). DVR allows the user to
specify a target extent pool and an EAM. For example, you might want to move volumes that
are no longer needed for daily operation from a hybrid extent pool to a Nearline-only pool.

Chapter 3. Logical configuration concepts and terminology 65


Important: DVR can be applied among extent pools that are associated with the same
DS8000 storage system affinity (storage server 0 or storage server 1) or rank group only.
All volumes in even-numbered extent pools (P0, P2, P4, and so on) belong to rank group 0
and are serviced by storage server 0. All volumes in odd-numbered extent pools (P1, P3,
P5, and so on) belong to rank group 1 and are serviced by storage server 1. Additionally,
the DVR is not supported if source and target pools are different storage types (FB and
CKD).

If the same extent pool is specified and rotate extents is used as the EAM, the volume
migration is carried out as manual volume rebalance, as described in “Manual volume
rebalance” on page 59. Manual volume rebalance is designed to redistribute the extents of
volumes within a non-managed, single-tier (homogeneous) pool so that workload skew and
hot spots are less likely to occur on the ranks. During extent relocation, only one extent at a
time is allocated rather than preallocating the full volume and only a minimum amount of free
capacity is required in the extent pool.

Important: A volume migration with DVR back into the same extent pool (for example,
manual volume rebalance for restriping purposes) is not supported in managed or hybrid
extent pools. Hybrid pools are always supposed to be prepared for Easy Tier automatic
management. In pools under control of Easy Tier automatic mode, the volume placement
is managed automatically by Easy Tier. It relocates extents across ranks and storage tiers
to optimize storage performance and storage efficiency. However, it is always possible to
migrate volumes across extent pools, no matter if those pools are managed,
non-managed, or hybrid pools.

For more information about this topic, see IBM DS8000 Easy Tier, REDP-4667.

3.2.9 Logical subsystem


A logical subsystem (LSS) is another logical construct. It groups logical volumes and LUNs in
groups of up to 256 logical volumes.

On the DS8000 series, there is no fixed binding between a rank and an LSS. The capacity of
one or more ranks can be aggregated into an extent pool. The logical volumes that are
configured in that extent pool are not necessarily bound to a specific rank. Different logical
volumes on the same LSS can even be configured in separate extent pools. The available
capacity of the storage facility can be flexibly allocated across LSSs and logical volumes. You
can define up to 255 LSSs on a DS8000 storage system.

For each LUN or CKD volume, you must select an LSS when creating the volume. The LSS is
part of the volume ID ‘abcd’ and must to be specified upon volume creation.

Hexadecimal notation: To emphasize the hexadecimal notation of the DS8000 volume


IDs, LSS IDs, and address groups, an ‘X’ is used in front of the ID in part of the book.

You can have up to 256 volumes in one LSS. However, there is one restriction. Volumes are
created from extents of an extent pool. However, an extent pool is associated with one
DS8000 storage system (also called a central processor complex (CPC)): server 0 or
server 1. The LSS number also reflects this affinity to one of these DS8000 storage systems.
All even-numbered LSSs (X'00', X'02', X'04', up to X'FE') are serviced by storage server 0
(rank group 0). All odd-numbered LSSs (X'01', X'03', X'05', up to X'FD') are serviced by
storage server 1 (rank group 1). LSS X’FF’ is reserved.

66 IBM System Storage DS8000 Performance Monitoring and Tuning


All logical volumes in an LSS must be either CKD or FB. LSSs are even grouped into address
groups of 16 LSSs. All LSSs within one address group must be of the same storage type,
either CKD or FB. The first digit of the LSS ID or volume ID specifies the address group. For
more information, see 3.2.10, “Address groups” on page 67.

z Systems users are familiar with a logical control unit (LCU). z Systems operating systems
configure LCUs to create device addresses. There is a one-to-one relationship between an
LCU and a CKD LSS (LSS X'ab' maps to LCU X'ab'). Logical volumes have a logical volume
number X'abcd' in hexadecimal notation where X'ab' identifies the LSS and X'cd' is one of the
256 logical volumes on the LSS. This logical volume number is assigned to a logical volume
when a logical volume is created and determines with which LSS the logical volume is
associated. The 256 possible logical volumes that are associated with an LSS are mapped to
the 256 possible device addresses on an LCU (logical volume X'abcd' maps to device
address X'cd' on LCU X'ab'). When creating CKD logical volumes and assigning their logical
volume numbers, consider whether Parallel Access Volumes (PAVs) are required on the LCU
and reserve addresses on the LCU for alias addresses.

For Open Systems, LSSs do not play an important role other than associating a volume with a
specific rank group and server affinity (storage server 0 or storage server 1) or grouping hosts
and applications together under selected LSSs for the DS8000 Copy Services relationships
and management.

Tip: Certain management actions in Metro Mirror, Global Mirror, or Global Copy operate at
the LSS level. For example, the freezing of pairs to preserve data consistency across all
pairs is at the LSS level. The option to put all or a set of volumes of a certain application in
one LSS can make the management of remote copy operations easier under certain
circumstances.

Important: LSSs for FB volumes are created automatically when the first FB logical
volume on the LSS is created, and deleted automatically when the last FB logical volume
on the LSS is deleted. CKD LSSs require user parameters to be specified and must be
created before the first CKD logical volume can be created on the LSS. They must be
deleted manually after the last CKD logical volume on the LSS is deleted.

3.2.10 Address groups


Address groups are created automatically when the first LSS that is associated with the
address group is created, and deleted automatically when the last LSS in the address group
is deleted. The DSCLI lsaddressgrp command displays a list of address groups in use for the
storage image and the storage type that is associated with it, either FB or CKD.

All devices in an address group must be either CKD or FB. LSSs are grouped into address
groups of 16 LSSs. LSSs are numbered X'ab', where a is the address group. So, all LSSs
within one address group must be of the same type, CKD or FB. The first LSS defined in an
address group sets the type of that address group. For example, LSS X'10' to LSS X'1F' are
all in the same address group and therefore can all be used only for the same storage type,
either FB or CKD.

Chapter 3. Logical configuration concepts and terminology 67


Figure 3-5 shows the concept of volume IDs, LSSs, and address groups.

Address group X'1x' CKD


LSS X' 10' L SS X'11 '
LSS X' 12' L SS X'1 3'
LSS X' 14' L SS X'1 5'
LSS X' 16' X'1 50 0'
E xtent Pool CKD-1 E xtent Pool CKD-2
LSS X' 18'
Ran k-a LSS X' 1A' L SS X'1 7' Rank-w
LSS X' 1C' L SS X'1 9'
L SS X'1 B'
Ran k-b X'1 E00' Rank-x
L SS X'1 D'
X'1 E01'
X'1 D00 '
0
r LSS X' 1E'
1
r
e L SS X'1 F' e
v
r
E xtent Pool FB-1 E xtent Pool FB-2
v
r
e Rank-y e
SRan k-c S
Address group X'2x': FB
Ran k-d
E xtent Pool FB-2
LSS X' 20' L SS X'2 1'
LSS X' 22' X'2 10 0' Rank-z
LSS X' 24'
X'2 10 1'
LSS X' 26'
L SS X'2 3'
X'2 800 ' L SS X'2 5'
L SS X'2 7'
LSS X' 28' L SS X'2 9'
Volume ID LSS X' 2A' L SS X'2 B'
LSS X' 2C' L SS X'2 D'
LSS X' 2E' L SS X'2 F'

Figure 3-5 Volume IDs, logical subsystems, and address groups on the DS8000 storage systems

The volume ID X'gabb' in hexadecimal notation is composed of the address group X'g', the
LSS ID X'ga', and the volume number X'bb' within the LSS. For example, LUN X'2101'
denotes the second (X'01') LUN in LSS X'21' of address group 2.

3.2.11 Volume access


A DS8000 storage system provides mechanisms to control host access to LUNs. In most
cases, a server or host has two or more host bus adapters (HBAs) and needs access to a
group of LUNs. For easy management of server access to logical volumes, the DS8000
storage system uses the concept of host attachments and volume groups.

Host attachment
Host bus adapters (HBAs) are identified to the DS8000 storage system in a host attachment
or host connection construct that specifies the HBA worldwide port names (WWPNs). A set of
host ports (host connections) can be associated through a port group attribute in the DSCLI
that allows a set of HBAs to be managed collectively. This group is called a host attachment
within the GUI.

Each host attachment can be associated with a volume group to define which LUNs that HBA
is allowed to access. Multiple host attachments can share the volume group. The host
attachment can also specify a port mask that controls which DS8000 I/O ports the HBA is
allowed to log in to. Whichever ports the HBA logs in to, it sees the same volume group that is
defined on the host attachment that is associated with this HBA.

The maximum number of host attachments on a DS8000 is 8,192.

68 IBM System Storage DS8000 Performance Monitoring and Tuning


Volume group
A volume group is a named construct that defines a set of logical volumes. When used with
CKD hosts, there is a default volume group that contains all CKD volumes. Any CKD host that
logs in to a FICON I/O port has access to the volumes in this volume group. CKD logical
volumes are automatically added to this volume group when they are created and
automatically removed from this volume group when they are deleted.

When used with Open Systems hosts, a host attachment object that identifies the HBA is
linked to a specific volume group. You must define the volume group by indicating which FB
logical volumes are to be placed in the volume group. Logical volumes can be added to or
removed from any volume group dynamically.

One host connection can be assigned to one volume group only. However, the same volume
group can be assigned to multiple host connections. An FB logical volume can be assigned to
one or more volume groups. Assigning a logical volume to different volume groups allows a
LUN to be shared by hosts, each configured with its own dedicated volume group and set of
volumes (in case a set of volumes that is not identical is shared between the hosts).

The maximum number of volume groups is 8,320 for the DS8000 storage system.

3.2.12 Summary of the DS8000 logical configuration hierarchy


Describing the virtualization hierarchy, this section started with multiple disks that were
grouped in array sites. An array site is transformed into an array, eventually with spare disks.
The array was further transformed into a rank with extents formatted for FB data or CKD.
Next, the extents were added to an extent pool that determined which storage server served
the ranks and aggregated the extents of all ranks in the extent pool for later allocation to one
or more logical volumes. Within the extent pool, you can optionally reserve storage for
space-efficient volumes.

Next, this section described the creation of logical volumes within the extent pools (optionally
striping the volumes), assigning them a logical volume number that determined to which LSS
they are associated and indicated which server manages them. Space-efficient volumes can
be created immediately or within a repository of the extent pool. Then, the LUNs can be
assigned to one or more volume groups. Finally, the HBAs were configured into a host
attachment that is associated with a volume group.

The above concept is seen when working with the DSCLI. When working with the DS Storage
Manager GUI, some complexity is reduced externally: Instead of array sites, arrays and ranks,
the GUI just speaks of “Arrays” or “Managed Arrays”, and these were put into “Pools”. Also,
the concept of a volume group is not directly visible externally when working with the GUI.

This virtualization concept provides for greater flexibility. Logical volumes can dynamically be
created, deleted, migrated, and resized. They can be grouped logically to simplify storage
management. Large LUNs and CKD volumes reduce the required total number of volumes,
which also contributes to a reduction of management efforts.

Chapter 3. Logical configuration concepts and terminology 69


Figure 3-6 summarizes the virtualization hierarchy.

Array RAID Ran k Extent Logical


Site Array Type FB Poo l Volume

Data B B
B
Data F F F
B B B iB iB iB
Data F F F G G G
iB B
i B
i 0 1 1 1
Data r
G G G e
v
Data 1 1 1 r
e
S B B B
Data F F F
B
i B
i B
i
Parity G G G
Spare 1 1 1

LSS Address Volume Ho st


FB Group Gro up Attach ment

X '2x ' FB
4 096
a ddre sse s

LSS X'2 7'

X'3 x' CK D
4 096
a ddre sse s

Figure 3-6 DS8000 virtualization hierarchy

3.3 Data placement on ranks and extent pools


It is important to understand how the volume data is placed on ranks and extent pools. This
understanding helps you decide how to create extent pools and choose the required number
of ranks within an extent pool. It also helps you understand and detect performance problems
or optimally tweak overall system performance.

It can also help you fine-tune system performance from an extent pool perspective, for
example, sharing the resources of an extent pool evenly between application workloads or
isolating application workloads to dedicated extent pools. Data placement can help you when
planning for dedicated extent pools with different performance characteristics and storage
tiers without using Easy Tier automatic management. Plan your configuration carefully to
meet your performance goals by minimizing potential performance limitations that might be
introduced by single resources that become a bottleneck because of workload skew. For
example, use rotate extents as the default EAM to help reduce the risk of single ranks that
become a hot spot and limit the overall system performance because of workload skew.

If workload isolation is required in your environment, you can isolate workloads and I/O on the
rank and DA levels on the DS8000 storage systems, if required.

70 IBM System Storage DS8000 Performance Monitoring and Tuning


You can manually manage storage tiers that are related to different homogeneous extent
pools of the same storage class and plan for appropriate extent pools for your specific
performance needs. Easy Tier provides optimum performance and a balanced resource
utilization at a minimum configuration and management effort.

3.3.1 Rank and array considerations


For fine-tuning performance-intensive workloads that use homogeneous pool configurations,
a 7+P RAID 5 array performs better than a 6+P array, and a 4+4 RAID 10 performs better
than a 3+3 array. For random I/O, you might see an up to 15% greater throughput on a 7+P
and a 4+4 array than a 6+P and a 3+3 array. Yet, with the larger net capacity you also get
more IOPS landing on such an array, so usually there is no problem when mixing such larger
and smaller ranks in an Easy Tier managed pool. For sequential applications, the differences
are minimal. Generally, try to balance workload activity evenly across RAID arrays, regardless
of the size. It is typically not worth the additional management effort to do it otherwise.
However, if you use multiple pools, evaluate whether it makes sense to ensure that each pool
contains a similar proportion of 6+P and 7+P (or 3+3/4+4 or 5+P+Q/6+P+Q) ranks.

In a RAID 5 6+P or 7+P array, the amount of capacity equal to one disk is used for parity
information. However, the parity information is not bound to a single disk. Instead, it is striped
across all the disks in the array, so all disks of the array are involved to service I/O requests
equally.

In a RAID 6 5+P+Q or 6+P+Q array, the amount of capacity equal to two disks is used for
parity information. As with RAID 5, the parity information is not bound to single disks, but
instead is striped across all the disks in the array. So, all disks of the array service I/O
requests equally. However, a RAID 6 array might have one less drive available than a RAID 5
array configuration. Nearline drives, for example, by default allow RAID 6 configurations only.

In a RAID 10 3+3 array, the available usable space is the capacity of only three disks. Two
disks of the array site are used as spare disks. When a LUN is created from the extents of this
array, the data is always mirrored across two disks in the array. Each write to the array must
be performed twice to two disks. There is no additional parity information in a RAID 10 array
configuration.

Important: The spares in the mirrored RAID 10 configuration act independently; they are
not mirrored spares.

In a RAID 10 4+4 array, the available usable space is the capacity of four disks. No disks of
the array site are used as spare disks. When a LUN is created from the extents of this array,
the data is always mirrored across two disks in the array. Each write to the array must be
performed twice to two disks.

Important: The stripe width for the RAID arrays differs in size with the number of active
disks that hold the data. Because of the different stripe widths that make up the extent from
each type of RAID array, it is not a preferred practice to intermix RAID array types within
the same extent pool, especially in homogeneous extent pool configurations that do not
use Easy Tier automatic management. With Easy Tier enabled, the benefit of automatic
storage performance and storage efficiency management combined with Easy Tier
micro-tiering capabilities typically outperforms the disadvantage of different RAID arrays
within the same pool.

Chapter 3. Logical configuration concepts and terminology 71


3.3.2 Extent pool considerations
The data placement and thus the workload distribution of your volumes on the ranks in a
non-managed homogeneous extent pool is determined by the selected EAM, as described in
3.2.7, “Extent allocation methods” on page 62 and the number of ranks in the pool.

Even in single-tier homogeneous extent pools, you can benefit from Easy Tier automatic
mode (by running the DSCLI chsi ETautomode=all command). It manages the subvolume
data placement within the managed pool based on rank utilization and thus reduces workload
skew and hot spots (auto-rebalance).

In multitier hybrid extent pools, you can fully benefit from Easy Tier automatic mode (by
running the DSCLI chsi ETautomode=all|tiered command). It provides full automatic
storage performance and storage economics management by optimizing subvolume data
placement in a managed extent pool across different storage tiers and even across ranks
within each storage tier (auto-rebalance). Easy Tier automatic mode and hybrid extent pool
configurations offer the most efficient way to use different storage tiers. It optimizes storage
performance and storage economics across three drive tiers to manage more applications
effectively and efficiently with a single DS8000 storage system at an optimum price versus the
performance and footprint ratio.

The data placement and extent distribution of a volume across the ranks in an extent pool can
be displayed by running the DSCLI showfbvol -rank or showckdvol -rank command, as
shown in Example 3-2 on page 78.

Before configuring extent pools and volumes, be aware of the basic configuration principles
about workload sharing, isolation, and spreading, as described in 4.2, “Configuration
principles for optimal performance” on page 87.

Single-tier extent pools


A single-tier or homogeneous extent pool is formed by using disks of the same storage tier
with the same drive characteristics, which means not combining Enterprise class disks with
Nearline-class disks or SSDs in the same extent pool.

The first example, which is shown in Figure 3-7 on page 73, illustrates an extent pool with
only one rank, which is also referred to a single-rank extent pool. This approach is common if
you plan to use the SAN Volume Controller, for example, or if you plan a configuration that
uses the maximum isolation that you can achieve on the rank/extpool level. In this type of a
single-rank extent pool configuration, all volumes that are created are bound to a single rank.
This type of configuration requires careful logical configuration and performance planning
because single ranks are likely to become a hot spot and might limit overall system
performance. It also requires the highest administration and management effort because
workload skew typically varies over time. You might constantly monitor your system
performance and need to react to hot spots. It also considerably limits the benefits that a
DS8000 storage system can offer regarding its virtualization and Easy Tier automatic
management capabilities.

In these configurations, use host-based striping methods to achieve a balanced distribution of


the data and the I/O workload across the ranks and back-end disk resources. For example,
you can use IBM AIX LVM or SAN Volume Controller to stripe the volume data across ranks,
or use SVC-based Easy Tier.

72 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 3-7 shows the data placement of two volumes created in an extent pool with a single
rank. Volumes that are created in this extent pool always use extents from rank R6, and are
limited to the capacity and performance capability of this single rank. Without the use of any
host-based data and workload striping methods across multiple volumes from different extent
pools and ranks, this rank is likely to experience rank hot spots and performance bottlenecks.

Also, in this example, one host can easily degrade the whole rank, depending on its I/O
workload, and affect multiple hosts that share volumes on the same rank if you have more
than one LUN allocated in this extent pool.

Extpool P2

R6
Host B – LUN 6

Host B – LUN 6

Host C – LUN 7

Host C – LUN 7

Single Tier

Figure 3-7 Extent pool with a single rank

The second example, which is shown in Figure 3-8, illustrates an extent pool with multiple
ranks of the same storage class or storage tier, which is referred to as a homogeneous or
single-tier extent pool. In general, an extent pool with multiple ranks also is called a
multi-rank extent pool.

Extpool P1

R1 R2 R3 R4

Host A – LUN 1

Host A – LUN 2

Host B – LUN 3

Host B – LUN 4

Host C – LUN 5

Host C – LUN 5

Single Tier

Figure 3-8 Single-tier extent pool with multiple ranks

Chapter 3. Logical configuration concepts and terminology 73


Although in principle both EAMs (rotate extents and rotate volumes) are available for
non-managed homogeneous extent pools, it is preferable to use the default allocation method
of rotate extents (storage pool striping). Use this EAM to distribute the data and thus the
workload evenly across all ranks in the extent pool and minimize the risk of workload skew
and a single rank that becomes a hot spot.

The use of the EAM rotate volumes still can isolate volumes to separate ranks, even in
multi-rank extent pools, wherever such configurations are required. This EAM minimizes the
configuration effort because a set of volumes that is distributed across different ranks can be
created with a single command. Plan your configuration and performance needs to implement
host-level-based methods to balance the workload evenly across all volumes and ranks.
However, it is not a preferred practice to use both EAMs in the same extent pool without an
efficient host-level-based striping method for the non-striped volumes. This approach easily
forfeits the benefits of storage pool striping and likely leads to imbalanced workloads across
ranks and potential single-rank performance bottlenecks.

Figure 3-8 on page 73 is an example of storage-pool striping for LUNs 1 - 4. It shows more
than one host and more than one LUN distributed across the ranks. In contrast to the
preferred practice, it also shows an example of LUN 5 being created with the rotate volumes
EAM in the same pool. The storage system tries to allocate the continuous space available for
this volume on a single rank (R1) until there is insufficient capacity that is left on this rank and
then it spills over to the next available rank (R2). All workload on this LUN is limited to these
two ranks. This approach considerably increases the workload skew across all ranks in the
pool and the likelihood that these two ranks might become a bottleneck for all volumes in the
pool, which reduces overall pool performance.

Multiple hosts with multiple LUNs, as shown in Figure 3-8 on page 73, share the resources
(resource sharing) in the extent pool, that is, ranks, DAs, and physical spindles. If one host or
LUN has a high workload, I/O contention can result and easily affect the other application
workloads in the pool, especially if all applications have their workload peaks at the same
time. Alternatively, applications can benefit from a much larger amount of disk spindles and
thus larger performance capabilities in a shared environment in contrast to workload isolation
and only dedicated resources. With resource sharing, expect that not all applications peak at
the same time, so that each application typically benefits from the larger amount of disk
resources that it can use. The resource sharing and storage-pool striping in non-managed
extent pools method is a good approach for most cases if no other requirements, such as
workload isolation or a specific quality of service (QoS) requirements, dictate another
approach.

Enabling Easy Tier automatic mode for homogeneous, single-tier extent pools always is an
additional option, and is preferred, to let the DS8000 storage system manage system
performance in the pools based on rank utilization (auto-rebalance). The EAM of all volumes
in the pool becomes managed in this case. With Easy Tier and its advanced micro-tiering
capabilities that take different RAID levels and drive characteristics into account for
determining the rank utilization in managed pools, even a mix of different drive characteristics
and RAID levels of the same storage tier might be an option for certain environments.

With Easy Tier and I/O Priority Manager, the DS8880 family offers advanced features when
taking advantage of resource sharing to minimize administration efforts and reduce workload
skew and hot spots while benefitting from automatic storage performance, storage
economics, and workload priority management. The use of these features in the DS8000
environments is highly encouraged. These features generally help provide excellent overall
system performance while ensuring (QoS) levels by prioritizing workloads in shared
environments at a minimum of administration effort and at an optimum price-performance
ratio.

74 IBM System Storage DS8000 Performance Monitoring and Tuning


Multitier extent pools
A multitier or hybrid extent pool is formed by combining ranks of different storage classes or
different storage tiers within the same extent pool. Hybrid pools are always supposed to be
managed by Easy Tier automatic mode, which provides automated storage performance and
storage economics management. It provides full automated cross-tier and intra-tier
management by dynamically relocating subvolume data (at the extent level) to the appropriate
rank and storage tier within an extent pool based on its access pattern. Easy Tier supports
automatic management of up to three storage tiers that use Enterprise, Nearline, and SSD
class drives. Using Easy Tier to let the system manage data placement and performance is
highly encouraged with DS8880 storage systems.

When you create a volume in a managed extent pool, that is, an extent pool that is managed
by Easy Tier automatic mode, the EAM of the volume always becomes managed. This
situation is true no matter which EAM is specified at volume creation. The volume is under
control of Easy Tier. Easy Tier moves extents to the most appropriate storage tier and rank in
the pool based on performance aspects. Any specified EAM, such as rotate extents or rotate
volumes, is ignored.

In managed or hybrid extent pools, an initial EAM that is similar to rotate extents for new
volumes is used. The same situation applies if an existing volume is manually moved to a
managed or hybrid extent pool by using volume migration or DVR. In hybrid or multitier extent
pools (whether managed or non-managed by Easy Tier), initial volume creation always starts
on the ranks of the Enterprise tier first. The Enterprise tier is also called the home tier. The
extents of a new volume are distributed in a rotate extents or storage pool striping fashion
across all available ranks in this home tier in the extent pool if sufficient capacity is available.
Only when all capacity on the home tier in an extent pool is consumed does volume creation
continue on the ranks of the flash or Nearline tier. The initial extent allocation in non-managed
hybrid pools differs from the extent allocation in single-tier extent pools with rotate extents (the
extents of a volume are not evenly distributed across all ranks in the pool because of the
different treatment of the different storage tiers). However, the attribute for the EAM of the
volume is shown as rotate extents if the pool is not under Easy Tier automatic mode control.
After Easy Tier automatic mode is enabled for a hybrid pool, the EAM of all volumes in that
pool becomes managed.

Mixing different storage tiers combined with Easy Tier automatic performance and economics
management on a subvolume level can considerably increase the performance versus price
ratio, increase energy savings, and reduce the overall footprint. The use of Easy Tier
automated subvolume data relocation and the addition of an flash tier are good for mixed
environments with applications that demand both IOPS and bandwidth at the same time. For
example, database systems might have different I/O demands according to their architecture.
Costs might be too high to allocate a whole database on flash. Mixing different drive
technologies, for example, flash with Enterprise or Nearline disks, and efficiently allocating
the data capacity on the subvolume level across the tiers with Easy Tier can highly optimize
price, performance, the footprint, and the energy usage. Only the hot part of the data
allocates flash or SSD capacity instead of provisioning flash capacity for full volumes.
Therefore, you can achieve considerable system performance at a reduced cost and footprint
with only a few SSDs.

The ratio of flash capacity to hard disk drive (HDD) disk capacity in a hybrid pool depends on
the workload characteristics and skew. Ideally, there must be enough flash or SSD capacity to
hold the active (hot) extents in the pool, but not more, to not waste the more expensive flash
capacity. For new DS8000 orders, 3 - 5% of flash capacity might be a reasonable percentage
to plan with hybrid pools in typical environments. This configuration can already result in the
movement of 50% and more of the small and random I/O workload from Enterprise drives to
flash. This configuration provides a reasonable initial estimate if measurement data is not
available to support configuration planning.

Chapter 3. Logical configuration concepts and terminology 75


The Storage Tier Advisor Tool (STAT) also provides guidance for SSD capacity planning
based on the existing workloads on a DS8000 storage system with Easy Tier monitoring
capabilities. For more information about the STAT, see 6.5, “Storage Tier Advisor Tool” on
page 213.

Figure 3-9 shows a configuration of a managed two-tier extent pool with an SSD and
Enterprise storage tier. All LUNs are managed by Easy Tier. Easy Tier automatically and
dynamically relocates subvolume data to the appropriate storage tier and rank based on their
workload patterns. Figure 3-9 shows multiple LUNs from different hosts allocated in the
two-tier pool with hot data already migrated to SSDs. Initial volume creation in this pool
always allocates extents on the Enterprise tier first, if capacity is available, before Easy Tier
automatically starts promoting extents to the SSD tier.

Extpool P5

R14 R15

Host A – LUN 1

Host A – LUN 2

Host B – LUN 3

Host B – LUN 4

Host C – LUN 5

Host C – LUN 6

SSD ENT
Two Tiers
Figure 3-9 Multitier extent pool with SSD and Enterprise ranks

Figure 3-10 on page 77 shows a configuration of a managed two-tier extent pool with an
Enterprise and Nearline storage tier. All LUNs are managed by Easy Tier. Easy Tier
automatically and dynamically relocates subvolume data to the appropriate storage tier and
rank based on their workload patterns. With more than one rank in the Enterprise storage tier,
Easy Tier also balances the workload and resource usage across the ranks within this
storage tier (auto-rebalance). Figure 3-10 on page 77 shows multiple LUNs from different
hosts allocated in the two-tier pool with cold data already demoted to the Nearline tier. Initial
volume creation in this pool always allocates extents on the Enterprise tier first, if capacity is
available, before Easy Tier automatically starts demoting extents to the Nearline tier.

76 IBM System Storage DS8000 Performance Monitoring and Tuning


Extpool P4

R11 R12 R13

Host A – LUN 1

Host A – LUN 2

Host B – LUN 3

Host B – LUN 4

Host C – LUN 5

Host C – LUN 6

ENT NL
Two Tiers
Figure 3-10 Multitier extent pool with Enterprise and Nearline ranks

Figure 3-11 shows a configuration of a managed three-tier extent pool with an SSD,
Enterprise, and Nearline storage tiers. All LUNs are managed by Easy Tier. Easy Tier
automatically and dynamically relocates subvolume data to the appropriate storage tier and
rank based on their workload patterns. With more than one rank in the Enterprise storage tier,
Easy Tier also balances the workload and resource usage across the ranks within this
storage tier (auto-rebalance). Figure 3-11 shows multiple LUNs from different hosts allocated
in the three-tier pool with hot data promoted to the SSD/flash tier and cold data demoted to
the Nearline tier. Initial volume creation in this pool always allocates extents on the Enterprise
tier first, if capacity is available, before Easy Tier automatically starts promoting extents to the
SSD/flash tier or demoting extents to the Nearline tier.

Use the hybrid extent pool configurations under automated Easy Tier management. It
provides ease of use with minimum administration and performance management efforts
while optimizing the system performance, price, footprint, and energy costs.

Extpool P3

R7 R8 R9

Host A – LUN 1

Host A – LUN 2

Host B – LUN 3

Host B – LUN 4

Host C – LUN 5

Host C – LUN 6

SSD ENT NL

Three Tiers
Figure 3-11 Multitier extent pool with SSD/flash, Enterprise, and Nearline ranks

Chapter 3. Logical configuration concepts and terminology 77


3.3.3 Easy Tier considerations
When an extent pool is under control of Easy Tier automatic mode, the EAM of all volumes in
this pool becomes managed. Easy Tier automatically manages subvolume data placement on
the extent level and relocates the extents based on their access pattern to the appropriate
storage tier in the pool (cross-tier or inter-tier management). Easy Tier also rebalances the
extents across the ranks of the same storage tier based on resource usage (intra-tier
management or auto-rebalance) to minimize skew and avoid hot spots in the extent pool.
Easy Tier can be enabled for multitier hybrid extent pools with up to three storage tiers or
single-tier homogeneous extent pools and automatically manage the data placement of the
volumes in these pools.

The initial allocation of extents for a volume in a managed single-tier pool is similar to the
rotate extents EAM or storage pool striping. So, the extents are evenly distributed across all
ranks in the pool right after the volume creation. The initial allocation of volumes in hybrid
extent pools differs slightly. The extent allocation always begins in a rotate extents-like fashion
on the ranks of the Enterprise tier first, and then continues on SSD/flash and Nearline ranks.

After the initial extent allocation of a volume in the pool, the extents and their placement on
the different storage tiers and ranks are managed by Easy Tier. Easy Tier collects workload
statistics for each extent in the pool and creates migration plans to relocate the extents to the
appropriate storage tiers and ranks. The extents are promoted to higher tiers or demoted to
lower tiers based on their actual workload patterns. The data placement of a volume in a
managed pool is no longer static or determined by its initial extent allocation. The data
placement of the volume across the ranks in a managed extent pool is subject to change over
time to constantly optimize storage performance and storage economics in the pool. This
process is ongoing and always adapting to changing workload conditions. After Easy Tier
data collection and automatic mode are enabled, it might take a few hours before the first
migration plan is created and applied. For more information about Easy Tier migration plan
creation and timings, see IBM DS8000 Easy Tier, REDP-4667.

The DSCLI showfbvol -rank or showckdvol -rank commands, and the showfbvol -tier or
showckdvol -tier commands, can help show the current extent distribution of a volume
across the ranks and tiers, as shown in Example 3-2. In this example, volume 0110 is
managed by Easy Tier and distributed across ranks R13 (HPFE flash or SSD tier), R10 (ENT
tier), and R11 (ENT). You can use the lsarray -l -rank Rxy command to show the storage
class and DA pair of a specific rank Rxy.

Example 3-2 showfbvol -tier and showfbvol -rank commands to show the volume-to-rank relationship in
a multitier pool
dscli> showfbvol -tier 0110
Name aix7_esp
ID 0110
accstate Online
datastate Normal
configstate Normal
deviceMTM 2107-900
datatype FB 512T
addrgrp 0
extpool P3
exts 100
captype DS
cap (2^30B) 100.0
cap (10^9B) -
cap (blocks) 209715200
volgrp V3
ranks 3

78 IBM System Storage DS8000 Performance Monitoring and Tuning


dbexts 0
sam ESE
repcapalloc -
eam managed
reqcap (blocks) 209715200
realextents 52
virtualextents 11
migrating 0
perfgrp PG0
migratingfrom -
resgrp RG0
tierassignstatus -
tierassignerror -
tierassignorder -
tierassigntarget -
%tierassigned 0
etmonpauseremain -
etmonitorreset unknown
===========Tier Distribution============
Tier %allocated
===============
SSD 49
ENT 3

dscli> showfbvol -rank 0110


Name aix7_esp
ID 0110
accstate Online
datastate Normal
configstate Normal
deviceMTM 2107-900
datatype FB 512T
addrgrp 0
extpool P3
exts 100
captype DS
cap (2^30B) 100.0
cap (10^9B) -
cap (blocks) 209715200
volgrp V3
ranks 3
dbexts 0
sam ESE
repcapalloc -
eam managed
reqcap (blocks) 209715200
realextents 52
virtualextents 11
migrating 0
perfgrp PG0
migratingfrom -
resgrp RG0
tierassignstatus -
tierassignerror -
tierassignorder -
tierassigntarget -
%tierassigned 0
etmonpauseremain -
etmonitorreset unknown
==============Rank extents==============

Chapter 3. Logical configuration concepts and terminology 79


rank extents
============
R10 1
R11 2
R13 49

dscli> lsarray -l -rank r13


Array State Data RAIDtype arsite Rank DA Pair DDMcap (10^9B) diskclass encrypt
======================================================================================
A13 Assigned Normal 5 (6+P+S) S13 R13 18 400.0 Flash supported

dscli> lsarray -l -rank r11


Array State Data RAIDtype arsite Rank DA Pair DDMcap (10^9B) diskclass encrypt
======================================================================================
A11 Assigned Normal 5 (6+P+S) S7 R11 2 600.0 ENT supported

The volume heat distribution (volume heat map), which is provided by the STAT, helps you
identify hot, warm, and cold extents for each volume and its distribution across the storage
tiers in the pool. For more information about the STAT, see 6.5, “Storage Tier Advisor Tool” on
page 213.

Figure 3-12 gives an example of a three-tier hybrid pool managed by Easy Tier. It shows the
change of the volume data placement across the ranks over time when Easy Tier relocates
extents based on their workload pattern and adapts changing workload conditions. At time
T1, you see that all volumes are initially allocated on the Enterprise tier, evenly balanced
across ranks R8 and R9. After becoming cold, some extents from LUN 1 are demoted to
Nearline drives at T3. LUN 2 and LUN 3, with almost no activity, are demoted to Nearline
drives at T2. After increased activity, some data of LUN 2 is promoted to Enterprise drives
again at T3. The hot extent of LUN 4 is promoted to SSD/flash, and the cold extent is
demoted to Nearline drives at T2. After changing workload conditions on LUN 4 and the
increased activity on the demoted extents, these extents are promoted again to Enterprise
drives at T3. Cold extents from LUN 5 are demoted to Nearline drives with a constant low
access pattern over time. LUN 6 shows some extents promoted to SSD/flash at T2 and
further extent allocation changes across the ranks of the Enterprise tier at T3. The changes
are because of potential extent relocations from the SSD/flash tier (warm demote or swap
operations) or across the ranks within the Enterprise tiers (auto-rebalance), balancing the
workload based on workload patterns and rank utilization.

d
reate ier ier
T1 ns c T2 syT T3 syT
Lu Ea Ea timeline

Extpool P3 Extpool P3 Extpool P3

R7 R8 R9 R10 R7 R8 R9 R10 R7 R8 R9 R10


Host A – LUN 1 Host A – LUN 1 Host A – LUN 1
Host A – LUN 2 Host A – LUN 2 Host A – LUN 2
Host B – LUN 3 Host B – LUN 3 Host B – LUN 3
Host B – LUN 4 Host B – LUN 4 Host B – LUN 4
Host C – LUN 5 Host C – LUN 5 Host C – LUN 5
Host C – LUN 6 Host C – LUN 6 Host C – LUN 6

SSD ENT NL SSD ENT NL SSD ENT NL


Figure 3-12 Data placement in a managed 3-tier extent pool with Easy Tier over time

80 IBM System Storage DS8000 Performance Monitoring and Tuning


Easy Tier constantly adapts to changing workload conditions and relocates extents, so the
extent locations of a volume are subject to a constant change over time that depends on its
workload pattern. However, the number of relocations decreases and eventually becomes
marginal if the workload pattern about the decision windows of Easy Tier remains steady.

Chapter 3. Logical configuration concepts and terminology 81


82 IBM System Storage DS8000 Performance Monitoring and Tuning
4

Chapter 4. Logical configuration


performance considerations

Important: Before reading this chapter, familiarize yourself with the material that is
covered in Chapter 3, “Logical configuration concepts and terminology” on page 47.

This chapter introduces a step-by-step approach to configuring the IBM Storage System
DS8000 workload and performance considerations:
򐂰 Reviewing the tiered storage concepts and Easy Tier
򐂰 Understanding the configuration principles for optimal performance:
– Workload isolation
– Workload resource-sharing
– Workload spreading
򐂰 Analyzing workload characteristics to determine isolation or resource-sharing
򐂰 Planning allocation of the DS8000 disk and host connection capacity to identified
workloads
򐂰 Planning spreading volumes and host connections for the identified workloads
򐂰 Planning array sites
򐂰 Planning RAID arrays and ranks with RAID-level performance considerations
򐂰 Planning extent pools with single-tier and multitier extent pool considerations
򐂰 Planning address groups, logical subsystems (LSSs), volume IDs, and count key data
(CKD) Parallel Access Volumes (PAVs)
򐂰 Planning I/O port IDs, host attachments, and volume groups
򐂰 Implementing and Documenting the DS8000 logical configuration

© Copyright IBM Corp. 2016. All rights reserved. 83


4.1 Reviewing the tiered storage concepts and Easy Tier
Many storage environments support a diversity of needs and use disparate technologies that
cause storage sprawl. In a large-scale storage infrastructure, this environment yields a
suboptimal storage design that can be improved only with a focus on data access
characteristics analysis and management to provide optimum performance.

4.1.1 Tiered storage concept


Tiered storage is an approach of using different types of storage throughout the storage
infrastructure. It is a mix of higher performing/higher-cost storage with lower
performing/lower-cost storage and placing data based on specific characteristics, such as the
performance needs, age, and importance of data availability. Correctly balancing these tiers
leads to the minimal cost and best performance solution. The concept and a cost versus
performance relationship of a tiered storage environment is shown in Figure 4-1.

Figure 4-1 Cost versus performance in a tiered storage environment

Typically, an optimal design keeps the active operational data in Tier 0 and Tier 1 and uses
Tiers 2 and 3 for less active data, as shown in Figure 4-2 on page 85.

The benefits that are associated with a tiered storage approach mostly relate to cost. By
introducing flash storage as tier 0, you might more efficiently address the highest
performance needs while reducing the Enterprise class storage, system footprint, and energy
costs. A tiered storage approach can provide the performance that you need and save
significant costs that are associated with storage because lower-tier storage is less
expensive. Environmental savings, such as energy, footprint, and cooling reductions, are
possible.

84 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 4-2 Storage pyramid with average relative amount of data in each tier

4.1.2 IBM System Storage Easy Tier


With flash cards and solid-state drives (SSDs) emerging as an attractive alternative to hard
disk drives (HDDs) in the enterprise storage market, tiered storage concepts that use flash,
Enterprise (ENT), and Nearline (NL) storage tiers are a promising approach to satisfy storage
performance and storage economics needs in mixed environments at an optimum
price-performance ratio.

With dramatically high I/O rates, low response times, and IOPS-energy-efficient
characteristics, flash addresses the highest performance needs and also potentially can
achieve significant savings in operational costs. However, the acquisition cost per GB is
higher than HDDs. To satisfy most workload characteristics, flash must be used efficiently
with HDDs in a well-balanced tiered storage architecture. It is critical to choose the correct
mix of storage tiers and the correct data placement to achieve optimal storage performance
and economics across all tiers at a low cost.

With the DS8000 storage system, you can easily implement tiered storage environments that
use flash, Enterprise, and Nearline class storage tiers. Still, different storage tiers can be
isolated to separate extent pools and volume placement can be managed manually across
extent pools where required. Or, better and highly encouraged, volume placement can be
managed automatically on a subvolume level in hybrid extent pools by Easy Tier automatic
mode with minimum management effort for the storage administrator. Easy Tier is a no-cost
feature on DS8000 storage systems. For more information about Easy Tier, see 1.3.4, “Easy
Tier” on page 11.

Consider Easy Tier automatic mode and hybrid extent pools for managing tiered storage on
the DS8000 storage system. The overall management and performance monitoring effort
increases considerably when manually managing storage capacity and storage performance
needs across multiple storage classes and does not achieve the efficiency provided with Easy
Tier automatic mode data relocation on the subvolume level (extent level). With Easy Tier,
client configurations show less potential to waste flash capacity than with volume-based
tiering methods.

Chapter 4. Logical configuration performance considerations 85


With Easy Tier, you can configure hybrid extent pools (mixed flash/HDD storage pools) and
turn on Easy Tier on. It then provides automated data relocation across the storage tiers and
ranks in the extent pool to optimize storage performance and storage economics. It also
rebalances the workload across the ranks within each storage tier (auto-rebalance) based on
rank utilization to minimize skew and hot spots. Furthermore, it constantly adapts to changing
workload conditions. There is no need anymore to bother with tiering policies that must be
manually applied to accommodate changing workload dynamics.

In environments with homogeneous system configurations or isolated storage tiers that are
bound to different homogeneous extent pools, you can benefit from Easy Tier automatic
mode. Easy Tier provides automatic intra-tier performance management by rebalancing the
workload across ranks (auto-rebalance) in homogeneous single-tier pools based on rank
utilization. Easy Tier automatically minimizes skew and rank hot spots and helps to reduce
the overall management effort for the storage administrator.

Depending on the particular storage requirements in your environment, with the DS8000
architecture, you can address a vast range of storage needs combined with ease of
management. On a single DS8000 storage system, you can perform these tasks:
򐂰 Isolate workloads to selected extent pools (or down to selected ranks and DAs).
򐂰 Share resources of other extent pools with different workloads.
򐂰 Use Easy Tier to manage automatically multitier extent pools with different storage tiers
(or homogeneous extent pools).
򐂰 Adapt your logical configuration easily and dynamically at any time to changing
performance or capacity needs by migrating volumes across extent pools, merging extent
pools, or removing ranks from one extent pool (rank depopulation) and moving them to
another pool.

Easy Tier helps you consolidate more workloads onto a single DS8000 storage system by
automating storage performance and storage economics management across up to three
drive tiers. In addition, I/O Priority Manager, as described in 1.3.5, “I/O Priority Manager” on
page 17, can help you align workloads to quality of service (QoS) levels to prioritize separate
system workloads that compete for the same shared and possibly constrained storage
resources to meet their performance goals.

For many initial installations, an approach with two extent pools (with or without different
storage tiers) and enabled Easy Tier automatic management might be the simplest way to
start if you have FB or CKD storage only; otherwise, four extent pools are required. You can
plan for more extent pools based on your specific environment and storage needs, for
example, workload isolation for some pools, different resource sharing pools for different
departments or clients, or specific Copy Services considerations.

Considerations, as described in 4.2, “Configuration principles for optimal performance” on


page 87, apply for planning advanced logical configurations in complex environments,
depending on your specific requirements. Also, take advantage of features such as Easy Tier,
I/O Priority Manager, or, for z Systems, SAN Fabric I/O Priority and FICON Dynamic Routing.

86 IBM System Storage DS8000 Performance Monitoring and Tuning


4.2 Configuration principles for optimal performance
There are three major principles for achieving a logical configuration on a DS8000 storage
system for optimal performance when planning extent pools:
򐂰 Workload isolation
򐂰 Workload resource-sharing
򐂰 Workload spreading

You can take advantage of features such as Easy Tier and I/O Priority Manager. Both features
pursue different goals and can be combined.

Easy Tier provides a significant benefit for mixed workloads, so consider it for
resource-sharing workloads and isolated workloads dedicated to a specific set of resources.
Furthermore, Easy Tier automatically supports the goal of workload spreading by distributing
the workload in an optimum way across all the dedicated resources in an extent pool. It
provides automated storage performance and storage economics optimization through
dynamic data relocation on extent level across multiple storage tiers and ranks based on their
access patterns. With auto-rebalance, it rebalances the workload across the ranks within a
storage tier based on utilization to reduce skew and avoid hot spots. Auto-rebalance applies
to managed multitier pools and single-tier pools and helps to rebalance the workloads evenly
across ranks to provide an overall balanced rank utilization within a storage tier or managed
single-tier extent pool. Figure 4-3 shows the effect of auto-rebalance in a single-tier extent
pool that starts with a highly imbalanced workload across the ranks at T1. Auto-rebalance
rebalances the workload and optimizes the rank utilization over time.

Figure 4-3 Effect of auto-rebalance on individual rank utilization in the system

The DS8000 I/O Priority Manager provides a significant benefit for resource-sharing
workloads. It aligns QoS levels to separate workloads that compete for the same shared and
possibly constrained storage resources. I/O Priority Manager can prioritize access to these
system resources to achieve the needed QoS for the volume based on predefined
performance goals (high, medium, or low). I/O Priority Manager constantly monitors and
balances system resources to help applications meet their performance targets automatically,
without operator intervention. I/O Priority Manager acts only if resource contention is
detected.

Chapter 4. Logical configuration performance considerations 87


4.2.1 Workload isolation
With workload isolation, a high-priority workload uses dedicated DS8000 hardware
resources to reduce the impact of less important workloads. Workload isolation can also
mean limiting a lower-priority workload to a subset of the DS8000 hardware resources so that
it does not affect more important workloads by fully using all hardware resources.

Isolation provides ensured availability of the hardware resources that are dedicated to the
isolated workload. It removes contention with other applications for those resources.

However, isolation limits the isolated workload to a subset of the total DS8000 hardware so
that its maximum potential performance might be reduced. Unless an application has an
entire DS8000 storage system that is dedicated to its use, there is potential for contention
with other applications for any hardware (such as cache and processor resources) that is not
dedicated. Typically, isolation is implemented to improve the performance of certain
workloads by separating different workload types.

One traditional practice to isolation is to identify lower-priority workloads with heavy I/O
demands and to separate them from all of the more important workloads. You might be able
to isolate multiple lower priority workloads with heavy I/O demands to a single set of hardware
resources and still meet their lower service-level requirements, particularly if their peak I/O
demands are at different times. In addition, I/O Priority Manager can help to prioritize different
workloads, if required.

Important: For convenience, this chapter sometimes describes isolation as a single


isolated workload in contrast to multiple resource-sharing workloads, but the approach also
applies to multiple isolated workloads.

DS8000 disk capacity isolation


The level of disk capacity isolation that is required for a workload depends on the scale of its
I/O demands as compared to the DS8000 array and DA capabilities, and organizational
considerations, such as the importance of the workload and application administrator
requests for workload isolation.

You can partition the DS8000 disk capacity for isolation at several levels:
򐂰 Rank level: Certain ranks are dedicated to a workload, that is, volumes for one workload
are allocated on these ranks. The ranks can have a different disk type (capacity or speed),
a different RAID array type (RAID 5, RAID 6, or RAID 10, arrays with spares or arrays
without spares), or a different storage type (CKD or FB) than the disk types, RAID array
types, or storage types that are used by other workloads. Workloads that require different
types of the above can dictate rank, extent pool, and address group isolation. You might
consider workloads with heavy random activity for rank isolation, for example.
򐂰 Extent pool level: Extent pools are logical constructs that represent a group of ranks that
are serviced by storage server 0 or storage server 1. You can isolate different workloads to
different extent pools, but you always must be aware of the rank and DA pair associations.
Although physical isolation on rank and DA level involves building appropriate extent pools
with a selected set of ranks or ranks from a specific DA pair, different extent pools with a
subset of ranks from different DA pairs still can share DAs. Isolated workloads to different
extent pools might share a DA adapter as a physical resource, which can be a potential
limiting physical resource under certain extreme conditions. However, given the
capabilities of one DA, some mutual interference there is rare, and isolation (if needed) on
extent pool level is effective if the workloads are disk-bound.

88 IBM System Storage DS8000 Performance Monitoring and Tuning


򐂰 DA level: All ranks on one or more DA pairs are dedicated to a workload, that is, only
volumes for this workload are allocated on the ranks that are associated with one or more
DAs. These ranks can be a different disk type (capacity or speed), RAID array type
(RAID 5, RAID 6, or RAID 10, arrays with spares or arrays without spares), or storage type
(CKD or FB) than the disk types, RAID types, or storage types that are used by other
workloads. Consider only huge (multiple GBps) workloads with heavy, large blocksize, and
sequential activity for DA-level isolation because these workloads tend to consume all of
the available DA resources.
򐂰 Processor complex level: All ranks that are assigned to extent pools managed by
processor complex 0 or all ranks that are assigned to extent pools managed by processor
complex 1 are dedicated to a workload. This approach is not preferable because it can
reduce the processor and cache resources and the back-end bandwidth that is available
to the workload by 50%.
򐂰 Storage unit level: All ranks in a physical DS8000 storage system are dedicated to a
workload, that is, the physical DS8000 storage system runs one workload.
򐂰 Pinning volumes to a tier in hybrid pools: This level is not complete isolation, but it can be
a preferred way to go in many cases. Even if you have just one pool pair with two large
three-tier pools, you may use the flash tier only for some volumes and the Nearline tier
only for others. Certain volumes may float between the two upper (SSD and ENT) tiers
only, but NL is excluded for them. Such a setup allows a simple extent pool structure, and
at the same time less-prioritized volumes stay on lower tiers only and highly prioritized
volumes are treated preferentially.

DS8000 host connection isolation


The level of host connection isolation that is required for a workload depends on the scale of
its I/O demands as compared to the DS8000 I/O port and host adapter (HA) capabilities. It
also depends on organizational considerations, such as the importance of the workload and
administrator requests for workload isolation.

The DS8000 host connection subsetting for isolation can also be done at several levels:
򐂰 I/O port level: Certain DS8000 I/O ports are dedicated to a workload, which is a common
case. Workloads that require Fibre Channel connection (FICON) and Fibre Channel
Protocol (FCP) must be isolated at the I/O port level anyway because each I/O port on a
FCP/FICON-capable HA card can be configured to support only one of these protocols.
Although Open Systems host servers and remote mirroring links use the same protocol
(FCP), they are typically isolated to different I/O ports. You must also consider workloads
with heavy large-block sequential activity for HA isolation because they tend to consume
all of the I/O port resources that are available to them.
򐂰 HA level: Certain HAs are dedicated to a workload. FICON and FCP workloads do not
necessarily require HA isolation because separate I/O ports on the same
FCP/FICON-capable HA card can be configured to support each protocol (FICON or
FCP). However, it is a preferred practice to separate FCP and FICON to different HBAs.
Furthermore, host connection requirements might dictate a unique type of HA card
(longwave (LW) or shortwave (SW)) for a workload. Workloads with heavy large-block
sequential activity must be considered for HA isolation because they tend to consume all
of the I/O port resources that are available to them.
򐂰 I/O enclosure level: Certain I/O enclosures are dedicated to a workload. This approach is
not necessary.

Chapter 4. Logical configuration performance considerations 89


4.2.2 Workload resource-sharing
Workload resource-sharing means multiple workloads use a common set of the DS8000
hardware resources:
򐂰 Ranks
򐂰 DAs
򐂰 I/O ports
򐂰 HAs

Multiple resource-sharing workloads can have logical volumes on the same ranks and can
access the same DS8000 HAs or I/O ports. Resource-sharing allows a workload to access
more DS8000 hardware than can be dedicated to the workload, providing greater potential
performance, but this hardware sharing can result in resource contention between
applications that impacts overall performance at times. It is important to allow
resource-sharing only for workloads that do not consume all of the DS8000 hardware
resources that are available to them. As an alternative, use I/O Priority Manager to align QoS
levels to the volumes that are sharing resources and prioritize different workloads, if required.
Pinning volumes to one certain tier can also be considered temporarily, and then you can
release these volumes again.

Easy Tier extent pools typically are shared by multiple workloads because Easy Tier with its
automatic data relocation and performance optimization across multiple storage tiers
provides the most benefit for mixed workloads.

To better understand the resource-sharing principle for workloads on disk arrays, see 3.3.2,
“Extent pool considerations” on page 72.

4.2.3 Workload spreading


Workload spreading means balancing and distributing overall workload evenly across all of
the DS8000 hardware resources that are available:
򐂰 Processor complex 0 and processor complex 1
򐂰 DAs
򐂰 Ranks
򐂰 I/O enclosures
򐂰 HAs

Spreading applies to both isolated workloads and resource-sharing workloads.

You must allocate the DS8000 hardware resources to either an isolated workload or multiple
resource-sharing workloads in a balanced manner, that is, you must allocate either an
isolated workload or resource-sharing workloads to the DS8000 ranks that are assigned to
DAs and both processor complexes in a balanced manner. You must allocate either type of
workload to I/O ports that are spread across HAs and I/O enclosures in a balanced manner.

You must distribute volumes and host connections for either an isolated workload or a
resource-sharing workload in a balanced manner across all DS8000 hardware resources that
are allocated to that workload.

You should create volumes as evenly distributed as possible across all ranks and DAs
allocated to those workloads.

One exception to the recommendation of spreading volumes might be when specific files or
data sets are never accessed simultaneously, such as multiple log files for the same
application where only one log file is in use at a time. In that case, you can place the volumes
required by these data sets or files on the same resources.

90 IBM System Storage DS8000 Performance Monitoring and Tuning


You must also configure host connections as evenly distributed as possible across the I/O
ports, HAs, and I/O enclosures that are available to either an isolated or a resource-sharing
workload. Then, you can use host server multipathing software to optimize performance over
multiple host connections. For more information about multipathing software, see Chapter 8,
“Host attachment” on page 267.

4.2.4 Using workload isolation, resource-sharing, and spreading


When you perform DS8000 performance optimization, you must first identify any workload
that has the potential to negatively impact the performance of other workloads by fully using
all of the DS8000 I/O ports and the DS8000 ranks available to it.

Additionally, you might identify any workload that is so critical that its performance can never
be allowed to be negatively impacted by other workloads.

Then, identify the remaining workloads that are considered appropriate for resource-sharing.

Next, define a balanced set of hardware resources that can be dedicated to any isolated
workloads, if required. Then, allocate the remaining DS8000 hardware for sharing among the
resource-sharing workloads. Carefully consider the appropriate resources and storage tiers
for Easy Tier and multitier extent pools in a balanced manner. Also, plan ahead for
appropriate I/O Priority Manager alignments of QoS levels to resource-sharing workloads
where needed.

The next step is planning extent pools and assigning volumes and host connections to all
workloads in a way that is balanced and spread. By default, the standard allocation method
when creating volumes is stripes with one-extent granularity across all arrays in a pool, so on
the rank level, this distribution is done automatically.

Without the explicit need for workload isolation or any other requirements for multiple extent
pools, starting with two extent pools (with or without different storage tiers) and a balanced
distribution of the ranks and DAs might be the simplest configuration to start with using
resource-sharing throughout the whole DS8000 storage system and Easy Tier automatic
management if you have either FB or CKD storage. Otherwise, four extent pools are required
for a reasonable minimum configuration, two for FB storage and two for CKD storage, and
each pair is distributed across both DS8000 storage servers. In addition, you can plan to align
your workloads to expected QoS levels with I/O Priority Manager.

The final step is the implementation of host-level striping (when appropriate) and multipathing
software, if needed. If you planned for Easy Tier, do not consider host-level striping because it
dilutes the workload skew and is counterproductive to the Easy Tier optimization.

4.3 Analyzing application workload characteristics


The first and most important step in creating a successful logical configuration for the DS8000
storage system is analyzing the workload characteristics for the applications that access the
DS8000 storage system. The DS8000 hardware resources, such as RAID arrays and I/O
ports, must be correctly allocated to workloads for isolation and resource-sharing
considerations. If planning for shared multitier configurations and Easy Tier, it is important to
determine the skew of the workload and plan for the amount of required storage capacity on
the appropriate storage tiers. You must perform this workload analysis during the DS8000
capacity planning process, and you must complete it before ordering the DS8000 hardware.

Chapter 4. Logical configuration performance considerations 91


4.3.1 Determining skew and storage requirements for Easy Tier
For Easy Tier configurations, it is important to determine the skew of the workload and plan
for the amount of required storage capacity on each of the storage tiers. Plan the optimum
initial hardware configuration for managed multitier environments so that you determine the
overall distribution of the I/O workload against the amount of data (data heat distribution) to
understand how much of the data is doing how much (or most) of the I/O workload. The
workload pattern, small block random or large block sequential read/write operations, also is
important. A good understanding of the workload heat distribution and skew helps to evaluate
the benefit of an Easy Tier configuration.

For example, the ratio of flash capacity to HDD capacity in a hybrid pool depends on the
workload characteristics and skew. Ideally, there must be enough flash capacity to hold the
active (hot) extents in the pool, but not more to not waste expensive flash capacity. For new
DS8000 orders, 3–5% of flash might be a reasonable percentage to plan for with hybrid pools
in typical environments. This configuration can result in the movement of already over 50% of
the small and random I/O workload from Enterprise drives to flash. It provides a reasonable
initial estimate if measurement data is not available to support configuration planning.

The Storage Tier Advisor Tool (STAT) also can provide guidance for capacity planning of the
available storage tiers based on the existing workloads on a DS8000 storage system with
Easy Tier monitoring enabled. For more information, see 6.5, “Storage Tier Advisor Tool” on
page 213.

4.3.2 Determining isolation requirements


The objective of this analysis is to identify workloads that require isolated (dedicated) DS8000
hardware resources because this determination ultimately affects the total amount of disk
capacity that is required and the total number of disk drive types that is required, and the
number and type of HAs that is required. The result of this first analysis indicates which
workloads require isolation and the level of isolation that is required. However, in recent years,
with new options for QoS tools or for pinning volumes, and stronger storage hardware, the
isolation of loads to better manage performance has become less common, and should not
be thought of as a rule.

You must also consider organizational and business considerations in determining which
workloads to isolate. Workload priority (the importance of a workload to the business) is a key
consideration. Application administrators typically request dedicated resources for high
priority workloads. For example, certain database online transaction processing (OLTP)
workloads might require dedicated resources to ensure service levels.

The most important consideration is preventing lower-priority workloads with heavy I/O
requirements from impacting higher priority workloads. Lower-priority workloads with heavy
random activity must be evaluated for rank isolation. Lower-priority workloads with heavy,
large blocksize, and sequential activity must be evaluated for DA and I/O port isolation.

Workloads that require different disk drive types (capacity and speed), different RAID types
(RAID 5, RAID 6, or RAID 10), or different storage types (CKD or FB) dictate isolation to
different DS8000 arrays, ranks, and extent pools, unless this situation can be solved by
pinning volumes to one certain tier. For more information about the performance implications
of various RAID types, see “RAID-level performance considerations” on page 98.

Workloads that use different I/O protocols (FCP or FICON) dictate isolation to different I/O
ports. However, workloads that use the same disk drive types, RAID type, storage type, and
I/O protocol can be evaluated for separation or isolation requirements.

92 IBM System Storage DS8000 Performance Monitoring and Tuning


Workloads with heavy, continuous I/O access patterns must be considered for isolation to
prevent them from consuming all available DS8000 hardware resources and impacting the
performance of other types of workloads. Workloads with large blocksize and sequential
activity must be considered for separation from those workloads with small blocksize and
random activity.

Isolation of only a few workloads that are known to have high I/O demands can allow all the
remaining workloads (including the high-priority workloads) to share hardware resources and
achieve acceptable levels of performance. More than one workload with high I/O demands
might be able to share the isolated DS8000 resources, depending on the service level
requirements and the times of peak activity.

The following examples are I/O workloads, files, or data sets that might have heavy and
continuous I/O access patterns:
򐂰 Sequential workloads (especially those workloads with large-blocksize transfers)
򐂰 Log files or data sets
򐂰 Sort or work data sets or files
򐂰 Business Intelligence and Data Mining
򐂰 Disk copies (including Point-in-Time Copy background copies, remote mirroring target
volumes, and tape simulation on disk)
򐂰 Video and imaging applications
򐂰 Engineering and scientific applications
򐂰 Certain batch workloads

You must consider workloads for all applications for which DS8000 storage is allocated,
including current workloads to be migrated from other installed storage systems and new
workloads that are planned for the DS8000 storage system. Also, consider projected growth
for both current and new workloads.

For existing applications, consider historical experience first. For example, is there an
application where certain data sets or files are known to have heavy, continuous I/O access
patterns? Is there a combination of multiple workloads that might result in unacceptable
performance if their peak I/O times occur simultaneously? Consider workload importance
(workloads of critical importance and workloads of lesser importance).

For existing applications, you can also use performance monitoring tools that are available for
the existing storage systems and server platforms to understand current application workload
characteristics:
򐂰 Read/write ratio
򐂰 Random/sequential ratio
򐂰 Average transfer size (blocksize)
򐂰 Peak workload (IOPS for random access and MB per second for sequential access)
򐂰 Peak workload periods (time of day and time of month)
򐂰 Copy Services requirements (Point-in-Time Copy and Remote Mirroring)
򐂰 Host connection utilization and throughput (FCP host connections and FICON)
򐂰 Remote mirroring link utilization and throughput

Estimate the requirements for new application workloads and for current application workload
growth. You can obtain information about general workload characteristics in Chapter 5,
“Understanding your workload” on page 141.

Chapter 4. Logical configuration performance considerations 93


As new applications are rolled out and current applications grow, you must monitor
performance and adjust projections and allocations. You can obtain more information about
this topic in Appendix A, “Performance management process” on page 551 and in Chapter 7,
“Practical performance management” on page 221.

You can use the Disk Magic modeling tool to model the current or projected workload and
estimate the required DS8000 hardware resources. Disk Magic is described in 6.1, “Disk
Magic” on page 160.

The STAT can also provide workload information and capacity planning recommendations
that are associated with a specific workload to reconsider the need for isolation and evaluate
the potential benefit when using a multitier configuration and Easy Tier.

4.3.3 Reviewing remaining workloads for feasibility of resource-sharing


After workloads with the highest priority or the highest I/O demands are identified for isolation,
the I/O characteristics of the remaining workloads must be reviewed to determine whether a
single group of resource-sharing workloads is appropriate, or whether it makes sense to split
the remaining applications into multiple resource-sharing groups. The result of this step is the
addition of one or more groups of resource-sharing workloads to the DS8000 configuration
plan.

4.4 Planning allocation of disk and host connection capacity


You must plan the allocation of specific DS8000 hardware first for any isolated workload, and
then for the resource-sharing workloads, including Easy Tier hybrid pools. Use the workload
analysis in 4.3.2, “Determining isolation requirements” on page 92 to define the disk capacity
and host connection capacity that is required for the workloads. For any workload, the
required disk capacity is determined by both the amount of space that is needed for data and
the number of arrays (of a specific speed) that are needed to provide the needed level of
performance. The result of this step is a plan that indicates the number of ranks (including
disk drive type) and associated DAs and the number of I/O adapters and associated I/O
enclosures that are required for any isolated workload and for any group of resource-sharing
workloads.

Planning DS8000 hardware resources for isolated workloads


For the DS8000 disk allocation, isolation requirements might dictate the allocation of certain
individual ranks or all of the ranks on certain DAs to one workload. For the DS8000 I/O port
allocation, isolation requirements might dictate the allocation of certain I/O ports or all of the
I/O ports on certain HAs to one workload.

Choose the DS8000 resources to dedicate in a balanced manner. If ranks are planned for
workloads in multiples of two, half of the ranks can later be assigned to extent pools managed
by processor complex 0, and the other ranks can be assigned to extent pools managed by
processor complex 1. You may also note the DAs to be used. If I/O ports are allocated in
multiples of four, they can later be spread evenly across all I/O enclosures in a DS8000 frame
if four or more HA cards are installed. If I/O ports are allocated in multiples of two, they can
later be spread evenly across left and right I/O enclosures.

94 IBM System Storage DS8000 Performance Monitoring and Tuning


Planning DS8000 hardware resources for resource-sharing workloads
Review the DS8000 resources to share for balance. If ranks are planned for resource-sharing
workloads in multiples of two, half of the ranks can later be assigned to processor complex 0
extent pools, and the other ranks can be assigned to processor complex one extent pools. If
I/O ports are allocated for resource-sharing workloads in multiples of four, they can later be
spread evenly across all I/O enclosures in a DS8000 frame if four or more HA cards are
installed. If I/O ports are allocated in multiples of two, they can later be spread evenly across
left and right I/O enclosures.

Easy Tier later provides automatic intra-tier management in single-tier and multitier pools
(auto-rebalance) and cross-tier management in multitier pools for the resource-sharing
workloads. In addition, different QoS levels can be aligned to different workloads to meet
performance goals.

4.5 Planning volume and host connection spreading


After hardware resources are allocated for both isolated and resource-sharing workloads,
plan the volume and host connection spreading for all of the workloads.

Host connection: In this chapter, we use host connection in a general sense to represent
a connection between a host server (either z/OS or Open Systems) and the DS8000
storage system.

The result of this step is a plan that includes this information:


򐂰 The specific number and size of volumes for each isolated workload or group of
resource-sharing workloads and how they are allocated to ranks and DAs
򐂰 The specific number of I/O ports for each workload or group of resource-sharing
workloads and how they are allocated to HAs and I/O enclosures

After the spreading plan is complete, use the DS8000 hardware resources that are identified
in the plan as input to order the DS8000 hardware.

4.5.1 Spreading volumes for isolated and resource-sharing workloads


Now, consider the requirements of each workload for the number and size of logical volumes.
For a specific amount of required disk capacity from the perspective of the DS8000 storage
system, there are typically no significant DS8000 performance implications of using more
small volumes as compared to fewer large volumes. However, using one or a few standard
volume sizes can simplify management.

However, there are host server performance considerations related to the number and size of
volumes. For example, for z Systems servers, the number of Parallel Access Volumes (PAVs)
that are needed can vary with volume size. For more information about PAVs, see 14.2,
“DS8000 and z Systems planning and configuration” on page 476. For IBM i servers, use a
volume size that is on the order of half the size of the disk drives that are used. There also
can be Open Systems host server or multipathing software considerations that are related to
the number or the size of volumes, so you must consider these factors in addition to workload
requirements.

Chapter 4. Logical configuration performance considerations 95


There are significant performance implications with the assignment of logical volumes to
ranks and DAs. The goal of the entire logical configuration planning process is to ensure that
volumes for each workload are on ranks and DAs that allow all workloads to meet
performance objectives.

To spread volumes across allocated hardware for each isolated workload, and then for each
workload in a group of resource-sharing workloads, complete the following steps:
1. Review the required number and the size of the logical volumes that are identified during
the workload analysis.
2. Review the number of ranks that are allocated to the workload (or group of
resource-sharing workloads) and the associated DA pairs.
3. Evaluate the use of multi-rank or multitier extent pools. Evaluate the use of Easy Tier in
automatic mode to automatically manage data placement and performance.
4. Assign the volumes, preferably with the default allocation method rotate extents (DSCLI
term: rotateexts, GUI: rotate capacity).

4.5.2 Spreading host connections for isolated and resource-sharing


workloads
Next, consider the requirements of each workload for the number and type of host
connections. In addition to workload requirements, you also might need to consider the host
server or multipathing software in relation to the number of host connections. For more
information about multipathing software, see Chapter 8, “Host attachment” on page 267.

There are significant performance implications from the assignment of host connections to
I/O ports, HAs, and I/O enclosures. The goal of the entire logical configuration planning
process is to ensure that host connections for each workload access I/O ports and HAs that
allow all workloads to meet the performance objectives.

To spread host connections across allocated hardware for each isolated workload, and then
for each workload in a group of resource-sharing workloads, complete the following steps:
1. Review the required number and type (SW, LW, FCP, or FICON) of host connections that
are identified in the workload analysis. You must use a minimum of two host connections
to different DS8000 HA cards to ensure availability. Some Open Systems hosts might
impose limits on the number of paths and volumes. In such cases, you might consider not
exceeding four paths per volume, which in general is a good approach for performance
and availability. The DS8880 front-end host ports are 16 Gbps capable and if the expected
workload is not explicitly saturating the adapter and port bandwidth with high sequential
loads, you might share ports with many hosts.
2. Review the HAs that are allocated to the workload (or group of resource-sharing
workloads) and the associated I/O enclosures.
3. Review requirements that need I/O port isolation, for example, remote replication Copy
Services, IBM ProtecTIER®, and SAN Volume Controller. If possible, try to split them as
you split hosts among hardware resources. Do not mix them with other Open Systems
because they can have different workload characteristics.

96 IBM System Storage DS8000 Performance Monitoring and Tuning


4. Assign each required host connection to a different HA in a different I/O enclosure if
possible, balancing them across the left and right I/O enclosures:
– If the required number of host connections is less or equal than the available number of
I/O enclosures (which can be typical for certain Open Systems servers), assign an
equal number of host connections to left I/O enclosures (0, 2, 4, and 6) as to right I/O
enclosures (1, 3, 5, and 7).
– Within an I/O enclosure, assign each required host connection to the HA of the
required type (SW FCP/FICON-capable or LW FCP/FICON-capable) with the greatest
number of unused ports. When HAs have an equal number of unused ports, assign the
host connection to the adapter that has the fewest connections for this workload.
– If the number of required host connections is greater than the number of I/O
enclosures, assign the additional connections to different HAs with the most unused
ports within the I/O enclosures. When HAs have an equal number of unused ports,
assign the host connection to the adapter that has the fewest connections for this
workload.

4.6 Planning array sites


During the DS8000 installation, array sites are dynamically created and assigned to DA pairs.
Array site IDs (Sx) do not have any fixed or predetermined relationship to disk drive physical
locations or to the disk enclosure installation order. The relationship between array site IDs
and physical disk locations or DA assignments can differ between the DS8000 storage
systems, even on the DS8000 storage systems with the same number and type of disk drives.

When using the DS Storage Manager GUI to create managed arrays and pools, the GUI
automatically chooses a good distribution of the arrays across all DAs, and initial formatting
with the GUI gives optimal results for many cases. Only for specific requirements (for
example, isolation by DA pairs) is the command-line interface (DSCLI) advantageous
because it gives more options for a certain specific configuration.

After the DS8000 hardware is installed, you can use the output of the DS8000 DSCLI
lsarraysite command to display and document array site information, including disk drive
type and DA pair. Check the disk drive type and DA pair for each array site to ensure that
arrays, ranks, and ultimately volumes that are created from the array site are created on the
DS8000 hardware resources required for the isolated or resource-sharing workloads.

The result of this step is the addition of specific array site IDs to the plan of workload
assignment to ranks.

4.7 Planning RAID arrays and ranks


The next step is planning the RAID arrays and ranks. When using DSCLI, take the specific
array sites that are planned for isolated or resource-sharing workloads and define their
assignments to RAID arrays and CKD or FB ranks, and thus define array IDs and rank IDs.
Because there is a one-to-one correspondence between an array and a rank on the DS8000
storage system, you can plan arrays and ranks in a single step. However, array creation and
rank creation require separate steps.

Chapter 4. Logical configuration performance considerations 97


The sequence of steps when creating the arrays and ranks with the DSCLI finally determines
the numbering scheme of array IDs and rank IDs because these IDs are chosen automatically
by the system during creation. The logical configuration does not depend on a specific ID
numbering scheme, but a specific ID numbering scheme might help you plan the
configuration and manage performance more easily.

Storage servers: Array sites, arrays, and ranks do not have a fixed or predetermined
relationship to any DS8000 processor complex (storage server) before they are finally
assigned to an extent pool and a rank group (rank group 0/1 is managed by processor
complex 0/1).

RAID-level performance considerations


When configuring arrays from array sites, you must specify the RAID level, either RAID 5,
RAID 6, or RAID 10. These RAID levels meet different requirements for performance, usable
storage capacity, and data protection. However, you must determine the correct RAID types
and the physical disk drives (speed and capacity) that are related to initial workload
performance objectives, capacity requirements, and availability considerations before you
order the DS8000 hardware. For more information about implementing the various RAID
levels on the DS8000, see 3.1, “RAID levels and spares” on page 48.

The following RAID types are available:


򐂰 Enterprise disks are RAID 5, RAID 6, and RAID 10 capable.
򐂰 Nearline disks are RAID 6 capable (RAID 10 only with a client RPQ).
򐂰 SSDs are RAID 5 capable, and RAID 10/RAID 6 only with a client RPQ.
򐂰 HPFE flash card arrays are RAID 5 capable.

RAID 5 is one of the most commonly used levels of RAID protection because it optimizes
cost-effective performance while emphasizing usable capacity through data striping. It
provides fault tolerance if one disk drive fails by using XOR parity for redundancy. Hot spots
within an array are avoided by distributing data and parity information across all of the drives
in the array. The capacity of one drive in the RAID array is lost because it holds the parity
information. RAID 5 provides a good balance of performance and usable storage capacity.

RAID 6 provides a higher level of fault tolerance than RAID 5 in disk failures, but also provides
less usable capacity than RAID 5 because the capacity of two drives in the array is set aside
to hold the parity information. As with RAID 5, hot spots within an array are avoided by
distributing data and parity information across all of the drives in the array. Still, RAID 6 offers
more usable capacity than RAID 10 by providing an efficient method of data protection in
double disk errors, such as two drive failures, two coincident medium errors, or a drive failure
and a medium error during a rebuild. Because the likelihood of media errors increases with
the capacity of the physical disk drives, consider the use of RAID 6 with large capacity disk
drives and higher data availability requirements. For example, consider RAID 6 where
rebuilding the array in a drive failure takes a long time. RAID 6 can also be used with smaller
SAS (Serial-attached Small Computer System Interface) drives, when the primary concern is
a higher level of data protection than is provided by RAID 5.

RAID 10 optimizes high performance while maintaining fault tolerance for disk drive failures.
The data is striped across several disks, and the first set of disk drives is mirrored to an
identical set. RAID 10 can tolerate at least one, and in most cases, multiple disk failures if the
primary copy and the secondary copy of a mirrored disk pair do not fail at the same time.

98 IBM System Storage DS8000 Performance Monitoring and Tuning


In addition to the considerations for data protection and capacity requirements, the question
typically arises about which RAID level performs better, RAID 5, RAID 6, or RAID 10. As with
most complex issues, the answer is that it depends. There are a number of workload
attributes that influence the relative performance of RAID 5, RAID 6, or RAID 10, including
the use of cache, the relative mix of read as compared to write operations, and whether data
is referenced randomly or sequentially.

Regarding read I/O operations, either random or sequential, there is generally no difference
between RAID 5, RAID 6, and RAID 10. When a DS8000 storage system receives a read
request from a host system, it first checks whether the requested data is already in cache. If
the data is in cache (that is, a read cache hit), there is no need to read the data from disk, and
the RAID level on the arrays does not matter. For reads that must be satisfied from disk (that
is, the array or the back end), the performance of RAID 5, RAID 6, and RAID 10 is roughly
equal because the requests are spread evenly across all disks in the array. In RAID 5 and
RAID 6 arrays, data is striped across all disks, so I/Os are spread across all disks. In
RAID 10, data is striped and mirrored across two sets of disks, so half of the reads are
processed by one set of disks, and half of the reads are processed by the other set, reducing
the utilization of individual disks.

Regarding random-write I/O operations, the different RAID levels vary considerably in their
performance characteristics. With RAID 10, each write operation at the disk back end initiates
two disk operations to the rank. With RAID 5, an individual random small-block write
operation to the disk back end typically causes a RAID 5 write penalty, which initiates four I/O
operations to the rank by reading the old data and the old parity block before finally writing the
new data and the new parity block. For RAID 6 with two parity blocks, the write penalty
increases to six required I/O operations at the back end for a single random small-block write
operation. This assumption is a worst-case scenario that is helpful for understanding the
back-end impact of random workloads with a certain read/write ratio for the various RAID
levels. It permits a rough estimate of the expected back-end I/O workload and helps you to
plan for the correct number of arrays. On a heavily loaded system, it might take fewer I/O
operations than expected on average for RAID 5 and RAID 6 arrays. The optimization of the
queue of write I/Os waiting in cache for the next destage operation can lead to a high number
of partial or full stripe writes to the arrays with fewer required back-end disk operations for the
parity calculation.

On modern disk systems, such as the DS8000 storage system, write operations are cached
by the storage subsystem and thus handled asynchronously with short write response times
for the attached host systems. So, any RAID 5 or RAID 6 write penalties are shielded from the
attached host systems in disk response time. Typically, a write request that is sent to the
DS8000 subsystem is written into storage server cache and persistent cache, and the I/O
operation is then acknowledged immediately to the host system as complete. If there is free
space in these cache areas, the response time that is seen by the application is only the time
to get data into the cache, and it does not matter whether RAID 5, RAID 6, or RAID 10 is
used.

There is also the concept of rewrites. If you update a cache segment that is still in write cache
and not yet destaged, update segment in the cache and eliminate the RAID penalty for the
previous write step. However, if the host systems send data to the cache areas faster than the
storage server can destage the data to the arrays (that is, move it from cache to the physical
disks), the cache can occasionally fill up with no space for the next write request. Therefore,
the storage server signals the host system to retry the I/O write operation. In the time that it
takes the host system to retry the I/O write operation, the storage server likely can destage
part of the data, which provides free space in the cache and allows the I/O operation to
complete on the retry attempt.

Chapter 4. Logical configuration performance considerations 99


When random small-block write data is destaged from cache to disk, RAID 5 and RAID 6
arrays can experience a severe write penalty with four or six required back-end disk
operations. RAID 10 always requires only two disk operations per small block write request,
and these operations can be done in parallel. Because RAID 10 performs only half the disk
operations of RAID 5, for random writes, a RAID 10 destage completes faster and reduces
the busy time of the disk system. So, with steady and heavy random-write workloads and just
a few spinning drives present, the back-end write operations to the ranks (the physical disk
drives) can become a limiting factor, so that only a RAID 10 configuration (instead of
additional RAID 5 or RAID 6 arrays) provides enough back-end disk performance at the rank
level to meet the workload performance requirements.

Although RAID 10 clearly outperforms RAID 5 and RAID 6 in small-block random write
operations, RAID 5 and RAID 6 show excellent performance in sequential write I/O
operations. With sequential write requests, all of the blocks that are required for the RAID 5
parity calculation can be accumulated in cache, and thus the destage operation with parity
calculation can be dynamic as a full stripe write without the need for additional disk operations
to the array. So, with only one additional parity block for a full stripe write (for example, seven
data blocks plus one parity block for a 7+P RAID 5 array), RAID 5 requires fewer disk
operations at the back end than a RAID 10, which always requires twice the write operations
because of data mirroring. RAID 6 also benefits from sequential write patterns with most of
the data blocks, which are required for the double parity calculation, staying in cache and thus
reducing the number of additional disk operations to the back end considerably. For
sequential writes, a RAID 5 destage completes faster and reduces the busy time of the disk
subsystem.

Comparing RAID 5 to RAID 6, the performance of small-block random read and the
performance of a sequential read are roughly equal. Because of the higher write penalty, the
RAID 6 small-block random write performance is explicitly less than with RAID 5. Also, the
maximum sequential write throughput is slightly less with RAID 6 than with RAID 5 because
of the additional second parity calculation. However, RAID 6 rebuild times are close to RAID 5
rebuild times (for the same size disk drive modules (DDMs)) because rebuild times are
primarily limited by the achievable write throughput to the spare disk during data
reconstruction. So, RAID 6 mainly is a significant reliability enhancement with a trade-off in
random-write performance. It is most effective for large capacity disks that hold
mission-critical data and that are correctly sized for the expected write I/O demand. Workload
planning is especially important before implementing RAID 6 for write-intensive applications,
including Copy Services targets.

RAID 10 is not as commonly used as RAID 5 for two key reasons. First, RAID 10 requires
more raw disk capacity for every TB of effective capacity. Second, when you consider a
standard workload with a typically high number of read operations and only a few write
operations, RAID 5 generally offers the best trade-off between overall performance and
usable capacity. In many cases, RAID 5 write performance is adequate because disk systems
tend to operate at I/O rates below their maximum throughputs, and differences between
RAID 5 and RAID 10 are primarily observed at maximum throughput levels. Consider
RAID 10 for critical workloads with a high percentage of steady random-write requests, which
can easily become rank limited. RAID 10 provides almost twice the throughput as RAID 5
(because of the “write penalty”). The trade-off for better performance with RAID 10 is about
40% less usable disk capacity. Larger drives can be used with RAID 10 to get the
random-write performance benefit while maintaining about the same usable capacity as a
RAID 5 array with the same number of disks.

100 IBM System Storage DS8000 Performance Monitoring and Tuning


Here is a summary of the individual performance characteristics of the RAID arrays:
򐂰 For read operations from disk, either random or sequential, there is no significant
difference in RAID 5, RAID 6, or RAID 10 performance.
򐂰 For random writes to disk, RAID 10 outperforms RAID 5 and RAID 6.
򐂰 For random writes to disk, RAID 5 performs better than RAID 6.
򐂰 For sequential writes to disk, RAID 5 tends to perform better.

Table 4-1 shows a short overview of the advantages and disadvantages for the RAID level
reliability, space efficiency, and random write performance.

Table 4-1 RAID-level comparison of reliability, space efficiency, and write penalty
RAID level Reliability Space efficiencya Performance
(number of erasures) write penalty
(number of disk
operations)

RAID 5 (7+P) 1 87.5% 4

RAID 6 (6+P+Q) 2 75% 6

RAID 10 (4x2) At least 1 50% 2


a. The space efficiency in this table is based on the number of disks that remain available for
data storage. The actual usable decimal capacities are up to 5% less.

In general, workloads that effectively use storage system cache for reads and writes see little
difference between RAID 5 and RAID 10 configurations. For workloads that perform better
with RAID 5, the difference in RAID 5 performance over RAID 10 is typically small. However,
for workloads that perform better with RAID 10, the difference in RAID 10 performance over
RAID 5 performance or RAID 6 performance can be significant.

Because RAID 5, RAID 6, and RAID 10 perform equally well for both random and sequential
read operations, RAID 5 or RAID 6 might be a good choice for space efficiency and
performance for standard workloads with many read requests. RAID 6 offers a higher level of
data protection than RAID 5, especially for large capacity drives, but the random-write
performance of RAID 6 is less because of the second parity calculation. Therefore, size for
performance, especially for RAID 6.

RAID 5 tends to have a slight performance advantage for sequential writes. RAID 10
performs better for random writes. RAID 10 is considered to be the RAID type of choice for
business-critical workloads with many random write requests (typically more than 35% writes)
and low response time requirements.

For array rebuilds, RAID 5, RAID 6, and RAID 10 require approximately the same elapsed
time, although RAID 5 and RAID 6 require more disk operations and therefore are more likely
to affect other disk activity on the same disk array.

You can select RAID types for each array site. So, you can select the RAID type based on the
specific performance requirements of the data for that site. The preferred way to compare the
performance of a specific workload that uses RAID 5, RAID 6, or RAID 10 is to run a Disk
Magic model. For additional information about the capabilities of this tool, see 6.1, “Disk
Magic” on page 160.

Chapter 4. Logical configuration performance considerations 101


For workload planning purposes, it might be convenient to have a general idea of the I/O
performance that a single RAID array can provide. Such numbers are not DS8000 specific
because they represent the limits that you can expect from a simple set of eight physical disks
that form a RAID array. Again, the Disk Magic tool can help to do models for all of these
various RAID models.

For small block random-read workloads, there is no significant performance difference


between RAID 5, RAID 6, and RAID 10, and for a typical 70:30 random small block workload
with 70% reads (no read cache hits) and 30% writes, a performance difference is already
noticeable. With an increasing number of random writes, RAID 10 clearly outperforms RAID 5
and RAID 6. To give you some idea for suggested peak array loads of a cache-unfriendly load
profile: For a standard random small-block 70:30 workload, if 1100 IOPS mark the upper limit
of a RAID 5 array, then 1500 IOPS is the number for a RAID 10 array of these same drives,
and 900 IOPS for a RAID 6 array of Enterprise disk drives.

Despite the different RAID levels and the actual workload pattern (read:write ratio, sequential
access, or random access), the limits of the maximum I/O rate per rank also depend on the
type of disk drives that are used. As a mechanical device, each disk drive can process a
limited number of random IOPS, depending on the drive characteristics. So, the number of
disk drives that are used for a specific amount of storage capacity determines the achievable
random IOPS performance. The 15 K drives offer approximately 30% more random IOPS
performance than 10 K drives. Generally, for random IOPS planning calculations, you may
use up to 160 IOPS per 15 K FC drive and 120 IOPS per 10 K FC drive. However, at such
levels of IOPS and disk utilization, you might see already elevated response times. So, for
excellent response time expectations, consider lower IOPS limits. Low spinning,
large-capacity Nearline disk drives offer a considerably lower maximum random access I/O
rate per drive (approximately half of a 15 K drive). Therefore, they are only intended for
environments with fixed content, data archival, reference data, or for near-line applications
that require large amounts of data at low cost, or in case of normal production, as a slower
Tier 2 in hybrid pools and with a fraction of the total capacity.

Today, disk drives are mostly used as some Tier 1 and lower tiers in a hybrid pool where most
of the IOPS are handled by Tier 0 SSDs. Yet, even if the flash tier might handle, for example,
70% and more of the load, the HDDs still handle a considerable workload amount because of
their large bulk capacity. So, it can be a difference whether you go with, for example,
600 GB/15 K drives for Tier 1 versus going with 1.2 TB/10 K drives for Tier 1. The drive
selection of such lower drive tiers must be done as a sizing exercise as well.

Finally, using the lsarray -l, and lsrank -l commands can give you an idea of which DA
pair is used by each array and rank respectively, as shown in Example 4-1. You can do further
planning from here.

Example 4-1 lsarray -l and lsrank -l commands showing Array ID sequence and DA pair
dscli> lsarray -l
Array State Data RAIDtype arsite Rank DA Pair DDMcap (10^9B) diskclass encrypt
======================================================================================
A0 Assigned Normal 5 (7+P) S12 R0 2 600.0 ENT supported
A1 Assigned Normal 5 (6+P) S16 R1 18 400.0 Flash supported
A2 Assigned Normal 5 (7+P) S11 R2 2 600.0 ENT supported
A3 Assigned Normal 5 (6+P) S15 R3 18 400.0 Flash supported
A4 Assigned Normal 5 (7+P) S6 R4 0 600.0 ENT supported
A5 Assigned Normal 5 (6+P+S) S10 R5 2 600.0 ENT supported
A6 Assigned Normal 5 (6+P+S) S8 R6 2 600.0 ENT supported
A7 Assigned Normal 5 (6+P+S) S4 R7 0 600.0 ENT supported
A8 Assigned Normal 5 (6+P+S) S14 R8 18 400.0 Flash supported
A9 Assigned Normal 5 (7+P) S5 R9 0 600.0 ENT supported
A10 Assigned Normal 5 (6+P+S) S9 R10 2 600.0 ENT supported

102 IBM System Storage DS8000 Performance Monitoring and Tuning


A11 Assigned Normal 5 (6+P+S) S7 R11 2 600.0 ENT supported
A12 Assigned Normal 5 (6+P+S) S3 R12 0 600.0 ENT supported
A13 Assigned Normal 5 (6+P+S) S13 R13 18 400.0 Flash supported
A14 Assigned Normal 5 (6+P+S) S2 R14 0 600.0 ENT supported
A15 Assigned Normal 5 (6+P+S) S1 R15 0 600.0 ENT supported

dscli> lsrank -l
ID Group State datastate Array RAIDtype extpoolID extpoolnam stgtype exts usedexts encryptgrp marray
======================================================================================================
R0 0 Normal Normal A0 5 P0 ITSO_CKD ckd 4113 0 - MA12
R1 0 Normal Normal A1 5 P0 ITSO_CKD ckd 2392 0 - MA16
R2 1 Normal Normal A2 5 P1 ITSO_CKD ckd 4113 0 - MA11
R3 1 Normal Normal A3 5 P1 ITSO_CKD ckd 2392 0 - MA15
R4 0 Normal Normal A4 5 P2 ITSO_FB fb 3672 0 - MA6
R5 0 Normal Normal A5 5 P2 ITSO_FB fb 3142 0 - MA10
R6 0 Normal Normal A6 5 P2 ITSO_FB fb 3142 0 - MA8
R7 0 Normal Normal A7 5 P2 ITSO_FB fb 3142 0 - MA4
R8 0 Normal Normal A8 5 P2 ITSO_FB fb 2132 0 - MA14
R9 1 Normal Normal A9 5 P3 ITSO_FB fb 3672 0 - MA5
R10 1 Normal Normal A10 5 P3 ITSO_FB fb 3142 0 - MA9
R11 1 Normal Normal A11 5 P3 ITSO_FB fb 3142 0 - MA7
R12 1 Normal Normal A12 5 P3 ITSO_FB fb 3142 0 - MA3
R13 1 Normal Normal A13 5 P3 ITSO_FB fb 2132 0 - MA13
R14 0 Normal Normal A14 5 P4 Outcenter fb 3142 0 - MA2
R15 1 Normal Normal A15 5 P5 Outcenter fb 3142 0 - MA1

4.8 Planning extent pools


After planning the arrays and the ranks, the next step is to plan the extent pools, which means
taking the planned ranks and defining their assignment to extent pools and rank groups,
including planning the extent pool IDs.

Extent pools are automatically numbered with system-generated IDs starting with P0, P1, and
P2 in the sequence in which they are created. Extent pools that are created for rank group 0
are managed by processor complex 0 and have even-numbered IDs (P0, P2, and P4, for
example). Extent pools that are created for rank group 1 are managed by processor complex
1 and have odd-numbered IDs (P1, P3, and P5, for example). Only in a failure condition or
during a concurrent code load is the ownership of a certain rank group temporarily moved to
the alternative processor complex.

To achieve a uniform storage system I/O performance and avoid single resources that
become bottlenecks (called hot spots), it is preferable to distribute volumes and workloads
evenly across all of the ranks (disk spindles) and DA pairs that are dedicated to a workload by
creating appropriate extent pool configurations.

The assignment of the ranks to extent pools together with an appropriate concept for the
logical configuration and volume layout is the most essential step to optimize overall storage
system performance. A rank can be assigned to any extent pool or rank group. Each rank
provides a particular number of storage extents of a certain storage type (either FB or CKD)
to an extent pool. Finally, an extent pool aggregates the extents from the assigned ranks and
provides the logical storage capacity for the creation of logical volumes for the attached host
systems.

Chapter 4. Logical configuration performance considerations 103


On the DS8000 storage system, you can configure homogeneous single-tier extent pools with
ranks of the same storage class, and hybrid multitier extent pools with ranks from different
storage classes. The Extent Allocation Methods (EAM), such as rotate extents or
storage-pool striping, provide easy-to-use capacity-based methods of spreading the workload
data across the ranks in an extent pool. Furthermore, the use of Easy Tier automatic mode to
automatically manage and maintain an optimal workload distribution across these resources
over time provides excellent workload spreading with the best performance at a minimum
administrative effort.

In addition, the various extent pool configurations (homogeneous or hybrid pools, managed or
not managed by Easy Tier) can further be combined with the DS8000 I/O Priority Manager to
prioritize workloads that are sharing resources to meet QoS goals in cases when resource
contention might occur.

The following sections present concepts for the configuration of single-tier and multitier extent
pools to spread the workloads evenly across the available hardware resources. Also, the
benefits of Easy Tier with different extent pool configurations are outlined. Unless otherwise
noted, assume that enabling Easy Tier automatic mode refers to enabling the automatic
management capabilities of Easy Tier and Easy Tier monitoring.

4.8.1 Overview of DS8000 extent pool configurations with Easy Tier


Figure 4-4 provides a brief overview of different extent pool configurations on DS8000 storage
systems by comparing the management effort and the benefits of Easy Tier. It presents four
generic extent pool configurations based on the same hardware base. You can take
step-by-step advantage of automatic storage performance and storage economics
management by using Easy Tier.

- Manual management of volumes


- Manual performance across different 2-tier extent pools. - Fully automated intra-tier and cross-tier
management of volumes - Automatic intra-tier (auto- storage performance and storage
across single-tier extent pools rebalance) performance and cross- economics management on sub-
using Easy Tier manual mode tier storage performance or volume level using Easy Tier
features on volume level. storage economics management automode features
- Automatic intra-tier for selected workloads on sub- - Minimum management effort.
performance management on volume level using Easy Tier
sub-volume level within single- automode features.
tier extent pools using Easy - Boost ENT performance with SSD
Tier automode features (auto- tier for selected workloads. SSD SSD
rebalance) minimizing skew - Optimize ENT storage economics ENT ENT
and avoiding rank hotspots. with NL tier for selected workloads.
Ease of storage management

- Manual performance ENT ENT


management of all volumes - Requires good planning and - Moderate management effort. P0
ENT ENT P1
within and across single-tier knowledge of workloads.
extent pools using Easy Tier - High management effort with ENT ENT
manual mode features only. ongoing performance SSD SSD NL
NL
- Not using any of the Easy Tier monitoring and workload ENT P0 ENT P1
automatic mode features. balancing across extent pools
ENT ENT
- Requires exact planning and and tiers.
ET automode=all|tiered
knowledge of workloads. ENT ENT
3-tier (hybrid) pools
- Highest management effort with ENT P2 ENT P3
ongoing performance SSD P0 SSD P1
NL NL
monitoring and workload ENT ENT
balancing across extent pools, P2 P3
tiers and ranks. ENT ENT
ET automode=all|tiered
ENT ENT
P4 P5 2-tier (hybrid) pools h er
ENT no t
SSD P1 ENT e to a
e rg
P0 SSD
on
NL P6 NL l m n f ig u r a ti
ENT ENT P7 oo
P2 P3
n tp r co
xte ith e on
l a ti
ENT ENT e
E om
ET automode=all
v e fr o pu
mo dep
ENT ENT 1-tier (homogeneous) pools
P4 P5 lly
ENT ic a nk
ENT
d yna
m Ra
NL P6 NL P7

ET automode=off
1-tier (homogeneous) pools

Automatic storage optimization by Easy Tier

Figure 4-4 Ease of storage management versus automatic storage optimization by using Easy Tier

104 IBM System Storage DS8000 Performance Monitoring and Tuning


Starting on the left, there is a DS8000 configuration (red color) with multiple homogeneous
extent pools of different storage tiers with the Easy Tier automatic mode not enabled (or
without the Easy Tier feature). With dedicated storage tiers bound to individual extent pools,
the suitable extent pool for each volume must be chosen manually based on workload
requirements. Furthermore, the performance and workload in each extent pool must be
closely monitored and managed down to the rank level to adapt to workload changes over
time. This monitoring increases overall management effort considerably. Depending on
workload changes or application needs, workloads and volumes must be migrated from one
highly used extent pool to another less used extent pool or from a lower storage tier to a
higher storage tier. Easy Tier manual mode can help to easily and dynamically migrate a
volume from one extent pool to another.

However, data placement across tiers must be managed manually and occurs on the volume
level only, which might, for example, waste costly flash capacity. Typically, only a part of the
capacity of a specific volume is hot and suited for flash or SSD placement. The workload can
become imbalanced across the ranks within an extent pool and limit the overall performance
even with storage pool striping because of natural workload skew. Workload spreading within
a pool is based on spreading only the volume capacity evenly across the ranks, not taking any
data access patterns or performance statistics into account. After adding capacity to an
existing extent pool, you must restripe the volume data within an extent pool manually, for
example, by using manual volume rebalance, to maintain a balanced workload distribution
across all ranks in a specific pool.

As shown in the second configuration (orange color), you can easily optimize the first
configuration and reduce management efforts considerably by enabling Easy Tier automatic
mode. Thus, you automate intra-tier performance management (auto-rebalance) in these
homogeneous single-tier extent pools. Easy Tier controls the workload spreading within each
extent pool and automatically relocates data across the ranks based on rank utilization to
minimize skew and avoid rank hot spots. Performance management is shifted from rank to
extent pool level with correct data placement across tiers and extent pools at the volume level.
Furthermore, when adding capacity to an existing extent pool, Easy Tier automatic mode
automatically takes advantage of the capacity and performance capabilities of the new ranks
without the need for manual interaction.

You can further reduce management effort by merging extent pools and building different
combinations of two-tier hybrid extent pools, as shown in the third configuration (blue color).
You can introduce an automatically managed tiered storage architecture but still isolate, for
example, the high-performance production workload from the development/test environment.
You can introduce an ENT/SSD pool for the high-performance and high-priority production
workload, efficiently boosting ENT performance with flash cards or SSDs and automate
storage tiering from enterprise-class drives to SSDs by using Easy Tier automated cross-tier
data relocation and storage performance optimization at the subvolume level. You can create
an ENT/NL pool for the development/test environment or other enterprise-class applications
to maintain enterprise-class performance while shrinking the footprint and reducing costs by
combining enterprise-class drives with large-capacity Nearline drives that use Easy Tier
automated cross-tier data relocation and storage economics management. In addition, you
benefit from a balanced workload distribution across all ranks within each drive tier because
of the Easy Tier intra-tier optimization and automatically take advantage of the capacity and
performance capabilities of new ranks when capacity is added to an existing pool without the
need for manual interaction.

Chapter 4. Logical configuration performance considerations 105


The minimum management effort combined with the highest amount of automated storage
optimization can be achieved by creating three-tier hybrid extent pools and by using Easy Tier
automatic mode across all three tiers, as shown in the fourth configuration (green color). You
can use the most efficient way of automated data relocation to the appropriate storage tier
with automatic storage performance and storage economics optimization on the subvolume
level at a minimum administrative effort. You can use automated cross-tier management
across three storage tiers and automated intra-tier management within each storage tier in
each extent pool.

All of these configurations can be combined with I/O Priority Manager to prioritize workloads
when sharing resources and provide QoS levels in case resource contention occurs.

You can also take full advantage of the Easy Tier manual mode features, such as dynamic
volume relocation (volume migration), dynamic extent pool merge, and rank depopulation to
modify dynamically your logical configuration. When merging extent pools with different
storage tiers, you can gradually introduce more automatic storage management with Easy
Tier at any time. With rank depopulation, you can reduce multitier pools and automated
cross-tier management according to your needs.

These examples are generic. A single DS8000 storage system with its tremendous scalability
can manage many applications effectively and efficiently, so typically multiple extent pool
configurations exist on a large system for various needs for isolated and resource sharing
workloads, Copy Services considerations, or other specific requirements. Easy Tier and
potentially I/O Priority Manager simplify management with single-tier and multitier extent
pools and help to spread workloads easily across shared hardware resources under optimum
conditions and best performance, automatically adapting to changing workload conditions.
You can choose from various extent pool configurations for your resource isolation and
resource sharing workloads, which are combined with Easy Tier and I/O Priority Manager.

4.8.2 Single-tier extent pools


Single-tier or homogeneous extent pools are pools that contain ranks from only one storage
class or storage tier:
򐂰 HPFE flash and SSDs
򐂰 Enterprise disks (10 K and 15 K RPM)
򐂰 Nearline disks

Single-tier extent pools consist of one or more ranks that can be referred to as single-rank or
multi-rank extent pools.

Single-rank extent pools


With single-rank extent pools, there is a direct relationship between volumes and ranks based
on the extent pool of the volume. This relationship helps you manage and analyze
performance down to rank level more easily, especially with host-based tools, such as IBM
Resource Measurement Facility™ (IBM RMF™) on z Systems in combination with a
hardware-related assignment of LSS/LCU IDs. However, the administrative effort increases
considerably because you must create the volumes for a specific workload in multiple steps
from each extent pool separately when distributing the workload across multiple ranks.

106 IBM System Storage DS8000 Performance Monitoring and Tuning


With single-rank extent pools, you choose a configuration design that limits the capabilities of
a created volume to the capabilities of a single rank for capacity and performance. A single
volume cannot exceed the capacity or the I/O performance provided by a single rank. So, for
demanding workloads, you must create multiple volumes from enough ranks from different
extent pools and use host-level-based striping techniques, such as volume manager striping,
to spread the workload evenly across the ranks dedicated to a specific workload. You are also
likely to waste storage capacity easily if extents remain left on ranks in different extent pools
because a single volume can be created only from extents within a single extent pool, not
across extent pools.

Furthermore, you benefit less from the advanced DS8000 virtualization features, such as
dynamic volume expansion (DVE), storage pool striping, Easy Tier automatic performance
management, and workload spreading, which use the capabilities of multiple ranks within a
single extent pool.

Single-rank extent pools are selected for environments where isolation or management of
volumes on the rank level is needed, such as in some z/OS environments. Single-rank extent
pools are selected for configurations by using storage appliances, such as the SAN Volume
Controller, where the selected RAID arrays are provided to the appliance as simple back-end
storage capacity and where the advanced virtualization features on the DS8000 storage
system are not required or not wanted to avoid multiple layers of data striping. However, the
use of homogeneous multi-rank extent pools and storage pool striping to minimize the
storage administrative effort by shifting the performance management from the rank to the
extent pool level and letting the DS8000 storage system maintain a balanced data distribution
across the ranks within a specific pool is popular. It provides excellent performance in relation
to the reduced management effort.

Also, you do not need to strictly use only single-rank extent pools or only multi-rank extent
pools on a storage system. You can base your decision on individual considerations for each
workload group that is assigned to a set of ranks and thus extent pools. The decision to use
single-rank and multi-rank extent pools depends on the logical configuration concept that is
chosen for the distribution of the identified workloads or workload groups for isolation and
resource-sharing.

In general, single-rank extent pools might not be good in the current complex and mixed
environments unless you know that this level of isolation and micro-performance
management is required for your specific environment. If not managed correctly, workload
skew and rank hot spots that limit overall system performance are likely to occur.

Multi-rank homogeneous extent pools (with only one storage tier)


If the logical configuration concept aims to balance certain workloads or workload groups
(especially large resource-sharing workload groups) across multiple ranks with the allocation
of volumes or extents on successive ranks, use multi-rank extent pools for these workloads.

With a homogeneous multi-rank extent pool, you take advantage of the advanced DS8000
virtualization features to spread the workload evenly across the ranks in an extent pool to
achieve a well-balanced data distribution with considerably less management effort.
Performance management is shifted from the rank level to the extent pool level. An extent
pool represents a set of merged ranks (a larger set of disk spindles) with a uniform workload
distribution. So, the level of complexity for standard performance and configuration
management is reduced from managing many individual ranks (micro-performance
management) to a few multi-rank extent pools (macro-performance management).

Chapter 4. Logical configuration performance considerations 107


The DS8000 capacity allocation methods take care of spreading the volumes and thus the
individual workloads evenly across the ranks within homogeneous multi-rank extent pools.
Rotate extents is the default and preferred EAM to distribute the extents of each volume
successively across all ranks in a pool to achieve a well-balanced capacity-based distribution
of the workload. Rotate volumes is rarely used today, but it can help to implement a strict
volume-to-rank relationship. It reduces the configuration effort compared to single-rank extent
pools by easily distributing a set of volumes to different ranks in a specific extent pool for
workloads where the use of host-based striping methods is still preferred.

The size of the volumes must fit the available capacity on each rank. The number of volumes
that are created for this workload in a specific extent pool must match the number of ranks (or
be at least a multiple of this number). Otherwise, the result is an imbalanced volume and
workload distribution across the ranks and rank bottlenecks might emerge. However, efficient
host-based striping must be ensured in this case to spread the workload evenly across all
ranks, eventually from two or more extent pools. For more information about the EAMs and
how the volume data is spread across the ranks in an extent pool, see 3.2.7, “Extent
allocation methods” on page 62.

Even multi-rank extent pools that are not managed by Easy Tier provide some level of control
over the volume placement across the ranks in cases where it is necessary to enforce
manually a special volume allocation scheme: You can use the DSCLI command chrank
-reserve to reserve all of the extents from a rank in an extent pool from being used for the
next creation of volumes. Alternatively, you can use the DSCLI command chrank -release to
release a rank and make the extents available again.

Multi-rank extent pools that use storage pool striping are the general configuration approach
today on modern DS8000 storage systems to spread the data evenly across the ranks in a
homogeneous multi-rank extent pool and thus reduce skew and the likelihood of single-rank
hot spots. Without Easy Tier automatic mode management, such non-managed,
homogeneous multitier extent pools consist only of ranks of the same drive type and RAID
level. Although not required (and probably not realizable for smaller or heterogeneous
configurations), you can take the effective rank capacity into account, grouping ranks with and
without spares into different extent pools when using storage pool striping to ensure a strict
balanced workload distribution across all ranks up to the last extent. Otherwise, take
additional considerations for the volumes that are created from the last extents in a mixed
homogeneous extent pool that contains ranks with and without spares because these
volumes are probably allocated only on part of the ranks with the larger capacity and without
spares.

In combination with Easy Tier, a more efficient and automated way of spreading the
workloads evenly across all ranks in homogeneous multi-rank extent pool is available. The
automated intra-tier performance management (auto-rebalance) of Easy Tier efficiently
spreads the workload evenly across all ranks. It automatically relocates the data across the
ranks of the same storage class in an extent pool based on rank utilization to achieve and
maintain a balanced distribution of the workload, minimizing skew and avoiding rank hot
spots. You can enable auto-rebalance for homogeneous extent pool by setting the Easy Tier
management scope to all extent pools (ETautomode=all).

In addition, Easy Tier automatic mode can also handle storage device variations within a tier
that uses a micro-tiering capability. An example of storage device variations within a tier is an
intermix of ranks with different disk types (RPM or RAID level) within the same storage class
of an extent pool, or when mixing classical SSDs with the more powerful HPFE cards.

108 IBM System Storage DS8000 Performance Monitoring and Tuning


A typical micro-tiering scenario is, for example, when after a hardware upgrade, new 15 K
RPM Enterprise disk drives intermix with existing 10 K RPM Enterprise disk drives. In these
configurations, the Easy Tier automatic mode micro-tiering capability accounts for the
different performance profiles of each micro-tier and performs intra-tier (auto-rebalance)
optimizations. Easy Tier does not handle a micro-tier like an additional tier; it is still part of a
specific tier. For this reason, the heat of an extent does not trigger any promotion or demotion
across micro-tiers of the same tie because the extent relocation across micro-tiers can occur
only as part of the auto-rebalance feature and is based on rank utilization.

Note: Easy Tier does not differentiate 10 K and 15 K Enterprise disk drives as separate
storage tiers in a managed extent pool, or between SSDs and HPFE cards. Both drive
types are considered as the same storage tier and no automated cross-tier promotion or
demotion algorithms are applied between these two storage classes. Easy Tier automated
data relocation across tiers to optimize performance and storage economics based on the
hotness of the particular extent takes place only between different storage tiers. If these
drives are mixed in the same managed extent pool, the Easy Tier auto-rebalance algorithm
balances the workload only across all ranks of this Enterprise-class tier based on overall
rank utilization, taking the performance capabilities of each rank (micro-tiering capability)
into account.

Figure 4-5 provides two configuration examples for using dedicated homogeneous extent
pools with storage classes in combination with and without Easy Tier automatic mode
management.

Extent pool features


- Manual cross-tier workload management on volume level using Easy Tier manual mode
features for volume migrations (dynamic volume relocation) across pools and tiers.
- Automatic intra-tier workload management and workload spreading within extent pools on
subvolume level using Easy Tier automatic mode (auto-rebalance) based on workload
SSD P0 SSD P1 characteristics and rank utilization, constantly adapting to changing workload conditions,
Automatic storage optimization by Easy Tier

minimizing skew and avoiding rank hotspots.


ENT ENT
P2 P3 - Efficient workload spreading within a pool even across ranks of the same storage class
ENT ENT but different drive characteristics or RAID levels through Easy Tier micro-tiering
capabilities.
ENT ENT
P4 P5 - Isolation of workloads across different extent pools and storage tiers on volume level,
ENT ENT limiting the most efficient use of the available storage capacity and tiers.
- High administration and performance management effort with constant resource utilization
NL P6 NL P7 monitoring, workload balancing and manual placement of volumes across extent pools
ET automode=all and storage tiers.
1-tier extent pools - Automatically taking advantage of new capacity when added to existing pools through
Easy Tier automatic mode.

SSD SSD - Manual cross-tier workload management on volume level using Easy Tier manual mode
P0 P1
features for volume migrations (dynamic volume relocation) across pools and tiers.
ENT ENT - Manual intra-tier workload management and workload spreading within extent pools using
ENT P2 P3 DS8000 extent allocation methods such as storage pool striping based on a balanced
ENT
volume capacity distribution.
ENT ENT - Strictly homogeneous pools with ranks of the same drive characteristics and RAID level.
ENT P4 P5 - Isolation of workloads across different extent pools and storage tiers on volume level,
ENT
limiting the most efficient use of the available storage capacity and tiers.
NL P6 NL P7 - Highest administration and performance management effort with constant resource
ET automode=none utilization monitoring, workload balancing and manual placement of volumes across
1-tier extent pools ranks, extent pools and storage tiers.
- Efficiently taking advantage of new capacity when added to existing pools typically
requires manual restriping of volumes using manual volume rebalance.

Figure 4-5 Single-tier extent pool configuration examples and Easy Tier benefits

With multi-rank extent pools, you can fully use the features of the DS8000 virtualization
architecture and Easy Tier that provide ease of use when you manage more applications
effectively and efficiently with a single DS8000 storage system. Consider multi-rank extent
pools and the use of Easy Tier automatic management especially for mixed workloads that
will be spread across multiple ranks. Multi-rank extent pools help simplify management and
volume creation. They also allow the creation of single volumes that can span multiple ranks
and thus exceed the capacity and performance limits of a single rank.

Chapter 4. Logical configuration performance considerations 109


Easy Tier manual mode features, such as dynamic extent pool merge, dynamic volume
relocation (volume migration), and rank depopulation also help to manage easily complex
configurations with different extent pools. You can migrate volumes from one highly used
extent pool to another less used one, or from an extent pool with a lower storage class to
another one associated with a higher storage class, and merge smaller extent pools to larger
ones. You can also redistribute the data of a volume within a pool by using the manual volume
rebalance feature, for example, after capacity is added to a pool or two pools are merged, to
optimize manually the data distribution and workload spreading within a pool. However,
manual extent pool optimization and performance management, such as manual volume
rebalance, is not required (and not supported) if the pools are managed by Easy Tier
automatic mode. Easy Tier automatically places the data in these pools even if the pools are
merged or capacity is added to a pool.

For more information about data placement in extent pool configurations, see 3.3.2, “Extent
pool considerations” on page 72.

Important: Multi-rank extent pools offer numerous advantages with respect to ease of use,
space efficiency, and the DS8000 virtualization features. Multi-rank extent pools, in
combination with Easy Tier automatic mode, provide both ease of use and excellent
performance for standard environments with workload groups that share a set of
homogeneous resources.

4.8.3 Multitier extent pools


Multi-rank or hybrid extent pools consist of ranks from different storage classes or storage
tiers (referred to as multitier or hybrid extent pools). Data placement within and across these
tiers can automatically be managed by Easy Tier providing automated storage performance
and storage economics optimization.

A multitier extent pool can consist of one of the following storage class combinations with up
to three storage tiers:
򐂰 HPFE cards/SSD + Enterprise disk
򐂰 HPFE cards/SSD + Nearline disk
򐂰 Enterprise disk + Nearline disk
򐂰 HPFE cards/SSD + Enterprise disk + Nearline disk

Multitier extent pools are especially suited for mixed, resource-sharing workloads. Tiered
storage, as described in 4.1, “Reviewing the tiered storage concepts and Easy Tier” on
page 84, is an approach of using types of storage throughout the storage infrastructure. It is a
mix of higher-performing/higher-cost storage with lower-performing/lower-cost storage and
placing data based on its specific I/O access characteristics. Although flash can help boost
efficiently enterprise-class performance, you can additionally shrink the footprint and reduce
costs by adding large-capacity Nearline drives while maintaining enterprise class
performance. Correctly balancing all the tiers eventually leads to the lowest cost and best
performance solution.

Always create hybrid extent pools for Easy Tier automatic mode management. The extent
allocation for volumes in hybrid extent pools differs from the extent allocation in homogeneous
pools. Any specified EAM, such as rotate extents or rotate volumes, is ignored when a new
volume is created in, or migrated into, a hybrid pool. The EAM is changed to managed when
the Easy Tier automatic mode is enabled for the pool, and the volume is under the control of
Easy Tier. Easy Tier then automatically moves extents to the most appropriate storage tier
and rank in the pool based on performance aspects.

110 IBM System Storage DS8000 Performance Monitoring and Tuning


In hybrid extent pools (even if not managed by Easy Tier), an initial EAM, similar to rotate
extents, is used for new volumes. The initial volume creation always starts on the ranks of the
Enterprise tier first in a rotate extents-like fashion. It continues on the ranks of the flash and
Nearline tiers if insufficient capacity is available on the initial tier.

Easy Tier automatically spreads workload across the resources (ranks and DAs) in a
managed hybrid pool. Easy Tier automatic mode adapts to changing workload conditions and
automatically promotes hot extents from the lower tier to the next upper tier. It demotes colder
extents from the higher tier to the next lower tier (swap extents from flash with hotter extents
from Enterprise tier, or demote cold extents from Enterprise tier to Nearline tier). Easy Tier
automatic mode optimizes the Nearline tier by demoting some of the sequential workload to
the Nearline tier to better balance sequential workloads. The auto-rebalance feature
rebalances extents across the ranks of the same tier based on rank utilization to minimize
skew and avoid hot spots. Auto-rebalance takes different device characteristics into account
when different devices or RAID levels are mixed within the same storage tier (micro-tiering).

Regarding the requirements of your workloads, you can create one or multiple pairs of extent
pools with different two-tier or three-tier combinations that depend on your needs and
available hardware resources. You can, for example, create separate two-tier SSD/ENT and
ENT/NL extent pools to isolate your production environment from your development
environment. You can boost the performance of your production application with flash cards
or SSDs and optimize storage economics for your development applications with NL drives.

You can create three-tier extent pools for mixed, large resource-sharing workload groups and
benefit from fully automated storage performance and economics management at a minimum
management effort. You can boost the performance of your high-demand workloads with
flash and reduce the footprint and costs with NL drives for the lower-demand data.

You can use the DSCLI showfbvol/showckdvol -rank or -tier commands to display the
current extent distribution of a volume across the ranks and tiers, as shown in Example 3-2 on
page 78. Additionally, the volume heat distribution (volume heat map), provided by the STAT,
can help identify the amount of hot, warm, and cold extents for each volume and its
distribution across the storage tiers in the pool. For more information about STAT, see 6.5,
“Storage Tier Advisor Tool” on page 213.

The ratio of SSD, ENT, and NL disk capacity in a hybrid pool depends on the workload
characteristics and skew and must be planned when ordering the drive hardware for the
identified workloads.

With the Easy Tier manual mode features, such as dynamic extent pool merge, dynamic
volume relocation, and rank depopulation, you can modify existing configurations easily,
depending on your needs. You can grow from a manually managed single-tier configuration
into a partially or fully automatically managed tiered storage configuration. You add tiers or
merge appropriate extent pools and enable Easy Tier at any time. For more information about
Easy Tier, see IBM DS8000 Easy Tier, REDP-4667.

Important: Multitier extent pools and Easy Tier help you implement a tiered storage
architecture on a single DS8000 storage system with all its benefits at a minimum
management effort and ease of use. Easy Tier and its automatic data placement within
and across tiers spread the workload efficiently across the available resources in an extent
pool. Easy Tier constantly optimizes storage performance and storage economics and
adapts to changing workload conditions. Easy Tier can reduce overall performance
management efforts and help consolidate more workloads efficiently and effectively on a
single DS8000 storage system. It optimizes performance and reduces energy costs and
the footprint.

Chapter 4. Logical configuration performance considerations 111


Figure 4-6 provides two configuration examples of using hybrid extent pools with Easy Tier.
One example shares all resources for all workloads with automatic management across three
tiers. The other example isolates workload groups with different optimization goals to two-tier
extent pool combinations.

Extent pool features


- Fully automated storage tiering on the subvolume level across three tiers.
- Automated and optimized workload spreading within extent pools on the extent level.
SSD SSD - Automated intra-tier and cross-tier data relocation on the extent level.
ENT ENT - Storage performance and storage economics optimization combined.
Automatic storage optimization by Easy Tier

ENT ENT - Minimum administration and performance management efforts.


P0 P1 - Automatically boost performance for high performance workloads with SSD drives
ENT ENT and maintain enterprise performance for enterprise workloads while reducing costs
ENT ENT and footprint with NL drives.
NL NL - Automatically taking advantage of new capacity when added to existing pools.
ET automode=all|tierd - Additional benefits from Easy Tier manual mode features when managing volumes or
3-tier (hybrid) extent pools workloads across multiple extent pools.
- Fully automated storage tiering on subvolume level across any two tiers in different
extent pools with different storage optimization goals.
- Isolation of workloads with different optimization goals.
SSD SSD
- Automated and optimized workload spreading within extent pools on extent level.
ENT P0 ENT P1 - Automated intra-tier and cross-tier data relocation on extent level.
ENT ENT - Storage performance or storage economics optimization for selected workloads
depending on 2-tier combination in extent pool.
ENT ENT - Reduced administration and performance management effort with manual placement
ENT P2 ENT P3 of volumes across appropriate extent pools and 2-tier combinations.
- Automatically boost performance for selected high priority workloads with SSD drives
NL NL
in SSD/ENT (or SSD/NL) extent pools.
ET automode=all|tierd - Maintain enterprise performance for selected enterprise workloads while reducing
2-tier (hybrid) extent pools costs and footprint with NL drives in ENT/NL extent pools.
- Automatically taking advantage of new capacity when added to existing pools.
- Additional benefits from Easy Tier manual mode features when managing volumes or
workloads across extent pools.

Figure 4-6 Multitier extent pool configuration examples and Easy Tier benefits

Pinning volumes to a tier


By using the DSCLI commands manageckdvol -action tierassign or managefbvol -action
tierassign, volumes in multitier pools can be assigned to a flash tier (-tier ssd), to an
Enterprise HDD tier (-tier ent), to a Nearline HDD tier (-tier nl), or you can tell the system
that you want them to float freely between flash and Enterprise, but not use Nearline (-tier
nlexclude). Such a pinning can be done permanently, or temporarily. For example, shortly
before an important end-of-year application run of one database that is of top priority, you
might move volumes of this database to the SSD tier. After the yearly run is finished, you
release these volumes again (-action tierunassign) and let them migrate freely between all
available tiers. Such setups allow flexibly to make good use of a multitier extent pool structure
and still give some short-term boost, or throttle, to some certain volumes if needed for a
certain period.

Copy Services replication scenarios


When doing replication between two DS8000 storage systems, each using a hybrid multitier
extent pool, all recent DS8000 models offer a heatmap transfer from the primary DS8000
storage system to the secondary DS8000 storage system, so that in case of a sudden failure
of the primary system, the performance on the secondary system is immediately the same as
before the DR failover. This Easy Tier heatmap transfer can be carried out by Copy Services
Manager, by IBM GDPS®, or by a small tool that regularly connects to both DS8000 storage
systems. For each Copy Services pair, the local learning at each PPRC target volume for
Easy Tier in that case is disabled, and instead the heatmap of the primary volume is used.
For more information about this topic, see IBM DS8870 Easy Tier Heat Map Transfer,
REDP-5015.

112 IBM System Storage DS8000 Performance Monitoring and Tuning


4.8.4 Additional implementation considerations for multitier extent pools
For multitier extent pools managed by Easy Tier automatic mode, consider how to allocate
the initial volume capacity and how to introduce gradually new workloads to managed pools.

Using thin provisioning with Easy Tier in multitier extent pools


In environments that use FlashCopy, which supports the usage of thin-provisioned volumes
(ESE), you might start using thin-provisioned volumes in managed hybrid extent pools,
especially with three tiers. You use this configuration to start the initial volume allocation for as
many volumes as possible on the Enterprise tier (home tier). Avoid the initial creation of fully
provisioned volumes on the Nearline drive tier when the storage capacity on the Enterprise
tier is exhausted. This exhaustion is because of the creation of fully provisioned volumes with
unused allocated capacities.

In this case, only used capacity is allocated in the pool and Easy Tier does not move unused
extents around or move hot extents on a large scale up from the Nearline tier to the
Enterprise tier and to the flash tier. However, thin-provisioned volumes are not fully supported
by all DS8000 Copy Services or advanced functions and platforms yet, so it might not be a
valid approach for all environments at this time. For more information about the initial volume
allocation in hybrid extent pools, see “Extent allocation in hybrid and managed extent pools”
on page 63.

Staged implementation approach for multitier extent pools


In a new three-tier environment, volumes are first created on the Enterprise tier, and Easy
Tier cannot “learn” and optimize before production starts. In a migration scenario, if you are
migrating all the servers at once, some server volumes, for reasons of space, might be placed
completely on Nearline drives first, although these servers also might have higher
performance requirements. For more information about the initial volume allocation in hybrid
extent pools, see “Extent allocation in hybrid and managed extent pools” on page 63.

When bringing a new DS8880 storage system into production to replace an older one, with
the older storage system often not using Easy Tier, consider the timeline of the
implementation stages by which you migrate all servers from the older to the new storage
system.

One good option is to consider a staged approach when migrating servers to a new multitier
DS8880 storage system:
򐂰 Assign the resources for the high-performing and response time sensitive workloads first,
then add the less performing workloads. The other way might lead to situations where all
initial resources, such as the Enterprise tier in hybrid extent pools, are allocated already by
the secondary workloads. This situation does not leave enough space on the Enterprise
tier for the primary workloads, which then must be initially on the Nearline tier.
򐂰 Split your servers into several subgroups, where you migrate each subgroup one by one,
and not all at once. Then, allow Easy Tier several days to learn and optimize. Some
extents are moved to flash and some extents are moved to Nearline. You regain space on
the Enterprise HDDs. After a server subgroup learns and reaches a steady state, the next
server subgroup can be migrated. You gradually allocate the capacity in the hybrid extent
pool by optimizing the extent distribution of each application one by one while regaining
space in the Enterprise tier (home tier) for the next applications.

Another option that can help in some cases of new deployments is to reset the Easy Tier
learning heatmap for a certain subset of volumes, or for some pools. This action cuts off all
the previous days of Easy Tier learning, and the next upcoming internal auto-migration plan is
based on brand new workload patterns only.

Chapter 4. Logical configuration performance considerations 113


4.8.5 Extent allocation in homogeneous multi-rank extent pools
When creating a new volume, the extent allocation method (EAM) determines how a volume
is created within a multi-rank extent pool for the allocation of the extents on the available
ranks. The rotate extents algorithm spreads the extents of a single volume and the I/O activity
of each volume across all the ranks in an extent pool. Furthermore, it ensures that the
allocation of the first extent of successive volumes starts on different ranks rotating through all
available ranks in the extent pool to optimize workload spreading across all ranks.

With the default rotate extents (rotateexts in DSCLI, Rotate capacity in the GUI) algorithm,
the extents (1 GiB for FB volumes and 1113 cylinders or approximately 0.94 GiB for CKD
volumes) of each single volume are spread across all ranks within an extent pool and thus
across more disks. This approach reduces the occurrences of I/O hot spots at the rank level
within the storage system. Storage pool striping helps to balance the overall workload evenly
across the back-end resources. It reduces the risk of single ranks that become performance
bottlenecks while providing ease of use with less administrative effort.

When using the optional rotate volumes (rotatevols in DSCLI) EAM, each volume is placed
on a single separate rank with a successive distribution across all ranks in a round-robin
fashion.

The rotate extents and rotate volumes EAMs determine the initial data distribution of a
volume and thus the spreading of workloads in non-managed, single-tier extent pools. With
Easy Tier automatic mode enabled for single-tier (homogeneous) or multitier (hybrid) extent
pools, this selection becomes unimportant. The data placement and thus the workload
spreading is managed by Easy Tier. The use of Easy Tier automatic mode for single-tier
extent pools is highly encouraged for an optimal spreading of the workloads across the
resources. In single-tier extent pools, you can benefit from the Easy Tier automatic mode
feature auto-rebalance. Auto-rebalance constantly and automatically balances the workload
across ranks of the same storage tier based on rank utilization, minimizing skew and avoiding
the occurrence of single-rank hot spots.

Additional considerations for specific applications and environments


This section provides considerations when manually spreading the workload in single-tier
extent pools and selecting the correct EAM without using Easy Tier. The use of Easy Tier
automatic mode also for single-tier extent pools is highly encouraged. Shared environments
with mixed workloads that benefit from storage-pool striping and multitier configurations also
benefit from Easy Tier automatic management. However, Easy Tier automatic management
might not be an option in highly optimized environments where a fixed volume-to-rank
relationship and well-planned volume configuration is required.

Certain, if not most, application environments might benefit from the use of storage pool
striping (rotate extents):
򐂰 Operating systems that do not directly support host-level striping.
򐂰 VMware datastores.
򐂰 Microsoft Exchange.
򐂰 Windows clustering environments.
򐂰 Older Solaris environments.
򐂰 Environments that need to suballocate storage from a large pool.
򐂰 Applications with multiple volumes and volume access patterns that differ from day to day.
򐂰 Resource sharing workload groups that are dedicated to many ranks with host operating
systems that do not all use or support host-level striping techniques or application-level
striping techniques.

114 IBM System Storage DS8000 Performance Monitoring and Tuning


However, there might also be valid reasons for not using storage-pool striping. You might use
it to avoid unnecessary layers of striping and reorganizing I/O requests, which might increase
latency and not help achieve a more evenly balanced workload distribution. Multiple
independent striping layers might be counterproductive. For example, creating a number of
volumes from a single multi-rank extent pool that uses storage pool striping and then,
additionally, use host-level striping or application-based striping on the same set of volumes
might compromise performance. In this case, two layers of striping are combined with no
overall performance benefit. In contrast, creating four volumes from four different extent pools
from both rank groups that use storage pool striping and then use host-based striping or
application-based striping on these four volumes to aggregate the performance of the ranks in
all four extent pools and both processor complexes is reasonable.

Consider the following products for specific environments:


򐂰 SAN Volume Controller: SAN Volume Controller is a storage appliance with its own
methods of striping and its own implementation of IBM Easy Tier. So, you can use striping
or Easy Tier on the SAN Volume Controller or the DS8000 storage system. For more
information, see Chapter 15, “IBM System Storage SAN Volume Controller attachment” on
page 491.
򐂰 ProtecTIER: ProtecTIER is a backup solution and storage appliance. ProtecTIER storage
configuration guidelines are partially similar to the ones for SAN Volume Controller
attachment, so that certain SAN Volume Controller guidelines can apply. For more
information, see Chapter 16, “IBM ProtecTIER data deduplication” on page 509.
򐂰 IBM i: IBM i has also its own methods of spreading workloads and data sets across
volumes. For more information about IBM i storage configuration suggestions and Easy
Tier benefits, see Chapter 13, “Performance considerations for the IBM i system” on
page 415.
򐂰 z Systems: z/OS typically implements data striping with storage management subsystem
(SMS) facilities. To make the SMS striping effective, storage administrators must plan to
ensure a correct data distribution across the physical resources. For this reason,
single-rank extent pool configuration was preferred in the past. Using storage-pool striping
and multi-rank extent pools can considerably reduce storage configuration and
management efforts because a uniform striping within an extent pool can be provided by
the DS8000 storage system. Multi-rank extent pools, storage-pool striping, and Easy Tier
are reasonable options today for z Systems configurations. For more information, see
Chapter 14, “Performance considerations for IBM z Systems servers” on page 459.
򐂰 Database volumes: If these volumes are used by databases or applications that explicitly
manage the workload distribution by themselves, these applications might achieve
maximum performance by using their native techniques for spreading their workload
across independent LUNs from different ranks. Especially with IBM DB2 or Oracle, where
the vendor suggests specific volume configurations, for example, with DB2 Database
Partitioning Feature (DPF) and DB2 Balanced Warehouses (BCUs), or with Oracle
Automatic Storage Management (ASM), it is preferable to follow those suggestions. For
more information about this topic, see Chapter 17, “Databases for open performance” on
page 513 and Chapter 18, “Database for IBM z/OS performance” on page 531.
򐂰 Applications that evolved particular storage strategies over a long time, with proven
benefits, and where it is unclear whether they can additionally benefit from using storage
pool striping. When in doubt, follow the vendor recommendations.

Chapter 4. Logical configuration performance considerations 115


DS8000 storage-pool striping is based on spreading extents across different ranks. So, with
extents of 1 GiB (FB) or 0.94 GiB (1113 cylinders/CKD), the size of a data chunk is rather
large. For distributing random I/O requests, which are evenly spread across the capacity of
each volume, this chunk size is appropriate. However, depending on the individual access
pattern of a specific application and the distribution of the I/O activity across the volume
capacity, certain applications perform better. Use more granular stripe sizes for optimizing the
distribution of the application I/O requests across different RAID arrays by using host-level
striping techniques or have the application manage the workload distribution across
independent volumes from different ranks.

Consider the following points for selected applications or environments to use storage-pool
striping in homogeneous configurations:
򐂰 DB2: Excellent opportunity to simplify storage management by using storage-pool striping.
You might prefer to use DB2 traditional recommendations for DB2 striping for
performance-sensitive environments.
򐂰 DB2 and similar data warehouse applications, where the database manages storage and
parallel access to data. Consider independent volumes on individual ranks with a careful
volume layout strategy that does not use storage-pool striping. Containers or database
partitions are configured according to suggestions from the database vendor.
򐂰 Oracle: Excellent opportunity to simplify storage management for Oracle. You might prefer
to use Oracle traditional suggestions that involve ASM and Oracle striping capabilities for
performance-sensitive environments.
򐂰 Small, highly active logs or files: Small highly active files or storage areas smaller than
1 GiB with a high access density might require spreading across multiple ranks for
performance reasons. However, storage-pool striping offers a striping granularity on extent
levels only around 1 GiB, which is too large in this case. Continue to use host-level striping
techniques or application-level striping techniques that support smaller stripe sizes. For
example, assume a 0.8 GiB log file exists with extreme write content, and you want to
spread this log file across several RAID arrays. Assume that you intend to spread its
activity across four ranks. At least four 1 GiB extents must be allocated, one extent on
each rank (which is the smallest possible allocation). Creating four separate volumes,
each with a 1 GiB extent from each rank, and then using Logical Volume Manager (LVM)
striping with a relatively small stripe size (for example, 16 MiB) effectively distributes the
workload across all four ranks. Creating a single LUN of four extents, which is also
distributed across the four ranks by using DS8000 storage-pool striping, cannot effectively
spread the file workload evenly across all four ranks because of the large stripe size of one
extent, which is larger than the actual size of the file.
򐂰 IBM Spectrum™ Protect/Tivoli Storage Manager storage pools: Tivoli Storage Manager
storage pools work well in striped pools. But, Tivoli Storage Manager suggests that the
Tivoli Storage Manager databases be allocated in a separate pool or pools.
򐂰 AIX volume groups (VGs): LVM and physical partition (PP) striping continue to be powerful
tools for managing performance. In combination with storage-pool striping, now
considerably fewer stripes are required for common environments. Instead of striping
across a large set of volumes from many ranks (for example, 32 volumes from 32 ranks),
striping is required only across a few volumes from a small set of different multi-rank
extent pools from both DS8000 rank groups that use storage-pool striping. For example,
use four volumes from four extent pools, each with eight ranks. For specific workloads that
use the advanced AIX LVM striping capabilities with a smaller granularity on the KiB or
MiB level, instead of storage-pool striping with 1 GiB extents (FB), might be preferable to
achieve the highest performance.

116 IBM System Storage DS8000 Performance Monitoring and Tuning


򐂰 Microsoft Windows volumes: Typically, only a few large LUNs per host system are
preferred, and host-level striping is not commonly used. So, storage-pool striping is an
ideal option for Windows environments. You can create single, large-capacity volumes that
offer the performance capabilities of multiple ranks. A volume is not limited by the
performance limits of a single rank, and the DS8000 storage system spreads the I/O load
across multiple ranks.
򐂰 Microsoft Exchange: Storage-pool striping makes it easier for the DS8000 storage system
to conform to Microsoft sizing suggestions for Microsoft Exchange databases and logs.
򐂰 Microsoft SQL Server: Storage-pool striping makes it easier for the DS8000 storage
system to conform to Microsoft sizing suggestions for Microsoft SQL Server databases
and logs.
򐂰 VMware datastore for Virtual Machine Storage Technologies (VMware ESX Server
Filesystem (VMFS) or virtual raw device mapping (RDM) access: Because data stores
concatenate LUNs rather than striping them, allocate the LUNs inside a striped storage
pool. Estimating the number of disks (or ranks) to support any specific I/O load is
straightforward based on the specific requirements.

In general, storage-pool striping helps improve overall performance and reduces the effort of
performance management by evenly distributing data and workloads across a larger set of
ranks, which reduces skew and hot spots. Certain application workloads can also benefit from
the higher number of disk spindles behind one volume. But, there are cases where host-level
striping or application-level striping might achieve a higher performance, at the cost of higher
overall administrative effort. Storage-pool striping might deliver good performance in these
cases with less management effort, but manual striping with careful configuration planning
can achieve the ultimate preferred levels of performance. So, for overall performance and
ease of use, storage-pool striping might offer an excellent compromise for many
environments, especially for larger workload groups where host-level striping techniques or
application-level striping techniques are not widely used or available.

Note: Business- and performance-critical applications always require careful configuration


planning and individual decisions on a case-by-case basis. Determine whether to use
storage-pool striping or LUNs from dedicated ranks together with host-level striping
techniques or application-level striping techniques for the preferred performance.

Storage-pool striping is well-suited for new extent pools.

4.8.6 Balancing workload across available resources


To achieve a balanced utilization of all available resources of the DS8000 storage system, you
must distribute the I/O workloads evenly across the available back-end resources:
򐂰 Ranks (disk drive modules)
򐂰 DA pairs

You must distribute the I/O workloads evenly across the available front-end resources:
򐂰 I/O ports
򐂰 HA cards
򐂰 I/O enclosures

You must distribute the I/O workloads evenly across both DS8000 processor complexes
(called storage server 0/CEC#0 and storage server 1/CEC#1) as well.

Configuring the extent pools determines the balance of the workloads across the available
back-end resources, ranks, DA pairs, and both processor complexes.

Chapter 4. Logical configuration performance considerations 117


Tip: If you use the GUI for an initial configuration, most of this balancing is done
automatically for you.

Each extent pool is associated with an extent pool ID (P0, P1, and P2, for example). Each
rank has a relationship to a specific DA pair and can be assigned to only one extent pool. You
can have as many (non-empty) extent pools as you have ranks. Extent pools can be
expanded by adding more ranks to the pool. However, when assigning a rank to a specific
extent pool, the affinity of this rank to a specific DS8000 processor complex is determined. By
hardware, a predefined affinity of ranks to a processor complex does not exist. All ranks that
are assigned to even-numbered extent pools (P0, P2, and P4, for example) form rank group 0
and are serviced by DS8000 processor complex 0. All ranks that are assigned to
odd-numbered extent pools (P1, P3, and P5, for example) form rank group 1 and are
serviced by DS8000 processor complex 1.

To spread the overall workload across both DS8000 processor complexes, a minimum of two
extent pools is required: one assigned to processor complex 0 (for example, P0) and one
assigned to processor complex 1 (for example, P1).

For a balanced distribution of the overall workload across both processor complexes and both
DA cards of each DA pair, apply the following rules. For each type of rank and its RAID level,
storage type (FB or CKD), and disk drive characteristics (disk type, RPM speed, and
capacity), apply these rules:
򐂰 Assign half of the ranks to even-numbered extent pools (rank group 0) and assign half of
them to odd-numbered extent pools (rank group 1).
򐂰 Spread ranks with and without spares evenly across both rank groups.
򐂰 Distribute ranks from each DA pair evenly across both rank groups.

It is important to understand that you might seriously limit the available back-end bandwidth
and thus the system overall throughput if, for example, all ranks of a DA pair are assigned to
only one rank group and thus a single processor complex. In this case, only one DA card of
the DA pair is used to service all the ranks of this DA pair and thus only half of the available
DA pair bandwidth is available.

Use the GUI for creating a new configuration; even if there are fewer controls, as far as the
balancing is concerned, the GUI takes care of rank and DA distribution when creating pools.

4.8.7 Extent pool configuration examples


The next sections provide examples to demonstrate how to balance the arrays across extent
pools and processor complexes to provide optimum overall DS8000 system performance.
The configuration on your DS8000 storage system typically differs considerably from these
examples, depending on your hardware configuration number of ranks, DA pairs, storage
classes, storage types (FB or CKD), RAID levels, and spares.

DS8000 extent pool configuration example


This example represents a homogeneously configured DS8000 storage system with four DA
pairs and 24 ranks of the same small form factor (SFF) HDD type that are all configured as
RAID 5. Based on the specific DS8000 hardware and rank configuration, the scheme typically
becomes more complex for the number of DA pairs, ranks, SFF or large form factor (LFF)
drive enclosures, different RAID levels or drive classes, spare distribution, and storage types.
Each DA pair is typically populated with up to six ranks before the next DA pair is used.
Expect additional ranks on all DA pairs on a fully populated system with expansion frames.

118 IBM System Storage DS8000 Performance Monitoring and Tuning


A simple two-extent pool example, evenly distributing all ranks from each DA pair across both
processor complexes for a homogeneously configured DS8000 storage system with one
workload group sharing all resources and using storage pool striping, is shown in Figure 4-7.
The volumes that are created from the last extents in the pool are distributed only across the
large 7+P ranks because the capacity on the 6+P+S ranks is exhausted. Be sure to use this
remaining capacity only for workloads with lower performance requirements in manually
managed environments or consider using Easy Tier automatic mode management
(auto-rebalance) instead.

Array and DA pair association Configuration with 2 extent pools

6+P+S 6+P+S DA2 6+P+S 6+P+S DA2


DA2 6+P+S 6+P+S DA2 DA0 6+P+S 6+P+S DA0
7+P 7+P DA3 6+P+S 6+P+S DA3
DA1 6+P+S 6+P+S DA1
6+P+S 6+P+S DA2 6+P+S 6+P+S DA2
DA0 6+P+S 6+P+S DA0 DA0 6+P+S 6+P+S DA0
7+P 7+P DA3 6+P+S 6+P+S DA3
DA1 6+P+S 6+P+S DA1
6+P+S 6+P+S DA2 7+P 7+P DA2
DA3 6+P+S 6+P+S DA3 DA0 7+P 7+P DA0
7+P 7+P DA3 7+P 7+P DA3
DA1 7+P 7+P DA1
6+P+S 6+P+S P0 P1
DA1 6+P+S 6+P+S DA1 Processor Complex 0 Processor Complex 1
7+P 7+P
Processor Complex 0 Processor Complex 1

Figure 4-7 Example of a homogeneously configured DS8000 storage system (single-tier) with two
extent pools

Another example for a homogeneously configured DS8000 storage system with four extent
pools and one workload group, which shares all resources across four extent pools, or two
isolated workload groups that each share half of the resources, is shown in Figure 4-8.

Configuration with 4 extent pools Alternate configuration with 4 extent pools

DA2 6+P+S 6+P+S DA2 DA2 6+P+S 6+P+S DA2


DA0 6+P+S 6+P+S DA0 DA0 6+P+S 6+P+S DA0
DA3 6+P+S 6+P+S DA3 DA3 6+P+S 6+P+S DA3
DA1 6+P+S 6+P+S DA1 DA1 6+P+S 6+P+S DA1
DA2 6+P+S 6+P+S DA2 DA2 7+P 7+P DA2
DA0 6+P+S 6+P+S DA0 DA0 7+P 7+P DA0
DA3 6+P+S 6+P+S DA3 P0 P1
DA1 6+P+S 6+P+S DA1
P0 P1 DA2 6+P+S 6+P+S DA2
DA0 6+P+S 6+P+S DA0
DA2 7+P 7+P DA2 DA3 6+P+S 6+P+S DA3
DA0 7+P 7+P DA0 DA1 6+P+S 6+P+S DA1
DA3 7+P 7+P DA3 DA3 7+P 7+P DA3
DA1 7+P 7+P DA1 DA1 7+P 7+P DA1
P2 P3 P2 P3
Processor Complex 0 Processor Complex 1 Processor Complex 0 Processor Complex 1

Figure 4-8 Example of a homogeneously configured DS8000 storage system with four extent pools

Chapter 4. Logical configuration performance considerations 119


On the left in Figure 4-8 on page 119, there are four strictly homogeneous extent pools, and
each one contains only ranks of the same RAID level and capacity for the spare drives.
Storage-pool striping efficiently distributes the extents in each extent pool across all ranks up
to the last volume created in these pools, providing more ease of use. However, the extent
pools with 6+P+S ranks and 7+P ranks differ considerably in overall capacity and
performance, which might be appropriate only for two workload groups with different overall
capacity and performance requirements.

Another configuration with four extent pools is shown on the right in Figure 4-8 on page 119.
Evenly distribute the 6+P+S and 7+P ranks from all DA pairs across all four extent pools to
obtain the same overall capacity in each extent pool. However, the last capacity in these pools
is only allocated on the 7+P ranks. Use Easy Tier automatic mode management
(auto-rebalance) for these pools. Using four extent pools and storage-pool striping instead of
two can also reduce the failure boundary from one extent pool with 12 ranks (that is, if one
rank fails, all data in the pool is lost) to two distinct extent pools with only six ranks per
processor complex (for example, when physically separating table spaces from logs).

Also, consider separating workloads by using different extent pools with the principles of
workload isolation, as shown in Figure 4-9. The isolated workload can either use storage-pool
striping with the EAM or rotate volumes that are combined with host-level or application-level
striping. The workload isolation in this example is on the DA pair level (DA2). In addition, there
is one pair of extent pools for resource-sharing workload groups. Instead of manually
spreading the workloads across the ranks in each pool, consider using Easy Tier automatic
mode management (auto-rebalance) for all pools.

DA2 6+P+S 6+P+S DA2


DA2 6+P+S 6+P+S DA2
DA2 7+P 7+P DA2 ISOLATION
P0 P1

DA0 6+P+S 6+P+S DA0


DA3 6+P+S 6+P+S DA3
DA1 6+P+S 6+P+S DA1
DA0 6+P+S 6+P+S DA0
DA3 6+P+S 6+P+S DA3 SHARED
DA1 6+P+S 6+P+S DA1
DA0 7+P 7+P DA0
DA3 7+P 7+P DA3
DA1 7+P 7+P DA1
P2 P3

Processor Complex 0 Processor Complex 1

Figure 4-9 Example of DS8000 extent pool configuration with workload isolation on DA pair level

Another consideration for the number of extent pools to create is the usage of Copy Services,
such as FlashCopy. If you use FlashCopy, you also might consider a minimum of four extent
pools with two extent pools per rank group or processor complex. If you do so, you can
perform your FlashCopy copies from the P0 volumes (sources) to the P2 volumes (targets),
and vice versa from P2 source volumes to target volumes in pool P0. Likewise, you can
create FlashCopy pairs between the extent pools P1 and P3, and vice versa. This approach
follows the guidelines for FlashCopy performance (staying in the same processor complex
source–target, but having the target volumes on other ranks/different spindles, and preferably
on different DAs), and is also a preferred way when considering the failure boundaries.

120 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 4-10 gives an example of a balanced extent pool configuration with two extent pools on
a DS8000 storage system with three storage classes that use Easy Tier. All arrays from each
DA pair (especially the SSD arrays) are evenly distributed across both processor complexes
to use fully the back-end DA pair bandwidth and I/O processing capabilities. There is a
difference in DA pair population with ranks from LFF disk enclosures for the 3.5-inch NL
drives. A pair of LFF disk enclosures contains only 24 disk drives (three ranks) compared to a
pair of SFF disk enclosures with 48 disk drives (six ranks). All cross-tier and intra-tier data
relocation in these pools is done automatically by Easy Tier. Easy Tier constantly optimizes
storage performance and storage economics.

Array and DA pair association

SSD SSD Easy Tier configuration with 2 extent pools


DA2 Reserved Reserved DA2
Reserved Reserved DA2 SSD SSD DA2
DA0 SSD SSD DA0
SSD SSD DA3 ENT ENT DA3
DA0 Reserved Reserved DA0 DA1 ENT ENT DA1
Reserved Reserved DA7 ENT ENT DA7
DA5 ENT ENT DA5
ENT ENT DA3 ENT ENT DA3
DA3 ENT ENT DA3 DA1 ENT ENT DA1
ENT ENT DA7 ENT ENT DA7
DA5 ENT ENT DA5
ENT ENT DA3 ENT ENT DA3
DA1 ENT ENT DA1 DA1 ENT ENT DA1
ENT ENT DA7 ENT ENT DA7
DA5 ENT ENT DA5
NL NL DA6 NL NL DA4
DA6 DA6
NL DA4 NL NL DA6
DA6 NL NL DA4
NL NL P0 P1
DA4 DA4
NL Processor Complex 0 Processor Complex 1

ENT ENT
DA7 ENT ENT DA7
ENT ENT

ENT ENT
DA5 ENT ENT DA5
ENT ENT

Figure 4-10 Example DS8000 configuration with two hybrid extent pools that use Easy Tier

4.8.8 Assigning workloads to extent pools


Extent pools can contain only ranks of the same storage type, either FB for Open Systems or
IBM i, or CKD for FICON-based z Systems. You can have multiple extent pools in various
configurations on a single DS8000 storage system, for example, managed or non-managed
single-tier (homogeneous) extent pools with ranks of only one storage class. You can have
multitier (hybrid) extent pools with any two storage tiers, or up to three storage tiers managed
by Easy Tier.

Chapter 4. Logical configuration performance considerations 121


Multiple homogeneous extent pools, each with different storage classes, easily allow tiered
storage concepts with dedicated extent pools and manual cross-tier management. For
example, you can have extent pools with slow, large-capacity drives for backup purposes and
other extent pools with high-speed, small capacity drives or flash for performance-critical
transaction applications. Also, you can use hybrid pools with Easy Tier and introduce fully
automated cross-tier storage performance and economics management.

Using dedicated extent pools with an appropriate number of ranks and DA pairs for selected
workloads is a suitable approach for isolating workloads.

The minimum number of required extent pools depends on the following considerations:
򐂰 The number of isolated and resource-sharing workload groups
򐂰 The number of different storage types, either FB for Open Systems or IBM i, or CKD for
z Systems
򐂰 Definition of failure boundaries (for example, separating logs and table spaces to different
extent pools)
򐂰 in some cases, Copy Services considerations.

Although you are not restricted from assigning all ranks to only one extent pool, the minimum
number of extent pools, even with only one workload on a homogeneously configured
DS8000 storage system, must be two (for example, P0 and P1). You need one extent pool for
each rank group (or storage server) so that the overall workload is balanced across both
processor complexes.

To optimize performance, the ranks for each workload group (either isolated or
resource-sharing workload groups) must be split across at least two extent pools with an
equal number of ranks from each rank group. So, at the workload level, each workload is
balanced across both processor complexes. Typically, you assign an equal number of ranks
from each DA pair to extent pools assigned to processor complex 0 (rank group 0: P0, P2,
and P4, for example) and to extent pools assigned to processor complex 1 (rank group 1: P1,
P3, and P5, for example). In environments with FB and CKD storage (Open Systems and z
Systems), you additionally need separate extent pools for CKD and FB volumes. It is often
useful to have a minimum of four extent pools to balance the capacity and I/O workload
between the two DS8000 processor complexes. Additional extent pools might be needed to
meet individual needs, such as ease of use, implementing tiered storage concepts, or
separating ranks for different DDM types, RAID types, clients, applications, performance, or
Copy Services requirements.

However, the maximum number of extent pools is given by the number of available ranks (that
is, creating one extent pool for each rank).

In most cases, accepting the configurations that the GUI offers when doing initial setup and
formatting already gives excellent results. For specific situations, creating dedicated extent
pools on the DS8000 storage system with dedicated back-end resources for separate
workloads allows individual performance management for business and performance-critical
applications. Compared to share and spread everything storage systems without the
possibility to implement workload isolation concepts, creating dedicated extent pools on the
DS8000 storage system with dedicated back-end resources for separate workloads is an
outstanding feature of the DS8000 storage system as an enterprise-class storage system.
With this feature, you can consolidate and manage various application demands with different
performance profiles, which are typical in enterprise environments, on a single storage
system.

122 IBM System Storage DS8000 Performance Monitoring and Tuning


Before configuring the extent pools, collect all the hardware-related information of each rank
for the associated DA pair, disk type, available storage capacity, RAID level, and storage type
(CKD or FB) in a spreadsheet. Then, plan the distribution of the workloads across the ranks
and their assignments to extent pools.

Plan an initial assignment of ranks to your planned workload groups, either isolated or
resource-sharing, and extent pools for your capacity requirements. After this initial
assignment of ranks to extent pools and appropriate workload groups, you can create
additional spreadsheets to hold more details about the logical configuration and finally the
volume layout of the array site IDs, array IDs, rank IDs, DA pair association, extent pools IDs,
and volume IDs, and their assignments to volume groups and host connections.

4.9 Planning address groups, LSSs, volume IDs, and CKD PAVs
After creating the extent pools and evenly distributing the back-end resources (DA pairs and
ranks) across both DS8000 processor complexes, you can create host volumes from these
pools. When creating the host volumes, it is important to follow a volume layout scheme that
evenly spreads the volumes of each application workload across all ranks and extent pools
that are dedicated to this workload to achieve a balanced I/O workload distribution across
ranks, DA pairs, and the DS8000 processor complexes.

So, the next step is to plan the volume layout and thus the mapping of address groups and
LSSs to volumes created from the various extent pools for the identified workloads and
workload groups. For performance management and analysis reasons, it can be useful to
relate easily volumes, which are related to a specific I/O workload, to ranks, which finally
provide the physical disk spindles for servicing the workload I/O requests and determining the
I/O processing capabilities. Therefore, an overall logical configuration concept that easily
relates volumes to workloads, extent pools, and ranks is wanted.

Each volume is associated with a hexadecimal four-digit volume ID that must be specified
when creating the volume. An example for volume ID 1101 is shown in Table 4-2.

Table 4-2 Understand the volume ID relationship to address groups and LSSs/LCUs
Volume ID Digits Description

1101 First digit: 1xxx Address group (0 - F)


(16 address groups on a DS8000 storage system)

First and second digits: Logical subsystem (LSS) ID for FB


11xx Logical control unit (LCU) ID for CKD
(x0 - xF: 16 LSSs or LCUs per address group)

Third and fourth digits: Volume number within an LSS or LCU


xx01 (00 - FF: 256 volumes per LSS or LCU)

The first digit of the hexadecimal volume ID specifies the address group, 0 - F, of that volume.
Each address group can be used only by a single storage type, either FB or CKD. The first
and second digit together specify the logical subsystem ID (LSS ID) for Open Systems
volumes (FB) or the logical control unit ID (LCU ID) for z Systems volumes (CKD). There are
16 LSS/LCU IDs per address group. The third and fourth digits specify the volume number
within the LSS/LCU, 00 - FF. There are 256 volumes per LSS/LCU. The volume with volume
ID 1101 is the volume with volume number 01 of LSS 11, and it belongs to address group 1
(first digit).

Chapter 4. Logical configuration performance considerations 123


The LSS/LCU ID is related to a rank group. Even LSS/LCU IDs are restricted to volumes that
are created from rank group 0 and serviced by processor complex 0. Odd LSS/LCU IDs are
restricted to volumes that are created from rank group 1 and serviced by processor complex
1. So, the volume ID also reflects the affinity of that volume to a DS8000 processor complex.
All volumes, which are created from even-numbered extent pools (P0, P2, and P4, for
example) have even LSS IDs and are managed by DS8000 processor complex 0. All volumes
that are created from odd-numbered extent pools (P1, P3, and P5, for example) have odd
LSS IDs and are managed by DS8000 processor complex 1.

There is no direct DS8000 performance implication as a result of the number of defined


LSSs/LCUs. For the z/OS CKD environment, a DS8000 volume ID is also required for each
PAV. The maximum of 256 addresses per LCU includes both CKD base volumes and PAVs,
so the number of volumes and PAVs determines the number of required LCUs.

In the past, for performance analysis reasons, it was useful to identify easily the association of
specific volumes to ranks or extent pools when investigating resource contention. But, since
the introduction of storage-pool striping, the use of multi-rank extent pools is the preferred
configuration approach for most environments. Multitier extent pools are managed by Easy
Tier automatic mode anyway, constantly providing automatic storage intra-tier and cross-tier
performance and storage economics optimization. For single-tier pools, turn on Easy Tier
management. In managed pools, Easy Tier automatically relocates the data to the
appropriate ranks and storage tiers based on the access pattern, so the extent allocation
across the ranks for a specific volume is likely to change over time. With storage-pool striping
or extent pools that are managed by Easy Tier, you no longer have a fixed relationship
between the performance of a specific volume and a single rank. Therefore, planning for a
hardware-based LSS/LCU scheme and relating LSS/LCU IDs to hardware resources, such as
ranks, is no longer reasonable. Performance management focus is shifted from ranks to
extent pools. However, a numbering scheme that relates only to the extent pool might still be
viable, but it is less common and less practical.

The common approach that is still valid today with Easy Tier and storage pool striping is to
relate an LSS/LCU to a specific application workload with a meaningful numbering scheme
for the volume IDs for the distribution across the extent pools. Each LSS can have 256
volumes, with volume numbers 00 - FF. So, relating the LSS/LCU to a certain application
workload and additionally reserving a specific range of volume numbers for different extent
pools is a reasonable choice, especially in Open Systems environments. Because volume IDs
are transparent to the attached host systems, this approach helps the administrator of the
host system to determine easily the relationship of volumes to extent pools by the volume ID.
Therefore, this approach helps you easily identify physically independent volumes from
different extent pools when setting up host-level striping across pools. This approach helps
you when separating, for example, database table spaces from database logs on to volumes
from physically different drives in different pools.

This approach provides a logical configuration concept that provides ease of use for storage
management operations and reduces management efforts when using the DS8000 related
Copy Services because basic Copy Services management steps (such as establishing
Peer-to-Peer Remote Copy (PPRC) paths and consistency groups) are related to LSSs. If
Copy Services are not planned, plan the volume layout because overall management is
easier if you must introduce Copy Services in the future (for example, when migrating to a
new DS8000 storage system that uses Copy Services).

However, the strategy for the assignment of LSS/LCU IDs to resources and workloads can
still vary depending on the particular requirements in an environment.

The following section introduces suggestions for LSS/LCU and volume ID numbering
schemes to help to relate volume IDs to application workloads and extent pools.

124 IBM System Storage DS8000 Performance Monitoring and Tuning


4.9.1 Volume configuration scheme by using application-related LSS/LCU IDs
This section describes several suggestions for a volume ID numbering scheme in a multi-rank
extent pool configuration where the volumes are evenly spread across a set of ranks. These
examples refer to a workload group where multiple workloads share the set of resources. The
following suggestions apply mainly to Open Systems. For CKD environments, the LSS/LCU
layout is defined at the operating system level with the Input/Output Configuration Program
(IOCP) facility. The LSS/LCU definitions on the storage server must match exactly the IOCP
statements. For this reason, any consideration on the LSS/LCU layout must be made first
during the IOCP planning phase and afterward mapped to the storage server.

Typically, when using LSS/LCU IDs that relate to application workloads, the simplest
approach is to reserve a suitable number of LSS/LCU IDs according to the total number of
volumes requested by the application. Then, populate the LSS/LCUs in sequence, creating
the volumes from offset 00. Ideally, all volumes that belong to a certain application workload
or a group of related host systems are within the same LSS. However, because the volumes
must be spread evenly across both DS8000 processor complexes, at least two logical
subsystems are typically required per application workload. One even LSS is for the volumes
that are managed by processor complex 0, and one odd LSS is for volumes managed by
processor complex 1 (for example, LSS 10 and LSS 11). Moreover, consider the future
capacity demand of the application when planning the number of LSSs to be reserved for an
application. So, for those applications that are likely to increase the number of volumes
beyond the range of one LSS pair (256 volumes per LSS), reserve a suitable number of LSS
pair IDs for them from the beginning.

Figure 4-11 shows an example of an application-based LSS numbering scheme. This


example shows three applications, application A, B, and C, that share two large extent pools.
Hosts A1 and A2 both belong to application A and are assigned to LSS 10 and LSS 11, each
using a different volume ID range from the same LSS range. LSS 12 and LSS 13 are
assigned to application B, which runs on host B. Application C is likely to require more than
512 volumes, so use LSS pairs 28/29 and 2a/2b for this application.

Volume ID Configuration (Rank Group 0) Volume ID Configuration (Rank Group 1)

1000 1010 1200 2800 2a00 1100 1110 1300 2900 2b00
1001 1011 1201 2801 2a01 1101 1111 1301 2901 2b01
1002 1012 1202 2802 2a02 1102 1112 1302 2902 2b02
1003 1013 1203 2803 2a03 1103 1113 1303 2903 2b03
1004 1014 1204 2804 2a04 1104 1114 1304 2904 2b04
1005 1015 1205 2805 2a05 Ranks Ranks 1105 1115 1305 2905 2b05
1006 1016 1206 2806 2a06 . . 1106 1116 1306 2906 2b06
1007 1017 1207 2807 2a07 . . 1107 1117 1307 2907 2b07
1008 1018 1208 2808 2a08 . . 1108 1118 1308 2908 2b08
1009 1019 1209 2809 2a09 Ranks Ranks 1109 1119 1309 2909 2b09
100a 101a 120a 280a 2a0a 110a 111a 130a 290a 2b0a
100b 101b 120b 280b 2a0b 110b 111b 130b 290b 2b0b
100c 101c 120c 280c 2a0c 110c 111c 130c 290c 2b0c
100d 101d 120d 280d 2a0d 110d 111d 130d 290d 2b0d
100e 101e 120e 280e 2a0e 110e 111e 130e 290e 2b0e
100f 101f 120f 280f 2a0f 110f 111f 130f 290f 2b0f
P0 P1
Host A1 Host A2 Host B Host C Host A1 Host A2 Host B Host C
Application A Application B Application C Application A Application B Application C

Processor Complex 0 Processor Complex 1

Figure 4-11 Application-related volume layout example for two shared extent pools

Chapter 4. Logical configuration performance considerations 125


If an application workload is distributed across multiple extent pools on each processor
complex, for example, in a four extent pool configuration, as shown in Figure 4-12, you can
expand this approach. Define a different volume range for each extent pool so that the system
administrator can easily identify the extent pool behind a volume from the volume ID, which is
transparent to the host systems.

In Figure 4-12, the workloads are spread across four extent pools. Again, assign two
LSS/LCU IDs (one even, one odd) to each workload to spread the I/O activity evenly across
both processor complexes (both rank groups). Additionally, reserve a certain volume ID range
for each extent pool based on the third digit of the volume ID. With this approach, you can
quickly create volumes with successive volume IDs for a specific workload per extent pool
with a single DSCLI mkfbvo or mkckdvol command.

Hosts A1 and A2 belong to the same application A and are assigned to LSS 10 and LSS 11.
For this workload, use volume IDs 1000 - 100f in extent pool P0 and 1010 - 101f in extent pool
P2 on processor complex 0. Use volume IDs 1100 - 110f in extent pool P1 and 1110 - 111f in
extent pool P3 on processor complex 1. In this case, the administrator of the host system can
easily relate volumes to different extent pools and thus different physical resources on the
same processor complex by looking at the third digit of the volume ID. This numbering
scheme can be helpful when separating, for example, DB table spaces from DB logs on to
volumes from physically different pools.

Volume ID Configuration (Rank Group 0) Volume ID Configuration (Rank Group 1)

1 1 2 Ranks Ranks 1 1 2
0 0 a . . 1 1 b
0 0 0 . . 0 0 0
0 1 0 Ranks Ranks 0 1 0

P0 P1

1 1 2 Ranks Ranks 1 1 2
0 0 a . . 1 1 b
1 1 1 . . 1 1 1
0 1 0 Ranks Ranks 0 1 0

P2 P3
Host A1 Host A2 Host B Host A1 Host A2 Host B
Application A Application B Application A Application B

Processor Complex 0 Processor Complex 1

Figure 4-12 Application and extent pool-related volume layout example for four shared extent pools

The example that is depicted in Figure 4-13 on page 127 provides a numbering scheme that
can be used in a FlashCopy scenario. Two different pairs of LSS are used for source and
target volumes. The address group identifies the role in the FlashCopy relationship: address
group 1 is assigned to source volumes, and address group 2 is used for target volumes. This
numbering scheme allows a symmetrical distribution of the FlashCopy relationships across
source and target LSSs. For example, source volume 1007 in P0 uses the volume 2007 in P2
as the FlashCopy target. In this example, use the third digit of the volume ID within an LSS as
a marker to indicate that source volumes 1007 and 1017 are from different extent pools. The
same approach applies to the target volumes, for example, volumes 2007 and 2017 are from
different pools.

126 IBM System Storage DS8000 Performance Monitoring and Tuning


However, for the simplicity of the Copy Services management, you can choose a different
extent pool numbering scheme for source and target volumes (so 1007 and 2007 are not from
the same pool) to implement the recommended extent pool selection of source and target
volumes in accordance with the FlashCopy guidelines. Source and target volumes must stay
on the same rank group but different ranks or extent pools. For more information about this
topic, see the FlashCopy performance chapters in IBM DS8870 Copy Services for IBM z
Systems, SG24-6787 (for z Systems) or IBM DS8870 Copy Services for Open Systems,
SG24-6788 (for Open Systems).

Volume ID Configuration (Rank Group 0) Volume ID Configuration (Rank Group 1)

1000 2010 1100 2110


1001 2011 1101 2111
1002 2012 Ranks Ranks 1102 2112
1003 2013 . . 1103 2113
1004 2014 . . 1104 2114
1005 2015 Ranks Ranks 1105 2115
1006 2016 1106 2116
1007 2017 1107 2117
P0 P1

1010 2000 1110 2100


1011 2001 1111 2101
1012 2002 Ranks Ranks 1112 2102
1013 2003 . . 1113 2103
1014 2004 . . 1114 2104
1015 2005 Ranks Ranks 1115 2105
1016 2006 1116 2106
1017 2007 1117 2107
P2 P3
Host A1 Host A2 Host A1 Host A2
FC source LSS FC target LSS FC source LSS FC target LSS
Application A Application A

Processor Complex 0 Processor Complex 1


Figure 4-13 Application and extent pool-related volume layout example in a FlashCopy scenario

Chapter 4. Logical configuration performance considerations 127


Tip: Use the GUI advanced Custom mode to select a specific LSS range when creating
volumes. Choose this Custom mode and also the appropriate Volume definition mode, as
shown in Figure 4-14.

Figure 4-14 Create volumes in Custom mode and specify an LSS range

4.10 I/O port IDs, host attachments, and volume groups


Finally, when planning the attachment of the host system to the storage system HA I/O ports,
you must achieve a balanced workload distribution across the available front-end resources
for each workload with the appropriate isolation and resource-sharing considerations.
Therefore, distribute the FC connections from the host systems evenly across the DS8880 HA
ports, HA cards, and I/O enclosures.

128 IBM System Storage DS8000 Performance Monitoring and Tuning


For high availability, each host system must use a multipathing device driver, such as
subsystem device driver (SDD) or one that is native to the respective operating system. Each
host system must have a minimum of two host connections to HA cards in different I/O
enclosures on the DS8880 storage system. Preferably, they are evenly distributed between
left side (even-numbered) I/O enclosures and right side (odd-numbered) I/O enclosures. The
number of host connections per host system is primarily determined by the required
bandwidth. Use an appropriate number of HA cards to satisfy high throughput demands.

For a DS8880 storage system, four-port and eight-port HA card options are available.
16 Gbps four-port HAs satisfy the highest throughput requirements. For the 8 Gbps HAs,
because the maximum available bandwidth is the same for four-port and eight-port HA cards,
the eight-port HA card provides additional connectivity but no additional performance.
Furthermore, the HA card maximum available bandwidth is less than the nominal aggregate
bandwidth and depends on the workload profile. These specifications must be considered
when planning the HA card port allocation and especially for workloads with high sequential
throughputs. Be sure to contact your IBM representative or IBM Business Partner for an
appropriate sizing, depending on your actual workload requirements. With typical
transaction-driven workloads that show high numbers of random, small-blocksize I/O
operations, all ports in a HA card can be used likewise. For the preferred performance of
workloads with different I/O characteristics, consider the isolation of large-block sequential
and small-block random workloads at the I/O port level or the HA card level.

The preferred practice is to use dedicated I/O ports for Copy Services paths and host
connections. For more information about performance aspects that are related to Copy
Services, see the performance-related chapters in IBM DS8870 Copy Services for IBM z
Systems, SG24-6787 (for z Systems) and IBM DS8870 Copy Services for Open Systems,
SG24-6788 (for Open Systems).

To assign FB volumes to the attached Open Systems hosts by using LUN masking, when
using the DSCLI, these volumes must be grouped in the DS8000 volume groups. A volume
group can be assigned to multiple host connections, and each host connection is specified by
the worldwide port name (WWPN) of the host FC port. A set of host connections from the
same host system is called a host attachment. The same volume group can be assigned to
multiple host connections; however, a host connection can be associated only with one
volume group. To share volumes between multiple host systems, the most convenient way is
to create a separate volume group for each host system and assign the shared volumes to
each of the individual volume groups as required. A single volume can be assigned to multiple
volume groups. Only if a group of host systems shares a set of volumes, and there is no need
to assign additional non-shared volumes independently to particular hosts of this group, can
you consider using a single shared volume group for all host systems to simplify
management. Typically, there are no significant DS8000 performance implications because of
the number of DS8000 volume groups or the assignment of host attachments and volumes to
the DS8000 volume groups.

Do not omit additional host attachment and host system considerations, such as SAN zoning,
multipathing software, and host-level striping. For more information, see Chapter 8, “Host
attachment” on page 267, Chapter 9, “Performance considerations for UNIX servers” on
page 285, “Chapter 12, “Performance considerations for Linux” on page 385, and Chapter 14,
“Performance considerations for IBM z Systems servers” on page 459.

After the DS8000 storage system is installed, you can use the DSCLI lsioport command to
display and document I/O port information, including the I/O ports, HA type, I/O enclosure
location, and WWPN. Use this information to add specific I/O port IDs, the required protocol
(FICON or FCP), and the DS8000 I/O port WWPNs to the plan of host and remote mirroring
connections that are identified in 4.4, “Planning allocation of disk and host connection
capacity” on page 94.

Chapter 4. Logical configuration performance considerations 129


Additionally, the I/O port IDs might be required as input to the DS8000 host definitions if host
connections must be restricted to specific DS8000 I/O ports by using the -ioport option of
the mkhostconnect DSCLI command. If host connections are configured to allow access to all
DS8000 I/O ports, which is the default, typically the paths must be restricted by SAN zoning.
The I/O port WWPNs are required as input for SAN zoning. The lshostconnect -login
DSCLI command might help verify the final allocation of host attachments to the DS8000 I/O
ports because it lists host port WWPNs that are logged in, sorted by the DS8000 I/O port IDs
for known connections. The lshostconnect -unknown DSCLI command might further help
identify host port WWPNs, which are not yet configured to host connections, when creating
host attachments by using the mkhostconnect DSCLI command.

The DSCLI lsioport output identifies this information:


򐂰 The number of I/O ports on each installed HA
򐂰 The type of installed HAs (SW FCP/FICON-capable or LW FCP/FICON-capable)
򐂰 The distribution of HAs across I/O enclosures
򐂰 The WWPN of each I/O port

The DS8000 I/O ports use predetermined, fixed DS8000 logical port IDs in the form I0xyz,
where:
򐂰 x: I/O enclosure
򐂰 y: Slot number within the I/O enclosure
򐂰 z: Port within the adapter

For example, I0101 is the I/O port ID for these devices:


򐂰 I/O enclosure 1
򐂰 Slot 0
򐂰 Second port

Slot numbers: The slot numbers for logical I/O port IDs are one less than the physical
location numbers for HA cards, as shown on the physical labels and in IBM Spectrum
Control/Tivoli Storage Productivity Center for Disk, for example, I0101 is R1-XI2-C1-T2.

A simplified example of spreading the DS8000 I/O ports evenly to two redundant SAN fabrics
is shown in Figure 4-15 on page 131. The SAN implementations can vary, depending on
individual requirements, workload considerations for isolation and resource-sharing, and
available hardware resources.

130 IBM System Storage DS8000 Performance Monitoring and Tuning


Left I/O Enclosures Right I/O Enclosures
Bay 0 Bay 2 Bay 4 Bay 6 Bay 1 Bay 3 Bay 5 Bay 7
C0 C1 C0 C1 C0 C1 C0 C1 C0 C1 C0 C1 C0 C1 C0 C1
L4 L8 L3 L7 L4 L8 L3 L7 R4 R8 R3 R7 R4 R8 R3 R7

0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 16 1 Bay 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 17 1
3 3 3 3 0 0 0 0 3 3 3 3 0 0 0 0 3 3 3 3 0 0 0 0 3 3 3 3 0 0 0 00 0 Card 0 0 0 0 3 3 3 3 0 0 0 0 3 3 3 3 0 0 0 0 3 3 3 3 0 0 0 0 3 3 3 03 0
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 03 1 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 03 1
Port

01 02 03 04 05 06 07 08 01 02 03 04 05 06 07 08

SAN Fabric #0 SAN Fabric #1


09 10 11 12 13 14 15 16 09 10 11 12 13 14 15 16

Figure 4-15 Example of spreading DS8000 I/O ports evenly across two redundant SAN fabrics

4.10.1 I/O port planning considerations


Here are several considerations for the DS8000 I/O port planning and assignment:
򐂰 The ports on the HA card can be connected in any order or position, and it delivers the
same performance.
򐂰 The performance of an 8 Gbps HA card does not scale with more than four ports. So in
case, the HA card that is chosen, for Fibre Channel and FICON paths with high utilization,
does not use more than four ports on an 8 Gbps HA, as the eight-port HA type of this
speed is meant to offer more connectivity options.
򐂰 Spread the paths from all host systems across the available I/O ports, HA cards, and I/O
enclosures to optimize workload distribution across the available resources depending on
your workload sharing and isolation considerations.
򐂰 Spread the host paths that access the same set of volumes as evenly as possible across
the available HA cards and I/O enclosures. This approach balances workload across
hardware resources, and it ensures that a hardware failure does not result in a loss of
access.
򐂰 Plan the paths for the attached host systems with a minimum of two host connections to
different HA cards in different I/O enclosures on the DS8880 storage system. Preferably,
evenly distribute them between left (even-numbered) I/O enclosures and right
(odd-numbered) I/O enclosures for the highest availability and a balanced workload
distribution across I/O enclosures and HA cards.
򐂰 Use separate DS8000 I/O ports for host attachment and Copy Services remote replication
connections (such as Metro Mirror, Global Mirror, and zGM data mover). If additional HAs
are available, consider using separate HAs for Copy Services remote replication
connections and host attachments to avoid any possible interference between remote
replication and host workloads.

Chapter 4. Logical configuration performance considerations 131


򐂰 Spread Copy Services remote replication connections at least across two HA cards in
different I/O enclosures.
򐂰 Consider using separate HA cards for FICON protocol and FCP. Although I/O ports on the
same HA can be configured independently for the FCP protocol and the FICON protocol, it
might be preferable to isolate your z/OS environment (FICON) from your Open Systems
environment (FCP).

Look at Example 4-2, which shows a DS8886 storage system with a selection of different
HAs:
򐂰 You see four-port LW (longwave / single-mode) HAs, for example, I000x or I013x, which
are already configured for a FICON topology here.
򐂰 You see four-port 16 Gbps SW (shortwave / multi-mode) HAs, for example I003x.
򐂰 All SW (shortwave) HAs are configured with the FCP protocol.
򐂰 You see some eight-port HAs (8 Gbps), for example, I030x.
򐂰 You see that corresponding HA types are balanced between I/O enclosures when the
machine is coming from the manufacturing site. For example:
– For the 16 Gbps HAs, of which there are four, each of the four is in another I/O
enclosure (0, 1, 2, and 3).
– For the 8 Gbps eight-port, you have one in I/O enclosure 2 and one in I/O enclosure 3.
– For the 8 Gbps LW, you have one in I/O enclosure 0 and one in I/O enclosure 1.

Example 4-2 DS8880 HBA example - DSCLI lsioport command output


dscli> lsioport -l
ID WWPN State Type topo portgrp Speed
=======================================================================
I0000 5005076306001693 Online Fibre Channel-LW FICON 0 8 Gb/s
I0001 5005076306005693 Online Fibre Channel-LW FICON 0 8 Gb/s
I0002 5005076306009693 Online Fibre Channel-LW FICON 0 8 Gb/s
I0003 500507630600D693 Online Fibre Channel-LW FICON 0 8 Gb/s
I0030 5005076306031693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s
I0031 5005076306035693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s
I0032 5005076306039693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s
I0033 500507630603D693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s
I0100 5005076306081693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s
I0101 5005076306085693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s
I0102 5005076306089693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s
I0103 500507630608D693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s
I0130 50050763060B1693 Online Fibre Channel-LW FICON 0 8 Gb/s
I0131 50050763060B5693 Online Fibre Channel-LW FICON 0 8 Gb/s
I0132 50050763060B9693 Online Fibre Channel-LW FICON 0 8 Gb/s
I0133 50050763060BD693 Online Fibre Channel-LW FICON 0 8 Gb/s
I0200 5005076306101693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s
I0201 5005076306105693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s
I0202 5005076306109693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s
I0203 500507630610D693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s
I0230 5005076306131693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s
I0231 5005076306135693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s
I0232 5005076306139693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s
I0233 500507630613D693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s
I0234 5005076306531693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s
I0235 5005076306535693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s

132 IBM System Storage DS8000 Performance Monitoring and Tuning


I0236 5005076306539693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s
I0237 500507630653D693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s
I0300 5005076306181693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s
I0301 5005076306185693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s
I0302 5005076306189693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s
I0303 500507630618D693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s
I0304 5005076306581693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s
I0305 5005076306585693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s
I0306 5005076306589693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s
I0307 500507630658D693 Online Fibre Channel-SW SCSI-FCP 0 8 Gb/s
I0330 50050763061B1693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s
I0331 50050763061B5693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s
I0332 50050763061B9693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s
I0333 50050763061BD693 Online Fibre Channel-SW SCSI-FCP 0 16 Gb/s

When planning the paths for the host systems, ensure that each host system uses a
multipathing device driver and a minimum of two host connections to two different HA cards in
different I/O enclosures on the DS8000. Preferably, they are evenly distributed between left
side (even-numbered) I/O enclosures and the right side (odd-numbered) I/O enclosures for
highest availability. Multipathing additionally optimizes workload spreading across the
available I/O ports, HA cards, and I/O enclosures.

You must tune the SAN zoning scheme to balance both the oversubscription and the
estimated total throughput for each I/O port to avoid congestion and performance bottlenecks.

4.11 Implementing and documenting a DS8000 logical


configuration
For performance management and analysis reasons, it is crucial to relate easily volumes,
which are related to a specific I/O workload, to pools or ranks, which finally provide the
physical disk spindles for servicing the workload I/O requests and determining the I/O
processing capabilities. An overall logical configuration concept that easily relates volumes to
workloads, extent pools, and ranks is needed.

After the logical configuration is planned, you can use either the DS Storage Manager GUI or
the DSCLI to implement it on the DS8000 by completing the following steps:
1. Change the password for the default user (admin) for DS Storage Manager and DSCLI.
2. Create additional user IDs for DS Storage Manager and DSCLI.
3. Apply the DS8000 authorization keys.
4. Create arrays.
5. Create ranks.
6. Create extent pools.
7. Assign ranks to extent pools.
8. Create CKD Logical Control Units (LCUs).
9. Create CKD volumes.
10.Create CKD PAVs.
11.Create FB LUNs.
12.Create Open Systems host definitions.
13.Create Open Systems DS8000 volume groups.
14.Assign Open Systems hosts and volumes to the DS8000 volume groups.
15.Configure I/O ports.
16.Implement SAN zoning, multipathing software, and host-level striping, as needed.

Chapter 4. Logical configuration performance considerations 133


After the logical configuration is created on the DS8000 storage system, it is important to
document it.

Documentation by using the GUI


You can use the DS Storage Manager GUI to export information in a spreadsheet format (that
is, save it as a comma-separated values (CSV) file). Figure 4-16 shows the export of the GUI
System Summary, and a resulting spreadsheet file that contains much detailed information,
such as cache and disks that are installed, HAs installed, licenses installed, arrays created
and extent pools (including their usage), LSSs/LCUs, all volumes created, hosts and users,
and code levels. You can also use this function for regular usage.

You can use this information with a planning spreadsheet to document the logical
configuration.

Figure 4-16 DS Storage Manager GUI - exporting System Summary (output truncated)

Documentation through self-made DSCLI scripts


The DSCLI provides a set of list (ls) and show commands, whose outputs can be
redirected and appended into a plain text or CSV file. A list of selected DSCLI commands, as
shown in Example 4-3 on page 135, can be started as a DSCLI script (by using the DSCLI
command dscli -script) to collect the logical configuration of a DS8000 Storage Image.
This output can be used as a text file or imported into a spreadsheet to document the logical
configuration.

134 IBM System Storage DS8000 Performance Monitoring and Tuning


The DSCLI script in Example 4-3 collects a small set of the DS8000 logical configuration
information, but it illustrates a simple script implementation and runs quickly within a single
DSCLI command session. Depending on the environment, you can modify this script to
include more commands to provide more information, for example, about Copy Services
configurations and source/target relationships. The DSCLI script terminates with the first
command that returns an error, which, for example, can be a simple lslcu command if no
LCUs are defined. You can adjust the output of the ls commands in a DSCLI script to meet
special formatting and delimiter requirements by using appropriate options for format, delim,
or header in the specified DS8000 profile file or selected ls commands.

Example 4-3 Example of a minimum DSCLI script get_config.dscli to gather the logical configuration
> dscli -cfg profile/DEVICE.profile -script get_config.dscli > DEVICE_SN_config.out
CMMCI9029E showrank: rank R48 does not exist.

> cat get_config.dscli


ver -l
lssu -l
lssi -l

lsarraysite -l
lsarray -l
lsrank -l
lsextpool -l

lsaddressgrp
lslss # Use only if FB volumes have been configured
#lslcu # Use only if CKD volumes and LCUs have been configured
# otherwise the command returns an error and the script terminates.
lsioport -l
lshostconnect
lsvolgrp

lsfbvol -l # Use only if FB volumes have been configured


#lsckdvol -l # Use only if CKD volumes have been configured
# otherwise the command returns an error and the script terminates.
showrank R0 # Modify this list of showrank commands so that
showrank R1 # the showrank command is run on all available ranks!
showrank R2 # Note that an error is returned if the specified rank is not
showrank R3 # present. The script terminates on the first non-existing rank.
... # Check for gaps in the rank ID sequence.
showrank R192

Chapter 4. Logical configuration performance considerations 135


The script in Example 4-3 on page 135 is focused on providing the relationship between
volumes, ranks, and hosts, and can easily be used on different DS8000 storage systems
without modification or the need to consider the particular storage image ID of the DS8000
storage system. However, you can further enhance the script by adding the commands that
are shown in Example 4-4 to include hardware-specific information about the DS8000 storage
system, which is helpful when performing a deeper performance analysis.

Example 4-4 Additional DSCLI commands to include DS8000 machine-specific information


showsu
showsi
lskey
lsddm
lsda
lsframe
lshba
lsstgencl

DS8000 Query and Reporting tool (DS8QTOOL)


This section introduces the storage system configuration DS8000 Query and Reporting tool
(DS8QTOOL). The task of maintaining up-to-date configuration documentation is
time-consuming. The scripts in “Documentation through self-made DSCLI scripts” on
page 134, and several other commands to document the machine state, for example, with
regards to volumes in replication relationships, is readily available and integrated into
DS8QTOOL.

To help automate the processes to gather such data, VBScript and Excel macro programs
were written and combined to provide a quick and easy-to-use interface to DS8000 storage
servers through DSCLI, which is passed to an Excel macro to generate a summary workbook
with detailed configuration information.

The tool is started from a desktop icon on a Microsoft Windows system. A VBScript program
is included to create the icon with a link to the first Excel macro that displays a DS8000
Hardware Management Console (HMC) selection window to start the query process.

The DS8QTOOL uses non-intrusive list and show commands to query and report on system
configurations. The design point of the programs is to automate a repeatable process of
creating configuration documentation for a specific DS8000 storage system.

You can obtain this tool, along with other tools, such as DS8CAPGEN, from this IBM website:
http://congsa.ibm.com/~dlutz/

136 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 4-17 and Figure 4-18 show two DS8QTOOL samples. Look at the tabs at the bottom.
This tool collects a significant amount of DS8000 configuration information to help you track
the changes in your machine over time.

Figure 4-17 DS8QTOOL Summary tab

Figure 4-18 DS8QTOOL array tab

Chapter 4. Logical configuration performance considerations 137


138 IBM System Storage DS8000 Performance Monitoring and Tuning
Part 2

Part 2 IBM System Storage


DS8000 performance
management
This part starts with a brief introduction of different workload types and introduces various
tools for effective performance planning, monitoring, and management on the DS8000
storage system.

This part includes the following topics:


򐂰 Understanding your workload
򐂰 Performance planning tools
򐂰 Practical performance management

© Copyright IBM Corp. 2016. All rights reserved. 139


140 IBM System Storage DS8000 Performance Monitoring and Tuning
5

Chapter 5. Understanding your workload


This chapter presents and describes the various workload types that an application can
generate. This characterization can be useful for understanding performance documents and
reports, and categorizing the various workloads in your installation.

Information in this chapter is not dedicated to IBM System Storage DS8000. You can apply
this information generally to other disk storage systems.

This chapter includes the following topics:


򐂰 General workload types
򐂰 Database workload
򐂰 Application workload
򐂰 Profiling workloads in the design phase
򐂰 Understanding your workload type

© Copyright IBM Corp. 2016. All rights reserved. 141


5.1 General workload types
The correct understanding of the existing or planned workload is the key element of the entire
planning and sizing process. Understanding the workload means having the description of the
workload pattern:
򐂰 Expected or existing number of IOPS
򐂰 Size of the I/O requests
򐂰 Read and write ratio
򐂰 Random and sequential access ratio
򐂰 General purpose of the application and the workload

You might also collect the following information:


򐂰 Expected cache hit ratio
The number of requests serviced from cache. This number is important for read requests
because write requests always get into the cache first. If out of 1000 requests 100 of them
are serviced from cache, you have a 10% of cache hit ratio. The higher this parameter, the
lower the overall response time.
򐂰 Expected seek ratio
The percentage of the I/O requests for which the disk arm must move from its location.
Moving the disk arm requires more time than rotating the disk, which rotates anyway and
is fast enough. So, by not moving the disk arm, the whole track or cylinder can be read,
which generally means large amount of data. This parameter is mostly indicative of how
disk systems worked a long time ago, and it is now turned into a sort of the quality value of
the random nature of the workload. A random workload shows this value close to 100%.
This parameter is not applicable to the flash drives that have no disk arms by design.
򐂰 Expected write efficiency
The write efficiency is a number that represents the number of times a block is written to
before being destaged to the disk. Actual applications, especially databases, update the
information: write-read again-change-write with changes. So, the data for the single disk
block can be served several times from cache before it is written to the disk. A value of 0%
means that a destage is assumed for every write operation and the characteristic of the
“pure random small block write workload pattern”, which is unlikely. A value of 50% means
that a destage occurs after the track is written to twice. A value of 100% is unlikely also
because it means that writes come to the one track only and are never destaged to disk.

In general, you describe the workload in these terms. The following sections cover the details
and describe the different workload types.

5.1.1 Typical online transaction processing workload


This workload is characterized by mostly the random access of small-sized I/O records (less
than or equal to 16 KB) with a mix of 70% reads and 30% writes. This workload is also
characterized by low read-hit ratios in the disk system cache (less than 20%). This workload
might be representative of various online applications, for example, the SAP NetWeaver
application or many database applications. This type of workload is typical because it is the
basis for most of the benchmark and performance tests. However, in actuality, the following
online transaction processing (OLTP) patterns are spread:
򐂰 90% read, 8 - 16 - 32 - 64 KB blocks, 30% sequential, 30 - 40% cache hit
򐂰 80% read, 16 - 32 - 64 KB blocks, 40 - 50% sequential, 50 - 60% cache hit

A 100% random access workload is rare, which you must remember when you size the disk
system.

142 IBM System Storage DS8000 Performance Monitoring and Tuning


5.1.2 Microsoft Exchange Server workload
This type of workload can be similar to an OLTP workload, but it has a different read/write
balance. It is characterized by many write operations, up to 60%, with high random numbers.
Also, the size of the I/O can be high, with blocks up to 128 KB, which is explained by the
nature of the application. Additionally, it acts as a database and the data warehouse. Size this
type of workload with the Microsoft tools and advice. To better understand the workload and
storage configurations, read the documents that are found at the following website:
http://technet.microsoft.com/en-us/exchange/dd203064

5.1.3 Sequential workload


Sequential access is one of the original workload types for data storage. Tape is the best
example of sequential access. Tape uses several large blocks at one read operation, and it
uses buffers first. Sequential access does not change for the disks. Sequential workload is
good for prefetch and putting data into cache because blocks are accessed one by one and
the disk system can read many blocks to unload the disks. Sequential write requests work
well because the disk system can optimize the access to the disks and write several tracks or
cylinders at a time. Blocksizes are typically 256 KB or more and response time is high, but
that time does not matter. Sequential workload is about bandwidth, not response time. The
following environments and applications are likely sequential workloads:
򐂰 Backup/restore applications
򐂰 Database log files
򐂰 Batch processing
򐂰 File servers
򐂰 Web servers
򐂰 Media streaming applications
򐂰 Graphical software

Important: Because of large block access and high response times, physically separate
sequential workloads from random small-block workloads. Do not mix random and
sequential workloads on the same physical disk. If you do, large amounts of cache are
required on the disk system. Typically, high response times with small-block random
access mean the presence of the sequential write activity (foreground or background) on
the same disks.

5.1.4 Batch jobs workload


Batch workloads have several common characteristics:
򐂰 Mixture of random database access, skip-sequential, pure sequential, and sorting.
򐂰 Large blocksizes up to 128 - 256 KB
򐂰 High volume of write activity, and read activity
򐂰 The same volume extents might scramble for write and read at the same time.
򐂰 Batch workloads include large data transfers and high path utilizations.
򐂰 Batch workloads are often constrained to operate within a particular window of time when
online operation is restricted or shut down. Poor or improved performance is often not
recognized unless it affects this window.

Plan when to run batch jobs: Plan all batch workload activity for the end of the day or at
a slower time of day. Normal activity can be negatively affected with the batch activity.

Chapter 5. Understanding your workload 143


5.1.5 Sort jobs workload
Most sorting applications, such as z/OS DFSORT, are characterized by large transfers for
input, output, and work data sets.

For more information about z/OS DFSORT, see the following websites:
򐂰 http://www.ibm.com/support/docview.wss?rs=114&uid=isg3T7000077
򐂰 http://www.ibm.com/support/knowledgecenter/api/redirect/zos/v1r11/index.jsp?topic
=/com.ibm.zos.r11.iceg200/abstract.htm

5.1.6 Read-intensive cache friendly and unfriendly workloads


Use the cache hit ratio to estimate the cache-friendly read workload. If the ratio is more than
50%, the workload is cache friendly. If you have two serviced I/Os to the same data and one
I/O is serviced from cache, it is a 50% cache hit ratio.

It is not as easy to divide known workload types into cache friendly and cache unfriendly. An
application can change its behavior during the day several times. When users work with data,
it is cache friendly. When the batch processing or reporting starts, it is not cache friendly. High
random-access numbers mean a not cache-friendly workload type. However, if the amount of
data that is accessed randomly is not large, 10% for example, it can be placed totally into the
disk system cache and becomes cache friendly.

Sequential workloads are always cache-friendly because of prefetch algorithms that exist in
the disk system. Sequential workload is easy to prefetch. You know that the next 10 or 100
blocks are definitely accessed, and you can read them in advance. For the random
workloads, it is different. There are no purely random workloads in the actual applications and
it is possible to predict some moments. The DS8000 storage systems use the following
powerful read-caching algorithms to deal with cache unfriendly workloads:
򐂰 Sequential Prefetching in Adaptive Replacement Cache (SARC)
򐂰 Adaptive Multi-stream Prefetching (AMP)

The write workload is always cache friendly because every write request comes to the cache
first and the application gets the reply when the request is placed into cache. Write requests
are served at least two times longer by the back end than read requests. You always need to
wait for the write acknowledgment, which is why cache is used for every write request.
However, improvement is possible. The DS8800 and DS8700 storage systems use the
Intelligent Write Caching (IWC) algorithm, which makes work with write requests more
effective.

To learn more about the DS8000 caching algorithms, see the following website:
http://www.redbooks.ibm.com/abstracts/sg248323.html

Table 5-1 on page 145 provides a summary of the characteristics of the various types of
workloads.

144 IBM System Storage DS8000 Performance Monitoring and Tuning


Table 5-1 Workload types
Workload type Characteristics Representative of this type of
process

Sequential read Sequential 128 - 1024 KB Database backups


blocks Batch processing
Read hit ratio: 100% Media streaming

Sequential write Sequential 128 - 1024 KB Database restores and loads


blocks Batch processing
Write ratio: 100% File access

z/OS cache uniform Random 4 KB record Average database


R/W ratio 3.4 CICS/VSAM
Read hit ratio 84% IBM IMS™

z/OS cache standard Random 4 KB record Representative of typical


R/W ratio 3.0 database conditions
Read hit ratio 78%

z/OS cache friendly Random 4 KB record Interactive


R/W ratio 5.0 Existing software
Read hit ratio 92%

z/OS cache hostile Random 4 KB record DB2 logging


R/W ratio 2.0
Read hit ratio 40%

Open read-intensive Random 4 - 8 - 16 - 32 KB Databases (Oracle and DB2)


record Large DB inquiry
Read% = 70 - 90% Decision support
Hit ratio 28%, 1% Warehousing

Open standard Random 4 - 8 - 16 - 32 KB OLTP


record General Parallel File System
Read% = 70% (IBM GPFS™)
Hit ratio 50%

5.2 Database workload


Database workload does not come with the database initially. It depends on the application
that is written for the database and the type of work that this application performs. The
workload can be an OLTP workload or a data warehousing workload in the same database. A
reference to a database in this section mostly means DB2 and Oracle databases, but this
section can apply to other databases. For more information, see Chapter 17, “Databases for
open performance” on page 513 and Chapter 18, “Database for IBM z/OS performance” on
page 531.

The database environment is often difficult to typify because I/O characteristics differ greatly.
A database query has a high read content and is of a sequential nature. It also can be
random, depending on the query type and data structure. Transaction environments are more
random in behavior and are sometimes cache-unfriendly. At other times, they have good hit
ratios. You can implement several enhancements in databases, such as sequential prefetch
and the exploitation of I/O priority queuing, that affect the I/O characteristics. Users must
understand the unique characteristics of their database capabilities before generalizing the
performance.

Chapter 5. Understanding your workload 145


5.2.1 Database query workload
Database query is a common type of database workload. This term includes transaction
processing that is typically random, and sequential data reading, writing, and updating. The
query can have following properties:
򐂰 High read content
򐂰 Mix of write and read content
򐂰 Random access, and sequential access
򐂰 Small or large transfer size

A well-tuned database keeps characteristics of the queries closer to a sequential read


workload. A database can use all available caching algorithms, both its own and the disk
system algorithms. This function, which caches data that has the most probability to be
accessed, provides performance improvements for most database queries.

5.2.2 Database logging workload


The logging system is an important part of the database. It is the main component to preserve
the data integrity and provide a transaction mechanism. There are several types of log files in
the database:
򐂰 Online transaction logs: This type of log is used to restore the last condition of the
database when the latest transaction failed. A transaction can be complex and require
several steps to complete. Each step means changes in the data. Because data in the
database must be in a consistent state, an incomplete transaction must be rolled back to
the initial state of the data. Online transaction logs have a rotation mechanism that creates
and uses several small files for about an hour each.
򐂰 Archive transaction logs: This type is used to restore a database state up to the specified
date. Typically, it is used with incremental or differential backups. For example, if you
identify a data error that occurred a couple of days ago, you can restore the data back to
its prior condition with only the archive logs. This type of log uses a rotation mechanism
also.

The workload pattern for the logging is sequential writes mostly. Blocksize is about 64 KB.
Reads are rare and might not be considered. The write capability and location of the online
transaction logs are most important. The entire performance of the database depends on the
writes to the online transaction logs. If you expect high write rates to the database, plan for a
RAID 10 on to which to place the online transaction logs. Also, as a preferred practice,
physically separate log files from the disks on which the data and index files are. For more
information, see Chapter 17, “Databases for open performance” on page 513.

5.2.3 Database transaction environment workload


Database transaction workloads have these characteristics:
򐂰 Low to moderate read hits, depending on the size of the database buffers
򐂰 Cache unfriendly for certain applications
򐂰 Deferred writes that cause low write-hit ratios, which means that cached write data is
rarely required for reading
򐂰 Deferred write chains with multiple locate-record commands in chain
򐂰 Low read/write ratio because of reads that are satisfied in a large database buffer pool
򐂰 High random-read access values, which are cache unfriendly

146 IBM System Storage DS8000 Performance Monitoring and Tuning


The enhanced prefetch cache algorithms, together with the high storage back-end bandwidth,
provide high system throughput and high transaction rates for database transaction-based
workloads.

A database can benefit from using a large amount of server memory for the large buffer pool.
For example, the database large buffer pool, when managed correctly, can avoid a large
percentage of the accesses to disk. Depending on the application and the size of the buffer
pool, this large buffer pool can convert poor cache hit ratios into synchronous reads in DB2.
You can spread data across several RAID arrays to increase the throughput even if all
accesses are read misses. DB2 administrators often require that table spaces and their
indexes are placed on separate volumes. This configuration improves both availability and
performance.

5.2.4 Database utilities workload


Database utilities, such as loads, reorganizations, copies, and recovers, generate high read
and write sequential and sometimes random operations. This type of workload takes
advantage of the sequential bandwidth performance of the back-end storage connection,
such as the PCI-Express bus for the device adapter (DA) pairs, and the use of higher RPM
(15 K) drives with flash drives and Easy Tier automatic mode enabled.

5.3 Application workload


This section categorizes various types of common applications according to their I/O
behavior. There are four typical categories:
򐂰 Need for high throughput. These applications need more bandwidth (the more, the better).
Transfers are large, read-only I/Os that are sequential access. These applications use
database management systems (DBMSs); however, random DBMS access might also
exist.
򐂰 Need for high throughput and a mix of read/write (R/W), similar to the first category (large
transfer sizes). In addition to 100% read operations, this category mixes reads and writes
in 70/30 and 50/50 ratios. The DBMS is typically sequential, but random and 100% write
operations also exist.
򐂰 Need for high I/O rate and throughput. This category requires both performance
characteristics of IOPS and megabytes per second (MBps). Depending on the application,
the profile is typically sequential access, medium to large transfer sizes (16 KB, 32 KB,
and 64 KB), and 100/0, 0/100, and 50/50 R/W ratios.
򐂰 Need for high I/O rate. With many users and applications that run simultaneously, this
category can consist of a combination of small to medium-sized transfers (4 KB, 8 KB,
16 KB, and 32 KB), 50/50 and 70/30 R/W ratios, and a random DBMS.

Synchronous activities: Certain applications have synchronous activities, such as


locking database tables during an online backup, or logging activities. These types of
applications are highly sensitive to any increase in disk response time and must be
handled with care.

Chapter 5. Understanding your workload 147


Table 5-2 summarizes these workload categories and common applications.

Table 5-2 Application workload types


Category Application Read/write ratio I/O size Access type

4 General file Expect 50/50 64 - 256 KB Sequential


serving mostly because
of good file
system caching.

4 Online 50/50, 70/30 4 KB, 8 KB, or Random mostly


transaction 16 KB for writes and
processing reads. Bad cache
hits

4 Batch update Expect 50/50 16 KB, 32 KB, Almost 50/50 mix


64 KB, or 128 KB of sequential and
random.
Moderate cache
hits.

1 Data mining 90/10 32 KB, 64 KB, or Mainly


larger sequential,
some random.

1 Video on demand 100/0 256 KB and Sequential, good


larger caching.

2 Data 90/10, 70/30, or 50/50 64 KB or larger Mainly


warehousing sequential,
rarely random,
and good
caching.

2 Engineering and 100/0, 0/100, 70/30, or 64 KB or larger Sequential


scientific 50/50 mostly, good
caching.

3 Digital video 100/0, 0/100, or 50/50 128 KB, 256 - Sequential, good
editing 1024 KB caching.

3 Image 100/0, 0/100, or 50/50 64 KB, 128 KB Sequential, good


processing caching.

1 Backup, restore 100/0, 0/100 256 - 1024 KB Sequential, good


caching.

5.3.1 General file serving


This application type consists of many users who run many different applications, all with
varying file access sizes and mixtures of read/write ratios, all occurring simultaneously.
Applications can include file server, LAN storage, disk arrays, and even internet/intranet
servers. There is no standard profile other than the “chaos” principle of file access. General
file serving fits this application type because this profile covers almost all transfer sizes and
R/W ratios.

148 IBM System Storage DS8000 Performance Monitoring and Tuning


5.3.2 Online transaction processing
This application category typically has many users, all accessing the same disk storage
system and a common set of files. The file access typically is under the control of a DBMS,
and each user might work on the same or unrelated activities. The I/O requests are typically
spread across many files; therefore, the file sizes are typically small and randomly accessed.
A typical application consists of a network file server or a disk system that is accessed by a
sales department that enters order information.

5.3.3 Data mining


Databases are the repository of most data, and every time that information is needed, a
database is accessed. Data mining is the process of extracting valid, previously unknown,
and ultimately comprehensive information from large databases to make crucial business
decisions. This application category consists of a number of operations, each of which is
supported by various techniques, such as rule induction, neural networks, conceptual
clustering, and association discovery. In these applications, the DBMS extracts only large
sequential or possibly random files, depending on the DBMS access algorithms.

5.3.4 Video on demand


Video on demand consists of video playback that can be used to broadcast quality video for
either satellite transmission or a commercial application, such as in-room movies. Fortunately
for the storage industry, the data rates that are needed for this type of transfer are reduced
dramatically because of data compression developments. A broadcast quality video stream,
for example, full HD video, now needs about 4 - 5 Mbps bandwidth to serve a single user.
These advancements reduce the need for higher speed interfaces and can be serviced with
the current interface. However, these applications demand numerous concurrent users that
interactively access multiple files within the same storage system. This requirement changed
the environment of video applications because the storage system is specified by a number of
video streams that they can service simultaneously. In this application, the DBMS extracts
only large sequential files.

5.3.5 Data warehousing


A data warehouse supports information processing by providing a solid platform of integrated,
historical data from which to perform analysis. A data warehouse organizes and stores the
data that is needed for informational and analytical processing over a long historical period. A
data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of
data that is used to support the management decision-making process. A data warehouse is
always a physically separate store of data that spans a spectrum of time, and many
relationships exist in the data warehouse.

An example of a data warehouse is a design around a financial institution and its functions,
such as loans, savings, bank cards, and trusts for a financial institution. In this application,
there are three kinds of operations: initial loading of the data, access to the data, and
updating of the data. However, because of the fundamental characteristics of a warehouse,
these operations can occur simultaneously. At times, this application can perform 100% reads
when accessing the warehouse, 70% reads and 30% writes when accessing data while
record updating occurs simultaneously, or even 50% reads and 50% writes when the user
load is heavy. The data within the warehouse is a series of snapshots and after the snapshot
of data is made, the data in the warehouse does not change. Therefore, there is typically a
higher read ratio when using the data warehouse.

Chapter 5. Understanding your workload 149


5.3.6 Engineering and scientific applications
The engineering and scientific arena includes hundreds of applications. Typical applications
are computer-assisted design (CAD), Finite Element Analysis, simulations and modeling, and
large-scale physics applications. Transfers can consist of 1 GB of data for 16 users. Other
transfers might require 20 GB of data and hundreds of users. The engineering and scientific
areas of business are more concerned with the manipulation of spatial data and series data.
This application typically goes beyond standard relational DBMSs, which manipulate only flat
(two-dimensional) data. Spatial or multi-dimensional issues and the ability to handle complex
data types are commonplace in engineering and scientific applications.

Object-Relational DBMSs (ORDBMSs) are being developed, and they offer traditional
relational DBMS features and support complex data types. Objects can be stored and
manipulated, and complex queries at the database level can be run. Object data is data about
real objects, including information about their location, geometry, and topology. Location
describes their position, geometry relates to their shape, and topology includes their
relationship to other objects. These applications essentially have an identical profile to that of
the data warehouse application.

5.3.7 Digital video editing


Digital video editing is popular in the movie industry. The idea that a film editor can load entire
feature films onto disk storage and interactively edit and immediately replay the edited clips
has become a reality. This application combines the ability to store huge volumes of digital
audio and video data onto relatively affordable storage devices to process a feature film.

Depending on the host and operating system that are used to perform this application,
transfers are typically medium to large and access is always sequential. Image processing
consists of moving huge image files for editing. In these applications, the user regularly
moves huge high-resolution images between the storage device and the host system. These
applications service many desktop publishing and workstation applications. Editing sessions
can include loading large files of up to 16 MB into host memory, where users edit, render,
modify, and store data onto the storage system. High interface transfer rates are needed for
these applications, or the users waste huge amounts of time by waiting to see results. If the
interface can move data to and from the storage device at over 32 MBps, an entire 16 MB
image can be stored and retrieved in less than 1 second. The need for throughput is all
important to these applications and along with the additional load of many users, I/O
operations per second (IOPS) are also a major requirement.

5.4 Profiling workloads in the design phase


Assessing the I/O profile before you build and deploy the application requires methods of
evaluating the workload profile without measurement data. In these cases, as a preferred
practice, use a combination of general rules based on application type and the development
of an application I/O profile by the application architect or the performance architect. The
following examples are basic examples that are designed to provide an idea of how to
approach workload profiling in the design phase.

For general rules for application types, see Table 5-1 on page 145.

150 IBM System Storage DS8000 Performance Monitoring and Tuning


The following requirements apply to developing an application I/O profile:
򐂰 User population
Determining the user population requires understanding the total number of potential
users, which for an online banking application might represent the total number of
customers. From this total population, you must derive the active population that
represents the average number of persons that use the application at any specific time,
which is derived from experiences with other similar applications.
In Table 5-3, you use 1% of the total population. From the average population, you can
estimate the peak. The peak workload is some multiplier of the average and is typically
derived based on experience with similar applications. In this example, we use a multiple
of 3.
Table 5-3 User population
Total potential users Average active users Peak active users

50000 500 1500

򐂰 Transaction distribution
Table 5-4 breaks down the number of times that key application transactions are run by
the average user and how much I/O is generated per transaction. Detailed application and
database knowledge is required to identify the number of I/Os and the type of I/Os per
transaction. The following information is a sample.

Table 5-4 Transaction distribution


Transaction Iterations per user I/Os I/O type

Look up savings account 1 4 Random read

Look up checking account 1 4 Random read

Transfer money to checking 0.5 4 reads/4 writes Random read/write

Configure new bill payee 0.5 4 reads/4 writes Random read/write

Submit payment 1 4 writes Random write

Look up payment history 1 24 reads Random read

򐂰 Logical I/O profile


An I/O profile is created by combining the user population and the transaction distribution.
Table 5-5 provides an example of a logical I/O profile.

Table 5-5 Logical I/O profile from user population and transaction profiles
Transaction Iterations I/Os I/O type Average Peak users
per user user I/Os

Look up savings 1 4 Random read 2000 6000


account (RR) I/Os

Look up 1 4 RR 2000 6000


checking
account

Transfer money 0.5 4 reads/4 RR, random 1000, 1000 3000 R/W
to checking writes write I/Os
(RW)

Chapter 5. Understanding your workload 151


Transaction Iterations I/Os I/O type Average Peak users
per user user I/Os

Configure new 0.5 4 reads/4 RR, RW 1000, 1000 3000 R/W


bill payee writes

Submit payment 1 4 writes RW 2000 6000 R/W

Look up 1 24 reads RR 12000 36000


payment history

򐂰 Physical I/O profile


The physical I/O profile is based on the logical I/O with the assumption that the database
provides cache hits to 90% of the read I/Os. All write I/Os are assumed to require a
physical I/O. This physical I/O profile results in a read miss ratio of (1 - 0.9) = 0.1 or 10%.
Table 5-6 is an example, and every application has different characteristics.
Table 5-6 Physical I/O profile
Transaction Average user Average active users Peak active users
logical I/Os physical I/Os physical I/Os

Look up savings account 2000 200 RR 600 RR

Look up checking account 2000 200 RR 600 RR

Transfer money to checking 1000, 1000 100 RR, 1000 RW 300 RR, 3000 RW

Configure new bill payee 1000, 1000 100 RR, 1000 RW 300 RR, 3000 RW

Submit payment 2000 200 RR 600 RR

Look up payment history 12000 1200 SR 3600 RR

Totals 20000 R, and 2000 RR 6000 RR


2000 W 2000 RW 6000 RW

As you can see in Table 5-6, to meet the peak workloads, you must design an I/O
subsystem to support 6000 random reads/sec and 6000 random writes/sec:
Physical I/Os The number of physical I/Os per second from the host perspective
RR Random Read I/Os
RW Random Write I/Os

To determine the appropriate configuration to support your unique workload, see Appendix A,
“Performance management process” on page 551.

5.5 Understanding your workload type


To understand the workload, you need the performance data from the operating system and
from the disk system. Combined analysis of these two sets of performance data can give you
the entire picture and help you understand your workload. Separate analysis might be not
accurate. This section describes various performance monitoring tools.

152 IBM System Storage DS8000 Performance Monitoring and Tuning


5.5.1 Monitoring the DS8000 workload
The following performance monitoring tools are available for the IBM System Storage DS8800
and DS8700:
򐂰 DS8880 and DS8870 storage systems performance reporting capability
With the DS8000 command-line interface (DSCLI), you can retrieve performance data
from a DS8880 or DS8870 storage system and provide it in a window or a file. Use the
lsperfgrprpt and lsperfrescrpt commands, which provide reports for the performance
group or the specified resource. See Example 5-1.

Example 5-1 Output of the performance reporting command


dscli> lsperfgrprpt pg0
time grp resrc avIO avMB avresp pri avQ tgtQ %hlpT %dlyT %impt
===========================================================================================================
2015-11-11/14:00:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 1421 51.746 6.250 0 88 0 0 0 0
2015-11-11/14:05:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 2851 103.322 15.695 0 21 0 0 0 0
2015-11-11/14:10:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 3957 108.900 4.649 0 38 0 0 0 0
2015-11-11/14:15:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 708 20.079 0.206 0 69 0 0 0 0
2015-11-11/14:20:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 372 12.604 10.739 0 32 0 0 0 0
2015-11-11/14:25:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 490 17.082 9.948 0 61 0 0 0 0
2015-11-11/14:30:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 57 0.294 0.177 0 112 0 0 0 0
2015-11-11/14:35:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 257 9.832 31.029 0 27 0 0 0 0
2015-11-11/14:40:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 229 7.875 0.711 0 89 0 0 0 0
2015-11-11/14:45:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 187 7.668 35.725 0 19 0 0 0 0
2015-11-11/14:50:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 796 25.098 0.521 0 87 0 0 0 0
2015-11-11/14:55:00 IBM.2107-75TV181/PG0 IBM.2107-75TV181 8313 421.517 2.684 0 15 0 0 0 0

It is also possible to get the performance data for the DA pair or the rank. See
Example 5-2.

Example 5-2 Output of the performance data for 20 hours for rank 17 (output truncated)
dscli> lsperfrescrpt -start 20h r17
time resrc avIO avMB avresp %Hutl %hlpT %dlyT %impt
===================================================================
2015-11-11/09:15:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:20:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:25:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:30:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:35:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:40:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:45:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:50:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/09:55:00 R17 0 0.000 0.000 0 0 0 0
2015-11-11/10:00:00 R17 0 0.000 0.000 0 0 0 0

By default, statistics are shown for 1 hour. You can use settings that are specified in days,
hours, and minutes.
򐂰 IBM Spectrum Control™
IBM Spectrum Control is the tool to monitor the workload on your DS8000 storage for a
long period and collect historical data. This tool can also create reports and provide alerts.
For more information, see 7.2.1, “IBM Spectrum Control overview” on page 223.

Chapter 5. Understanding your workload 153


5.5.2 Monitoring the host workload
The following sections list the host-based performance measurement and reporting tools
under the UNIX, Linux, Windows, IBM i, and z/OS environments.

Open Systems servers


This section lists the most common tools that are available on Open Systems servers to
monitor the workload.

UNIX and Linux Open Systems servers


To get host information about I/O subsystems, processor activities, virtual memory, and the
use of the physical memory, use the following common UNIX and Linux commands:
򐂰 iostat
򐂰 vmstat
򐂰 sar
򐂰 nmon
򐂰 topas
򐂰 filemon

These commands are standard tools that are available with most UNIX and UNIX like (Linux)
systems. Use iostat to obtain the data that you need to evaluate your host I/O levels. Specific
monitoring tools are also available for AIX, Linux, Hewlett-Packard UNIX (HP-UX), and Oracle
Solaris.

For more information, see Chapter 9, “Performance considerations for UNIX servers” on
page 285 and Chapter 12, “Performance considerations for Linux” on page 385.

Microsoft Windows servers


Common Microsoft Windows Server monitoring tools include the Windows Performance
Monitor (perfmon). Performance Monitor has the flexibility to customize the monitoring to
capture various categories of Windows server system resources, including processor and
memory. You can also monitor disk I/O by using perfmon.

For more information, see Chapter 10, “Performance considerations for Microsoft Windows
servers” on page 335.

IBM i environment
IBM i provides a vast selection of performance tools that can be used in performance-related
cases with external storage. Several of the tools, such as Collection services, are integrated
in the IBM i system. Other tools are a part of an IBM i licensed product. The management of
many IBM i performance tools is integrated into IBM i web graphical user interface IBM
System Director Navigator for i, or into the product iDoctor.

The IBM i tools, such as Performance Explorer and iDoctor, are used to analyze the hot data
in IBM i and to size solid-state drives (SSDs) for this environment. Other tools, such as Job
Watcher, are used mostly in solving performance problems, together with the tools for
monitoring the DS8000 storage system.

For more information about the IBM i tools and their usage, see 13.6.1, “IBM i performance
tools” on page 443.

154 IBM System Storage DS8000 Performance Monitoring and Tuning


z Systems environment
The z/OS systems have proven performance monitoring and management tools that are
available to use for performance analysis. Resource Measurement Facility (RMF), a z/OS
performance tool, collects performance data and reports it for the wanted interval. It also
provides cache reports. The cache reports are similar to the disk-to-cache and cache-to-disk
reports that are available in the Tivoli Storage Productivity Center for Disk, except that the
RMF cache reports are in text format. RMF collects the performance statistics of the DS8000
storage system that are related to the link or port and also to the rank and extent pool. The
REPORTS(ESS) parameter in the RMF report generator produces the reports that are related to
those resources.

The RMF Spreadsheet Reporter is an easy way to create Microsoft Excel Charts based on
RMF postprocessor reports. It is used to convert your RMF data to spreadsheet format and
generate representative charts for all performance charts for all performance relevant areas.

For more information, see Chapter 14, “Performance considerations for IBM z Systems
servers” on page 459.

5.5.3 Modeling the workload and sizing the system


Workload modeling is used to predict the behavior of the system under the workload to
identify the limits and potential bottlenecks, and to model the growth of the system and plan
for the future.

IBM specialists and IBM Business Partner specialists use the IBM Disk Magic tool for
modeling the workload on the systems. Disk Magic can be used to help to plan the DS8000
hardware configuration. With Disk Magic, you model the DS8000 performance when
migrating from another disk system or when making changes to an existing DS8000
configuration and the I/O workload. Disk Magic is for use with both z Systems and Open
Systems server workloads.

When running the DS8000 modeling, you start from one of these scenarios:
򐂰 An existing, non DS8000 model, which you want to migrate to a DS8000 storage system
򐂰 An existing DS8000 workload
򐂰 Modeling a planned new workload, even if you do not have the workload running on any
disk system

You can model the following major DS8000 components by using Disk Magic:
򐂰 DS8000 model: DS8300, DS8300 Turbo, DS8700, DS8886, and DS8884 models
򐂰 Cache size for the DS8000 storage system
򐂰 Number, capacity, and speed of disk drive modules (DDMs)
򐂰 Number of arrays and RAID type
򐂰 Type and number of DS8000 host adapters (HAs)
򐂰 Type and number of channels
򐂰 Remote Copy option

When working with Disk Magic, always ensure that you input accurate and representative
workload information because Disk Magic results depend on the input data that you provide.
Also, carefully estimate the future demand growth that you input to Disk Magic for modeling
projections. The hardware configuration decisions are based on these estimates.

For more information about using Disk Magic, see 6.1, “Disk Magic” on page 160.

Chapter 5. Understanding your workload 155


Sizing the system for the workload
With the performance data collected or estimated, and the model of the data that is created,
you can size the planned system. Systems are sized with application-specific tools that are
provided by the application vendors. There are several tools:
򐂰 General storage sizing: Disk Magic and Capacity Magic.
򐂰 MS Exchange sizing: MS Exchange sizing tool. You can find more information about this
tool at the following websites:
– http://technet.microsoft.com/en-us/library/bb124558.aspx
– http://technet.microsoft.com/en-us/library/ff367907.aspx
򐂰 Oracle, DB2, SAP NetWeaver: IBM Techline and specific tools.

Workload testing
There are various reasons for conducting I/O load tests. They all start with a hypothesis and
have defined performance requirements. The objective of the test is to determine whether the
hypothesis is true or false. For example, a hypothesis might be that you think that a DS8884
storage system with 18 disk arrays and 128 GB of cache can support 10,000 IOPS with a
70/30/50 workload and the following response time requirements:
򐂰 Read response times: 95th percentile < 10 ms
򐂰 Write response times: 95th percentile < 5 ms

To test, complete the following generic steps:


1. Define the hypothesis.
2. Simulate the workload by using an artificial or actual workload.
3. Measure the workload.
4. Compare workload measurements with objectives.
5. If the results support your hypothesis, publish the results and make recommendations. If
the results do not support your hypothesis, determine why and make adjustments.

Microsoft Windows environment


The following example tests might be appropriate for a Windows environment:
򐂰 Pre-deployment hardware validation. Ensure that the operating system, multipathing, and
host bus adapter (HBA) drivers are at the current levels and supported. Before you deploy
any solution and especially a complex solution, such as Microsoft cluster servers, ensure
that the configuration is supported. For more information, go to the interoperability website
found at:
http://www.ibm.com/systems/support/storage/ssic/interoperability.wss
򐂰 Application-specific requirements. Often, you receive inquiries about the DS8000 storage
system and Microsoft Exchange. Use MS Exchange Jetstress and MS Exchange
Workload Generator to test and simulate the workload of MS Exchange. For more
information, go to the following websites:
– http://technet.microsoft.com/en-us/library/ff706601.aspx
– https://www.microsoft.com/en-us/download/details.aspx?id=36849

The universal workload generator and benchmark tool is the Iometer


(http://www.iometer.org). Iometer is both a workload generator (it performs I/O operations
to stress the system) and a measurement tool (it examines and records the performance of its
I/O operations and their effect on the system). It can be configured to emulate the disk or
network I/O load of any program or benchmark, or it can be used to generate entirely
synthetic I/O loads. It can generate and measure loads on single or multiple (networked)
systems.

156 IBM System Storage DS8000 Performance Monitoring and Tuning


Iometer can be used for the following measurements and characterizations:
򐂰 Performance of disk and network controllers
򐂰 Bandwidth and latency capabilities of buses
򐂰 Network throughput to attached drives
򐂰 Shared bus performance
򐂰 System-level hard disk drive performance
򐂰 System-level network performance

You can use Iometer to configure these settings:


򐂰 Read/write ratios
򐂰 Sequential/random
򐂰 Arrival rate and queue depth
򐂰 Blocksize
򐂰 Number of concurrent streams

With these configuration settings, you can simulate and test most types of workloads. Specify
the workload characteristics to reflect the workload in your environment.

UNIX and Linux environment


The UNIX and Linux dd command is a great tool to drive sequential read workloads or
sequential write workloads against the DS8000 storage system.

This section describes how to perform these tasks:


򐂰 Determine the sequential read speed that an individual vpath (logical unit number (LUN))
can provide in your environment.
򐂰 Measure sequential read and write speeds for file systems.

To test the sequential read speed of a rank, run the following command:
time dd if=/dev/rvpath0 of=/dev/null bs=128k count=781

The rvpath0 is the character or raw device file for the LUN that is presented to the operating
system by SDD. This command reads 100 MB off rvpath0 and reports how long it takes in
seconds. Take 100 MB and divide by the number of seconds that is reported to determine the
MBps read speed.

Linux: For Linux systems, use the appropriate /dev/sdX device or /dev/mpath/mpathn
device if you use Device-Mapper multipath.

Run the following command and start the nmon monitor or iostat -k 1 command in Linux:
dd if=/dev/rvpath0 of=/dev/null bs=128k

Your nmon monitor (the e option) reports that this previous command imposed a sustained 100
MBps bandwidth with a blocksize=128 K on vpath0. Notice the xfers/sec column; xfers/sec is
IOPS. Now, if your dd command did not error out because it reached the end of the disk,
press Ctrl+c to stop the process. Now, nmon reports an idle status. Next, run the following dd
command with a 4 KB blocksize and put it in the background:

dd if=/dev/rvpath0 of=/dev/null bs=4k &

For this command, nmon reports a lower MBps but a higher IOPS, which is the nature of I/O as
a function of blocksize. Run your dd sequential read command with a bs=1024 and you see a
high MBps but a reduced IOPS.

Chapter 5. Understanding your workload 157


The following commands perform sequential writes to your LUNs:
򐂰 dd if=/dev/zero of=/dev/rvpath0 bs=128k
򐂰 dd if=/dev/zero of=/dev/rvpath0 bs=1024k
򐂰 time dd if=/dev/zero of=/dev/rvpath0 bs=128k count=781

Try different blocksizes, different raw vpath devices, and combinations of reads and writes.
Run the commands against the block device (/dev/vpath0) and notice that blocksize does not
affect performance.

Because the dd command generates a sequential workload, you still must generate the
random workload. You can use a no-charge open source tool, such as Vdbench.

Vdbench is a disk and tape I/O workload generator for verifying data integrity and measuring
the performance of direct-attached and network-connected storage on Windows, AIX, Linux,
Solaris, OS X, and HP-UX. It uses workload profiles as the inputs for the workload modeling
and has its own reporting system. All output is presented in HTML files as reports and can be
analyzed later. For more information, see the following website:
http://www.oracle.com/technetwork/server-storage/vdbench-downloads-1901681.html

158 IBM System Storage DS8000 Performance Monitoring and Tuning


6

Chapter 6. Performance planning tools


This chapter describes Disk Magic and the Storage Tier Advisor Tool (STAT), which you can
use for DS8000 capacity and performance planning. Disk Magic is a disk performance and
capacity modeling tool that is provided by IntelliMagic. It is available for both z Systems and
Open Systems. The STAT is a powerful reporting tool that provides a graphical representation
of performance data that is collected by Easy Tier.

This chapter includes the following topics:


򐂰 Disk Magic
򐂰 Disk Magic for z Systems
򐂰 Disk Magic for Open Systems
򐂰 Disk Magic Easy Tier modeling
򐂰 Storage Tier Advisor Tool

© Copyright IBM Corp. 2016. All rights reserved. 159


6.1 Disk Magic
This chapter describes Disk Magic and its basic usage. It includes examples that show the
required input data, how the data is fed into the tool, and also shows the output reports and
information that Disk Magic provides.

The examples in this chapter demonstrate the steps that are required to model a storage
system for certain workload requirements. The examples show how to model a DS8880
storage system and provide guidance about the steps that are involved in this process.

Disk Magic: Disk Magic is available for use by IBM representatives, IBM Business
Partners, and users. Clients must contact their IBM representative to run Disk Magic tool
studies when planning their DS8000 hardware configurations.

Disk Magic for Windows is a product of IntelliMagic B.V. Although this book refers to this
product as Disk Magic, the product that is available from IntelliMagic for clients was
renamed to IntelliMagic Direction and contains more features.

This chapter provides basic examples only for the use of Disk Magic in this book. The
version of the product that is used in these examples is current as of the date of the writing
of this book. For more information, see the product documentation and guides. For more
information about the current client version of this product, go to the IntelliMagic website:
http://www.intellimagic.com

6.1.1 The need for performance planning and modeling tools


There are two ways to achieve your planning and modeling:
򐂰 Perform a complex workload benchmark based on your workload.
򐂰 Do a Disk Magic modeling by using performance data retrieved from your system.

Performing an extensive and elaborate lab benchmark by using the correct hardware and
software provides a more accurate result because it is real-life testing. Unfortunately, this
approach requires much planning, time, and preparation, plus a significant amount of
resources, such as technical expertise and hardware/software in an equipped lab.

Doing a Disk Magic study requires much less effort and resources. Retrieving the
performance data of the workload and getting the configuration data from the servers and the
disk subsystems (DSSs) is all that is required from the client. With this data, the IBM
representative or IBM Business Partner can use Disk Magic to do the modeling.

Disk Magic is calibrated to match the results of lab runs that are documented in sales
materials and white papers. You can view it as an encoding of the data that is obtained in
benchmarks and reported in white papers.

When the Disk Magic model is run, it is important to size each component of the storage
server for its peak usage period, usually a 15- or 30-minute interval. Using a longer period
tends to average out the peaks and non-peaks, which does not give a true reading of the
maximum demand.

Different components can peak at different times. For example, a processor-intensive online
application might drive processor utilization to a peak while users are actively using the
system. However, disk utilization might be at a peak when the files are backed up during
off-hours. So, you might need to model multiple intervals to get a complete picture of your
processing environment.

160 IBM System Storage DS8000 Performance Monitoring and Tuning


6.1.2 Overview and characteristics
Disk Magic is a Microsoft Windows based DSS performance modeling tool. Disk Magic can
be used to plan the DS8000 hardware configuration. With Disk Magic, you model the DS8000
performance when migrating from another DSS or changing an existing DS8000 configuration
and the I/O workload. You can use Disk Magic with both z Systems and Open Systems server
workloads.
When running the DS8000 modeling, you start from one of these scenarios:
򐂰 An existing model of one or more DSSs (IBM or non-IBM) that you want to migrate to a
DS8880 storage system. Because a DS8880 storage system can have much greater
storage and throughput capacity than other DSSs, with Disk Magic you can merge the
workload from several existing DSSs into a single DS8880 storage system.
򐂰 Modeling a planned new workload, even if you do not have the workload running on any
DSS. You need an estimate of the workload characteristics, such as disk capacity, I/O rate,
and cache statistics, which provide an estimate of the DS8880 performance results. Use
an estimate for rough planning purposes only.

6.1.3 Disk Magic data collection for a z/OS environment


To perform the Disk Magic study, get information about the characteristics of the workload.
For each control unit to be modeled (current and proposed), you need the following
information:
򐂰 Control unit type and model.
򐂰 Cache size.
򐂰 Nonvolatile storage (NVS) size.
򐂰 Disk drive module (DDM) size and speed.
򐂰 Solid-state drive (SSD) and flash sizes.
򐂰 Number, type, and speed of channels.
򐂰 Number and speed of host adapter (HA) and FICON ports.
򐂰 Parallel access volume (PAV): Which PAV option is used.

For Remote Copy, you need the following information:


򐂰 Remote Copy type
򐂰 Distance
򐂰 Number of links and speed

Some of this information can be obtained from the reports created by RMF Magic.

In a z/OS environment, running a Disk Magic model requires the System Management
Facilities (SMF) record types 70 - 78. There are two different ways to send this SMF data:
򐂰 If you have RMF Magic available
򐂰 If you do not have access to RMF Magic

If RMF Magic is available


With this first option, you must terse the SMF data set. To avoid huge data set sizes, separate
the SMF data by SYSID or by date. The SMF data set must be tersed before putting it on the
FTP site. To terse the data set, complete the following steps:
1. Prepare the collection of SMF data.
2. Run TRSMAIN with the PACK option to terse the SMF data set.

Chapter 6. Performance planning tools 161


Note: Do not run TRSMAIN against an SMF data set that is on tape because this action
causes problems with the terse process of the data set. The SMF record length on tape
can be greater than 32760 bytes, which TRSMAIN cannot handle.

If RMF Magic is not available


In this case, you must pack the SMF data by using RMFPACK, which is available for download
from the same website as Disk Magic.

To pack the SMF data set into a ZRF file, complete the following steps:
1. Install RMFPACK on your z/OS system.
2. Prepare the collection of SMF data.
3. Run the $1SORT job to sort the SMF records.
4. Run the $2PACK job to compress the SMF records and to create the ZRF file.

Sending the date by using FTP


The easiest way to send the SMF data is through FTP. To avoid huge data set sizes, separate
the SMF data by SYSID or by date. Send either the tersed or packed SMF data.

When uploading the data to the IBM FTP website, use the following information:
򐂰 The FTP site is testcase.boulder.ibm.com.
򐂰 The user ID is anonymous.
򐂰 The password is your email user ID, for example, [email protected].
򐂰 The directory to put the date into is eserver/toibm/zseries.
򐂰 Notify IBM or the Business Partner about the file name that you use to create your FTP
file.

6.1.4 Disk Magic data collection for Open Systems environment


In an Open Systems environment, you need the following information for each control unit to
include in this study:
򐂰 Storage controller make, machine type, model, and serial number
򐂰 The number, size, and speed of the disk drives installed on each controller
򐂰 The number, speed, and type of channels
򐂰 The cache size
򐂰 Whether the control unit is direct-attached or SAN-attached
򐂰 How many servers are allocated and sharing these disks

For Remote Copy, you need the following information:


򐂰 Remote Copy type
򐂰 Distance
򐂰 Number of links

Data collection
The preferred data collection method for a Disk Magic study is using Spectrum Control. For
each control unit to be modeled, collect performance data, create a report for each control
unit, and export each report as a comma-separated values (CSV) file. You can obtain the
detailed instructions for this data collection from your IBM representative.

Other data collection techniques


If Tivoli Storage Productivity Center is not available or cannot be used for the existing disk
systems, other data collection techniques are available. Contact your IBM representative.

162 IBM System Storage DS8000 Performance Monitoring and Tuning


The Help function in Disk Magic documents shows how to gather various Open Systems
types of performance data by using commands, such as iostat in Linux/UNIX and perfmon in
Windows. Disk Magic also can process PT reports from IBM i systems.

6.1.5 Configuration options


Disk Magic models the DS8000 performance based on the I/O workload and the DS8000
hardware configuration. Thus, it helps in the DS8000 capacity planning and sizing decisions.
The following major DS8880 components can be modeled by using Disk Magic:
򐂰 DS8880 model: DS8884 and DS8886 storage systems with 8, 16 or 24 cores
򐂰 Enabling Easy Tier
򐂰 Cache size for the DS8880 storage system
򐂰 Number, capacity, and speed of DDM, or SSD or flash
򐂰 Number of arrays and RAID type
򐂰 DS8880 HAs
– Number and speed (16 Gb)
– Number of ports
򐂰 Type and number of channels
򐂰 Remote Copy option

When working with Disk Magic, always ensure that you feed in accurate and representative
workload information because Disk Magic results depend on the input data that is provided.
Also, carefully estimate future demand growth, which is fed into Disk Magic for modeling
projections on which the hardware configuration decisions are made.

6.1.6 Workload growth projection


The workload that runs on any DSS always grows over time, which is why it is important to
project how the new DSS performs as the workload grows. There are three growth projection
options:
򐂰 I/O rate
The I/O rate growth projection projects the DSS performance when the I/O rate grows.
Use this option when the I/O rate is expected to grow but without a simultaneous growth of
cache or the backstore capacity of the subsystem. You can expect the Access Density
(number of I/Os per sec per gigabyte) to increase. In this particular case, Disk Magic
keeps the hit rate the same for each step.
򐂰 I/O rate with capacity growth
The I/O rate with capacity growth projection projects the DSS performance when both the
I/O rate and the capacity grow. With this selection, Disk Magic grows the workload and the
backstore capacity at the same rate (the Access Density remains constant) while the
cache size remains the same. Automatic Cache modeling is used to compute the negative
effect on the cache hit ratio.
򐂰 Throughput rate (MBps)
This is similar to the I/O rate option. Disk Magic models the DS8880 storage system based
on throughput growth but without any growth of the cache and backstore capacity of the
DSS.

Chapter 6. Performance planning tools 163


6.1.7 Disk Magic modeling
The process of modeling with Disk Magic starts with creating a base model for the existing
DSSs. Initially, you load the input data that describes the hardware configuration and
workload information of those DSSs. When you create the base model, Disk Magic validates
the hardware and workload information that you entered, and if everything is acceptable, a
valid base model is created. If not, Disk Magic provides messages that explain why the base
model cannot be created, and it shows the errors on the log.

After the valid base model is created, you proceed with your modeling. You change the
hardware configuration options of the base model to determine the preferred DS8880
configuration for a certain workload, or you can modify the workload values that you initially
entered, so that, for example, you can see what happens when your workload grows or its
characteristics change.

When doing Disk Magic modeling, you must model two or three different measurement
periods:
򐂰 Peak I/O rate
򐂰 Peak read + write throughput in MBps
򐂰 Peak write throughput in MBps if you are also modeling a DSS running one of the remote
copy options

Welcome to Disk Magic


When you start the Disk Magic program, the Welcome to Disk Magic window opens, as
shown in Figure 6-1. You can create a SAN project or open an existing project that is saved to
a Disk Magic file with the extension DM2.

Figure 6-1 Welcome to Disk Magic

When you create a model for a DS8000 storage system, select New SAN Project. The
window that is shown in Figure 6-2 on page 165 opens.

164 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-2 Create SAN project in Disk Magic

This window shows the following options:


򐂰 Create New Project Using Automated Input:
– zSeries or WLE Automated input (*.DMC) by using DMC input files:
Disk Magic Control (DMC) files are created by using RMF Magic. The files contain a
description of both the configuration and workload of a z/OS environment for a specific
time interval. This description is used as a starting point for a detailed and accurate
DSS modeling study with Disk Magic that is based on data collected with the z/OS
Resource Management Facility.
– Open and iSeries Automated Input (*.IOSTAT, *.TXT, *.CSV):
With this option, you can make Disk Magic process multiple UNIX/Linux iostat,
Windows perfmon, or iSeries PT reports, which are generated by multiple servers. Disk
Magic then consolidates these statistics across the servers so that you can identify the
interval with the highest I/O rate, MB transferred rate, and so on.
– Spectrum Control reports from DSS and SAN Volume Controller configurations (*.CSV):
򐂰 Create New Project Using Manual Input:
– General Project: This option can be selected to create a project that initially consists of
a single Project, Model, System, and Disk Subsystem:
• Number of zSeries Servers
• Number of Open Servers
• Number of iSeries Servers
– A Transaction Processing Facility (TPF) project (TPF storage subsystem modeler)

Chapter 6. Performance planning tools 165


– Storage Virtualization Wizard Project: Select this option when you intend to build a
model that consists of a virtualization cluster (for example, SAN Volume Controller) and
attached DSSs.
򐂰 Create DMC Files with RMF Loader
This option can be used to read z/OS RMF/SMF data collected at the mainframe of the
client and packed with RMFPACK. RMF Loader then creates the command files to be later
used in Disk Magic. This option does not require the use of RMF Magic.

6.2 Disk Magic for z Systems


This section explains how to use Disk Magic as a modeling tool for z Systems (still designated
as zSeries in Disk Magic). In the example that is presented, we merge two DS8870 storage
systems into one DS8886 storage system. In a z/OS environment, DSS measurement data is
collected by the z/OS Resource Management Facility. For more information, see 14.1,
“DS8000 performance monitoring with RMF” on page 460. The DSS measurement data is
stored in an SMF data set.

There are two options to accomplish this task:


򐂰 Use the RMF Loader option in Disk Magic, as shown in Figure 6-2 on page 165, to
process raw RMF data and create a DMC file. DMC files contain a description of both the
configuration and workload of a z/OS environment for a specific time interval. This DMC
file is used as a starting point for a detailed and accurate DSS modeling study with Disk
Magic.
To process the z/OS SMF data set on a Windows system with Disk Magic, it must be
sorted and then packed with the RMFPACK utility. RMFPACK is part of the Disk Magic
installation. The Disk Magic installation provides two XMIT files for installation of RMFPACK
on the z/OS system. The Disk Magic installation provides a PDF file that contains a
detailed description of how to install and use RMFPACK on z/OS. Use RMFPACK to create the
input data for the RMF Loader option. RMFPACK creates an SMF file in ZRF format on z/OS
to be downloaded in binary to your Windows system. You can then create your DMC files
by processing the data with the RMF Loader option in Disk Magic, as shown in Figure 6-3
on page 167, and determine the intervals to use for modeling.

166 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-3 Using RMF Loader in Disk Magic to create z/OS DMC input files

򐂰 The preferred option is to use an automated input process to load the RMF data into Disk
Magic by using a DMC file. To read the DMC file as automated input into Disk Magic,
select zSeries or WLE Automated input (*.DMC) in the New SAN Project window, as
shown in Figure 6-2 on page 165. By using automated input, you can make the Disk
Magic process the model at the DSS, logical control unit (LCU), or device level.
Considerations for using one or the other option are provided in the Disk Magic help text
under “How to Perform Device Level Modeling”. The simplest way to model a DSS is using
the DSS option, where the model contains the workload statistics by DSS. These DMC
files must be created first by using RMF Magic.

6.2.1 Processing the DMC file


DMC files typically use a .dmc suffix for the file name and show the date and time of the
corresponding RMF period by using the following naming convention:
򐂰 The first two characters represent an abbreviation of the month.
򐂰 The next two digits represent the date.
򐂰 The following four digits show the time period.

For example, JL292059 means that the DMC file was created for the RMF period of July 29 at
20:59.

Chapter 6. Performance planning tools 167


To process the DMC file, complete the following steps:
1. From the window that is shown in Figure 6-2 on page 165, select zSeries or WLE
Automated Input (*.DMC). Figure 6-4 shows the window opens and where you can select
the DMC file that you want to use.

Figure 6-4 z Systems - select DMC file

2. In this particular example, select the JL292059.dmc file, which opens the window that is
shown in Figure 6-5. This particular DMC file was chosen because it represents the peak
I/O rate period.

Figure 6-5 z Systems - DMC file opened

168 IBM System Storage DS8000 Performance Monitoring and Tuning


3. There are 12 LPARs and two DSSs (IBM-ABCD1 and IBM-EFGH1).
Clicking the IBM-EFGH1 icon opens the general information that relates to this DSS
(Figure 6-6). It shows that this DSS is a DS8870 storage system with 512 GB of cache that
was created by using the DSS level.

Figure 6-6 z Systems - general information of the disk subsystem

Chapter 6. Performance planning tools 169


4. Click Hardware Details in Figure 6-6 on page 169 to open the window that is shown in
Figure 6-7. You can change the following features, based on the actual hardware
configuration of the DS8870 storage system:
– SMP (Processor Complex) type
– Number of HAs
– Number of device adapters (DAs)
– Number of High Performance Flash Enclosure (HPFE)
– Cache size

Figure 6-7 z Systems - configuration details

170 IBM System Storage DS8000 Performance Monitoring and Tuning


5. Click the Interfaces tab, as shown in Figure 6-6 on page 169. You see in Figure 6-8 that
each LPAR connects to the DSS through eight FEx8S channels. If this information is
incorrect, you can change it by clicking Edit.

Figure 6-8 z Systems - LPAR to disk subsystem interfaces

Chapter 6. Performance planning tools 171


6. Select From Disk Subsystem in Figure 6-8 on page 171 to open the interface that is
used by the DSS. Figure 6-9 indicates that Enterprise Storage Server IBM-EFGH1 uses
24 FICON 8 Gb ports.
In this window, you also indicate whether there is a Remote Copy relationship between
this DS8870 storage system and a remote DSS. There are eight 8 Gb Fibre ports that are
used for the PPRC connection at 0 km distance.

Figure 6-9 z Systems - disk subsystem to LPAR interfaces

172 IBM System Storage DS8000 Performance Monitoring and Tuning


7. Look at the DDM by clicking the zSeries Disk tab, which is shown in Figure 6-10. The
3390 entries are models 9, 27 and 54, on both CPC 0 and CPC 1. The back-end drives
are 64 ranks of 300 GB/15 K RPM DDM and 4 ranks of 400 GB SSD drives per CPC.

Figure 6-10 z Systems - DDM option

Chapter 6. Performance planning tools 173


8. To see the last option, click the zSeries Workload tab. Because this DMC file is created
by using the DSS option, you see the I/O statistics for each LPAR (Figure 6-11). This view
shows that the I/O rate from LPAR ABC2 is 9208.2 IOPS. Click the Average tab to the
right of ABC2. You see the total I/O rate from all 12 LPARs to this DS8870 storage system,
which is 98,449.6 IOPS (Figure 6-12 on page 175) and the average response time is
0.78 msec.

Figure 6-11 z Systems - I/O statistics from LPAR ABC2

174 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-12 z Systems - I/O statistics from all LPARS to this DS8870 storage system

9. Click Base to create the base model for this DS8870 storage system. It is possible that a
base model cannot be created from the input workload statistics, for example, if there is
excessive CONN time that Disk Magic cannot calibrate against the input workload
statistics. In this case, you must identify another DMC from a different period, and try to
create the base model from that DMC file. For example, if this model is for the peak I/O
rate, you should try the period with the second highest I/O rate.
After creating this base model for IBM-EFGH1, you must also create the base model for
IBM-ABCD1 by following this same procedure. The DS8870 IBM-ABCD1 storage system
has the same physical configuration as IBM-EFGH1, but with different workload
characteristics.

Chapter 6. Performance planning tools 175


6.2.2 z Systems model to merge the two DS8870 storage systems to a DS8886
storage system
If you merge the two DS8870 storage systems with their configurations as is, you have a
problem because each DS8870 storage system requires a total of 128 ranks, so the total
number of ranks that are required is 256 ranks. The DS8886 storage system has a maximum
capacity for 192 ranks. So, you must model each DS8870 storage system by using fewer
ranks, which can be done by consolidating the 300 GB/15 K RPM ranks to 600 GB/15 K RPM
ranks. This task can be done by updating the RAID Rank definitions and then clicking Solve.
Now, you need only a total of 144 ranks on the DS8886 storage system, as shown in
Figure 6-13.

Figure 6-13 z Systems - update ranks from 300 GB to 600 GB

176 IBM System Storage DS8000 Performance Monitoring and Tuning


Start the merge procedure to merge the two DS8870 storage systems into a DS8886 storage
system by completing the following steps:
1. In Figure 6-14, right-click IBM-EFGH1 to open a window, and click Merge, which opens
another window, and select Add to Merge Source Collection and create New Target.
This option creates the Merge Target1, which is the new DSS that you use as the merge
target (Figure 6-15).

Figure 6-14 z Systems - merge and create a target disk subsystem

Figure 6-15 z Systems - merge target disk subsystem

Chapter 6. Performance planning tools 177


2. Select Hardware Details to open the window that is shown in Figure 6-16. With the
Failover Mode option, you can model the performance of the DS8880 storage system
when one server is down.

Figure 6-16 z Systems - target hardware details option

You can select the cache size. In this case, select 1024 GB because each of the two
DS8870 storage systems has 512 GB cache.
Disk Magic computes the number of HAs on the DS8880 storage system based on the
specification on the Interfaces page, but you can, to a certain extent, override these
numbers. The Fibre ports are used for Peer-to-Peer Remote Copy (PPRC) links. Enter 12
into the FICON HAs field and 8 into the Fibre HAs field.
3. Click the Interfaces tab to open the From Servers dialog box (Figure 6-17 on page 179).
Because the DS88870 FICON ports are running at 8 Gbps, you must update this option on
all the LPARs and also on the From Disk Subsystem to 16 Gbps. If the Host CPC uses
different FICON channels than the FICON channels that are specified, it also must be
updated.
Select and determine the Remote Copy Interfaces. Select the Remote Copy type and the
connections that are used for the Remote Copy links.

178 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-17 z Systems - target interfaces from LPARs

4. To select the DDM capacity and RPM used, click the zSeries Disk tab, as shown in
Figure 6-18. Do not make any changes here, and let Disk Magic select the DDM/SSD
configuration based on the source DSSs.

Figure 6-18 z Systems - target DDM option

Chapter 6. Performance planning tools 179


5. Merge the second DS8870 storage system onto the target subsystem. In Figure 6-19,
right-click IBM-ABCD1, select Merge, and then select Add to Merge Source Collection.

Figure 6-19 z Systems - merge second DS8870 storage system

6. Perform the merge procedure. From the Merge Target window (Figure 6-20), click Start
Merge.

Figure 6-20 z Systems - start the merge

180 IBM System Storage DS8000 Performance Monitoring and Tuning


7. This selection initiates Disk Magic to merge the two DS8870 storage systems onto the
new DS8886 storage system and creates Merge Result1 (Figure 6-21).

Figure 6-21 z Systems - DS8886 disk subsystem created as the merge result

Chapter 6. Performance planning tools 181


8. To see the DDM that is configured for the DS8886 storage system, select zSeries Disk for
MergeResult1. You see that all the volumes from the DS8870 storage system are now
allocated on this DS8886 storage system. Also, the rank configuration is equal to the
combined ranks from the two DS8870 storage systems. There are 128 ranks of
600 GB/15 K RPM DDM and 16 ranks of SSD 400 GB (Figure 6-22).

Figure 6-22 z Systems - DDM of the new DS8886 storage system

182 IBM System Storage DS8000 Performance Monitoring and Tuning


9. Select zSeries Workload to show the Disk Magic predicted performance of the DS8886
storage system. The estimated response time is 0.70 msec. Disk Magic assumes that the
workload is spread evenly among the ranks within the extent pool that is configured for the
workload (Figure 6-23).

Figure 6-23 z Systems - performance statistics of the DS8886 storage system

Chapter 6. Performance planning tools 183


10.Click Utilization to display the utilization statistics of the various DS8886 components. In
Figure 6-24, you can see that the Average FICON HA Utilization is 49%, which is close to
the 50% amber point. The amber point is an indication that if the resource utilization
reaches this number, an increase in the workload soon causes this resource to reach its
limit.
Therefore, try to improve the performance of the DS8886 storage system by updating
some of the resources. Because the FICON HA utilization is the bottleneck, you can
perform either of the following actions:
– Add 12 more FICON HAs.
– Use the FEx16S FICON channels on the host, if they are available.
In this example, we assume that the host has the FEx16S FICON channels.

Figure 6-24 z Systems - DS8886 utilization statistics

11.Figure 6-25 on page 185 shows that you upgraded the host server FICON channels to
FEx16S. This upgrade also requires that you upgrade, by using the From Disk
Subsystem tab, to make the Server Side also use the FEx16S channels.

184 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-25 z Systems - upgrade the host server channels to FEx16S

12.For this configuration, update the SSD drives to flash drives (Figure 6-26). Click Solve to
create the new Disk Magic configuration model.

Figure 6-26 z Systems - replace the SSD drives with flash drives

Chapter 6. Performance planning tools 185


13.Figure 6-27 shows that the new DS8886 model improved the response time to 0.55 msec.
Click Utilization to display the utilization statistics of the various upgraded DS8886
components. In Figure 6-28 on page 187, you can see that the Average FICON HA
Utilization dropped to 27.4% from the previous 49.0%.

Figure 6-27 z Systems - performance statistics after upgrade

186 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-28 z Systems - upgraded DS8886 utilization

As you can see from the examples in this section, you can modify many of the resources of
the target DSS, for example, replacing some of the 600 GB/15 K RPM DDM with flash drives,
and let Disk Magic model the new configuration. This way, if the DS8886 model shows
resource bottlenecks, you can make the modifications to try to eliminate those bottlenecks.

Chapter 6. Performance planning tools 187


6.2.3 Disk Magic performance projection for the z Systems model
Based on the previous modeling results, you can create a chart that compares the
performance of the original DS8870 storage system with the performance of the new DS8886
storage system:
1. Click 2008/07/29 20:59 to put the LPAR and DSSs in the right column (Figure 6-29).

Figure 6-29 z Systems - window for performance comparison

2. With Figure 6-29 open, press and hold the Ctrl key and select IBM-EFGH1, IBM-ABCD1,
and MergeResult2. Right-click any of them and a small window opens. Select Graph from
this window. In the window that opens (Figure 6-30 on page 189), click Clear to clear any
prior graph option settings.

188 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-30 z Systems - graph option window

3. Click Plot to produce the response time components graph of the three DSSs that you
selected in a Microsoft Excel spreadsheet. Figure 6-31 is the graph that is created based
on the numbers from the Excel spreadsheet.

Figure 6-31 z Systems - response time comparison

Chapter 6. Performance planning tools 189


The DS8886 response time that is shown on the chart is 0.56 msec, which is less compared
to the two DS8870 storage systems. The response time improvement on the DS8886 storage
system is about 24% compared to the DS8870 storage system.

The response time that is shown here is 0.56 msec, which is slightly different compared to the
0.55 msec shown in Figure 6-27 on page 186. This difference is caused by rounding factors
by Disk Magic.

6.2.4 Workload growth projection for a z Systems model


A useful feature of Disk Magic is its capability to create a workload growth projection and
observe the impact of this growth on the various DS8880 resources and also on the response
time. To run this workload growth projection, complete the following steps:
1. Click Graph in Figure 6-27 on page 186 to open the window that is shown in Figure 6-32.

Figure 6-32 z Systems - workload growth projection dialog box

2. Click Range Type, and choose I/O Rate, which completes the From field with the I/O Rate
of the current workload, which is 186,093.7 IOPS. You can change the to field to any
value; in this example, we change it to 321,000 IOPS and the by field to 10,000 IOPS.
These changes create a plot from 186,093.7 IOPS and increments the next point by
10,000 IOPS until the I/O rate reaches the maximum rate that is equal to or less than
321,000 IOPS.
3. Click New Sheet to create the next plot on a new sheet in the Excel file and then click
Plot. An error message displays and informs you that the DA is saturated. The projected
response time report is shown in Figure 6-33 on page 191.

190 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-33 z Systems - workload growth impact on response time components

4. Click Utilization Overview in the Graph Data choices, then click New Sheet and Plot to
produce the chart that is shown in Figure 6-34.

Figure 6-34 z Systems - utilization growth overview

This utilization growth projection chart shows how much these DS8886 resources increase as
the I/O rate increases. Here, you can see that the first bottleneck that is reached is the DA.

At the bottom of the chart, you can see that Disk Magic projected that the DS8886 storage
system can sustain a workload growth of up to 43%, as shown in Figure 6-33. Additional
ranks and additional DAs should be planned, which might include an additional DSS.

Chapter 6. Performance planning tools 191


When you create the growth chart, there are new options that you can use for the Range
Type:
򐂰 Throughput (MBps)
This option shows the workload growth based on the increase in throughput based on MB,
which is 1,000,000 bytes.
򐂰 Throughput (MiB/s)
This option shows the workload growth based on the increase in throughput based on
MiB, which is 1.048,576 bytes.

The next step is to repeat the same modeling steps starting with 6.2.1, “Processing the DMC
file” on page 167, for two other peak periods, which are:
򐂰 Peak read + write MBps to check how the overall DS8886 storage system performs under
this stress load
򐂰 Peak write MBps to check whether the SMP, Bus, and Fibre links can handle this peak
PPRC activity.

6.3 Disk Magic for Open Systems


This section shows how to use Disk Magic as a modeling tool for Open Systems. It illustrates
the example of migrating a DS8870 storage system to a DS8886 storage system. This
exercise shows the steps to do the migration and also to check which resources will reach
their saturation point when the workload grows. When the bottleneck is identified, action is
taken on the model to improve the saturated resource and a new model is created.

In this example, we use the comma-separated values (CSV) file created by the Open and
iSeries Automated Input.

Typically, when doing a Disk Magic study, model the following periods:
򐂰 Peak I/O period
򐂰 Peak Read + Write throughput in MBps
򐂰 Peak Write throughput in MBps, if you are doing a Remote Copy configuration

6.3.1 Processing the CSV output file to create the base model for the DS8870
storage system
To process the CSV files, complete the following steps:
1. From the Welcome to Disk Magic window, which is shown in Figure 6-35 on page 193,
select New SAN Project and click OK. The result is shown in Figure 6-36 on page 193.
Select Open and iSeries Automated Input and then click OK.

192 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-35 Open Systems - New SAN project

Figure 6-36 Open Systems - Automated input

Chapter 6. Performance planning tools 193


2. Select the folder that contains the automated input data, which contains the CSV file, as
shown in Figure 6-37. Select both the config text file and the dss CSV file and click Open.
The window that is shown in Figure 6-38 opens. Click Process.

Figure 6-37 Open Systems - automated input file selection

Figure 6-38 Open Systems - process the data

3. Now, you see Figure 6-39 on page 195, and from here select the row with the date of Nov.
3 at 03:10:00 AM. This is the period where the DS8870 storage system reaches the peak
I/O Rate and also the peak Write MBps. Click Add Model and then click Finish. You now
see the Disk Magic configuration window (Figure 6-40 on page 195). In this window,
delete all the other DSSs and show only the DS8870 storage system (16 core) with 512
GB of cache (IBM-75PQR21) that is migrated to the DS8886 storage system. Click Easy
Tier Settings.

194 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-39 Open Systems - select the appropriate data based on date and time

Figure 6-40 Open Systems - model created

Chapter 6. Performance planning tools 195


4. In this window (Figure 6-41), select Enable Easy Tier and on For Open Servers, select
Manually select a preset and select Intermediate skew of 7.00. In the Hardware Details
window (Figure 6-42), define the number of Fibre Host Adapters as 8.

Figure 6-41 Open Systems - enable Easy Tier

Figure 6-42 Open Systems - update number of Fibre HAs

196 IBM System Storage DS8000 Performance Monitoring and Tuning


5. In the Interface window (Figure 6-43), define 16 Fibre 8 Gb links on both the Server side
and DSS side. Also, define eight the 8 Gb links for PPRC at 0 km distance.

Figure 6-43 Open Systems - server to disk subsystem and PPRC interfaces

Chapter 6. Performance planning tools 197


6. In the Open Disk window (Figure 6-44), define 164 ranks of 300 GB/15 K disk drives and
four ranks of 400 GB SSD.

Figure 6-44 Open Systems - disk drive module options

7. In Figure 6-45, click Base to create the base of this DS8870 IBM-PQR21 storage system.

Figure 6-45 Open Systems - base model created

198 IBM System Storage DS8000 Performance Monitoring and Tuning


6.3.2 Migrating the DS8870 storage system to the DS8886 storage system
To migrate the DS8870 storage system to the DS8886 storage system, complete the following
steps:
1. In Figure 6-46, right-click IBM-PQR21 and select Merge → Add to Merge Source
Collection and create New Target. The Merge Target window opens (Figure 6-47).
Define the target DSS as a DS8886 storage system (24 core) with 2048 GB of cache.

Figure 6-46 Open Systems - create disk subsystem target

Figure 6-47 Open Systems - DS8886 target disk subsystem

Chapter 6. Performance planning tools 199


2. In Figure 6-48, define the Interfaces for the Server, DSS, and PPRC as 16 Gb. Click Start
Merge. The Merge Result for the DS8886 storage system is shown in Figure 6-49 on
page 201.
The service time on the DS8886 storage system improves compared to the DS8870
storage system, from 2.53 msec to 1.98 msec.

Figure 6-48 Open Systems - server to disk subsystem and PPRC interfaces

200 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-49 Open Systems - DS8886 target disk subsystem created

DS8886 growth projection


Now, do a growth projection of the DS8886 storage system based on I/O load increase to
discover which resource is the first bottleneck when the I/O rate increases.

Chapter 6. Performance planning tools 201


Complete the following steps:
1. Click Graph in Figure 6-49 on page 201 opens the window that is shown in Figure 6-50.

Figure 6-50 Open Systems - utilization projection

2. For the Graph Data drop-down menu, select Utilization Overview, and for the Range
Type drop-down menu, select I/O Rate.
3. Click Plot, which opens the dialog box that is shown in Figure 6-51. This dialog box states
that at a certain point when projecting the I/O Rate growth, Disk Magic will stop the
modeling because the DA is saturated.

Figure 6-51 Open Systems - growth projection limited by device adapter saturation

The utilization growth projection in Figure 6-52 on page 203 shows the projected growth of
each resource of the DS8886 storage system. At greater than 125% I/O rate growth, the DA
would have reached a 100% utilization. But to be realistic, you should not run this
configuration at greater than the 39% growth rate because at that point the DA utilization
starts to reach the amber/warning stage.

The next step is to try to update the configuration to relieve this bottleneck.

202 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-52 Open Systems - growth projection bottleneck at the device adapter

DS8886 storage system: Upgrading SSD drives to flash drives


Now, try to update the configuration to relieve the DA bottleneck. One way to do this task is to
upgrade the SSD drives with flash drives because the flash drives use HPFEs, which do not
use the DA.

Chapter 6. Performance planning tools 203


In the Open Disk window, which is shown in Figure 6-53, update the 400 GB SSD to a
400 GB HPF and click Solve. The result of this solve is shown in Figure 6-54.

Figure 6-53 Open Systems - upgrade the SSD drives with flash drives

Figure 6-54 Open Systems - Disk Magic solve of DS8886 with flash drives

204 IBM System Storage DS8000 Performance Monitoring and Tuning


You see that the service time did not change; it is still 1.98 msec. But, this new rate improves
the DA utilization, and you see the effect of this upgrade when doing the I/O rate growth
projection.

Now, run the utilization graph with I/O rate growth again. This time, the DA is not the
bottleneck, but you get the Host Adapter Utilization > 100% message (Figure 6-55).

Figure 6-55 Open Systems - growth projection limited by host adapter saturation

The utilization growth projection in Figure 6-56 shows the projected growth of each resource
after the flash upgrade on the DS8886 storage system. At greater than 183% I/O rate growth,
the Fibre HA reaches a 100% utilization. Do not run this configuration at greater than the 67%
growth rate because at that point the Fibre HA utilization starts to reach the amber/warning
stage.

Figure 6-56 Open Systems - growth projection bottleneck at the Fibre host adapter

Chapter 6. Performance planning tools 205


Service Time projection
Figure 6-57 shows the service time comparison between these three configurations:
򐂰 The original DS8870 storage system with a 512 GB cache
򐂰 The DS8886 storage system with 2048 GB cache, 16 Gb Fibre ports, and the same disk
and SSD configuration
򐂰 The DS8886 storage system with SSD drives upgraded to flash drives

The service time improves from 2.53 msec on the DS8870 storage system to 1.98 msec on
the DS8886 storage system. The service time is the same for both DS8886 options.

The difference between option 2 and option 3 is that option 3 can handle a much higher I/O
rate growth projection because the flash drives do not use the DAs.

Figure 6-57 Open Systems - service time projections

206 IBM System Storage DS8000 Performance Monitoring and Tuning


Next, you see the service time projection when the I/O rate grows for the DS8886 storage
system with flash drives. In Figure 6-58, select Service Time in ms for the Graph data and
I/O Rate for the Range Type. Click Plot, and the message that is shown in Figure 6-59
displays. This message indicates that the service time projection will be terminated when the
Fibre HA utilization reaches 61.3%, which exceeds the amber threshold. Figure 6-55 on
page 205 shows that this point is reached at the I/O rate of 91,875 IOPS, which is at the I/O
rate growth of 77%.

Figure 6-58 Open Systems - graph option for service time growth projection

Figure 6-59 Open Systems - message when doing service time growth projection

Chapter 6. Performance planning tools 207


The result of the plot is shown in Figure 6-60. It shows that the service time increases from
1.98 msec to 2.13 msec at the I/O rate of 86,875 IOPS, which is an I/O rate growth of 67%.

Figure 6-60 Open Systems - service time chart with workload growth

6.4 Disk Magic Easy Tier modeling


Disk Magic supports DS8000 Easy Tier modeling for DS8800 and DS8700 multitier
configurations. The performance modeling that is provided by Disk Magic is based on the
workload skew level concept. The skew level describes how the I/O activity is distributed
across the capacity for a specific workload. A workload with a low skew level has the I/O
activity distributed evenly across the available capacity. A heavily skewed workload has many
I/Os to only a small portion of the data. The workload skewing affects the Easy Tier
effectiveness, especially if there is a limited number of high performance ranks. The heavily
skewed workloads benefit most from the Easy Tier capabilities because even when moving a
small amount of data, the overall performance improves. With lightly skewed workloads, Easy
Tier is less effective because the I/O activity is distributed in such a large amount of data that
it cannot be moved to a higher performance tier.

There are three different approaches on how to model Easy Tier on a DS8880 DSS. Here are
the three options:
򐂰 Use one of the predefined skew levels.
򐂰 Use an existing skew level based on the current workload on the current DSS.
򐂰 Use heatmap data from a DSS that supports this function.

6.4.1 Predefined skew levels


Disk Magic supports five predefined skew levels for use in Easy Tier predictions:
򐂰 Skew level 2: Very low skew
򐂰 Skew level 3.5: Low Skew
򐂰 Skew level 7: Intermediate skew

208 IBM System Storage DS8000 Performance Monitoring and Tuning


򐂰 Skew level 14: High skew
򐂰 Skew level 24: Very high skew

Disk Magic uses this setting to predict the number of I/Os that are serviced by the higher
performance tier. In Figure 6-61, the five curves represent the predefined skew levels in
respect to the capacity and I/Os.

A skew level value of 1 means that the workload does not have any skew at all, meaning that
the I/Os are distributed evenly across all ranks.

Figure 6-61 Disk Magic skew level curves

So, the top curve represents the very high skew level. The lowest curve represents the very
low skew level. In this chart, the intermediate skew curve (the middle one) indicates that for a
fast tier capacity of 20% Easy Tier moves 79% of the Workload (I/Os) to the fast tier. Disk
Magic assumes that if there is an extent pool where 20% of the capacity is on SSD or flash
ranks, Easy Tier manages to fill this 20% of the capacity with data that handles 79% of all the
I/Os.

These skew curves are developed by IntelliMagic. The class of curves and the five predefined
levels were chosen after researching workload data from medium and large sites.

The skew level settings affect the Disk Magic predictions. A heavy skew level selection results
in a more aggressive sizing of the higher performance tier. A low skew level selection provides
a conservative prediction. It is important to understand which skew level best matches the
actual workload before you start the modeling.

Chapter 6. Performance planning tools 209


When doing Disk Magic modeling, you can decide which skew level you want to use by
clicking Easy Tier Settings in Figure 6-62, which opens the dialog box that is shown in
Figure 6-63 on page 211. Here, select Enable Easy Tier and Manually select a preset,
where you can select one of the five available options.

Figure 6-62 Easy Tier Settings option

210 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-63 Skew level selection

Tip: For Open Systems and z Systems workload, the default skew level is High (14). For a
System i workload, the default skew level is Very Low (2).

Chapter 6. Performance planning tools 211


6.4.2 Current workload existing skew level
If the current DSS you want to model is already using Easy Tier, the Disk Magic model
detects it and calculates the current skew level. Figure 6-64 shows a sample of a DSS that is
already running Easy Tier, and it shows that the estimated skew level for this DSS is 5.69.

For z/OS, RMF Magic also estimates the skew for Easy Tier modeling even if the DSS is not
running Easy Tier. It estimates the skew based on the volume skew.

This skew level is integrated into the model of the new DSS that is migrated from the current
one.

Figure 6-64 Disk Magic data for an existing Easy Tier disk subsystem

6.4.3 Heat map


The third option is to use the heat map. Only the DS8000 storage system, IBM Storwize®
V7000, and IBM SAN Volume Controller provides this heat map. The heat map is a CSV file
containing the activity of groups of extents within every pool. Using this heat map, Disk Magic
can model the skew curve to be used by Easy Tier.

In Figure 6-65 on page 213, select Enable Easy Tier and then click Read Heatmap. These
actions open the file containing the heat map. The heat map name is XXXX_skew_curve.csv,
where XXXX is the DSS name. From here, select the appropriate heat map. Based on the
heat map that Disk Magic reads, the predicted skew level is now 12.11.

212 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-65 Read heatmap

6.5 Storage Tier Advisor Tool


The STAT is a Microsoft Windows application that can be used to analyze the characteristics
of the workload that runs on the storage facility. The STAT provides capacity planning
information that is associated with the current or future use of the Easy Tier facility.

Storage Tier Advisor Tool overview


The advisor tool processes data that is collected by the Easy Tier monitor. The DS8000
monitoring capabilities are available regardless of whether you install and activate the Easy
Tier license feature on your DS8000 storage system. The monitoring capability of the DS8000
storage system can monitor the usage of storage at the volume extent level. Monitoring
statistics are gathered and analyzed every 24 hours. The results of this data are summarized
in summary monitor data that can be downloaded from a DS8000 storage system for
reporting with the advisor tool.

The advisor tool provides a graphical representation of performance data that is collected by
the Easy Tier monitor over a 24-hour operational cycle. You can view the information that is
displayed by the advisor tool to analyze workload statistics and evaluate which logical
volumes might be candidates for Easy Tier management. If the Easy Tier feature is not
installed and enabled, you can use the performance statistics that are gathered by the
monitoring process to help you determine whether to use Easy Tier to enable potential
performance improvements in your storage environment and to determine optimal flash, SSD,
or HDD configurations and benefits.

Chapter 6. Performance planning tools 213


After you know your Licensed Internal Code version (which you can learn by running the
DSCLI command ver -l), you can download the suitable STAT version from the IBM Fix
Central website, where you select the correct DS8000 model and code:
http://www.ibm.com/support/fixcentral/

To extract the summary performance data that is generated by Easy Tier, you can use either
the DSCLI or DS Storage Manager. When you extract summary data, two files are provided,
one for each processor complex in the Storage Facility Image (SFI) server. The download
operation initiates a long running task to collect performance data from both selected SFIs.
This information can be provided to IBM if performance analysis or problem determination is
required.

Storage Tier Advisor Tool improvement


In its initial version, the STAT estimated performance improvement and SSD configuration
guidelines were performed for the storage system as a whole.

The version of the STAT that is available with DS8000 Licensed Internal Code R8.0 is more
granular and it provides a broader range of recommendations and benefit estimations. The
recommendations are available for all the supported multitier configurations and they are on a
per-extent pool basis. The tool estimates the performance improvement for each pool by
using the existing SSD ranks, and it provides guidelines on the type and number of SSD
ranks to configure. The STAT also provides recommendations for sizing a Nearline tier so you
can evaluate the advantage of demoting extents to either existing or additional Nearline
ranks, and the cold data capacity that results from this cold demotion.

Another improvement in the STAT is a better way to calculate the performance estimation.
Previously, the STAT took the heat values of each bucket as linear, which resulted in an
inaccurate estimation when the numbers of extents in each bucket were disproportional. The
STAT now uses the average heat value that is provided at the sub-LUN level to provide a
more accurate estimate of the performance improvement.

6.5.1 Storage Tier Advisor Tool output samples


This section provides several STAT output samples with short descriptions. For a full
description of the STAT capabilities and their usage, see IBM DS8000 Easy Tier, REDP-4667.

The STAT describes the Easy Tier statistical data that is collected by the DS8000 storage
system in detail and it produces reports in HTML format that can be viewed by using a
standard browser. These reports provide information at the levels of a DS8000 storage
system, the extent pools, and the volumes. Sizing recommendations and estimated benefits
are also included in the STAT reports.

Figure 6-66 on page 215 shows the System Summary report from STAT. There are two
storage pools monitored (P2 & P3) with a total of 404 volumes and 23,487 GiB capacity.
Three percent of the data is hot. It also shows that P2 and P3 contain two tiers, SSD and
Enterprise disk.

214 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-66 System summary for a DS8886 storage system

The dark purple portion of the Data Management Status bar displays the data that is
managed by Easy Tier that are assigned to a certain tier. The green portion of the bar
represents data that is managed by Easy Tier. On storage pool P2, you see that there are
1320 GiB of data assigned/pinned to one tier.

Chapter 6. Performance planning tools 215


Based on the collected data, STAT shows how performance can be improved by adding a
flash rank, as shown in Figure 6-67. It predicts that this addition provides a performance
improvement of up to 10%. Also, capacity can be improved by adding Nearline drives to the
existing pools by providing more capacity for the cold demotion of cold extents

Figure 6-67 Systemwide Recommendation report that shows possible improvements for all pools

The next series of views is by storage pool. When you click a certain storage pool, you can
see additional and detailed recommendations for improvements at the level of each extent
pool, as shown in Figure 6-68 on page 217.

216 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-68 Report that shows that storage pool 0000 needs SSDs and skew

Figure 6-68 shows a storage pool view for a pool that consists of two Enterprise (15 K/10 K)
HDD ranks, with both hot and cold extents. This pool can benefit from adding one SSD rank
and one Nearline rank. You can select the types of drives that you want to add for a certain
tier through the drop-down menus on the left. These menus contain all the drive and RAID
types for a certain type of tier. For example, when adding more Enterprise drives is
suggested, the STAT can calculate the benefit of adding drives in RAID 10 instead of RAID 5,
or the STAT can calculate the benefit of using additional 900 GB/10 K HDDs instead of the
300 GB/15 K drives.

If adding multiple ranks of a certain tier is beneficial for a certain pool, the STAT modeling
offers improvement predictions for the expected performance gains when adding two, three,
or more ranks up to the recommended number.

Chapter 6. Performance planning tools 217


Another view within each pool is the Volume Heat Distribution report. Figure 6-69 shows all
the volumes of a certain pool. For each volume, the view shows the amount of capacity that is
allocated to each tier and the distribution within each tier among hot, warm, and cold data.

Figure 6-69 Storage pool statistics and recommendations - Volume Heat Distribution view

In this view, three heat classes are visible externally. However, internally, DS8000 Easy Tier
monitoring uses a more granular extent temperature in heat buckets. This detailed Easy Tier
data can be retrieved by IBM Support for extended studies of a client workload situation.

6.5.2 Storage Tier Advisor Tool for Disk Magic skew


STAT incorporates further enhanced detailed CSV file reports on the Easy Tier activity that
can be used for additional planning and more accurate sizing. The skew curve is available in
a format to help you differentiate the skew according to whether the skew is for small or large
blocks, reads or writes, and IOPS or MBps. Also, it can be read into Disk Magic to overrule the
suggested skew curve estimate there to arrive at better sizing for mixed-tier pools.

To get the enhanced data, process the binary heat map data with STAT by using the
information that is shown in Example 6-1.

Example 6-1 Process the binary heat map data by using STAT
STAT.exe SF75FAW80ESS01_heat_20151112222136.data
SF75FAW80ESS11_heat_20151112222416.data

Processing the DS8000 heat map with STAT produces three CSV files. One of them,
SF75FAW80_skew_curve.csv, is used by Disk Magic to do the Easy Tier modeling.

Figure 6-70 on page 219 shows how to import the heatmap data that is created by STAT.
From the DSS main window, click Easy Tier Settings. In the next window, which is shown in
Figure 6-71 on page 219, click Read Heatmap. A dialog box opens, where you can select the
CSV file that contains the skew curve that you want to use.

218 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 6-70 Select Easy Tier Settings

Figure 6-71 Select the heatmap to be used for this modeling

Chapter 6. Performance planning tools 219


After you select the skew curve CSV file, the Easy Tier Setting window that is shown in
Figure 6-72 shows the measured skew, which is 11.56 in this case. Now, you can continue
your Disk Magic model by using this skew as measured by STAT.

Figure 6-72 The heatmap that is selected shows the computed skew

220 IBM System Storage DS8000 Performance Monitoring and Tuning


7

Chapter 7. Practical performance


management
This chapter describes the tools, data, and activities that are available for supporting the
DS8000 performance management processes by using IBM Spectrum Control V5.2.8,
formerly known as IBM Tivoli Storage Productivity Center.

This chapter includes the following topics:


򐂰 Introduction to practical performance management
򐂰 Performance management tools
򐂰 IBM Spectrum Control data collection considerations
򐂰 IBM Spectrum Control performance metrics
򐂰 IBM Spectrum Control reporting options
򐂰 Using IBM Spectrum Control network functions
򐂰 End-to-end analysis of I/O performance problems
򐂰 Performance analysis examples
򐂰 IBM Spectrum Control in mixed environments

© Copyright IBM Corp. 2016. All rights reserved. 221


7.1 Introduction to practical performance management
Appendix A, “Performance management process” on page 551 describes performance
management processes and inputs, actors, and roles. Performance management processes
include operational processes, such as data collection and alerting, tactical processes, such
as performance problem determination and analysis, and strategic processes, such as
long-term trending. This chapter defines the tools, metrics, and processes that are required to
support the operational, tactical, and strategic performance management processes by using
IBM Spectrum Control Standard Edition.

Important: IBM Spectrum Control has different editions that have specific features. This
chapter assumes the use of IBM Spectrum Control Standard Edition, which includes
performance and monitoring functions that are relevant to this topic.

7.2 Performance management tools


Tools for collecting, monitoring, and reporting on DS8000 performance are critical to the
performance management processes. While this book was being written, the storage
resource management tool with the most DS8000 performance management capabilities was
IBM Spectrum Control V5.2.8. It provides support for the DS8000 performance management
processes with the following features, which are briefly described in Table 7-1. Furthermore,
this chapter describes some of the most important Tactical/Strategic items.

Table 7-1 IBM Spectrum Control supported activities for performance processes
Process Activities Feature

Operational Performance data collection for ports, Performance monitor jobs.


arrays, volumes, pools, nodes (formerly
called controllers), and host connections.
Switch performance metrics can also be
collected.

Operational Alerting. Alerts and threshold violations.

Tactical/ Performance reporting of ports, pools, Web-based GUI, IBM Cognos®


Strategic array, volumes, nodes, host connections, (including predefined reports), and
and switch performance data. TPCTOOL.

Tactical Performance analysis and tuning. Tool facilitates thorough data collection
and reporting.

Tactical Short-term reporting. GUI charting with the option to export


data to analytical tools.

Tactical Advanced Analytics. Tiering and Balancing Analysis.

Strategic Long-term reporting. GUI charting with the option to export


data to analytical tools.

Additional performance management processes that complement IBM Spectrum Control are
shown in Table 7-2 on page 223.

222 IBM System Storage DS8000 Performance Monitoring and Tuning


Table 7-2 Additional tools
Process Activity Alternative

Strategic Sizing Disk Magic and general rules (see Chapter 6,


“Performance planning tools” on page 159)

Strategic Planning Logical configuration performance considerations (see


Chapter 4, “Logical configuration performance
considerations” on page 83).

Operational Host data collection Native host tools.


performance and alerting

Tactical Host performance analysis Native host tools.


and tuning

7.2.1 IBM Spectrum Control overview


IBM Spectrum Control reduces the complexity of managing SAN storage devices by allowing
administrators to configure, manage, and monitor storage devices and switches from a single
console.

For a full list of the features that are provided in each of the IBM Spectrum components, go to
the following IBM website:
http://www.ibm.com/systems/storage/spectrum/

For more information about the configuration and deployment of storage by using IBM
Spectrum Control, see these publications:
򐂰 IBM Spectrum Family: IBM Spectrum Control Standard Edition, SG24-8321
򐂰 IBM Tivoli Storage Productivity Center V5.2 Release Guide, SG24-8204
򐂰 IBM Tivoli Storage Productivity Center V5.1 Technical Guide, SG24-8053
򐂰 IBM Tivoli Storage Productivity Center Beyond the Basics, SG24-8236

Chapter 7. Practical performance management 223


On the IBM Spectrum Control Overview window of a selected DS8000 storage system,
performance statistics for that device are displayed, as shown in Figure 7-1.

Figure 7-1 Overview window of a DS8000 storage system

You can customize the dashboard by using the arrows. In the left navigation section, you can
see the aggregated status of internal or related resources.

7.2.2 IBM Spectrum Control measurement of DS8000 components


IBM Spectrum Control can gather information about the component levels, as shown in
Figure 7-2 on page 225, for the DS8000 storage system. Displaying a metric within IBM
Spectrum Control depends on the ability of the storage system and its mechanisms to provide
the performance data and related information, such as the usage of its components.
Figure 7-2 on page 225 drills down from the top-level subsystem view to give you a better
understanding of how and what data is displayed.

224 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 7-2 Physical view compared to IBM Spectrum Control Performance Report Items

Since IBM Spectrum Control V5.2.8, the stand-alone GUI is no longer available. All Alerting
functions and comprehensive performance management capabilities are available in the
WebUI. From here, it is possible to drill down to a more detailed level of information regarding
performance metrics. This WebUI is used to display some of these metrics later in the
chapter.

Metrics: A metric is a numerical value that is derived from the information that is provided
by a device. It is the raw data and a calculated value. For example, the raw data is the
transferred bytes, but the metric uses this value and the interval to show the bytes/second.

For the DS8000 storage system, the native application programming interface (NAPI) is used
to collect performance data, in contrast to the SMI-S Standard that is used for third-party
devices.

The DS8000 storage system interacts with the NAPI in the following ways:
򐂰 Access method used: Enterprise Storage Server Network Interface (ESSNI)
򐂰 Failover:
– For the communication with a DS8000 storage system, IBM Spectrum Control uses the
ESSNI client. This library is basically the same library that is included in any DS8000
command-line interface (DSCLI). Because this component has built-in capabilities to
fail over from one Hardware Management Console (HMC) to another HMC, a good
approach is to specify the secondary HMC IP address if your DS8000 storage system
has one.
– The failover might still cause errors in a IBM Spectrum Control job, but the next
command that is sent to the device uses the redundant connection.

Chapter 7. Practical performance management 225


򐂰 Network: No special network considerations exist. IBM Spectrum Control needs to be able
to connect to the HMC.
򐂰 IBM Spectrum Control shows some ESSNI error codes in its logs, but for ESSNI-specific
error codes, detailed information is provided in the DS8000 IBM Knowledge Center.

Subsystem
On the subsystem level, metrics are aggregated from multiple records to a single value per
metric to give the performance of a storage subsystem from a high-level view, based on the
metrics of other components. This aggregation is done by adding values, or calculating
average values, depending on the metric.

Cache
The cache in Figure 7-2 on page 225 plays a crucial role in the performance of any storage
subsystem.

Metrics, such as disk-to-cache operations, show the number of data transfer operations from
disks to cache. The number of data transfer operations from disks to cache is called staging
for a specific volume. Disk-to-cache operations are directly linked to read activity from hosts.
When data is not found in the DS8000 cache, the data is first staged from back-end disks into
the cache of the DS8000 storage system and then transferred to the host.

Read hits occur when all the data that is requested for a read data access is in cache. The
DS8000 storage system improves the performance of read caching by using Sequential
Prefetching in Adaptive Replacement Cache (SARC) staging algorithms. For more
information about the SARC algorithm, see 1.3.1, “Advanced caching techniques” on page 8.
The SARC algorithm seeks to store those data tracks that have the greatest probability of
being accessed by a read operation in cache.

The cache-to-disk operation shows the number of data transfer operations from cache to
disks, which is called as destaging for a specific volume. Cache-to-disk operations are directly
linked to write activity from hosts to this volume. Data that is written is first stored in the
persistent memory (also known as nonvolatile storage (NVS)) at the DS8000 storage system
and then destaged to the back-end disk. The DS8000 destaging is enhanced automatically by
striping the volume across all the disk drive modules (DDMs) in one or several ranks
(depending on your configuration). This striping, or volume management that is done by Easy
Tier, provides automatic load balancing across DDMs in ranks and an elimination of the hot
spots.

The Write-cache Delay I/O Rate or Write-cache Delay Percentage because of persistent
memory allocation gives you information about the cache usage for write activities. The
DS8000 storage system stores data in the persistent memory before sending an
acknowledgment to the host. If the persistent memory is full of data (no space available), the
host receives a retry for its write request. In parallel, the subsystem must destage the data
that is stored in its persistent memory to the back-end disk before accepting new write
operations from any host.

If a volume experiences write operation delayed due to persistent memory constraint delays,
consider moving the volume to a less busy rank or spread this volume on multiple ranks
(increase the number of DDMs used). If this solution does not fix the persistent memory
constraint problem, consider adding cache capacity to your DS8000 storage system.

As shown in Figure 7-3 on page 227, you can use IBM Spectrum Control to monitor the cache
metrics easily.

226 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 7-3 Available cache metrics in IBM Spectrum Control

Controller/Nodes
IBM Spectrum Control refers to the DS8000 processor complexes as Nodes (formerly called
controllers). A DS8000 storage system has two processor complexes, and each processor
complex independently provides major functions for the disk storage system. Examples
include directing host adapters (HAs) for data transfer to and from host processors, managing
cache resources, and directing lower device interfaces for data transfer to and from physical
disks. To analyze performance data, you must know that most volumes can be
“assigned/used” by only one controller at a time.

Chapter 7. Practical performance management 227


Front-end (Volume Tab) and back-end (Disk Tab (2)) node performance data is available in
IBM Spectrum Control, as shown in Figure 7-4.

Figure 7-4 Option to select front-end or back-end Node Performance Data.

When you right-click one of the nodes in the WebUI, you can use IBM Spectrum Control to
drill down to the volume performance chart for the volumes that are assigned to the selected
node, as shown in Figure 7-5 on page 229.

228 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 7-5 IBM Spectrum Control drill-down function

You can enlarge the performance chart with the “Open in a new window” icon. The URL of the
new window can be bookmarked or attached to an email.

Chapter 7. Practical performance management 229


You also can see which volumes are assigned to which node in the volumes window, as
shown in Figure 7-6.

Figure 7-6 Volume to node assignment

Ports
The port information reflects the performance metrics for the front-end DS8000 ports that
connect the DS8000 storage system to the SAN switches or hosts. Additionally, port error rate
metrics, such as Error Frame Rate, are also available. The DS8000 HA card has four or eight
ports. The WebUI does not reflect this aggregation, but if necessary, custom reports can be
created with IBM Cognos Report Studio or with native SQL statements to show port
performance data that is grouped by the HA to which they belong. Monitoring and analyzing
the ports that belong to the same card are beneficial because the aggregated throughput is
less than the sum of the stated bandwidth of the individual ports.

Note: Cognos BI is an optional part of IBM Spectrum Control and can be installed at any
time. More information about installation and usage of Cognos BI is provided in IBM Tivoli
Storage Productivity Center V5.1 Technical Guide, SG24-8053 and IBM Tivoli Storage
Productivity Center V5.2 Release Guide, SG24-8204.

For more information about the DS8000 port cards, see 2.5.1, “Fibre Channel and FICON
host adapters” on page 41.
.

Port metrics: IBM Spectrum Control reports on many port metrics, so the ports on the
DS8000 storage system are the front-end part of the storage device.

Array
The array name that is shown in the WebUI, as shown in Figure 7-8 on page 231, directly
refers to the array on the DS8000 storage system as listed in the DS GUI or DSCLI.

230 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 7-7 IBM Spectrum Control WebUI V5.2.8 shows information for DS8000 arrays

When you click the Performance tab, the top five performing arrays are displayed with their
corresponding graphs, as shown in Figure 7-8.

Figure 7-8 IBM Spectrum Control WebUI V5.2.8 DS8000 Array Performance chart

Chapter 7. Practical performance management 231


Volumes on the DS8000 storage systems are primarily associated with an extent pool, and an
extent pool relates to a set of ranks. To associate quickly all arrays with their related ranks and
extent pools, use the output of the DSCLI lsarray -l and lsrank -l commands, as shown in
Example 7-1.

Example 7-1 DS8000 array site, array, and rank association


dscli> showrank r23
ID R23
SN -
Group 1
State Normal
datastate Normal
Array A15
RAIDtype 6
extpoolID P5
extpoolnam rd6_fb_1
volumes 1100,1101,1102,1103,1104,1105
stgtype fb
exts 12922
usedexts 174
widearrays 0
nararrays 1
trksize 128
strpsize 384
strpesize 0
extsize 16384
encryptgrp -
migrating(in) 0
migrating(out) 0

dscli> lsrank -l R23


ID Group State datastate Array RAIDtype extpoolID extpoolnam stgtype exts usedexts
=====================================================================================
R23 1 Normal Normal A15 6 P5 rd6_fb_1 fb 12922 174

dscli> lsarray -l A15


Array State Data RAIDtype arsite Rank DA Pair DDMcap (10^9B) diskclass encrypt
==========================================================================================
A15 Assigned Normal 6 (5+P+Q+S) S16 R23 2 3000.0 NL unsupported

dscli> lsarraysite S16


arsite DA Pair dkcap (10^9B) State Array
===========================================
S16 2 3000.0 Assigned A15

A DS8000 array is defined on an array site with a specific RAID type. A rank is a logical
construct to which an array is assigned. A rank provides a number of extents that are used to
create one or several volumes. A volume can use the DS8000 extents from one or several
ranks. For more information, see 3.2.1, “Array sites” on page 54, 3.2.2, “Arrays” on page 54,
and 3.2.3, “Ranks” on page 55.

232 IBM System Storage DS8000 Performance Monitoring and Tuning


Associations: On a configured DS8000 storage system, there is a 1:1 relationship
between an array site, an array, and a rank. However, the numbering sequence can differ
for arrays, ranks, and array sites, for example, array site S16 = array A15 = rank R23.

In most common logical configurations, users typically sequence them in order, for
example, Array Site = S1 = Array A0 = Rank R0. If they are not in order, you must
understand on which array the analysis is performed.

Important: In IBM Spectrum Control, the relationship between array site, array, and rank
can be configured to be displayed in the WebUI as shown in Figure 7-7. The array statistics
are used to measure the rank’s usage because they have a 1:1 relationship.

Example 7-1 on page 232 shows the relationships among a DS8000 rank, an array, and an
array site with a typical divergent numbering scheme by using DSCLI commands. Use the
showrank command to show which volumes have extents on the specified rank.

In the Array performance chart, you can include both front-end and back-end metrics. The
back-end metrics can be selected on the Disk Metrics Tab. They provide metrics from the
perspective of the controller to the back-end array sites. The front-end metrics relate to the
activity between the server and the controller.

There is a relationship between array operations, cache hit ratio, and percentage of read
requests:
򐂰 When the cache hit ratio is low, the DS8000 storage system has frequent transfers from
DDMs to cache (staging).
򐂰 When the percentage of read requests is high and the cache hit ratio is also high, most of
the I/O requests can be satisfied without accessing the DDMs because of the cache
management prefetching algorithm.
򐂰 When there is heavy write activity, it leads to frequent transfers from cache to DDMs
(destaging).

Comparing the performance of different arrays shows whether the global workload is equally
spread on the DDMs of your DS8000 storage system. Spreading data across multiple arrays
increases the number of DDMs that is used and optimizes the overall performance.

Important: Back-end write metrics do not include the RAID impact. In reality, the RAID 5
write penalty adds additional unreported I/O operations.

Chapter 7. Practical performance management 233


Volumes
The volumes, which are also called logical unit numbers (LUNs), are shown in Figure 7-9. The
host server sees the volumes as physical disk drives and treats them as physical disk drives.

Figure 7-9 DS8000 volume

Analysis of volume data facilitates the understanding of the I/O workload distribution among
volumes, and workload characteristics (random or sequential and cache hit ratios). A DS8000
volume can belong to one or several ranks, as shown in Figure 7-9 (for more information, see
3.2.7, “Extent allocation methods” on page 62). Especially in managed multi-rank extent pools
with Easy Tier automatic data relocation enabled, the distribution of a certain volume across
the ranks in the extent pool can change over time. The STAT with its volume heat distribution
report provides additional information about the heat of the data and the data distribution
across the tiers within a pool for each volume. For more information about the STAT and its
report, see 6.5, “Storage Tier Advisor Tool” on page 213.

With IBM Spectrum Control, you can see the Easy Tier Distribution, Easy Tier Status, and the
capacity values for pools and volumes, as shown in Figure 7-10 on page 235 and Figure 7-11
on page 235.

234 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 7-10 Easy Tier Distribution of the DS8000 pools

Figure 7-11 Easy Tier Distribution of the DS8000 volumes

The analysis of volume metrics shows the activity of the volumes on your DS8000 storage
system and can help you perform these tasks:
򐂰 Determine where the most accessed data is and what performance you get from the
volume.
򐂰 Understand the type of workload that your application generates (sequential or random
and the read or write operation ratio).
򐂰 Determine the cache benefits for the read operation (cache management prefetching
algorithm SARC).
򐂰 Determine cache bottlenecks for write operations.
򐂰 Compare the I/O response observed on the DS8000 storage system with the I/O response
time observed on the host.

The relationship of certain RAID arrays and ranks to the DS8000 pools can be derived from
the IBM Spectrum Control RAID array list window, which is shown in Figure 7-7 on page 231.

From there, you can easily see the volumes that belong to a certain pool by right-clicking a
pool, selecting View Properties, and clicking the Volumes tab.

Chapter 7. Practical performance management 235


Figure 7-12 shows the relationship of the raid array, pools, and volumes for a raid array.

Figure 7-12 Relationship of the raid array, pool, and volumes for a raid array

In addition, to associate quickly the DS8000 arrays to array sites and ranks, you might use the
output of the DSCLI commands lsrank -l and lsarray -l, as shown in Example 7-1 on
page 232.

7.2.3 General IBM Spectrum Control measurement considerations


To understand the IBM Spectrum Control measurements of the DS8000 components, it is
helpful to understand the context for the measurement. The measurement facilitates insight
into the behavior of the DS8000 storage system and its ability to service I/O requests. The
DS8000 storage system handles various types of I/O requests differently. Table 7-3 shows the
behavior of the DS8000 storage system for various I/O types.

Table 7-3 DS8000 I/O types and behavior


I/O type DS8000 high-level behavior

Sequential read Pre-stage reads in cache to increase cache hit ratio.

Random read Attempt to find data in cache. If not present in cache, read from back end.

Sequential write Write data to the NVS of the processor complex owning volume and send a
copy of the data to cache in the other processor complex. Upon back-end
destaging, perform prefetching of read data and parity into cache to reduce
the number of disk operations on the back end.

Random write Write data to NVS of the processor complex owning volume and send a copy
of the data to cache in the other processor complex. Destage modified data
from NVS to disk as determined by Licensed Internal Code.

Understanding writes to a DS8000 storage system


When the DS8000 storage system accepts a write request, it processes it without physically
writing to the DDMs. The data is written into both the processor complex to which the volume
belongs and the persistent memory of the second processor complex in the DS8000 storage
system. Later, the DS8000 storage system asynchronously destages the modified data out to
the DDMs. In cases where back-end resources are constrained, NVS delays might occur. IBM
Spectrum Control reports on these conditions with the following front-end metrics: Write
Cache Delay I/O Rate and Write Cache Delay I/O Percentage.

236 IBM System Storage DS8000 Performance Monitoring and Tuning


The DS8000 lower interfaces use switched Fibre Channel (FC) connections, which provide a
high data transfer bandwidth. In addition, the destage operation is designed to avoid the write
penalty of RAID 5, if possible. For example, there is no write penalty when modified data to be
destaged is contiguous enough to fill the unit of a RAID 5 stride. A stride is a full RAID 5
stripe. However, when all of the write operations are random across a RAID 5 array, the
DS8000 storage system cannot avoid the write penalty.

Understanding reads on a DS8000 storage system


If the DS8000 storage system cannot satisfy the read I/O requests within the cache, it
transfers data from the DDMs. The DS8000 storage system suspends the I/O request until it
reads the data. This situation is called cache-miss. If an I/O request is cache-miss, the
response time includes the data transfer time between host and cache, and also the time that
it takes to read the data from DDMs to cache before sending it to the host. The various read
hit ratio metrics show how efficiently cache works on the DS8000 storage system.

The read hit ratio depends on the characteristics of data on your DS8000 storage system and
applications that use the data. If you have a database and it has a high locality of reference, it
shows a high cache hit ratio because most of the data that is referenced can remain in the
cache. If your database has a low locality of reference, but it has the appropriate sets of
indexes, it might also have a high cache hit ratio because the entire index can remain in the
cache.

A database can be cache-unfriendly by nature. An example of a cache-unfriendly workload is


a workload that consists of large sequential reads to a highly fragmented file system. If an
application reads this file, the cache hit ratio is low because the application never reads the
same data because of the nature of sequential access. In this case, defragmentation of the
file system improves the performance. You cannot determine whether increasing the size of
cache improves the I/O performance without knowing the characteristics of the data on your
DS8000 storage system.

Monitor the read hit ratio over an extended period:


򐂰 If the cache hit ratio is historically low, it is most likely because of the nature of the data
access patterns. Defragmenting the file system and making indexes if none exist might
help more than adding cache.
򐂰 If you have a high cache hit ratio initially and it decreases as the workload increases,
adding cache or moving part of the data to volumes that are associated with the other
processor complex might help.

Interpreting the read-to-write ratio


The read-to-write ratio depends on how the application programs issue I/O requests. In
general, the overall average read-to-write ratio is in the range of 75% - 80% reads.

For a logical volume that has sequential files, you must understand the application types that
access those sequential files. Normally, these sequential files are used for either read only or
write only at the time of their use. The DS8000 cache management prefetching algorithm
(SARC) determines whether the data access pattern is sequential. If the access is sequential,
contiguous data is prefetched into cache in anticipation of the next read request.

IBM Spectrum Control reports the reads and writes through various metrics. For a description
of these metrics in greater detail, see 7.3, “IBM Spectrum Control data collection
considerations” on page 238.

Chapter 7. Practical performance management 237


7.3 IBM Spectrum Control data collection considerations
This section describes the performance data collection considerations, such as time stamps,
durations, and intervals.

7.3.1 Time stamps


IBM Spectrum Control server uses the time stamp of the source devices when it inserts data
into the database. If the server clock is not synchronized with the rest of your environment, it
does not include any additional offset because you might need to compare the performance
data of the DS8000 storage system with the data gathered on a server or other connected
SAN devices, such as SAN Volume Controller or switch metrics.

Although the time configuration of the device is written to the database, reports are always
based on the time of the IBM Spectrum Control server. It receives the time zone information
from the devices (or the NAPIs) and uses this information to adjust the time in the reports to
the local time. Certain devices might convert the time into Coordinate Universal Time time
stamps and not provide any time zone information.

This complexity is necessary to compare the information from two subsystems in different
time zones from a single administration point. This administration point is the GUI, not the
IBM Spectrum Control server. If you open the GUI in different time zones, a performance
diagram might show a distinct peak at different times, depending on its local time zone.

When using IBM Spectrum Control to compare data from a server (for example, iostat data)
with the data of the storage subsystem, it is important to know the time stamp of the storage
subsystem. The time zone of the device is shown in the DS8000 Properties window.

To ensure that the time stamps on the DS8000 storage system are synchronized with the
other infrastructure components, the DS8000 storage system provides features for
configuring a Network Time Protocol (NTP) server. To modify the time and configure the HMC
to use an NTP server, see the following publications:
򐂰 IBM DS8870 Architecture and Implementation (Release 7.5), SG24-8085
򐂰 IBM DS8880 Architecture and Implementation (Release 8), SG24-8323

As IBM Spectrum Control can synchronize multiple performance charts that are opened in
the WebUI to display the metrics at the same time, use an NTP server for all components in
the SAN environment.

7.3.2 Duration
IBM Spectrum Control collects data continuously. From a performance management
perspective, collecting data continuously means that performance data exists to facilitate
reactive, proactive, and even predictive processes, as described in Chapter 7, “Practical
performance management” on page 221.

7.3.3 Intervals
In IBM Spectrum Control, the data collection interval is referred to as the sample interval. The
sample interval for the DS8000 performance data collection tasks is 1 - 60 minutes. A shorter
sample interval results in a more granular view of performance data at the expense of
requiring additional database space. The appropriate sample interval depends on the
objective of the data collection. Table 7-4 on page 239 displays example data collection
objectives and reasonable values for a sample interval.

238 IBM System Storage DS8000 Performance Monitoring and Tuning


Table 7-4 Sample interval examples
Objective Sample interval minutes

Problem determination/service-level agreement (SLA) 1

Ongoing performance management 5

Baseline or capacity planning 20

In support of ongoing performance management, a reasonable sample interval is 5 minutes.


An interval of 5 minutes provides enough granularity to facilitate reactive performance
management. In certain cases, the level of granularity that is required to identify the
performance issue is less than 5 minutes. In these cases, you can reduce the sample interval
to a 1-minute interval. IBM Spectrum Control also provides reporting at higher intervals,
including hourly and daily. It provides these views automatically.

Attention: Although the 1-minute interval collection is the default for some devices, the
5-minute interval is considered a good average for reviewing data. However, some issues
can occur within this time interval resulting in peaks being averaged out, which might not
be as apparent as they are with 1-minute interval collection. In such situations, the
1-minute interval is most appropriate for offering a more granular view that results in a
more effective analysis. However, a 1-minute interval results in copious amount of data
being collected, resulting in a fast growing database. Therefore, it should be used only for
troubleshooting purposes.

7.4 IBM Spectrum Control performance metrics


IBM Spectrum Control has many metrics available for reporting the health and performance of
the DS8000 storage system.

For a list of available performance metrics for the DS8000 storage system, see the IBM
Spectrum Control IBM Knowledge Center:
http://ibm.co/1J59pAq

7.4.1 DS8000 key performance indicator thresholds


This section provides some additional information about a subset of critical metrics. It
provides suggested threshold values as general recommendations for alerting for various
DS8000 components, as described in 7.4.2, “Alerts and thresholds” on page 240. As with any
recommendation, you must adjust them for the performance requirements of your
environment.

Note: IBM Spectrum Control also has other metrics that can be adjusted and configured
for your specific environment to suit customer demands.

Chapter 7. Practical performance management 239


Colors are used to distinguish the components that are shown in Table 7-5.

Table 7-5 DS8000 key performance indicator thresholds


Component Tab Metric Threshold Comment

Node Volume Cache Holding Time < 200 Indicates high cache track turnover
Metrics and possibly cache constraint.

Node Volume Write Cache Delay > 1% Indicates writes delayed because of
Metrics Percentage insufficient memory resources.

Array Disk Metrics Disk Utilization Percentage > 70% Indicates disk saturation. For
IBM Spectrum Control, the default
value on this threshold is 50%.

Array Disk Metrics Overall Response Time > 35 Indicates busy disks.

Array Disk Metrics Write Response Time > 35 Indicates busy disks.

Array Disk Metrics Read Response Time > 35 Indicates busy disks.

Port Port Metrics Total Port I/O Rate Depends Indicates transaction intensive load.
The configuration depends on the
HBA, switch, and other components.

Port Port Metrics Total Port Data Rate Depends If the port data rate is close to the
bandwidth, this rate indicates
saturation. The configuration
depends on the HBA, switch, and
other components.

Port Port Metrics Port Send Response Time >2 Indicates contention on I/O path from
the DS8000 storage system to the
host.

Port Port Metrics Port Receive Response >2 Indicates a potential issue on the I/O
Time path or the DS8000 storage system
back end.

Port Port Metrics Total Port Response Time >2 Indicates a potential issue on the I/O
path or the DS8000 storage system
back end.

7.4.2 Alerts and thresholds


IBM Spectrum Control uses the term performance monitor for the process that is set up to
gather data from a subsystem. This monitor is shown in the device’s Data Collection window.
The performance monitor collects information at certain intervals and stores the data in its
database. After inserting the data, the data is available for analysis by using several methods
that are described in 7.5, “IBM Spectrum Control reporting options” on page 242.

Because the intervals are usually 1- 15 minutes, IBM Spectrum Control is not an online or
real-time monitor.

You can use IBM Spectrum Control to define performance-related alerts that can trigger an
event when the defined thresholds are reached. Even though it works in a similar manner to a
monitor without user intervention, the actions are still performed at the intervals specified
during the definition of the performance monitor job.

240 IBM System Storage DS8000 Performance Monitoring and Tuning


IBM Spectrum Control offers advanced options to set thresholds to monitor DS8000
performance. For each of the available performance metrics, you can set thresholds with
different severities, suppressions, and notifications. To do so, complete the following steps:
1. Right-click the DS8000 storage system from the Block Storage Subsystems list to set
alerts or thresholds on the Block Storage Subsystems list and select Edit Alert
Definitions. The window that is shown in Figure 7-13 opens.

1
2

3
2 3 4 5

Figure 7-13 Set a thresholds

2. Select the component for which you want to set the threshold. In this example, click
Controllers 7/7.
3. Click the Performance tab.
4. Click Add Metric.
5. Select the check box for the metric for which you want to set the Threshold.
6. Click OK.
7. Enable the alert (1).
8. Specify the Threshold (2) and the severity (3).
9. Click the envelope to specify the notification (4).
10.Click the struck through exclamation mark to specify the suppression settings (5).
11.Click Save.

Reference: For more information about setting Thresholds and Alert suppressions in IBM
Spectrum Control, see IBM Spectrum Family: IBM Spectrum Control Standard Edition,
SG24-8321.

Chapter 7. Practical performance management 241


You should configure the thresholds that are most important and most relevant to the
environmental needs to assist with good planning.

The alerts for a DS8000 storage system can be seen, filtered, removed, acknowledged, or
exported in the storage system Alert window, as shown in Figure 7-14.

Figure 7-14 Alert window

Limitations to alert definitions


There are a few limitations to alert levels:
򐂰 Thresholds are always active. They cannot be set to exclude specific periods. This setting
can be mitigated by using alert suppression settings.
򐂰 Detailed knowledge of the workload is required to use thresholds effectively.

False positive alerts: Configuring thresholds too conservatively can lead to an excessive
number of false positive alerts.

7.5 IBM Spectrum Control reporting options


IBM Spectrum Control provides numerous ways of reporting the DS8000 performance data.
This section provides an overview of the various options and their appropriate usage in
ongoing performance management of a DS8000 storage system (Table 7-6). The categories
of usage are based on definitions in “Tactical performance subprocess” on page 560.

Table 7-6 Report category, usage, and considerations


Report type Performance process

WebUI reports Tactical

Predefined and Custom Cognos reports Tactical and Strategic

TPCTOOL Tactical and Strategic

Custom Reports (SQL and ODBC) Tactical and Strategic

All of the reports use the metrics that are available for the DS8000 storage system. The
remainder of this section describes each of the report types from a general aspect.

242 IBM System Storage DS8000 Performance Monitoring and Tuning


The STAT can also provide additional performance information for your DS8000 storage
system based on the DS8000 Easy Tier performance statistics collection. It does this by
revealing data heat information at the system and volume level in addition to configuration
recommendations. For more information about the STAT, see 6.5, “Storage Tier Advisor Tool”
on page 213.

7.5.1 WebUI reports


In IBM Spectrum Control WebUI, export options are available. To export the data that is used
in the performance chart, use the export function, which is found in the upper left of the chart
in Figure 7-15.

Figure 7-15 Performance export functions

To export the summary table underneath the chart, click Action → More → Export, and
select the correct format.

Chapter 7. Practical performance management 243


7.5.2 Cognos reports
It is possible to run and view predefined reports and create custom reports with Cognos in
IBM Spectrum Control. The Cognos reporting engine is accessible by clicking Cognos at the
top navigation bar.

IBM Spectrum Control provides over 70 predefined reports that show capacity and
performance information that is collected by IBM Spectrum Control, as shown in Figure 7-16.

Figure 7-16 IBM Spectrum Control - Cognos Reports

Charts are automatically generated for most of the predefined reports. Depending on the type
of resource, the charts show statistics for space usage, workload activity, bandwidth
percentage, and other statistics, as shown in Figure 7-17. You can schedule reports and
specify the report output as HTML, PDF, and other formats. You can also configure reports to
save the report output to your local file system, and to send reports as mail attachments.

Figure 7-17 IBM Spectrum Control - Cognos Performance Report example

If the wanted report is not available as a predefined report, you can use either Query Studio
or Report Studio to create your own custom reports.

244 IBM System Storage DS8000 Performance Monitoring and Tuning


With Cognos Query Studio, you can easily create simple reports by dragging and functions
such as sorting, filtering, adding conditional styles, automatic calculations, and charting. You
do not need any skills in writing SQL or require any knowledge about the database tables or
views.

Cognos Report Studio is a professional report authoring tool and offers many advanced
functions:
򐂰 Creating multiple report pages.
򐂰 Creating multiple queries that can be joined, unioned, and so on.
򐂰 Native SQL Queries can be rendered.
򐂰 Generation of rollup reports.
򐂰 Complex calculations.
򐂰 Additional chart types with baseline and trending functions.
򐂰 Active report that can create interactive reports that can be used on your mobile devices.

Reference: For more information about the usage of Cognos and its functions, see IBM
Tivoli Storage Productivity Center V5.1 Technical Guide, SG24-8053 and IBM Spectrum
Family: IBM Spectrum Control Standard Edition, SG24-8321.

7.5.3 TPCTOOL
You can use the TPCTOOL command-line interface (CLI) to extract data from the IBM
Spectrum Control database. It requires no knowledge of the IBM Spectrum Control schema
or SQL query skills, but you must understand how to use the tool.

For more information about TPCTOOL, see the following resources:


򐂰 IBM Tivoli Storage Productivity Center V5.2 Release Guide, SG24-8204
򐂰 Reporting with TPCTOOL, REDP-4230

7.5.4 Native SQL Reports


IBM Spectrum Control stores all data in a DB2 database that is called TPCDB, which you can
query by using native SQL.

You also can make connections by using the ODBC interface, for example, with Microsoft
Excel.

Note: Always specify with ur for read only in your SQL queries. Otherwise, your tables
might get locked during the read operation, which might slow down the performance of the
TPCDB.

For more information, see the following websites:


򐂰 http://ibm.co/1QGS3uC
򐂰 http://ibm.co/1PKnYcM

An example of querying the IBM Spectrum Control database by using native SQL with
Microsoft Excel is in IBM Spectrum Family: IBM Spectrum Control Standard Edition,
SG24-8321.

Reference: Documentation about the IBM Spectrum Control Exposed Views is available at
the following website:
http://www.ibm.com/support/docview.wss?uid=swg27023813

Chapter 7. Practical performance management 245


7.6 Advanced analytics and self-service provisioning
IBM Spectrum Control Advanced Edition provides additional functions, such as advanced
storage analytics for storage optimization and self-provisioning capabilities.

These functions are not included in IBM Spectrum Control Standard Edition.

7.6.1 Advanced analytics


IBM Spectrum Control uses real performance metrics and advanced analytics to make
recommendations to optimize storage pools and volumes by redistributing workloads across
the storage environment. By using real performance metrics, it enables optimization decisions
to be made based on actual usage patterns rather than on predictions.

The storage optimization function of IBM Spectrum Control uses the VDisk mirroring
capabilities of the SAN Volume Controller, so it can be used only for devices that are
configured as back-end storage for storage virtualizers.

For more information about the storage optimization function of IBM Spectrum Control, see
IBM SmartCloud Virtual Storage Center, SG24-8239.

7.6.2 Self-service provisioning


With the self-service provisioning capability of IBM Spectrum Control, you can offer a service
catalog that provides policy-based provisioning that is restricted to the user login.

The self-service provisioning capability of IBM Spectrum Control can also be used for
DS8000 pools that are not behind a storage virtualizer.

To take advantage of the self-service provisioning capabilities that are available in IBM
Spectrum Control Advanced Edition, some configuration is required. This configuration is
called cloud configuration, and it specifies storage tiers, service classes, and capacity pools
(see Figure 7-18).

Figure 7-18 IBM Spectrum Control cloud configuration

After the cloud configuration is done, you can do storage provisioning by specifying the
storage capacity and quality of a given service class. Then, volumes are created with the
required characteristics that are defined within that selected service class.

246 IBM System Storage DS8000 Performance Monitoring and Tuning


For more information about the advanced analytics and self-provisioning functions, see the
following resources:
򐂰 IBM SmartCloud Virtual Storage Center, SG24-8239
򐂰 IBM Tivoli Storage Productivity Center V5.2 Release Guide, SG24-8204
򐂰 IBM Tivoli Storage Productivity Center Beyond the Basics, SG24-8236

7.7 Using IBM Spectrum Control network functions


All SAN switch and director vendors provide management software that includes performance
monitoring capabilities. The real-time SAN statistics, such as port utilization and throughput
information available from SAN management software, can be used to complement the
performance information that is provided by host servers or storage subsystems.

For more information about monitoring performance through a SAN switch or director point
product, see the following websites:
򐂰 http://www.brocade.com
򐂰 http://www.cisco.com

Most SAN management software includes options to create SNMP alerts based on
performance criteria, and to create historical reports for trend analysis. Certain SAN vendors
offer advanced performance monitoring capabilities, such as measuring I/O traffic between
specific pairs of source and destination ports, and measuring I/O traffic for specific LUNs.

In addition to the vendor point products, IBM Spectrum Control can be used as a central data
repository and reporting tool for switch environments. It lacks real-time capabilities, but it
collects and reports on data for a 1- 60-minute interval for performance analysis over a
selected time frame when integrated with Network Advisor.

IBM Spectrum Control provides facilities to report on fabric topology, configurations and
switches, and port performance and errors. In addition, you can use Spectrum Control to
configure alerts or thresholds for Total Port Data Rate and Total Port Packet Rate.
Configuration options allow the creation of events to be triggered if thresholds are exceeded.
Although it does not provide real-time monitoring, it offers several advantages over traditional
vendor point products:
򐂰 Ability to store performance data from multiple switch vendors in a common database
򐂰 Advanced reporting and correlation between host data and switch data through custom
reports
򐂰 Centralized management and reporting
򐂰 Aggregation of port performance data for the entire switch

In general, you need to analyze SAN statistics for these reasons:


򐂰 Ensure that there are no SAN bottlenecks that limit the DS8000 I/O traffic, for example,
analyze any link utilization over 80%.
򐂰 Confirm that multipathing/load balancing software operates as expected.
򐂰 Isolate the I/O activity contributed by adapters on different host servers that share storage
subsystem I/O ports.
򐂰 Isolate the I/O activity contributed by different storage subsystems accessed by the same
host server.

Chapter 7. Practical performance management 247


For more information about IBM Spectrum Control Network functions and how to work with
them, see the following resources:
򐂰 IBM Tivoli Storage Productivity Center V5.2 Release Guide, SG24-8204
򐂰 IBM Tivoli Storage Productivity Center Beyond the Basics, SG24-8236

7.8 End-to-end analysis of I/O performance problems


To support tactical performance management processes, problem determination skills and
processes must exist. This section explains the logical steps that are required to perform
successful problem determination for I/O performance issues. The process of I/O
performance problem determination consists of the following steps:
򐂰 Define the problem.
򐂰 Classify the problem.
򐂰 Identify the I/O bottleneck.
򐂰 Implement changes to remove the I/O bottleneck.
򐂰 Validate that the changes that were made resolved the issue.

Perceived or actual I/O bottlenecks can result from hardware failures on the I/O path,
contention on the server, contention on the SAN Fabric, contention on the DS8000 front-end
ports, or contention on the back-end disk adapters or disk arrays. This section provides a
process for diagnosing these scenarios by using IBM Spectrum Control and external data.
This process was developed for identifying specific types of problems and is not a substitute
for common sense, knowledge of the environment, and experience. Figure 7-19 shows the
high-level process flow.

Figure 7-19 I/O performance analysis process

248 IBM System Storage DS8000 Performance Monitoring and Tuning


I/O bottlenecks that are referenced in this section relate to one or more components on the
I/O path that reached a saturation point and can no longer achieve the I/O performance
requirements. I/O performance requirements are typically throughput-oriented or
transaction-oriented. Heavy sequential workloads, such as tape backups or data warehouse
environments, might require maximum bandwidth and use large sequential transfers.
However, they might not have stringent response time requirements. Transaction-oriented
workloads, such as online banking systems, might have stringent response time
requirements, but have no requirements for throughput.

If a server processor or memory resource shortage is identified, it is important to take the


necessary remedial actions. These actions might include but are not limited to adding
additional processors, optimizing processes or applications, or adding additional memory. If
there are not any resources that are constrained on the server but the end-to-end I/O
response time is higher than expected for the DS8000 storage system (see “Rules” on
page 349), a resource constraint likely exists in one or more of the SAN components.

To troubleshoot performance problems, IBM Spectrum Control data must be augmented with
host performance and configuration data. Figure 7-20 shows a logical end-to-end view from a
measurement perspective.

Figure 7-20 End-to-end measurement

Although IBM Spectrum Control does not provide host performance, configuration, or error
data, IBM Spectrum Control provides performance data from host connections, SAN
switches, and the DS8000 storage system, and configuration information and error logs from
SAN switches and the DS8000 storage system.

Tip: Performance analysis and troubleshooting must always start top-down, starting with
the application (for example, database design and layout), then the operating system,
server hardware, SAN, and then storage. The tuning potential is greater at the “higher”
levels. The best I/O tuning is never carried out because server caching or a better
database design eliminated the need for it.

Chapter 7. Practical performance management 249


Process assumptions
This process assumes that the following conditions exist:
򐂰 The server is connected to the DS8000 storage system natively.
򐂰 Tools exist to collect the necessary performance and configuration data for each
component along the I/O path (server disk, SAN fabric, and the DS8000 arrays, ports, and
volumes).
򐂰 Skills exist to use the tools, extract data, and analyze data.
򐂰 Data is collected in a continuous fashion to facilitate performance management.

Process flow
The order in which you conduct the analysis is important. Use the following process:
1. Define the problem. A sample questionnaire is provided in “Sample questions for an AIX
host” on page 559. The goal is to assist you in determining the problem background and
understand how the performance requirements are not being met.

Changes: Before proceeding any further, ensure that an adequate investigation is


carried out to identify any changes that were made in the environment. Experience has
proven that a correlation exists between changes made to the environment and sudden
“unexpected” performance issues.

2. Consider checking the application level first. Has all potential tuning on the database level
been performed? Does the layout adhere to the vendor recommendations, and is the
server adequately sized (RAM, processor, and buses) and configured?
3. Correctly classify the problem by identifying hardware or configuration issues. Hardware
failures often manifest themselves as performance issues because I/O is degraded on one
or more paths. If a hardware issue is identified, all problem determination efforts must
focus on identifying the root cause of the hardware errors:
a. Gather any errors on any of the host paths.

Physical component: If you notice significant errors in the “datapath query device”
or the “pcmpath query device” and the errors increase, there is most likely a problem
with a physical component on the I/O path.

b. Gather the host error report and look for Small Computer System Interface (SCSI) or
Fibre errors.

Hardware: Often a hardware error that relates to a component on the I/O path
shows as a TEMP error. A TEMP error does not exclude a hardware failure. You
must perform diagnostic tests on all hardware components in the I/O path, including
the host bus adapter (HBA), SAN switch ports, and the DS8000 HBA ports.

c. Gather the SAN switch configuration and errors. Every switch vendor provides different
management software. All of the SAN switch software provides error monitoring and a
way to identify whether there is a hardware failure with a port or application-specific
integrated circuit (ASIC). For more information about identifying hardware failures, see
your vendor-specific manuals or contact vendor support.

250 IBM System Storage DS8000 Performance Monitoring and Tuning


Patterns: As you move from the host to external resources, remember any patterns.
A common error pattern that you see involves errors that affect only those paths on
the same HBA. If both paths on the same HBA experience errors, the errors are a
result of a common component. The common component is likely to be the host
HBA, the cable from the host HBA to the SAN switch, or the SAN switch port.
Ensure that all of these components are thoroughly reviewed before proceeding.

d. If errors exist on one or more of the host paths, determine whether there are any
DS8000 hardware errors. Log on to the HMC as customer/cust0mer and look to ensure
that there are no hardware alerts. Figure 7-21 provides a sample of a healthy DS8000
storage system. If there are any errors, you might need to open a problem ticket (PMH)
with DS8000 hardware support (2107 engineering).

Figure 7-21 DS8000 storage system - healthy HMC

4. After validating that no hardware failures exist, analyze server performance data and
identify any disk bottlenecks. The fundamental premise of this methodology is that I/O
performance degradation that relates to SAN component contention can be observed at
the server through analysis of the key server-based I/O metrics.
Degraded end-to-end I/O response time is the strongest indication of I/O path contention.
Typically, server physical disk response times measure the time that a physical I/O request
takes from the moment that the request was initiated by the device driver until the device
driver receives an interrupt from the controller that the I/O completed. The measurements
are displayed as either service time or response time. They are averaged over the
measurement interval. Typically, server wait or queue metrics refer to time spent waiting at
the HBA, which is usually an indication of HBA saturation. In general, you need to interpret
the service times as response times because they include potential queuing at various
storage subsystem components, for example:
– Switch
– Storage HBA
– Storage cache
– Storage back-end disk controller
– Storage back-end paths
– Disk drives

Important: Subsystem-specific load-balancing software usually does not add any


performance impact and can be viewed as a pass-through layer.

Chapter 7. Practical performance management 251


In addition to the disk response time and disk queuing data, gather the disk activity rates,
including read I/Os, write I/Os, and total I/Os because they show which disks are active:
a. Gather performance data, as shown in Table 7-7.

Table 7-7 Native tools and key metrics


OS Native tool Command/Object Metric/Counter

AIX iostat, filemon iostat -D, filemon read time(ms)


-o /tmp/fmon.log -O write time(ms)
all reads, writes
queue length

HP-UX sar sar -d avserv(ms)


avque
blks/s

Linux iostat iostat -d svctm(ms)


avgqu-sz
tps

Solaris iostat iostat -xn svc_t(ms)


Avque
blks/s

Microsoft perfmon Physical disk Avg Disk Sec/Read


Windows Server Avg Disk Sec/Write
Read Disk Queue
Length
Write Disk Queue
Length
Disk Reads/sec
Disk Writes/sec

z Systems Resource See Chapter 14, N/A


Measurement “Performance
Facility considerations for
(RMF)/System IBM z Systems
Management servers” on
Facilities (SMF) page 459.

I/O-intensive disks: The number of total I/Os per second indicates the relative
activity of the device. This relative activity provides a metric to prioritize the analysis.
Those devices with high response times and high activity are more important to
understand than devices with high response time and infrequent access. If
analyzing the data in a spreadsheet, consider creating a combined metric of
Average I/Os × Average Response Time to provide a method for identifying the most
I/O-intensive disks. You can obtain additional detail about OS-specific server
analysis in the OS-specific chapters.

b. Gather configuration data (Subsystem Device Driver (SDD)/Subsystem Device Driver


Path Control Module (SDDPCM)), as shown in Table 7-8 on page 253. In addition to
the multipathing configuration data, you must collect configuration information for the
host and DS8000 HBAs, which includes the bandwidth of each adapter.

252 IBM System Storage DS8000 Performance Monitoring and Tuning


Table 7-8 Path configuration data
OS Tool Command Key Other
information

All UNIX SDD/SDDPCM datapath query LUNserial Ranka, logical


essmap subsystem
pcmpath query (LSS), storage
essmap subsystem

Windows SDD/Subsystem datapath query LUN serial Ranka , LSS,


Device Driver essmap storage
Device Specific subsystem
Module
(SDDDSM)
a. The rank column is not meaningful for multi-rank extent pools on the DS8000 stor-
age system.

Multipathing: Ensure that multipathing works as designed. For example, if there are
two paths that are zoned per HBA to the DS8000 storage system, there must be four
active paths per LUN. Both SDD and SDDPCM use an active/active configuration of
multipathing, which means that traffic flows across all the traffic fairly evenly. For
native DS8000 connections, the absence of activity on one or more paths indicates
a problem with the SDD behavior.

c. Format the data and correlate the host LUNs with their associated DS8000 resources.
Formatting the data is not required for analysis, but it is easier to analyze formatted
data in a spreadsheet.
The following steps represent the logical steps that are required to format the data and
do not represent literal steps. You can codify these steps in scripts:
i. Read the configuration file.
ii. Build a hdisk hash with key = hdisk and value = LUN SN.
iii. Read I/O response time data.
iv. Create hashes for each of the following values with hdisk as the key: Date, Start
time, Physical Volume, Reads, Avg Read Time, Avg Read Size, Writes, Avg Write
Time, and Avg Write Size.
v. Print the data to a file with headers and commas to separate the fields.
vi. Iterate through the hdisk hash and use the common hdisk key to index into the other
hashes and print those hashes that have values.
d. Analyze the host performance data:
i. Determine whether I/O bottlenecks exist by summarizing the data and analyzing
key performance metrics for values in excess of the thresholds that are described in
“Rules” on page 349. Identify those vpaths/LUNs with poor response time. We
show an example in “Analyzing performance data” on page 357. Hardware errors
and multipathing configuration issues must already be excluded. The hot LUNs
must already be identified. Proceed to step 5 on page 254 to determine the root
cause of the performance issue.
ii. If no degraded disk response times exist, the issue is likely not internal to the
server.

Chapter 7. Practical performance management 253


5. If there are disk constraints that are identified, continue the identification of the root cause
by collecting and analyzing the DS8000 configuration and performance data:
a. Gather the configuration information. IBM Spectrum Control WebUI can also be used
to gather configuration data through the Properties window, as shown in Figure 7-22.

2
3

Figure 7-22 DS8000 Properties window in IBM Spectrum Control

Analyze the DS8000 performance data first: Check for Alerts (2) and errors (3) in
the left navigation. Then, look at the performance data of the internal resources (4).
Analysis of the SAN fabric and the DS8000 performance data can be completed in
either order. However, SAN bottlenecks occur less frequently than disk bottlenecks,
so it can be more efficient to analyze the DS8000 performance data first.

b. Use IBM Spectrum Control to gather the DS8000 performance data for subsystem
ports, pools, arrays, and volumes, nodes, and host connections. Compare the key
performance indicators from Table 7-5 on page 240 with the performance data. To
analyze the performance, complete the following steps:
i. For those server LUNs that show poor response time, analyze the associated
volumes during the same period. If the problem is on the DS8000 storage system, a
correlation exists between the high response times observed on the host and the
volume response times observed on the DS8000 storage system.

254 IBM System Storage DS8000 Performance Monitoring and Tuning


Compare the same period: Meaningful correlation with the host performance
measurement and the previously identified hot LUNs requires analysis of the
DS8000 performance data for the same period that the host data was collected.
The synchronize time function of IBM Spectrum Control can help you with this
task (see Figure 7-23 (1)). For more information, see IBM Tivoli Storage
Productivity Center V5.2 Release Guide, SG24-8204. For more information
about time stamps, see 7.3.1, “Time stamps” on page 238.

ii. Correlate the hot LUNs with their associated disk arrays. When using the IBM
Spectrum Control WebUI, the relationships are provided automatically in the
drill-down feature, as shown in Figure 7-23 (2).

2
Figure 7-23 Drill-down function of IBM Spectrum Control

If you use Cognos exports and want to correlate the volume data to the rank data,
you can correlate the volume data to the rank data manually or by using the script. If
multiple ranks per extent pool and storage pool striping, or Easy Tier managed
pools are used, one volume can exist on multiple ranks.
Analyze storage subsystem ports for the ports associated with the server in
question.
6. Continue the identification of the root cause by collecting and analyzing SAN fabric
configuration and performance data:
a. Gather the connectivity information and establish a visual diagram of the environment.

Visualize the environment: Sophisticated tools are not necessary for creating this
type of view; however, the configuration, zoning, and connectivity information must
be available to create a logical visual representation of the environment.

Chapter 7. Practical performance management 255


b. Gather the SAN performance data. Each vendor provides SAN management
applications that provide the alerting capability and some level of performance
management. Often, the performance management software is limited to real-time
monitoring, and historical data collection features require additional licenses. In
addition to the vendor-provided solutions, IBM Spectrum Control can collect further
metrics, which are shown in Table 7-8 on page 253.
c. Consider graphing the Overall Port Response Time, Port Bandwidth Percentage, and
Total Port Data Rate metrics to determine whether any of the ports along the I/O path
are saturated during the time when the response time is degraded. If the Total Port
Data Rate is close to the maximum expected throughput for the link or the bandwidth
percentages that exceed their thresholds, this situation is likely a contention point. You
can add additional bandwidth to mitigate this type of issue either by adding additional
links or by adding faster links. Adding links might require upgrades of the server HBAs
and the DS8000 HAs to take advantage of the additional switch link capacity.

7.9 Performance analysis examples


This section provides sample performance data, analysis, and recommendations for the
following performance scenarios by using the process that is described in 7.8, “End-to-end
analysis of I/O performance problems” on page 248. The examples highlight the key
performance data that is appropriate for each problem type. It provides the host configuration
or errors only in the cases where that information is critical to determine the outcome.

7.9.1 Example 1: Disk array bottleneck


The most common type of performance problem is a disk array bottleneck. Similar to other
types of I/O performance problems, a disk array bottleneck usually manifests itself in high disk
response time on the host. In many cases, the write response times are excellent because of
cache hits, but reads often require immediate disk access.

Defining the problem


The application owner complains of poor response time for transactions during certain times
of the day.

Classifying the problem


There are no hardware errors, configuration issues, or host performance constraints.

Identifying the root cause


Figure 7-24 on page 257 shows the average read response time for a Windows Server that
performs a random workload in which the response time increases steadily over time.

256 IBM System Storage DS8000 Performance Monitoring and Tuning


Average Disk Read Response Time (ms)

30.00

25.00

)
s Disk1
m 20.00
(
e Disk2
m
i
T 15.00 Disk3
e Disk4
s
n
o Disk5
p
s 10.00
e Disk6
R

5.00

Time - 1 Minute Intervals

Figure 7-24 Windows Server perfmon - Average Physical Disk Read Response Time

At approximately 18:39 hours, the average read response time jumps from approximately
15 ms to 25 ms. Further investigation of the host reveals that the increase in response time
correlates with an increase in load, as shown in Figure 7-25.

Average Disk Reads/sec

1,000.00
900.00
800.00
Disk1
c 700.00
e Disk2
s
/ 600.00
s Disk3
d
a 500.00
e Disk4
R
k 400.00
s
i Disk5
D 300.00
Disk6
200.00
100.00
-
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
0
: 0
: 0
: 0
: 0
: 0
: 0
: 0
: 0 : 0 : 0
: 0
: 0 : 0
: 0
: 0
: 0
: 0
: 0
: 0
:
5 1 7 3 9 5 1 7 3 9 5 1 7 3 9 5 1 7 3 9
5
: 0
: 0
: 1
: 1
: 2
: 3
: 3 4
: : : 4 5
: 0
: 0 : 1
: 1
: 2
: 3
: 3
: 4
: 4
:
6 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Time - 1 Minute Interval

Figure 7-25 Windows Server perfmon - Average Disk Reads/sec

Chapter 7. Practical performance management 257


As described in 7.8, “End-to-end analysis of I/O performance problems” on page 248, there
are several possibilities for high average disk read response time:
򐂰 DS8000 array contention
򐂰 DS8000 port contention
򐂰 SAN fabric contention
򐂰 Host HBA saturation

Because the most probable reason for the elevated response times is the disk utilization on
the array, gather and analyze this metric first. Figure 7-26 shows the disk utilization on the
DS8000 storage system.

Figure 7-26 IBM Spectrum Control array disk utilization

Implementing changes to resolve the problem


Add volumes on additional disks. For environments where host striping is configured, you
might need to re-create the host volumes to spread the I/O from an existing workload across
the new volumes.

Validating the problem resolution


Gather performance data to determine whether the issue is resolved.

7.9.2 Example 2: Hardware connectivity part 1


Infrequent connectivity issues occur as a result of broken or damaged components in the I/O
path. The following example illustrates the required steps to identify and resolve these types
of issues.

Problem definition
The online transactions for a Windows Server SQL server appear to take longer than normal
and time out in certain cases.

258 IBM System Storage DS8000 Performance Monitoring and Tuning


Problem classification
After reviewing the hardware configuration and the error reports for all hardware components,
we determined that there are errors on the paths associated with one of the host HBAs, as
shown in Figure 7-27. This output shows the errors on path 0 and path 1, which are both on
the same HBA (SCSI port 1). For a Windows Server that runs SDDDSM, additional
information about the HAs is available by running the gethba.exe command. The command
that you use to identify errors depends on the multipathing software installation.

Figure 7-27 Example of datapath query device

Identifying the root cause


A further review of the switch software revealed significant errors on the switch port
associated with the paths in question. A visual inspection of the environment revealed a kink
in the cable from the host to the switch.

Implementing changes to resolve the problem


Replace the cable.

Validating the problem resolution


After you implement the change, the error counts do not increase and the nightly backups
complete within the backup window.

7.9.3 Example 3: Hardware connectivity part 2


Infrequent connectivity issues occur as a result of broken or damaged components in the I/O
path. The following example illustrates the required steps to identify and resolve these types
of issues.

Defining the problem


Users report that the data warehouse application on an AIX server does not complete jobs in
a reasonable amount of time. Online transactions also time out.

Chapter 7. Practical performance management 259


Classifying the problem
A review of the host error log shows a significant number of hardware errors. An example of
the errors is shown in Figure 7-28.

Figure 7-28 AIX error log

Identifying the root cause


The IBM Service Support Representative (SSR) ran IBM diagnostic tests on the host HBA,
and the card did not pass the diagnostic tests.

Disabling a path: In cases where there is a path with significant errors, you can disable
the path with the multipathing software, which allows the non-working paths to be disabled
without causing performance degradation to the working paths. With SDD PCM, disable
the path by running pcmpath set device # path # offline.

Implementing changes to resolve the problem


Replace the card.

Validating the problem resolution


The errors did not persist after the card is replaced and the paths are brought online.

7.9.4 Example 4: Port bottleneck


DS8000 port bottlenecks do not occur often, but they are a component that is typically
oversubscribed.

Defining the problem


The production server batch runs exceed their batch window.

Classifying the problem


There are no hardware errors, configuration issues, or host performance constraints.

260 IBM System Storage DS8000 Performance Monitoring and Tuning


Identifying the root cause
The production server throughput diminishes at approximately 18:30 hours daily.
Concurrently, development workloads that run on the same DS8000 ports increase.
Figure 7-29 demonstrates the overall workload from both the production server and the
development server.

Total Disk T hroughput

1000000
900000
800000 Dev Disk7
700000 Dev Disk6
600000 Dev Disk5
c
e
/s 500000 Dev Disk4
B
K 400000 Pro d uction Disk5
300000 Pro d uction Disk2
200000 Pro d uction Disk1
100000
0
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
:0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0
9 5 1 7 3 9 5 1 7 3 9 5 1 7 3 9 5 1 7
:3 :4 :5 :5 :0 :0 :1 :2 :2 :3 :3 :4 :5 :5 :0 :0 :1 :2 :2
7 7 7 7 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Time - 1 minute interval

Figure 7-29 Production throughput compared to development throughput

The DS8000 port data reveals a peak throughput of around 300 MBps per 4-Gbps port.

Note: The HA ports speed might differ from DS8000 models, so it is a value that depends
on the hardware configuration of the system and the SAN environment.

Total Port Data Rate

700.00

600.00

500.00
c
e
/s 400.00
B R1-I3-C4-T0
M
l 300.00 R1-I3-C1-T0
a
t
o
T
200.00

100.00

Time - 5 minute

Figure 7-30 Total port data rate

Chapter 7. Practical performance management 261


Implementing changes to resolve the problem
Rezone ports for production servers and development servers so that they do not use the
same DS8000 ports. Add additional ports so that each server HBA is zoned to two DS8000
ports.

Validating the problem resolution


After implementing the new zoning that separates the production server and the development
server, the storage ports are no longer the bottleneck.

7.9.5 Example 5: Server HBA bottleneck


Although rare, server HBA bottlenecks occur, usually as result of a highly sequential workload
with under-configured HBAs. For an example of the type of workload and configuration that
lead to this type of problem, see “Analyzing performance data” on page 357.

7.10 IBM Spectrum Control in mixed environments


A benefit of IBM Spectrum Control is its capability to analyze both Open Systems fixed block
(FB) and z Systems Count Key Data (CKD) workloads. When the DS8000 storage system are
attached to multiple hosts that run on different platforms, Open Systems hosts might affect
your z Systems workload, and the z Systems workload might affect the Open Systems
workloads. If you use a mixed environment, looking at the RMF reports is insufficient. You
also need the information about the Open Systems hosts. IBM Spectrum Control informs you
about the cache and I/O activity.

Before beginning the diagnostic process, you must understand your workload and your
physical configuration. You must know how your system resources are allocated, and
understand your path and channel configuration for all attached servers.

Assume that you have an environment with a DS8000 storage system attached to a z/OS
host, an AIX on IBM Power Systems™ host, and several Windows Server hosts. You noticed
that your z/OS online users experience a performance degradation 07:30 - 08:00 hours each
morning.

You might notice that there are 3390 volumes that indicate high disconnect times, or high
device busy delay time for several volumes in the RMF device activity reports. Unlike UNIX or
Windows Server, you might notice the response time and its breakdown to connect,
disconnect, pending, and IOS queuing.

Disconnect time is an indication of cache-miss activity or destage wait (because of persistent


memory high utilization) for logical disks behind the DS8000 storage systems.

Device busy delay is an indication that another system locks up a volume, and an extent
conflict occurs among z/OS hosts or applications in the same host when using Parallel
Access Volumes (PAVs). The DS8000 multiple allegiance or PAVs capability allows it to
process multiple I/Os against the same volume at the same time. However, if a read or write
request against an extent is pending while another I/O is writing to the extent, or if a write
request against an extent is pending while another I/O is reading or writing data from the
extent, the DS8000 storage system delays the I/O by queuing. This condition is referred as
extent conflict. Queuing time because of extent conflict is accumulated to device busy (DB)
delay time. An extent is a sphere of access; the unit of increment is a track. Usually, I/O
drivers or system routines decide and declare the sphere.

262 IBM System Storage DS8000 Performance Monitoring and Tuning


To determine the possible cause of high disconnect times, check the read cache hit ratios,
read-to-write ratios, and bypass I/Os for those volumes. If you see that the cache hit ratio is
lower than usual and you did not add other workloads to your z Systems environment, I/Os
against Open Systems FB volumes might be the cause of the problem. Possibly, FB volumes
that are defined on the same server have a cache-unfriendly workload, thus affecting your
z Systems volumes hit ratio.

To get more information about cache usage, you can check the cache statistics of the FB
volumes that belong to the same server. You might be able to identify the FB volumes that
have a low read hit ratio and short cache holding time. Moving the workload of the Open
Systems logical disks, or the z Systems CKD volumes, that you are concerned about to the
other side of the cluster improves the situation by concentrating cache-friendly I/O workload
across both clusters. If you cannot or if the condition does not improve after this move,
consider balancing the I/O distribution on more ranks, or solid-state drives (SSDs). Balancing
the I/O distribution on more ranks optimizes the staging and destaging operation.

The scenarios that use IBM Spectrum Control as described in this chapter might not cover all
the possible situations that can be encountered. You might need to include more information,
such as application and host operating system-based performance statistics, the STAT
reports, or other data collections to analyze and solve a specific performance problem.

Chapter 7. Practical performance management 263


264 IBM System Storage DS8000 Performance Monitoring and Tuning
Part 3

Part 3 Performance
considerations for host
systems and databases
This part provides performance considerations for various host systems or appliances that
are attached to the IBM System Storage DS8000 storage system, and for databases.

This part includes the following topics:


򐂰 Host attachment
򐂰 Performance considerations for UNIX servers
򐂰 Performance considerations for Microsoft Windows servers
򐂰 Performance considerations for VMware
򐂰 Performance considerations for Linux
򐂰 Performance considerations for the IBM i system
򐂰 Performance considerations for IBM z Systems servers
򐂰 IBM System Storage SAN Volume Controller attachment
򐂰 IBM ProtecTIER data deduplication
򐂰 Databases for open performance
򐂰 Database for IBM z/OS performance

© Copyright IBM Corp. 2016. All rights reserved. 265


266 IBM System Storage DS8000 Performance Monitoring and Tuning
8

Chapter 8. Host attachment


This chapter describes the following attachment topics and considerations between host
systems and the DS8000 series for availability and performance:
򐂰 DS8000 attachment types
򐂰 Attaching Open Systems hosts
򐂰 Attaching IBM z Systems™ hosts

This chapter provides detailed information about performance tuning considerations for
specific operating systems in later chapters of this book.

© Copyright IBM Corp. 2016. All rights reserved. 267


8.1 DS8000 host attachment
The DS8000 enterprise storage solution provides various host attachments that allow
exceptional performance and superior data throughput. At a minimum, have two connections
to any host, and the connections must be on different host adapters (HAs) in different I/O
enclosures. You can consolidate storage capacity and workloads for Open Systems hosts and
z Systems hosts by using the following adapter types and protocols:
򐂰 Fibre Channel Protocol (FCP)-attached Open Systems hosts
򐂰 FCP/Fibre Connection (FICON)-attached z Systems hosts

The DS8886 model supports a maximum of 16 ports per I/O bay and can have four I/O bays
in the base frame and four I/O bays in first expansion frame, so a maximum of 128
FCP/FICON ports is supported. The DS8884 model supports a maximum of 16 FCP/FICON
HAs and can have two I/O bays in the base frame and two I/O bays in first expansion frame,
so a maximum of 64 FCP/FICON ports is supported. Both models support 8 Gbps or 16 Gbps
HAs. All ports can be intermixed and independently configured. The 8 Gbps HAs support 2, 4,
or 8 Gbps link speeds, and the 16 Gbps HAs support 4, 8, or 16 Gbps. Thus, 1 Gbps is no
longer supported on the DS8880 storage system. Enterprise Systems Connection (ESCON)
adapters are not supported on the DS8880 storage system.

The DS8000 storage system can support host and remote mirroring links by using
Peer-to-Peer Remote Copy (PPRC) on the same I/O port. However, it is preferable to use
dedicated I/O ports for remote mirroring links.

Planning and sizing the HAs for performance are not easy tasks, so use modeling tools, such
as Disk Magic (see 6.1, “Disk Magic” on page 160). The factors that might affect the
performance at the HA level are typically the aggregate throughput and the workload mix that
the adapter can handle. All connections on a HA share bandwidth in a balanced manner.
Therefore, host attachments that require maximum I/O port performance must be connected
to HAs that are not fully populated. You must allocate host connections across I/O ports, HAs,
and I/O enclosures in a balanced manner (workload spreading).

8.2 Attaching Open Systems hosts


This section describes the host system requirements and attachment considerations for Open
Systems hosts that run AIX, Linux, Hewlett-Packard UNIX (HP-UX), Novell, Oracle Solaris,
and Microsoft Windows to the DS8000 series with Fibre Channel (FC) adapters.

No SCSI: There is no direct Small Computer System Interface (SCSI) attachment support
for the DS8000 storage system.

8.2.1 Fibre Channel


FC is a 100, 200, 400, 800, and 1600 MBps, full-duplex, serial communications technology to
interconnect I/O devices and host systems that might be separated by tens of kilometers. The
DS8880 storage system supports 16, 8, 4, and 2 Gbps connections and it negotiates the link
speed automatically.

Supported Fibre Channel-attached hosts


For specific considerations that apply to each server platform, and for the current information
about supported servers (the list is updated periodically), see the following website:
http://www.ibm.com/systems/support/storage/config/ssic

268 IBM System Storage DS8000 Performance Monitoring and Tuning


Fibre Channel topologies
The DS8000 architecture supports all three FC interconnection topologies:
򐂰 Direct connect
򐂰 Arbitrated loop
򐂰 Switched fabric

For maximum flexibility and performance, use a switched fabric topology.

The next section describes preferred practices for implementing a switched fabric.

8.2.2 Storage area network implementations


This section describes a basic storage area network (SAN) network and how to implement it
for maximum performance and availability. It shows examples of a correctly connected SAN
network to maximize the throughput of disk I/O.

Description and characteristics of a storage area network


With a SAN, you can connect heterogeneous Open Systems servers to a high-speed network
and share storage devices, such as disk storage and tape libraries. Instead of each server
having its own locally attached storage and tape drives, a SAN shares centralized storage
components, and you can easily allocate storage to hosts.

Storage area network cabling for availability and performance


For availability and performance, you must connect to different adapters in different I/O
enclosures whenever possible. You must use multiple FC switches or directors to avoid a
potential single point of failure. You can use inter-switch links (ISLs) for connectivity.

Importance of establishing zones


For FC attachments in a SAN, it is important to establish zones to prevent interaction from
HAs. Every time that a HA joins the fabric, it issues a Registered State Change Notification
(RSCN). An RSCN does not cross zone boundaries, but it affects every device or HA in the
same zone.

If a HA fails and starts logging in and out of the switched fabric, or a server must be restarted
several times, you do not want it to disturb the I/O to other hosts. Figure 8-1 on page 271
shows zones that include only a single HA and multiple DS8000 ports (single initiator zone).
This approach is the preferred way to create zones to prevent interaction between server
HAs.

Tip: Each zone contains a single host system adapter with the wanted number of ports
attached to the DS8000 storage system.

By establishing zones, you reduce the possibility of interactions between system adapters in
switched configurations. You can establish the zones by using either of two zoning methods:
򐂰 Port number
򐂰 Worldwide port name (WWPN)

You can configure switch ports that are attached to the DS8000 storage system in more than
one zone, which enables multiple host system adapters to share access to the DS8000 HA
ports. Shared access to a DS8000 HA port might be from host platforms that support a
combination of bus adapter types and operating systems.

Chapter 8. Host attachment 269


Important: A DS8000 HA port that is configured to run with the FICON topology cannot be
shared in a zone with non-z/OS hosts. Ports with non-FICON topology cannot be shared in
a zone with z/OS hosts.

LUN masking
In FC attachment, logical unit number (LUN) affinity is based on the WWPN of the adapter on
the host, which is independent of the DS8000 HA port to which the host is attached. This
LUN masking function on the DS8000 storage system is provided through the definition of
DS8000 volume groups. A volume group is defined by using the DS Storage Manager or
DS8000 command-line interface (DSCLI), and host WWPNs are connected to the volume
group. The LUNs to be accessed by the hosts that are connected to the volume group are
defined to be in that volume group.

Although it is possible to limit through which DS8000 HA ports a certain WWPN connects to
volume groups, it is preferable to define the WWPNs to have access to all available DS8000
HA ports. Then, by using the preferred process of creating FC zones, as described in
“Importance of establishing zones” on page 269, you can limit the wanted HA ports through
the FC zones. In a switched fabric with multiple connections to the DS8000 storage system,
this concept of LUN affinity enables the host to see the same LUNs on different paths.

Configuring logical disks in a storage area network


In a SAN, carefully plan the configuration to prevent many disk device images from being
presented to the attached hosts. Presenting many disk devices to a host can cause longer
failover times in cluster environments. Also, boot times can take longer because the device
discovery steps take longer.

The number of times that a DS8000 logical disk is presented as a disk device to an open host
depends on the number of paths from each HA to the DS8000 storage system. The number
of paths from an open server to the DS8000 storage system is determined by these factors:
򐂰 The number of HAs installed in the server
򐂰 The number of connections between the SAN switches and the DS8000 storage system
򐂰 The zone definitions created by the SAN switch software

Physical paths: Each physical path to a logical disk on the DS8000 storage system is
presented to the host operating system as a disk device.

Consider a SAN configuration, as shown in Figure 8-1 on page 271:


򐂰 The host has two connections to the SAN switches, and each SAN switch in turn has four
connections to the DS8000 storage system.
򐂰 Zone A includes one FC card (FC0) and two paths from SAN switch A to the DS8000
storage system.
򐂰 Zone B includes one FC card (FC1) and two paths from SAN switch B to the DS8000
storage system.
򐂰 This host uses only four of the eight possible paths to the DS8000 storage system in this
zoning configuration.

270 IBM System Storage DS8000 Performance Monitoring and Tuning


By cabling the SAN components and creating zones, as shown in Figure 8-1, each logical
disk on the DS8000 storage system is presented to the host server four times because there
are four unique physical paths from the host to the DS8000 storage system. As you can see
in Figure 8-1, Zone A shows that FC0 has access through DS8000 host ports I0000 and
I0130. Zone B shows that FC1 has access through DS8000 host ports I0230 and I0300. So,
in combination, this configuration provides four paths to each logical disk presented by the
DS8000 storage system. If Zone A and Zone B are modified to include four paths each to the
DS8000 storage system, the host has a total of eight paths to the DS8000 storage system. In
that case, each logical disk that is assigned to the host is presented as eight physical disks to
the host operating system. Additional DS8000 paths are shown as connected to Switch A and
Switch B, but they are not in use for this example.

I0000
SAN I0030
Switch
FC 0 I0100
A
I0130

Host DS8000
I0200
FC 1 SAN I0230
Switch
I0300
B
I0330

Zone A - FC 0 Zone B - FC 1
DS8000_I0000 DS8000_I0230
DS8000_I0130 DS8000_I0300

Figure 8-1 Zoning in a SAN environment

You can see how the number of logical devices that are presented to a host can increase
rapidly in a SAN environment if you are not careful about selecting the size of logical disks
and the number of paths from the host to the DS8000 storage system.

Typically, it is preferable to cable the switches and create zones in the SAN switch software for
dual-attached hosts so that each server HA has 2 - 4 paths from the switch to the DS8000
storage system. With hosts configured this way, you can allow the multipathing module to
balance the load across the four HAs in the DS8000 storage system.

Zoning more paths, such as eight connections from the host to the DS8000 storage system,
does not improve SAN performance and causes twice as many devices to be presented to
the operating system.

Chapter 8. Host attachment 271


8.2.3 Multipathing
Multipathing describes a technique to attach one host to an external storage device through
more than one path. Multipathing can improve fault-tolerance and the performance of the
overall system because the fault of a single component in the environment can be tolerated
without an impact to the host. Also, you can increase the overall system bandwidth, which
positively influences the performance of the system.

As illustrated in Figure 8-2, attaching a host system by using a single-path connection


implements a solution that depends on several single points of failure. In this example, as a
single link, failure either between the host system and the switch, between the switch and the
storage system, or a failure of the HA on the host system, the DS8000 storage system, or
even a failure of the switch leads to a loss of access of the host system. Additionally, the path
performance of the whole system is reduced by the slowest component in the link.

Host System

Host adapter

single point of
failure

SAN switch

single point of
failure

Host
Port
I0001

Logical
disk
DS8000
Figure 8-2 SAN single-path connection

Adding additional paths requires you to use multipathing software (Figure 8-3 on page 273).
Otherwise, the same LUN behind each path is handled as a separate disk from the operating
system side, which does not allow failover support.

Multipathing provides the DS8000 attached Open Systems hosts that run Windows, AIX,
HP-UX, Oracle Solaris, or Linux with these capabilities:
򐂰 Support for several paths per LUN.
򐂰 Load balancing between multiple paths when there is more than one path from a host
server to the DS8000 storage system. This approach might eliminate I/O bottlenecks that
occur when many I/O operations are directed to common devices through the same I/O
path, thus improving the I/O performance.

272 IBM System Storage DS8000 Performance Monitoring and Tuning


򐂰 Automatic path management, failover protection, and enhanced data availability for users
that have more than one path from a host server to the DS8000 storage system. It
eliminates a potential single point of failure by automatically rerouting I/O operations to the
remaining active paths from a failed data path.
򐂰 Dynamic reconsideration after changing the configuration environment, including zoning,
LUN masking, and adding or removing physical paths.

Host System
multipathing module

Host adapter Host adapter

SAN switch SAN switch

Host Host
Port Port
I0001 I0131

LUN

DS8000

Figure 8-3 DS8000 multipathing implementation that uses two paths

The DS8000 storage system supports several multipathing implementations. Depending on


the environment, host type, and operating system, only a subset of the multipathing
implementations is available. This section introduces the multipathing concepts and provides
general information about implementation, usage, and specific benefits.

Important: Do not intermix several multipathing solutions within one host system. Usually,
the multipathing software solutions cannot coexist.

Subsystem Device Driver


The IBM Multipath Subsystem Device Driver (SDD) software is a generic host-resident
pseudo-device driver that is designed to support the multipath configuration environments in
the DS8000 storage system. The SDD is on the host system with the native disk device driver
and manages redundant connections between the host server and the DS8000 storage
system. The SDD is provided by and maintained by IBM for the AIX, Linux, HP-UX, Oracle
Solaris, and Windows host operating systems.

Chapter 8. Host attachment 273


For the correct multipathing driver, see the Refer to the IBM System Storage Interoperation
Center (SSIC) website, found at:
http://www.ibm.com/systems/support/storage/config/ssic

The SDD can operate under different modes or configurations:


򐂰 Concurrent data access mode: A system configuration where simultaneous access to data
on common LUNs by more than one host is controlled by system application software.
Examples are Oracle Parallel Server or file access software that can handle address
conflicts. The LUN is not involved in access resolution.
򐂰 Non-concurrent data access mode: A system configuration where there is no inherent
system software control of simultaneous accesses to the data on a common LUN by more
than one host. Therefore, access conflicts must be controlled at the LUN level by a
hardware-locking facility, such as SCSI Reserve/Release.

Persistent Reserve: Do not share LUNs among multiple hosts without the protection of
Persistent Reserve (PR). If you share LUNs among hosts without PR, you are exposed to
data corruption situations. You must also use PR when using FlashCopy.

The SDD does not support booting from or placing a system primary paging device on an
SDD pseudo-device.

For certain servers that run AIX, booting off the DS8000 storage system is supported. In that
case, LUNs used for booting are manually excluded from the SDD configuration by using the
querysn command to create an exclude file.

For more information about installing and using SDD, see IBM System Storage Multipath
Subsystem Device Driver User’s Guide, GC52-1309. This publication and other information
are available at the following website:
http://www.ibm.com/support/docview.wss?uid=ssg1S7000303

SDD load balancing


SDD automatically adjusts data routing for optimum performance. Multipath load balancing of
data flow prevents a single path from becoming overloaded. If a single path is overloaded, it
can cause the I/O congestion that occurs when many I/O operations are directed to common
devices along the same I/O path.

The policy that is specified for the device determines the path that is selected to use for an I/O
operation. The following policies are available:
򐂰 Load balancing (default): The path to use for an I/O operation is chosen by estimating the
load on the adapter to which each path is attached. The load is a function of the number of
I/O operations currently in process. If multiple paths have the same load, a path is chosen
at random from those paths.
򐂰 Round-robin: The path to use for each I/O operation is chosen at random from those paths
that are not used for the last I/O operation. If a device has only two paths, SDD alternates
between the two paths.
򐂰 Failover only: All I/O operations for the device are sent to the same (preferred) path until
the path fails because of I/O errors. Then, an alternative path is chosen for later I/O
operations.

Normally, path selection is performed on a global rotating basis; however, the same path is
used when two sequential write operations are detected.

274 IBM System Storage DS8000 Performance Monitoring and Tuning


Single-path mode
SDD does not support concurrent download and installation of the Licensed Machine Code
(LMC) to the DS8000 storage system if hosts use a single-path mode. However, SDD
supports a single-path FC connection from your host system to a DS8000 storage system. It
is possible to create a volume group or a vpath device with only a single path.

Important: With a single-path connection, which is not preferable, the SDD cannot provide
failure protection and load balancing.

Single Fibre Channel adapter with multiple paths


A host system with a single FC adapter that connects through a switch to multiple DS8000
ports is considered to have multiple FC paths.

From an availability point of view, this configuration is not preferred because of the single fiber
cable from the host to the SAN switch. However, this configuration is better than a single path
from the host to the DS8000 storage system, and this configuration can be useful for
preparing for maintenance on the DS8000 storage system.

Path failover and online recovery


SDD automatically and non-disruptively can redirect data to an alternative data path.

When a path failure occurs, the SDD automatically reroutes the I/O operations from the failed
path to the other remaining paths. This action eliminates the possibility of a data path
becoming a single point of failure.

SDD datapath command


The SDD provides commands that you can use to display the status of adapters that are used
to manage disk devices or to display the status of the disk devices. You can also set individual
paths online or offline, and you can also set all paths connected to an adapter online or offline
at one time.

Multipath I/O
Multipath I/O (MPIO) summarizes native multipathing technologies that are available in
several operating systems, such as AIX, Linux, and Windows. Although the implementation
differs for each of the operating systems, the basic concept is almost the same:
򐂰 The multipathing module is delivered with the operating system.
򐂰 The multipathing module supports failover and load balancing for standard SCSI devices,
such as simple SCSI disks or SCSI arrays.
򐂰 To add device-specific support and functions for a specific storage device, each storage
vendor might provide a device-specific module that implements advanced functions for
managing the specific storage device.

Chapter 8. Host attachment 275


The term SDD represents both the SDD as the established multipath SDD and as the MPIO
path control module, depending upon the operating system. Table 8-1 provides examples

Table 8-1 Examples of DS8000 MPIO path control modules


Operating system Multipathing solution Device-specific Acronym
module

AIX MPIO SDD Path Control SDDPCM


Module

Windows MPIO SDD Device Specific SDDDSM


Module

External multipathing software


In addition to the SDD and MPIO solutions, third-party multipathing software is available for
specific host operating systems and configurations.

For example, Symantec provides an alternative to the IBM provided multipathing software.
The Veritas Volume Manager (VxVM) relies on the Microsoft implementation of MPIO and
Device Specific Modules (DSMs) that rely on the Storport driver. The Storport driver is not
available for all versions of Windows. The Veritas Dynamic MultiPathing (DMP) software is
also available for UNIX versions, such as Oracle Solaris.

For your specific hardware configuration, see the SSIC website:


http://www.ibm.com/systems/support/storage/config/ssic/

8.3 Attaching z Systems hosts


This section describes the host system requirements and attachment considerations for
attaching the z Systems hosts (z/OS, IBM z/VM®, IBM z/VSE®, Linux on z Systems, and
Transaction Processing Facility (TPF)) to the DS8000 series. Starting with the DS8800
storage system, the ESCON attachment was discontinued. The following sections describe
attachment through FICON adapters only.

FCP: z/VM, z/VSE, and Linux for z Systems can also be attached to the DS8000 storage
system with FCP. Then, the same considerations as for Open Systems hosts apply.

8.3.1 FICON
FICON is a Fibre Connection used with z Systems servers. Each storage unit HA has either
four or eight ports, and each port has a unique WWPN. You can configure the port to operate
with the FICON upper layer protocol. When configured for FICON, the storage unit provides
the following configurations:
򐂰 Either fabric or point-to-point topology.
򐂰 A maximum of 128 host ports for a DS8886 model and a maximum of 64 host ports for a
DS8884 model.
򐂰 A maximum of 1280 logical paths per DS8000 HA port.
򐂰 Access to all 255 control unit images (65280 count key data (CKD) devices) over each
FICON port. FICON HAs support 2, 4, 8, or 16 Gbps link speed in DS8880 storage
systems.

276 IBM System Storage DS8000 Performance Monitoring and Tuning


Improvement of FICON generations
IBM introduced FICON channels in the IBM 9672 G5 and G6 servers with the capability to run
at 1 Gbps. Since that time, IBM introduced several generations of FICON channels. The
FICON Express16S channels make up the current generation. They support 16 Gbps link
speeds and can auto-negotiate to 4 or 8 Gbps. The speed also depends on the capability of
the director or control unit port at the other end of the link.

Operating at 16 Gbps speeds, FICON Express16S channels achieve up to 620 MBps for a
mix of large sequential read and write I/O operations, as shown in the following charts.
Figure 8-4 shows a comparison of the overall throughput capabilities of various generations of
channel technology.

FICON Throughput (full duplex)


Large sequential read/write mix
3000

FICON
Express 16S
16Gbps

2500 2600

zHPF

2000

FICON
Express 8S
8Gbps

1500 1600

zHPF

1000
FICON FICON FICON
Express 8 Express 8S Express 16S
8Gbps 8Gbps 16Gbps

FICON
FICON 620 620 620
500 Express 4
Express 2 4Gbps
2Gbps
350 zEC12 zEC12
270 z196 zBC12 zBC12
z9/z990/z890 z10 z10 z196/z114 z13 z196/z114 z13
0

Figure 8-4 Measurements of channel throughput over several generations of channels

The FICON Express 16S channel on IBM z13 and the FICON Express8S channel on IBM z
Systems Enterprise EC12 and BC12 represents a significant improvement in maximum
bandwidth capability compared to FICON Express4 channels and previous FICON offerings.
The response time improvements are expected to be noticeable for large data transfers.

Figure 8-4 also shows the maximum throughput of FICON Express 16S and 8S with High
Performance FICON (zHPF) that delivers significant improvement compared to the FICON
protocol.

Chapter 8. Host attachment 277


As shown in Figure 8-5, the maximum number of I/Os per second (IOPS) measured on
FICON Express 16S and FICON Express8S channel that run an I/O driver benchmark with a
4 KB per I/O workload is approximately 23,000 IOPS. This maximum is more than 10%
greater than the maximum number of I/Os measured with a FICON Express8 channel. The
greater performance capabilities of the FICON Express 16S and FICON Express8S channel
make it a good match with the performance characteristics of the new DS8000 HAs.

The measurement of 4 K I/O performance on FICON Express 16S achieves up to 98,000


IOPS, which is approximately four times as high as the FICON protocol capability.

FICON I/O per second (100% Channel Utilization)


4K block size
120000

FICON
Express 16S
FICON 16Gbps
100000 Express 8S
8Gbps 98000
92000 zHPF
80000 zHPF

60000

40000
FICON FICON
FICON Express 8S Express 16S
Express 8 8Gbps 16Gbps
FICON 8Gbps
20000 FICON
Express 4
23000 23000
4Gbps 20000
Express 2
2Gbps zEC12 zEC12
14000
9200 z196 zBC12 zBC12
z9/z990/z890 z10 z10 z196/z114 z13 z196/z114 z13
0
Figure 8-5 Measurements of IOPS over several generations of channels

Support of FICON generations


The z13 storage system offers FICON Express16S SX and LX features that have two
independent channels. Each feature occupies a single I/O slot and uses one CHPID per
channel. Each channel supports 4 Gbps, 8 Gbps, and 16 Gbps link data rates with
auto-negotiation to support existing switches, directors, and storage devices.

278 IBM System Storage DS8000 Performance Monitoring and Tuning


Support: The FICON Express16S SX and LX are supported on the z13 storage system
only. The FICON Express8S SX and LX are available on z13, zEC12, zBC12, z196, and
z114 storage systems.

z13 is the last z Systems server to support FICON Express 8 channels. FICON Express 8
will not be supported on future high-end z Systems servers as carry forward on an
upgrade.

z13 does not support FICON Express4. zEC12 and zBC12 are the last systems that
support FICON Express4.

Withdrawal: At the time of writing, all FICON Express4, FICON Express2, and FICON
features are withdrawn from marketing.

Note: The FICON Express4 was the last feature to support 1 Gbps link data rates.

For any generation of FICON channels, you can attach directly to a DS8000 storage system
or you can attach through a FICON capable FC switch.

When you use a FC/FICON HA to attach to FICON channels, either directly or through a
switch, the port is dedicated to FICON attachment and cannot be simultaneously attached to
FCP hosts. When you attach a DS8000 storage system to FICON channels through one or
more switches, the maximum number of FICON logical paths is 1280 per DS8000 HA port.
The directors provide high availability with redundant components and no single points of
failure. A single director between servers and a DS8000 storage system is not preferable
because it can be a single point of failure. More than two directors are preferable for
redundancy.

8.3.2 FICON configuration and sizing considerations


This section describes FICON connectivity between z Systems and a DS8000 storage
system, and provides some recommendations and considerations.

Chapter 8. Host attachment 279


FICON topologies
As shown in Figure 8-6, FICON channels in FICON native mode, which means CHPID type
FC in Input/Output Configuration Program (IOCP), can access the DS8000 storage system
through the following topologies:
򐂰 Point-to-Point (direct connection)
򐂰 Switched Point-to-Point (through a single FC switch)
򐂰 Cascaded FICON Directors (through two FC switches)

Point-to-Point
FC Link
z Systems DS8000

Switched Point-to-Point
z Systems DS8000

FC Link F ICON F C Link


Switch
z Systems DS8000

Cascaded FICON Directors


z Systems DS8000

FC Link FICON F ICON FC Link


Switch F C Link Switch
z Systems DS8000

Figure 8-6 FICON topologies between z Systems and a DS8000 storage system

FICON connectivity
Usually in z Systems environments, a one-to-one connection between FICON channels and
storage HAs is preferred because the FICON channels are shared among multiple logical
partitions (LPARs) and heavily used. Carefully plan the oversubscription of HA ports to avoid
any bottlenecks.

280 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 8-7 shows an example of FICON attachment that connects a z Systems server
through FICON switches. This example uses 16 FICON channel paths to eight HA ports on
the DS8000 storage system, and addresses eight logical control units (LCUs). This channel
consolidation might be possible when your aggregate host workload does not exceed the
performance capabilities of the DS8000 HA.

z System s z System s
FICON (FC) channels FICON (FC) channels

FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC

FICON FICON FICON FICON


Director Director Director Director

CU 2000 LV 2000-20FF (LCU 20)

CU 2100 LV 2100-21FF (LCU 21)

CU 2200 LV 2200-22FF (LCU 22)

CU 2300 LV 2300-23FF (LCU 23)


DS8000
CU 2400 LV 2400-24FF (LCU 24)

CU 2500 LV 2500-25FF (LCU 25)

CU 2600 LV 2600-26FF (LCU 26)

CU 2700 LV 2700-27FF (LCU 27)

Figure 8-7 Many-to-one FICON attachment

Chapter 8. Host attachment 281


A one-to-many configuration is also possible, as shown in Figure 8-8, but, again, careful
planning is needed to avoid performance issues.

z Systems
FICON (FC) channels

FC FC FC FC FC FC FC FC

FICON FICON FICON FICON


Director Director Director Director

CU 2000 LV 2000-20FF (LCU 20) CU 3000 LV 3000-30FF (LCU 30)

CU 2100 LV 2100-21FF (LCU 21) CU 3100 LV 3100-31FF (LCU 31)

CU 2200 LV 2200-22FF (LCU 22) CU 3200 LV 3200-32FF (LCU 32)

CU 2300 LV 2300-23FF (LCU 23)


DS8000
CU 3300 LV 3300-33FF (LCU 33)
DS8000
CU 2400 LV 2400-24FF (LCU 24) CU 3400 LV 3400-34FF (LCU 34)

CU 2500 LV 2500-25FF (LCU 25) CU 3500 LV 3500-35FF (LCU 35)

CU 2600 LV 2600-26FF (LCU 26) CU 3600 LV 3600-36FF (LCU 36)

CU 2700 LV 2700-27FF (LCU 27) CU 3700 LV 3700-37FF (LCU 37)

Figure 8-8 One-to-many FICON attachment

Sizing FICON connectivity is not an easy task. You must consider many factors. As a
preferred practice, create a detailed analysis of the specific environment. Use these
guidelines before you begin sizing the attachment environment:
򐂰 For FICON Express CHPID utilization, the preferred maximum utilization level is 50%.
򐂰 For the FICON Bus busy utilization, the preferred maximum utilization level is 40%.
򐂰 For the FICON Express Link utilization with an estimated link throughput of 2 Gbps,
4 Gbps, 8 Gbps, or 16 Gbps, the preferred maximum utilization threshold level is 70%.

For more information about DS8000 FICON support, see IBM System Storage DS8000 Host
Systems Attachment Guide, SC26-7917, and FICON Native Implementation and Reference
Guide, SG24-6266.

You can monitor the FICON channel utilization for each CHPID in the RMF Channel Path
Activity report. For more information about the Channel Path Activity report, see Chapter 14,
“Performance considerations for IBM z Systems servers” on page 459.

The following statements are some considerations and preferred practices for paths in z/OS
systems to optimize performance and redundancy:
򐂰 Do not mix paths to one LCU with different link speeds in a path group on one z/OS. It
does not matter in the following cases, even if those paths are on the same CPC:
– The paths with different speeds, from one z/OS to different multiple LCUs
– The paths with different speeds, from each z/OS to one LCU
򐂰 Place each path in a path group on different I/O bays.

282 IBM System Storage DS8000 Performance Monitoring and Tuning


򐂰 Do not have two paths from the same path group sharing a card.

8.3.3 z/VM, z/VSE, and Linux on z Systems attachment


z Systems FICON features in FCP mode provide full fabric and point-to-point attachments of
Fixed Block (FB) devices to the operating system images. With this attachment, z/VM, z/VSE,
and Linux on z Systems can access industry-standard FCP storage controllers and devices.
This capability can facilitate the consolidation of UNIX server farms onto z Systems servers
and help protect investments in SCSI-based storage.

The FICON features provide support of FC devices to z/VM, z/VSE, and Linux on z Systems,
which means that these features can access industry-standard SCSI devices. For disk
applications, these FCP storage devices use FB 512-byte sectors rather than Extended
Count Key Data (IBM ECKD™) format. All available FICON features can be defined in FCP
mode.

Linux FCP connectivity


You can use either direct or switched attachment to attach a storage unit to a z Systems host
system that runs SUSE Linux Enterprise Server 9, 10, 11, or 12, or Red Hat Enterprise Linux
4, 5, 6, or 7, and later with current maintenance updates for FICON. For more information,
see the following websites:
򐂰 http://www.ibm.com/systems/z/os/linux/resources/testedplatforms.html
򐂰 http://www.ibm.com/developerworks/linux/linux390/development_documentation.html

Chapter 8. Host attachment 283


284 IBM System Storage DS8000 Performance Monitoring and Tuning
9

Chapter 9. Performance considerations for


UNIX servers
This chapter describes performance considerations for attaching a DS8000 storage system to
UNIX systems in general, and then explores performance monitoring and tuning for the
IBM AIX operating system.

This chapter includes the following topics:


򐂰 Planning and preparing UNIX servers for performance
򐂰 UNIX disk I/O architecture
򐂰 AIX disk I/O components
򐂰 AIX performance monitoring tools
򐂰 Testing and verifying the DS8000 storage system

© Copyright IBM Corp. 2016. All rights reserved. 285


9.1 Planning and preparing UNIX servers for performance
Planning and configuring a UNIX system for performance is never a simple task. There are
numerous factors to consider before tuning parameters and deciding on the ideal setting. To
help you answer these questions, consider the following factors:
򐂰 Type of application: There are thousands of applications available on the market. But, it is
possible to group them into a few types based on their I/O profile. The I/O profile helps you
decide the preferred configuration at the operating system level. For more information
about identifying and classifying applications, see Chapter 4, “Logical configuration
performance considerations” on page 83.
򐂰 Platform and version of operating system: Although they have the same performance
characteristics, there can be differences in how each operating system implements these
functions. In addition, newer versions or releases of an operating system bring or have
performance parameters with optimized default values for certain workload types.
򐂰 Type of environment: Another significant factor is whether the environment is for
production, testing, or development. Normally, production environments are hosted in
servers with many processors, hundreds of gigabytes of memory, and terabytes of disk
space. They demand high levels of availability and are difficult to schedule for downtime,
so there is a need to make a more detailed plan. Quality assurance (or testing)
environments normally are smaller in size, but they must sustain the performance tests.
Development environments are typically much smaller than their respective production
environments, and normally performance is not a concern. This chapter focuses on the
performance tuning of the production environments.

Before planning for performance, validate the configuration of your environment. See the IBM
System Storage Interoperation Center (SSIC), which shows information about supported
system models, operating system versions, host bus adapters (HBA), and so on:
http://www.ibm.com/systems/support/storage/config/ssic/index.jsp

To download fix packs for your AIX version and the current firmware version for the Fibre
Channel (FC) adapters, go to the IBM Support website:
http://www.ibm.com/support/fixcentral/

For more information about how to attach and configure a host system to a DS8000 storage
system, see the IBM System Storage DS8000 Host System Attachment Guide, GC27-4210,
found at:
http://www.ibm.com/support/docview.wss?rs=1114&context=HW2B2&uid=ssg1S7001161

Note: The AIX command prtconf displays system information, such as system model,
processor type, and firmware levels, which facilitates Fix Central website navigation.

9.1.1 UNIX disk I/O architecture


You must understand the UNIX I/O subsystem to tune adequately your system. The I/O
subsystem can be represented by a set of layers. Figure 9-1 on page 287 provides an
overview of those layers.

286 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 9-1 UNIX disk I/O architecture

I/O requests normally go through these layers:


򐂰 Application/DB layer: This layer is the top-level layer where many of the I/O requests start.
Each application generates several I/Os that follow a pattern or a profile. An application
I/O profile has these characteristics:
– IOPS: IOPS is the number of I/Os (reads and writes) per second.
– Throughput: How much data is transferred in a certain sample time? Typically, the
throughput is measured in MBps or KBps.
– I/O size: This I/O size is the result of MBps divided by IOPS.
– Read ratio: The read ratio is the percentage of I/O reads compared to the total I/Os.
– Disk space: This amount is the total amount of disk space that is needed by the
application.
򐂰 I/O system calls layer: Through the system calls provided by the operating system, the
application issues I/O requests to the storage. By default, all I/O operations are
synchronous. Many operating systems also provide asynchronous I/O (AIO), which is a
facility that allows an application to overlap processing time while it issues I/Os to the
storage. Typically, the databases take the advantage of this feature.
򐂰 File system layer: The file system is the operating system’s way to organize and manage
the data in the form of files. Many file systems support buffered and unbuffered I/Os. If
your application has its own caching mechanism and supports a type of direct I/O, enable
it because doing so avoids double-buffering and reduces the processor utilization.
Otherwise, your application can take advantage of features, such as file caching,
read-ahead, and write-behind.
򐂰 Volume manager layer: A volume manager was a key component to distribute the I/O
workload over the logical unit numbers (LUNs) of the DS8000 storage system. The
situation changes with the implementation of IBM Easy Tier and I/O Priority Manager in
the DS8800 storage system. You must use a logical volume manager (LVM) to provide
LUN concatenation only if it is a cluster software requirement. For any other case, it is
better to use DS8800 or DS8700 capabilities of managing the workload because
additional striping at the LVM level might deform the skew factor and lead to an incorrect
heat calculation. Also, because I/O Priority Manager works at the storage system
back-end level, it improves the usage of the internal resources. Managing priority at the
operating system level might be less effective.

Chapter 9. Performance considerations for UNIX servers 287


򐂰 Multipathing/Disk layer: Today, there are several multipathing solutions available: hardware
multipathing, software multipathing, and operating system multipathing. It is better to
adopt the operating system multipathing solutions. However, depending on the
environment, you might face limitations and prefer to use a hardware or software
multipathing solution. Always try not to exceed a maximum of four paths for each LUN
unless required.
򐂰 FC adapter layer: The need to make configuration changes in the FC adapters depends
on the operating system and vendor model. For specific instructions about how to set up
the FC adapters, see IBM System Storage DS8000 Host Systems Attachment Guide,
GC27-4210. Also, check the compatibility matrix for dependencies among the firmware
level, operating system patch levels, and adapter models.
򐂰 Fabric layer: The storage area network (SAN) is used to interconnect storage devices and
servers.
򐂰 Array layer: The array layer is the DS8000 storage system in our case.

Normally in each of these layers, there are performance indicators that help you to assess
how that particular layer affects performance.

Typical performance indicators


The following indicators are the typical performance indicators:
򐂰 The first performance indicator that is used to assess whether there is an I/O bottleneck is
the wait I/O time (wio). It is essential to realize that the wio is calculated differently
depending on the operating system:
– Current versions of IBM AIX increment wio if the processor is not busy in user or
system mode and there is outstanding I/O started by that processor. In this way,
systems with several processors have a less inflated wio. Moreover, wio from file
systems that are mounted through Network File System (NFS) is also recorded.
– Other UNIX operating systems (for example, Oracle Solaris) calculate wio with CPU
idle time included. The wio counter is incremented by disk I/O. It also is incremented by
I/Os from file systems and Small Computer System Interface (SCSI) tape devices.
For more information about the wait I/O time in AIX, see the following website:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.performance/w
ait_io_time_reporting.htm
For more information about the wait I/O time in Solaris, see the following link:
http://sunsite.uakom.sk/sunworldonline/swol-08-1997/swol-08-insidesolaris.html
A wait I/O event might be an indication that there is a disk I/O bottleneck, but it is not
enough to assume from only the wio that there is a constraint of disk I/O. You must
observe other counters, such as the blocked process in the kernel threads column and
statistics generated by iostat or an equivalent tool.
򐂰 The technology for disk is improved. In the past, disks were capable of only 120 I/Os per
second (IOPS) and had no cache memory. Therefore, utilization levels of 10 - 30% were
considered high. Today, with arrays of the DS8000 class (supporting tens or hundreds of
gigabytes of cache memory and hundreds of physical disks at the back end), even
utilization levels higher than 80 or 90% still might not indicate an I/O performance problem.
򐂰 It is fundamental to check the queue length, the service time, and the I/O size averages
that are reported in the disk statistics:
– If the queue length and the service time are low, there is no performance problem.
– If the queue length is low and the service time and the I/O size are high, it is also not
evidence of a performance problem.

288 IBM System Storage DS8000 Performance Monitoring and Tuning


򐂰 Performance thresholds: They might indicate a change in the system. However, they
cannot explain why or how it changed. Only a good interpretation of data can answer
these types of questions.
Here is an actual case: A user was complaining that the transactional system showed a
performance degradation of 12% in the average transaction response time. The database
administrator (DBA) argued that the database spent much of the time in disk I/O
operations. At the operating system and the storage level, all performance indicators were
excellent. The only remarkable fact was that the system showed a cache hit of 80%, which
did not make much sense because the access characteristic of a transactional system is
random. High levels of hit cache indicated that somehow the system was reading
sequentially. By analyzing the application, we discovered that about 30% of the database
workload related to 11 queries. These 11 queries accessed the tables without the help of
an index. The fix was the creation of those indexes with specific fields to optimize the
access to disk.

9.1.2 Modern approach to the logical volume management in UNIX systems


LVMs are the main part of UNIX system management and performance tuning.
Administrators use LVM functions to adapt precisely servers and disk systems to the
particularities of the applications and the workloads. The important role of the LVMs explains
the way that the DS8000 storage systems were configured before. It was the method of
workload isolation at the rank and array level and the hardware resource management
approach. It was the administrator’s decision about how to use each of the hardware
components in the DS8000 storage system and when. LVMs played a key role in the disk
management process; administrators used them to add a flexible level of management to the
hardware resources. There were two main techniques of disk management in LVMs: volume
striping with different levels of granularity, and implementation of the RAID technology.
Administrators overcame the imperfection of the disk systems with these techniques.

Modern DS8000 storage systems have many improvements in data management that can
change LVM usage. Easy Tier V3 functions move the method of data isolation from the rank
level to the extent pool, volume, and application levels. It is important that a disk system of
today must be planned from the application point of view, not the hardware resource
allocation point of view. Plan the logical volumes (LVs) on the extent pool that is for one type
of application or workload. This method of disk system layout eliminates the necessity of LVM
usage.

Moreover, by using LVM striping of the Easy Tier managed volumes, you eliminate most of the
technology benefits because striping shadows the real skew factor and changes the real
picture of the hot extents allocation. This method might lead to improper extent migration plan
generation, which leads to continuous massive extent migration. Performance analysis
becomes complicated and makes I/O bottleneck allocation a complicated task. In general, the
basic approach for most of the applications is to use one or two hybrid extent pools with three
tiers and Easy Tier in automated mode for group of applications of the same kind or the same
workload type. To prevent bandwidth consumption by one or several applications, the I/O
Priority Manager function must be used.

The extended RAID function of the LVM (RAID 5, 6, and 10) must not be used at all, except
the RAID 1 (Mirroring) function, which might be required in high availability (HA) and disaster
recovery (DR) solutions.

Chapter 9. Performance considerations for UNIX servers 289


However, there are still several major points of LVM usage:
򐂰 Definite and complete understanding of the workload nature, which requires isolation at
the rank level and disk management at the LVM level. This situation is a rare case for
today, but it might be required.
򐂰 Usage of the large volumes in the operating system that should not be created in the disk
system (typically volumes larger than 2 TB). The concatenation function of the LVM must
be used in this case. Concatenation itself does not change the skew factor.
򐂰 Any cluster software requirement of LVM usage that might be the key component of the
HA or DR solution.

Consider these points when you read the LVM description in the following sections. Also, see
Chapter 4, “Logical configuration performance considerations” on page 83.

9.1.3 Queue depth parameter (qdepth)


The operating system can enforce limits on the number of I/O requests that can be
outstanding from the SCSI adapter to a specific SCSI bus or disk drive. These limits are
intended to use the hardware ability to handle multiple requests while ensuring that the
seek-optimization algorithms in the device drivers can operate effectively. There are also
queue depth limits to the FC adapter and FC path.

All these parameters have a common meaning: the length of the queue of the SCSI
commands, which a device can keep unconfirmed while maintaining the I/O requests. A
device (disk or FC adapter) sends a successful command completion notification to the
operating system or a driver before it is completed, which allows the operating system or a
driver to send another command or I/O request.One I/O request can consist of two or more
SCSI commands.

There are two methods for the implementation of queuing: untagged and tagged. For more
information about this topic, see the following website:
https://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.kernextc/scsi_c
md_tag.htm

SCSI command tagged queuing refers to queuing multiple commands to a SCSI device.
Queuing to the SCSI device can improve performance because the device itself determines
the most efficient way to order and process commands. Untagged queuing allows a target to
accept one I/O process from each initiator for each logical unit or target routine.

The qdepth parameter might affect the disk I/O performance. Values that are too small can
make the device ineffective to use. Values that are too high might lead to the QUEUE FULL
status of the device, reject the next I/O, and lead to data corruption or a system crash.

Another important reason why queue depth parameters must be set correctly is the queue
limits of the host ports of the disk system. The host port might be flooded with the SCSI
commands if there is no correct limit set in the operating system. When this situation
happens, a host port refuses to accept any I/Os, then resets, and then starts the loop
initialization primitive (LIP) procedure. This situation leads to the inactivity of the port for up to
several minutes and might initiate path failover or an I/O interruption. Moreover, in highly
loaded environments, this situation leads to the overload of the other paths and might lead to
the complete I/O interruption for the application or buffer overflow in the operating system,
which causes paging activity.

290 IBM System Storage DS8000 Performance Monitoring and Tuning


Queue depth parameter settings
Each operating system has its own settings for the disk I/O because many factors affect the
formation of the disk I/O in the operating system: kernel I/O threads, device driver-specific
factors, HBA driver-specific factors, and file system specifics. Pay close attention to the
parameters of the queue depth for the disks and FC HBAs that are specified in IBM System
Storage DS8000 Host System Attachment Guide, SC26-7917-02 for every supported
operating system. Not observing these suggestions can lead to the DS8000 host port buffer
overflow and data flow interrupts.

In addition, certain HBAs or multipathing drivers have their own queue depth settings to
manage FC targets and paths. These settings are needed to limit the number of commands
that the driver sends to the FC target and FCP LUN to prevent buffer overflow. For more
information, see the operating system-specific chapters of this book.

On AIX, the size of the disk (hdisk) driver queue is specified by the queue_depth attribute, and
the size of the adapter driver queue is specified by the num_cmd_elems attribute. In addition,
there are queue depth limits to the FC adapter and FC path. For more information, see 9.2.6,
“IBM Subsystem Device Driver for AIX” on page 306 and 9.2.7, “Multipath I/O with IBM
Subsystem Device Driver Path Control Module” on page 306.

Queue depths with Virtual I/O Server


A Virtual I/O Server (VIOS) supports Virtual SCSI (VSCSI) and N_Port ID Virtualization
(NPIV) adapter types.

For each VSCSI adapter in a VIOS, which is known as a vhost device, there is a matching
VSCSI adapter in a Virtual I/O Client (VIOC). These adapters have a fixed queue depth that
varies depending on how many VSCSI LUNs are configured for the adapter. There are 512
command elements, of which two are used by the adapter, three are reserved for each VSCSI
LUN for error recovery, and the rest are used for I/O requests. Thus, with the default
queue_depth of 3 for VSCSI LUNs, there are up to 85 LUNs to use at an adapter: (512 - 2)/(3
+ 3) = 85 (rounding down). So, if you need higher queue depths for the devices, the number of
LUNs per adapter is reduced. For example, if you want to use a queue_depth of 25, you can
have 510/28 = 18 LUNs. You can configure multiple VSCSI adapters to handle many LUNs
with high queue depths and each one requires additional memory. You can have more than
one VSCSI adapter on a VIOC that is connected to the same VIOS if you need more
bandwidth.

Chapter 9. Performance considerations for UNIX servers 291


As shown in Figure 9-2, you must set the queue_depth attribute on the VIOC hdisk to match
that of the mapped hdisk queue_depth on the VIOS. For a formula, the maximum number of
LUNs per VSCSI adapter (vhost on the VIOS or VSCSI on the VIOC) is =INT(510/(Q+3))
where Q is the queue_depth of all the LUNs (which assumes that they are all the same).

Figure 9-2 VIO server queue depth

Important: To change the queue_depth on a hdisk device at the VIOS, you must unmap
the disk from the VIOC and remap it back after the change.

If you use NPIV, if you increase num_cmd_elems on the virtual FC (vFC) adapter, you must also
increase the setting on the real FC adapter.

For more information about the queue depth settings for VIO Server, see IBM System
Storage DS8000 Host Attachment and Interoperability, SG24-8887.

Queue depth and the performance


There is no need to merely increase these values. It is possible to overload the disk
subsystem or to cause problems with device configuration at boot. So, the approach of adding
the hdisk queue_depths to get a total to determine num_cmd_elems is not preferable. Instead, it
is better to use the maximum I/Os for each device for tuning. When you increase
queue_depths and number of in-flight I/Os that are sent to the disk subsystem, the I/O service
times are likely to increase, but throughput also increases. If the I/O service times start
approaching the disk timeout value, you are submitting more I/Os than the disk subsystem
can handle. If you start seeing I/O timeouts and errors in the error log that indicate that there
are problems when completing I/Os, look for hardware problems or make the pipe smaller.

292 IBM System Storage DS8000 Performance Monitoring and Tuning


A good general rule for tuning queue_depths is to increase queue_depths until I/O service
times start exceeding 15 ms for small random reads or writes or you are not filling the queues.
After I/O service times start increasing, you push the bottleneck from the AIX disk and
adapter queues to the disk subsystem. There are two approaches to tuning queue depth:
򐂰 Use your application and tune the queues from it.
򐂰 Use a test tool to see what the disk subsystem can handle and tune the queues from that
information based on what the disk subsystem can handle.

It is preferable to tune based on the application I/O requirements, especially when the disk
system is shared with other servers.

Queue depth and performance on AIX


When you examine the devstats, if you see that the Maximum field = queue_depth x # paths
and qdepth_enable=yes for IBM System Storage Multipath Subsystem Device Driver (SDD),
then increasing the queue_depth for the hdisks might help performance. At least the I/Os
queue on the disk subsystem rather than in AIX. It is reasonable to increase queue depths
about 50% at a time.

Regarding the qdepth_enable parameter, the default is yes, which essentially has the SDD
handling the I/Os beyond queue_depth for the underlying hdisks. Setting it to no results in the
hdisk device driver handling them in its wait queue. With qdepth_enable=yes, SDD handles
the wait queue; otherwise, the hdisk device driver handles the wait queue. There are
error-handling benefits that allow the SDD to handle these I/Os, for example, by using LVM
mirroring across two DS8000 storage systems. With heavy I/O loads and much queuing in
SDD (when qdepth_enable=yes), it is more efficient to allow the hdisk device drivers to handle
relatively shorter wait queues rather than SDD handling a long wait queue by setting
qdepth_enable=no. SDD queue handling is single threaded and there is a thread for handling
each hdisk queue. So, if error handling is of primary importance (for example, when LVM
mirrors across disk subsystems), leave qdepth_enable=yes. Otherwise, setting
qdepth_enable=no more efficiently handles the wait queues when they are long. Set the
qdepth_enable parameter by using the datapath command because it is a dynamic change
that way (chdev is not dynamic for this parameter).

For the adapters, look at the adaptstats column. Set num_cmd_elems=Maximum or 200,
whichever is greater. Unlike devstats with qdepth_enable=yes, Maximum for adaptstats can
exceed num_cmd_elems.

It is reasonable to use the iostat -D command or sar -d to provide an indication if the


queue_depths need to be increased (Example 9-1).

Example 9-1 Iostat -D output (output truncated)


Paths/Disks:
hdisk0 xfer:%tm_act bps tps bread bwrtn
1.4 30.4K 3.6 23.7K 6.7K
read: rps avgserv minserv maxserv timeouts fails
2.8 5.7 1.6 25.1 0 0
write: wps avgserv minserv maxserv timeouts fails
0.8 9.0 2.1 52.8 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
11.5 0.0 34.4 0.0 0.0 0.9

The iostat -D command shows statistics since system boot, and it assumes that the system
is configured to continuously maintain disk IO history. Run # lsattr -El sys0 to see whether
the iostat attribute is set to true, and smitty chgsys to change the attribute setting.

Chapter 9. Performance considerations for UNIX servers 293


The avgwqsz is the average wait queue size, and avgsqsz is the average service queue size.
The average time that is spent in the wait queue is avgtime. The sqfull value changed from
initially being a count of the times that we submitted an I/O to a full queue, to now where it is
the rate of I/Os submitted to a full queue. The example report shows the prior case (a count of
I/Os submitted to a full queue). Newer releases typically show decimal fractions that indicate
a rate. It is good that iostat -D separates reads and writes because you can expect the I/O
service times to differ when you have a disk subsystem with cache.

The sar -d command generates an output, as shown in Figure 9-3.

# sa r -d 1 2
S y st em co n fi gu r at io n : l cp u =2 d r iv e s= 1 en t =0 .30

1 0:0 1:3 7 d ev i ce % b us y avq u e r+w /s K bs /s av w ait


1 0:0 1:3 8 h di sk 0 1 00 3 6.1 3 63 4 61 53 5 1.1
1 0:0 1:3 9 h di sk 0 99 3 8.1 3 50 4 41 05 5 8.0
Av e r age hd i sk 0 99 3 7.1 3 56 4 51 29 5 4.6

Figure 9-3 Output of sar -d command

The avwait is the average time spent in the wait queue. The avserv is the average time spent
in the service queue. avserv corresponds to avgserv in the iostat output. The avque value
represents the average number of I/Os in the wait queue.

IBM Subsystem Device Driver Path Control Module (SDDPCM) provides the pcmpath query
devstats and pcmpath query adaptstats commands to show hdisk and adapter queue
statistics. You can refer to the SDDPCM manual for syntax, options, and explanations of all
the fields. Example 9-2 shows devstats output for a single LUN and adaptstats output for a
Fibre Channel adapter.

Example 9-2 Pcmpath query output (output truncated)


# pcmpath query devstats

DEV#: 1 DEVICE NAME: hdisk1


===============================
Total Read Total Write Active Read Active Write Maximum
I/O: 69227 49414 0 0 40
SECTOR: 1294593 1300373 0 0 1737

Transfer Size: <= 512 <= 4k <= 16K <= 64K > 64K
118 20702 80403 12173 5245

# pcmpath query adaptstats


Adapter #: 0
=============
Total Read Total Write Active Read Active Write Maximum
I/O: 88366 135054 0 0 258
SECTOR: 2495579 4184324 0 0 3897

Look at the Maximum field, which indicates the maximum number of I/Os submitted to the
device since system boot.

294 IBM System Storage DS8000 Performance Monitoring and Tuning


For IBM SDDPCM, if the Maximum value equals the hdisk queue_depth, the hdisk driver
queue filled during the interval, so increasing queue_depth is appropriate.

You can monitor adapter queues and IOPS. For adapter IOPS, run iostat -at <interval>
<# of intervals> and for adapter queue information, run iostat -aD, optionally with an
interval and number of intervals.

The downside of setting queue depths too high is that the disk subsystem cannot handle the
I/O requests in a timely fashion and might even reject the I/O or ignore it. This situation can
result in an I/O timeout, and an I/O error recovery code is called. This situation is bad
because the processor ends up performing more work to handle I/Os than necessary. If the
I/O eventually fails, this situation can lead to an application crash or worse.

Lower the queue depth per LUN when using multipathing. With multipathing, this default value
is magnified because it equals the default queue depth of the adapter multiplied by the
number of active paths to the storage device. For example, because QLogic uses a default
queue depth of 32, the preferable queue depth value to use is 16 when using two active paths
and 8 when using four active paths. Directions for adjusting the queue depth are specific to
each HBA driver and are available in the documentation for the HBA.

For more information about AIX, see AIX disk queue depth tuning for performance, found at:
http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105745

9.2 AIX disk I/O components


Since AIX Version 6, a set of tunable parameters from six tuning commands (vmo, ioo,
schedo, raso, no, and nfso) is preset with default values that are optimized for most types of
workloads, and the tunable parameters are classified as “restricted use” tunables. Change
them only if instructed to do so by IBM Support. For more information, see IBM AIX Version
7.1 Differences Guide, SG24-7910

With AIX 7 (with the DS8000 storage system), you must the default parameters and install the
file sets for multipathing and host attachment that already provide basic performance defaults
for queue length and SCSI timeout. For more information about setting up the volume layout,
see 9.2.4, “IBM Logical Volume Manager” on page 301.

For a description of AIX tuning, see the following resources:


򐂰 AIX 7.1 Performance tools guide and reference, found at:
http://public.dhe.ibm.com/systems/power/docs/aix/71/prftools_pdf.pdf
򐂰 AIX 7.1 Performance Management Manual describes the relationships between Virtual
Memory Manager (VMM) and the buffers that are used by the file systems and LVM:
http://public.dhe.ibm.com/systems/power/docs/aix/71/prftungd_pdf.pdf

Chapter 9. Performance considerations for UNIX servers 295


9.2.1 AIX Journaled File System and Journaled File System 2
Journaled File System (JFS) and Journaled File System 2 (JFS2) are AIX standard file
systems. JFS was created for the 32-bit kernels and implements the concept of a
transactional file system where all of the I/O operations of the metadata information are kept
in a log. The practical impact is that in the recovery of a file system, the fsck command looks
at that log to see what I/O operations completed and rolls back only those operations that
were not completed. From a performance point of view, there is an impact. However, it is an
acceptable compromise to ensure the recovery of a corrupted file system. Its file organization
method is a linear algorithm. You can mount the file systems with the Direct I/O option. You
can adjust the mechanisms of sequential read ahead, sequential and random write behind,
delayed write operations, and others. You can tune its buffers to increase the performance. It
also supports AIO.

JFS2 was created for 64-bit kernels. Its file organization method is a B+ tree algorithm. It
supports all the features that are described for JFS, with the exception of “delayed write
operations.” It also supports concurrent I/O (CIO).

AIX file system caching


In AIX, you can limit the amount of memory used for file system cache and the behavior of the
page replacement algorithm. Configure the following parameters and use these values:
򐂰 minperm% = 3
򐂰 maxperm% = 90
򐂰 maxclient% = 90
򐂰 strict_maxperm = 0
򐂰 strict_maxclient = 1
򐂰 lru_file_repage = 0
򐂰 lru_poll_interval = 10

File system I/O buffers


AIX tracks I/O to disk by using pinned memory buffers (pbufs). AIX 7.1 provides the ioo and
vmstat commands to control the following system-wide tuning parameters:
򐂰 numfsbufs
򐂰 lvm_bufcnt
򐂰 pd_npages
򐂰 v_pinshm
򐂰 j2_nBufferPerPagerDevice

For more information, see the following website:


http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.performance/fs_b
uff_tuning.htm

Read-ahead algorithms
JFS and JFS2 have read-ahead algorithms that can be configured to buffer data for
sequential reads into the file system cache before the application requests it. Ideally, this
feature reduces the percent of I/O wait (%iowait) and increases I/O throughput as seen from
the operating system. Configuring the read-ahead algorithms too aggressively results in
unnecessary I/O. The following VMM tunable parameters control read-ahead behavior:
򐂰 For JFS:
– minpgahead = max(2, <application’s blocksize> / <filesystem’s blocksize>)
– maxpgahead = max(256, (<application’s blocksize> / <filesystem’s blocksize> *
<application’s read ahead block count>))

296 IBM System Storage DS8000 Performance Monitoring and Tuning


򐂰 For JFS2:
– j2_minPgReadAhead = max(2, <application’s blocksize> / <filesystem’s
blocksize>)
– j2_maxPgReadAhead = max(256, (<application’s blocksize> / <filesystem’s
blocksize> * <application’s read ahead block count>))

I/O pacing
I/O pacing manages the concurrency to files and segments by limiting the processor
resources for processes that exceed a specified number of pending write I/Os to a discrete
file or segment. When a process exceeds the maxpout limit (high-water mark), it is put to sleep
until the number of pending write I/Os to the file or segment is less than minpout (low-water
mark). This pacing allows another process to access the file or segment.

Disabling I/O pacing improves backup times and sequential throughput. Enabling I/O pacing
ensures that no single process dominates the access to a file or segment. AIX V6.1 and
higher enables I/O pacing by default. In AIX V5.3, you need to enable explicitly this feature.
The feature is enabled by setting the sys0 settings of the minpout and maxpout parameters to
4096 and 8193 (lsattr -El sys0). To disable I/O pacing, simply set them both to zero. You
can also limit the effect of setting global parameters by mounting file systems by using an
explicit 0 for minput and maxpout: mount -o minpout=0,maxpout=0 /u. Tuning the maxpout and
minpout parameters might prevent any thread that performs sequential writes to a file from
dominating system resources.

In certain circumstances, it is appropriate to enable I/O pacing:


򐂰 For IBM PowerHA® (IBM HACMP™), enable I/O pacing to ensure that heartbeat activities
complete. If you enable I/O pacing, start with settings of maxpout=321 and minpout=240 by
increasing the default values of maxpout=33 and minpout=24.
򐂰 Starting with AIX V 5.3, I/O pacing can be enabled at the file system level by using mount
command options.
򐂰 In AIX V6, I/O pacing is technically enabled but with such high settings that the I/O pacing
is not active except under extreme situations.
򐂰 In AIX 7, I/O pacing parameters can be changed on mounted file systems by using the -o
remount option.

Enabling I/O pacing improves user response time at the expense of throughput. For more
information about I/O pacing, see the following website:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.performance/disk
_io_pacing.htm

Write behind
The sync daemon (syncd) writes dirty file pages that remain in memory and do not get reused
to disk. In some situations, this situation might result in abnormal temporary disk utilization.

The write behind parameter causes pages to be written to disk before the sync daemon
runs. Writes are triggered when a specified number of sequential 16 KB clusters (for JFS) and
128 KB clusters (for JFS2) are updated.
򐂰 Sequential write behind:
– numclust for JFS
– j2_nPagesPerWriteBehindCluster and j2_nRandomCluster for JFS2

Chapter 9. Performance considerations for UNIX servers 297


򐂰 Random write behind:
– maxrandwrt for JFS
– j2_maxRandomWrite

Setting j2_nPagesPerWriteBehindCluster to 0 disables JFS2 sequential write behind, and


setting j2_maxRandomWrite to 0 also disables JFS2 random write behind.

Mount options
Use the release behind, direct I/O, and CIO mount options when appropriate:
򐂰 The release behind mount option can reduce syncd and lrud impact. This option modifies
the file system behavior so that it does not maintain data in JFS2 cache. You use these
options if you know that data that goes into or out of certain file systems is not requested
again by the application before the data is likely to be paged out. Therefore, the lrud
daemon has less work to do to free cache and eliminates any syncd impact for this file
system. One example of a situation where you can use these options is if you have a Tivoli
Storage Manager Server with disk storage pools in file systems and you configured the
read ahead mechanism to increase the throughput of data, especially when a migration
takes place from disk storage pools to tape storage pools:
– -rbr for release behind after a read
– -rbw for release behind after a write
– -rbrw for release behind after a read or a write
򐂰 Direct I/O (DIO):
– Bypass JFS/JFS2 cache.
– No read ahead.
– An option of the mount command.
– Useful for databases that use file systems rather than raw LVs. If an application has its
own cache, it does not make sense to also keep data in file system cache.
– Direct I/O is not supported on compressed file systems.
For more information about DIO, see the following website:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.genprogc/dire
ct_io_normal.htm
򐂰 CIO:
– Same as DIO but without inode locking, so the application must ensure data integrity
for multiple simultaneous I/Os to a file.
– An option of the mount command.
– Not available for JFS.
– If possible, consider the use of a no cache option at the database level when it is
available rather than at the AIX level. An example for DB2 is the no_filesystem_cache
option. DB2 can control it at the table space level. With a no file system caching policy
enabled in DB2, a specific OS call is made that uses CIO regardless of the mount. By
setting CIO as a mount option, all files in the file system are CIO-enabled, which might
not benefit certain files.

Asynchronous I/O
AIO is the AIX facility that allows an application to issue an I/O request and continue
processing without waiting for the I/O to finish. Many applications, such as databases and file
servers, take advantage of the ability to overlap processing and I/O.

298 IBM System Storage DS8000 Performance Monitoring and Tuning


򐂰 With AIX V6, the tunables fastpath and fsfastpath are classified as restricted tunables
and now are set to a value of 1 by default. Therefore, all AIO requests to a raw LV are
passed directly to the disk layer by using the corresponding strategy routine (legacy AIO or
POSIX-compliant AIO), or all AIO requests for files that are opened with CIO are passed
directly to LVM or disk by using the corresponding strategy routine.
򐂰 On AIX 7.1 a kernel process, called an AIO server, is in charge of each request from the
time it is taken off the queue until it completes. The performance depends on how many
AIO server processes are running. AIX 7.1 controls the number of AIO server processes
by using the minservers and maxservers tunable parameters. The ioo command shows
and changes these tunable parameters. You can change the current value of a tunable
anytime and make it persistent for the restart of the operating system.

For more information about AIX AIO, see this website:


http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.kernextc/async_i
o_subsys.htm

For more information about changing tunable values for AIX AIO, see the following website:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.performance/asyn
c_io.htm

9.2.2 Symantec Veritas File System for AIX


Veritas developed the Veritas File System (VxFS), which was part of the Veritas Storage
Foundation (VSF). Implementation and administration are similar to the same versions on
other UNIX and Linux operating systems. With Symantec and Veritas splitting into two
companies, the nomenclature has been revised. Veritas is now referring to InfoScale Storage
Foundation and High Availability. For more information, see the Veritas Enterprise Product
Matrix at the following website:
https://sort.veritas.com/productmatrix/show/SF

For more information, see the Veritas File System for AIX Administrator’s Guide, found at:
https://www.veritas.com/support/en_US/article.DOC5203

In addition, Veritas published Advice for tuning Veritas File System on AIX, found at:
https://www.veritas.com/support/en_US/article.000067287

9.2.3 IBM Spectrum Scale


IBM Spectrum Scale™ (formerly known as IBM General Parallel File System (GPFS)) is a
clustered file system that can be shared among a set of nodes. The file system takes care of
file consistency throughout the cluster through mechanisms such as file locking and
management of shared LUNs.

Chapter 9. Performance considerations for UNIX servers 299


IBM Spectrum Scale supports different storage architectures, as shown in Figure 9-4. When
configured with a DS8000 storage system, a shared SAN infrastructure is the architecture of
choice, but it can be combined with the Network Shared Disk (NSD) model. Spectrum Scale
allows storage-attached nodes to provision LUNs as NSDs to other Spectrum Scale nodes
that do not have disk access to those disks. Spectrum Scale nodes without direct storage
access (such as NSD Clients) use TCP/IP communication to NSD Servers. Typically, NSDs
are shared by 2 - 8 nodes in parallel because of failover capabilities. IBM Spectrum Scale
Software handles all logic in regard to sharing LUNs across multiple nodes and TCP/IP
communication.

Figure 9-4 IBM Spectrum Scale topologies

Spectrum Scale file system block size


You can use Spectrum Scale to define a file system block size during file system creation.
The parameter cannot be changed after creation.

At a high level, a larger file system block size provides higher throughput for medium to large
files by increasing I/O size. Smaller file system block sizes handle tiny files more efficiently. To
select the preferred file system block size, it is essential to understand the system workload,
especially the average and minimum file sizes.

The following sections describe some of the configuration parameters that are available in
Spectrum Scale. They include some notes about how the parameter values might affect
performance.

Pagepool
Spectrum Scale uses pinned memory (also called pagepool) and unpinned memory for
storing file data and metadata in support of I/O operations. Pinned memory regions cannot be
swapped out to disk.

The pagepool sets the size of buffer cache on each node. For new Spectrum Scale V4.1.1
installations, the default value is either one-third of the node’s physical memory or 1 GB,
whichever is smaller. For a database system, 100 MB might be enough. For an application
with many small files, you might need to increase this setting to 2 GB - 4 GB.

300 IBM System Storage DS8000 Performance Monitoring and Tuning


Additional parameter settings
Spectrum Scale implements Direct I/O and file system caching. For I/O tuning, check at least
the following parameter settings:
maxFilesToCache The total number of different files that can be cached at one time.
Every entry in the file cache requires some pageable memory to hold
the content of the file's inode plus control data structures. In Spectrum
Scale V4.1.1, the default value is 4000. If the application has many
small files, consider increasing this value.
maxStatCache This parameter sets aside additional pageable memory to cache
attributes of files that are not in the regular file cache. If the user
specifies a value for maxFilesToCache but does not specify a value for
maxStatCache, the default value of maxStatCache changes to
4*maxFilesToCache.
Number of threads The worker1Threads parameter controls the maximum number of
concurrent file operations at any instant. This parameter is primarily
used for random read or write requests that cannot be pre-fetched.
The prefetchThreads parameter controls the maximum number of
threads that are dedicated to prefetching data for files that are read
sequentially or to handle sequential write behind.
maxMBpS The maxMBpS parameter specifies an estimate of how many megabytes
of data can be transferred per second into or out of a single node.
Increase the maxMBpS parameter value to 80% of the total bandwidth
for all HBAs in a single host. In Spectrum Scale V4.1.1, the default
value is 2048 MBps.
maxblocksize Configure the IBM Spectrum Scale block size (maxblocksize) to match
the application I/O size, the RAID stripe size, or a multiple of the RAID
stripe size. For example, if you are running an Oracle database, it is
better to adjust a value that matches the product of the value of the
DB_BLOCK_SIZE and DB_FILE_MULTIBLOCK_READ_COUNT parameters. If
the application performs many sequential I/Os, it is better to configure
a block size of 8 - 16 MB to take advantage of the sequential
prefetching algorithm on the DS8000 storage system.

For more information, see the following resources:


򐂰 IBM Spectrum Scale tuning parameters in IBM developerWorks®, found at:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%
20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters
򐂰 IBM Spectrum Scale / GPFS manuals, found at:
http://www.ibm.com/support/knowledgecenter/STXKQY/ibmspectrumscale_welcome.html
򐂰 Tuning considerations in the IBM Spectrum Scale 4.1.1 Concepts, Planning, and
Installation Guide, GA76-0441
򐂰 Configuration and Tuning GPFS for Digital Media Environments, SG24-6700

9.2.4 IBM Logical Volume Manager


IBM Logical Volume Manager (LVM) is the standard volume manager that comes with AIX. It
is an abstraction layer that allows storage virtualization at the operating system level. It is
possible to implement RAID 0 or RAID 1, or a combination of both RAID types. It is also
possible to spread the data over the LUNs in a round-robin manner. You can configure the
buffer sizes to optimize performance.

Chapter 9. Performance considerations for UNIX servers 301


Figure 9-5 shows an overview.

Figure 9-5 IBM LVM overview

In Figure 9-5, the DS8000 LUNs that are under the control of the LVM are called physical
volumes (PVs). The LVM splits the disk space into smaller pieces, which are called physical
partitions (PPs). A logical volume (LV) consists of several logical partitions (LPARs). A file
system can be mounted over an LV, or it can be used as a raw device. Each LPAR can point to
up to three corresponding PPs. The ability of the LV to point a single LPAR to multiple PPs is
the way that LVM implements mirroring (RAID 1).

To set up the volume layout with the DS8000 LUNs, you can adopt one of the following
strategies:
򐂰 Storage pool striping: In this case, you are spreading the workload at the storage level. At
the operating system level, you must create the LVs with the inter-policy attribute set to
minimum, which is the default option when creating an LV.
򐂰 PP striping: A set of LUNs is created in different ranks inside of the DS8000 storage
system. When the LUNs are recognized in AIX, a volume group (VG) is created. The LVs
are spread evenly over the LUNs by setting the inter-policy to maximum, which is the most
common method that is used to distribute the workload. The advantage of this method
compared to storage pool striping is the granularity of data spread over the LUNs. With
storage pool striping, the data is spread in chunks of 1 GB. In a VG, you can create PP
sizes 8 - 16 MB. The advantage of this method compared to LVM striping is that you have
more flexibility to manage the LVs, such as adding more disks and redistributing evenly the
LVs across all disks by reorganizing the VG.
򐂰 LVM striping: As with PP striping, a set of LUNs is created in different ranks inside of the
DS8000 storage system. After the LUNs are recognized in AIX, a VG is created with larger
PP sizes, such as 128 MB or 256 MB. The LVs are spread evenly over the LUNs by setting
the stripe size of LV 8 - 16 MB. From a performance standpoint, LVM striping and PP
striping provide the same performance. You might see an advantage in a scenario of
PowerHA with LVM Cross-Site and VGs of 1 TB or more when you perform cluster
verification, or you see that operations related to creating, modifying, or deleting LVs are
faster.

302 IBM System Storage DS8000 Performance Monitoring and Tuning


Volume group limits
When creating the VG, there are LVM limits to consider along with the potential expansion of
the VG. The key LVM limits for a VG are shown in Table 9-1.

Table 9-1 Volume group characteristics


Limit Standard VG Big VG Scalable VG

Maximum PVs/VG 32 128 1024

Maximum LVs/VG 255 511 4095

Maximum PPs/VG 32512a 130048a 2097152


a. 1016 * VG Factor, which is 32 for a standard VG and 128 for a big VG.

Suggestion: Use AIX scalable volume groups whenever possible.

PP striping
Figure 9-6 shows an example of PP striping. The VG contains four LUNs and created 16 MB
PPs on the LUNs. The LV in this example consists of a group of 16 MB PPs from four logical
disks: hdisk4, hdisk5, hdisk6, and hdisk7.

PP Striping
/dev/inte r-dis k_lv
8 GB Lo gica l d isk (LUN ) = hdisk 4
16MB 16MB 16MB 16MB 16MB 16MB 16MB 16MB
pp 1 pp2 pp 3 pp4 ... pp4 97 pp 498 pp 499 pp500
lp1 l p5

8 GB Lo gica l d isk (LUN ) = hdisk 5


16MB 16MB 16MB 16MB 16MB 16MB 16MB 16MB
pp 1 pp2 p p3 pp4 ... pp49 7 pp49 8 pp4 99 pp50 0
lp2 l p6

8 GB Lo gica l d isk (LUN ) = hdisk 6


16MB 16MB 16MB 16MB 16MB 16MB 16MB 16MB
pp1
l p7
pp 2 p p3 pp4 ... pp49 7 pp49 8 pp4 99 pp50 0
lp3

8 GB Lo gica l d isk (LUN ) = hdisk 7


16MB 16MB 16MB 16MB 16MB 16MB 16MB 16MB
lp4
pp1
l p8
pp 2 p p3 pp4 ... pp49 7 pp49 8 pp4 99 pp50 0

vpath0, vpath1, vpath2, and vpath3 are hardware-striped LUNs on different DS8000 Extent Pools
8 GB/16 MB partitions ~ 500 physical partitions per LUN (pp1-pp500)
/dev/inter-disk_lv is made up of eight logical partitions
(lp1 + lp2 + lp3 + lp4 + lp5 +lp6 +lp7 + lp8) = 8 x 16 = 128 MB

Figure 9-6 Inter-disk policy logical volume

The first step is to create a VG. Create a VG with a set of DS8000 LUNs where each LUN is
in a separate extent pool. If you plan to add a set of LUNs to a host, define another VG. To
create a VG, run the following command to create the data01vg and a PP size of 16 MB:
mkvg -S -s 16 -y data01vg hdisk4 hdisk5 hdisk6 hdisk7

Chapter 9. Performance considerations for UNIX servers 303


Commands: To create the VG, if you use SDD, run the mkvg4vp command. If you use
SDDPCM, run the mkvg command. All the flags for the mkvg command apply to the mkvg4vp
command.

After you create the VG, the next step is to create the LVs. To create a VG with four disks
(LUNs), create the LVs as a multiple of the number of disks in the VG times the PP size. In
this example, we create the LVs in multiples of 64 MB. You can implement the PP striping by
using the -e x option. By adding the -a e option, the Intra-Physical Volume Allocation Policy
changes the allocation policy from middle (default) to edge so that the LV PPs are allocated
beginning at the outer edge and continuing to the inner edge. This method ensures that all
PPs are sequentially allocated across the physical disks. To create an LV of 1 GB, run the
following command:
mklv -e x -a e -t jfs2 -y inter-disk_lv data01vg 64 hdisk4 hdisk5 hdisk6 hdisk7

Preferably, use inline logs for JFS2 LVs, which result in one log for every file system that is
automatically sized. Having one log per file system improves performance because it avoids
the serialization of access when multiple file systems make metadata changes. The
disadvantage of inline logs is that they cannot be monitored for I/O rates, which can provide
an indication of the rate of metadata changes for a file system.

LVM striping
Figure 9-7 shows an example of a striped LV. The LV called /dev/striped_lv uses the same
capacity as /dev/inter-disk_lv (shown in Figure 9-6 on page 303), but it is created
differently.

LVM Striping
8GB LUN = hd isk 4
256MB 256MB 256MB 256MB 256MB 256MB 256MB 2 56MB
l p1 pp1 l p5 pp2 pp 3 pp4 .... pp29 p p30 p p31 pp3 2
1 .1 1 .2 1 .3 5 .1 5 .2 5 .3

8GB LUN = hd isk 5


256MB 256MB 256MB 256MB 256MB 256MB 256MB 2 56MB
lp 2 pp 1 l p6 p p2 pp 3 pp4 .... pp29 p p30 p p31 pp3 2
2 .1 2 .2 2 .3 6 .1 6 .2 6 .3

IO 8GB LUN = h disk 6


256MB 256MB 256MB 256MB 256MB 256MB 256MB 2 56MB
lp 3 pp1 l p7 pp 2 pp3 pp4 .... pp 29 pp 30 pp3 1 pp32
3 .1 3 .2 3 .3 7 .1 7 .2 7 .3

8GB LUN = h disk 7


256MB 256MB 256MB 256MB 256MB 256MB 256MB 2 56MB
l p4 pp1 l p8 pp2 p p3 pp 4 .... pp2 9 pp30 pp31 pp 32
4 .1 4 .2 4 .3 8 .1 8 .2 8 .3

hdisk4, hdisk5, hdisk6, and hdisk7 are hardware-striped LUN S on different DS8000 Extent Pools
8 GB/256 MB partitions ~ 32 physical partitions per LUN (pp1 – pp32)

/dev/striped_lv is made up of eight logical partitions (8 x 256 MB = 32 MB)


Each logical partition is divided into 64 equal parts of 4 MB
(only 3 of the 4 MB parts are shown for each logical partition)
/dev/striped_lv = lp1.1 +lp2.1 + lp3.1 + lp4.1 + lp1.2 + lp2.2 + lp3.2 + lp4.2 + lp5.1….

Figure 9-7 Striped logical volume

304 IBM System Storage DS8000 Performance Monitoring and Tuning


/dev/striped_lv is also made of eight 256 MB PPs, but each partition is then subdivided into
32 chunks of 8 MB; only three of the 8 MB chunks are shown per LPAR for space reasons.

Again, the first step is to create a VG. To create a VG for LVM striping, run the following
command:
mkvg -S -s 256 -y data01vg hdisk4 hdisk5 hdisk6 hdisk7

For you to create a striped LV, you must combine the following options when you use LVM
striping:
򐂰 Stripe width (-C): This option sets the maximum number of disks to spread the data. The
default value is used from the upperbound option (-u).
򐂰 Copies (-c): This option is required only when you create mirrors. You can set 1 - 3 copies.
The default value is 1.
򐂰 Strict allocation policy (-s): This option is required only when you create mirrors and it is
necessary to use the value s (superstrict).
򐂰 Stripe size (-S): This option sets the size of a chunk of a sliced PP. Since AIX V5.3, the
valid values include 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1M, 2M, 4M, 8M, 16M,
32M, 64M, and 128M.
򐂰 Upperbound (-u): This option sets the maximum number of disks for a new allocation. If
you set the allocation policy to superstrict, the upperbound value must be the result of the
stripe width times the number of copies that you want to create.

Important: Do not set the -e option with LVM striping.

To create a striped LV, run the following command:


mklv -C 4 -c 1 -S 8K -t jfs2 -u 4 -y striped_lv data01vg 4 hdisk4 hdisk5 hdisk6
hdisk7

Since AIX V5.3, the striped column feature is available. With this feature, you can extend an
LV in a new set of disks after the current disks where the LV is spread is full.

Memory buffers
Adjust the memory buffers (pv_min_pbuf) of LVM to increase the performance. Set the
parameter to 2048 for AIX 7.1.

Scheduling policy
If you have a dual-site cluster solution that uses PowerHA with LVM Cross-Site, you can
reduce the link requirements among the sites by changing the scheduling policy of each LV to
parallel write/sequential read (ps). You must remember that the first copy of the mirror must
point to the local storage.

9.2.5 Symantec Veritas Volume Manager


Symantec Veritas Volume Manager (VxVM) is another volume manager. With AIX, VxVM can
replace the IBM LVM for rootvg. Implementation and administration are similar to the same
versions on other UNIX and Linux operating systems. For more information, see the following
websites:
򐂰 http://www.ibm.com/developerworks/aix/library/au-aixveritas/
򐂰 https://www.veritas.com/support/en_US/article.DOC5203

Chapter 9. Performance considerations for UNIX servers 305


9.2.6 IBM Subsystem Device Driver for AIX
SDD is an IBM proprietary multipathing software that works with the DS8000 storage system
and other IBM storage devices only. It once was popular, but with SDDPCM and IBM AIX
Multipath I/O (MPIO), a more modern alternative is available now. For more information, see
the SDD support matrix at the following website:
http://www.ibm.com/support/docview.wss?rs=540&uid=ssg1S7001350#AIXSDD

With SDD V1.7 and prior versions, the datapath command, instead of the chdev command,
was used to change the qdepth_enable setting because then it is a dynamic change. For
example, datapath set qdepth disable sets it to no. Certain releases of SDD do not include
SDD queuing, and other releases include SDD queuing. Certain releases do not show the
qdepth_enable attribute. Either check the manual for your version of SDD or try the datapath
command to see whether it supports turning off this feature.

If qdepth_enable=yes (the default), I/Os that exceed the queue_depth queue in the SDD. If
qdepth_enable=no, I/Os that exceed the queue_depth queue in the hdisk wait queue. SDD
with qdepth_enable=no and SDDPCM do not queue I/Os and instead merely pass them to the
hdisk drivers.

Note: SDD is not compatible with the AIX MPIO framework. It cannot be installed on the
IBM Power Systems VIOS and is not supported on AIX 7.1 (or later). For these reasons,
SDD should no longer be used for AIX support of the DS8000 storage system.

9.2.7 Multipath I/O with IBM Subsystem Device Driver Path Control Module
MPIO is another multipathing device driver. It was introduced in AIX V5.2. The reason for
providing your own multipathing solution is that in a SAN environment you might want to
connect to several storage subsystems from a single host. Each storage vendor has its own
multipathing solution that is not interoperable with the multipathing solution of other storage
vendors. This restriction increases the complexity of managing the compatibility of operating
system fix levels, HBA firmware levels, and multipathing software versions.

AIX provides the base MPIO device driver; however, it is still necessary to install the MPIO
device driver that is provided by the storage vendor to take advantage of all of the features of
a multipathing solution. For the DS8000 storage system and other storage systems, IBM
provides the SDDPCM multipathing driver. SDDPCM is compatible with the AIX MPIO
framework and replaced SDD.

Use MPIO with SDDPCM rather than SDD with AIX whenever possible.

If you used both SDD and SDDPCM, with SDD each LUN has a corresponding vpath and a
hdisk for each path to the LUN. With SDDPCM, you have only one hdisk per LUN. Thus, with
SDD, you can submit queue_depth x # paths to a LUN, and with SDDPCM, you can submit
queue_depth IOs only to the LUN. If you switch from SDD that uses four paths to SDDPCM,
you must set the SDDPCM hdisks to four times that of the SDD hdisks for an equivalent
effective queue depth.

Both the hdisk and adapter drivers have “in process” and “wait” queues. After the queue limit
is reached, the I/Os wait until an I/O completes and opens a slot in the service queue. The
in-process queue is also sometimes called the “service” queue. Many applications do not
generate many in-flight I/Os, especially single-threaded applications that do not use AIO.
Applications that use AIO are likely to generate more in-flight I/Os.

306 IBM System Storage DS8000 Performance Monitoring and Tuning


For more information about MPIO, see the following resources:
򐂰 Multipath I/O description in AIX 7.1, found at:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.osdevice/devm
pio.htm
򐂰 Chapter 2 of the Oracle 12c Automatic Storage Management Administrator's Guide
explores storage preparation for Oracle ASM:
http://docs.oracle.com/database/121/OSTMG/toc.htm
򐂰 Check the interoperability matrix of SDDPCM (MPIO) to see which version is supported:
http://www.ibm.com/support/docview.wss?rs=540&uid=ssg1S7001350#AIXSDDPCM
򐂰 MPIO is supported only with PowerHA (HACMP) if you configure the VGs in Enhanced
Concurrent Mode. For more information about PowerHA with MPIO, see this website:
http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/FLASH10504
򐂰 If you use a multipathing solution with VIOS, use MPIO. There are several limitations when
you use SDD with VIOS. For more information, see 9.2.10, “Virtual I/O Server” on
page 309 and the VIOS support site:
http://www14.software.ibm.com/webapp/set2/sas/f/vios/documentation/datasheet.ht
ml#multipath

9.2.8 Symantec Veritas Dynamic MultiPathing for AIX


Symantec Veritas Dynamic MultiPathing (DMP) is a device driver that is provided by
Symantec to work with VxVM. If you use VxVM, use DMP instead of SDD. Also, there is an
option for you to use VxVM with SDD. For more information, see the following website:
https://www.veritas.com/support/en_US/article.DOC2728

9.2.9 Fibre Channel adapters


FC adapters or HBAs provide the connection between the host and the storage devices.
Configure the following four important parameters:
򐂰 num_cmd_elems: This parameter sets the maximum number of commands to queue to the
adapter. When many supported storage devices are configured, you can increase this
attribute to improve performance. The range of supported values depends on the FC
adapter, which you can check by running lsattr -Rl fcs0 -a num_cmd_elems.
򐂰 max_xfer_size: This attribute for the fcs device, which controls the maximum I/O size that
the adapter device driver can handle, and also controls a memory area used by the
adapter for data transfers. For example, for the 8 Gb dual port adapter, with the default
max_xfer_size of 0x40000=256 KB, the DMA memory area size is 16 MB. Using any other
allowable value for max_xfer_size increases the memory area to 128 MB. The range of
supported values depends on the FC adapter, which you can check by running lsattr -Rl
fcs0 -a max_xfer_size.
For typical DS8000 environments, this setting must remain unchanged and use its default
value. Any other change might imply risks and not lead to performance improvements.
The fcstat command can be used to examine whether increasing num_cmd_elems or
max_xfer_size can increase performance. In selected environments with heavy I/O and
especially large I/Os (such as for backups), carefully consider increasing this setting and
verify whether you improve performance (see the following note and take appropriate
precautions when testing this setting).

Chapter 9. Performance considerations for UNIX servers 307


At AIX V6.1 TL2 or later, a change was made for virtual FC adapters so that the DMA
memory area is always 128 MB even with the default max_xfer_size. This memory area is
a DMA memory area, but it is different from the DMA memory area that is controlled by the
lg_term_dma attribute (which is used for I/O control). The default value for lg_term_dma of
0x800000 is adequate.

Changing the max_xfer_size: Changing max_xfer_size uses memory in the PCI Host
Bridge chips attached to the PCI slots. The sales manual, regarding the dual-port 4
Gbps PCI-X FC adapter, states that “If placed in a PCI-X slot rated as SDR compatible
and/or has the slot speed of 133 MHz, the AIX value of the max_xfer_size must be kept
at the default setting of 0x100000 (1 megabyte) when both ports are in use. The
architecture of the DMA buffer for these slots does not accommodate larger
max_xfer_size settings.” Issues occur when configuring the LUNs if there are too many
FC adapters and too many LUNs attached to the adapter. Errors, such as DMA_ERR
might appear in the error report. If you get these errors, you must change the
max_xfer_size back to the default value. Also, if you boot from SAN and you encounter
this error, you cannot boot, so be sure to have a back-out plan if you plan to change the
max_xfer_size and boot from SAN.

򐂰 dyntrk: AIX supports dynamic tracking of FC devices. Previous releases of AIX required a
user to unconfigure FC storage device and adapter device instances before changing the
system area network (SAN), which can result in an N_Port ID (SCSI ID) change of any
remote storage ports. If dynamic tracking of FC devices is enabled, the FC adapter driver
detects when the Fibre Channel N_Port ID of a device changes. The FC adapter driver
then reroutes traffic that is destined for that device to the new address while the devices
are still online. Events that can cause an N_Port ID to change include moving a cable
between a switch and storage device from one switch port to another, connecting two
separate switches that use an inter-switch link (ISL), and possibly restarting a switch.
Dynamic tracking of FC devices is controlled by a new fscsi device attribute, dyntrk. The
default setting for this attribute is no. To enable dynamic tracking of FC devices, set this
attribute to dyntrk=yes, as shown in the following example:
chdev -l fscsi0 -a dyntrk=yes

Important: The disconnected cable must be reconnected within 15 seconds.

򐂰 fc_err_recov: AIX supports Fast I/O Failure for FC devices after link events in a switched
environment. If the FC adapter driver detects a link event, such as a lost link between a
storage device and a switch, the FC adapter driver waits a short time, approximately 15
seconds, so that the fabric can stabilize. At that point, if the FC adapter driver detects that
the device is not on the fabric, it begins failing all I/Os at the adapter driver. Any new I/O or
future retries of the failed I/Os are failed immediately by the adapter until the adapter driver
detects that the device rejoined the fabric. Fast Failure of I/O is controlled by a fscsi device
attribute, fc_err_recov. The default setting for this attribute is delayed_fail, which is the
I/O failure behavior seen in previous versions of AIX. To enable Fast I/O Failure, set this
attribute to fast_fail, as shown in the following example:
chdev -l fscsi0 -a fc_err_recov=fast_fail

Important: Change fc_err_recov to fast_fail and dyntrk to yes only if you use a
multipathing solution with more than one path.

Example 9-3 on page 309 shows the output of the attributes of a fcs device.

308 IBM System Storage DS8000 Performance Monitoring and Tuning


Example 9-3 Output of a fcs device
# lsattr -El fcs0
DIF_enabled no DIF (T10 protection) enabled True
bus_mem_addr 0xa0108000 Bus memory address False
init_link auto INIT Link flags False
intr_msi_1 30752 Bus interrupt level False
intr_priority 3 Interrupt priority False
io_dma 64 IO_DMA True
lg_term_dma 0x800000 Long term DMA True
max_xfer_size 0x100000 Maximum Transfer Size True
msi_type msix MSI Interrupt type False
num_cmd_elems 500 Maximum number of COMMANDS to queue to the adapter True

Example 9-4 is the output of the attributes of a fscsi device.

Example 9-4 Output of a fscsi device


# lsattr -El fscsi0
attach switch How this adapter is CONNECTED False
dyntrk yes Dynamic Tracking of FC Devices True+
fc_err_recov fast_fail FC Fabric Event Error RECOVERY Policy True+
scsi_id 0x10200 Adapter SCSI ID False
sw_fc_class 3 FC Class for Fabric True

For more information about the Fast I/O Failure (fc_err_recov) and Dynamic Tracking
(dyntrk) options, see the following links:
򐂰 http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.kernextc/dynami
ctracking.htm
򐂰 http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.performance/fas
t_fail_dynamic_interaction.htm

For more information about num_cmd_elems and max_xfer_size, see AIX disk queue depth
tuning for performance, found at:
http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105745

For more information, see the SDD and SDDPCM User’s Guides at the following website:
http://www.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S7000303

9.2.10 Virtual I/O Server


The VIOS is an appliance that provides IO virtualization and I/O sharing of disk storage,
Ethernet adapters, optical devices, and tape storage, to client LPARs on IBM Power Systems.
It is built on top of AIX and uses a default AIX user padmin that runs in a restricted shell to run
the available commands provided by the ioscli command-line interface (CLI) only.

The VIOS allows a physical adapter with disk attached at the VIOS partition level to be shared
by one or more partitions and enables clients to consolidate and potentially minimize the
number of required physical adapters.

Chapter 9. Performance considerations for UNIX servers 309


Figure 9-8 an overview of multipathing with VIOS.

Figure 9-8 Multipathing with VIOS overview

There are two ways of connecting disk storage to a client LPAR with VIOS:
򐂰 VSCSI
򐂰 N-port ID Virtualization (NPIV)

VIOS Virtual SCSI mode


The VIOS in VSCSI mode works in the following manner:
򐂰 On the DS8000 side, the same set of LUNs for the VIOSs is assigned to the
corresponding volume groups for LUN masking.
򐂰 At the Power System, there are at least two LPARs defined: the VIO servers. They are
considered SCSI servers that provide access to the DS8000 LUNs for the other LPARs.
Have two or more HBAs installed on each VIOS.
For every LPAR VSCSI client, there is a VSCSI Server (VSCSI target) defined in both
VIOSs through the Hardware Management Console (HMC). Similarly, for every VIOS,
there is a VSCSI Client device (VSCSI Initiator) defined for the corresponding LPAR
VSCSI Client. The host attachments for SDDPCM and SDDPCM for AIX are installed in
the VIOSs through the oem_setup_env command, as with an ordinary AIX server. For all
DS8000 LUNs that you map directly to the corresponding LPAR Client vscsi device, you
also must disable the SCSI reservation first.
򐂰 The LPAR that is the VSCSI Client needs the basic MPIO device driver only (MPIO works
in failover mode only). You can use the VSCSI disk devices like any other ordinary disk in
the AIX.

310 IBM System Storage DS8000 Performance Monitoring and Tuning


When you assign several LUNs from the DS8000 storage system to the VIOS and then map
those LUNs to the LPAR clients with the time, trivial activities, such as upgrading the
SDDPCM device driver, can become challenging.

VIOS NPIV mode


A VIOS_NPIV connection requires 16 Gb or 8 Gb adapters in VIOS, and it requires NPIV
enabled switches to connect to the storage system.

The VIOS NPIV works as follows:


򐂰 On the DS8800 side, the same set of LUNs for the VIOSs is assigned to the
corresponding volume groups.
򐂰 In the POWER system, you create server virtual FC adapters in VIOS LPAR. You create
client virtual FC adapters in the client LPAR, and assign each of them to a server virtual
FC adapter in VIOS.
򐂰 In VIOS, you map each server virtual FC adapter to a port in the physical adapter. After
this task is done, the WWPN of the client virtual FC adapter logs in the switch to which the
port is physically connected.
򐂰 In the switch, you zone the WWPNs of the client virtual FC adapter with the ports of the
storage system.
򐂰 In the DS8800 storage system, you assign the VG to the WWPN of client adapter. After
this task is done, the LUNs in the VG are available to the host.

For more information about VIOS, see the following resources:


򐂰 IBM PowerVM Virtualization Introduction and configuration, found at:
http://www.redbooks.ibm.com/redbooks/pdfs/sg247940.pdf
򐂰 IBM PowerVM Virtualization Managing and Monitoring, SG24-7590
򐂰 Virtual I/O Server and Integrated Virtualization Manager command descriptions, found at:
http://www.ibm.com/support/knowledgecenter/8247-21L/p8hcg/p8hcg_kickoff_alphabe
tical.htm
򐂰 Also, check the VIOS frequently asked questions (FAQs), which explain in more detail
several restrictions and limitations, such as the lack of the load balancing feature for AIX
VSCSI MPIO devices:
http://www14.software.ibm.com/webapp/set2/sas/f/vios/documentation/faq.html

Performance suggestions
Use these performance settings when configuring VSCSI for performance:
򐂰 Processor:
– Typical entitlement is 0.25.
– Virtual processor of 2.
– Always run uncapped.
– Run at higher priority (weight factor >128).
– More processor power with high network loads.

Chapter 9. Performance considerations for UNIX servers 311


򐂰 Memory:
– Typically use at least 10 GB (at least 1 GB of memory is required. The minimum is
2 GB + 20 MB per hdisk.) This type of a configuration is needed to avoid any paging
activity in the VIOS, which might lead to performance degradation in all LPARs.
– Add more memory if there are high device (vscsi and hdisk) counts.
– Small LUNs drive up the memory requirements.
򐂰 For multipathing with VIOS, check the configuration of the following parameters:
– fscsi devices on VIOS:
• The attribute fc_err_recov is set to fast_fail.
• The attribute dyntrk is set to yes by running chdev -l fscsiX -a dyntrk=yes.
– Hdisk devices on VIOS:
• The attribute algorithm is set to load_balance.
• The attribute reserve_policy is set to no_reserve.
• The attribute hcheck_mode is set to nonactive.
• The attribute hcheck_interval is set to 20.
– vscsi devices in client LPARs: The attribute vscsi_path_to is set to 30.
– Hdisk devices in client:
• The attribute algorithm is set to failover.
• The attribute reserv_policy is set to no_reserve.
• The attribute hcheck_mode is set to nonactive.
• The attribute hcheck_interval is set to 20.

Important: Change the reserve_policy parameter to no_reserve only if you are going to
map the LUNs of the DS8000 storage system directly to the client LPAR.

For more information, see the Planning for the Virtual I/O Server and Planning for virtual SCSI
sections in the POWER8 IBM Knowledge Center at the following website:
򐂰 http://www.ibm.com/support/knowledgecenter/POWER8/p8hb1/p8hb1_vios_planning.htm
򐂰 http://www.ibm.com/support/knowledgecenter/8247-21L/p8hb1/p8hb1_vios_planning_vsc
si.htm

9.3 AIX performance monitoring tools


This section briefly reviews the various AIX commands and utilities that are useful for
performance monitoring.

9.3.1 AIX vmstat


The vmstat utility is a useful tool for taking a quick snapshot of the system performance. It is
the first step toward understanding the performance issue and specifically in determining
whether the system is I/O bound.

Example 9-5 on page 313 shows how vmstat can help you monitor file system activity by
running the vmstat -I command.

312 IBM System Storage DS8000 Performance Monitoring and Tuning


Example 9-5 The vmstat -I utility output for file system activity analysis
[root@p520-tic-3]# vmstat -I 1 5

System Configuration: lcpu=30 mem=3760MB

kthr memory page faults cpu


-------- ----------- ------------------------ ------------ -----------
r b p avm fre fi fo pi po fr sr in sy cs us sy id wa
10 1 42 5701952 10391285 338 715 0 0 268 541 2027 46974 13671 18 5 60 16
10 0 88 5703217 10383443 0 6800 0 0 0 0 9355 72217 26819 14 10 15 62
9 0 88 5697049 10381654 0 7757 0 0 0 0 10107 77807 28646 18 11 11 60
8 1 70 5692366 10378376 0 8171 0 0 0 0 9743 80831 26634 20 12 13 56
5 0 74 5697938 10365625 0 6867 0 0 0 0 11986 63476 28737 13 10 13 64
10 0 82 5698586 10357280 0 7745 0 0 0 0 12178 66806 29431 14 11 12 63
12 0 80 5704760 10343915 0 7272 0 0 0 0 10730 74279 29453 16 11 11 62
6 0 84 5702459 10337248 0 9193 0 0 0 0 12071 72015 30684 15 12 11 62
6 0 80 5706050 10324435 0 9183 0 0 0 0 11653 72781 31888 16 10 12 62
8 0 76 5700390 10321102 0 9227 0 0 0 0 11822 82110 31088 18 14 12 56

In an I/O-bound system, look for these indicators:


򐂰 A high I/O wait percentage, as shown in the “cpu” column under the “wa” subcolumn.
Example 9-5 shows that a majority of CPU cycles are waiting for I/O operations to
complete.
򐂰 A high number of blocked processes, as shown in the “kthr” column under the subcolumns
“b” and “p”, which are the wait queue (b) and wait queue for raw devices (p). A high
number of blocked processes normally indicate I/O contention among the process.
򐂰 Paging activity as seen under the column “page.” High first in first out (FIFO) indicates
intensive file caching activity.

Example 9-6 shows you another option that you can use, vmstat -v, from which you can
understand whether the blocked I/Os are because of a shortage of buffers.

Example 9-6 The vmstat -v utility output for file system buffer activity analysis
[root@p520-tic-3]# vmstat -v | tail -7
0 pending disk I/Os blocked with no pbuf
0 paging space I/Os blocked with no psbuf
2484 filesystem I/Os blocked with no fsbuf
0 client filesystem I/Os blocked with no fsbuf
0 external pager filesystem I/Os blocked with no fsbuf
0 Virtualized Partition Memory Page Faults
0.00 Time resolving virtualized partition memory page faults

In Example 9-6, notice these characteristics:


򐂰 File system buffer (fsbuf) and LVM buffer (pbuf) space are used to hold the I/O request at
the file system and LVM levels.
򐂰 If a substantial number of I/Os is blocked because of insufficient buffer space, both buffers
can be increased by using the ioo command, but a larger value results in overall poor
system performance. Increase buffers incrementally and monitor the system performance
with each increase.

For the preferred practice values, see the application papers listed under “AIX file system
caching” on page 296.

Chapter 9. Performance considerations for UNIX servers 313


By using lvmo, you can also check whether contention is happening because of a lack of LVM
memory buffer, which is shown in Example 9-7.

Example 9-7 Output of lvmo -a

[root@p520-tic-3]# lvmo -a -v rootvg


vgname = rootvg
pv_pbuf_count = 512
total_vg_pbufs = 1024
max_vg_pbuf_count = 16384
pervg_blocked_io_count = 0
pv_min_pbuf = 512
global_blocked_io_count = 0

As you can see in Example 9-7, there are two incremental counters: pervg_blocked_io_count
and global_blocked_io_count. The first counter indicates how many times an I/O block
happened because of a lack of LVM pinned memory buffer (pbufs) on that VG. The second
incremental counter counts how many times an I/O block happened because of the lack of
LVM pinned memory buffer (pbufs) in the whole OS. Other indicators for I/O bound can be
seen with the disk xfer part of the vmstat output when run against the physical disk, as
shown in Example 9-8.

Example 9-8 Output of vmstat for disk xfer


# vmstat hdisk0 hdisk1 1 8
kthr memory page faults cpu disk xfer
---- ---------- ----------------------- ------------ ----------- ------
r b avm fre re pi po fr sr cy in sy cs us sy id wa 1 2 3 4
0 0 3456 27743 0 0 0 0 0 0 131 149 28 0 1 99 0 0 0
0 0 3456 27743 0 0 0 0 0 0 131 77 30 0 1 99 0 0 0
1 0 3498 27152 0 0 0 0 0 0 153 1088 35 1 10 87 2 0 11
0 1 3499 26543 0 0 0 0 0 0 199 1530 38 1 19 0 80 0 59
0 1 3499 25406 0 0 0 0 0 0 187 2472 38 2 26 0 72 0 53
0 0 3456 24329 0 0 0 0 0 0 178 1301 37 2 12 20 66 0 42
0 0 3456 24329 0 0 0 0 0 0 124 58 19 0 0 99 0 0 0
0 0 3456 24329 0 0 0 0 0 0 123 58 23 0 0 99 0 0 0

The disk xfer part provides the number of transfers per second to the specified PVs that
occurred in the sample interval. This count does not imply an amount of data that was read or
written.

For more information, see the following website:


http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.cmds6/vmstat.htm

9.3.2 pstat
The pstat command counts how many legacy AIO servers are used in the server. There are
two AIO subsystems:
򐂰 Legacy AIO
򐂰 Posix AIO

You can run the psat -a | grep ‘aioserver’ | wc -l command to get the number of legacy
AIO servers that are running. You can run the pstat -a | grep posix_aioserver | wc -l
command to see the number of POSIX AIO servers.

314 IBM System Storage DS8000 Performance Monitoring and Tuning


Example 9-9 shows that the host does not have any AIO servers that are running. This
function is not enabled by default. You can enable this function by running mkdev -l aio0 or
by using SMIT. For POSIX AIO, substitute posix_aio for aio0.

Example 9-9 pstat -a output to measure the legacy AIO activity


[root@p520-tic-3]# pstat -a | grep ‘ aioserver’ | wc -l
0
[root@p520-tic-3]#

Important: If you use raw devices, you must use ps -k instead of pstat -a to measure the
legacy AIO activity.

In AIX V6 and Version 7, both AIO subsystems are loaded by default but are activated only
when an AIO request is initiated by the application. Run pstat -a | grep aio to see the AIO
subsystems that are loaded, as shown in Example 9-10.

Example 9-10 pstat -a output to show the AIO subsystem defined in AIX 7.1
[p8-e870-01v1:root:/:] pstat -a|grep aio
50 a 3200c8 1 3200c8 0 0 1 aioPpool
1049 a 1901f6 1 1901f6 0 0 1 aioLpool

In AIX Version 6 and Version 7, you can use the ioo tunables to show whether the AIO is
used. An illustration is given in Example 9-11.

Example 9-11 ioo -a output to show the AIO subsystem activity in AIX 7.1
[p8-e870-01v1:root:/:] ioo -a|grep aio
aio_active = 0
aio_maxreqs = 131072
aio_maxservers = 30
aio_minservers = 3
aio_server_inactivity = 300
posix_aio_active = 0
posix_aio_maxreqs = 131072
posix_aio_maxservers = 30
posix_aio_minservers = 3
posix_aio_server_inactivity = 300

In Example 9-11, aio_active and posix_aio_active show whether the AIO is used. The
parameters aio_server_inactivity and posix_aio_server_inactivity show how long an
AIO server sleeps without servicing an I/O request.

To check the AIO configuration in AIX V5.3, run the commands that are shown in
Example 9-12.

Example 9-12 lsattr -El aio0 output to list the configuration of legacy AIO
[root@p520-tic-3]# lsattr -El aio0
autoconfig defined STATE to be configured at system restart True
fastpath enable State of fast path True
kprocprio 39 Server PRIORITY True
maxreqs 4096 Maximum number of REQUESTS True
maxservers 10 MAXIMUM number of servers per cpu True
minservers 1 MINIMUM number of servers True

Chapter 9. Performance considerations for UNIX servers 315


Notes: If you use AIX Version 6 or Version 7, there are no more AIO devices in the Object
Data Manager (ODM), and the aioo command is removed. You must use the ioo command
to change them.

If your AIX V5.3 is between TL05 and TL08, you can also use the aioo command to list and
increase the values of maxservers, minservers, and maxreqs.

The rule is to monitor the I/O wait by using the vmstat command. If the I/O wait is more than
25%, consider enabling AIO, which reduces the I/O wait but does not help disks that are busy.
You can monitor busy disks by using iostat.

9.3.3 AIX iostat


The iostat command is used for monitoring system I/O device load. It can be used to
determine and balance the I/O load between physical disks and adapters.

The lsattr -E -l sys0 -a iostat command indicates whether the iostat statistic collection
is enabled. To enable the collection of iostat data, run chdev -l sys0 -a iostat=true.

The disk and adapter-level system throughput can be observed by running the iostat -aDR
command.

The a option retrieves the adapter-level details, and the D option retrieves the disk-level
details. The R option resets the min* and max* values at each interval, as shown in
Example 9-13.

Example 9-13 Disk-level and adapter-level details by using iostat -aDR


[root@p520-tic-3]# iostat -aDR 1 1
System configuration: lcpu=2 drives=1 paths=1 vdisks=1 tapes=0

Vadapter:
vscsi0 xfer: Kbps tps bkread bkwrtn partition-id
29.7 3.6 2.8 0.8 0
read: rps avgserv minserv maxserv
0.0 48.2S 1.6 25.1
write: wps avgserv minserv maxserv
30402.8 0.0 2.1 52.8
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
0.0 0.0 0.0 0.0 0.0 0.0

Paths/Disks:
hdisk0 xfer: %tm_act bps tps bread bwrtn
1.4 30.4K 3.6 23.7K 6.7K
read: rps avgserv minserv maxserv timeouts fails
2.8 5.7 1.6 25.1 0 0
write: wps avgserv minserv maxserv timeouts fails
0.8 9.0 2.1 52.8 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
11.5 0.0 34.4 0.0 0.0 0.9

316 IBM System Storage DS8000 Performance Monitoring and Tuning


Check for the following situations when analyzing the output of iostat:
򐂰 Check whether the number of I/Os is balanced among the disks. If not, it might indicate
that you have problems in the distribution of PPs over the LUNs. With the information
provided by lvmstat or filemon, select the most active LV, and with the lslv -m command,
check whether the PPs are distributed evenly among the disks of the VG. If not, check the
inter-policy attribute on the LVs to see whether they are set to maximum. If the PPs are not
distributed evenly and the LV inter-policy attribute is set to minimum, you must change the
attribute to maximum and reorganize the VG.
򐂰 Check in the read section whether the avgserv is larger than 15 ms. If so, this might
indicate that your bottleneck is in a lower layer, which can be the HBA, SAN, or even in the
storage. Also, check whether the same problem occurs with other disks of the same VG. If
yes, you must add up the number of IOPS, add up the throughput by vpath (if you still use
SDD), rank, and host, and compare the number with the performance numbers from your
storage system’s monitoring tool.
򐂰 Check in the write section whether the avgserv is larger than 3 ms. Writes that average
significantly and consistently higher service times indicate that write cache is full, and
there is a bottleneck in the disk.
򐂰 Check in the queue section whether avgwqsz is larger than avgsqsz. Compare with other
disks in the storage. Check whether the PPs are distributed evenly in all disks in the VG. If
avgwqsz is smaller than avgsqsz, compare with other disks in the storage. If there are
differences and the PPs are distributed evenly in the VG, it might indicate that the
unbalanced situation is at the rank level.

Chapter 9. Performance considerations for UNIX servers 317


The following example shows how multipath must be considered to interpret the iostat
output. Figure 9-9 shows a server with two FC adapters. The SAN zoning provides four I/O
paths to the DS8000 storage system.

Figure 9-9 Devices that are presented to iostat

The iostat command displays the I/O statistics for the hdisk1 device. as shown in
Example 9-14. One way to establish relationships between a hdisk device, the corresponding
I/O paths, and the DS8000 LVs is to use the pcmpath query device command that is installed
with SDDPCM. In Example 9-14, the logical disk on the DS8000 storage system has LUN
serial number 75ZA5710019.

Example 9-14 The pcmpath query device command


[p8-e870-01v1:root:/:] pcmpath query device 1

DEV#: 1 DEVICE NAME: hdisk1 TYPE: 2107900 ALGORITHM: Load Balance


SERIAL: 75ZA5710019
==========================================================================
Path# Adapter/Path Name State Mode Select Errors
0 fscsi0/path0 OPEN NORMAL 58184 0
1 fscsi1/path1 OPEN NORMAL 58226 0
2 fscsi2/path2 OPEN NORMAL 58299 0
3 fscsi3/path3 OPEN NORMAL 58118 0

The option that is shown in Example 9-15 on page 319 provides details in a record format,
which can be used to sum up the disk activity.

318 IBM System Storage DS8000 Performance Monitoring and Tuning


Example 9-15 Output of the iostat command (output truncated)
# iostat -alDRT 1 2

System configuration: lcpu=16 drives=5 paths=17 vdisks=22 tapes=0


Adapter: xfers time
-------------------- --------------------------- ---------
bps tps bread bwrtn
fcs0 40.6M 8815.9 40.6M 0.0 15:10:03

Disks: xfers read write


queue time
---------- -------------------------------- ------------------------------------ ------------------------------------ ------------------
%tm bps tps bread bwrtn rps avg min max time fail wps avg min max time fail avg
min max avg avg serv
act serv serv serv outs serv serv serv outs time
time time wqsz sqsz qfull
hdisk2 9.9 11.1M 2406.4 11.1M 0.0 4643.8 0.1 0.1 0.4 0 0 0.0 0.0 0.0 0.0 0 0 0.0
0.0 0.0 0.0 0.0 0.0 15:10:03
hdisk1 5.9 8.2M 1778.5 8.2M 0.0 3665.3 0.1 0.1 8.5 0 0 0.0 0.0 0.0 0.0 0 0 0.0
0.0 0.0 0.0 0.0 0.0 15:10:03
hdisk3 12.9 11.0M 2387.3 11.0M 0.0 4889.2 0.1 0.1 10.6 0 0 0.0 0.0 0.0 0.0 0 0 0.0
0.0 0.0 0.0 0.0 0.0 15:10:03
hdisk4 5.9 10.3M 2243.8 10.3M 0.0 4519.5 0.1 0.1 9.6 0 0 0.0 0.0 0.0 0.0 0 0 0.0
0.0 0.0 0.0 0.0 0.0 15:10:03

Adapter: xfers time


-------------------- --------------------------- ---------
bps tps bread bwrtn
sissas1 561.4K 137.1 0.0 561.4K 15:10:03

It is not unusual to see a device reported by iostat as 90% - 100% busy because a DS8000
volume that is spread across an array of multiple disks can sustain a much higher I/O rate
than a single physical disk. A device that is 100% busy is a problem for a single device, but it
is not a problem for a RAID 5 device.

Further AIO can be monitored by running iostat -A for legacy AIO and iostat -P for POSIX
AIO.

Because the AIO queues are assigned by file system, it is more accurate to measure the
queues per file system. If you have several instances of the same application where each
application uses a set of file systems, you can see which instances consume more resources.
To see the legacy AIO, which is shown in Example 9-16, run the iostat -AQ command.
Similarly, for POSIX-compliant AIO statistics, run iostat -PQ.

Example 9-16 iostat -AQ output to measure legacy AIO activity by file system
[root@p520-tic-3]# iostat -AQ 1 2

System configuration: lcpu=4

aio: avgc avfc maxg maif maxr avg-cpu: % user % sys % idle % iowait
0 0 0 0 16384 0.0 0.1 99.9 0.0

Queue# Count Filesystems


129 0 /
130 0 /usr
132 0 /var
133 0 /tmp
136 0 /home
137 0 /proc
138 0 /opt

aio: avgc avfc maxg maif maxr avg-cpu: % user % sys % idle % iowait
0 0 0 0 16384 0.0 0.1 99.9 0.0

Chapter 9. Performance considerations for UNIX servers 319


Queue# Count Filesystems
129 0 /
130 0 /usr
132 0 /var
133 0 /tmp
136 0 /home
137 0 /proc
138 0 /opt

If your AIX system is in a SAN environment, you might have so many hdisks that iostat does
not provide much information. Instead, use the nmon tool, as described in “Interactive nmon
options for DS8000 performance monitoring” on page 322.

For more information about the enhancements of the iostat tool in AIX 7, see IBM AIX
Version 7.1 Differences Guide, SG24-7910, or see the iostat man pages at the following
website:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.cmds3/iostat.htm

9.3.4 lvmstat
The lvmstat command reports input and output statistics for LPARs, LVs, and volume groups.
This command is useful in determining the I/O rates to LVM volume groups, LVs, and LPARs.
This command is useful for dealing with unbalanced I/O situations where the data layout was
not considered initially.

Enabling volume group I/O by using lvmstat


By default, the statistics collection is not enabled. If the statistics collection is not enabled for
the VG or LV that you want to monitor, lvmstat reports this error:
#lvmstat -v rootvg
0516-1309 lvmstat:Statistics collection is not enabled for this logical device.
Use -e option to enable.

To enable statistics collection for all LVs in a VG (in this case, the rootvg VG), use the -e
option together with the -v <volume group> flag, as shown in the following example:
#lvmstat -v rootvg -e

When you do not need to continue to collect statistics with lvmstat, disable it because it
affects the performance of the system. To disable the statistics collection for all LVs in a VG (in
this case, the rootvg VG), use the -d option together with the -v <volume group> flag, as
shown in the following example:
#lvmstat -v rootvg -d

This command disables the collection of statistics on all LVs in the VG.

The first report section that is generated by lvmstat provides statistics that are about the time
since the statistical collection was enabled. Each later report section covers the time since the
previous report. All statistics are reported each time that lvmstat runs. The report consists of
a header row, followed by a line of statistics for each LPAR or LV, depending on the flags
specified.

320 IBM System Storage DS8000 Performance Monitoring and Tuning


Monitoring volume group I/O by using lvmstat
After a VG is enabled for lvmstat monitoring, such as rootvg in this example, you must run
lvmstat -v rootvg only to monitor all activity to rootvg. An example of the lvmstat output is
shown in Example 9-17.

Example 9-17 The lvmstat command example


#lvmstat -v rootvg

Logical Volume iocnt Kb_read Kb_wrtn Kbps


fslv00 486126 0 15049436 3.45
hd2 73425 2326964 14276 0.54
hd8 1781 0 7124 0.00
hd9var 98 0 408 0.00
hd4 45 0 180 0.00
hd3 15 0 60 0.00
hd1 10 0 40 0.00
livedump 0 0 0 0.00
lg_dumplv 0 0 0 0.00
hd11admin 0 0 0 0.00
hd10opt 0 0 0 0.00
hd6 0 0 0 0.00
hd5 0 0 0 0.00

You can see that fslv00 is busy performing writes and hd2 is performing read and some write
I/O.

The lvmstat tool has powerful options, such as reporting on a specific LV or reporting busy
LVs in a VG only. For more information about usage, see the following resources:
򐂰 IBM AIX Version 7.1 Differences Guide, SG24-79100:
򐂰 A description of the lvmstat command, found at:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.performance/l
vm_perf_mon_lvmstat.htm

9.3.5 topas
The interactive AIX tool topas is convenient if you want to get a quick overall view of the
activity of the system. A fast snapshot of memory usage or user activity can be a helpful
starting point for further investigation. Example 9-18 shows a sample topas output.

Example 9-18 AIX 7.1 topas output


Topas Monitor for host:p8-e870-01v1 EVENTS/QUEUES FILE/TTY
Tue Oct 27 15:47:03 2015 Interval:2 Cswitch 6233 Readch 242
Syscall 144.5K Writech 71.4M
CPU User% Kern% Wait% Idle% Physc Entc% Reads 1 Rawin 0
Total 3.4 6.5 5.7 84.3 1.05 17.43 Writes 142.8K Ttyout 444
Forks 0 Igets 0
Network BPS I-Pkts O-Pkts B-In B-Out Execs 0 Namei 372
Total 3.83K 56.00 5.00 3.10K 750.5 Runqueue 3.00 Dirblk 0
Waitqueue 0.5
Disk Busy% BPS TPS B-Read B-Writ MEMORY
Total 42.6 93.8M 4.95K 12.0K 93.8M PAGING Real,MB 92160
Faults 0 % Comp 7
FileSystem BPS TPS B-Read B-Writ Steals 0 % Noncomp 27

Chapter 9. Performance considerations for UNIX servers 321


Total 235.0 0.50 235.0 0 PgspIn 0 % Client 27
PgspOut 0
Name PID CPU% PgSp Owner PageIn 0 PAGING SPACE
blast.AI 2097758 8.2 1.99M root PageOut 22752 Size,MB 512
syncd 3211264 0.3 624K root Sios 22752 % Used 4
vtiol 786464 0.3 2.56M root % Free 96
j2pg 1704392 0.2 18.6M root NFS (calls/sec)
reaffin 589850 0.1 640K root SerV2 0 WPAR Activ 0
topas 3801452 0.0 4.21M root CliV2 0 WPAR Total 0
topas 5832872 0.0 5.47M root SerV3 0 Press: "h"-help
java 3801152 0.0 85.7M root CliV3 0 0 0uit

Since AIX V6.1 the topas monitor offers enhanced monitoring capabilities and file system I/O
statistics:
򐂰 To expand the file system I/O statistics, enter ff (first f turns it off, the next f expands it).
򐂰 To get an exclusive and even more detailed view of the file system I/O statistics, enter F.
򐂰 Expanded disk I/O statistics can be obtained by typing dd or D in the topas initial window.

For more information, see the topas manual page, found at:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.cmds5/topas.htm

9.3.6 nmon
The nmon tool and analyzer for AIX and Linux is a great storage performance analysis
resource. It was written by Nigel Griffiths, who works for IBM UK. Since AIX V5.3 TL09 and
AIX V6.1 TL02, it is integrated with the topas command. It is installed by default. For more
information, see the AIX 7.1 nmon command description, found at:
https://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.cmds4/nmon.htm

You can start either of the tools by running nmon or topas and then toggle between them by
typing the ~ character. You can use this tool, among others, when you perform client
benchmarks.

The interactive nmon tool is similar to monitor or topas, which you might have used to monitor
AIX, but it offers more features that are useful for monitoring the DS8000 performance.

Unlike topas, the nmon tool can also record data that can be used to establish a baseline of
performance for comparison later. Recorded data can be saved in a file and imported into the
nmon analyzer (a spreadsheet format) for easy analysis and graphing.

Interactive nmon options for DS8000 performance monitoring


The interactive nmon tool is an excellent way to show comprehensive AIX system monitoring
information, of your choice, on one display. When used interactively, nmon updates statistics
every 2 seconds. You can change the refresh rate. To run the tool, type nmon and press Enter.
Then, press the keys corresponding to the areas of interest. For this book, we are interested
in monitoring storage. The options that relate to storage are a (disk adapter), d (disk I/O
graphs), D (disk I/O and service times), and e (DS8000 SDD vpath statistics). It is also helpful
to view only the busiest disk devices, so type a period (.) to turn on this viewing feature. We
also want to look the CPU utilization typing c or C. The nmon tool has its own way of ordering
the topics that you choose on the display.

The different options that you can select when you run nmon are shown in Example 9-19 on
page 323.

322 IBM System Storage DS8000 Performance Monitoring and Tuning


Example 9-19 The nmon tool options
••HELP•••••••••most-keys-toggle-on/off••••••••••••••••••••••••••••••••••••••••••
•h = Help information q = Quit nmon 0 = reset peak counts •
•+ = double refresh time - = half refresh r = ResourcesCPU/HW/MHz/AIX•
•c = CPU by processor C=upto 1024 CPUs p = LPAR Stats (if LPAR) •
•l = CPU avg longer term k = Kernel Internal # = PhysicalCPU if SPLPAR •
•m = Memory & Paging M = Multiple Page Sizes P = Paging Space •
•d = DiskI/O Graphs D = DiskIO +Service times o = Disks %Busy Map •
•a = Disk Adapter e = ESS vpath stats V = Volume Group stats •
•^ = FC Adapter (fcstat) O = VIOS SEA (entstat) v = Verbose=OK/Warn/Danger •
•n = Network stats N=NFS stats (NN for v4) j = JFS Usage stats •
•A = Async I/O Servers w = see AIX wait procs "="= Net/Disk KB<-->MB •
•b = black&white mode g = User-Defined-Disk-Groups (see cmdline -g) •
•t = Top-Process ---> 1=basic 2=CPU-Use 3=CPU(default) 4=Size 5=Disk-I/O •
•u = Top+cmd arguments U = Top+WLM Classes . = only busy disks & procs•
•W = WLM Section S = WLM SubClasses @=Workload Partition(WPAR) •
•[ = Start ODR ] = Stop ODR i = Top-Thread •
•~ = Switch to topas screen •
•Need more details? Then stop nmon and use: nmon -?

The nmon adapter performance


Example 9-20 displays the ability of nmon to show I/O performance based on system adapters
and FC statistics. The output shows two VSCSI controllers and four FC adapters.

Example 9-20 The nmon tool adapter statistics


••topas_nmon••W=WLM••••••••••••••Host=p8-e870-01v1•••Refresh=2 secs•••17:35.59•••••
• ESS-I/O •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
• •
•Name Size(GB) AvgBusy read-KB/s write-KB/s TotalMB/s xfers/s vpaths=0 •
•No ESS vpath found or not found by odm commands •
• •
• Disk-Adapter-I/O ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•Name %busy read write xfers Disks Adapter-Type •
•vscsi1 0.0 0.0 0.0 KB/s 0.0 1 Virtual SCSI Client A •
•vscsi0 0.0 0.0 35.9 KB/s 9.0 1 Virtual SCSI Client A •
•fcs3 20.5 0.0 2459.0 KB/s 367.7 4 PCIe2 2-Port 16Gb FC •
•fcs2 16.0 0.0 2439.0 KB/s 373.2 4 PCIe2 2-Port 16Gb FC •
•fcs1 19.5 0.0 2478.9 KB/s 373.2 4 PCIe2 2-Port 16Gb FC •
•fcs0 15.5 0.0 2427.0 KB/s 360.3 4 PCIe2 2-Port 16Gb FC •
•TOTALS 6 adapters 0.0 9839.9 KB/s 1483.5 5 TOTAL(MB/s)=9.6 1 •
• Disk-KBytes/second-(K=1024,M=1024*1024) •••••••••••••••••••••••••••••••••••••••••
•Disk Busy Read Write 0----------25-----------50------------75--------100 •
• Name KB/s KB/s | | | | | •
•hdisk0 0% 0 36| | •
•hdisk1 18% 0 4427|WWWWWWWWW > | •
•hdisk4 4% 0 2064|WW > | •
•hdisk3 4% 0 2076|WWW > | •
•hdisk2 4% 0 1235|WW > | •
•Totals 0 9838+-----------|------------|-------------|----------+ •
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••e

Chapter 9. Performance considerations for UNIX servers 323


The nmon tool disk performance
The d option of nmon monitors disk I/O activity, as shown in Figure 9-10.

Figure 9-10 The nmon tool disk I/O

The nmon tool disk group performance


The nmon tool has a feature called disk grouping. For example, you can create a disk group
based on your AIX volume groups. First, you must create a file that maps hdisks to
nicknames. For example, you can create a map file, as shown in Example 9-21.

Example 9-21 The nmon tool disk group mapping file


# cat /tmp/vg-maps
rootvg hdisk0
data01vg hdisk1 hdisk2 hdisk3 hdisk4

Then, run nmon with the -g flag to point to the map file:
nmon -g /tmp/vg-maps

When nmon starts, press the g key to view statistics for your disk groups. An example of the
output is shown in Example 9-22.

Example 9-22 Statistics for disk groups


••topas_nmon••C=many-CPUs••••••••Host=p8-e870-01v1•••Refresh=2 secs•••10:07.26••
• Disk-Group-I/O •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•Name Disks AvgBusy Read|Write-KB/s TotalMB/s xfers/s BlockSizeKB •
•rootvg 1 3.5% 0.0|1717.9 1.7 155.0 11.1 •
•data01vg 4 61.4% 0.0|123600.2 120.7 7046.6 17.5 •
•Groups= 2 TOTALS 5 13.0% 0.0|125318.1 122.4 7201.5 •
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

The nmon tool has these characteristics:


򐂰 The nmon tool reports real-time iostats for different disk groups. In this case, the disk
groups that we created are for volume groups.
򐂰 You can create logical groupings of hdisks for any group that you want.
򐂰 You can make multiple disk-group map files and run nmon -g <map-file> to report on
different groups.

To enable nmon to report iostats based on ranks, you can make a disk-group map file that
lists ranks with the associated hdisk members.

324 IBM System Storage DS8000 Performance Monitoring and Tuning


Use the SDDPCM command pcmpath query device to provide a view of your host system
logical configuration on the DS8000 storage system. You can, for example, create a nmon disk
group of storage type (DS8000 storage system), logical subsystem (LSS), rank, and port to
show unique views of your storage performance.

Recording nmon information for import into the nmon analyzer tool
A great benefit that the nmon tool provides is the ability to collect data over time to a file and
then to import the file into the nmon analyzer tool, found at:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Power+Syste
ms/page/nmon_analyser

To collect nmon data in comma-separated value (CSV) file format for easy spreadsheet import,
complete the following steps:
1. Run nmon with the -f flag. See nmon -h for the details, but as an example, to run nmon for
an hour to capture data snapshots every 30 seconds, run this command:
nmon -f -s 30 -c 120
2. This command creates the output file in the <hostname>_date_time.nmon directory.

Note: When you capture data to a file, the nmon tool disconnects from the shell to ensure
that it continues running even if you log out, which means that nmon can appear to fail, but it
is still running in the background until the end of the analysis period.

The nmon analyzer is a macro-customized Microsoft Excel spreadsheet. After transferring the
output file to the machine that runs the nmon analyzer, simply start the nmon analyzer, enabling
the macros, and click Analyze nmon data. You are prompted to select your spreadsheet and
then to save the results.

Many spreadsheets have fixed numbers of columns and rows. Collect up to a maximum of
300 snapshots to avoid experiencing these issues.

Hint: The use of the CHARTS setting instead of PICTURES for graph output simplifies the
analysis of the data, which makes it more flexible.

9.3.7 fcstat
The fcstat command displays statistics from a specific FC adapter. Example 9-23 shows the
output of the fcstat command.

Example 9-23 The fcstat command output


# fcstat fcs0
FIBRE CHANNEL STATISTICS REPORT: fcs0
skipping.........
FC SCSI Adapter Driver Information
No DMA Resource Count: 0
No Adapter Elements Count: 0
No Command Resource Count: 99023
skipping.........

Chapter 9. Performance considerations for UNIX servers 325


The No Command Resource Count indicates how many times the num_cmd_elems value was
exceeded since AIX was booted. You can continue to take snapshots every 3 - 5 minutes
during a peak period to evaluate whether you must increase the value of num_cmd_elems. For
more information, see the man pages of the fcstat command, found at:
http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.cmds2/fcstat.htm

9.3.8 filemon
The filemon command monitors a trace of file system and I/O system events, and reports
performance statistics for files, virtual memory segments, LVs, and PVs. The filemon
command is useful to individuals whose applications are thought to be disk-bound, and who
want to know where and why.

The filemon command provides a quick test to determine whether there is an I/O problem by
measuring the I/O service times for reads and writes at the disk and LV level.

The filemon command is in /usr/bin and is part of the bos.perf.tools file set, which can be
installed from the AIX base installation media.

filemon measurements
To provide an understanding of file system performance for an application, the filemon
command monitors file and I/O activity at four levels:
򐂰 Logical file system
The filemon command monitors logical I/O operations on logical files. The monitored
operations include all read, write, open, and seek system calls, which might result in
physical I/O, depending on whether the files are already buffered in memory. I/O statistics
are kept on a per-file basis.
򐂰 Virtual memory system
The filemon command monitors physical I/O operations (that is, paging) between
segments and their images on disk. I/O statistics are kept on a per segment basis.
򐂰 LVs
The filemon command monitors I/O operations on LVs. I/O statistics are kept on a per-LV
basis.
򐂰 PVs
The filemon command monitors I/O operations on PVs. At this level, physical resource
utilizations are obtained. I/O statistics are kept on a per-PV basis.

filemon examples
A simple way to use filemon is to run the command that is shown in Example 9-24, which
performs these actions:
򐂰 Run filemon for 2 minutes and stop the trace.
򐂰 Store the output in /tmp/fmon.out.
򐂰 Collect only LV and PV output.

Example 9-24 The filemon command


#filemon -o /tmp/fmon.out -T 500000 -PuvO lv,pv; sleep 120; trcstop

Tip: To set the size of the buffer of option -T, start with 2 MB per logical CPU.

326 IBM System Storage DS8000 Performance Monitoring and Tuning


For more information about filemon, check the man pages of filemon, found at:
򐂰 http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.cmds2/filemon.h
tm
򐂰 http://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.performance/det
ailed_io_analysis_fielmon.htm

To produce a sample output for filemon, we ran a sequential write test in the background and
started a filemon trace, as shown in Example 9-25. We used the lmktemp command to create
a 2 GB file full of nulls while filemon gathered I/O statistics.

Example 9-25 The filemon command with a sequential write test


cd /interdiskfs
time lmktemp 2GBtest 2000M &
filemon -o /tmp/fmon.out -T 500000 -PuvO detailed,lv,pv; sleep 120; trcstop

Example 9-26 shows the parts of the /tmp/fmon.out file.

Example 9-26 The filemon most active logical volumes report


Thu Oct 6 21:59:52 2005
System: AIX CCF-part2 Node: 5 Machine: 00E033C44C00

Cpu utilization: 73.5%

Most Active Logical Volumes


------------------------------------------------------------------------
util #rblk #wblk KB/s volume description
------------------------------------------------------------------------
0.73 0 20902656 86706.2 /dev/305glv /interdiskfs
0.00 0 472 2.0 /dev/hd8 jfs2log
0.00 0 32 0.1 /dev/hd9var /var
0.00 0 16 0.1 /dev/hd4 /
0.00 0 104 0.4 /dev/jfs2log01 jfs2log

Most Active Physical Volumes


------------------------------------------------------------------------
util #rblk #wblk KB/s volume description
------------------------------------------------------------------------
0.99 0 605952 2513.5 /dev/hdisk39 IBM FC 2107
0.99 0 704512 2922.4 /dev/hdisk55 IBM FC 2107
0.99 0 614144 2547.5 /dev/hdisk47 IBM FC 2107
0.99 0 684032 2837.4 /dev/hdisk63 IBM FC 2107
0.99 0 624640 2591.1 /dev/hdisk46 IBM FC 2107
0.99 0 728064 3020.1 /dev/hdisk54 IBM FC 2107
0.98 0 612608 2541.2 /dev/hdisk38 IBM FC 2107

skipping...........

------------------------------------------------------------------------
Detailed Logical Volume Stats (512 byte blocks)
------------------------------------------------------------------------

VOLUME: /dev/305glv description: /interdiskfs


writes: 81651 (0 errs)
write sizes (blks): avg 256.0 min 256 max 256 sdev 0.0

Chapter 9. Performance considerations for UNIX servers 327


write times (msec): avg 1.816 min 1.501 max 2.409 sdev 0.276
write sequences: 6
write seq. lengths: avg 3483776.0 min 423936 max 4095744 sdev 1368402.0
seeks: 6 (0.0%)
seek dist (blks): init 78592,
avg 4095744.0 min 4095744 max 4095744 sdev 0.0
time to next req(msec): avg 1.476 min 0.843 max 13398.588 sdev 56.493
throughput: 86706.2 KB/sec
utilization: 0.73

skipping...........
------------------------------------------------------------------------
Detailed Physical Volume Stats (512 byte blocks)
------------------------------------------------------------------------

VOLUME: /dev/hdisk39 description: IBM FC 2107


writes: 2367 (0 errs)
write sizes (blks): avg 256.0 min 256 max 256 sdev 0.0
write times (msec): avg 1.934 min 0.002 max 2.374 sdev 0.524
write sequences: 2361
write seq. lengths: avg 256.7 min 256 max 512 sdev 12.9
seeks: 2361 (99.7%)
seek dist (blks): init 14251264,
avg 1928.4 min 256 max 511232 sdev 23445.5
seek dist (%tot blks):init 10.61802,
avg 0.00144 min 0.00019 max 0.38090 sdev 0.01747
time to next req(msec): avg 50.666 min 1.843 max 14010.230 sdev 393.436
throughput: 2513.5 KB/sec
utilization: 0.99

VOLUME: /dev/hdisk55 description: IBM FC 2107


writes: 2752 (0 errs)
write sizes (blks): avg 256.0 min 256 max 256 sdev 0.0
write times (msec): avg 1.473 min 0.507 max 1.753 sdev 0.227
write sequences: 2575
write seq. lengths: avg 273.6 min 256 max 512 sdev 64.8
seeks: 2575 (93.6%)
seek dist (blks): init 14252544,
avg 1725.9 min 256 max 511232 sdev 22428.8
seek dist (%tot blks):init 10.61897,
avg 0.00129 min 0.00019 max 0.38090 sdev 0.01671
time to next req(msec): avg 43.573 min 0.844 max 14016.443 sdev 365.314
throughput: 2922.4 KB/sec
utilization: 0.99

skipping to end.....................

When analyzing the output from filemon, focus on these areas:


򐂰 Most active PV:
– Look for balanced I/O across disks.
– A lack of balance might be a data layout problem.
򐂰 Look at I/O service times at the PV layer:

328 IBM System Storage DS8000 Performance Monitoring and Tuning


– Writes to cache that average less than 3 ms are good. Writes averaging significantly
and consistently longer times indicate that write cache is full, and there is a bottleneck
in the disk.
– Reads that average less than 10 ms - 20 ms are good. The disk subsystem read cache
hit rate affects this value considerably. Higher read cache hit rates result in lower I/O
service times, often near 5 ms or less. If reads average greater than 15 ms, it can
indicate that something between the host and the disk is a bottleneck, although it
indicates a bottleneck in the disk subsystem.
– Look for consistent I/O service times across PVs. Inconsistent I/O service times can
indicate unbalanced I/O or a data layout problem.
– Longer I/O service times can be expected for I/Os that average greater than 64 KB.
– Look at the difference between the I/O service times between the LV and the PV layers.
A significant difference indicates queuing or serialization in the AIX I/O stack.

The following fields are in the filemon report of the filemon command:
util Utilization of the volume (fraction of time busy). The rows are sorted by
this field, in decreasing order. The first number, 1.00, means 100
percent.
#rblk Number of 512-byte blocks read from the volume.
#wblk Number of 512-byte blocks written to the volume.
KB/sec Total transfer throughput in kilobytes per second.
volume Name of volume.
description Contents of the volume: Either a file system name or an LV type (jfs2,
paging, jfslog, jfs2log, boot, or sysdump). Also indicates whether the
file system is fragmented or compressed.

In the filemon output in Example 9-26 on page 327, notice these characteristics:
򐂰 The most active LV is /dev/305glv (/interdiskfs); it is the busiest LV with an average
data rate of 87 MBps.
򐂰 The Detailed Logical Volume Status field shows an average write time of 1.816 ms for
/dev/305glv.
򐂰 The Detailed Physical Volume Stats show an average write time of 1.934 ms for the
busiest disk, /dev/hdisk39, and 1.473 ms for /dev/hdisk55, which is the next busiest disk.

9.4 Verifying your system


At each step along the way, there are tests that you must run to test the storage infrastructure.
This section describes these tests to provide you with testing techniques as you configure
your storage for the preferred performance.

Verifying the storage subsystem


After the LUNs are assigned to the host system, and multipathing software, such as SDD,
discovers the LUNs, it is important to test the storage subsystem. The storage subsystem
includes the SAN infrastructure, the host system HBAs, and the DS8000 storage system.

Chapter 9. Performance considerations for UNIX servers 329


To test the storage subsystem, complete the following steps:
1. Run the pcmpath query essmap command to determine whether your storage allocation
from the DS8000 storage systemworks well with SDDPCM.
2. Make sure that the LUNs are set up in the manner that you expected. Is the number of
paths to the LUNs correct? Are all of the LUNs from different ranks? Are the LUN sizes
correct? The output from the command looks like the output in Example 9-27.

Example 9-27 The pcmpath query essmap command output


# pcmpath query essmap
Disk Path P Location adapter LUN SN Type Size LSS Vol Rank C/A S Connection port RaidMode
------- ----- - ------------ ------- ----------- ------------ ------- --- --- ---- --- - ----------- ---- --------
hdisk1 path0 00-00-01[FC] fscsi0 75ZA5710019 IBM 2107-900 10.2GB 0 25 0011 0e Y R1-B3-H1-ZA 200 R5R6
hdisk1 path1 00-01-01[FC] fscsi1 75ZA5710019 IBM 2107-900 10.2GB 0 25 0011 0e Y R1-B3-H1-ZA 200 R5R6
hdisk1 path2 01-00-01[FC] fscsi2 75ZA5710019 IBM 2107-900 10.2GB 0 25 0011 0e Y R1-B3-H1-ZH 207 R5R6
hdisk1 path3 01-01-01[FC] fscsi3 75ZA5710019 IBM 2107-900 10.2GB 0 25 0011 0e Y R1-B3-H1-ZH 207 R5R6
hdisk2 path0 00-00-01[FC] fscsi0 75ZA571001A IBM 2107-900 10.2GB 0 26 0000 02 Y R1-B3-H1-ZA 200 R5R6
hdisk2 path1 00-01-01[FC] fscsi1 75ZA571001A IBM 2107-900 10.2GB 0 26 0000 02 Y R1-B3-H1-ZA 200 R5R6
hdisk2 path2 01-00-01[FC] fscsi2 75ZA571001A IBM 2107-900 10.2GB 0 26 0000 02 Y R1-B3-H1-ZH 207 R5R6
hdisk2 path3 01-01-01[FC] fscsi3 75ZA571001A IBM 2107-900 10.2GB 0 26 0000 02 Y R1-B3-H1-ZH 207 R5R6

3. Next, run sequential reads and writes (by using the dd command, for example) to all of the
hdisk devices (raw or block) for about an hour. Then, look at your SAN infrastructure to see
how it performs.
Look at the UNIX error report. Problems show up as storage errors, disk errors, or adapter
errors. If there are problems, they are not to hard to identify in the error report because
there are many errors. The source of the problem can be hardware problems on the
storage side of the SAN, Fibre Channel cables or connections, early device drivers, or
device (HBA) Licensed Internal Code. If you see errors similar to the errors shown in
Example 9-28, stop and fix them.

Example 9-28 SAN problems reported in the UNIX error report


IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
7BFEEA1F 1019165415 T H fcs3 LINK ERROR
7BFEEA1F 1019165415 T H fcs2 LINK ERROR
7BFEEA1F 1019165415 T H fcs1 LINK ERROR
7BFEEA1F 1019165415 T H fcs0 LINK ERROR
7BFEEA1F 1019165315 T H fcs3 LINK ERROR
7BFEEA1F 1019165315 T H fcs2 LINK ERROR
7BFEEA1F 1019165315 T H fcs1 LINK ERROR
7BFEEA1F 1019165315 T H fcs0 LINK ERROR
...
3074FEB7 0915100805 T H fscsi0 ADAPTER ERROR
3074FEB7 0915100805 T H fscsi3 ADAPTER ERROR
3074FEB7 0915100805 T H fscsi3 ADAPTER ERROR
825849BF 0915100705 T H fcs0 ADAPTER ERROR
3074FEB7 0915100705 T H fscsi3 ADAPTER ERROR
3074FEB7 0915100705 T H fscsi0 ADAPTER ERROR
3074FEB7 0914175405 T H fscsi0 ADAPTER ERROR

4. Next, run the following command to see whether MPIO/SDDPCM correctly balances the
load across paths to the LUNs:
pcmpath query device
The output from this command looks like Example 9-29 on page 331.

330 IBM System Storage DS8000 Performance Monitoring and Tuning


Example 9-29 The pcmpath query device command output
# pcmpath query device

Total Dual Active and Active/Asymmetric Devices : 4

DEV#: 1 DEVICE NAME: hdisk1 TYPE: 2107900 ALGORITHM: Load Balance


SERIAL: 75ZA5710019
==========================================================================
Path# Adapter/Path Name State Mode Select Errors
0 fscsi0/path0 OPEN NORMAL 3260780 0
1 fscsi1/path1 OPEN NORMAL 3259749 0
2 fscsi2/path2 OPEN NORMAL 3254353 0
3 fscsi3/path3 OPEN NORMAL 3251982 0

DEV#: 2 DEVICE NAME: hdisk2 TYPE: 2107900 ALGORITHM: Load Balance


SERIAL: 75ZA571001A
==========================================================================
Path# Adapter/Path Name State Mode Select Errors
0 fscsi0/path0 OPEN NORMAL 1527936 0
1 fscsi1/path1 OPEN NORMAL 1529938 0
2 fscsi2/path2 OPEN NORMAL 1526752 0
3 fscsi3/path3 OPEN NORMAL 1524776 0

DEV#: 3 DEVICE NAME: hdisk3 TYPE: 2107900 ALGORITHM: Load Balance


SERIAL: 75ZA571001B
==========================================================================
Path# Adapter/Path Name State Mode Select Errors
0 fscsi0/path0 OPEN NORMAL 1719801 0
1 fscsi1/path1 OPEN NORMAL 1720265 0
2 fscsi2/path2 OPEN NORMAL 1719512 0
3 fscsi3/path3 OPEN NORMAL 1721139 0

DEV#: 4 DEVICE NAME: hdisk4 TYPE: 2107900 ALGORITHM: Load Balance


SERIAL: 75ZA571001C
==========================================================================
Path# Adapter/Path Name State Mode Select Errors
0 fscsi0/path0 OPEN NORMAL 1795417 0
1 fscsi1/path1 OPEN NORMAL 1792827 0
2 fscsi2/path2 OPEN NORMAL 1795002 0
3 fscsi3/path3 OPEN NORMAL 1796556 0

Check to ensure that for every LUN the counters under the Select column are the same
and that there are no errors.
5. Next, randomly check the sequential read speed of the raw disk device. The following
command is an example of the command that is run against a device called hdisk4. For
the LUNs that you test, ensure that they each yield the same results:
time dd if=/dev/rhdisk4 of=/dev/null bs=128k count=20000

Hint: For this dd command, for the first time that it is run against rhdisk4, the I/O must
be read from disk and staged to the DS8000 cache. The second time that this dd
command is run, the I/O is already in cache. Notice the shorter read time when we get
an I/O cache hit.

Chapter 9. Performance considerations for UNIX servers 331


If any of these LUNs are on ranks that are also used by another application, you see a
variation in the throughput. If there is a large variation in the throughput, you need to give
that LUN back to the storage administrator; trade for another LUN. You want all your LUNs
to have the same performance.

If everything looks good, continue with the configuration of volume groups and LVs.

Verifying the logical volumes


The next time that you stop and look at how your DS8000 storage system performs is after
the LVs are created. Complete the following steps:
1. Put the nmon monitor up for a quick check on I/O throughput performance and path
balance.
2. Test the sequential read speed on every raw LV device, if practical, or at least a decent
sampling if you have too many to test. The following command is an example of the
command that is run against an LV called 38glv. Perform this test against all your LVs to
ensure that they each yield the same results.
time dd if=/dev/rstriped_lv of=/dev/null bs=128k count=10000
3. Use the dd command without the time or count options to perform sequential reads and
writes against all your LVs, raw devices, or block devices. Watch nmon for the Mbps and
IOPS of each LUN. Monitor the adapter. Look at the following characteristics:
– Performance is the same for all the LVs.
– Raw LVs devices (/dev/rlvname) are faster than the counterpart block LV devices
(/dev/lvname) if the blocksize specified is more than 4 KB.
– Larger blocksizes result in higher MBps but reduced IOPS for raw LVs.
– The blocksize does not affect the throughput of a block (not raw) LV because in AIX the
LVM imposes an I/O blocksize of 4 KB. Verify this size by running the dd command
against a raw LV with a blocksize of 4 KB. This performance is the same as running the
dd command against the non-raw LV.
– Reads are faster than writes.
– With inter-disk LVs, nmon does not report that all the LUNs have input at the same time,
as with a striped LV. This result is normal and has concerns the nmon refresh rate and
the characteristics of inter-disk LVs.
4. Ensure that the UNIX errorlog is clear of storage-related errors.

Verifying the file systems and characterizing performance


After the file systems are created, it is a good idea to take time to characterize and document
the file system performance. A simple way to test sequential write/read speeds for file
systems is to time how long it takes to create a large sequential file and then how long it takes
to copy the same file to /dev/null. After creating the file for the write test, be careful that the
file is not still cached in host memory, which invalidates the read test because the data comes
from memory instead of disk.

The lmktemp command, which is used next, creates a file, and you control the size of the file.
It does not appear to be supported by any AIX documentation and therefore might disappear
in future releases of AIX. Here are examples of the tests:
򐂰 A simple sequential write test:
# cd /singleLUN
# time lmktemp 2GBtestfile 2000M
2GBtestfile

332 IBM System Storage DS8000 Performance Monitoring and Tuning


real 0m5.88s
user 0m0.06s
sys 0m2.31s
Divide 2000/5.88 seconds = 340 MBps sequential write speed.
򐂰 Sequential read speed:
# cd /
# umount /singleLUN (this command flushes the file from the operating system (jfs, jfs2)
memory)
# mount /singleLUN
# cd - (change working directory back to the previous directory, /singleLUN)
# time dd if=/singleLUN/2GBtestfile of=/dev/null bs=128k
16000+0 records in
16000+0 records out

real 0m11.42s
user 0m0.02s
sys 0m1.01s
Divide 2000/11.42 seconds = 175 MBps read speed.
Now that the DS8000 cache is primed, run the test again. When we ran the test again, we
got 0.28 seconds. Priming the cache is a good idea for isolated application read testing. If
you have an application, such as a database, and you perform several isolated fixed
reads, ignore the first run and measure the second run to take advantage of read hits from
the DS8000 cache because these results are a more realistic measurement of how the
application performs.

Hint: The lmktemp command for AIX has a 2 GB size limitation. This command cannot
create a file greater than 2 GB. If you want a file larger than 2 GB for a sequential read test,
concatenate a couple of 2 GB files.

Chapter 9. Performance considerations for UNIX servers 333


334 IBM System Storage DS8000 Performance Monitoring and Tuning
10

Chapter 10. Performance considerations for


Microsoft Windows servers
This chapter describes performance considerations for supported Microsoft Windows servers
that are attached to the IBM System Storage DS8000 storage system. In the context of this
chapter, the term Windows servers refers to native servers as opposed to Windows servers
that run as guests on VMware. You can obtain the current list of supported Windows servers
(at the time of writing) from the IBM System Storage Interoperation Center (SSIC):
http://www.ibm.com/systems/support/storage/ssic/interoperability.wss

Disk throughput and I/O response time for any server that is connected to a DS8000 storage
system are affected by the workload and configuration of the server and the DS8000 storage
system, data layout and volume placement, connectivity characteristics, and the performance
characteristics of the DS8000 storage system. Although the health and tuning of all of the
system components affect the overall performance management and tuning of a Windows
server, this chapter limits its descriptions to the following topics:
򐂰 General Windows performance tuning
򐂰 I/O architecture overview
򐂰 File systems
򐂰 Volume management
򐂰 Multipathing and the port layer
򐂰 Host bus adapter (HBA) settings
򐂰 Windows server I/O enhancements
򐂰 I/O performance measurement
򐂰 Problem determination
򐂰 Load testing

© Copyright IBM Corp. 2016. All rights reserved. 335


10.1 General Windows performance tuning
Microsoft Windows servers are largely self-tuning. Typically, using the system defaults is
reasonable from a performance perspective. This section describes the general
considerations for improving the disk throughput or response time for either a file server or a
database server:
򐂰 The Windows environment is memory-intensive and benefits from having as much
memory as possible. Provide enough memory for database, application, and file system
cache. As a rule, database buffer hit ratios must be greater than 90%. Increasing the
cache hit ratios is the most important tuning consideration for the I/O performance of
databases because it reduces the amount of physical I/O that is required.
򐂰 Monitor file fragmentation regularly, which is important for database files and Microsoft
Exchange database files. After starting with the fixed block allocation, a database file can
become fragmented later. This fragmentation increases reading and writing response time
dramatically because fragmented files are cached badly by the file system cache.
򐂰 Schedule processes that are processor-intensive, memory-intensive, or disk-intensive
during after-hours operations. Examples of these processes are virus scanners, backups,
and disk fragment utilities. These types of processes must be scheduled to run when the
server is least active.
򐂰 Optimize the amount and priority of the services running. It is better to have most of the
secondary services started manually than running all the time. Critical application-related
services must be planned with the highest priority and resource allocation, that is, you
must start main services first and then add secondary services one-by-one if needed.
򐂰 Follow the Microsoft recommendation that large dedicated file servers or database servers
are configured as backup domain controllers (BDCs) because of the impact that is
associated with the netlogon service.
򐂰 Optimize the paging file configuration. The paging file fragments if there is not enough
contiguous hard disk drive (HDD) space to hold the entire paging file. By default, the
operating system configures the paging file to allocate space for itself dynamically. During
the dynamic resizing, the file can end up fragmented because of a lack of contiguous disk
space. You can obtain more information about the paging file in 10.4.3, “Paging file” on
page 339.

For more information about these tuning suggestions, see the following resources:
򐂰 Tuning IBM System x Servers for Performance, SG24-5287
򐂰 Tuning Windows Server 2003 on IBM System x Servers, REDP-3943
򐂰 https://msdn.microsoft.com/en-us/library/windows/hardware/dn529133

10.2 I/O architecture overview


At a high level, the Windows I/O architecture is similar to the I/O architecture of most Open
Systems. Figure 10-1 on page 337 shows a generic view of the I/O layers and examples of
how they are implemented.

336 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 10-1 Windows I/O stack

To initiate an I/O request, an application issues an I/O request by using one of the supported
I/O request calls. The I/O manager receives the application I/O request and passes the I/O
request packet (IRP) from the application to each of the lower layers that route the IRP to the
appropriate device driver, port driver, and adapter-specific driver.

Windows server file systems can be configured as file allocation table (FAT), FAT32, or NTFS.
The file structure is specified for a particular partition or logical volume. A logical volume can
contain one or more physical disks. All Windows volumes are managed by the Windows
Logical Disk Management utility.

For more information about Windows Server 2003 and Windows Server 2008 I/O stacks and
performance, see the following documents:
򐂰 http://download.microsoft.com/download/5/6/6/5664b85a-ad06-45ec-979e-ec4887d715eb
/Storport.doc
򐂰 http://download.microsoft.com/download/5/b/9/5b97017b-e28a-4bae-ba48-174cf47d23cd
/STO089_WH06.ppt

10.3 Windows Server 2008 I/O Manager enhancements


There are several I/O performance enhancements to the Windows Server 2008 I/O
subsystem.

I/O priorities
The Windows Server 2008 I/O subsystem provides a mechanism to specify I/O processing
priorities. Windows primarily uses this mechanism to prioritize critical I/O requests over
background I/O requests. API extensions exist to provide application vendors file-level I/O
priority control. The prioritization code has some processing impact and can be disabled for
disks that are targeted for similar I/O activities, such as databases.

I/O completion and cancellation


The Windows Server 2008 I/O subsystem provides a more efficient way to manage the
initiation and completion of I/O requests, resulting in a reduced number of context switches,
lower CPU utilization, and reduced overall I/O response time.

Chapter 10. Performance considerations for Microsoft Windows servers 337


I/O request size
The maximum I/O size was increased from 64 KB per I/O request in Windows Server 2003 to
1024 KB in Windows Server 2008. For large sequential workloads, such as file shares or
backups, this increase can improve the disk throughput.

You can obtain additional information at the following websites:


򐂰 http://support.microsoft.com/kb/2160852
򐂰 http://blogs.technet.com/b/askperf/archive/2008/02/07/ws2008-memory-management-dy
namic-kernel-addressing-memory-priorities-and-i-o-handling.aspx

10.4 File system


A file system is a part of the operating system that determines how files are named, stored,
and organized on a volume. A file system manages files, folders, and the information that is
needed to locate and access these files and folders for local or remote users.

10.4.1 Windows file system overview


Microsoft Windows 2000 Server, Windows Server 2003, and Windows Server 2008 all
support the FAT/FAT32 file system and NTFS. However, use NTFS for the following reasons:
򐂰 NTFS provides considerable performance benefits by using a B-tree structure as the
underlying data structure for the file system. This type of structure improves performance
for large file systems by minimizing the number of times that the disk is accessed, which
makes it faster than FAT/FAT32.
򐂰 NTFS provides scalability over FAT/FAT32 in the maximum volume size. In theory, the
maximum file size is 264. However, on a Windows 32-bit system that uses 64 KB clusters,
the maximum volume size is 256 TB, and the maximum file size is 16 TB.
򐂰 NTFS provides recoverability through a journaled file system function.
򐂰 NTFS fully supports the Windows NT security model and supports multiple data streams.
No longer is a data file a single stream of data. Additionally, under NTFS, a user can add
user-defined attributes to a file.

10.4.2 NTFS guidelines


Follow these guidelines:
򐂰 Allocation: Block allocation size must be selected based on the application
recommendations and preferred practices. Allocations with 64-KB blocks work in most of
the cases and improve the efficiency of the NTFS file system. This allocation reduces the
fragmentation of the file system and reduces the number of allocation units that are
required for large file allocations.
򐂰 Defragment disks: Over time, files become fragmented in noncontiguous clusters across
disks, and disk response time suffers as the disk head jumps between tracks to seek and
reassemble the files when they are required. Regularly defragment volumes.
򐂰 Block alignment: Use diskpar.exe for Windows 2000 Server servers and use
diskpart.exe for Windows Server 2003 servers to force sector alignment. Windows
Server 2008 automatically enforces a 1 MB offset for the first sector in the partition, which
negates the need for diskpart.exe.

338 IBM System Storage DS8000 Performance Monitoring and Tuning


For more information, see the following documents:
– http://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-9fde-d599bac8
184a/Perf-tun-srv.docx
– http://support.microsoft.com/kb/929491

Important: NTFS file system compression seems to be the easiest way to increase the
amount of the available capacity. However, do not use in enterprise environments because
file system compression consumes too much disk and processor resources and increases
response times of reading and writing. For better capacity utilization, consider the DS8000
Thin Provisioning technology and IBM data deduplication technologies.

Start sector offset: The start sector offset must be 256 KB because of the stripe size on
the DS8000 storage system. Workloads with small, random I/Os (less than 16 KB) are
unlikely to experience any significant performance improvement from sector alignment on
the DS8000 logical unit numbers (LUNs).

10.4.3 Paging file


Windows servers use paging file technology as the extension to the existing memory. The
term paging file is used because the operating system writes memory pages there when
main memory is full and it keeps them for future reading. Because the paging file is on the
HDDs, it might degrade the performance of the entire system if the paging file is maintained
improperly. Following these basic rules might help you to avoid this situation:
򐂰 The paging file must be on physically separate disk drives and those drives must have no
other activity. The paging file must be on physically separate disk drives for monitoring.
򐂰 The paging file must have a static size, which eliminates the necessity of its growth to
avoid the fragmentation of the paging file. The paging file fragments if there is not enough
contiguous HDD space to hold the entire paging file. So, plan paging file volumes for
enough capacity.
򐂰 The activity on the paging file depends on the amount of memory in the system. Consider
more memory in the system if the activity and usage of the paging file are high.
򐂰 Monitor the activity on the paging file volumes regularly. The monitored values must be
included in the alert settings and generate alerts if the values are exceeded.
򐂰 The access pattern to the paging file is highly unpredictable. It can be random with the
memory page size (typically 4 KB) or sequential with blocks up to 1 MB. Read/write
balance is close to 50%. So, if you are expecting high activity on the paging file, it is better
to put it on a separate RAID 10 rank.
򐂰 For systems with a large amount of memory, you might be tempted to disable the paging
file, which is not advised. The paging file is used by all services and device drivers in the
operating system and disabling it can lead to a system crash. If the system has enough
memory, it is better to have a paging file of a minimum required static size.

For more information about the paging file, see the following websites:
򐂰 http://support.microsoft.com/kb/889654
򐂰 http://technet.microsoft.com/en-us/magazine/ff382717.aspx

Chapter 10. Performance considerations for Microsoft Windows servers 339


10.5 Volume management
Volume managers provide an additional abstraction layer between the physical resources and
the file system and allow administrators to group multiple physical resources into a single
volume. There are two volume managers that are supported for the Microsoft Windows server
environment:
򐂰 Microsoft Logical Disk Manager (LDM)
򐂰 Veritas Volume Manager (VxVM)

10.5.1 Microsoft Logical Disk Manager


The LDM provides an abstraction layer between the NTFS file system layer and the physical
storage layer with the following functions:
򐂰 Support of dynamic and basic volumes.
򐂰 Support of concatenated, striped (RAID 0), mirrored (RAID 1), and RAID 5 volumes.
򐂰 Dynamic expansion of the dynamic volumes. Volumes are expanded with disk system
logical disk capacity increase.
򐂰 Support for Microsoft Cluster Service (might require additional hardware and software).

10.5.2 Veritas Volume Manager


Veritas Storage Foundation for Windows provides the VxVM, a comprehensive solution for
managing Windows server volumes.

VxVM includes the following features:


򐂰 Support of concatenated, striped (RAID 0), mirrored (RAID 1), mirrored striped (RAID
1+0), and RAID 5 volumes
򐂰 Dynamic expansion of all volume types
򐂰 Dynamic MultiPathing (DMP) as an optional component
򐂰 Support for Microsoft Cluster Service (might require additional hardware and software)
򐂰 Support for up to 256 physical disks in a dynamic volume

For more information about VxVM, see Veritas Storage Foundation High Availability for
Windows, found at:
http://www.symantec.com/business/storage-foundation-for-windows

Important: Despite the logical volume manager (LVM) functional benefits, it is preferable
to not use any LVM striping in DS8000 Easy Tier and I/O Priority Manager environments.
DS8800 and DS8700 storage systems offer improved algorithms and methods of
managing the data and performance at a lower level and do not require any additional
volume management. Combined usage of these technologies might lead to unexpected
results and performance degradation that makes the search for bottlenecks impossible.

340 IBM System Storage DS8000 Performance Monitoring and Tuning


10.5.3 Determining volume layout
From a performance perspective, there are two approaches to volume layout. The first
approach is to spread everything everywhere. The second approach is to isolate volumes
based on the nature of the workload or application. Chapter 4, “Logical configuration
performance considerations” on page 83 described these approaches in great detail. It is
preferable to use a consolidation of these approaches with the benefits of Easy Tier V3 and
I/O Priority Manager in DS8800 or DS8700 storage systems. From the isolation on the rank
and extent level, you switch to isolation on the application level and go deeper with priority
management and micro-tiering. There are several areas to consider when planning the
volume layout:
򐂰 Workloads that are highly sensitive to increases of I/O response time are better to isolate.
It is not preferable to use isolation on the rank level. Use isolation on the extent pool level
and the volume level.
򐂰 Isolation can be done also on the functional group of the application level. For example,
you might have several mail server instances that use their own databases and locate
them on one or several extent pools that maintain different levels of I/O priority for each of
them. This functional group of mail servers is isolated from the other applications but
shares the cache memory and I/O ports.
򐂰 “Home tier” for all hybrid pool volumes is the Enterprise tier.1 The goal of the tiering is to
keep it performing well by moving hot extents to the solid-state drive (SSD) tier and free
enough for new volumes by moving cold extents to the Nearline tier.
򐂰 Easy Tier technology and I/O Priority Manager technologies are unaware of the work of
each other. It is not possible to make extents “hotter” or “colder” by setting different priority
levels. Easy Tier looks for I/O density, and priority management looks for the quality of
service (QoS).
򐂰 Skew factor is the main determinant whether to use SSDs. Low skew factor workloads
might benefit more from the Micro-tier option than SSDs. The Micro-tier option is more
attractive from the cost point of view.
򐂰 Micro-tier can be useful for high performance applications but not for capacity-oriented
applications. By using 15-K RPM and 10-K RPM drives for these applications, you might
improve the cost of 1 GB of storage.
򐂰 For applications that consistently demand high bandwidth for I/O throughput, consider
Enterprise 10-K RPM drives and Nearline drives to keep the level of performance high and
the Enterprise drive space optimized.

1 Hybrid pools in this context means pools with Enterprise disks.

Chapter 10. Performance considerations for Microsoft Windows servers 341


Table 10-1 demonstrates examples of typical Windows server workloads, categorizes them
as potential candidates for hybrid pool usage, and describes the goals of priority
management.

Table 10-1 Windows application tier requirements and priorities


Application type Tier levels Priority level Goals of tiering

Database Server SSD + ENT + NL Highest Maintain a low response


(online transaction SSD + ENT (15,10) + time and the highest priority
processing NL for business-critical
(OLTP)) applications. Keep ENT level
free enough for growth and
benefits from Micro-tiering.

Database server (data SSD + ENT + NL High Micro-tiering is favorable for


mining, test and ENT (15,10) + NL Medium these applications.
development, and data Non-business-critical
warehousing) applications might still
require low response times
as the backup option for the
main applications, but they
might also require a
decreased I/O priority level.

Web servers ENT + NL Medium Keep ENT level free and


High high performing for the
Low database part of the
application. NL tier is used
for storing the data. If it is not
a business-critical
application, it can have a low
priority level on the requests.

Mail servers ENT (15,10) + NL High Keep high priority to the DB


ENT + NL Medium part requests and maintain
low read and write request
response time. Keep ENT
tier free enough to expand.

File servers NL mostly Low Try to avoid the use of the


ENT (10) + NL ENT tier and keep the I/O
requests at a low priority.

Note: The example applications that are listed in Table 10-1 are examples only and not
specific rules.

Volume planning for Windows applications


You must determine the number of drive types in each hybrid pool for each application group.
It is not easily defined and predicted. To be accurate, you must consider many factors, such
as skew factor, application behavior, access pattern, and active/passive user density. Also,
workload must be evenly spread across all the DS8000 resources to eliminate or reduce
contention.

342 IBM System Storage DS8000 Performance Monitoring and Tuning


Follow these rules:
򐂰 Collect the workload statistics when possible to be able to predict the behavior of the
systems and analyze the skew factor.
򐂰 Database applications often have a high skew factor and benefit from the SSDs in the
pool. Plan at least 10% of pool capacity for SSDs, which might be increased up to 30%.
Nearline drives can help to preserve free extents on the Enterprise tier. Plan to have at
least 30% of the capacity in Nearline drives.
򐂰 If you do not expect cold extents to appear in the database application and plan to have a
heavy workload pattern, use the Micro-tiering option and have 15-K RPM drives and 10-K
RPM drives in the Enterprise tier. This plan frees the 15-K RPM drives from the large block
sequential workload so that the 15-K RPM drives provide lower response times. The ratio
is about two thirds for the 15-K RPM drives.
򐂰 Email applications have a database part for indexing data for rapid random access and the
data is stored for later access. This application can benefit from SSDs, but the number of
SSDs must not be more than 2% - 5% because of the high write ratio, which can be up to
50% - 60%. You can benefit from the Micro-tiering option with 15-K and 10-K Enterprise
drives. Mail servers can have many large block sequential writes and 10-K drives work
well. Plan for at least 90% of capacity on Enterprise drives with 50% of 10-K RPM drives.
The Nearline tier provides a plan to keep the archives in the same instances of the mail
servers.
򐂰 Web servers are likely to keep the data for a long time and cache the data often, so the
size of the Enterprise tier must be about 5% to provide random access if it happens. Most
of the storage must be Nearline drives. Use Thin Provisioning technology for the initial
data allocation, which helps to keep the Enterprise extents free at the initial data
allocation.
򐂰 File servers handle the cold data and have a sequential access pattern. The performance
of the file servers depends more on the performance of the network adapters and not the
disk subsystem. Plan to have 99% of the capacity on Nearline drives and thin-provisioned.
򐂰 Run the I/O Priority Manager on both the inter-application and intra-application sides. You
need the QoS of the overall system and the scope of the resources that are dedicated to
each application.
򐂰 Remember the device adapter (DA) pair limits when planning the extent pools.
򐂰 Include ranks from both odd and even central processor complexes (CPCs) in one extent
pool.
򐂰 Monitor the resource load regularly to avoid high peaks and unexpected delays.

The prior approach of workload isolation on the rank level might work for low skew factor
workloads or some specific workloads. Also, you can use this approach if you are confident in
planning the workload and volume layout.

Chapter 10. Performance considerations for Microsoft Windows servers 343


10.6 Multipathing and the port layer
The multipathing, storage port, and adapter drivers exist in three separate logical layers;
however, they function together to provide access and multipathing facilities to the DS8000
storage system. Multipathing provides redundancy and scalability. Redundancy is facilitated
through multiple physical paths from the server to a DS8000 storage system. Scalability is
implemented by allowing the server to have multiple paths to the storage and to balance the
traffic across the paths. There are several methods available for configuring multipathing for
Windows servers attached to a DS8800 or DS8700 storage system:
򐂰 Windows Server 2003
IBM Subsystem Device Driver (SDD), IBM Subsystem Device Driver Device Specific
Module (SDDDSM), and Veritas DMP
򐂰 Windows Server 2008
SDDDSM and Veritas DMP

On Windows servers, the implementations of multipathing rely on either native multipathing


(Microsoft Multipath I/O (MPIO) + Storport driver) or non-native multipathing and the Small
Computer System Interface (SCSI) port driver or SCSIport. The following sections describe
the performance considerations for each of these implementations.

10.6.1 SCSIport scalability issues


Microsoft originally designed the SCSIport storage driver for parallel SCSI interfaces. SDD
and older versions of Veritas DMP still rely on it. HBA miniport device drivers that are
compliant with the SCSIport driver have many performance and scalability limitations. The
following section summarizes the key scalability issues with this architecture:
򐂰 Adapter limits
SCSIport is limited to 254 outstanding I/O requests per adapter regardless of the number
of physical disks that is associated with the adapter. SCSIport does not provide a means
for managing the queues in high loads. One of the possible results of this architecture is
that one highly active device can dominate the adapter queue, resulting in latency for other
non-busy disks.
򐂰 Serialized I/O requests processing
The SCSIport driver cannot fully take advantage of the parallel processing capabilities
available on modern enterprise class servers and the DS8000 storage system.
򐂰 Elevated Interrupt Request Levels (IRQLs)
There is a high probability that other higher priority processes might run on the same
processors as the device interrupts, which on I/O-intensive systems can cause a
significant queuing of interrupts, resulting in slower I/O throughput.
򐂰 Data buffer processing impact
The SCSIport exchanges physical address information with the miniport driver one
element at a time instead of in a batch, which is inefficient, especially with large data
transfers, and results in the slow processing of large requests.

10.6.2 Storport scalability features


In response to significant performance and scalability advances in storage technology, such
as hardware RAID and high performing storage arrays, Microsoft developed a new storage
driver called Storport. The architecture and capabilities of this driver address most of the
scalability limitations that exist in the SCSIport driver.

344 IBM System Storage DS8000 Performance Monitoring and Tuning


The Storport driver offers these key features:
򐂰 Adapter limits removed
There are no adapter limits. There is a limit of 254 requests queued per device.
򐂰 Improvement in I/O request processing
Storport decouples the StartIo and Interrupt processing, enabling parallel processing of
start and completion requests.
򐂰 Improved IRQL processing
Storport provides a mechanism to perform part of the I/O request preparation work at a
low priority level, reducing the number of requests queued at the same elevated priority
level.
򐂰 Improvement in data buffer processing
Lists of information are exchanged between the Storport driver and the miniport driver as
opposed to single element exchanges.
򐂰 Improved queue management
Granular queue management functions provide HBA vendors and device driver
developers the ability to improve management of queued I/O requests.
For more information about the Storport driver, see the following document:
http://download.microsoft.com/download/5/6/6/5664b85a-ad06-45ec-979e-ec4887d715
eb/Storport.doc

10.6.3 Subsystem Device Driver


The SDD provides path failover/failback processing for the Windows server attached to the
DS8000 storage system. SDD relies on the existing Microsoft SCSIport system-supplied port
driver and HBA vendor-provided miniport driver.

It also provides I/O load balancing. For each I/O request, SDD dynamically selects one of the
available paths to balance the load across all possible paths.

To receive the benefits of path balancing, ensure that the disk drive subsystem is configured
so that there are multiple paths to each LUN. By using multiple paths to each LUN, you can
benefit from the performance improvements from SDD path balancing. This approach also
prevents the loss of access to data if there is a path failure.

Section “Subsystem Device Driver” on page 273 describes the SDD in further detail.

10.6.4 Subsystem Device Driver Device Specific Module


SDDDSM provides MPIO support based on Microsoft MPIO technology for
Windows Server 2003 and Windows Server 2008 servers. A Storport-based driver is required
for the Fibre Channel adapter. SDDDSM uses a device-specific module that provides support
of specific storage arrays. The DS8000 storage system supports most versions of
Windows Server 2003 and Windows Server 2008 servers as specified at the SSIC:
http://www.ibm.com/systems/support/storage/config/ssic/displayesssearchwithoutjs.w
ss?start_over=yes

You can obtain more information about SDDDSM in the SDD User’s Guide, found at:
http://www.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S7000303

Chapter 10. Performance considerations for Microsoft Windows servers 345


In Windows Server 2003, the MPIO drivers are provided as part of the SDDDSM package. On
Windows Server 2008, they ship with the OS.

SDDDSM: For non-clustered environments, use SDDDSM for its performance and
scalability improvements.

10.6.5 Veritas Dynamic MultiPathing for Windows


For enterprises with significant investment in Veritas software and skills, Veritas provides an
alternative to the multipathing software provided by IBM. Veritas relies on the Microsoft
implementation of MPIO and Device Specific Modules (DSMs), which rely on the Storport
driver. This implementation is not available for all versions of Windows. For your specific
hardware configuration, see the SSIC:
http://www.ibm.com/systems/support/storage/config/ssic/displayesssearchwithoutjs.w
ss?start_over=yes

10.7 Host bus adapter settings


For each HBA, there are BIOS and driver settings that are suitable for connecting to your
DS8000 storage system. Configuring these settings incorrectly can affect performance or
cause the HBA to not work correctly.

To configure the HBA, see the IBM System Storage DS8700 and DS8800 Introduction and
Planning Guide, GC27-2297-07. This guide contains detailed procedures and settings. You
also must read the readme file and manuals for the driver, BIOS, and HBA.

Obtain a list of supported HBAs, firmware, and device driver information at this website:
http://www.ibm.com/systems/support/storage/config/ssic/displayesssearchwithoutjs.w
ss?start_over=yes

Newer versions: When configuring the HBA, install the newest version of driver and the
BIOS. The newer version includes more effective function and problem fixes so that the
performance or reliability, availability, and service (RAS) can improve.

10.8 I/O performance measurement


Regular performance monitoring and measurement are critical for the normal daily work of
enterprise environments. This section covers methods, tools, and approaches to performance
management. Figure 10-2 on page 347 demonstrates the I/O layer on the left side and an
example of its corresponding method of control on the right side.

346 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 10-2 I/O layer and example implementations

In Figure 10-2:
򐂰 The application layer is monitored and tuned with the application-specific tools and metrics
available for monitoring and analyzing application performance on Windows servers.
Application-specific objects and counters are outside the scope of this book.
򐂰 The I/O Manager and file system levels are controlled with the built-in Windows tool, which
is available in Windows Performance Console (perfmon).
򐂰 The volume manager level can be also monitored with perfmon. However, it is preferable to
not use any logical volume management in Windows servers.
򐂰 Fibre Channel port level multipathing is monitored with the tools provided by the
multipathing software: SDD, SDDPCM, or Veritas DMP drivers.
򐂰 The Fibre Channel adapter level can be monitored with the adapter-specific original
software that is provided by each vendor. For the support software of the adapters that are
compatible with DS8870 and DS8880 storage systems, see the following websites:
– http://www.brocade.com/services-support/index.page
– http://www.emulex.com/support.html
򐂰 The SAN fabric level and the D8000 level are monitored with the IBM Tivoli Storage
Productivity Center and the DS8000 built-in tools. Because Tivoli Storage Productivity
Center provides more functions to monitor the DS8000 storage systems, use it for
monitoring and analysis.

The following sections describe several more complex approaches to performance


monitoring:
򐂰 Overview of I/O metrics
򐂰 Overview of perfmon
򐂰 Overview of logs
򐂰 Mechanics of logging
򐂰 Mechanics of exporting data
򐂰 Collecting multipath data
򐂰 Correlating the configuration and performance data
򐂰 Analyzing the performance data

Chapter 10. Performance considerations for Microsoft Windows servers 347


10.8.1 Key I/O performance metrics
In Windows servers, there are two kinds of disk counters: PhysicalDisk object counters and
LogicalDisk object counters. PhysicalDisk object counters are used to monitor single physical
disks and are enabled by default. LogicalDisk object counters are used to monitor logical
disks or software RAID arrays created with LDM. LogicalDisk object counters are enabled by
default on Windows Server 2003 and Windows Server 2008 servers.

PhysicalDisk object counters: When attempting to analyze disk performance


bottlenecks, always use PhysicalDisk counters to identify performance issues with
individual DS8000 LUNs. We do not use LogicalDisk counters in this book.

Table 10-2 describes the key I/O-related metrics that are reported by perfmon.

Table 10-2 Performance monitoring counters for PhysicalDisk and other objects
Counter Normal Critical values Description
values

%Disk Time ~ 70 - 90% Depends on the Percentage of elapsed disk serviced read
situation or write requests.

Average Disk 5 - 15 ms 16 - 20 ms The average amount of time in seconds to


sec/Read complete an I/O read request. These
results are end-to-end disk response
times.

Average Disk 1 - 15 ms 16 - 20 ms The average amount of time in seconds to


sec/Transfer complete an I/O request. These results
are end-to-end disk response times.

Average Disk 0 - 3 ms 5 - 10 ms The average amount of time in seconds to


sec/Write complete an I/O write request. These
results are end-to-end disk response
times. Write requests must be services
with cache. If not, there is a problem.

Disk Transfers/sec According to Close to the The momentary number of disk transfers
the workload limits of volume, per second during the collection interval.
rank, and extent
pool

Disk Reads/sec According to Close to the The momentary number of disk reads per
the workload limits of volume, second during the collection interval.
rank, and extent
pool

Disk Bytes/sec According to Close to the The momentary number of bytes per
the workload limits of volume, second during the collection interval.
rank, and extent
pool

Disk Read According to Close to the The momentary number of bytes read per
Bytes/sec the workload limits of volume, second during the collection interval.
rank, and extent
pool

Current Disk 5 - 100, 1000> Indicates the momentary number of read


Queue Length depends on and write I/O requests waiting to be
the activity serviced. When you have I/O activity,
Queue Length is not zero.

348 IBM System Storage DS8000 Performance Monitoring and Tuning


Counter Normal Critical values Description
values

Object Paging file, 0-1 40% and more The amount of paging file instance in use
counter%Usage in a percentage.

Current Disk 5 - 100, 1000> Indicates the momentary number of read


Queue Length depends on and write I/O requests waiting to be
the activity serviced. When you have I/O activity,
Queue Length is not zero.
Object Processor, 80 - 90% 1 - 10% The amount of time the processor spends
counter% User servicing user requests, that is, user
Time mode. Applications are considered user
requests too.

Rules
We provide the following rules based on our field experience. Before using these rules for
anything specific, such as a contractual service-level agreement (SLA), you must carefully
analyze and consider these technical requirements: disk speeds, RAID format, workload
variance, workload growth, measurement intervals, and acceptance of response time and
throughput variance. We suggest these rules:
򐂰 Write and read response times in general must be as specified in the Table 10-2 on
page 348.
򐂰 There must be a definite correlation between the counter values; therefore, the increase of
one counter value must lead to the increase of the others connected to it. For example, the
increase of the Transfers/sec counter leads to the increase of the Average sec/Transfer
counter because the increased number of IOPS leads to an increase in the response time
of each I/O.
򐂰 If one counter has a high value and the related parameter value is low, pay attention to this
area. It can be a hardware or software problem or a bottleneck.
򐂰 A Disk busy counter close to 100% does not mean that the system is out of its disk
performance capability. The disk is busy with I/Os. Problems occur when there are 100%
disk busy counters with close to zero counters of the I/Os at the same time.
򐂰 Shared storage environments are more likely to have a variance in disk response time. If
your application is highly sensitive to variance in response time, you need to isolate the
application at either the processor complex, DA, or rank level.
򐂰 With the perfmon tool, you monitor only the front-end activity of the disk system. To see the
complete picture, monitor the back-end activity, also. Use the Tivoli Storage Productivity
Center console and Storage Tier Advisor Tool (STAT).

Conversion to milliseconds: By default, Windows provides the response time in


seconds. To convert to milliseconds, you must multiply by 1000,

Chapter 10. Performance considerations for Microsoft Windows servers 349


10.8.2 Windows Performance console (perfmon)
The Windows Performance console (perfmon) is one of the most valuable monitoring tools
available to Windows server administrators. It is commonly used to monitor server
performance and to isolate disk bottlenecks. The tool provides real-time information about
server subsystem performance. It also can log performance data to a file for later analysis.
The data collection interval can be adjusted based on your requirements. The perfmon tool
offers two ways to monitor performance:
򐂰 Real-time monitoring
򐂰 Monitoring with automatic data collection for a period

Monitoring disk performance in real time


Monitoring disk activity in real time permits you to view disk activity on local or remote disk
drives. The current, and not the historical, level of disk activity is shown in the chart. You
cannot analyze the data for any performance problems because the window size for real-time
monitoring is about 2 minutes, as shown in Figure 10-3. The data is displayed for
1 minute and 40 seconds. It is indicated in the Duration field of the window.

Monitoring disk performance with data collection


If you want to determine whether excessive disk activity on your system is slowing
performance, log the disk activity of the wanted disks to a file over a period that represents
the typical use of your system. View the logged data in a chart and export it to a
spreadsheet-readable file to see whether disk activity affects system performance. Collected
data can be exported to electronic spreadsheet software for future analysis.

Figure 10-3 Main Performance console window in Windows Server 2003

The Performance console is a snap-in tool for the Microsoft Management Console (MMC).
You use the Performance console to configure the System Monitor and Performance Logs
and Alerts tools.

350 IBM System Storage DS8000 Performance Monitoring and Tuning


You can open the Performance console by clicking Start → Programs → Administrative
Tools → Performance or by typing perfmon on the command line.

Windows Server 2008 perfmon


The Windows Server 2008 performance console (perfmon) has additional features, including
these key new features:
Data Collector Sets Use to configure data collection templates for the collection of system
and trace counters.
Resource Overview Shows a high-level view of key system resources, including CPU%
total usage, disk aggregate throughput, network bandwidth utilization,
and memory hard faults/sec (see Figure 10-4).
Reports Integration of SPA function. This feature can quickly report on
collected counter and trace data in a way that provides substantial
detail.

Figure 10-4 Windows Server 2008 perfmon console

With Windows Server 2008, you can open the Performance console by clicking Start →
Programs → Administrative Tools → Performance or by typing perfmon on the command
line.

10.8.3 Performance log configuration and data export


With many physical disks and long collection periods that are required to identify certain disk
bottlenecks, it is impractical to use real-time monitoring. In these cases, disk performance
data can be logged for analysis over extended periods. The remaining sections assume that
you collected performance data.

Chapter 10. Performance considerations for Microsoft Windows servers 351


10.8.4 Collecting configuration data
You need the correct configuration data so that you can easily and clearly understand which
disk in the system correlates to which logical volume in the DS8000 storage system. This
section guides you step-by-step through the process of getting configuration data for one
disk. You need the following tools:
򐂰 MMC with the Computer Management snap-in
Click Start → Run and type mmc. An MMC window opens. Open
%SystemRoot%\System32\compmngmt.msc and you see the Computer Management snap-in.
򐂰 SDDDSM Console
Click Start → Program → Subsystem Device Driver → Subsystem Device Driver
Management. A command-line window opens.
򐂰 DS8000 command-line interface (DSCLI) console
Open %SystemRoot%\Program Files\IBM\dscli\dscli.exe.
򐂰 Identify the disks that you have in the Disk Management of the Computer Management
snap-in, as shown in Figure 10-5.

Figure 10-5 List of the volumes in MMC

Figure 10-5 shows several disks. To identify them, right-click the name and click the
Properties option on each of them. Disks from the DS8000 storage system show IBM 2107
in the properties, which is the definition of the DS8000 machine-type. So, in this example, our
disk from the DS8000 storage system is Disk 2, as shown in Figure 10-6 on page 353.

352 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 10-6 Properties of the DS8800 logical volume

Multi-Path Disk Device means that you are running SDD. You can also check for SDD from
the Device Manager option of the Computer Management snap-in, as shown in Figure 10-7.

Figure 10-7 SDD multipathing is running

In Figure 10-7, you see several devices and one SDD that is running. Use the datapath query
device command to show the disk information in the SDDDSM console (Example 10-1).

Example 10-1 The datapath query device command


c:\Program Files\IBM\SDDDSM>datapath query device
Total Devices : 4
DEV#: 0 DEVICE NAME: Disk2 Part0 TYPE: 2107900 POLICY: OPTIMIZED
SERIAL: 75V1818601
============================================================================
Path# Adapter/Hard Disk State Mode Select Errors
0 Scsi Port4 Bus0/Disk2 Part0 OPEN NORMAL 1520 0
1 Scsi Port4 Bus0/Disk2 Part0 OPEN NORMAL 1507 0
2 Scsi Port2 Bus0/Disk2 Part0 OPEN NORMAL 52 24
3 Scsi Port2 Bus0/Disk2 Part0 OPEN NORMAL 165 78

Chapter 10. Performance considerations for Microsoft Windows servers 353


Figure 10-1 on page 337 shows the disk information. Disk 2 has serial number 75V1818601.
The last four digits are the volume ID, which is 8601 in this example. This disk is connected
with two FC ports of the FC adapter.

List the worldwide port names (WWPNs) of the ports by running the datapath query wwpn
command (Figure 10-2 on page 347).

Example 10-2 List the FC port WWPNs


c:\Program Files\IBM\SDDDSM>datapath query wwpn
Adapter Name PortWWN
Scsi Port2: 2100001B32937DAE
Scsi Port3: 2101001B32B37DAE
Scsi Port4: 2100001B3293D5AD
Scsi Port5: 2101001B32B3D5AD

In Figure 10-2 on page 347, there are two WWPNs for Port 2 and Port 4.

Identify the disk in the DS8000 storage system by using the DSCLI console (Figure 10-3 on
page 350).

Example 10-3 List the volumes in the DS8800 DSCLI console (output truncated)
dscli> lsfbvol
Name ID accstate datastate configstate deviceMTM datatype extpool cap (2^30B) cap (10^9B) cap (blocks)
===========================================================================================================
- 8600 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8601 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8603 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8604 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8605 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8606 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200
- 8607 Online Normal Normal 2107-900 FB 512 P4 100.0 - 209715200

Figure 10-3 on page 350 shows the output of the DSCLI command lsfbvol that lists all the
fixed block (FB) volumes in the system. Our example volume has ID 8601. It is shown in bold.
It is created on extent pool number P4.

Next, list the ranks that are allocated with this volume by running showfbvol -rank Vol_ID
(Example 10-4).

Example 10-4 List the ranks that are allocated with this volume
dscli> showfbvol -rank 8601
Name -
ID 8601
accstate Online
datastate Normal
configstate Normal
deviceMTM 2107-900
datatype FB 512
addrgrp 8
extpool P4
exts 100
captype DS
cap (2^30B) 100.0
cap (10^9B) -
cap (blocks) 209715200

354 IBM System Storage DS8000 Performance Monitoring and Tuning


volgrp V8
ranks 2
dbexts 0
sam ESE
repcapalloc -
eam managed
reqcap (blocks) 209715200
realextents 4
virtualextents 100
migrating 0
perfgrp PG0
migratingfrom -
resgrp RG0
==============Rank extents==============
rank extents
============
R17 2
R18 2

Figure 10-4 on page 351 shows the output of the command. Volume 8601 is the extent
space-efficient (ESE) volume with a virtual capacity of 100 GB, which occupies four extents
(two from each) from ranks R17 and R18. There is 4 GB of occupied space. Extent pool P4 is
under the Easy Tier automatic management.

Check for the arrays, RAID type, and the DA pair allocation (Figure 10-5 on page 352).

Example 10-5 List the array and DA pair


dscli> lsrank
ID Group State datastate Array RAIDtype extpoolID stgtype
=============================================================
R16 1 Normal Normal A13 5 P1 ckd
R17 0 Normal Normal A2 5 P4 fb
R18 0 Normal Normal A3 5 P4 fb
R20 1 Reserved Normal A5 5 P5 fb

dscli> lsarray
Array State Data RAIDtype arsite Rank DA Pair DDMcap (10^9B)
======================================================================
A0 Assigned Normal 5 (6+P+S) S1 R0 0 300.0
A1 Assigned Normal 5 (6+P+S) S2 R1 0 300.0
A2 Assigned Normal 5 (6+P+S) S3 R17 2 600.0
A3 Assigned Normal 5 (6+P+S) S4 R18 2 600.0
A4 Unassigned Normal 5 (6+P+S) S5 - 2 600.0

Figure 10-5 on page 352 shows the array and DA pair allocation by running the lsarray and
lsrank commands. Ranks R17 and R18 relate to arrays A2 and A3 and the Enterprise drives
of 600 GB on the DA Pair 2.

Chapter 10. Performance considerations for Microsoft Windows servers 355


From Example 10-4 on page 354, you know that Volume 8601 is in Volume Group V8. You
can list the host connection properties (Figure 10-6 on page 353).

Example 10-6 Listing the host connection properties


dscli> lshostconnect -volgrp v8
Name ID WWPN HostType Profile portgrp volgrpID ESSIOport
===========================================================================================
x3850_lab_5_s12p20 0008 2100001B32937DAE - Intel - Windows 2008 0 V8 all
x3850_lab_5_s12p22 0009 2100001B3293D5AD - Intel - Windows 2008 0 V8 all

By running lshostconnect -volgrp VolumeGroup_ID, you can list the ports to which this
volume group is connected, as shown in Example 10-6. This volume group uses host
connections with IDs 0008 and 0009 and the WWPNs that are shown in bold. These WWPNs
are the same as the WWNs in Example 10-2 on page 354.

List the ports that are used in the disk system for the host connections (Example 10-7).

Example 10-7 List the host port connections (outputs truncated)


dscli> showhostconnect 0008
Name x3850_lab_5_s12p20
ID 0008
WWPN 2100001B32937DAE
HostType -
LBS 512
addrDiscovery LUNPolling
Profile Intel - Windows 2008
portgrp 0
volgrpID V8
atchtopo -
ESSIOport I0002

dscli> showhostconnect 0009


Name x3850_lab_5_s12p22
ID 0009
WWPN 2100001B3293D5AD
HostType -
LBS 512
addrDiscovery LUNPolling
Profile Intel - Windows 2008
portgrp 0
volgrpID V8
atchtopo -
ESSIOport I0003

dscli> lsioport
ID WWPN State Type topo portgrp
===============================================================
I0000 500507630A00029F Online Fibre Channel-SW SCSI-FCP 0
I0001 500507630A00429F Online Fibre Channel-SW FC-AL 0
I0002 500507630A00829F Online Fibre Channel-SW SCSI-FCP 0
I0003 500507630A00C29F Online Fibre Channel-SW SCSI-FCP 0
I0004 500507630A40029F Online Fibre Channel-SW FICON 0
I0005 500507630A40429F Online Fibre Channel-SW SCSI-FCP 0

356 IBM System Storage DS8000 Performance Monitoring and Tuning


I0006 500507630A40829F Online Fibre Channel-SW SCSI-FCP 0
I0007 500507630A40C29F Online Fibre Channel-SW SCSI-FCP 0

Example 10-7 on page 356 shows how to obtain the WWPNs and port IDs by running the
showhostconnect and lsioport commands. All of the information that you need is in bold.

After these steps, you have all the configuration information for a single disk in the system.

10.8.5 Correlating performance and configuration data


Section 10.8.4, “Collecting configuration data” on page 352 shows how to get configuration
information for one disk manually. Because most of the information was obtained with the CLI,
the process can be organized into scripts. This process is easier if you have many disks. It is
also possible to correlate the performance collected data and the configuration data.

Analyzing performance data


Performance analysis can be complicated. Because there are many methods to analyze
performance, this work cannot be done randomly. The methodology for analyzing disk I/O
issues is provided in 7.8, “End-to-end analysis of I/O performance problems” on page 248.
This section goes deeper into detail and provide suggestions that are specific for the
Windows environment and applications.

After the performance data is correlated to the DS8000 LUNs and reformatted, open the
performance data file in Microsoft Excel. It looks similar to Figure 10-8.

DATE TIME Subsyste LUN Disk Disk Avg Read Avg Total Avg Read Read
m Serial Reads/se RT(ms) Time Queue KB/sec
c Length

11/3/2008 13:44:48 75GB192 75GB1924 Disk6 1,035.77 0.612 633.59 0.63 66,289.14
11/3/2008 13:44:48 75GB192 75GB1924 Disk2 1,035.75 0.613 634.49 0.63 66,288.07
11/3/2008 13:44:48 75GB192 75GB1924 Disk3 1,035.77 0.612 633.87 0.63 66,289.14
11/3/2008 13:44:48 75GB192 75GB1924 Disk5 1,035.77 0.615 637.11 0.64 66,289.14
11/3/2008 13:44:48 75GB192 75GB1924 Disk4 1,035.75 0.612 634.38 0.63 66,288.07
11/3/2008 13:44:48 75GB192 75GB1924 Disk1 1,035.77 0.612 633.88 0.63 66,289.14
11/3/2008 14:29:48 75GB192 75GB1924 Disk6 1,047.24 5.076 5,315.42 5.32 67,023.08
11/3/2008 14:29:48 75GB192 75GB1924 Disk2 1,047.27 5.058 5,296.86 5.30 67,025.21
11/3/2008 14:29:48 75GB192 75GB1924 Disk3 1,047.29 5.036 5,274.30 5.27 67,026.28
11/3/2008 14:29:48 75GB192 75GB1924 Disk5 1,047.25 5.052 5,291.01 5.29 67,024.14
11/3/2008 14:29:48 75GB192 75GB1924 Disk4 1,047.29 5.064 5,303.36 5.30 67,026.28
11/3/2008 14:29:48 75GB192 75GB1924 Disk1 1,047.29 5.052 5,290.89 5.29 67,026.28
11/3/2008 13:43:48 75GB192 75GB1924 Disk6 1,035.61 0.612 634.16 0.63 66,279.00
11/3/2008 13:43:48 75GB192 75GB1924 Disk2 1,035.61 0.612 633.88 0.63 66,279.00
11/3/2008 13:43:48 75GB192 75GB1924 Disk3 1,035.61 0.615 636.72 0.64 66,279.00
Figure 10-8 The perfmon-essmap.pl script output

A quick look at the compiled data in Figure 10-8 gives you a high increase of the response
time without an appropriate increase in the number of IOPS (Disk Reads/sec). This counter
shows that a problem occurred, which you confirm with the increased Queue Length value.
You must look at the drives that show the response time increase and collect additional data
for the drives.

Chapter 10. Performance considerations for Microsoft Windows servers 357


However, even with this data, you can assume possible reasons for what happened and
propose further steps:
򐂰 Read response time has a value near zero before it increased. The reads were serviced
from the cache mostly. Even distribution through all the drives indicates that the workload
is mostly random and balanced through all the paths, but it is one of the points to check
more.
򐂰 Assume that the increase in response time is caused by the increase in the number of
read operations, but this assumption is not confirmed. The reason for the assumption is
that the response time value increase is about 10 times, but the number of read operations
increased by less than 1%. So, this reason cannot be the reason for this response time
increase.
򐂰 Another assumption might be that the blocksize of the I/O increased and caused a longer
time to respond. There is no confirmation for that situation. If you look at the number of
megabytes per second, you do not see any dramatic increase. The blocksize remains
about 64 KB. This assumption is also not a reason for the increase in response time.
򐂰 The only and most likely reason for the behavior of the system is the write activity on the
front end or at the back end of the system. That activity typically occurs on the same
drives on which the read activity occurs. That write activity is not as cache friendly as
expected, which caused the increase of the response time and the increase in queue
length. Probably, it was a batch activity or a large block write activity.

Because you have a possible reason for the read response time increase, you can specify the
further steps to confirm it:
򐂰 Gather additional performance data for the volumes from the Windows server, including
write activity.
򐂰 Gather performance data from the back end of the disk system on those volumes for any
background activity or secondary operations.
򐂰 Examine the balancing policy on the disk path.
򐂰 Examine the periodic processes initiated in the application. There might be activity on the
log files.
򐂰 For database applications, separate the log files from the main data and indexes.
򐂰 Check for any other activity on that drive that can cause an increase of the write I/Os.

You can see how even a small and uncertain amount of the collected performance data can
help you in the early detection of performance problems and help you quickly identify further
steps.

Removing disk bottlenecks


Disk bottlenecks occur in two potential places:
򐂰 Applications and operating system
򐂰 Disk system

At the disk subsystem level, there can be bottlenecks on the rank level, extent pool level, DA
pair level, and cache level that lead to problems on the volume level. Table 10-3 on page 359
describes the reasons for the problems on different levels.

358 IBM System Storage DS8000 Performance Monitoring and Tuning


Table 10-3 Bottlenecks on the disk system levels (the numbers correspond)
Disk system Possible reasons Actions
level

Rank level 1. Rank IOPS capability exceeded: 1. Split workload between several
Rank bandwidth capability exceeded ranks organized into one
extent pool with rotate extents
2. RAID type does not fit the workload feature. If already organized,
type. manually rebalance the ranks.
3. Disk type is wrong. 2. Change the RAID level to a
better performing RAID level
4. Physical problems with the disks in
(RAID 5 to RAID 10, for
the rank.
example) or migrate extents to
another extent pool with better
conditions.
3. Migrate extents to the better
performing disks.
4. Fix the problems with the
disks.

Extent pool level 1. Extent pool capability reached its 1. Add more ranks to the pool,
maximum. examine the STAT reports for
2. Conflicting workloads mixed in the the recommendations, add
same extent pool. more tiers in the pool, or use
3. No Easy Tier management for this hot promotion and cold
pool. demotion.
4. One of the ranks in the pool is 2. Split workloads into the
overloaded. separate pools, split one pool
5. Physical problems with the disk in the into two dedicated pools to
rank. both processor complexes,
examine the STAT data and
add required tier, or set
priorities for the workloads and
enable IOPM.
3. Start Easy Tier for this pool by
following the
recommendations from the
STAT tool.
4. Perform the extent
redistribution for the pool, start
Easy Tier and follow
recommendations from the
STAT tool, or use the rotate
extents method.
5. Fix the problems with the disks
or remove the rank from the
pool.

DA pair level 1. DA pair is overloaded. 1. Mind the limits of the DA pair,


2. Hardware malfunction on the DA pair. decrease the number of SSDs
3. Hardware problem with one or more on the DA pair, or use ranks
ranks on the DA pair. from as many different DA
pairs as possible.
2. Fix the problem with the
hardware.
3. Fix the problem with the rank
or exclude it from the extent
pools.

Chapter 10. Performance considerations for Microsoft Windows servers 359


Disk system Possible reasons Actions
level

Cache level 1. Cache-memory limits are reached. 1. Upgrade the cache memory,
2. Workload is not “cache-friendly”. add more ranks to the extent
3. Large number of write requests to a pool, enable Easy Tier, or split
single volume (rank or extent pool). extent pools evenly between
CPCs.
2. Add more disks, use the
Micro-tiering function to unload
the 15-K RPM drives, or tune
the application, if possible.
3. Split the pools and ranks
evenly between CPCs to use
all the cache memory.

CPC level 1. CPCs are overloaded. 1. Split the workload between two
2. There is uneven volume, extent pool, CPCs evenly, stop the
or rank assignment on CECs. unnecessary Copy Services,
3. There are CPC hardware problems. or upgrade the system to
another model.
2. Split the pools and ranks
evenly between CPCs to use
all the cache memory and
processor power.
3. Fix the problems.

Host adapter (HA) 1. Ports are overloaded. 1. Add more HAs, change HAs to
level 2. Host I/Os are mixed with Copy the better performing ones
Services I/Os. (4 - 8 Gbps), use another
3. There are faulty SFPs. multipathing balancing
4. There is incorrect cabling. approach, or use the
5. There are other hardware problems. recommended number of the
logical volumes per port.
2. Use dedicated adapters for the
Copy Services, split host ports
for the different operating
system, or separate backup
workloads and host I/O
workloads.
3. Replace the SFPs.
4. Change the cabling or check
the cables with tools.
5. Fix the problems with the SAN
hardware.

At the application level, there can be bottlenecks in the application, multipathing drivers,
device drivers, zoning misconfiguration, or adapter settings. However, a Microsoft
environment is self-tuning and many problems might be fixed without any indication. Windows
can cache many I/Os and serve them from cache. It is important to maintain a large amount
of free Windows Server memory for peak usage. Also, the paging file should be set up based
on the guidelines that are given in 10.4.3, “Paging file” on page 339.

360 IBM System Storage DS8000 Performance Monitoring and Tuning


To avoid bottlenecks in the application, follow these guidelines:
򐂰 MS SQL Servers do not require separate placement of the log files and data files because
many transactions are cached in the operating system and SQL Server. However, it is still
a database, and you must place log files and data files separately. Never put log files with
the paging files. Never place log files on the main system disks (that is, SystemRoot%). Use
three-tier technology whenever possible because the SQL Server varies in its I\O and
access pattern. Use IOPM to keep SQL Server volumes at the highest priority all the time.
For more MS SQL Server tuning suggestions, see this website:
http://technet.microsoft.com/en-us/library/bb545450.aspx
򐂰 MS Exchange Server is a unique application that behaves as the OLTP database and the
data warehousing database. Serial Advanced Technology Attachment (SATA) or Nearline
drives are not advised for the MS Exchange Server, but with Easy Tier and
Enterprise+Nearline tiers in one extent pool, you can use most of the capacity on the
Nearline drives. This approach is true from the workload pattern view, which is up to
50 - 60% write. Also, the Micro-tiering option works well. So, you might consider putting
90% of the capacity on 10-K RPM drives. Do not use RAID 6 or RAID 5 for
high-performing MS Exchange Servers because of the high random write ratio. The MS
Exchange database and indexing must be physically separated from the data. At least two
logical disks on separate extent pools are advised for the MS Exchange setup. For more
MS Exchange Server tuning suggestions, see this website:
https://technet.microsoft.com/en-us/library/dn879084%28v=exchg.150%29.aspx

For the other Microsoft applications, follow the guidelines in Table 10-1 on page 342:
򐂰 To avoid bottlenecks on the SDDDSM side, maintain a balanced use of all the paths and
keep them active always, as shown in Example 10-8. You can see the numbers for reads
and writes on each adapter, which are nearly the same.

Example 10-8 Obtain the FC adapter statistics


c:\Program Files\IBM\SDDDSM>datapath query adaptstats
Adapter #: 0
=============
Total Read Total Write Active Read Active Write Maximum
I/O: 3003 2522 0 0 2
SECTOR: 306812 367257 0 0 2048
Adapter #: 1
=============
Total Read Total Write Active Read Active Write Maximum
I/O: 3103 2422 0 0 2
SECTOR: 306595 306094 0 0 2049

򐂰 SAN zoning, cabling, and FC-adapter settings must be done according to the IBM System
Storage DS8700 and DS8800 Introduction and Planning Guide, GC27-2297-07, but do
not have more than four paths per logical volume.

After you detect a disk bottleneck, you might perform several of these actions:
򐂰 If the disk bottleneck is a result of another application in the shared environment that
causes disk contention, request a LUN on a less used rank and migrate the data from the
current rank to the new rank. Start by using Priority Groups.
򐂰 If the disk bottleneck is caused by too much load that is generated from the Windows
Server to a single DS8000 LUN, spread the I/O activity across more DS8000 ranks, which
might require the allocation of additional LUNs. Start Easy Tier for the volumes and
migrate to hybrid pools.

Chapter 10. Performance considerations for Microsoft Windows servers 361


򐂰 On Windows Server 2008 for sequential workloads with large transfer sizes (256 KB),
consider the Micro-tiering option or Enterprise+Nearline drive hybrid pools.
򐂰 Move processing to another system in the network (either users, applications, or services).
򐂰 Add more memory. Adding memory increases system memory disk cache, which might
reduce the number of required physical I/Os and indirectly reduce disk response times.
򐂰 If the problem is a result of a lack of bandwidth on the HBAs, install additional HBAs to
provide more bandwidth to the DS8000 storage system.

For more information about Windows Server disk subsystem tuning, see the following
website:
http://www.microsoft.com/whdc/archive/subsys_perf.mspx

362 IBM System Storage DS8000 Performance Monitoring and Tuning


11

Chapter 11. Performance considerations for


VMware
This chapter describes the monitoring and tuning tools and techniques that can be used with
VMware ESXi Server to optimize throughput and performance when attaching to the DS8000
storage system.

This chapter includes the following topics:


򐂰 I/O architecture from a VMware perspective
򐂰 Initial planning considerations for optimum performance of VMware host systems that use
the DS8000 storage system in a storage area network (SAN)
򐂰 Specific VMware performance measuring tools and tuning options
򐂰 SAN multipathing considerations
򐂰 Testing and verifying the DS8000 storage system attached to VMware host systems
򐂰 Configuring VMware logical storage for optimum performance
򐂰 VMware operating system tuning considerations for maximum storage performance

© Copyright IBM Corp. 2016. All rights reserved. 363


11.1 Disk I/O architecture overview
This chapter introduces the relevant logical configuration concepts that are needed to attach
VMware ESXi Server to a DS8000 storage system. It focuses on performance-relevant
configuration and measuring options. For more information about how to set up VMware ESXi
Server with the DS8000 storage system, see IBM System Storage DS8000: Host Attachment
and Interoperability, SG24-8887.

VMware ESXi Server supports the use of external storage that can be on a DS8000 storage
system. The DS8000 storage system is typically connected by Fibre Channel (FC) and
accessed over a SAN. Each logical volume that is accessed by a VMware ESXi Server is
configured in a specific way, and this storage can be presented to the virtual machines (VMs)
as virtual disks (VMDKs).

To understand how storage is configured in VMware ESXi Server, you must understand the
layers of abstraction that are shown in Figure 11-1.

Virtual machine vmdk vmdk

Virtual disks

ESX Server

VMFS volume

External storage

DS8000 logical volume

Figure 11-1 Storage stack for VMware ESXi Server

For VMware to use external storage, VMware must be configured with logical volumes that
are defined in accordance with the expectations of the users, which might include the use of
RAID or striping at a storage hardware level. Striping at a storage hardware level is preferred
because DS8000 storage system can combine EasyTier and IOPM mechanisms. These
logical volumes must be presented to VMware ESXi Server. For the DS8000 storage system,
host access to the volumes includes the correct configuration of logical volumes, host
mapping, correct logical unit number (LUN) masking, and zoning of the involved SAN fabric.

364 IBM System Storage DS8000 Performance Monitoring and Tuning


At the VMware ESXi Server layer, these logical volumes can be addressed as a VMware
ESXi Server File System (VMFS) volume or as a raw disk that uses Raw Device Mapping
(RDM). A VMFS volume is a storage resource that can serve several VMs and several
VMware ESXi Servers as consolidated storage. In the VMware GUI, you see datastores,
which are logical containers for VMFS volumes. However, an RDM volume is intended for
usage as isolated storage by a single VM.

Two options exist to use these logical drives within vSphere Server:
򐂰 Formatting these disks with the VMFS: This option is the most common option because a
number of features require that the VMDKs are stored on VMFS volumes.
򐂰 Passing the disk through to the guest OS as a raw disk. No further virtualization occurs.
Instead, the OS writes its own file system onto that disk directly as though it is in a
stand-alone environment without an underlying VMFS structure.

The VMFS volumes house the VMDKs that the guest OS sees as its real disks. These
VMDKs are in the form of a file with the extension.vmdk. The guest OS either read/writes to
the VMDK file (.vmdk) or writes through the VMware ESXi Server abstraction layer to a raw
disk. In either case, the guest OS considers the disk to be real.

Figure 11-2 compares VMware VMFS volumes to logical volumes so that you can understand
the logical volumes for a DS8800 and DS8700 storage systems as references to volume IDs,
for example, 1000, 1001, and 1002.

Figure 11-2 Logical drives compared to VMware VMFS volumes

On the VM layer, you can configure one or several VMDKs out of a single VMFS volume.
These VMDKs can be configured for use by several VMs.

VMware datastore concept


VMware ESXi Server uses specially formatted logical containers called datastores. These
datastores can be on various types of physical storage devices, local disks inside VMware
ESXi Server, FC-attached disks, and iSCSI disks and Network File System (NFS) disks.

The VM disks are stored as files within a VMFS. When a guest operating system issues a
Small Computer System Interface (SCSI) command to its VMDKs, the VMware virtualization
layer converts this command to VMFS file operations. From the standpoint of the VM
operating system, each VMDK is recognized as a direct-attached SCSI drive connected to a
SCSI adapter. Device drivers in the VM operating system communicate with the VMware
virtual SCSI controllers.

Chapter 11. Performance considerations for VMware 365


Figure 11-3 illustrates the VMDK mapping within VMFS.

ESXi Server
virtual machine 1 virtual machine 2 virtual machine 3

virtual d isk vi rtu al di sk virtu al disk

SCSI controller SCSI controller SCSI controller

VMware virtualization layer

VMFS
LUN0

d isk1 .vmd k

di sk2.vmdk

di sk3.vmdk

Figure 11-3 Map virtual disks to LUNs within VMFS

VMFS is optimized to run multiple VMs as one workload to minimize disk I/O impact. A VMFS
volume can be spanned across several logical volumes, but there is no striping available to
improve disk throughput in these configurations. Each VMFS volume can be extended by
adding additional logical volumes while the VMs use this volume.

The VMFS volumes store this information:


򐂰 VM .vmdk files
򐂰 The memory images from VMs that are suspended
򐂰 Snapshot files for the .vmdk files that are set to a disk mode of non-persistent, undoable, or
append

Important: A VMFS volume can be spanned across several logical volumes, but there is
no striping available to improve disk throughput in these configurations. With Easy Tier,
hot/cold extents can be promoted or demoted, and you can achieve superior performance
versus economics on VMware ESXi hosts as well.

An RDM is implemented as a special file in a VMFS volume that acts as a proxy for a raw
device. An RDM combines the advantages of direct access to physical devices with the
advantages of VMDKs in the VMFS. In special configurations, you must use RDM raw
devices, such as in Microsoft Cluster Services (MSCS) clustering, or when you install IBM
Spectrum Protect™ Snapshot in a VM that is running on Linux.

366 IBM System Storage DS8000 Performance Monitoring and Tuning


With RDM volumes, VMware ESXi Server supports the use of N_Port ID Virtualization
(NPIV). This host bus adapter (HBA) virtualization technology allows a single physical HBA
port to function as multiple logical ports, each with its own worldwide port name (WWPN).
This function can be helpful when you migrate VMs between VMware ESXi Servers by using
VMotion, and to separate workloads of multiple VMs configured to the same paths on the
HBA level for performance measurement purposes.

The VMware ESXi virtualization of datastores is shown in Figure 11-4.

ESXi Server
virtual machine 1 virtual machine 2

virtual d isk 1 virtua l disk 2 virtua l disk 1 vi rtu al di sk 2

SC SI controller SCSI controller

VMware virtualization layer

HBA1 HBA2

VMFS LUN0 LUN1 LUN4

.vmdk RDM

Figure 11-4 VMware virtualization of datastores

Chapter 11. Performance considerations for VMware 367


11.2 vStorage APIs for Array Integration support
VMware vSphere provides an application programming interface (API) which extends the
functions of vSphere. The vStorage APIs for Array Integration (VAAI) enables certain storage
tasks to be offloaded from the server to the storage array. With VAAI, VMware vSphere can
perform key operations faster and use less host processor resources, memory, and storage
bandwidth. Figure 11-5 shows the VAAI module in the VMware stack. For example, with a
single command, the host can tell the DS8000 storage system to copy chunks of data from
one volume to another. This action saves the host from having to read the data from the
volume and write it back to another volume.

VMware Storage Stack


Provisioning / Cloning
VMFS NFS
VMware LVM NFS
Data Mover Client
vStorage APIs
vStorage API Network
NFS for Multi-
Pathing Stack

HBA Drivers NIC

Figure 11-5 vStorage APIs for Array Integration in the VMware storage stack

VAAI support relies on the storage implementing several fundamental operations that are
named as primitives. These operations are defined in terms of standard SCSI commands,
which are defined by the T10 SCSI specification.

The VAAI primitives include the following features:


򐂰 Atomic Test and Set
This feature is used to implement hardware accelerated locking of files. It is now possible
to run more VMs on a single volume and to create more VMware-based Snapshots
without creating performance problems.
򐂰 Clone Blocks (XCOPY)
This primitive enables the DS8000 storage system to copy a range of logical blocks from
one location to another, either within a single LUN or from one LUN to another. Without
this primitive, the host must read all the data from the source blocks and write it to the
target blocks.
򐂰 Write Same
This primitive is used to initialize new VMDKs, which allows the host to send only a single
512-byte block of zeros, which the DS8000 storage system writes to many blocks of the
volume.
򐂰 Thin provisioning
This primitive encompasses the following primitives:
– Out-of-space condition: To pause a running VM, when capacity is exhausted.
– Thin provisioning LUN reporting: To enable vSphere to determine the LUN thin
provisioning status.

368 IBM System Storage DS8000 Performance Monitoring and Tuning


– Quota Exceeded Behavior: Allows vSphere to react before out-of-space conditions
occur.
– Unmap: Allows the clearing of unused VMFS space.

For more information about VAAI integration and usage with the DS8000 storage system, see
IBM DS8870 and VMware Synergy, REDP-4915.l

11.3 Host type for the DS8000 storage system


To create your volume groups on a DS8000 storage system to support VMware ESXi hosts,
you must create them by running type scsimap256. For host connections, you must reference
each host by using the host type VMware (with LUNPolling as the SCSI address discovery
method), as shown in Example 11-1. A list of available SCSI host types on DS8800 and
DS8700 storage systems can be obtained by running the following command:
lshosttype -type scsiall

Example 11-1 Create volume groups and host connections for VMware hosts
dscli> mkvolgrp -type scsimap256 VMware_Host_1_volgrp_1
CMUC00030I mkvolgrp: Volume group V19 successfully created.

dscli> mkhostconnect -wwpn 21000024FF2D0F8D -hosttype VMware -volgrp V19 -desc "Vmware host1 hba1"
Vmware_host_1_hba_1
CMUC00012I mkhostconnect: Host connection 0036 successfully created.

dscli> lshostconnect -l -volgrp V19


Name ID WWPN HostType LBS addrDiscovery Profile portgrp volgrpID atchtopo ESSIOport speed desc
==========================================================================================================================================
Vmware_host_1_hba_1 0036 21000024FF2D0F8D VMWare 512 LUNPolling VMWare 0 V19 - all Unknown Vmware host1 hba1

11.4 Multipathing considerations


In VMware ESXi Server, the name of the storage device is displayed as a sequence of 3 - 4
numbers separated by colons, for example, vmhba2:c0:T1:L1. This naming has the following
format:
<vmhba#>:<C#>:<T#>:<L#>
vmhba# The physical HBA on the host. Either an FC HBA, a SCSI adapter, or
even an iSCSI initiator.
C# The storage channel number. It is used to show multiple paths to the
same target.
T# The target number. The target numbering is decided by the host and
might change if there is a LUN mapping change.
L# The LUN number. It is provided by the storage system.

For example: vmhba1:C3:T2:L0 represents LUN 0 on target 2 accessed through the storage
adapter vmhba1 and channel 3.

In multipathing environments, which are the standard configuration in correctly implemented


SAN environments, the same LUN can be accessed through several paths. The same LUN
has two or more storage device names. After a rescan or restart, the path information that is
displayed by VMware ESXi Server might change; however, the name still refers to the same
physical device.

Chapter 11. Performance considerations for VMware 369


In Figure 11-6, four LUNs are connected to the FC adapter vmhba0. If multiple HBAs are
used to connect to the SAN-attached storage for redundancy reasons, this LUN can also be
addressed through a different HBA.

Figure 11-6 Storage Adapters properties view in the vSphere Client

VMware ESXi Server provides built-in multipathing support, which means that it is not
necessary to install an additional failover driver. Any external failover drivers, such as
subsystem device drivers (SDDs), are not supported for VMware ESXi. Since ESX 4.0, it
supports path failover and the round-robin algorithm.

VMware ESXi Server 6.0 provides three major multipathing policies for use in production
environments: Most Recently Used (MRU), Fixed, and Round Robin (RR):
򐂰 MRU policy is designed for usage by active/passive storage devices, such as IBM System
Storage DS4000® storage systems, with only one active controller available per LUN.
򐂰 The Fixed policy ensures that the designated preferred path to the storage is used
whenever available. During a path failure, an alternative path is used, and when the
preferred path is available again, the multipathing module switches back to it as the active
path.
򐂰 The Round Robin policy with a DS8000 storage system uses all available paths to rotate
through the available paths. It is possible to switch from MRU and Fixed to RR without
interruptions. With RR, you can change the number of bytes and number of I/O operations
sent along one path before you switch to the other path. RR is a good approach for various
systems; however, it is not supported for use with VMs that are part of MSCS.

The default multipath policy for ALUA devices since ESXi 5 is Round Robin.

The multipathing policy and the preferred path can be configured from the vSphere Client or
by using the command-line tool esxcli. For command differences among the ESXi versions,
see Table 11-1 on page 371.

370 IBM System Storage DS8000 Performance Monitoring and Tuning


Table 11-1 Commands for changing the multipath policy on ESXi hosts
VMware version Change Multipath Policy Cmd Path Policy available

ESXi 5.x/6.0 esxcli storage nmp device VMW_PSP_FIXED, VMW_PSP_MRU,


set --device <naa_id> --psp and VMW_PSP_RR
<path_policy>

ESXi 4 esxcli nmp -d <naa_id> VMW_PSP_FIXED, VMW_PSP_MRU,


<path_policy> and VMW_PSP_RR

Figure 11-7 shows how the preferred path policy can be checked from the vSphere Client.

Figure 11-7 Manage Paths window in the vSphere Client

By using the Fixed multipathing policy, you can implement static load balancing if several
LUNs are attached to the VMware ESXi Server. The multipathing policy is set on a per LUN
basis, and the preferred path is chosen for each LUN. If VMware ESXi Server is connected
over four paths to its DS8000 storage system, spread the preferred paths over all four
available physical paths.

Important: Before zoning your VMware host to a DS8000 storage system, you must
ensure that the number of available paths for each LUN must meet a minimum of two paths
(for redundancy), but because of limitations on VMware, a maximum of all paths on a host
is 1024. The number of paths to a LUN is limited to 32, and the maximum available LUNS
per host is 256. The maximum size of a LUN is 64 TB. So, plan the number of paths
available to each VMware host carefully to avoid future problems with provisioning to your
VMware hosts.

For example, when you want to configure four LUNs, assign the preferred path of LUN0
through the first path, the one for LUN1 through the second path, the preferred path for LUN2
through the third path, and the one for LUN3 through the fourth path. With this method, you
can spread the throughput over all physical paths in the SAN fabric. Thus, this method results
in optimized performance for the physical connections between the VMware ESXi Server and
the DS8000 storage system.

Chapter 11. Performance considerations for VMware 371


If the workload varies greatly between the accessed LUNs, it might be a good approach to
monitor the performance on the paths and adjust the configuration according to the workload.
It might be necessary to assign one path as preferred to only one LUN with a high workload
but to share another path as preferred between five separate LUNs that show moderate
workloads. This static load balancing works only if all paths are available. When one path
fails, all LUNs that selected this failing path as preferred fail over to another path and put
additional workload onto those paths. Furthermore, there is no capability to influence the
failover algorithm to which path the failover occurs.

When the active path fails, for example, because of a physical path failure, I/O might pause for
about 30 - 60 seconds until the FC driver determines that the link is down and fails over to one
of the remaining paths. This behavior can cause the VMDKs that are used by the operating
systems of the VMs to appear unresponsive. After failover is complete, I/O resumes normally.
The timeout value for detecting a failed link can be adjusted; it is set in the HBA BIOS or
driver and the way to set this option depends on the HBA hardware and vendor. The typical
failover timeout value is 30 seconds. With VMware ESXi, you can adjust this value by editing
the device driver options for the installed HBAs in /etc/vmware/esx.conf.

Additionally, you can increase the standard disk timeout value in the VM operating system to
ensure that the operating system is not disrupted and to ensure that the system logs
permanent errors during the failover phase. Adjusting this timeout again depends on the
operating system that is used and the amount of queue that is expected by one path after it
fails; see the appropriate technical documentation for details.

11.5 Performance monitoring tools


This section reviews the performance monitoring tools that are available with VMware ESXi.

11.5.1 Virtual Center performance statistics


Virtual Center (VC) is the entry point for virtual platform management in VMware ESXi. It also
includes a module to view and analyze performance statistics and counters. The VC
performance counters collection is reduced by default to a minimum level, but you can modify
the settings to allow a detailed analysis.

VC includes real-time performance counters that display the past hour (which is not archived),
and archived statistics that are stored in a database. The real-time statistics are collected
every 20 seconds and presented in the vSphere Client for the past 60 minutes (Figure 11-8
on page 373).

372 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 11-8 Real-time performance statistics in the vSphere Client

These real-time counters are also the basis for the archived statistics, but to avoid too much
performance database expansion, the granularity is recalculated according to the age of the
performance counters. VC collects those real-time counters, aggregates them for a data point
every 5 minutes, and stores them as past-day statistics in the database. After one day, these
counters are aggregated once more to a 30-minute interval for the past week statistics. For
the past month, a data point is available every 2 hours, and for the last year, one datapoint is
stored per day.

In general, the VC statistics are a good basis to get an overview about the performance
statistics and to further analyze performance counters over a longer period, for example,
several days or weeks. If a granularity of 20-second intervals is sufficient for your individual
performance monitoring perspective, VC can be a good data source after configuration. You
can obtain more information about how to use the VC Performance Statistics at this website:
http://communities.vmware.com/docs/DOC-5230

11.5.2 Performance monitoring with esxtop


The esxtop command-line tool provides the finest granularity among the performance
counters available within VMware ESXi Server. The tool is available on the VMware ESXi
Server service console, and you must have root privileges to use the tool. The esxtop
command is available for use either in Interactive Mode or in Batch Mode. When using
Interactive Mode, the performance statistics are displayed inside the command-line console.
With Batch Mode, you collect and save the performance counters in a file. The esxtop utility
reads its default configuration from a file called ~/.esxtop3rc. The preferred way to configure
this default configuration to fit your needs is to change and adjust it for a running esxtop
process and then save this configuration by using the W interactive command.

Chapter 11. Performance considerations for VMware 373


Example 11-2 illustrates the basic adjustments that are required to monitor disk performance
on a SAN-attached storage device.

Example 11-2 Esxtop basic adjustment for monitoring disk performance


esxtop #starts esxtop in Interactive Mode
PCPU(%): 3.27, 3.13, 2.71, 2.66 ; used total: 2.94
7 7 drivers 16 0.01 0.01 0.00 1571.25 0.00 0.00 0.00 0.00 0.00
8 8 vmotion 1 0.00 0.00 0.00 98.20 0.00 0.00 0.00 0.00 0.00
9 9 console 1 0.85 0.84 0.01 97.04 0.32 97.03 0.06 0.00 0.00
21 21 vmware-vmkauthd 1 0.00 0.00 0.00 98.20 0.00 0.00 0.00 0.00 0.00
22 22 Virtual Center 12 6.79 6.81 0.00 1170.05 1.60 383.57 0.20 0.00 0.00
31 31 Windows1 6 0.86 0.86 0.00 588.24 0.14 97.37 0.13 0.00 0.00
32 32 Windows2 6 0.58 0.58 0.00 588.54 0.09 97.84 0.05 0.00 0.00
33 33 Windows3 6 0.60 0.61 0.00 588.52 0.09 97.95 0.08 0.00 0.00
34 34 Linux 1 5 1.40 1.41 0.00 489.17 0.46 96.56 0.22 0.00 0.00

d #changes to disk storage utilization panels


e vmhba2 #selects expanded display of vmhba2
a vmhba2:0 #selects expanded display of SCSI channel 0
t vmhba2:0:0 #selects expanded display mode of SCSI target 0
W #writes the current configuration into ~/.esctop3rc file

After this initial configuration, the performance counters are displayed as shown in
Example 11-3.

Example 11-3 Disk performance metrics in esxtop


1:25:39pm up 12 days 23:37, 86 worlds; CPU load average: 0.36, 0.14, 0.17

ADAPTR CID TID LID NCHNS NTGTS NLUNS NVMS AQLEN LQLEN WQLEN ACTV %USD LOAD CMDS/s READS/s WRITES/s MBREAD/s
vmhba1 - - - 2 1 1 32 238 0 0 - - - 4.11 0.20 3.91 0.00
vmhba2 0 0 0 1 1 1 10 4096 32 0 8 25 0.25 25369.69 25369.30 0.39 198.19
vmhba2 0 0 1 1 1 1 10 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 2 1 1 1 10 4096 32 0 0 0 0.00 0.39 0.00 0.39 0.00
vmhba2 0 0 3 1 1 1 9 4096 32 0 0 0 0.00 0.39 0.00 0.39 0.00
vmhba2 0 0 4 1 1 1 4 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 5 1 1 1 17 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 6 1 1 1 4 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 7 1 1 1 4 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 8 1 1 1 4 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 0 9 1 1 1 4 4096 32 0 0 0 0.00 0.00 0.00 0.00 0.00
vmhba2 0 1 - 1 1 10 76 4096 0 0 - - - 0.00 0.00 0.00 0.00
vmhba3 - - - 1 2 4 16 4096 0 0 - - - 0.00 0.00 0.00 0.00
vmhba4 - - - 1 2 4 16 4096 0 0 - - - 0.00 0.00 0.00 0.00
vmhba5 - - - 1 2 20 152 4096 0 0 - - - 0.78 0.00 0.78 0.00

Additionally, you can change the field order and select or clear various performance counters
in the view. The minimum refresh rate is 2 seconds, and the default setting is 5 seconds.

When you use esxtop in Batch Mode, always include all of the counters by using the -a
option. To collect the performance counters every 10 seconds for 100 iterations and save
them to a file, run esxtop this way:
esxtop -b -a -d 10 -n 100 > perf_counters.csv

For more information about how to use esxtop and other tools, see vSphere Resource
Management Guide, found at:
http://www.vmware.com/pdf/vsphere4/r40/vsp_40_resource_mgmt.pdf

Each VMware version has its own particularities and some might have performance analysis
tools that are not part of the older versions.

374 IBM System Storage DS8000 Performance Monitoring and Tuning


11.5.3 Guest-based performance monitoring
Because the operating systems that run in the VMs host the applications that perform the
host workload, it makes sense to use performance monitoring in these operating systems as
well. The tools that you use are described in Chapter 10, “Performance considerations for
Microsoft Windows servers” on page 335 and Chapter 12, “Performance considerations for
Linux” on page 385.

The guest operating system is unaware of the underlying VMware ESXi virtualization layer, so
any performance data captured inside the VMs can be misleading and must be analyzed and
interpreted only with the actual configuration and performance data gathered in VMware ESXi
Server or on a disk or SAN layer.

There is one additional benefit of the Windows Performance Monitor perfmon (see 10.8.2,
“Windows Performance console (perfmon)” on page 350). When you use esxtop in Batch
Mode with the -a option, it collects all available performance counters and thus the collected
comma-separated values (CSV) data gets large and cannot be easily parsed. Perfmon can
help you to analyze quickly results or to reduce the amount of CSV data to a subset of
counters that can be analyzed more easily by using other utilities. You can obtain more
information about importing the esxtop CSV output into perfmon by going to the following
website:
http://communities.vmware.com/docs/DOC-5100

11.5.4 VMware specific tuning for maximum performance


Because of the special VMware ESXi Server setup and the additional virtualization layer that
is implemented in VMware ESXi Server, it is necessary to focus on additional topics and
configuration options when you tune VMware ESXi. This section focuses on important points
about tuning the VMware ESXi Server with attached DS8000 storage to achieve maximum
performance.

11.5.5 Workload spreading


Spread the I/O workload across the available hardware. This method is the most effective way
to avoid any hardware limitations of either the HBA, processor complex, device adapter (DA),
or disk drives that negatively affect the potential performance of your VMware ESXi Server.

It is also important to identify and separate specific workloads because they can negatively
influence other workloads that might be more business critical.

Within VMware ESXi Server, it is not possible to configure striping over several LUNs for one
datastore. It is possible to add more than one LUN to a datastore, but adding more than one
LUN to a datastore only extends the available amount of storage by concatenating one or
more additional LUNs without balancing the data over the available logical volumes.

The easiest way to implement striping over several hardware resources is to use storage pool
striping in extent pools (see 4.8, “Planning extent pools” on page 103) of the attached
DS8000 storage system.

The only other possibility to achieve striping at the VM level is to configure several VMDKs for
a VM that are on different hardware resources, such as different HBAs, DAs, or servers, and
then configure striping of those VMDKs within the host operating system layer.

Chapter 11. Performance considerations for VMware 375


Use storage pool striping because the striping must be implemented only one time when
configuring the DS8000 storage system. Implementing the striping on the host operating
system level requires you to configure it for each of the VMs separately. Furthermore,
according to the VMware documentation, host-based striping is supported only by using
striping within Windows dynamic disks.

For performance monitoring purposes, be careful with spanned volumes or even avoid these
configurations. When configuring more than one LUN to a VMFS datastore, the volume space
is spanned across multiple LUNs, which can cause an imbalance in the utilization of those
LUNs. If several VMDKs are initially configured within a datastore and the disks are mapped
to different VMs, it is no longer possible to identify in which area of the configured LUNs the
data of each VM is allocated. Thus, it is no longer possible to pinpoint which host workload
causes a possible performance problem.

In summary, avoid using spanned volumes and configure your systems with only one LUN per
datastore.

11.5.6 Virtual machines sharing the LUN


The SCSI protocol allows multiple commands to be active for the same LUN at one time. A
configurable parameter that is called LUN queue depth determines how many commands can
be active at one time for a certain LUN. This queue depth parameter is handled by the SCSI
driver for a specific HBA. Depending on the HBA type, you can configure up to 255
outstanding commands for a QLogic HBA, and Emulex supports up to 128. The default value
for both vendors is 32.

For more information, see the following VMware knowledge base article:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC
&externalId=1267

If a VM generates more commands to a LUN than the LUN queue depth, these additional
commands are queued in the ESXi kernel, which increases the latency. The queue depth is
defined on a per LUN basis, not per initiator. An HBA (SCSI initiator) supports many more
outstanding commands.

For VMware ESXi Server, if two VMs access their VMDKs on two different LUNs, each VM
can generate as many active commands as the LUN queue depth. But if those two VMs have
their VMDKs on the same LUN (within the same VMFS volume), the total number of active
commands that the two VMs combined can generate without queuing I/Os in the ESXi kernel
is equal to the LUN queue depth. Therefore, when several VMs share a LUN, the maximum
number of outstanding commands to that LUN from all those VMs together must not exceed
the LUN queue depth.

Within VMware ESXi Server, there is a configuration parameter


Disk.SchedNumReqOutstanding, which can be configured from the VC. If the total number of
outstanding commands from all VMs for a specific LUN exceeds this parameter, the
remaining commands are queued to the ESXi kernel. This parameter must always be set at
the same value as the queue depth for the HBA.

To reduce latency, it is important to ensure that the sum of active commands from all VMs of
an VMware ESXi Server does not frequently exceed the LUN queue depth. If the LUN queue
depth is exceeded regularly, you might either increase the queue depth or move the VMDKs
of a few VMs to different VMFS volumes. Therefore, you lower the number of VMs that access
a single LUN. The maximum LUN queue depth per VMware ESXi Server must not exceed 64.
The maximum LUN queue depth per VMware ESXi Server can be up to 128 only when a
server has exclusive access to a LUN.

376 IBM System Storage DS8000 Performance Monitoring and Tuning


VMFS is a file system for clustered environments, and it uses SCSI reservations during
administrative operations, such as creating or deleting VMDKs or extending VMFS volumes.
A reservation ensures that at a specific time a LUN is available only to one VMware ESXi
Server. These SCSI reservations are used for administrative tasks that require only a
metadata update.

To avoid SCSI reservation conflicts in a production environment with several VMware ESXi
Servers that access shared LUNs, it might be helpful to perform those administrative tasks at
off-peak hours. If this approach is not possible, perform the administrative tasks from an
VMware ESXi Server that also hosts I/O-intensive VMs, which are less affected because the
SCSI reservation is set on the SCSI initiator level, which means for the complete VMware
ESXi Server.

The maximum number of VMs that can share the LUN depends on several conditions. VMs
with heavy I/O activity result in a smaller number of possible VMs per LUN. Additionally, you
must consider the already described LUN queue depth limits per VMware ESXi Server and
the storage system-specific limits.

11.5.7 ESXi file system considerations


VMware ESXi Server offers two possibilities to manage VMDKs: VMFS and RDM. VMFS is a
clustered file system that allows concurrent access by multiple hosts. RDM is implemented as
a proxy for a raw physical device. It uses a mapping file that contains metadata, and all disk
traffic is redirected to the physical device. RDM can be accessed only by one VM.

RDM offers two configuration modes: virtual compatibility mode and physical compatibility
mode. When you use physical compatibility mode, all SCSI commands toward the VMDK are
passed directly to the device, which means that all physical characteristics of the underlying
hardware become apparent. Within virtual compatibility mode, the VMDK is mapped as a file
within a VMFS volume, which allows advanced file locking support and the use of snapshots.

Chapter 11. Performance considerations for VMware 377


Figure 11-9 compares both possible RDM configuration modes and VMFS.

ESX Server
virtual machine virtual machine virtual machine

virtual d isk 1 vi rtu al di sk 1 virtua l d isk 1

VMware virtualization layer

open read / write open read / write


commands commands commands commands

a ddre ss ad dre ss
LUN0 LUN0 LUN2 LUN 0 LU N3
reso lutio n resol utio n

.vmdk Mapping mapping


file file

VMFS RDM in virtual mode RDM in physical mode

Figure 11-9 Comparison of RDM virtual and physical modes with VMFS

The implementations of VMFS and RDM imply a possible impact on the performance of the
VMDKs; therefore, all three possible implementations are tested together with the DS8000
storage system. This section summarizes the outcome of those performance tests.

In general, the file system selection affects the performance in a limited manner:
򐂰 For random workloads, the measured throughput is almost equal between VMFS, RDM
physical, and RDM virtual. Only for read requests of 32 KB, 64 KB, and 128 KB transfer
sizes, both RDM implementations show a slight performance advantage (Figure 11-10 on
page 379).
򐂰 For sequential workloads for all transfer sizes, there is a verified a slight performance
advantage for both RDM implementations against VMFS. For all sequential write and
certain read requests, the measured throughput for RDM virtual was slightly higher than
for RDM physical mode. This difference might be caused by an additional caching of data
within the virtualization layer, which is not used in RDM physical mode (Figure 11-11 on
page 379).

378 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 11-10 Result of random workload test for VMFS, RDM physical, and RDM virtual

Performance data varies: The performance data in Figure 11-10 and Figure 11-11 was
obtained in a controlled, isolated environment at a specific point by using the
configurations, hardware, and software levels available at that time. Actual results that
might be obtained in other operating environments can vary. There is no guarantee that the
same or similar results can be obtained elsewhere. The data is intended to help illustrate
only how different technologies behave in relation to each other.

Sequential Throughput

250,00

200,00

VMFS write

150,00 RDM physical write


RDM virtual write
MBps

VMFS read
100,00 RDM physical read
RDM virtual read
50,00

0,00
4 8 16 32 64 128
transfer size in KB

Figure 11-11 Result of sequential workload test for VMFS, RDM physical, and RDM virtual

Chapter 11. Performance considerations for VMware 379


The choice between the available file systems, VMFS and RDM, has a limited influence on
the data performance of the VMs. These tests verified a possible performance increase of
about 2 - 3%.

11.5.8 Aligning partitions


In a RAID array, the smallest hardware unit that is used to build a logical volume or LUN is
called a stripe. These stripes are distributed onto several physical drives in the array
according to the RAID algorithm that is used. Usually, stripe sizes are much larger than
sectors. For the DS8000 storage system in this example, we use a 256 KB stripe size for
RAID 5 and RAID 10 and 192 KB for RAID 6 in an Open Systems attachment. Thus, a SCSI
request that intends to read a single sector in reality reads one stripe from disk.

When using VMware ESXi, each VMFS datastore segments the allocated LUN into blocks,
which can be 1 - 8 MB. The file system that is used by the VM operating system optimizes I/O
by grouping several sectors into one cluster. The cluster size usually is in the range of several
KB.

If the VM operating system reads a single cluster from its VMDK, at least one block (within
VMFS) and all the corresponding stripes on physical disk must be read. Depending on the
sizes and the starting sector of the clusters, blocks, and stripes, reading one cluster might
require reading two blocks and all of the corresponding stripes. Figure 11-12 illustrates that in
an unaligned structure, a single I/O request can cause additional I/O operations. Thus, an
unaligned partition setup results in additional I/O that incurs a penalty on throughput and
latency and leads to lower performance for the host data traffic.

read one cluster

Virtual Machine file system

cluster

VMFS

block
DS8000 LUN

stripe
Figure 11-12 Processing of a data request in an unaligned structure

An aligned partition setup ensures that a single I/O request results in a minimum number of
physical disk I/Os, eliminating the additional disk operations and resulting in an overall
performance improvement.

380 IBM System Storage DS8000 Performance Monitoring and Tuning


Operating systems that use the x86 architecture create partitions with a master boot record
(MBR) of 63 sectors. This design is a relief from older BIOS code from personal computers
that used cylinder, head, and sector addressing instead of Logical Block Addressing (LBA).
The first track is always reserved for the MBR, and the first partition starts at the second track
(cylinder 0, head 1, and sector 1), which is sector 63 in LBA. Also, in current operating
systems, the first 63 sectors cannot be used for data partitions. The first possible start sector
for a partition is 63.

In a VMware ESXi environment, because of the additional virtualization layer that is


implemented by ESXi, this partition alignment must be performed for both layers: VMFS and
the host file systems. Because of that additional layer, the use of correctly aligned partitions is
considered to have even a higher performance effect than in the usual host setups without an
additional virtualization layer. Figure 11-13 shows how a single I/O request is fulfilled within
an aligned setup without causing additional physical disk I/O.

read one cluster

Virtual Machine file system

cluster

VMFS

block

DS8000 LUN

stripe
Figure 11-13 Processing a data request in an aligned structure

Partition alignment is a known issue in file systems, but its effect on performance is somehow
controversial. In performance lab tests, it turned out that in general all workloads show a slight
increase in throughput when the partitions are aligned. A significant effect can be verified only
on sequential workloads. Starting with transfer sizes of 32 KB and larger in this example, we
recognized performance improvements of up to 15%.

In general, aligning partitions can improve the overall performance. For random workloads in
this example, we identified only a slight effect. For sequential workloads, a possible
performance gain of about 10% seems to be realistic. So, as a preferred practice, align
partitions especially for sequential workload characteristics.

Aligning partitions within an VMware ESXi Server environment requires two steps. First, the
VMFS partition must be aligned. Then, the partitions within the VMware guest system file
systems must be aligned for maximum effectiveness.

Chapter 11. Performance considerations for VMware 381


You can align the VMFS partition only when configuring a new datastore. When using the
vSphere Client, the new partition is automatically configured to an offset of 128 sectors = 64
KB. But, this configuration is not ideal when using the DS8000 storage system. Because the
DS8000 storage system uses larger stripe sizes, the offset must be configured to at least the
stripe size. For RAID 5 and RAID 10 in Open Systems attachments, the stripe size is 256 KB,
and it is a good approach to set the offset to 256 KB (or 512 sectors). You can configure an
individual offset only from the VMware ESXi Server command line.

Example 11-4 shows how to create an aligned partition with an offset of 512 by using fdisk.

Example 11-4 Create an aligned VMFS partition by using fdisk


fdisk /dev/sdf #invoke fdisk for /dev/sdf
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.

The number of cylinders for this disk is set to 61440.


There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help):


n #create a new partition
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-61440, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-61440, default 61440):
Using default value 61440

Command (m for help): t #set partitions system id


Selected partition 1
Hex code (type L to list codes): fb #fb = VWware VMFS volume
Changed system type of partition 1 to fb (Unknown)

Command (m for help): x #enter expert mode

Expert command (m for help): b #set starting block number


Partition number (1-4): 1
New beginning of data (32-125829119, default 32): 512 #partition offset set to 512

Expert command (m for help): w #save changes


The partition table has been altered!

Calling ioctl() to re-read partition table.


Syncing disks.

382 IBM System Storage DS8000 Performance Monitoring and Tuning


fdisk -lu /dev/sdf #check the partition config

Disk /dev/sdf: 64.4 GB, 64424509440 bytes


64 heads, 32 sectors/track, 61440 cylinders, total 125829120 sectors
Units = sectors of 1 * 512 = 512 bytes

Device Boot Start End Blocks Id System


/dev/sdf1 512 125829119 62914304 fb Unknown

Then, you must create a VMFS file system within the aligned partition by using the
vmkfstools command, as shown in Example 11-5.

Example 11-5 Create a VMFS volume by using vmkfstools


vmkfstools -C vmfs3 -b 1m -S LUN0 vmhba2:0:0:1
Creating vmfs3 file system on "vmhba2:0:0:1" with blockSize 1048576 and volume
label "LUN0".
Successfully created new volume: 490a0a3b-cabf436e-bf22-001a646677d8

As a last step, all the partitions at the VM level must be aligned as well. This task must be
performed from the operating system of each VM by using the available tools. For example,
for Windows, use the diskpart utility, as shown in Example 11-6. You can use Windows to
align basic partitions only, and the offset size is set in KB (not in sectors).

Example 11-6 Create an aligned NTFS partition by using diskpart


DISKPART> create partition primary align=256

DiskPart succeeded in creating the specified partition.

DISKPART> list partition

Partition ### Type Size Offset


------------- ---------------- ------- -------
* Partition 1 Primary 59 GB 256 KB

DISKPART>

For more information about aligning VMFS partitions and the performance effects, see
Performance Best Practices for VMware vSphere 6.0, found at:
http://www.vmware.com/files/pdf/techpaper/VMware-PerfBest-Practices-vSphere6-0.pdf

Chapter 11. Performance considerations for VMware 383


384 IBM System Storage DS8000 Performance Monitoring and Tuning
12

Chapter 12. Performance considerations for


Linux
This chapter describes the monitoring and tuning tools and techniques that can be used with
Linux systems to optimize throughput and performance when attaching the DS8000 storage
system.

This chapter also describes the supported distributions of Linux when you use the DS8000
storage system, and the tools that can be helpful for the monitoring and tuning activities:
򐂰 Linux disk I/O architecture
򐂰 Host bus adapter (HBA) considerations
򐂰 Multipathing
򐂰 Logical Volume Manager (LVM)
򐂰 Disk I/O schedulers
򐂰 File system considerations

© Copyright IBM Corp. 2016. All rights reserved. 385


12.1 Supported platforms and distributions
Linux is the only operating system that is available for almost all hardware platforms. For
DS8000 attachment, IBM supports Linux on following platforms:
򐂰 On x86-based servers in 32-bit and 64-bit mode
򐂰 On IBM Power Systems in 32-bit and 64-bit mode
򐂰 In an IBM z Systems Logical Partition (LPAR) in 64-bit mode
򐂰 In a virtual machine (VM) under z/VM on IBM z Systems in 64-bit mode
򐂰 In a VM under VMware ESX

IBM supports the following major Linux distributions:


򐂰 Red Hat Enterprise Linux (RHEL)
򐂰 SUSE Linux Enterprise Server

For further clarification and the most current information about supported Linux distributions
and hardware prerequisites, see the IBM System Storage Interoperation Center (SSIC)
website:
http://www.ibm.com/systems/support/storage/ssic/interoperability.wss

This chapter introduces the relevant logical configuration concepts that are needed to attach
Linux operating systems to a DS8000 storage system and focuses on performance relevant
configuration and measuring options. For more information about hardware-specific Linux
implementation and general performance considerations about the hardware setup, see the
following documentation:
򐂰 For a general Linux implementation overview:
– The IBM developerWorks website for Linux, including a technical library:
http://www.ibm.com/developerworks/linux
– Anatomy of the Linux kernel:
http://www.ibm.com/developerworks/library/l-linux-kernel/
– SUSE Linux Enterprise Server documentation
https://www.suse.com/documentation/sles-12/
– RHEL documentation
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/ind
ex.html
򐂰 For x86-based architectures:
Tuning IBM System x Servers for Performance, SG24-5287
򐂰 For System p hardware:
– Performance Optimization and Tuning Techniques for IBM Power Systems Processors
Including IBM POWER8, SG24-8171
– POWER7 and POWER7+ Optimization and Tuning Guide, SG24-8079
򐂰 For z Systems hardware:
– Set up Linux on IBM System z for Production, SG24-8137
– Linux on IBM System z: Performance Measurement and Tuning, SG24-6926

386 IBM System Storage DS8000 Performance Monitoring and Tuning


12.2 Linux disk I/O architecture
Before describing relevant disk I/O-related performance topics, this section briefly introduces
the Linux disk I/O architecture. It describes the Linux disk I/O subsystem so that you can
better understand the components that have a major effect on system performance.

The architecture that is described applies to Open Systems servers attached to the DS8000
storage system by using the Fibre Channel Protocol (FCP). For Linux on z Systems with
extended count key data (ECKD) (Fibre Channel connection (FICON) attached) volumes, a
different disk I/O setup applies. For more information about disk I/O setup and configuration
for z Systems, see Linux for IBM System z9 and IBM zSeries, SG24-6694.

12.2.1 I/O subsystem architecture


Figure 12-1 illustrates the I/O subsystem architecture. This sequence is simplified: Storage
configurations that use additional virtualization layers and SAN attachment require additional
operations and layers, such as in the DS8000 storage system.

Figure 12-1 I/O subsystem architecture

For a quick overview of overall I/O subsystem operations, we use an example of writing data
to a disk. The following sequence outlines the fundamental operations that occur when a
disk-write operation is performed, assuming that the file data is on sectors on disk platters,
that it was read, and is on the page cache:
1. A process requests to write a file through the write() system call.
2. The kernel updates the page cache mapped to the file.

Chapter 12. Performance considerations for Linux 387


3. The kernel initiates flushing the page cache to disk.
4. The file system layer puts each block buffer together in a block I/O operation (bio)
structure (see 12.2.3, “Block layer” on page 390) and submits a write request to the block
device layer.
5. The block device layer gets requests from upper layers and performs an I/O elevator
operation and puts the requests into the I/O request queue.
6. A device driver, such as Small Computer System Interface (SCSI) or other device-specific
drivers, takes care of the write operation.
7. A disk device firmware performs hardware operations, such as seek head, rotation, and
data transfer to the sector on the platter.

For more information, see the IBM developerWorks article Anatomy of the Linux kernel, found
at:
http://www.ibm.com/developerworks/library/l-linux-kernel/

12.2.2 Cache and locality of reference


Achieving a high cache hit rate is the key for performance improvement. In Linux, the
technique that is called locality of reference is used. This technique is based on the following
principles:
򐂰 The most recently used data has a high probability of being used again in the near future
(temporal locality).
򐂰 The data that is close to the data, which was used, has a high probability of being used
(spatial locality).

Figure 12-2 on page 389 illustrates this principle.

388 IBM System Storage DS8000 Performance Monitoring and Tuning


CPU CPU
Data Data
Register Register

Data Data
Cache Cache

Data Data1 Data2


Memory Memory
Data Data1 Data2
Disk Disk

First access First access

CPU CPU
Data Data
Register Register

Data Data
Cache Cache

Data Data1 Data2


Memory Memory
Data Data1 Data2

Disk Disk

Second access in a few seconds Second access to data2 in a few seconds

Temporal locality Spatial locality

Figure 12-2 Locality of reference

Linux uses this principle in many components, such as page cache, file object cache (i-node
cache and directory entry cache), and read ahead buffer.

Flushing a dirty buffer


When a process reads data from disk, the data is copied to memory. The requesting process
and other processes can retrieve the same data from the copy of the data cached in memory.
When a process starts changing the data, it is changed in memory first. At this time, the data
on disk and the data in memory are not identical, and the data in memory is referred to as a
dirty buffer. The dirty buffer must be synchronized to the data on disk as soon as possible;
otherwise, the data in memory might be lost if a sudden crash occurs.

The synchronization process for a dirty buffer is called a flush. The flush occurs on a regular
basis and when the proportion of dirty buffers in memory exceeds a certain threshold. The
threshold is configurable in the /proc/sys/vm/dirty_background_ratio file.

The operating system synchronizes the data regularly, but with large amounts of system
memory, it might keep updated data for several days. Such a delay is unsafe for the data in a
failure situation. To avoid this situation, use the sync command. When you run the sync
command, all changes and records are updated on the disks and all buffers are cleared.
Periodic usage of sync is necessary in transaction processing environments that frequently
update the same data set file, which is intended to stay in memory. Data synchronization can
be set to frequent updates automatically, but the sync command is useful in situations when
large data movements are required and copy functions are involved.

Important: It is a preferred practice to run the sync command after data synchronization in
an application before issuing a FlashCopy operation.

Chapter 12. Performance considerations for Linux 389


12.2.3 Block layer
The block layer handles all the activity that is related to block device operation (see
Figure 12-1 on page 387). The key data structure in the block layer is the bio structure. The
bio structure is an interface between the file system layer and the block layer. When the
kernel, in the form of a file system, the virtual memory subsystem, or a system call, decides
that a set of blocks must be transferred to or from a block I/O device, it puts together a bio
structure to describe that operation. That structure is then handed to the block I/O code,
which merges it into an existing request structure or, if needed, creates one. The bio structure
contains everything that a block driver needs to perform the request without reference to the
user-space process that caused that request to be initiated.

When a write is performed, the file system layer first writes to the page cache, which is made
up of block buffers. It creates a bio structure by putting the contiguous blocks together and
then sends the bio to the block layer (see Figure 12-1 on page 387).

The block layer handles the bio request and links these requests into a queue called the I/O
request queue. This linking operation is called I/O elevator or I/O scheduler. The Linux
kernel 2.4 used a single, general-purpose I/O elevator, but since Linux kernel 2.6, four types
of I/O elevator algorithms are available. Because the Linux operating system can be used for
a wide range of tasks, both I/O devices and workload characteristics change significantly. A
notebook computer probably has different I/O requirements than a 10,000 user database
system. To accommodate these differences, four I/O elevators are available. For more
information about I/O elevator implementation and tuning, see 12.3.4, “Tuning the disk I/O
scheduler” on page 397.

12.2.4 I/O device driver


The Linux kernel takes control of devices by using a device driver. The device driver is a
separate kernel module and is provided for each device (or group of devices) to make the
device available for the Linux operating system. After the device driver is loaded, it runs as a
part of the Linux kernel and takes full control of the device. This section describes SCSI
device drivers.

SCSI
SCSI is the most commonly used I/O interface, especially in the enterprise server
environment. The FCP transports SCSI commands over Fibre Channel networks. In Linux
kernel implementations, SCSI devices are controlled by device driver modules. They consist
of the following types of modules (Figure 12-3 on page 391):
򐂰 The upper layer consists of specific device type drivers that are closest to user-space,
such as the disk and tape drivers: st (SCSI Tape) and sg (SCSI generic device).
򐂰 Middle level driver: scsi_mod
It implements SCSI protocol and common SCSI functions.
򐂰 The lower layer consists of drivers, such as the Fibre Channel HBA drivers, which are
closest to the hardware. They provide lower-level access to each device. A low-level driver
is specific to a hardware device and is provided for each device, for example, ips for the
IBM ServeRAID controller, qla2300 for the QLogic HBA, and mptscsih for the LSI Logic
SCSI controller.
򐂰 Pseudo driver: ide-scsi
It is used for IDE-SCSI emulation.

390 IBM System Storage DS8000 Performance Monitoring and Tuning


Process

sg st sd_mod sr_mod Upper level driver

scsi_mod Mid level driver

mptscsih ips qla2300 …… Low level driver

Device
Figure 12-3 Structure of SCSI drivers

If specific functions are implemented for a device, they must be implemented in the device
firmware and the low-level device driver. The supported functions depend on which hardware
you use and which version of the device driver that you use. The device must also support the
wanted functions. Specific functions are tuned by a device driver parameter.

12.3 Specific configuration for storage performance


Many specific parameters influence the whole system performance and the performance for a
specific application, which also applies to Linux systems. The focus of this chapter is disk I/O
performance only, so the influence of processor usage, memory usage, and paging, and
specific performance tuning possibilities for these areas are not described.

For more information about performance and tuning recommendations, see Linux
Performance and Tuning Guidelines, REDP-4285.

12.3.1 Host bus adapter for Linux


IBM supports several HBAs in many possible configurations. To confirm that a specific HBA is
supported by IBM, check the current information in the SSIC:
http://www.ibm.com/systems/support/storage/ssic/interoperability.wss

In the SSIC web application, describe your target configuration in as much detail as possible.
Press the Submit and Show Details to display the supported HBA BIOS and drivers versions
(see Example 12-2 on page 394).

To configure the HBA correctly, see the IBM DS8000: Host Systems Attachment Guide,
GC27-4210.

IBM System Storage DS8000: Host Attachment and Interoperability, SG24-8887 has detailed
procedures and suggested settings:

Also, check the readme files and manuals of the BIOS, HBA, and driver. The Emulex and
QLogic Fibre Channel device driver documentation is available at the following websites:
򐂰 http://www.emulex.com/downloads
򐂰 http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI

In current Linux distributions, the HBA driver is loaded automatically. However, you can
configure several driver parameters. The list of available parameters depends on the specific
HBA type and driver implementation. If these settings are not configured correctly, it might
affect performance or the system might not work correctly.

Chapter 12. Performance considerations for Linux 391


Queue depth parameter settings
Several HBA parameters can impact I/O performance: the queue depth parameters and
timeout and retry parameters regarding I/O errors.

The queue depth parameter specifies the length of the queue of the SCSI commands, which
a device can keep unconfirmed while maintaining the I/O requests. A device (disk or FC
adapter) sends a successful command completion notification to a driver before it is
completed, which allows the driver to send another command or I/O request.

By changing the queue depth, you can queue more outstanding I/Os on the adapter or disk
level, which can have, in certain configurations, a positive effect on throughput. However,
increasing the queue depth cannot be advised because it can slow performance or cause
delays, depending on the actual configuration. Thus, the complete setup must be checked
carefully before adjusting the queue depth. Increasing the queue depth can be beneficial for
the sequential large block write workloads and for some sequential read workloads. Random
workloads do not benefit much from the increased queue depth values. Indeed, you might
notice only a slight improvement in performance after the queue depth is increased. However,
the improvement might be greater if other methods of optimization are used.

You can configure each parameter as either temporary or persistent. For temporary
configurations, you can use the modprobe command. Persistent configuration is performed by
editing the appropriate configuration file (based on distribution):
򐂰 /etc/modprobe.d/lpfc.conf for RHEL 6.x or higher
򐂰 /etc/modprobe.d/99-qlogichba.conf for SUSE Linux Enterprise Server 11 SPx or higher
򐂰 /etc/modprobe.conf for RHEL 5.x
򐂰 /etc/modprobe.conf.local for older SUSE Linux Enterprise Server releases

To set the queue depth parameter value for an Emulex adapter, add the following line to the
configuration file to set the queue depth of an Emulex HBA to 20:
options lpfc lpfc_lun_queue_depth=20

To set the queue depth parameter value for a QLogic adapter for the Linux kernel Version 3.0
(For example, SUSE Linux Enterprise Server 11 or later), create a file that is named
/etc/modprobe.d/99-qlogichba.conf that contains the following line:
options qla2xxx ql2xmaxqdepth=48 qlport_down_retry=1

If you are running on a SUSE Linux Enterprise Server 11 or later operating system, run the
mkinitrd command and then restart.

Queue depth monitoring


You can monitor the necessity of the queue depth increase by running the iostat -kx
command (Example 12-1 on page 393). You can find more information about the iostat tool
in “The iostat command” on page 411.

392 IBM System Storage DS8000 Performance Monitoring and Tuning


Example 12-1 Output of the iostat -kx command (output truncated)
avg-cpu: %user %nice %system %iowait %steal %idle
0,00 0,00 3,12 4,75 0,00 92,13
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
dm-0 0,00 201428,00 0,00 1582,00 0,00 809984,00 1024,00 137,88 87,54 0,63 100,00
dm-1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00

Example 12-1 shows the output of the iostat -kx command. You can see that the average
queue size value might look high enough that the queue depth parameter might be increased.
If you look closer at the statistics, you can see that the service time is low enough and the
counter for write requests merged is high. Many write requests can be merged into fewer
write requests before they are sent to the adapter. Service times for write requests less than
1 ms indicate that writes are cached. Taking all these observations into consideration,
everything is fine with queue depth setting in this example.

For queue depth settings, follow these guidelines:


򐂰 The average queue size cannot be zero if any workload is present.
򐂰 Large block workloads always mean high average queue size values. So, queue size is not
the only consideration for overall performance. Also, consider other values, such as
response time.
򐂰 Read workloads must always have lower values of the average queue size than write
workloads.
򐂰 Random small block workloads, read or write, do not benefit much from increased queue
depth parameters. It is better to keep the queue depth parameter at the suggested default
described in the manual for the HBA or multipathing driver.

High queue depth parameter values might lead to adapter overload situations, which can
cause adapter resets and loss of paths. In turn, this situation might cause adapter failover and
overload the rest of the paths. It might lead to situations where I/O is stuck for a period.
Consider decreasing the queue depth parameter values to allow the faster reaction of the
multipathing module in path or adapter problems to avoid potential failures when you use
DM-MP.

12.3.2 Multipathing in Linux


In Linux environments, IBM supports the Linux Device Mapper multipathing solution
(Multipath I/O (DM-MP)) for the DS8000.

For older Linux distributions (up to Red Hat Linux 4 and SUSE Linux Enterprise Server 9),
IBM supported the IBM Multipath Subsystem Device Driver (SDD). SDD is not available for
current Linux distributions.

The Multipath I/O support that is included in Linux kernel Version 2.6 or higher is based on
Device Mapper (DM), a layer for block device virtualization that also supports logical volume
management, multipathing, and software RAID.

With DM, a virtual block device is presented where blocks can be mapped to any existing
physical block device. By using the multipath module, the virtual block device can be mapped
to several paths toward the same physical target block device. DM balances the workload of
I/O operations across all available paths, detects defective links, and fails over to the
remaining links.

Chapter 12. Performance considerations for Linux 393


DM-MP provides several load-balancing algorithms for multiple paths per LUN. It is
responsible for automated path discovery and grouping, and path handling and retesting of
previously failed paths. The framework is extensible for hardware-specific functions and
additional load-balancing or failover algorithms. For more information about DM-MP, see IBM
System Storage DS8000: Host Attachment and Interoperability, SG24-8887.

For more information about supported distribution releases, kernel versions, and multipathing
software, see the IBM Subsystem Device Driver for Linux website:

https://www.ibm.com/support/docview.wss?uid=ssg1S4000107

IBM provides a device-specific configuration file for the DS8000 storage system for the
supported levels of RHEL and SUSE Linux Enterprise Server. Append the device-specific
section of this file to the /etc/multipath.conf configuration file to set default parameters for
the attached DS8000 volumes (LUNs) and create names for the multipath devices that are
managed by DM-MP (see Example 12-2). Further configuration, adding aliases for certain
LUNs, or blacklisting specific devices can be manually configured by editing this file.

Example 12-2 Contents of the multipath.conf file (truncated)


# These are the default settings for 2107 (IBM DS8000)
# Uncomment them if needed on this system
device {
vendor "IBM"
product "2107900"
path_grouping_policy group_by_serial
}

To get the device-specific configuration file, see the following websites:


򐂰 https://www.ibm.com/support/docview.wss?uid=ssg1S4000107
򐂰 ftp://ftp.software.ibm.com/storage/subsystem/linux/dm-multipath

Using DM-MP, you can configure various path failover policies, path priorities, and failover
priorities. This type of configuration can be done individually for each device or for devices of
a certain type in the /etc/multipath.conf setup.

The multipath -ll command displays the available multipath information for disk devices, as
shown in Example 12-3.

Example 12-3 multipath -ll command output (truncated)


mpathc (36005076303ffd5aa000000000000002c) dm-2 IBM,2107900
size=10G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
|- 1:0:1:0 sdd 8:48 active ready running
`- 5:0:1:0 sdh 8:112 active ready running

For more configuration and setup information, see the following publications:
򐂰 For SUSE Linux Enterprise Server:
– http://www.suse.com/documentation/sles11/pdfdoc/stor_admin/stor_admin.pdf
– https://www.suse.com/documentation/sles-12/pdfdoc/stor_admin/stor_admin.pdf

394 IBM System Storage DS8000 Performance Monitoring and Tuning


– https://www.suse.com/documentation/sled11/singlehtml/book_sle_tuning/book_sle
_tuning.html
– https://www.suse.com/documentation/sles-12/singlehtml/book_sle_tuning/book_sl
e_tuning.html
򐂰 For RHEL:
– https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html
/DM_Multipath/
– https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html
/DM_Multipath/
򐂰 Considerations and comparisons between IBM SDD for Linux and DM-MP, found at:
http://www.ibm.com/support/docview.wss?uid=ssg1S7001664&rs=555

12.3.3 Logical Volume Management


LVM is the standard volume manager with SUSE Linux Enterprise Server and RHEL, the
Linux distributions that are supported by the DS8000 storage system. Starting with kernel
Version 2.6, Logical Volume Manager Version 2 (LVM 2) is available and compatible with LVM
Version 1.

This section always refers to LVM Version 2 and uses the term LVM.

Logical Volume Manager Version 2


With the use of LVM, you can configure logical extents (LEs) on multiple physical drives or
LUNs. Each LUN mapped from the DS8000 storage system is divided into one or more
physical volumes (PVs). Several of those PVs can be added to a logical volume group (VG),
and later on, logical volumes (LVs) are configured out of a VG. Each PV consists of a number
of fixed-size physical extents (PEs). Similarly, each LV consists of a number of fixed-size LEs.
An LV is created by mapping LEs to PEs within a VG. An LV can be created with a size from
one extent to all available extents in a VG.

With LVM2, you can influence the way that LEs (for an LV) are mapped to the available PEs.
With LVM linear mapping, the extents of several PVs are concatenated to build a larger LV.

Chapter 12. Performance considerations for Linux 395


Figure 12-4 illustrates an LV across several PVs. With striped mapping, groups of contiguous
PEs, which are called stripes, are mapped to a single PV. With this function, it is possible to
configure striping between several LUNs within LVM.

LUN0 LUN1 LUN2

physical
extent

Physical Volume 1 Physical Volume 2 Physical Volume 3

logical
extent

stripe
Logical Volume
Figure 12-4 LVM striped mapping of three LUNs to a single logical volume

Furthermore, LVM2 offers additional functions and flexibility:


򐂰 LVs can be resized during operation.
򐂰 Data from one PV can be relocated during operations, for example, in data migration
scenarios.
򐂰 LVs can be mirrored between several PVs for redundancy.
򐂰 LV snapshots can be created for backup purposes.

For more information about LVM2, see these websites:


򐂰 https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Log
ical_Volume_Manager_Administration/LVM_overview.html
򐂰 https://www.suse.com/documentation/sles-12/stor_admin/data/stor_admin.html
򐂰 http://sources.redhat.com/lvm2/

Software RAID functions


Software RAID in the Linux V2.6 and V3.x kernel distributions is implemented through the
multiple device driver (md). This driver implementation is device-independent and therefore
is flexible and allows many types of disk storage to be configured as a RAID array. Supported
software RAID levels are RAID 0 (striping), RAID 1 (mirroring), RAID 5 (striping with parity),
and RAID 6 (striping with double parity).

For more information about how to use the command-line RAID tools in Linux, see this
website:
https://raid.wiki.kernel.org/index.php/Linux_Raid

396 IBM System Storage DS8000 Performance Monitoring and Tuning


Modern practice
With the introduction of the IBM Easy Tier and I/O Priority Manager (IOPM) features, some
performance management tasks that were handled with LVM (for example, by configuration
LVM striping) can now be handled on the storage system. In addition, earlier approaches with
rank-based volume separation are not necessarily reasonable configurations anymore. With
the DS8000 features, such as Easy Tier and IOPM, the use of hybrid or multitier pools with
automated cross-tier and intra-tier management, and Micro-tiering capabilities for optimum
data relocation, offer excellent performance in most cases. These methods require less
storage management effort than the use of many single-tier volumes striped with LVM. The
use of LVM in these configurations might even result in decreasing real skew factors and
inefficient Easy Tier optimization because of diluted heat distributions.

The preferred way to use LVM, Easy Tier, and IOPM is to use LVM concatenated LVs. This
method might be useful when it is not possible to use volumes larger than2 TB in DS8700 and
DS8800 storage system (there are still some copy function limitations) or when implementing
disaster recovery solutions that require LVM involvement. In other cases, follow the preferred
practices that are described in Chapter 4, “Logical configuration performance considerations”
on page 83 and Chapter 3, “Logical configuration concepts and terminology” on page 47.

12.3.4 Tuning the disk I/O scheduler


The Linux kernel V2.6 introduced a new I/O elevator model. The Linux kernel V2.4 used a
single, general-purpose I/O scheduler. The Linux kernel V2.6 offers the choice of four
schedulers or elevators.

The I/O scheduler forms the interface between the generic block layer and the low-level
device drivers. Functions that are provided by the block layer can be used by the file systems
and the virtual memory manager to submit I/O requests to the block devices. These requests
are transformed by the I/O scheduler to the low-level device driver. Red Hat Enterprise Linux
AS 4 and SUSE Linux Enterprise Server 11 support four types of I/O schedulers.

You can obtain additional details about configuring and setting up I/O schedulers in Tuning
Linux OS on System p The POWER Of Innovation, SG24-7338.

Descriptions of the available I/O schedulers


The following list shows the four available I/O schedulers:
򐂰 Deadline I/O scheduler
The Deadline I/O scheduler incorporates a per-request expiration-based approach. The
I/O scheduler assigns an expiration time to each request and puts it into multiple queues:
a queue that is sorted by physical location on disk (sorted queue) and read and write I/O
first in, first out (FIFO) queues. The Deadline I/O scheduler pulls I/O requests from the
sorted queue to the disk I/O queue, but prioritizes the read and write I/Os in the FIFO
queues if they expire. The idea behind the Deadline I/O scheduler is that all read requests
are satisfied within a specified period. Web servers perform better when configured with
deadline I/O scheduler and an ext3 file system.
򐂰 Noop I/O scheduler
The Noop I/O scheduler manages all I/O requests in a FIFO queue. In addition, it performs
basic merging and sorting functions to optimize I/O request handling and to reduce seek
times. In large I/O systems that incorporate RAID controllers and many physical disk
drives that support the tagged command queuing (TCQ) feature on the hardware level, the
Noop I/O scheduler potentially can outperform the other three I/O schedulers as the
workload increases.

Chapter 12. Performance considerations for Linux 397


򐂰 Completely Fair Queuing I/O scheduler
The Completely Fair Queuing (CFQ) I/O scheduler is implemented on the concept of fair
allocation of I/O bandwidth among all the initiators of I/O requests. The CFQ I/O scheduler
strives to manage per-process I/O bandwidth and provide fairness at the level of process
granularity. Sequential writes performance can improve if the CFQ I/O scheduler is
configured with eXtended File System (XFS). The number of requests fetched from each
queue is controlled by the cfq_quantum tunable parameter.
򐂰 Anticipatory I/O scheduler
The anticipatory I/O scheduler attempts to reduce the per-thread read response time. It
introduces a controlled delay component into the dispatching equation. The delay is
started on any new read request to the device driver. File server performance can improve
if the anticipatory I/O scheduler is configured with ext3. Sequential reads performance can
improve if the anticipatory I/O scheduler is configured with XFS or ext3 file systems.

Selecting the correct I/O elevator for a selected type of workload


For most server workloads, either the CFQ elevator or the Deadline elevator is an adequate
choice because both of them are optimized for the multi-user and multi-process environment
in which a typical server operates. Enterprise distributions typically default to the CFQ
elevator. However, on Linux for z Systems, the Deadline scheduler is favored as the default
elevator. Certain environments can benefit from selecting a different I/O elevator. Since Red
Hat Enterprise Linux 5.0 and SUSE Linux Enterprise Server 10, the I/O schedulers can be
selected on a per disk system or a disk device basis as opposed to the global setting in Red
Hat Enterprise Linux 4.0 and SUSE Linux Enterprise Server 9.

With the capability to have different I/O elevators per disk system, the administrator now can
isolate a specific I/O pattern on a disk system (such as write-intensive workloads) and select
the appropriate elevator algorithm:
򐂰 Synchronous file system access
Certain types of applications must perform file system operations synchronously, which
can be true for databases that might use a raw file system or for large disk systems where
caching asynchronous disk accesses is not an option. In those cases, the performance of
the anticipatory elevator usually has the least throughput and the highest latency. The
three other schedulers perform equally up to an I/O size of roughly 16 KB where the CFQ
and the Noop elevators outperform the deadline elevator (unless disk access is
seek-intense).
򐂰 Complex disk systems
Benchmarks show that the Noop elevator is an interesting alternative in high-end server
environments. When using configurations with enterprise-class disk systems, such as the
DS8000 storage system, the lack of ordering capability of the Noop elevator becomes its
strength. It becomes difficult for an I/O elevator to anticipate the I/O characteristics of such
complex systems correctly, so you might often observe at least equal performance at less
impact when using the Noop I/O elevator. Most large-scale benchmarks that use hundreds
of disks most likely use the Noop elevator.
򐂰 Database systems
Because of the seek-oriented nature of most database workloads, some performance gain
can be achieved when selecting the deadline elevator for these workloads.

398 IBM System Storage DS8000 Performance Monitoring and Tuning


򐂰 VMs
VMs, regardless of whether in VMware or VM for z Systems, usually communicate through
a virtualization layer with the underlying hardware. So, a VM is not aware of whether the
assigned disk device consists of a single SCSI device or an array of Fibre Channel disks
on a DS8000 storage system. The virtualization layer takes care of necessary I/O
reordering and the communication with the physical block devices.
򐂰 Processor-bound applications
Although certain I/O schedulers can offer superior throughput, they can at the same time
create more system impact. For example, the impact that the CFQ or deadline elevators
cause comes from aggressively merging and reordering the I/O queue. Sometimes, the
workload is not so much limited by the performance of the disk system as by the
performance of the processor, for example, with a scientific workload or a data warehouse
that processes complex queries. In these scenarios, the Noop elevator offers an
advantage over the other elevators because it causes less processor impact, as shown on
Figure 12-5 on page 402. However, when you compare processor impact to throughput,
the deadline and CFQ elevators are still the preferred choices for most access patterns to
asynchronous file systems.
򐂰 Individual disks: Single Advanced Technology Attachment (ATA), Serial Advanced
Technology Attachment (SATA) disk systems, or SCSI
If you choose to use a single physical ATA or SATA disk, for example, for the boot partition
of your Linux system, consider using the anticipatory I/O elevator, which reorders disk
writes to accommodate the single disk head in these devices.

12.3.5 Using ionice to assign I/O priority


A feature of the CFQ I/O elevator is the option to assign priorities on a process level. By using
the ionice utility, you can restrict the disk system utilization of a specific process:
򐂰 Idle: A process with the assigned I/O priority idle is granted access to the disk systems if
no other processes with a priority of best-effort or higher request access to the data. This
setting is useful for tasks that run when the system has free resources, such as the
updatedb task.
򐂰 Best-effort: As a default, all processes that do not request a specific I/O priority are
assigned to this class. Processes inherit eight levels of the priority of their processor nice
level to the I/O priority class.
򐂰 Real time: The highest available I/O priority is real time, which means that the respective
process is always given priority access to the disk system. The real time priority setting
can also accept eight priority levels. Use caution when assigning a thread a priority level of
real time because this process can cause the other tasks to be unable to access the disk
system.
ionice has an effect only when several processes compete for I/Os. If you use ionice to
favor certain processes, other, maybe even essential, I/Os of other processes can suffer.

With the DS8000 IOPM feature, you can choose where to use I/O priority management. IOPM
has following advantages:
򐂰 Provides the flexibility of many levels of priorities to be set
򐂰 Does not consume the resources on the server
򐂰 Sets real priority at the disk level
򐂰 Manages internal bandwidth access contention between several servers

Chapter 12. Performance considerations for Linux 399


With ionice, you can control the priorities for the use of server resources, that is, to manage
access to the HBA, which is not possible with the DS8000 IOPM. However, this capability is
limited to a single server only, and you cannot manage priorities at the disk system back end.

We suggest that you use DS8000 IOPM in most of the cases for priority management. The
operating system priority management can be used combined with IOPM. This combination
provides the highest level of flexibility.

12.3.6 File systems


The file systems that are available for Linux are designed with different workload and
availability characteristics. If your Linux distribution and the application allow the selection of a
different file system, it might be worthwhile to investigate if Ext, Journal File System (JFS),
ReiserFS, or eXtended File System (XFS) is the optimal choice for the planned workload.

Do not confuse the JFS versions for Linux and AIX operating systems. AIX differentiates the
older JFS (JFS generation 1) and JFS2. On Linux, only JFS generation 2 exists, but is simply
called JFS. Today, JFS is rarely used on Linux because ext4 typically offers better
performance.

As of November 2015, the DS8000 storage system supports RHEL and SUSE Linux
Enterprise Server distributions. Thus, this section focuses on the file systems supported by
these Linux distributions.

SUSE Linux Enterprise Server 12 ships with a number of file systems, including Ext Versions
2 - 4, Btrfs, XFS, and ReiserFS. Btrfs is the default for the root partition while XFS is the
default for other use cases. For more information, see the following website:
https://www.suse.com/documentation/sles-12/stor_admin/data/cha_filesystems.html

RHEL 7 file system support includes Ext versions 3 and 4, XFS, and Btrfs. XFS is the default
file system. For more information, see the following website:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Stor
age_Administration_Guide/part-file-systems.html

The JFS and XFS workload patterns are suited for high-end data warehouses, scientific
workloads, large symmetric multiprocessor (SMP) servers, or streaming media servers.
ReiserFS and Ext3 are typically used for file, web, or mail serving.

ReiserFS is more suited to accommodate small I/O requests. XFS and JFS are tailored
toward large file systems and large I/O sizes. Ext3 fits the gap between ReiserFS and JFS
and XFS because it can accommodate small I/O requests while offering good multiprocessor
scalability.

Ext4 is compatible with Ext3 and Ext2 file systems. Mounting the older file systems as Ext4
can improve performance because new file system features, such as the newer block
allocation algorithm, are used.

Access time updates


The Linux file system records when files are created, updated, and accessed. Default
operations include updating the last-time-read attribute for files during reads and writes to
files. Because writing is an expensive operation, eliminating unnecessary I/O can lead to
overall improved performance. However, under most conditions, disabling file access time
updates yields only a small performance improvement.

400 IBM System Storage DS8000 Performance Monitoring and Tuning


Mounting file systems with the noatime option prevents inode access times from being
updated. If file and directory update times are not critical to your implementation, as in a web
serving environment, an administrator might choose to mount file systems with the noatime
flag in the /etc/fstab file, as shown in Example 12-4. The performance benefit of disabling
access time updates to be written to the file system is 0 - 10% with an average of 3% for file
server workloads.

Example 12-4 Update /etc/fstab file with noatime option set on mounted file systems
/dev/mapper/ds8kvg-lvol0 /ds8k ext4 defaults,noatime 1 2

Selecting the journaling mode of the file system


Ext3 and ext4 file systems support three journaling options that can be set with the data
option of the mount command. However, the journaling mode has the greatest effect on Ext3
file system performance, so use the following tuning options:
򐂰 data=journal
This journaling option provides the highest form of data consistency by causing both file
data and metadata to be journaled. It also has the highest performance impact.
򐂰 data=ordered (default)
In this mode, only metadata is written. However, file data is ensured to be written first. This
setting is the default setting.
򐂰 data=writeback
This journaling option provides the fastest access to the data at the expense of data
consistency. The data is ensured to be consistent because the metadata is still being
logged. However, no special handling of file data is done, which can lead to old data
appearing in files after a system crash.

Chapter 12. Performance considerations for Linux 401


The type of metadata journaling that is implemented when using the writeback mode is
comparable to the defaults of ReiserFS, JFS, or XFS. The writeback journaling mode
improves Ext3 performance especially for small I/O sizes, as shown in Figure 12-5.

140000

120000

100000

80000

kB/sec
data=ordered
60000 data=writeback

40000

20000

0
4 8 16 32 64 128 256 512 1024 2048
kB/op

Figure 12-5 Random write performance impact1 of data=writeback

The benefit of using writeback journaling declines as I/O sizes grow. Also, the journaling
mode of your file system affects only write performance. Therefore, a workload that performs
mostly reads (for example, a web server) does not benefit from changing the journaling mode.

There are three ways to change the journaling mode on a file system:
򐂰 Run the mount command:
mount -o data=writeback /dev/mapper/ds8kvg-lvol0 /ds8k
򐂰 Include the mode in the options section of the /etc/fstab file:
/dev/mapper/ds8kvg-lvol0 /ds8k ext4 defaults,data=writeback 1 2
򐂰 If you want to modify the default data=ordered option on the root partition, change the
/etc/fstab file. Then, run the mkinitrd command to scan the changes in the /etc/fstab
file and create an initial RAM disk image. Update grub or lilo to point to the new image.

Blocksizes
The blocksize, the smallest amount of data that can be read or written to a drive, can have a
direct impact on server performance. As a guideline, if your server handles many small files, a
smaller blocksize is more efficient. If your server is dedicated to handling large files, a larger
blocksize might improve performance. Blocksizes cannot be changed dynamically on existing
file systems, and only a reformatting modifies the current blocksize.

1
The performance data that is contained in this figure is obtained in a controlled, isolated environment at a specific
point in by using the configurations, hardware, and software levels available at that time. Actual results that might be
obtained in other operating environments can vary. There is no guarantee that the same or similar results can be
obtained elsewhere. The data is intended to help illustrate how different technologies behave in relation to each
other.

402 IBM System Storage DS8000 Performance Monitoring and Tuning


The allowed blocksize values depend on the file system type. For example, Ext4 allows 1-K,
2-K, and 4-K blocks, and the XFS file system blocks can be 512 bytes 64 KB. As benchmarks
demonstrate, there is hardly any performance improvement to gain from changing the
blocksize of a file system, so the default value should be the first choice. You can use the
suggestions of the application vendor in addition.

12.4 Linux performance monitoring tools


This section introduces the commonly used performance measurement tools. Additionally,
Linux offers several other performance measurement tools. Most of them are equal to those
tools that are available in the UNIX operating systems.

12.4.1 Gathering configuration data


Before collecting the performance statistics, you must have the disk and path configuration
data to identify the disks that are shown in the statistics and find problems easily. The
following preferred practices are based on the use of DM-MP multipathing. For IBM SDD
multipathing, see Example 10-1 on page 353 because the procedures are the same.

Discovering the disks in the system


First, find all of the disks of the system. There are many ways to find all the disks, but the
simplest way is to use the dmesg|grep disk command. This command lists all the disks
discovered during the boot process or during the disk discovery by an HBA driver. You can get
the following output, as shown in Example 12-5.

Example 12-5 Command dmesg|grep disk output (output truncated)


[ 22.015946] sd 0:2:0:0: [sda] Attached SCSI disk
[ 22.016192] sd 0:2:1:0: [sdb] Attached SCSI disk
[ 24.093634] sd 2:0:0:0: [sdc] Attached SCSI removable disk
[2437135.910509] sd 1:0:1:0: [sdd] Attached SCSI disk
[2437135.911698] sd 1:0:1:1: [sde] Attached SCSI disk
[2437135.916264] sd 1:0:1:2: [sdf] Attached SCSI disk
[2437135.918118] sd 1:0:1:3: [sdg] Attached SCSI disk
[2437136.197808] sd 5:0:1:0: [sdh] Attached SCSI disk
[2437136.198339] sd 5:0:1:1: [sdi] Attached SCSI disk
[2437136.201402] sd 5:0:1:2: [sdj] Attached SCSI disk
[2437136.201934] sd 5:0:1:3: [sdk] Attached SCSI disk

You can identify the DS8000 disks with the multipath -ll or multipathd -k commands, as
shown in Example 12-6. In current Linux versions, the multipathd -k interactive prompt is
used to communicate with DM-MP. multipath -ll is deprecated. For more information, see
IBM System Storage DS8000: Host Attachment and Interoperability, SG24-8887.

Example 12-6 multipathd -k command output


# multipathd -k
multipathd> show multipaths topology
mpatha (3600605b00902c1601b77aee7332ca07a) dm-0 IBM,ServeRAID M5210
size=278G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
`- 0:2:0:0 sda 8:0 active ready running
mpathc (36005076303ffd5aa000000000000002c) dm-2 IBM,2107900
size=10G features='1 queue_if_no_path' hwhandler='0' wp=rw

Chapter 12. Performance considerations for Linux 403


`-+- policy='service-time 0' prio=1 status=active
|- 1:0:1:0 sdd 8:48 active ready running
`- 5:0:1:0 sdh 8:112 active ready running
mpathd (36005076303ffd5aa000000000000002d) dm-3 IBM,2107900
size=10G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
|- 1:0:1:1 sde 8:64 active ready running
`- 5:0:1:1 sdi 8:128 active ready running
mpathe (36005076303ffd5aa000000000000002e) dm-6 IBM,2107900
size=10G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
|- 1:0:1:2 sdf 8:80 active ready running
`- 5:0:1:2 sdj 8:144 active ready running
mpathf (36005076303ffd5aa000000000000002f) dm-8 IBM,2107900
size=10G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
|- 1:0:1:3 sdg 8:96 active ready running
`- 5:0:1:3 sdk 8:160 active ready running

Example 12-6 on page 403 shows that the device is a DS8000 volume with an active-active
configuration. The LUNs have the names mpathc and mpathd and device names of dm-2 and
dm-3, which appear in the performance statistics. The LUN IDs in the parentheses,
36005076303ffd5aa000000000000002c and 36005076303ffd5aa000000000000002d, contain the
ID of the LV in the DS8000 storage system in the last four digits of the whole LUN ID: 002c
and 002d. The output also indicates that the size of each LUN is 10 GB. There is no hardware
handler that is assigned to this device, and I/O is supposed to be queued forever if no paths
are available. All paths group are in the active state, which means that all paths in this group
carry all the I/Os to the storage. All paths to the device (LUN) are in active ready mode.
There are two paths per LUN presented in the system as sdX devices, where X is the index
number of the disk.

Discovering the HBA information


All information about the devices is under /sys/class. Information about the HBAs and the
ports is in the /sys/class/fc_host folder (see Example 12-7).

Example 12-7 Contents of the /sys/class/fc_host folder


# ls /sys/class/fc_host
host1 host4 host5 host6

Example 12-7 shows the contents of the /sys/class/fc_host folder. This system has four FC
ports. You might use a script similar to Example 12-8 to display the FC port information.

Example 12-8 Get HBA port information with the script


for i in 1 4 5 6;
do echo #####Host $i ######;
cat /sys/class/fc_host/host$i/port_name;
cat /sys/class/fc_host/host$i/port_state;
cat /sys/class/fc_host/host$i/port_type;
cat /sys/class/fc_host/host$i/speed;
cat /sys/class/fc_host/host$i/supported_speeds;
done

404 IBM System Storage DS8000 Performance Monitoring and Tuning


This script simplifies information gathering. You might have to change the loop indexing.
Example 12-9 shows the script output: All the information for the FC ports that are connected
to the SAN and their WWPNs are shown in bold.

Example 12-9 Script output


0x2100000e1e30c2fe
Online
NPort (fabric via point-to-point)
8 Gbit
4 Gbit, 8 Gbit, 16 Gbit

0x10008c7cff82b000
Linkdown
Unknown
unknown
2 Gbit, 4 Gbit, 8 Gbit, 16 Gbit

0x2100000e1e30c2ff
Online
NPort (fabric via point-to-point)
16 Gbit
4 Gbit, 8 Gbit, 16 Gbit

0x10008c7cff82b001
Linkdown
Unknown
unknown
2 Gbit, 4 Gbit, 8 Gbit, 16 Gbit

Another way to discover the connection configuration is to use the systool -av -c fc_host
command, as shown in Example 12-10. This command displays extended output and
information about the FC ports. However, this command might not be available in all Linux
distributions.

Example 12-10 Output of the port information with systool -av -c fc_host (only one port shown)
Class Device = "host5"
Class Device path =
"/sys/devices/pci0000:40/0000:40:03.0/0000:51:00.1/host5/fc_host/host5"
dev_loss_tmo = "30"
fabric_name = "0x10000005339ff896"
issue_lip = <store method only>
max_npiv_vports = "254"
node_name = "0x2000000e1e30c2ff"
npiv_vports_inuse = "0"
port_id = "0x011c00"
port_name = "0x2100000e1e30c2ff"
port_state = "Online"
port_type = "NPort (fabric via point-to-point)"
speed = "16 Gbit"
supported_classes = "Class 3"
supported_speeds = "4 Gbit, 8 Gbit, 16 Gbit"
symbolic_name = "QLE2662 FW:v6.03.00 DVR:v8.07.00.08.12.0-k"
system_hostname = ""
tgtid_bind_type = "wwpn (World Wide Port Name)"

Chapter 12. Performance considerations for Linux 405


uevent =
vport_create = <store method only>
vport_delete = <store method only>

Device = "host5"
Device path = "/sys/devices/pci0000:40/0000:40:03.0/0000:51:00.1/host5"
fw_dump =
nvram = "ISP "
optrom_ctl = <store method only>
optrom =
reset = <store method only>
sfp = ""
uevent = "DEVTYPE=scsi_host"
vpd = ")"

Example 12-10 on page 405 shows the output for one FC port:
򐂰 The device file for this port is host5.
򐂰 This port has a worldwide port name (WWPN) of 0x2100000e1e30c2ff, which appears in
the fabric.
򐂰 This port is connected at 16 Gbps.
򐂰 It is a QLogic card with firmware version 6.03.00.

Discovering the configuration from the disk system


After you collect all the necessary information on the server side, collect the disk layout on the
disk system side:
򐂰 Rank and extent pool allocation
򐂰 Physical disk and array information
򐂰 VG and host connection information

Locating volume ranks


Our example shows how to deploy the configuration data for an LV with ID 002c. We also
show how to get the configuration data for this volume.

The rank configuration for the disk can be shown by running the showfbvol -rank VOL_ID
command, where VOL_ID is 002c in this example (Example 12-11).

Example 12-11 Locate volume ranks


dscli> showfbvol -rank 002c
Date/Time: November 3, 2015 4:42:34 PM CET IBM DSCLI Version: 7.8.0.372 DS: IBM.
2107-75ZA571
Name ITSO_SLES12
ID 002C
accstate Online
datastate Normal
configstate Normal
deviceMTM 2107-900
datatype FB 512
addrgrp 0
extpool P4
exts 10
captype DS
cap (2^30B) 10.0

406 IBM System Storage DS8000 Performance Monitoring and Tuning


cap (10^9B) -
cap (blocks) 20971520
volgrp V40
ranks 1
dbexts 0
sam Standard
repcapalloc -
eam managed
reqcap (blocks) 20971520
realextents 10
virtualextents 0
migrating 0
perfgrp PG0
migratingfrom -
resgrp RG0
tierassignstatus -
tierassignerror -
tierassignorder -
tierassigntarget -
%tierassigned 0
etmonpauseremain -
etmonitorreset unknown
==============Rank extents==============
rank extents
============
R4 10

Example 12-11 on page 406 shows the properties of the LV. The following information can be
discovered for the volume:
򐂰 Occupies one rank (R4)
򐂰 Belongs to VG V40
򐂰 Is 10 GB
򐂰 Uses an extent allocation method (EAM) that is managed by Easy Tier
򐂰 Uses a standard storage allocation method (SAM)
򐂰 Is a regular, non-thin provisioned volume

Collecting physical disk information


With the information about used ranks of the volume collected, you must collect the
information about the used physical disk arrays and RAID types. To do, run the showrank and
showarray commands.

Example 12-12 shows how to reveal the physical disk and array information. The properties of
the rank provide the array number, and the array properties provide the disk information and
the RAID type. In this case, rank R1 is on array A1, which consists of 3 TB SAS drives in a
RAID 6 configuration.

Example 12-12 Gather disk information


dscli> showrank R4
Date/Time: November 3, 2015 4:48:54 PM CET IBM DSCLI Version: 7.8.0.372 DS: IBM.
2107-75ZA571
No Property - extunit
ID R4
SN -
Group 0

Chapter 12. Performance considerations for Linux 407


State Normal
datastate Normal
Array A4
RAIDtype 6
extpoolID P4
extpoolnam ITSO_EasyTier
volumes 0019,001B,001C,001D,002C,002D,002E,002F,1000,1001,1004,1005,4010
stgtype fb
exts 12890
usedexts 5941
widearrays 0
nararrays 1
trksize 128
strpsize 384
strpesize 0
extsize 16384
encryptgrp -
migrating(in) 0
migrating(out) 0
marray MA15

dscli> showarray A4
Date/Time: November 3, 2015 4:49:25 PM CET IBM DSCLI Version: 7.8.0.372 DS: IBM.
2107-75ZA571
Array A4
SN BZA57150A3E368S
State Assigned
datastate Normal
RAIDtype 6 (5+P+Q+S)
arsite S15
Rank R4
DA Pair 2
DDMcap (10^9B) 3000.0
DDMRPM 7200
Interface Type SAS
interrate 6.0 Gb/sec
diskclass NL
encrypt supported

Collecting volume group and host connection information


From Example 12-11 on page 406, you know that volume 002C participates in VG V40.
Example 12-13 shows how to get the VG properties and host connection information.

Example 12-13 Gather host connection information


dscli> lshostconnect -volgrp v40
Date/Time: November 3, 2015 4:58:15 PM CET IBM DSCLI Version: 7.8.0.372 DS: IBM.
2107-75ZA571
Name ID WWPN HostType Profile portgrp volgrpID ESSIOport
==================================================================================
- 000A 2100000E1E30C2FE LinuxSuse Intel - Linux Suse 11 V40 all
- 0010 2100000E1E30C2FF LinuxSuse Intel - Linux Suse 11 V40 all

dscli> showvolgrp V40


Date/Time: November 3, 2015 4:58:28 PM CET IBM DSCLI Version: 7.8.0.372 DS: IBM.

408 IBM System Storage DS8000 Performance Monitoring and Tuning


2107-75ZA571
Name ITSO_SLES12_1
ID V40
Type SCSI Map 256
Vols 002C 002D 002E 002F

dscli> showhostconnect 000A


Date/Time: November 3, 2015 4:58:50 PM CET IBM DSCLI Version: 7.8.0.372 DS: IBM.
2107-75ZA571
Name -
ID 000A
WWPN 2100000E1E30C2FE
HostType LinuxSuse
LBS 512
addrDiscovery LUNPolling
Profile Intel - Linux Suse
portgrp 11
volgrpID V40
atchtopo -
ESSIOport all
speed Unknown
desc -
host ITSO_SLES12_1

Example 12-13 on page 408 shows that VG V40 participates in two host connections for two
WWPNs: 2100000E1E30C2FE and 2100000E1E30C2FE. These WWPNs are the same as in
Example 12-9 on page 405. Now, you checked all the information for a specific volume.

12.4.2 Disk I/O performance indicators


Disk I/O performance is an important aspect of server performance and can be a bottleneck.
Applications are considered to be I/O-bound when processor cycles are wasted simply
waiting for I/O tasks to finish. However, problems can be hidden by other factors, such as lack
of memory.

The symptoms that show that the server might be suffering from a disk bottleneck (or a
hidden memory problem) are shown in Table 12-1.

Table 12-1 Disk I/O performance indicators


Disk I/O indicators Analysis

Disk I/O numbers and wait time Analyze the number of I/Os to the LUN. This data
can be used to discover if reads or writes are the
cause of problem. Run iostat to get the disk
I/Os. Run stap ioblock.stp to get read/write
blocks. Also, run scsi.stp to get the SCSI wait
times, requests submitted, and completed. Also,
long wait times might mean the I/O is to specific
disks and not spread out.

Disk I/O size The memory buffer available for the block I/O
request might not be sufficient, and the page
cache size can be smaller than the maximum
amount of Disk I/O size. Run stap ioblock.stp
to get request sizes. Run iostat to get the
blocksizes.

Chapter 12. Performance considerations for Linux 409


Disk I/O indicators Analysis

Disk I/O scheduler An inappropriate I/O scheduler might cause


performance bottlenecks. A certain I/O scheduler
performs better if configured with the appropriate
file system.

Disk file system An inappropriate file system might cause


performance bottlenecks. The appropriate file
system must be chosen based on the
requirements.

Disk I/O to physical device If all the disk I/Os are directed to the same
physical disk, it might cause a disk I/O bottleneck.
Directing the disk I/O to different physical disks
increases the performance.

File system blocksize If the file system is created with small-sized


blocks, creating files larger than the blocksize
might cause a performance bottleneck. Creating
a file system with a proper blocksize improves the
performance.

Swap device/area If a single swap device/area is used, it might


cause performance problems. To improve the
performance, create multiple swap devices or
areas.

12.4.3 Identifying disk bottlenecks


A server that exhibits the following symptoms might suffer from a disk bottleneck (or a hidden
memory problem):
򐂰 Slow disks result in these issues:
– Memory buffers that fill with write data (or wait for read data), which delays all requests
because free memory buffers are unavailable for write requests (or the response is
waiting for read data in the disk queue).
– Insufficient memory, as in the case of not enough memory buffers for network requests,
causes synchronous disk I/O.
򐂰 Disk utilization, controller utilization, or both types are typically high.
򐂰 Most local area network (LAN) transfers happen only after disk I/O completes, causing
long response times and low network utilization.
򐂰 Disk I/O can take a relatively long time and disk queues become full, so the processors
are idle or have low utilization because they wait long periods before processing the next
request.

Linux offers command-line tools to monitor performance-relevant information. Several of


these tools are helpful to get performance metrics for disk I/O-relevant areas.

The vmstat command


One way to track disk usage on a Linux system is by using the vmstat tool (Example 12-14
on page 411). The important columns in vmstat with respect to I/O are the bi and bo fields.
These fields monitor the movement of blocks in and out of the disk system. Having a baseline
is key to identifying any changes over time.

410 IBM System Storage DS8000 Performance Monitoring and Tuning


Example 12-14 The vmstat tool output
[root@x232 root]# vmstat 2
r b swpd free buff cache si so bi bo in cs us sy id wa
2 1 0 9004 47196 1141672 0 0 0 950 149 74 87 13 0 0
0 2 0 9672 47224 1140924 0 0 12 42392 189 65 88 10 0 1
0 2 0 9276 47224 1141308 0 0 448 0 144 28 0 0 0 100
0 2 0 9160 47224 1141424 0 0 448 1764 149 66 0 1 0 99
0 2 0 9272 47224 1141280 0 0 448 60 155 46 0 1 0 99
0 2 0 9180 47228 1141360 0 0 6208 10730 425 413 0 3 0 97
1 0 0 9200 47228 1141340 0 0 11200 6 631 737 0 6 0 94
1 0 0 9756 47228 1140784 0 0 12224 3632 684 763 0 11 0 89
0 2 0 9448 47228 1141092 0 0 5824 25328 403 373 0 3 0 97
0 2 0 9740 47228 1140832 0 0 640 0 159 31 0 0 0 100

The iostat command


Performance problems can be encountered when too many files are opened, read, and
written to, and then closed repeatedly. By using the iostat tool, which is part of the sysstat
package, you can monitor the I/O device loading in real time. You can use various options to
drill down even deeper to gather the necessary data.

In general, pay attention to the following metrics of the iostat data:


򐂰 %user is the percentage of processor utilization that occurred while running at the user
level (application).
򐂰 %nice is the percentage of processor utilization that occurred while running at the user
level with nice priority.
򐂰 %System is the percentage of processor utilization that occurred while running at the
system level (kernel).
򐂰 %iowait is the percentage of time that the processors were idle during which the system
had an outstanding disk I/O request.
򐂰 %idle is the percentage of time that the processors were idle and the system did not have
an outstanding disk I/O request.
򐂰 tps is the number of transfers per second that were issued to the device. A transfer is an
I/O request to the device. Multiple logical requests can be combined into a single I/O
request to the device. A transfer is of indeterminate size.
򐂰 kB_read/s is the amount of data read from the device expressed in kilobytes per second.
򐂰 kB_wrtn/s is the amount of data written to the device expressed in kilobytes per second.
򐂰 r/s is the number of read requests that were issued to the device per second.
򐂰 w/s is the number of write requests that were issued to the device per second.
򐂰 avgrq-sz is the average size (in sectors) of the requests that were issued to the device.
򐂰 avgqu-sz is the average queue length of the requests that were issued to the device.
򐂰 await is the average time (in milliseconds) for I/O requests issued to the device to be
served. This metric includes the time that is spent by the requests in queue and the time
spent servicing them.
򐂰 svctm is the average service time (in milliseconds) for I/O requests that were issued to the
device.

Chapter 12. Performance considerations for Linux 411


Remember to gather statistics in extended mode with time stamp and with kilobytes or
megabytes values. This information is easier to understand, and you capture all necessary
information at a time. To do so, run the iostat -kxt or iostat -mxt command.

Look at the statistics from the iostat tool to help you understand the situation. You can use
the following suggestions as shown in the examples.

Good situations
Good situations have the following characteristics:
򐂰 High tps value, high %user value, low %iowait, and low svctm: A good condition, as
expected.
򐂰 High tps value, high %user value, medium %iowait, and medium svctm: Situation is still
good, but requires attention. Probably write activity is a little higher than expected. Check
write block size and queue size.
򐂰 Low tps value, low %user value, medium-high %iowait value, low %idle value, high MB\sec
value, and high avgrq-sz value: System performs well with large block write or read
activity.

Bad situations
򐂰 Low tps value, low %user value, low %iowait, and low svctm: The system is not handling
disk I/O. If the application still suffers from the disk I/O, look at the application first, not at
the disk system.
򐂰 Low tps value, low %user value, high %system value, high or low svctime, and 0 %idle
value: System is stuck with disk I/O. This situation can happen when a path failed, a
device adapter (DA) problem, and application errors.
򐂰 High tps value, medium %user value, medium %system value, high %iowait value, and
high svctim: System consumed the disk resources, and you must consider an upgrade.
Increase the number of disks and I/O paths first. In this case, review the physical layout.
򐂰 Low tps value, low %user value, high %iowait value, high service time, high read queue,
and high write queue: High write large block activity exists on the same disks that are also
intended for read activity. If applicable, split the data for writes and reads on separate
disks.

These situations are examples for your understanding. Plenty of similar situations might
occur. Remember to analyze not one or two values of the collected data, but try to obtain a full
picture by combining all the available data.

Monitoring workload distribution on the paths


The disks dm-0 and dm-1, shown in Example 12-15, have four paths each. It is important to
monitor whether the I/O workload is spread across the paths. All paths should be active while
servicing the I/O workload.

Example 12-15 Monitoring workload distribution with the iostat -kxt command
avg-cpu: %user %nice %system %iowait %steal %idle
0,00 0,00 0,62 25,78 0,09 73,51

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdc 0,00 0,00 978,22 11,88 124198,02 6083,17 263,17 1,75 1,76 0,51 50,30
sdd 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sde 0,00 0,00 20,79 0,00 2538,61 0,00 244,19 0,03 0,95 0,76 1,58
dm-0 59369,31 1309,90 1967,33 15,84 244356,44 8110,89 254,61 154,74 44,90 0,50 99,41

412 IBM System Storage DS8000 Performance Monitoring and Tuning


dm-1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdf 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdg 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdh 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdi 0,00 0,00 968,32 3,96 117619,80 2027,72 246,12 1,67 1,71 0,49 47,52

Example 12-15 on page 412 shows that disk dm-0 is running the workload. It has four paths:
sdc, sde, sdg, and sdi. Workload is distributed now to three paths for reading (sdc, sde, and
sdi) and to two paths for writing (sdc and sdi).

Example 12-16 shows the output of the iostat command on an LPAR configured with
1.2 processors running RHEL while issuing server writes to the disks sda and dm-2. The disk
transfers per second are 130 for sda and 692 for dm-2. The %iowait is 6.37%, which might
seem high for this workload, but it is not. It is normal for a mix of write and read workloads.
However, it might grow rapidly in the future, so pay attention to it.

Example 12-16 Outputs from iostat command


#iostat -k
avg-cpu: %user %nice %sys %iowait %idle
2.70 0.11 6.50 6.37 84.32
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 130.69 1732.56 5827.90 265688 893708
sda1 1.24 2.53 0.00 388 0
sda2 4.80 5.32 0.03 816 4
sda3 790.73 1717.40 5827.87 263364 893704
dm-0 96.19 1704.71 292.40 261418 44840
dm-1 0.29 2.35 0.00 360 0
dm-2 692.66 6.38 5535.47 978 848864

Example 12-17 shows the output of the iostat -k command on an LPAR configured with a
1.2 processor running RHEL that issues server writes to the sda and dm-2 disks. The disk
transfers per second are 428 for sda and 4024 for dm-2. The %iowait increased to 12.42%.
The prediction from the previous example is now true. The workload became higher and the
%iowait value grew, but the %user value remained the same. The disk system now can hardly
manage the workload and requires some tuning or an upgrade. Although the workload grew,
the performance of the user processes did not improve. The application might issue more
requests, but they must wait in the queue instead of being serviced. Gather the extended
iostat statistics.

Example 12-17 Output of iostat to illustrate disk I/O bottleneck


# iostat -k
avg-cpu: %user %nice %sys %iowait %idle
2.37 0.20 27.22 12.42 57.80
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 428.14 235.64 32248.23 269840 36928420
sda1 0.17 0.34 0.00 388 0
sda2 0.64 0.71 0.00 816 4
sda3 4039.46 233.61 32248.17 267516 36928352
dm-0 14.63 231.80 52.47 265442 60080
dm-1 0.04 0.31 0.00 360 0
dm-2 4024.58 0.97 32195.76 1106 36868336

Chapter 12. Performance considerations for Linux 413


Changes made to the elevator algorithm as described in 12.3.4, “Tuning the disk I/O
scheduler” on page 397 are displayed in avgrq-sz (average size of request) and avgqu-sz
(average queue length), as illustrated in Example 12-15 on page 412. As the latencies are
lowered by manipulating the elevator settings, avgrq-sz decreases. You can also monitor the
rrqm/s and wrqm/s to see the effect on the number of merged reads and writes that the disk
can manage.

sar command
The sar command, which is included in the sysstat installation package, uses the standard
system activity data file to generate a report.

The system must be configured to collect the information and log it; therefore, a cron job
must be set up. Add the following lines shown in Example 12-18 to the /etc/crontab for
automatic log reporting with cron.

Example 12-18 Example of automatic log reporting with cron


....
#8am-7pm activity reports every 10 minutes during weekdays.
0 8-18 **1-5 /usr/lib/sa/sa1 600 6 &
#7pm-8am activity reports every hour during weekdays.
0 19-7 **1-5 /usr/lib/sa/sa1 &
#Activity reports every hour on Saturday and Sunday.
0 ***0,6 /usr/lib/sa/sa1 &
#Daily summary prepared at 19:05
5 19 ***/usr/lib/sa/sa2 -A &
....

You get a detailed overview of your processor utilization (%user, %nice, %system, and %idle),
memory paging, network I/O and transfer statistics, process creation activity, activity for block
devices, and interrupts/second over time.

The sar -A command (the -A is equivalent to -bBcdqrRuvwWy -I SUM -I PROC -n FULL -U ALL,
which selects the most relevant counters of the system) is the most effective way to grep all
relevant performance counters. Use the sar command to analyze whether a system is disk
I/O-bound and I/O waits are high, which results in filled-up memory buffers and low processor
usage. Furthermore, this method is useful to monitor the overall system performance over a
longer period, for example, days or weeks, to further understand which times a claimed
performance bottleneck is seen.

A variety of additional performance data collection utilities are available for Linux. Most of
them are transferred from UNIX systems. You can obtain more details about those additional
tools in 9.3, “AIX performance monitoring tools” on page 312.

414 IBM System Storage DS8000 Performance Monitoring and Tuning


13

Chapter 13. Performance considerations for


the IBM i system
This chapter describes the topics that are related to the DS8800 performance with an IBM i
host. The performance of IBM i database applications and batch jobs is sensitive to the disk
response time. Therefore, it is important to understand how to plan, implement, and analyze
the DS8800 performance with the IBM i system.

This chapter includes the following topics:


򐂰 IBM i storage architecture
򐂰 Fibre Channel adapters and Multipath
򐂰 Performance guidelines for hard disk drives in a DS8800 storage system with the IBM i
system
򐂰 Preferred practices for implementing IBM i workloads on flash cards in a DS8870 storage
system
򐂰 Preferred practices for implementing IBM i workloads on flash cards in a DS8886 storage
system
򐂰 Analyzing performance data
򐂰 Easy Tier with the IBM i system
򐂰 I/O Priority Manager with the IBM i system

© Copyright IBM Corp. 2016. All rights reserved. 415


13.1 IBM i storage architecture
To understand the performance of the DS8800 storage system with the IBM i system, you
need insight into IBM i storage architecture. This section explains this part of IBM i
architecture and how it works with the DS8800 storage system.

The following IBM i specific features are important for the performance of external storage:
򐂰 Single-level storage
򐂰 Object-based architecture
򐂰 Storage management
򐂰 Types of disk pools

This section describes these features and explains how they relate to the performance of a
connected DS8800 storage system.

13.1.1 Single-level storage


The IBM i system uses the same architectural component that is used by the iSeries and
AS/400 platform: single-level storage. It treats the main memory and the disk space as one
storage area. It uses the same set of 64-bit virtual addresses to cover both main memory and
disk space. Paging in this virtual address space is performed in 4 KB memory pages.

13.1.2 Object-based architecture


One of the differences between the IBM i system and other operating systems is the concept
of objects. For example, data files, programs, libraries, queues, user profiles, and device
descriptions are all types of objects in the IBM i system. Every object on the IBM i system is
packaged with the set of rules for how it can be used, enhancing integrity, security, and
virus-resistance.

The IBM i system takes responsibility for managing the information in disk pools. When you
create an object, for example, a file, the system places the file in the best location that
ensures the best performance. It normally spreads the data in the file across multiple disk
units. Advantages of such design are ease of use, self-management, automation of using the
added disk units, and so on. IBM i object-based architecture is shown on Figure 13-1 on
page 417.

416 IBM System Storage DS8000 Performance Monitoring and Tuning


POWER

IBM i LPAR

Object D Object C Object B Object A

IBM i
TIMI – Technology independent
Machine Interface

Automatic Storage Management

IO

IBM i Storage Pools


Figure 13-1 IBM i object-based architecture

13.1.3 Storage management


Storage management is a part of the IBM i Licensed Internal Code that manages the I/O
operations to store and place data on storage. Storage management handles the I/O
operations in the following way.

When the application performs an I/O operation, the portion of the program that contains read
or write instructions is first brought into main memory where the instructions are then run.

With the read request, the virtual addresses of the needed record are resolved, and for each
needed page, storage management first looks to see whether it is in main memory. If the
page is there, it is used to resolve the read request. However, if the corresponding page is not
in main memory, a page fault is encountered and it must be retrieved from disk. When a page
is retrieved, it replaces another page in memory that recently was not used; the replaced
page is paged out to disk.

Chapter 13. Performance considerations for the IBM i system 417


Similarly, writing a new record or updating an existing record is done in main memory, and the
affected pages are marked as changed. A changed page normally remains in main memory
until it is written to disk as a result of a page fault. Pages are also written to disk when a file is
closed or when write-to-disk is forced by a user through commands and parameters. The
handling of I/O operations is shown in Figure 13-2.

Figure 13-2 Storage management handling I/O operations

When resolving virtual addresses for I/O operations, storage management directories map
the disk and sector to a virtual address. For a read operation, a directory lookup is performed
to get the needed information for mapping. For a write operation, the information is retrieved
from the page tables.

13.1.4 Disk pools in the IBM i system


The disk pools in the IBM i system are referred to as auxiliary storage pools (ASPs). The
following types of disk pools exist in the IBM i system:
򐂰 System ASP
򐂰 User ASP
򐂰 Independent ASP (IASP)

System ASP
The system ASP is the basic disk pool for the IBM i system. This ASP contains the IBM i
system boot disk (load source), system libraries, indexes, user profiles, and other system
objects. The system ASP is always present in the IBM i system and is needed for the IBM i
system. The IBM i system does not start if the system ASP is inaccessible.

User ASP
A user ASP separates the storage for different objects for easier management. For example,
the libraries and database objects that belong to one application are in one user ASP, and the
objects of another application are in a different user ASP. If user ASPs are defined in the IBM
i system, they are needed for the IBM i system to start.

418 IBM System Storage DS8000 Performance Monitoring and Tuning


Independent ASP
The IASP is a disk pool that can switch among two or more IBM i systems that are in a cluster.
The IBM i system can start without accessing the IASP. Typically, the objects that belong to a
particular application are in this disk pool. If the IBM i system with IASP fails, the independent
disk pool can be switched to another system in a cluster. If the IASP is on the DS8800 storage
system, the copy of IASP (FlashCopy, Metro Mirror, or Global Mirror copy) is made available
to another IBM i system and the cluster, and the application continues to work from another
IBM i system.

13.2 Fibre Channel adapters and Multipath


This section explains the usage of Fibre Channel (FC) adapters to connect the DS8800
storage system to the IBM i system, with multiple ways to connect. It describes the
performance capabilities of the adapters and the performance enhancement of using
Multipath.

The DS8800 storage system can connect to the IBM i system in one of the following ways:
򐂰 Native: FC adapters in the IBM i system are connected through a storage area network
(SAN) to the host bus adapters (HBAs) in the DS8800 storage system.
򐂰 With Virtual I/O Server Node Port ID Virtualization (VIOS NPIV): FC adapters in the VIOS
are connected through a SAN to the HBAs in the DS8800 storage system. The IBM i
system is a client of the VIOS and uses virtual FC adapters; each virtual FC adapter is
mapped to a port in an FC adapter in the VIOS.
For more information about connecting the DS8800 storage system to the IBM i system
with VIOS_NPIV, see DS8000 Copy Services for IBM i with VIOS, REDP-4584, and IBM
System Storage DS8000: Host Attachment and Interoperability, SG24-8887.
򐂰 With VIOS: FC adapters in the VIOS are connected through a SAN to the HBAs in the
DS8800 storage system. The IBM i system is a client of the VIOS, and virtual SCSI
adapters in VIOS are connected to the virtual SCSI adapters in the IBM i system.
For more information about connecting storage systems to the IBM i system with the
VIOS, see IBM i and Midrange External Storage, SG24-7668, and DS8000 Copy Services
for IBM i with VIOS, REDP-4584.

Most installations use the native connection of the DS8800 storage system to the IBM i
system or the connection with VIOS_NPIV.

IBM i I/O processors: The information that is provided in this section refers to connection
with IBM i I/O processor (IOP)-less adapters. For similar information about older
IOP-based adapters, see IBM i and IBM System Storage: A Guide to Implementing
External Disks on IBM i, SG24-7120.

13.2.1 FC adapters for native connection


The following FC adapters are used to connect the DS8800 storage system natively to an
IBM i partition in a POWER server:
򐂰 2-Port 16 Gb PCIe2 Fibre Channel Adapter, Feature Code EN0A (EN0B for Low Profile
(LP) adapter)
򐂰 4-Port 8 Gb PCIe Generation-2 Fibre Channel Adapter, Feature Code 5729 (EN0Y for LP
adapter)
򐂰 2-Port 8 Gb PCIe Fibre Channel Adapter, Feature Code 5735 (5273 for LP adapter)

Chapter 13. Performance considerations for the IBM i system 419


򐂰 2-Port 4 Gb PCIe Fibre Channel Adapter, Feature Code 5774 (5276 for LP adapter)
򐂰 2-Port 4 Gb PCI-X Fibre Channel Adapter, Feature Code 5749

Note: The supported adapters depend on the type of POWER server and the level of
the IBM i system. For detailed specifications, see the IBM System Storage Interoperation
Center (SSIC) at the following website:
http://www.ibm.com/systems/support/storage/ssic/interoperability.wss

All listed adapters are IOP-less adapters. They do not require an I/O processor card to offload
the data management. Instead, the processor manages the I/O and communicates directly
with the FC adapter. Thus, the IOP-less FC technology takes full advantage of the
performance potential in the IBM i system.

Before the availability of IOP-less adapters, the DS8800 storage system connected to
IOP-based FC adapters that require the I/O processor card.

IOP-less FC architecture enables two technology functions that are important for the
performance of the DS8800 storage system with the IBM i system: Tag Command Queuing
and Header Strip Merge.

Tagged Command Queuing


Tagged Command Queuing allows the IBM i system to issue multiple commands to the
DS8800 storage system on the same path to a logical volume (LV). In the past, the IBM i
system sent one command only per LUN path. Up to six I/O operations to the same LUN
through one path are possible with Tag Command Queuing in the natively connected DS8800
storage system. With the natively connected DS8800 storage system, the queue depth on a
LUN is 6.

Header Strip Merge


Header Strip Merge allows the IBM i system to bundle data into 4 KB chunks. By merging the
data together, it reduces the amount of storage that is required for the management of smaller
data chunks.

13.2.2 FC adapters in VIOS


The following FC adapters are used to connect the DS8800 storage system to VIOS in a
POWER server to implement VIOS_NPIV connection for the IBM i system:
򐂰 2-Port 16 Gb PCIe2 Fibre Channel Adapter, Feature Code EN0A
򐂰 4-Port 8 Gb PCIe Generation-2 Fibre Channel Adapter, Feature Code 5729
򐂰 2-Port 8 Gb PCIe Fibre Channel Adapter, Feature Code 5735

Queue depth and the number of command elements in the VIOS


When you connect the DS8800 storage system to the IBM i system through the VIOS,
consider the following types of queue depths:
򐂰 The queue depth per LUN: SCSI Command Tag Queuing in the IBM i operating system
enables up to 32 I/O operations to one LUN at the same time. This queue depth is valid for
either connection with VIOS VSCSI or connection with VIOS_NPIV.
򐂰 If the DS8800 storage system is connected in VIOS VSCSI, consider the queue depth 32
per physical disk (hdisk) in the VIOS. This queue depth indicates the maximum number of
I/O requests that can be outstanding on a physical disk in the VIOS at one time.

420 IBM System Storage DS8000 Performance Monitoring and Tuning


򐂰 The number of command elements per port in a physical adapter in the VIOS is 500 by
default; you can increase it to 2048. This queue depth indicates the maximum number of
I/O requests that can be outstanding on a port in a physical adapter in the VIOS at one
time, either in a VIOS VSCSI connection or in VIOS_NPIV. The IBM i operating system
has a fixed queue depth of 32, which is not changeable. However, the queue depth for a
physical disk in the VIOS can be set up by a user. If needed, set the queue depth per
physical disk in the VIOS to 32 by using the chdev -dev hdiskxx -attr queue_depth=32
command.

13.2.3 Multipath
The IBM i system allows multiple connections from different ports on a single IBM i partition to
the same LVs in the DS8800 storage system. This multiple connections support provides an
extra level of availability and error recovery between the IBM i system and the DS8800
storage system. If one IBM i adapter fails, or one connection to the DS8800 storage system is
lost, you can continue using the other connections and continue communicating with the disk
unit. The IBM i system supports up to eight active connections (paths) to a single LUN in the
DS8800 storage system.

In addition to high availability, multiple paths to the same LUN provide load balancing. A
Round-Robin algorithm is used to select the path for sending the I/O requests. This algorithm
enhances the performance of the IBM i system with the DS8800 LUNs connected in
Multipath.

Multipath is part of the IBM i operating system. This Multipath differs from other platforms that
have a specific software component to support multipathing, such as the Subsystem Device
Driver (SDD).

When the DS8800 storage system connects to the IBM i system through the VIOS, Multipath
in the IBM i system is implemented so that each path to a LUN uses a different VIOS.
Therefore, at least two VIOSs are required to implement Multipath for an IBM i client. This way
of multipathing provides additional resiliency if one VIOS fails. In addition to IBM i Multipath
with two or more VIOS, the FC adapters in each VIOS can multipath to the connected
DS8800 storage system to provide additional resiliency and enhance performance.

13.3 Performance guidelines for hard disk drives in a DS8800


storage system with the IBM i system
This section describes the guidelines to use when planning and implementing hard disk
drives (HDDs) in a DS8800 storage system for an IBM i system to achieve the wanted
performance.

13.3.1 RAID level


Most IBM i clients use RAID 10 or RAID 5 for their workloads. RAID 10 provides better
resiliency than RAID 5, and in many cases, it enables better performance. The difference in
performance is because of the lower RAID penalty that is experienced with RAID 10
compared to RAID 5. The workloads with a low read/write ratio and with many random writes
benefit the most from RAID 10 as far as performance.

Use RAID 10 for IBM i systems, especially for the following types of workloads:
򐂰 Workloads with large I/O rates
򐂰 Workloads with many write operations (low read/write ratio)

Chapter 13. Performance considerations for the IBM i system 421


򐂰 Workloads with many random writes
򐂰 Workloads with low write-cache efficiency

13.3.2 Number of ranks


To better understand why the number of disk drives or the number of ranks is important for an
IBM i workload, here is a short explanation of how an IBM i system spreads the I/O over disk
drives.

When an IBM i page or a block of data is written to disk space, storage management spreads
it over multiple disks. By spreading data over multiple disks, multiple disk arms work in
parallel for any request to this piece of data, so writes and reads are faster.

When using external storage with the IBM i system, storage management sees an LV (LUN)
in the DS8800 storage system as a “physical” disk unit. If a LUN is created with the rotate
volumes extent allocation method (EAM), it occupies multiple stripes of a rank. If a LUN is
created with the rotate extents EAM, it is composed of multiple stripes of different ranks.

Figure 13-3 shows the use of the DS8800 disk with IBM i LUNs created with the rotate
extents EAM.

6+P+S arrays

LUN 1
S

Block of data Disk


unit 1
S

Disk
unit 2
7+P array

LUN 2

Figure 13-3 Use of disk arms with LUNs created in the rotate extents method

Therefore, a LUN uses multiple DS8800 disk arms in parallel. The same DS8800 disk arms
are used by multiple LUNs that belong to the same IBM i workload, or even to different IBM i
workloads. To support efficiently this structure of I/O and data spreading across LUNs and
disk drives, it is important to provide enough disk arms to an IBM i workload.

Use the Disk Magic tool when you plan the number of ranks in the DS8800 storage system for
an IBM i workload. To provide a good starting point for Disk Magic modeling, consider the
number of ranks that is needed to keep disk utilization under 60% for your IBM i workload.

Table 13-1 on page 423 shows the maximal number of IBM i I/O/sec for one rank to keep the
disk utilization under 60%, for the workloads with read/write ratios 70/30 and 50/50.

422 IBM System Storage DS8000 Performance Monitoring and Tuning


Table 13-1 Host IO/sec for an array1
RAID array, disk drive Host I/O per second at 70% Host I/O per second at 50%
reads reads

RAID 5, 15 K RPM 940 731

RAID 10, 15 K RPM 1253 1116

RAID 6, 15 K RPM 723 526

RAID 5, 10 K RPM 749 583

RAID 10, 10 K RPM 999 890

RAID 6, 10 K RPM 576 420

Use the following steps to calculate the necessary number of ranks for your workload by using
Table 13-1:
1. Decide which read/write ratio (70/30 or 50/50) is appropriate for your workload.
2. Decide which RAID level to use for the workload.
3. Look for the corresponding number in Table 13-1.
4. Divide the I/O/sec of your workload by the number from the table to get the number of
ranks.

For example, we show a calculation for a medium IBM i workload with a read/write ratio of
50/50. The IBM i workload experiences 8500 I/O per second at a read/write ratio of 50/50.
The 15 K RPM disk drives in RAID 10 are used for the workload. Here are the number of
needed ranks:

8500 / 1116 = 7.6 or approximately 8

Therefore, use eight ranks of 15 K RPM disk drives in RAID 10 for the workload.

13.3.3 Number and size of LUNs


For the performance of an IBM i system, ensure that the IBM i system uses many disk units.
The more disk units that are available to an IBM i system, the more server tasks are available
for the IBM i storage management to use for managing the I/O operations to the disk space.
The result is improved I/O operation performance.

With IBM i internal disks, a disk unit is a physical disk drive. With a connected DS8800
storage system, a disk unit is a LUN. Therefore, it is important to provide many LUNs to an
IBM i system. With a certain disk capacity, define more smaller LUNs.

Number of disk drives in the DS8800 storage system: In addition to the suggestion for
many LUNs, use a sufficient number of disk drives in the DS8800 storage system to
achieve good IBM i performance, as described in 13.3.2, “Number of ranks” on page 422.

1
The calculations for the values in Table 13-1 are based on the measurements of how many I/O operations one rank
can handle in a certain RAID level, assuming 20% read cache hit and 30% write cache efficiency for the IBM i
workload. Assume that half of the used ranks have a spare and half are without a spare.

Chapter 13. Performance considerations for the IBM i system 423


Another reason why you should define smaller LUNs for an IBM i system is the queue depth
in Tagged Command Queuing. With a natively connected DS8800 storage system, an IBM i
system manages the queue depth of six concurrent I/O operations to a LUN. With the
DS8800 storage system connected through VIOS, the queue depth for a LUN is 32
concurrent I/O operations. Both of these queue depths are modest numbers compared to
other operating systems. Therefore, you must define sufficiently small LUNs for an IBM i
system to not exceed the queue depth with I/O operations.

Also, by considering the manageability and limitations of external storage and an IBM i
system, define LUN sizes of about 70 - 140 GB.

13.3.4 DS8800 extent pools for IBM i workloads


This section describes how to create extent pools and LUNs for an IBM i system and dedicate
or share ranks for IBM i workloads.

Number of extent pools


Create two extent pools for an IBM i workload with each pool in one rank group.

Rotate volumes or rotate extents EAMs for defining IBM i LUNs


IBM i storage management spreads each block of data across multiple LUNs. Therefore, even
if the LUNs are created with the rotate volumes EAM, performing an IBM i I/O operation uses
the disk arms of multiple ranks.

You might think that the rotate volumes EAM for creating IBM i LUNs provides sufficient disk
arms for I/O operations and that the use of the rotate extents EAM is “overvirtualizing”.
However, based on the performance measurements and preferred practices, the rotate
extents EAM of defining LUNs for an IBM i system still provides the preferred performance, so
use it.

Dedicating or sharing the ranks for IBM i workloads


When multiple IBM i logical partitions (LPARs) use disk space on the DS8800 storage
system, there is always a question whether to dedicate ranks (extent pools) to each of them
or to share ranks among the IBM i systems.

Sharing the ranks among the IBM i systems enables the efficient use of the DS8800
resources. However, the performance of each LPAR is influenced by the workloads in the
other LPARs.

For example, two extent pools are shared among IBM i LPARs A, B, and C. LPAR A
experiences a long peak with large blocksizes that causes a high I/O load on the DS8800
ranks. During that time, the performance of B and the performance of C decrease. But, when
the workload in A is low, B and C experience good response times because they can use
most of the disk arms in the shared extent pool. In these periods, the response times in B and
C are possibly better than if they use dedicated ranks.

You cannot predict when the peaks in each LPAR happen, so you cannot predict how the
performance in the other LPARs is influenced.

Many IBM i data centers successfully share the ranks with little unpredictable performance
because the disk arms and cache in the DS8800 storage system are used more efficiently
this way.

424 IBM System Storage DS8000 Performance Monitoring and Tuning


Other IBM i data centers prefer the stable and predictable performance of each system even
at the cost of more DS8800 resources. These data centers dedicate extent pools to each of
the IBM i LPARs.

Many BM i installations have one or two LPARs with important workloads and several smaller,
less important LPARs. These data centers dedicate ranks to the large systems and share the
ranks among the smaller ones.

13.3.5 Disk Magic modeling for an IBM i system


Use Disk Magic to model the IBM i workload on the DS8800 storage system before deciding
which DS8800 configuration to use. The modeling provides the expected IBM i disk response
time, several utilization values on the DS8800 storage system, and the growth of response
time and utilization with IBM i I/O growth.

For more information about the use of Disk Magic with an IBM i system, see 6.1, “Disk Magic”
on page 160 and IBM i and IBM System Storage: A Guide to Implementing External Disks on
IBM i, SG24-7120.

13.4 Preferred practices for implementing IBM i workloads on


flash cards in a DS8870 storage system
This section provides some preferred practices about how to implement IBM i workloads on
flash cards in the DS8870 storage system. They include the number and size of LUNs, and
number of adapters in an IBM i system and a DS8870 storage system. The described
preferred practices are based on tests performed at the European Storage Competence
Center during the residency for this book.

13.4.1 Testing environment


The tests for the preferred practices were performed with the following equipment:
򐂰 IBM Power System E870 system:
– IBM i LPAR with eight processors and 24 GB of memory.
– Two ports, each in a separate 16 Gb adapter.
– Two ports, each in a separate 8 Gb adapter.
– IBM i Release 7.2 with Group PTF level 15135.
򐂰 DS8870 storage system
– An extent pool with seven 400 GB flash cards.
– An extent pool with eight 400 GB flash cards.
– One 16 Gb port and one 8 Gb port.
– Different sizes of LUNs, defined from both extent pools.
– The total capacity of LUNs for an IBM i system is about 1.8 TiB.

Chapter 13. Performance considerations for the IBM i system 425


13.4.2 Workloads for testing
For the testing, we used programs that are written in Report Program Generator (RPG) and
Control Language (CL) making transactions to journaled and non-journaled database files.
The following programs were used:
򐂰 Writefile
– Writes to 24 journaled files simultaneously.
– Sequential / normal ratio of writes is about 50% / 50%.
– 70 million records are written to each file.
– A record is in packed decimal format with a length of 75 bytes specified in definition.
– On each file, we used a command override database file with option Force ratio with
a value of 1000. This value specifies the number of transactions that can occur on
records before those records are forced into disk storage, and thus influences the
transfer size or blocksize of write operations.
– The program results in about 43000 writes/sec with a transfer size 10 KB.
– Writefile lasts about 20 minutes when running on DS8870 flash cards.
򐂰 Readfile
– Reads up to 24 files simultaneously.
– Sequential / normal ratio is about 20% / 80%.
– 70 million records are read from each file.
– A record is in packed decimal format with a length of 75 bytes specified in the
definition.
– On each file, we used a command override database file with option Sequential only
with a value of 30000. This value specifies the number of records transferred as a
group to or from the database. This way, we influence the transfer size of read
operations.
– The program results in about 6700 reads/sec with transfer size of 118 KB.
– Readfile lasts about 24 minutes when running on DS8870 flash cards.
򐂰 Update
– Update consists of the following two programs that run simultaneously:
• In 50 iterations, 1 million times in each iteration generates a random number,
retrieves the record with that key number from a journaled file, and updates the
record.
• In 20 iterations, generates 5 million random numbers in each iteration and writes
them to a non-journaled file.
– A record key is in packed decimal format with a length of 20 bytes specified in the
definition, and the record field is 15 characters in length.
– On each file, we use the command override database file with option Number of
records with a value of 30000. This value specifies the number of records read from or
written to the main storage as a group. This value influences the transfer size of I/O
operations.
– This workload results in about 200 reads/sec and 40000 writes/sec with a blocksize of
about 4 KB.
– Update lasts about 22 minutes when running on DS8870 flash cards.

426 IBM System Storage DS8000 Performance Monitoring and Tuning


13.4.3 Testing scenarios
To experience the performance of the described IBM i workloads with different numbers and
sizes of LUNs, and different numbers of paths to an IBM i system, we ran the workloads
Writefile, Readfile, and Update in the following environments:
򐂰 Fifty-six 32.8 GiB LUNs
– Two paths to an IBM i system
– Four paths to an IBM i system
򐂰 Twenty-eight 65.7 GiB LUNs
– Two paths to an IBM i system
– Four paths to an IBM i system
򐂰 Fourteen 131.4 GiB LUNs
– Two paths to an IBM i system
– Four paths to an IBM i system
򐂰 Seven 262.9 GiB LUNs
– Two paths to an IBM i system
– Four paths to an IBM i system

Note:
򐂰 We used the same capacity of about 1840 GiB in each test.
򐂰 For two paths to an IBM i system, we used two ports in different 16 Gb adapters; for four
paths to an IBM i system, we used two ports in different 16 Gb adapters and two ports
in different 8 Gb adapters. Two IBM i ports are in SAN switch zone with one DS8870
port.

13.4.4 Test results


This section describes the results of our tests, including IBM i disk response time and elapsed
times of the workloads. We also provide the comments to the results.

Note: Disk response time in an IBM i system consists of service time (the time for I/O
processing, and wait time) and the time of potential I/O queuing in the IBM i host.

Chapter 13. Performance considerations for the IBM i system 427


Writefile
The graphs in Figure 13-4 show IBM i service time, which is measured in IBM i collection
services data, during the Writefile workload. The graphs in Figure 13-5 on page 429 present
durations of the workload at different scenarios.

To provide a clearer comparison, these figures show a graph comparing a performance of two
and four paths with the same size of LUNs, and a graph comparing performance of different
sizes of LUNs with the same number of paths, for either service time or durations.

Writefile, DS8870 HPFE, comparing service time

56 * 32.7 GiB LUNs 28 * 65.7 GiB LUNs 7 * 262.9 GiB LUNs

2 path 4 path

Writefile, DS8870 HPFE, comparing service time

2 path 4 path

56 * 32.7 GiB LUNs 28 * 65.7 GiB LUNs 7 * 262.9 GiB LUNs

Figure 13-4 DS8870 HPFE - compare Writefile service time

428 IBM System Storage DS8000 Performance Monitoring and Tuning


Writefile, DS8870 HPFE, comparing durations

56 * 32.7 GiB 28 * 65.7 GiB 14 * 131.4 GiB 7 * 262.9 GiB


LUNs LUNs LUNs LUNs

2 path 4 path

Writefile, DS8870 HPFE, comparing durations

2 path 4 path

56 * 32.7 GiB LUNs 28 * 6 5.7 GiB LUNs


14 * 131.4 GiB LUNs 7 * 26 2.9 GiB LUNs

Figure 13-5 Compare durations of Writefile

Comparing disk response time of different sizes of LUNs with given number of paths shows a
similar performance.

Elapsed times are similar when using two or four paths, and when using different sizes of
LUNs. Slightly higher duration is experienced with 262.9 GiB LUNs and with four paths,
although the service time with this scenario is lower; this can be because of other than I/O
waits during the jobs of the workload.

Chapter 13. Performance considerations for the IBM i system 429


Readfile
The graphs in Figure 13-6 show IBM i disk response time of the Readfile workload, and the
graphs in Figure 13-7 on page 431 show the elapsed times of the workload. The graphs
compare performance with two and four paths at a given LUN size, and performance of
different LUN sizes at a given number of paths.

Note: The wait time is 0 in many of the tests, so we do not show it in the graphs. The only
wait time bigger than 0 is experienced in Readfile with seven 262.9 GiB LUNs; we show it
in the graph in Figure 13-6.

Readfile, DS8870 HPFE, comparing service time and


wait time

56 * 32.7 GiB 28 * 65.7 GiB 14 * 131.4 GiB 7 * 262.9 GiB


LUNs LUNs LUNs LUNs

2 path, service time 2 path, wait time


4 path, service time 4 path, wait time

Readfile, DS8870 HPFE, comparin g disk response time


(service time + wait time)

2 path 4 path

56 * 32.7 GiB LUNs 28 * 65.7 GiB LUNs


14 * 131.4 G iB LUNs 7 * 262.9 GiB LUNs

Figure 13-6 DS8870 HPFE - compare Readfile service time and wait time

430 IBM System Storage DS8000 Performance Monitoring and Tuning


Readfile, DS8870 HPFE, comparing durations

56 * 32.7 GiB 28 * 65.7 GiB 14 * 1 31.4 GiB 7 * 262.9 GiB


LUNs LUNs LUNs LUNs

2 path 4 path

Readfile, DS8870 HPFE, comparing durations

2 path 4 path

56 * 32.7 GiB LUNs 28 * 65.7 GiB LUNs


14 * 131.4 GiB LUNs 7 * 262.9 GiB LUNs

Figure 13-7 Compare durations of Readfile

You see slightly shorter disk response times when using four paths comparing to using two
paths, which is most probably because of large transfer sizes that typically require bigger
bandwidth for good performance. The workload experiences a longer response time when
running on seven 262.9 GiB LUNs, compared to a response time with smaller LUNs.

When using 4four paths, durations show an increase when increasing the LUN size. With two
paths, you do not see such an increase, which might be because of high utilization of IBM i
and DS8870 ports that covers up the differences.

Chapter 13. Performance considerations for the IBM i system 431


Update
Disk response times and elapsed times of the workload Update are shown in the graphs in
Figure 13-8 and Figure 13-9 on page 433.

Update, DS8870 HPFE, comparing service time

56 * 32.7 GiB 28 * 65.7 GiB 14 * 131.4 GiB 7 * 262.9 GiB


LUNs LUNs LUNs LUNs

2 path 4 path

Update, DS8870 HPFE, comparing service time

2 path 4 path

56 * 32.7 GiB LUNs 28 * 65.7 GiB LUNs


14 * 131.4 GiB LUNs 7 * 262.9 GiB LUNs

Figure 13-8 DS8870 HPFE - compare Update service time

432 IBM System Storage DS8000 Performance Monitoring and Tuning


Update, DS8870 HPFE, comparing durations

56 * 32.7 GiB 28 * 65.7 GiB 14 * 131.4 GiB 7 * 262.9 GiB


LUNs LUNs LUNs LUNs

2 path 4 path

Update, DS8870 HPFE, comparing durations

2 path 4 path

56 * 32.7 GiB LUNs 28 * 65.7 GiB LUNs


14 * 131.4 GiB LUNs 7 * 262.9 GiB LUNs

Figure 13-9 Update - compare durations

There is no difference in service time comparing two and four paths, and there is no
difference comparing different LUN sizes.

Using four paths provides shorter elapsed times than using two paths. With two paths, the
262.9 GiB LUNs enable the shortest duration, and differences in duration with four paths are
small.

Chapter 13. Performance considerations for the IBM i system 433


Disk response times measured in an IBM i system and in a DS8870
storage system
As expected, the response time that is measured in a DS8870 storage system by
Spectrum Control is slightly shorter than the service time measured in an IBM i system
because the connection requires a fraction of I/O time. An example comparing both times of
the Update workload is shown in Figure 13-10.

Update, 2 path, comparing response time reported in IBM


i and in Spectrum Control

56 * 32.7 GiB LUNs 28 * 65.7 GiB LUNs 7 * 262.9 GiB LUNs

Service time in IBM i Resposne time in DS8870

Figure 13-10 Service time in an IBM i system and response time in a DS8870 storage system

13.4.5 Conclusions and recommendations


This section provides conclusions of the tests and recommendations that derive from them.

Number and capacity of LUNs


Based on the elapsed times with using four paths, bigger number of smaller LUNs provides
better performance than small number of bigger LUNs. In our tests, the performance
differences are small, but in installations with much bigger total capacity, you might expect
bigger differences.

Therefore, use LUNs sizes 50 GiB - 150 GiB on flash storage in HPFE. The more LUNs that
are implemented, the potentially better becomes I/O performance, so make sure that you
create at least 8 LUNs for an IBM i system.

Number of ports in an IBM i system


In most of our tests, using four paths results in shorter durations than using two paths. We
used two ports in the DS8870 storage system for both tests with two paths or tests with four
paths, so the ports in the DS8870 storage system were highly used, especially during
workloads with large transfer sizes.

Therefore, zone one port in an IBM i system with one port in the DS8870 storage system
when running the workload on flash cards.

434 IBM System Storage DS8000 Performance Monitoring and Tuning


Use at least as many ports as recommended in the following IBM publications, taking into
account a maximum 70% utilization of a port in the peaks:
򐂰 Planning quantity of Power Systems Fibre Channel Adapters, found at:
http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS5166

When sizing disk capacity per port, consider that the access density of a workload increases
when the workload is implemented on flash storage.

13.5 Preferred practices for implementing IBM i workloads on


flash cards in a DS8886 storage system
This section provides some preferred practices to implement flash storage in a DS8886
storage system for an IBM i workload. They include the number and size of LUNs, number of
adapters in an IBM i system and DS8886 storage system, and sharing or dedicating the
extent pools among multiple hosts. The described preferred practices are based on the tests
that we performed during the residency for this book at the European Storage Competence
Center.

13.5.1 Testing environment


The tests for preferred practices were performed with the following equipment:
򐂰 IBM Power System E870:
– IBM i LPAR with 16 processors and 24 GB of memory.
– Four 16 Gb adapters. One port from each adapter is used.
– IBM i Release 7.2 with Group PTF level 15135.
򐂰 DS8886 storage system:
– Two extent pools, each with eight 400 GB flash drives in RAID 5 and thirty-two 600 GB
15 K RPM disk drives in eight arrays in RAID 5.
– Four 16 Gb ports.
– IBM i LUNs are defined from both extent pools. They are assigned to Tier SSD - flash
storage.
– Different sizes of IBM i LUNs are used.
– The total capacity of the IBM i host is about 2.56 TiB.

13.5.2 Workloads for testing


For the tests on the DS8886 storage system we used the workloads that are described in
13.4.2, “Workloads for testing” on page 426, with the following changes:
򐂰 Writefile writes 100 million records to each of the 24 files.
򐂰 With Readfile, we use override database file option Sequential only with a value of 5000.

Chapter 13. Performance considerations for the IBM i system 435


The workloads result in the following I/O rates:
򐂰 Writefile: About 63000 IOPS with a blocksize of 10 KB. The sequential / normal ratio is
about 50% / 50%.
򐂰 Readfile: About 27000 IOPS with a blocksize of 95 KB. The sequential / normal ratio is
about 40% / 60%.
򐂰 Update: About 33000 IOPS with 0.6% of reads and a blocksize of 4.2 KB. Sequential /
normal ratio is about 20% / 80%.

13.5.3 Testing scenarios


To test the performance of the described IBM i workloads with different numbers and sizes of
LUNs and different numbers of paths to an IBM i system, we ran the Writefile, Readfile, and
Update workloads in the following environments:
򐂰 Eighty 32.8 GiB LUNs:
– Two paths to an IBM i system
– Three paths to an IBM i system
– Four paths to an IBM i system
򐂰 Twenty 131.4 GiB LUNs
– Two paths to an IBM i system
– Three paths to an IBM i system
– Four paths to an IBM i system
򐂰 Ten 262:9 GiB LUNs
– Two paths to an IBM i system
– Three paths to an IBM i system
– Four paths to an IBM i system

Each path is implemented with a port in 16 Gb adapter in an IBM i system and a port in a
16 Gb adapter in the DS8886 storage system.

13.5.4 Test results


This section provides IBM i disk response time and elapsed times of the workloads in different
environments; it also provides the comments about the results.

436 IBM System Storage DS8000 Performance Monitoring and Tuning


Writefile
The graphs in Figure 13-11 and Figure 13-12 on page 438 show IBM i service times and
elapsed times of the workload with different sizes of LUNs and different numbers of paths to
an IBM i system. To provide clearer comparison, we present both a graph comparing
performance of 2, 3, and 4 paths with the same size of LUNs, and a graph comparing
performance of different sizes of LUNs with the same number of paths.

Writefile, DS8886 Flash cards, comparing service time

80 * 32.8 GiB 20 * 131.4 GiB 10 * 262.9 GiB


LUNs LUNs LUNs

2 path 3 path 4 path

Writefile, DS8886 Flash cards, comparing service time

2 path 3 path 4 path

80 * 32.8 GiB LUNs 20 * 131.4 G iB LUNs 10 * 262.9 GiB LUNs

Figure 13-11 Writefile for a DS8886 storage system - compare service time

Chapter 13. Performance considerations for the IBM i system 437


Writefile, DS8886 Flash cards, comparing durations

80 * 32.8 GiB 20 * 131.4 GiB 10 * 262.9 GiB


LUNs LUNs LUNs

2 path 3 path 4 path

Writefile, DS8886 Flash cards, comparing durations

2 path 3 path 4 path

80 * 32.8 GiB LUNs 20 * 131.4 G iB LUNs 10 * 262.9 GiB LUNs

Figure 13-12 Writefile for a DS8886 storage system - compare elapsed times

There is no difference in service times among different environments.

Durations show the performance benefit of implementing more paths to an IBM i system.
Although the service times are the same as using more paths, the durations are shorter. The
explanation is that the more paths that are used, the more IOPS occur during a given
workload, which results in shorter elapsed time.

When implementing four paths, a bigger number of smaller LUNs shows better performance
than a smaller number of bigger LUNs. The difference in the tests is small, so in environments
with larger capacities, the difference might be significant.

438 IBM System Storage DS8000 Performance Monitoring and Tuning


Readfile
The graphs in Figure 13-13 and Figure 13-14 on page 440 show IBM i service times and
elapsed times of the workload with different sizes of LUNs and different numbers of paths to
an IBM i system.

Readfile, DS8886 Flash cards, comparing service time

80 * 32.8 GiB 20 * 131.4 GiB 10 * 262.9 GiB


LUNs LUNs LUNs

2 path 3 path 4 path

Readfile, DS8886 Flash cards, comparing service time

2 path 3 path 4 path

80 * 32.8 GiB LUNs 20 * 131.4 GiB LUNs 10 * 262.9 GiB LUNs

Figure 13-13 Readfile of a DS8886 storage system - compare service times

Chapter 13. Performance considerations for the IBM i system 439


Readfile, DS8886 Flash cards, comparing durations

80 * 32.8 GiB 20 * 131.4 GiB 10 * 262.9 GiB


LUNs LU Ns LUNs

2 path 3 path 4 path

Readfile, DS8886 Flash cards, comparing durations

2 path 3 path 4 path

80 * 32.8 GiB LUNs 20 * 131.4 GiB LUNs 10 * 262.9 GiB LUNs

Figure 13-14 Readfile of a DS8886 storage system - compare durations

Service times show drastic improvement when using three or four paths. With the Readfile
workload, this difference in performance is significant. The reasons for this is in large
blocksizes, which make the workload sensitive to the available bandwidth.

The performance benefit of bigger number of LUNs is shown with duration times when using
four paths. When using fewer paths, this fact might be covered up by constraint in available
bandwidth.

440 IBM System Storage DS8000 Performance Monitoring and Tuning


Update
The graphs in Figure 13-15 and Figure 13-16 on page 442 show IBM i service times and
elapsed times of the workload with different sizes of LUNs and different numbers of paths to
an IBM i system.

Update, DS8886 Flash cards, comparing service


time

80 * 32.8 GiB 20 * 131.4 GiB 10 * 262.9 GiB


LUNs LUNs LUNs

2 path 3 path 4 pat h

Update, DS8886 Flash cards, comparing service


time

2 path 3 path 4 path

80 * 32.8 GiB LUNs 20 * 131.4 GiB LUNs 10 * 262.9 GiB LUNs

Figure 13-15 Update of a DS8886 storage system - compare service time

Chapter 13. Performance considerations for the IBM i system 441


Update, DS8886 Flash cards, comparing durations

80 * 32.8 GiB 20 * 131.4 GiB 10 * 262.9 GiB


LUNs LUNs LUNs

2 path 3 path 4 pat h

Update, DS8886 Flash cards, comparing durations

2 path 3 path 4 path

80 * 32.8 GiB LUNs 20 * 131.4 GiB LUNs 10 * 262.9 GiB LUNs

Figure 13-16 Update of a DS8886 storage system - compare elapsed times

Clearly, the bandwidth that is provided by two paths limits performance compared to
bandwidth with three and four paths, so there is higher service time and elapsed time when
using two paths. With higher available bandwidth, there is almost no difference in service
times and durations when using different numbers of LUNs.

Both service time and elapsed time show performance benefits when implementing more
paths. However, the number of LUNs has minimal influence on performance.

13.5.5 Conclusions and preferred practices


This section provides conclusions of the tests and preferred practices that derive from them.

442 IBM System Storage DS8000 Performance Monitoring and Tuning


Number and capacity of LUNs
The workloads with about equal sequential and normal IO rate, bigger available bandwidth,
and bigger number of LUNs provides performance benefits compare to a smaller number of
LUNs. With the random-oriented workload, the difference is less visible in the testing
environment, but might be present in larger environments.

As a preferred practice, use 50 - 150 GiB LUN sizes for a given capacity. Potentially, better
performance is achieved with LUN sizes smaller than 100 GiB. Make sure that you create at
least eight LUNs for an IBM i system.

Number of ports in an IBM i system and a DS8886 storage system


Most of the tests show that the more paths that are used, the better the performance. The test
workloads performed best when using four paths. An environment with four paths also reveals
performance differences in the number of LUNs being used.

To provide sufficient port bandwidth in a DS8886 storage system, zone one port in an IBM i
system with one port in the DS8886 storage system when running the workload on flash
storage.

As a preferred practice, use at least as many ports as are recommended in the IBM
publications that are listed in 13.4.5, “Conclusions and recommendations” on page 434,
taking into account a maximum of 70% utilization of a port in the peaks. The workloads with
an I/O rate of 20000 - 60000 IOPS should experience potentially better performance by using
four paths to the LUNs on flash cards in the DS8886 storage system.

Sharing an extent pool among different IBM i hosts


Based on our experiences with IBM i workload in extent pools of HDDs, and mixed extent
pools of HDDs and solid-state drives (SSDs), isolate an important production workload in a
separate extent pool of flash storage to ensure that all of the resources are available to this
workload all the time.

When implementing smaller and less important IBM i workloads with flash storage, it might be
a good idea to share an extent pool among the hosts.

13.6 Analyzing performance data


For performance issues with IBM i workloads that run on the DS8800 storage system, you
must determine and use the most appropriate performance tools in an IBM i system and in
the DS8800 storage system.

13.6.1 IBM i performance tools


This section presents the performance tools that are used for an IBM i system. It also
indicates which of the tools we employ with the planning and implementation of the DS8800
storage system for an IBM i system in this example.

To help you better understand the tool functions, they are divided into two groups:
performance data collectors (the tools that collect performance data) and performance data
investigators (the tools to analyze the collected data).

Chapter 13. Performance considerations for the IBM i system 443


The following tools are the IBM i performance data collectors:
򐂰 Collection Services
򐂰 IBM i Job Watcher
򐂰 IBM i Disk Watcher
򐂰 Performance Explorer (PEX)

Collectors can be managed by IBM System Director Navigator for i, IBM System i Navigator,
or IBM i commands.

The following tools are or contain the IBM i performance data investigators:
򐂰 IBM Performance Tools for i
򐂰 IBM System Director Navigator for i
򐂰 iDoctor

Most of these comprehensive planning tools address the entire spectrum of workload
performance on System i, including processor, system memory, disks, and adapters. To plan
or analyze performance for the DS8800 storage system with an IBM i system, use the parts of
the tools or their reports that show the disk performance.

Collection Services
The major IBM i performance data collector is called Collection Services. It is designed to run
all the time to provide data for performance health checks, for analysis of a sudden
performance problem, or for planning new hardware and software upgrades. The tool is
documented in detail in the IBM i IBM Knowledge Center:
http://www.ibm.com/support/knowledgecenter/ssw_ibm_i/welcome

Collection Services is a sample-based engine (usually 5 - 15-minute intervals) that looks at


jobs, threads, processor, disk, and communications. It also has a set of specific statistics for
the DS8800 storage system. For example, it shows which disk units are the DS8800 LUNs,
whether they are connected in a single path or multipath, the disk service time, and wait time.

The following tools can be used to manage the data collection and report creation of
Collection Services:
򐂰 IBM System i Navigator
򐂰 IBM System Director navigator
򐂰 IBM Performance Tools for i

iDoctor Collection Service Investigator can be used to create graphs and reports based on
Collection Services data. For more information about iDoctor, see the IBM i iDoctor online
documentation at the following website:
https://www.ibm.com/i_dir/idoctor.nsf/documentation.html

With IBM i level V7R1, the Collection Services tool offers additional data collection categories,
including a category for external storage. This category supports the collection of
nonstandard data that is associated with certain external storage subsystems that are
attached to an IBM i partition. This data can be viewed within iDoctor, which is described in
“iDoctor” on page 446.

444 IBM System Storage DS8000 Performance Monitoring and Tuning


Job Watcher
Job Watcher is an advanced tool for collecting and analyzing performance information to help
you effectively monitor your system or to analyze a performance issue. It is job-centric and
thread-centric and can collect data at intervals of seconds. The collection contains vital
information, such as job processor and wait statistics, call stacks, SQL statements, objects
waited on, sockets, and TCP. For more information about Job Watcher, see the IBM i IBM
Knowledge Center:
http://www.ibm.com/support/knowledgecenter/ssw_ibm_i/welcome

Another resource is “Web Power - New browser-based Job Watcher tasks help manage your
IBM i performance” in the IBM Systems Magazine (on the IBM Systems Magazine page,
search for the title of the article):
http://www.ibmsystemsmag.com/ibmi

Disk Watcher
Disk Watcher is a function of an IBM i system that provides disk data to help identify the
source of disk-related performance problems on the IBM i platform. It can either collect
information about every I/O in trace mode or collect information in buckets in statistics mode.
In statistics mode, it can run more often than Collection Services to see more granular
statistics. The command strings and file layouts are documented in the IBM i IBM Knowledge
Center:
http://www.ibm.com/support/knowledgecenter/ssw_ibm_i/welcome

For more information about the use of Disk Watcher, see “A New Way to Look at Disk
Performance” and “Analyzing Disk Watcher Data” in the IBM Systems Magazine (on the IBM
Systems Magazine page, search for the title of the article):
http://www.ibmsystemsmag.com/ibmi

Disk Watcher gathers detailed information that is associated with I/O operations to disk units,
and provides data beyond the data that is available in other IBM i integrated tools, such as
Work with Disk Status (WRKDSKSTS), Work with System Status (WRKSYSSTS), and Work with
System Activity (WKSYSACT).

Performance Explorer
PEX is a data collection tool in the IBM i system that collects information about a specific
system process or resource to provide detailed insight. PEX complements IBM i Collection
Services.

An example of PEX, the IBM i system, and connection with external storage is identifying the
IBM i objects that are most suitable to relocate to SSDs. You can use PEX to collect IBM i disk
events, such as synchronous and asynchronous reads, synchronous and asynchronous
writes, page faults, and page-outs. The collected data is then analyzed by the iDoctor tool
PEX-Analyzer to observe the I/O rates and disk service times of different objects. The objects
that experience the highest accumulated read service time, the highest read rate, and a
modest write rate at the same time are good candidates to relocate to SSDs.

For a better understanding of IBM i architecture and I/O rates, see 13.1, “IBM i storage
architecture” on page 416. For more information about using SSDs with the IBM i system, see
13.7, “Easy Tier with the IBM i system” on page 448.

Chapter 13. Performance considerations for the IBM i system 445


IBM Performance Tools for i
IBM Performance Tools for i is a tool to manage data collection, analyze data, and print
reports to help you identify and correct performance problems. Performance Tools helps you
gain insight into IBM i performance features, such as dynamic tuning, expert cache, job
priorities, activity levels, and pool sizes. You can also identify ways to use these services
better. The tool also provides analysis of collected performance data and produces
conclusions and recommendations to improve system performance.

The Job Watcher part of Performance Tools analyzes the Job Watcher data through the IBM
Systems Director Navigator for i Performance Data Visualizer.

Collection Services reports about disk utilization and activity, which are created with
IBM Performance Tools for i, are used for sizing and Disk Magic modeling of the DS8800
storage system for the IBM i system:
򐂰 The Disk Utilization section of the System report
򐂰 The Disk Utilization section of the Resource report
򐂰 The Disk Activity section of the Component report

IBM Systems Director Navigator for i


The IBM Systems Director Navigator for i is a web-based console that provides a single,
easy-to-use view of the IBM i system. IBM Systems Director Navigator provides a strategic
tool for managing a specific IBM i partition.

The Performance section of IBM Systems Director Navigator for i provides tasks to manage
the collection of performance data and view the collections to investigate potential
performance issues. Figure 13-17 shows the menu of performance functions in the IBM
Systems Director Navigator for i.

Figure 13-17 Performance tools of Systems Director Navigator for i

iDoctor
iDoctor is a suite of tools that is used to manage the collection of data, investigate
performance data, and analyze performance data on the IBM i system. The goals of iDoctor
are to broaden the user base for performance investigation, simplify and automate processes
of collecting and investigating the performance data, provide immediate access to collected
data, and offer more analysis options.

The iDoctor tools are used to monitor the overall system health at a high level or to drill down
to the performance details within jobs, disk units, and programs. Use iDoctor to analyze data
that is collected during performance situations. iDoctor is frequently used by IBM, clients, and
consultants to help solve complex performance issues quickly.

446 IBM System Storage DS8000 Performance Monitoring and Tuning


One example of using iDoctor PEX-Analyzer is to determine the IBM i objects that are
candidates to relocate to SSD. To do so, complete the following steps:
1. In iDoctor, you start the tool PEX-Analyzer as shown in Figure 13-18.

Figure 13-18 Start the iDoctor PEX Analyzer

2. Select the PEX collection on which you want to work and select the type of graph that you
want to create, as shown in Figure 13-19.

Figure 13-19 iDoctor - select the query

Chapter 13. Performance considerations for the IBM i system 447


Figure 13-20 illustrates an example of the graph that shows the accumulated read disk
service time on IBM i objects. The objects with the highest accumulated read service time
are good candidates to relocate to SSD. For more information about moving IBM i data to
SSD, see 13.7.2, “IBM i methods for hot-spot management” on page 449.

Figure 13-20 iDoctor - query of I/O read times by object

13.6.2 DS8800 performance tools


As a preferred practice, use Tivoli Storage Productivity Center for Disk for analyzing the
DS8800 performance data for an IBM i workload. For more information about this product,
see Chapter 7, “Practical performance management” on page 221.

13.6.3 Periods and intervals of collecting data


For a successful performance analysis, ensure that the IBM i data and DS8800 data are
collected during the same periods, and if possible, with the same collection intervals.

13.7 Easy Tier with the IBM i system


This section describes how to use Easy Tier with the IBM i system.

13.7.1 Hot data in an IBM i workload


An important feature of the IBM i system is object-based architecture. Everything on the
system that can be worked with is considered an object. An IBM i library is an object. A
database table, an index file, a temporary space, a job queue, and a user profile are objects.
The intensity of I/Os is split by objects. The I/O rates are high on busy objects, such as
application database files and index files. The I/Os rates are lower on user profiles.

IBM i Storage Manager spreads the IBM i data across the available disk units (LUNs) so that
each disk drive is about equally occupied. The data is spread in extents that are 4 KB - 1 MB
or even 16 MB. The extents of each object usually span as many LUNs as possible to provide
many volumes to serve the particular object. Therefore, if an object experiences a high I/O
rate, this rate is evenly split among the LUNs. The extents that belong to the particular object
on each LUN are I/O-intense.

448 IBM System Storage DS8000 Performance Monitoring and Tuning


Many of the IBM i performance tools work on the object level; they show different types of
read and write rates on each object and disk service times on the objects. For more
information about the IBM i performance tools, see 13.6.1, “IBM i performance tools” on
page 443. You can relocate hot data by objects by using the Media preference method, which
is described in “IBM i Media preference” on page 449.

Also, the Easy Tier tool monitors and relocates data on the 1 GB extent level. IBM i ASP
balancing, which is used to relocate data to SSDs, works on the 1 MB extent level. Monitoring
extents and relocating extents do not depend on the object to which the extents belong; they
occur on the subobject level.

13.7.2 IBM i methods for hot-spot management


You can choose your tools for monitoring and moving data to faster drives in the DS8800
storage system. Because the IBM i system recognizes the LUNs on SSDs in a natively
connected DS8800 storage system, you can use the IBM i tools for monitoring and relocating
hot data. The following IBM i methods are available:
򐂰 IBM i Media preference
򐂰 ASP balancing
򐂰 A process where you create a separate ASP with LUNs on SSDs and restore the
application to run in the ASP

IBM i Media preference


This method provides monitoring capability and the relocation of the data on the object level.
You are in control. You decide the criteria for which objects are hot and you control which
objects to relocate to the SSDs. Some clients prefer Media preference to Easy Tier or ASP
balancing. IBM i Media preference involves the following steps:
1. Use PEX to collect disk events
Carefully decide which disk events to trace. In certain occasions, you might collect read
disk operations only, which might benefit the most from improved performance, or you
might collect both disk read and write information for future decisions. Carefully select the
peak period in which the PEX data is collected.
2. Examine the collected PEX data by using the PEX-Analyzer tool of iDoctor or by using the
user-written queries to the PEX collection.
It is a good idea to examine the accumulated read service time and the read I/O rate on
specific objects. The objects with the highest accumulated read service time and the
highest read I/O rate can be selected to relocate to SSDs. It is also helpful to analyze the
write operations on particular objects. You might decide to relocate the objects with high
read I/O and rather modest write I/O to SSD to benefit from lower disk service times and
wait times. Sometimes, you must distinguish among the types of read and write operations
on the objects. iDoctor queries provide the rates of asynchronous and synchronous
database reads and page faults.

Chapter 13. Performance considerations for the IBM i system 449


Figure 13-21 shows an example of a graph that shows the accumulated read and write
service times on IBM i objects. This graph was created by the iDoctor PEX-Analyzer query
I/O Times by Object on PEX data collected during the IBM i workload, as described in
13.7.3, “Skew level of an IBM i workload” on page 452. The objects are ordered by the
sum of accumulated read and write service times. The read service time on the objects is
much higher than the write service time, so the objects might be good candidates to
relocate to SSD.

Figure 13-21 Accumulated read and write service time on objects

In certain cases, queries must be created to run on the PEX collection to provide specific
information, for example, the query that provides information about which jobs and threads
use the objects with the highest read service time. You might also need to run a query to
provide the blocksizes of the read operations because you expect that the reads with
smaller blocksizes profit the most from SSDs. If these queries are needed, contact IBM
Lab Services to create them.
3. Based on the PEX analysis, decide which database objects to relocate to the SSD in the
DS8800 storage system. Then, use IBM i commands such as Change Physical File
(CHGPF) with the UNIT(*SSD) parameter, or use the SQL command ALTER TABLE UNIT SSD,
which sets on the file a preferred media attribute that starts dynamic data movement. The
preferred media attribute can be set on database tables and indexes, and on User-Defined
File Systems (UDFS).
For more information about the UDFS, see the IBM i IBM Knowledge Center:
http://www.ibm.com/support/knowledgecenter/ssw_ibm_i/welcome

ASP balancing
This IBM i method is similar to DS8800 Easy Tier because it is based on the data movement
within an ASP by IBM i ASP balancing. The ASP balancing function is designed to improve
IBM i system performance by balancing disk utilization across all of the disk units (or LUNs) in
an ASP. It provides four ways to balance an ASP. Two of these ways relate to data relocation
to SSDs:
򐂰 Hierarchical Storage Management (HSM) balancing
򐂰 Media preference balancing

450 IBM System Storage DS8000 Performance Monitoring and Tuning


The HSM balancer function, which traditionally supports data migration between
high-performance and low-performance internal disk drives, is extended for the support of
data migration between SSDs and HDDs. The disk drives can be internal or on the DS8800
storage system. The data movement is based on the weighted read I/O count statistics for
each 1 MB extent of an ASP. Data monitoring and relocation is achieved by the following two
steps:
1. Run the ASP balancer tracing function during the important period by using the TRCASPBAL
command. This function collects the relevant data statistics.
2. By using the STRASPBAL TYPE(*HSM) command, you move the data to SSD and HDD based
on the statistics that you collected in the previous step.

The Media preference balancer function is the ASP balancing function that helps to correct
any issues with Media preference-flagged database objects or UDFS files not on their
preferred media type, which is either SSD or HDD, based on the specified subtype parameter.

The function is started by the STRASPBAL TYPE(*MP) command with the SUBTYPE parameter
equal to either *CALC (for data migration to both SSD and HDD), *SSD, or *HDD.

ASP balancer migration priority is an option in the ASP balancer so that you can specify the
migration priority for certain balancing operations, including *HSM or *MP in levels of either
*LOW, *MEDIUM, or *HIGH, thus influencing the speed of data migration.

Location: For data relocation with Media preference or ASP balancing, the LUNs defined
on SSD and on HDD must be in the same IBM i ASP. It is not necessary that they are in the
same extent pool in the DS8800 storage system.

A dedicated ASP for SSD LUNs


This method is suitable for the installations that use a library (or multiple libraries) with files
that are all heavily used and critical for the workload performance. It is simple to implement
and is suitable also for storage systems other than the DS8800 storage system with which
IBM i Media preference and ASP balancing cannot be used.

The method requires that you create a separate ASP that contains LUNs that are on the
DS8800 SSD and then save the relevant IBM i libraries and restore them to the ASP with
SSD. All the files in the libraries then are on SSDs, and the performance of the applications
that use these files improves.

Additional information
For more information about the IBM i methods for SSD hot-spot management, including the
information about IBM i prerequisites, see the following documents:
򐂰 IBM i 7.1 Technical Overview with Technology Refresh Updates, SG24-7858
򐂰 Performance Value of Solid-State Drives using IBM i, found at:
http://www.ibm.com/systems/resources/ssd_ibmi.pdf

Chapter 13. Performance considerations for the IBM i system 451


13.7.3 Skew level of an IBM i workload
IBM i clients that are new to the DS8800 storage system might consider the configuration of
mixed SSDs and Enterprise drives of 15 K RPM or 10 K RPM, or they might select the
configuration of mixed SSDs, Enterprise drives, and Nearline drives. Clients that already run
their workloads on the DS8800 storage system with HDDs might purchase additional SSDs to
improve performance. Other clients might decide to add both SSDs and Nearline disks to get
the most efficient configuration for both performance and capacity.

Before deciding on a mixed SSD and HDD environment or deciding to obtain additional
SSDs, consider these questions:
򐂰 How many SSDs do you need to install to get the optimal balance between the
performance improvement and the cost?
򐂰 What is the estimated performance improvement after you install the SSDs?

The clients that use IBM i Media preference get at least a partial answer to these questions
from the collected PEX data by using queries and calculations. The clients that decide on
DS8800 Easy Tier or even IBM i ASP balancing get the key information to answer these
questions by the skew level of their workloads. The skew level describes how the I/O activity is
distributed across the capacity for a specific workload. The workloads with the highest skew
level (heavily skewed workloads) benefit most from the Easy Tier capabilities because even
when moving a small amount of data, the overall performance improves. For more information
abut the skew level, see 6.4, “Disk Magic Easy Tier modeling” on page 208. You can obtain
the skew level only for the workloads that run on the DS8800 storage system with Easy Tier.

To provide an example of the skew level of a typical IBM i installation, we use the IBM i
benchmark workload, which is based on the workload TPC-E. TPC-E is a new online
transaction processing (OLTP) workload developed by the Tivoli Storage Productivity Center.
It uses a database to model a brokerage firm with customers who generate transactions that
are related to trades, account inquiries, and market research. The brokerage firm in turn
interacts with financial markets to run orders on behalf of the customers and updates relevant
account information. The benchmark workload is scalable, which means that the number of
customers who are defined for the brokerage firm can be varied to represent the workloads of
different-sized businesses. The workload runs with a configurable number of job sets. Each
job set runs independently, generating its own brokerage firm next transaction. By increasing
the number of job sets, you increase the throughput and processor utilization of the run.

In our example, we used the following configuration for Easy Tier monitoring for which we
obtained the skew level:
򐂰 IBM i LPAR with eight processing units and 60 GB memory in POWER7 model 770.
򐂰 Disk space for the IBM i provided from an extent pool with four ranks of HDD in DS8800
code level 6.2.
򐂰 Forty-eight LUNs of the size 70 GB used for the IBM i system (the LUNs are defined in the
rotate extents EAM from the extent pool).
򐂰 The LUNs are connected to the IBM i system in Multipath through two ports in separate
4 Gb FC adapters.
򐂰 In the IBM i LPAR, we ran the following two workloads in turn:
– The workload with six database instances and six job sets
– The workload with six database instances and three job sets

The workload with six database instances was used to achieve 35% occupation of the disk
space. During the run with six job sets, the access density was about 2.7 IO/sec/GB. During
the run with three job sets, the access density was about 0.3 IO/sec/GB.

452 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 13-22 and Figure 13-23 show the skew level of the IBM i workload from the Easy Tier
data collected during 24 hours. On Figure 13-22, you can see the percentage of reads with
small blocksizes and the percentage of transferred MB by the percentage of occupied disk
space. The degree of skew is small because of the efficient spreading of data across
available LUNs by IBM i storage management, which is described in 13.7.1, “Hot data in an
IBM i workload” on page 448.

Figure 13-22 Skew level of the IBM i workload on active data

Figure 13-23 Skew level of the IBM i workload on allocated extents

Chapter 13. Performance considerations for the IBM i system 453


As shown in Figure 13-23 on page 453, the degree of skew for the same workload on all
allocated extents is higher because only 35% of the available disk space is occupied by IBM i
data.

13.7.4 Using Easy Tier with the IBM i system


IBM i installations that run their workloads in hybrid pools in the DS8800 storage system
benefit from Easy Tier in both planning for performance and improving the performance of the
workload. In the planning phase, you can use Storage Tier Advisory Tool (STAT) to help you
to determine the configuration and predict the performance improvement of the IBM i
workload that runs the hybrid pool. Use this information with the skew level information
described in 13.7.3, “Skew level of an IBM i workload” on page 452. In the production phase,
Easy Tier relocates the IBM i data to SSDs and Nearline disks to achieve the best
performance and the most efficient spread of capacity.

Storage Tier Advisory Tool


The STAT used on the IBM i data that is monitored by Easy Tier provides the heat map of the
workload. The heat map shows the amount and distribution of the hot data across LUNs in
the hybrid pool. For more information about the STAT, see 6.5, “Storage Tier Advisor Tool” on
page 213.

To use the STAT for an IBM i workload, complete the following steps:
1. Enable the collection of the heat data I/O statistics by changing the Easy Tier monitor
parameter to all or automode. Use the DS8800 command-line interface (DSCLI)
command chsi -etmonitor all or chsi -etmonitor automode. The parameter -etmonitor
all enables monitoring on all LUNs in the DS8800 storage system. The parameter
-etmonitor automode monitors the volumes that are managed by Easy Tier automatic
mode only.
2. Offload the collected data from the DS8800 clusters to the user workstation. Use either
the DS8800 Storage Manager GUI or the DSCLI command offloadfile -etdata
<directory>, where <directory> is the directory where you want to store the files with the
data on your workstation.
3. The offloaded data is stored in the following files in the specified directory:
– SF75NT100ESS01_heat.data
– SF75NT100ESS11_heat.data
The variable 75NT100 denotes the serial number of the DS8800 storage facility. The
variables ESS01 and ESS11 are the processor complexes 0 and 1.
4. Use the STAT on your workstation to create the heat distribution report that can be
presented in a web browser.

Figure 13-24 on page 455 shows an example of the STAT heat distribution on IBM i LUNs
after running the IBM i workload described in 13.7.3, “Skew level of an IBM i workload” on
page 452. The hot and warm data is evenly spread across the volumes, which is typical for an
IBM i workload distribution.

454 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 13-24 Heat distribution for IBM i workload

Relocating the IBM i data with Easy Tier


The Easy Tier relocation of IBM i data starts after a 24-hour learning period. The data is
moved to SSDs or Nearline disks in 1 GB extents.

An IBM i client can also use the IBM i Media preference or ASP balancing method for hot-spot
management. It is not the goal to compare the performance for the three relocation methods.
However, do not expect much difference in performance by using one or another method.
Factors such as ease of use, consolidation of the management method, or control over which
data to move, are more important for an IBM i client to decide which method to use.

13.8 I/O Priority Manager with the IBM i system


It is common to use one storage system to serve many categories of workloads with different
characteristics and requirements. I/O Priority Manager, which is a feature in DS8800 and
DS8700 storage systems, enables more effective storage performance management by
prioritizing access to storage system resources for separate workloads. For more information
about I/O Priority Manager, see 1.3.5, “I/O Priority Manager” on page 17.

Many IBM i clients run multiple IBM i workloads in different POWER partitions that share the
disk space in the DS8800 storage system. The installations run important production systems
and less important workloads for testing and developing. The other partitions can be used as
disaster recovery targets of production systems in another location. Assume that IBM i
centers with various workloads that share the DS8800 disk space use I/O Priority Manager to
achieve a more efficient spread of storage resources.

Here is a simple example of using the I/O Priority Manager for two IBM i workloads.

The POWER partition ITSO_1 is configured with four processor units, 56 GB memory, and
forty-eight 70 GB LUNs in a DS8800 extent pool with Enterprise drives.

The partition ITSO-2 is configured with one processor unit, 16 GB memory, and thirty-two
70 GB LUNs in a shared hybrid extent pool with all SSDs, Enterprise drives, and Nearline
disk drives.

Both extent pools are managed by DS8800 Easy Tier.

Chapter 13. Performance considerations for the IBM i system 455


In this example, we set up the I/O Priority Manager Performance Group 1 (PG1) for the
volumes of ITSO_1 by using the following DSCLI command:
chfbvol -perfgrp pg1 2000-203f

Performance Group 1 is defined for the 64 LUNs, but only 48 LUNs from these 60 LUNs are
added to the ASP and used by the system ITSO_1; the other LUNs are in the IBM i system
non-configured status.

We set up the I/O Priority Manager Performance Group 11 (PG11) for the volumes of ITSO_2
by using the following DSCLI command:
chfbvol -perfgrp pg11 2200-221f

After we define the performance groups for the IBM i LUNs, we ran the IBM i benchmark
workload described in 13.7.3, “Skew level of an IBM i workload” on page 452, with 40 job sets
in each of the ITSO_1 and ITSO_2 partitions.

After the workload finished, we obtained the monitoring reports of each performance group,
PG1 with the LUNs of ITSO_1 and PG11 with the LUNs of ITSO_2, during the 5-hour
workload run with 15-minute monitoring intervals.

Figure 13-25 and Figure 13-26 on page 457 show the DSCLI commands that we used to
obtain the reports and the displayed performance values for each performance group. The
workload in Performance Group PG11 shows different I/O characteristics than the workload in
Performance Group 1. Performance Group PG11 also experiences relatively high response
times compared to Performance Group 1. In our example, the workload characteristics and
response times are influenced by the different priority groups, types of disk drives used, and
Easy Tier management.

dscli> lsperfgrprpt -start 7h -stop 1h -interval 15m PG1


Date/Time: December 3, 2011 10:39:30 CET PM IBM DSCLI Version: 7.6.10.511 DS: IBM.2107-75TV181
time grp resrc avIO avMB avresp pri avQ tgtQ %hlpT %dlyT %impt
===========================================================================================================
2011-12-03/15:45:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 0 0.000 0.000 1 0 70 0 0 0
2011-12-03/16:00:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 5641 76.222 1.310 1 85 70 2 0 0
2011-12-03/16:15:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 3254 35.148 0.709 1 108 70 0 0 0
2011-12-03/16:30:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 2167 19.292 1.026 1 102 70 0 0 0
2011-12-03/16:45:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 1879 15.958 0.885 1 107 70 0 0 0
2011-12-03/17:00:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 1913 16.794 0.896 1 104 70 0 0 0
2011-12-03/17:15:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 1925 17.063 0.884 1 108 70 0 0 0
2011-12-03/17:30:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 1998 17.684 1.084 1 106 70 0 0 0
2011-12-03/17:45:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 1997 17.648 1.101 1 106 70 0 0 0
2011-12-03/18:00:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 1991 17.639 1.110 1 106 70 0 0 0
2011-12-03/18:15:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 2037 18.127 1.308 1 104 70 0 0 0
2011-12-03/18:30:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 2069 18.473 1.345 1 104 70 0 0 0
2011-12-03/18:45:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 2092 18.947 1.265 1 104 70 1 0 0
2011-12-03/19:00:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 2109 19.286 1.253 1 105 70 0 0 0
2011-12-03/19:15:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 2132 19.498 1.373 1 104 70 0 0 0
2011-12-03/19:30:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 2150 19.734 1.391 1 103 70 0 0 0
2011-12-03/19:45:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 2166 19.849 1.514 1 104 70 0 0 0
2011-12-03/20:00:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 2202 20.330 1.326 1 104 70 0 0 0
2011-12-03/20:15:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 2245 20.976 1.419 1 102 70 0 0 0
2011-12-03/20:30:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 2699 25.327 1.386 1 97 70 0 0 0
2011-12-03/20:45:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 1552 17.402 5.481 1 34 70 4 0 0
2011-12-03/21:00:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 1 0.024 0.250 1 100 70 0 0 0
2011-12-03/21:15:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 1 0.025 0.280 1 100 70 0 0 0
2011-12-03/21:30:00 IBM.2107-75TV181/PG1 IBM.2107-75TV181 1 0.025 0.271 1 101 70 0 0 0

Figure 13-25 I/O Priority Management performance values for ITSO_1

456 IBM System Storage DS8000 Performance Monitoring and Tuning


dscli> lsperfgrprpt -start 7h -stop 1h -interval 15m PG11
Date/Time: December 3, 2011 10:40:10 CET PM IBM DSCLI Version: 7.6.10.511 DS: IBM.2107-75TV181
time grp resrc avIO avMB avresp pri avQ tgtQ %hlpT %dlyT %imp
=============================================================================================================
2011-12-03/15:45:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 74 0.879 1.286 15 114 0 0 0 0
2011-12-03/16:00:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 9040 101.327 5.499 15 25 0 1 0 101
2011-12-03/16:15:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 14933 158.168 2.153 15 30 0 0 0 100
2011-12-03/16:30:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 15288 154.254 1.431 15 37 0 0 0 0
2011-12-03/16:45:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 16016 159.503 1.364 15 40 0 0 0 100
2011-12-03/17:00:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 16236 158.533 1.193 15 43 0 0 0 100
2011-12-03/17:15:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 17980 169.486 1.302 15 34 0 0 0 0
2011-12-03/17:30:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 18711 170.260 1.061 15 43 0 0 0 0
2011-12-03/17:45:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 19190 171.624 1.061 15 45 0 0 0 0
2011-12-03/18:00:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 19852 173.986 1.051 15 44 0 0 0 0
2011-12-03/18:15:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 20576 177.721 0.952 15 39 0 0 0 0
2011-12-03/18:30:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 21310 180.309 0.827 15 50 0 0 0 0
2011-12-03/18:45:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 21712 184.416 1.008 15 43 0 0 0 100
2011-12-03/19:00:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 22406 188.075 0.972 15 38 0 0 0 0
2011-12-03/19:15:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 23120 190.645 0.717 15 44 0 0 0 0
2011-12-03/19:30:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 23723 193.182 0.718 15 46 0 0 0 0
2011-12-03/19:45:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 24607 198.093 0.734 15 35 0 0 0 0
2011-12-03/20:00:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 24714 198.473 0.754 15 42 0 0 0 0
2011-12-03/20:15:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 25256 201.734 0.778 15 49 0 0 0 0
2011-12-03/20:30:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 25661 202.733 0.727 15 50 0 0 0 0
2011-12-03/20:45:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 1729 15.483 6.485 15 41 0 2 3 164
2011-12-03/21:00:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 4 0.019 0.707 15 94 0 0 0 0
2011-12-03/21:15:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 11 0.049 0.431 15 99 0 0 0 0
2011-12-03/21:30:00 IBM.2107-75TV181/PG11 IBM.2107-75TV181 7 0.040 0.702 15 96 0 0 0 0

Figure 13-26 I/O Priority Management performance values for ITSO_2

For more information about IBM i performance with the DS8000 I/O Priority Manager, see
IBM i Shared Storage Performance using IBM System Storage DS8000 I/O Priority Manager,
WP101935, which is available in the IBM Techdoc library.

Chapter 13. Performance considerations for the IBM i system 457


458 IBM System Storage DS8000 Performance Monitoring and Tuning
14

Chapter 14. Performance considerations for


IBM z Systems servers
This chapter covers some z Systems specific performance topics. First, it explains how you
can collect and interpret z Systems I/O performance data from the Resource Measurement
Facility (RMF). Next, it provides a section with hints, tips, and preferred practices to avoid
potential bottlenecks up front. The last part describes a strategy to isolate bottlenecks and
develop strategies to avoid or eliminate them.

A note about DS8000 sizing: z Systems I/O workload is complex. Use RMF data and
Disk Magic models for sizing. For more information about Disk Magic and how to get a
model, see 6.1, “Disk Magic” on page 160.

IBM z Systems and the IBM System Storage DS8000 storage systems family have a long
common history. Numerous features were added to the whole stack of server and storage
hardware, firmware, operating systems, and applications to improve I/O performance. This
level of synergy is unique to the market and is possible only because IBM is the owner of the
complete stack. This chapter does not describe these features because they are explained in
other places:
򐂰 For an overview of z Systems synergy features, see “Performance characteristics for z
Systems” on page 10.
򐂰 For a detailed description of these features, see IBM DS8870 and IBM z Systems
Synergy, REDP-5186.
򐂰 The current zSynergy features are also explained in detail in Get More Out of Your IT
Infrastructure with IBM z13 I/O Enhancements, REDP-5134.

This chapter includes the following topics:


򐂰 DS8000 performance monitoring with RMF
򐂰 DS8000 and z Systems planning and configuration
򐂰 Problem determination and resolution

© Copyright IBM Corp. 2016. All rights reserved. 459


14.1 DS8000 performance monitoring with RMF
This section describes how to gather and interpret disk I/O related performance data from
Resource Management Facility (RMF), which is part of the z/OS operating system. It provides
performance information for the DS8000 storage systemand other storage systems. RMF can
help with monitoring the following performance components:
򐂰 I/O response time
򐂰 IOP/SAP
򐂰 FICON host channel
򐂰 FICON director
򐂰 Symmetric multiprocessing (SMP)
򐂰 Cache and nonvolatile storage (NVS)
򐂰 Enterprise Storage Servers
– FICON/Fibre port and host adapter (HA)
– Extent pool and rank/array

Sample performance measurement data: Any sample performance measurement data


that is provided in this chapter is for explanatory purposes only. It does not represent the
capabilities of a real system. The data was collected in controlled laboratory environment
at a specific point by using a specific configuration with hardware and firmware levels
available then. Performance in real-world environments will be different.

Contact your IBM representative or IBM Business Partner if you have questions about the
expected performance capability of IBM products in your environment.

14.1.1 RMF Overview


Figure 14-1 shows an overview of the RMF infrastructure.

Figure 14-1 RMF overview

460 IBM System Storage DS8000 Performance Monitoring and Tuning


RMF gathers data by using three monitors:
򐂰 Monitor I: Long-term data collector for all types of resources and workloads. The SMF data
that is collected by Monitor I is mostly used for capacity planning and performance
analysis.
򐂰 Monitor II: Snapshot data collector for address space states and resource usage. A subset
of Monitor II data is also displayed by the IBM z/OS System Display and Search Facility
(SDSF) product.
򐂰 Monitor III: Short-term data collector for problem determination, workflow delay
monitoring, and goal attainment supervision. This data is also used by the RMF PM Java
Client and the RMF Monitor III Data Portal.

The data is then stored and processed in several ways. The ones that are described and used
in this chapter are:
򐂰 Data that is collected by the three monitors can be stored as SMF records (SMF types
70 - 79) for later reporting.
򐂰 RMF Monitor III can write VSAM records to in-storage buffer or into VSAM data sets.
򐂰 The RMF postprocessor is the function to extract historical reports for Monitor I data.

Other methods of working with RMF data, which is not described in this chapter, are:
򐂰 The RMF Spread Sheet Reporter provides graphical presentation of long-term
Postprocessor data. It helps you view and analyze performance at a glance, or for system
health check
򐂰 The RMF Distributed Data Server (DDS) supports HTTP requests to retrieve RMF
Postprocessor data from a selection of reports since z/OS 1.12. The data is returned as an
XML document, so a web browser can act as Data Portal to RMF data.
򐂰 With z/OS 1.12, z/OS Management Facility provides the presentation for DDS data.
򐂰 RMF for z/OS 1.13 enhances the DDS layer and provides a new solution as RMF XP that
enables Cross Platform Performance Monitoring.

The following sections describe how to gather and store RMF Monitor I data and then extract
it as reports by using the RMF postprocessor.

RMF Monitor I gatherer session options and write SMF record types
To specify which types of data RMF is collecting, you specify Monitor I session gatherer
options in the ERBRMFxx parmlib member.

Table 14-1 shows the Monitor I session options and associated SMF record types that are
related to monitoring I/O performance. The defaults are emphasized.

Table 14-1 Monitor I gatherer session options and write SMF record types
Activities Session options in SMF record types
ERBRMFxx parmlib member (Long-term Monitor I)

Direct Access Device Activity DEVICE / NODEVICE 74.1

I/O queuing Activity IOQ / NOIOQ 78.3

Channel Path Activity CHAN / NOCHAN 73

FICON Director Activity (H) FCD / NOFCD 74.7

Chapter 14. Performance considerations for IBM z Systems servers 461


Activities Session options in SMF record types
ERBRMFxx parmlib member (Long-term Monitor I)

Cache Subsystem Activity (H) CACHE / NOCACHE 74.5

Enterprise Storage Server (link and ESS / NOESS 74.8


rank statistics) (H)

Note: The Enterprise Storage Server activity is not collected by default. Change this and
turn on the collection if you have DS8000 storage systems installed. It provides valuable
information about DS8000 internal resources.

Note: FCD performance data is collected from the FICON directors. You must have the
FICON Control Unit Port (CUP) feature licensed and installed on your directors.

Certain measurements are performed by the storage hardware. The associated RMF record
types are 74.5, 74.7, and 74.8. They are marked with an (H) in Table 14-1 on page 461. They
do not have to be collected by each attached z/OS system separately; it is sufficient to get
them from one, or for redundancy reasons, two systems.

Note: Many clients, who have several z/OS systems sharing disk systems, typically collect
these records from two production systems that are always up and not running at the same
time.

In the ERBRMFxx parmlib member, you also find a TIMING section, where you can set the
RMF sampling cycle. It defaults to 1 second, which should be good for most cases. The RMF
cycle does not determine the amount of data that is stored in SMF records.

To store the collected RMF data, you must make sure that the associated SMF record types
(70 - 78) are included in the SYS statement in the SMFPRMxx parmlib member.

You also must specify the interval at which RMF data is stored. You can either do this explicitly
for RMF in the ERBRMFxx parmlib member or use the system wide SMF interval. Depending
on the type of data, RMF samples are added up or averaged for each interval. The number of
SMF record types you store and the interval make up the amount of data that is stored.

For more information about setting up RMF and the ERBRMFxx and SMFPRMxx parmlib
members, see z/OS Resource Measurement Facility User's Guide, SC34-2664.

Important: The shorter your interval, the more accurate your data is. However, there
always is a trade-off between shorter interval and the size of the SMF data sets.

Preparing SMF records for post processing


The SMF writer routine stores SMF records to VSAM data sets that are named SYS1.MANx
by default. Before you can run the postprocessor to create RMF reports, you must dump the
SMF records to sequential data sets by using the IFASMFDP program. Example 14-1 on
page 463 shows a sample IFASMFDP job.

462 IBM System Storage DS8000 Performance Monitoring and Tuning


Example 14-1 Sample IFASMFDP job
//SMFDUMP EXEC PGM=IFASMFDP
//DUMPIN1 DD DSN=SYS1.MAN1,DISP=SHR
//DUMPOUT DD DSN=hlq.SMFDUMP.D151027,DISP=(,CATLG),
// SPACE=(CYL,(10,100),RLSE),VOL=SER=ABC123,UNIT=SYSALLDA
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
INDD(DUMPIN1,OPTIONS(DUMP))
OUTDD(DUMPOUT,TYPE(70:79))
DATE(2015301,2015301)
START(1200)
END(1230)
/*

IFASMFDP can also be used to extract and concatenate certain record types or time ranges
from existing sequential SMF dump data sets. For more information about the invocation and
control of ISASMFDP, see z/OS MVS System Management Facilities (SMF), SA38-0667.

To create meaningful RMF reports or analysis, the records in the dump data set must be
chronological. This is important if you plan to analyze RMF data from several LPARs. Use a
SORT program to combine the individual data sets and sort them by date and time.
Example 14-2 shows a sample job snippet with the required sort parameters by using the
DFSORT program.

Example 14-2 Sample combine and sort job


//RMFSORT EXEC PGM=SORT
//SORTIN DD DISP=SHR,DSN=<input_smfdata_system1>
// DD DISP=SHR,DSN=<input_smfdata_system2>
//SORTOUT DD DSN=<output_cobined_sorted_smfdata>
//SYSIN DD *
SORT FIELDS=(11,4,CH,A,7,4,CH,A),EQUALS
MODS E15=(ERBPPE15,36000,,N),E35=(ERBPPE35,3000,,N)

RMF postprocessor
The RMF postprocessor analyses and summarizes RMF data into human readable reports.
Example 14-3 shows a sample job to run the ERBRMFPP post processing program.

Example 14-3 Sample ERBRMFPP job


//RMFPP EXEC PGM=ERBRMFPP
//MFPINPUT DD DISP=SHR,DSN=hlq.SMFDUMP.D151027
//MFPMSGDS DD SYSOUT=*
//SYSIN DD *
NOSUMMARY
DATE(10272015,10272015)
REPORTS(CACHE,CHAN,DEVICE,ESS,IOQ,FCD)
RTOD(1200,1230)
SYSOUT(H)
/*

In the control statements, you specify the reports that you want to get by using the REPORTS
keyword. Other control statements define the time frame, intervals, and summary points for
the reports to create. For more information about the available control statements, see z/OS
Resource Measurement Facility User's Guide, SC34-2664.

Chapter 14. Performance considerations for IBM z Systems servers 463


The following sections explain the reports that might be useful in analyzing I/O performance
issues, and provide some basic rules to identify potential bottlenecks. For more information
about all available RMF reports, see z/OS Resource Measurement Facility Report Analysis,
SC34-2665.

Note: You can also generate and start the postprocessor batch job from the ISPF menu of
RMF.

14.1.2 Direct Access Device Activity report


The RMF Direct Access Device Activity report (see Example 14-4) is a good starting point in
analyzing the storage systems performance. It lists the activity of individual devices, and their
average response time for the reporting period. The response time is also split up into several
components.

To get a first impression, you can rank volumes by their I/O intensity, which is the I/O rate
multiplied by Service Time (PEND + DISC + CONN component). Also, look for the largest
component of the response time. Try to identify the bottleneck that causes this problem. Do
not pay too much attention to volumes that have low or no Device Activity Rate, even if they
show high I/O response time. The following sections provide more detailed explanations.

The device activity report accounts for all activity to a base and all of its associated alias
addresses. Activity on alias addresses is not reported separately, but accumulated into the
base address.

The Parallel Access Volume (PAV) value is the number of addresses assigned to a unit
control block (UCB), including the base address and the number of aliases assigned to that
base address.

RMF reports the number of PAV addresses (or in RMF terms, exposures) that are used by a
device. In a HyperPAV environment, the number of PAVs is shown in this format: n.nH. The H
indicates that this volume is supported by HyperPAV. The n.n is a one decimal number that
shows the average number of PAVs assigned to the address during the RMF report period.
Example 14-4 shows that address 7010 has an average of 1.5 PAVs assigned to it during this
RMF period. When a volume has no I/O activity, the PAV is always 1, which means that there
is no alias that is assigned to this base address because in HyperPAV an alias is used or
assigned to a base address only during the period that is required to run an I/O. The alias is
then released and put back into the alias pool after the I/O is completed.

Important: The number of PAVs includes the base address plus the number of aliases
assigned to it. Thus, a PAV=1 means that the base address has no aliases assigned to it.

Example 14-4 Direct Access Device Activity report


D I R E C T A C C E S S D E V I C E A C T I V I T Y

DEVICE AVG AVG AVG AVG AVG AVG AVG AVG % % % AVG %
STORAGE DEV DEVICE NUMBER VOLUME PAV LCU ACTIVITY RESP IOSQ CMR DB INT PEND DISC CONN DEV DEV DEV NUMBER ANY
GROUP NUM TYPE OF CYL SERIAL RATE TIME TIME DLY DLY DLY TIME TIME TIME CONN UTIL RESV ALLOC ALLOC
7010 33909 10017 ST7010 1.5H 0114 689.000 1.43 .048 .046 .000 .163 .474 .741 33.19 54.41 0.0 2.0 100.0
7011 33909 10017 ST7011 1.5H 0114 728.400 1.40 .092 .046 .000 .163 .521 .628 29.72 54.37 0.0 2.0 100.0
YGTST00 7100 33909 60102 YG7100 1.0H 003B 1.591 12.6 .000 .077 .000 .067 .163 12.0 .413 0.07 1.96 0.0 26.9 100.0
YGTST00 7101 33909 60102 YG7101 1.0H 003B 2.120 6.64 .000 .042 .000 .051 .135 6.27 .232 0.05 1.37 0.0 21.9 100.0

464 IBM System Storage DS8000 Performance Monitoring and Tuning


14.1.3 I/O response time components
The device activity report provides various components of the overall response time for an I/O
operation:
򐂰 IOS Queuing (IOSQ) time: Queuing at the host level
򐂰 Pending (PEND) time: Response time of the storage hardware (fabric and HA)
򐂰 Disconnect (DISC) time: Stage or de-stage data to or from cache, and remote replication
impact.
򐂰 Connect (CONN) time: Data transfer time.
򐂰 I/O Interrupt delay (INT): Time between finishing an I/O operation and interrupt.

Figure 14-2 illustrates how these components relate to each other and to the common
response and service time definitions.

System Service Time

I/O Response Time

I/O Service Time

IOSQ PEND DISC CONNECT I/O Interrupt


DB delay Delay
CMR delay
Channel busy

Figure 14-2 I/O interrupt delay time

Before learning about the individual response time components, you should know about the
different service time definitions in simple terms:
򐂰 I/O service time is the time that is required to fulfill an I/O request after it is dispatched to
the storage hardware. It includes locating and transferring the data and the required
handshaking.
򐂰 I/O response time is the I/O service time plus the time the I/O request spends in the I/O
queue of the host.
򐂰 System service time is the I/O response time plus the time it takes to notify the requesting
application of the completion.

Only the I/O service time is directly related to the capabilities of the storage hardware. The
additional components that make up I/O response or system service time are related to host
system capabilities or configuration.

The following sections describe all response time components in more detail. They also
provide possible causes for unusual values.

Chapter 14. Performance considerations for IBM z Systems servers 465


IOSQ time
IOSQ time is the time an I/O request spends on the host I/O queue after being issued by the
operating system. During normal operation, IOSQ time should be zero, or at least only a
fraction of the total response time. The following situations can cause high IOSQ time:
򐂰 In many cases, high IOSQ time is because of the unavailability of aliases to initiate an I/O
request. Implementing HyperPAV and adding alias addresses for the affected LCUs
improves the situation.
򐂰 RESERVES can be a cause of a shortage of aliases. When active, they can hold off much
of the other I/O. After the reserve is released, there is a large burst of activity that uses up
all the available aliases and results in increased IOSQ. Avoid RESERVES altogether and
change the I/O serialization to Global Resource Serialization (GRS).
򐂰 There is also a slight possibility that the IOSQ is caused by a long busy condition during
device error recovery.
򐂰 If you see a high IOSQ time, also look at whether other response time components are
higher than expected.

PEND time
PEND time represents the time that an I/O request waits in the hardware. It can become
increased by the following conditions:
򐂰 High DS8000 HA utilization:
– An HA can be saturated even if the individual ports have not yet reached their limits.
HA utilization is not directly reported by RMF.
– The Command response (CMR) delay, which is part of PEND, can be an indicator for
high HA utilization. It represents the time that a Start- or Resume Subchannel function
needs until the first command is accepted by the device. It should not exceed a few
hundred microseconds.
– The Enterprise Storage Server report can help you further to find the reason for
increased PEND caused by a DS8000 HA. For more information, see “Enterprise
Storage Server” on page 488.
򐂰 High FICON Director port utilization:
– Sometimes, high FICON Director port or DS8000 HA port utilization is because of over
commission. Multiple FICON channels from different CPCs connect to the same
outbound switch port.
In this case, the FICON channel utilization as seen from the host might be low, but the
combination or sum of the utilization of these channels that share the outbound port
can be significant.
– The FICON Director report can help you isolate the ports that cause increased PEND.
For more information, see 14.1.6, “FICON Director Activity report” on page 470.
򐂰 Device Busy (DB) delay: Time that an I/O request waits because the device is busy. Today,
mainly because of the Multiple Allegiance feature, DB delay is rare. If it occurs, it is most
likely because the device is RESERVED by another system. Use GRS to avoid hardware
RESERVES, as already indicated in “IOSQ time” on page 466.
򐂰 SAP impact: Indicates the time the I/O Processor (IOP/SAP) needs to handle the I/O
request. For more information, see 14.1.4, “I/O Queuing Activity report” on page 468.

466 IBM System Storage DS8000 Performance Monitoring and Tuning


DISC time
DISC is the time that the storage system needs to process data internally. During this time, it
disconnects from the channel to free it for other operations:
򐂰 The most frequent cause of high DISC time is waiting for data to be staged from the
storage back end into cache because of a read cache miss. This time can be elongated
because of the following conditions:
– Low read hit ratio. For more information, see 14.1.7, “Cache and NVS report” on
page 471. The lower the read hit ratio, the more read operations must wait for the data
to be staged from the DDMs to the cache. Adding cache to the DS8000 storage system
can increase the read hit ratio.
– High rank utilization. You can verify this condition with the Enterprise Storage Server
Rank Statistics report, as described in “Enterprise Storage Server” on page 488.
򐂰 Heavy write workloads sometimes cause an NVS full condition. Persistent memory full
condition or NVS full condition can also elongate the DISC time. For more information,
see 14.1.7, “Cache and NVS report” on page 471.
򐂰 In a Metro Mirror environment, the time that is needed for sending and storing the data to
the secondary volume also adds to the DISCONNECT component.

CONN time
For each I/O operation, the channel subsystem measures the time that storage system,
channel, and CPC are connected for data transmission. CONN depends primarily on the
amount of data that is transferred per I/O. Large I/Os naturally have a higher CONN
component than small ones.

If there is a high level of utilization of resources, time can be spent in contention, rather than
transferring data. Several reasons exist for higher than expected CONN time:
򐂰 FICON channel saturation. CONN time increases if the channel or BUS utilization is high.
FICON data is transmitted in frames. When multiple I/Os share a channel, frames from an
I/O are interleaved with those from other I/Os, thus elongating the time that it takes to
transfer all of the frames of that I/O. The total of this time, including the transmission time
of the interleaved frames, is counted as CONN time. For details and thresholds, see
“FICON channels” on page 478.
򐂰 Contention in the FICON Director. A DS8000 HA port can also affect CONN time, although
they primarily affect PEND time.

I/O interrupt delay time


The interrupt delay time represents the time from when the I/O completes until the operating
system issues TSCH to bring in the status and process the interrupt. It includes the time that
it takes for the Hypervisor (IBM PR/SM™) to dispatch the LPAR on an available processor.
Possible examples of why the time might be high include the CPENABLE setting, processor
weights for LPAR dispatching, and so on. It usually is lower than 0.1 ms.

Note: This measurement is fairly new. It is available since z12 with z/OS V1.12 and V1.13
with APAR OA39993, or z/OS 2.1 and later.

In Example 14-4 on page 464, the AVG INT DLY is not displayed for devices 7010 and 7011.
The reason is that these volume records were collected on a z196 host system.

Note: I/O interrupt delay time is not counted as part of the I/O response time.

Chapter 14. Performance considerations for IBM z Systems servers 467


14.1.4 I/O Queuing Activity report
The I/O Queuing Activity report shows how busy the I/O processors (IOPs) in a z Systems
server are. IOPs are special purpose processors that handle the I/O operations.

If the utilization (% IOP BUSY) is unbalanced and certain IOPs are saturated, it can help to
redistribute the channels assigned to the storage systems. An IOP is assigned to handle a
certain set of channel paths. Assigning all of the channels from one IOP to access a busy disk
system can cause a saturation on that particular IOP. For more information, see the hardware
manual of the CPC that you use.

Example 14-5 shows an I/O Queuing Activity report.

Example 14-5 I/O Queuing Activity report


I/O Q U E U I N G A C T I V I T Y

- INITIATIVE QUEUE - ------- IOP UTILIZATION ------- -- % I/O REQUESTS RETRIED -- -------- RETRIES / SSCH ---------
IOP ACTIVITY AVG Q % IOP I/O START INTERRUPT CP DP CU DV CP DP CU DV
RATE LNGTH BUSY RATE RATE ALL BUSY BUSY BUSY BUSY ALL BUSY BUSY BUSY BUSY
00 259.349 0.12 0.84 259.339 300.523 31.1 31.1 0.0 0.0 0.0 0.45 0.45 0.00 0.00 0.00
01 127.068 0.14 100.0 126.618 130.871 50.1 50.1 0.0 0.0 0.0 1.01 1.01 0.00 0.00 0.00
02 45.967 0.10 98.33 45.967 54.555 52.0 52.0 0.0 0.0 0.0 1.08 1.08 0.00 0.00 0.00
03 262.093 1.72 0.62 262.093 279.294 32.9 32.9 0.0 0.0 0.0 0.49 0.49 0.00 0.00 0.00
SYS 694.477 0.73 49.95 694.017 765.243 37.8 37.8 0.0 0.0 0.0 0.61 0.61 0.00 0.00 0.00

In a HyperPAV environment, you can also check the usage of HyperPAV alias addresses.
Example 14-6 shows the LCU section of the I/O Queueing Activity report. It reports on
HyperPAV alias usage in the HPAV MAX column. Here, a maximum of 32 PAV alias
addresses were used for that LCU during the reporting interval. You can compare this value
to the number of aliases that are defined for that LCU. If all are used, you might experience
delays because of a lack of aliases.

This condition is also indicated by the HPAV WAIT value. It is calculated as the ratio of the
number of I/O requests that cannot start because no HyperPAV aliases are available, to the
total number of I/O requests for that LCU. If it is nonzero in a significant number of intervals,
you might consider defining more aliases for this LCU.

Example 14-6 I/O Queuing Activity report that uses HyperPAV


I/O Q U E U I N G A C T I V I T Y

AVG AVG DELAY AVG AVG DATA


LCU CU DCM GROUP CHAN CHPID % DP % CU CUB CMR CONTENTION Q CSS HPAV OPEN XFER
MIN MAX DEF PATHS TAKEN BUSY BUSY DLY DLY RATE LNGTH DLY WAIT MAX EXCH CONC
0114 7000 D0 2149.3 0.00 0.00 0.0 0.1
D1 2149.7 0.00 0.00 0.0 0.1
D2 2149.4 0.00 0.00 0.0 0.1
D3 2149.8 0.00 0.00 0.0 0.1
* 8598.3 0.00 0.00 0.0 0.1 0.000 0.00 0.1 0.025 32 9.02 4.86

Note: If your HPAV MAX value is constantly below the number of defined alias addresses,
you can consider unassigning some aliases and use these addresses for additional base
devices. Do this only if you are short of device addresses. Monitor HPAV MAX over an
extended period to make sure that you do not miss periods of higher demand for PAV.

468 IBM System Storage DS8000 Performance Monitoring and Tuning


14.1.5 FICON host channel report
The CHANNEL PATH ACTIVITY report, which is shown in Figure 14-3, shows the FICON
channel statistics:
򐂰 The PART Utilization is the utilization of the channel hardware related to the I/O activity on
this LPAR.
򐂰 The Total Utilization is the sum from all LPARs that share the channel. It should not
exceed 50% during normal operation
򐂰 The BUS utilization indicates the usage of the internal bus that connects the channel to
the CPC. The suggested maximum value is 40%.
򐂰 The FICON link utilization is not explicitly reported. For a rough approach, take the higher
of the TOTAL READ or TOTAL WRITE values and divide it by the maximum achievable
data rate, as indicated by the SPEED field. Consider that SPEED is in Gbps, and READ
and WRITE values are in MBps. The link utilization should not exceed 70%.

Note: We explain this estimation with CHPID 30 in the example in Figure 14-3. The
SPEED value is 16 Gbps, which roughly converts to 1600 MBps. TOTAL READ is 50.76
MBps, which is higher than TOTAL WRITE. Therefore, the link utilization is
approximately 50.78 divided by 1600, which results in 0.032 or 3.2%

Exceeding the thresholds significantly causes frame pacing, which eventually leads to higher
than necessary CONNECT times. If this happens only for a few intervals, it is most likely no
problem.

Figure 14-3 Channel Path Activity report

For small block transfers, the BUS utilization is less than the FICON channel utilization. For
large block transfers, the BUS utilization is greater than the FICON channel utilization.

The Generation (G) field in the channel report shows the combination of the FICON channel
generation that is installed and the speed of the FICON channel link for this CHPID at the time
of the machine start. The G field does not include any information about the link between the
director and the DS8000 storage system.

Chapter 14. Performance considerations for IBM z Systems servers 469


Table 14-2 lists the valid values and their definitions.

Table 14-2 G field in Channel Activity Report


G-field number FICON channel type Operating at

1 FICON Express 1 Gbps

2 FICON Express 2 Gbps

3 FICON Express2 or FICON Express4 1 Gbps

4 FICON Express2 or FICON Express4 2 Gbps

5 FICON Express4 4 Gbps

7 FICON Express8 2 Gbps

8 FICON Express8 4 Gbps

9 FICON Express8 8 Gbps

11 FICON Express8S 2 Gbps

12 FICON Express8S 4 Gbps

13 FICON Express8S 8 Gbps

15 FICON Express16S 4 Gbps

16 FICON Express16S 8 Gbps

17 FICON Express16S 16 Gbps

The link between the director and the DS8000 storage system can run at 1, 2, 4, 8, or
16 Gbps.

If the channel is point-to-point connected to the DS8000 HA port, the G field indicates the
speed that was negotiated between the FICON channel and the DS8000 port. With
z/OS V2.1 and later, a SPEED column was added that indicates the actual channel path
speed at the end of the interval.

The RATE field in the FICON OPERATIONS or zHPF OPERATIONS columns means the
number of FICON or zHPF I/Os per second that are initiated at the physical channel level. It is
not broken down by LPAR.

14.1.6 FICON Director Activity report


The FICON director is the switch that connects the host FICON channel to the DS8000 HA
port. FICON director performance statistics are collected in the SMF record type 74 subtype
7. The FICON Director Activity report (Example 14-7 on page 471) provides information
about director and port activities.

Note: To get the data that is related to the FICON Director Activity report, the CUP device
must be online on the gathering z/OS system.

The measurements that are provided are on a director port level. It represents the total I/O
passing through this port and is not broken down by LPAR or device.

470 IBM System Storage DS8000 Performance Monitoring and Tuning


The CONNECTION field indicates how a port is connected:
򐂰 CHP: The port is connected to a FICON channel on the host.
򐂰 CHP-H: The port is connected to a FICON channel on the host that requested this report.
򐂰 CU: This port is connected to a port on a disk subsystem.
򐂰 SWITCH: This port is connected to another FICON director.

The important performance metric is AVG FRAME PACING. This metric shows the average
time (in microseconds) that a frame waited before it was transmitted. The higher the
contention on the director port, the higher the average frame pacing metric. High frame
pacing negatively influences the CONNECT time.

Example 14-7 FICON director activity report


F I C O N D I R E C T O R A C T I V I T Y

PORT ---------CONNECTION-------- AVG FRAME AVG FRAME SIZE PORT BANDWIDTH (MB/SEC) ERROR
ADDR UNIT ID SERIAL NUMBER PACING READ WRITE -- READ -- -- WRITE -- COUNT
05 CHP FA 0000000ABC11 0 808 285 50.04 10.50 0
07 CHP 4A 0000000ABC11 0 149 964 20.55 5.01 0
09 CHP FC 0000000ABC11 0 558 1424 50.07 10.53 0
0B CHP-H F4 0000000ABC12 0 872 896 50.00 10.56 0
12 CHP D5 0000000ABC11 0 73 574 20.51 5.07 0
13 CHP C8 0000000ABC11 0 868 1134 70.52 2.08 1
14 SWITCH ---- 0ABCDEFGHIJK 0 962 287 50.03 10.59 0
15 CU C800 0000000XYG11 0 1188 731 20.54 5.00 0

14.1.7 Cache and NVS report


The CACHE SUBSYSTEM ACTIVITY report provides useful information for analyzing the
reason for high DISC time. It is provided on an LCU level. Example 14-8 shows a sample
cache overview report. Example 14-9 on page 472 is the continuation of this report with
cache statistics by volume.

Note: Cache reports by LCU calculate the total activities of volumes that are online.

Example 14-8 Cache Subsystem Activity summary


C A C H E S U B S Y S T E M A C T I V I T Y

------------------------------------------------------------------------------------------------------------------------------------
CACHE SUBSYSTEM OVERVIEW
------------------------------------------------------------------------------------------------------------------------------------
TOTAL I/O 19976 CACHE I/O 19976 CACHE OFFLINE 0
TOTAL H/R 0.804 CACHE H/R 0.804
CACHE I/O -------------READ I/O REQUESTS------------- ----------------------WRITE I/O REQUESTS---------------------- %
REQUESTS COUNT RATE HITS RATE H/R COUNT RATE FAST RATE HITS RATE H/R READ
NORMAL 14903 252.6 10984 186.2 0.737 5021 85.1 5021 85.1 5021 85.1 1.000 74.8
SEQUENTIAL 0 0.0 0 0.0 N/A 52 0.9 52 0.9 52 0.9 1.000 0.0
CFW DATA 0 0.0 0 0.0 N/A 0 0.0 0 0.0 0 0.0 N/A N/A
TOTAL 14903 252.6 10984 186.2 0.737 5073 86.0 5073 86.0 5073 86.0 1.000 74.6
-----------------------CACHE MISSES----------------------- ------------MISC------------ ------NON-CACHE I/O-----
REQUESTS READ RATE WRITE RATE TRACKS RATE COUNT RATE COUNT RATE
DFW BYPASS 0 0.0 ICL 0 0.0
NORMAL 3919 66.4 0 0.0 3921 66.5 CFW BYPASS 0 0.0 BYPASS 0 0.0
SEQUENTIAL 0 0.0 0 0.0 0 0.0 DFW INHIBIT 0 0.0 TOTAL 0 0.0
CFW DATA 0 0.0 0 0.0 ASYNC (TRKS) 3947 66.9
TOTAL 3919 RATE 66.4
---CKD STATISTICS--- ---RECORD CACHING--- ----HOST ADAPTER ACTIVITY--- --------DISK ACTIVITY-------
BYTES BYTES RESP BYTES BYTES
WRITE 0 READ MISSES 0 /REQ /SEC TIME /REQ /SEC
WRITE HITS 0 WRITE PROM 3456 READ 6.1K 1.5M READ 6.772 53.8K 3.6M
WRITE 5.7K 491.0K WRITE 12.990 6.8K 455.4K

The report shows the I/O requests by read and by write. It shows the rate, the hit rate, and the
hit ratio of the read and the write activities. The read-to-write ratio is also calculated.

Chapter 14. Performance considerations for IBM z Systems servers 471


Note: The total I/O requests here can be higher than the I/O rate that is shown in the
DASD report. In the DASD report, one channel program is counted as one I/O. However, in
the cache report, if there are multiple Locate Record commands in a channel program,
each Locate Record command is counted as one I/O request.

In this report, you can check the value of the read hit ratio. Low read hit ratios contribute to
higher DISC time. For a cache friendly workload, you see a read hit ratio of better than 90%.
The write hit ratio is usually 100%.

High DASD Fast Write (DFW) Bypass is an indication that persistent memory or NVS is
overcommitted. DFW BYPASS means that write I/Os cannot be completed because
persistent memory is full and must be retried. If the DFW Bypass Rate is higher than 1%, the
write retry operations can affect the DISC time. It is an indication of insufficient back-end
resources because write cache destaging operations are not fast enough.

Note: In cases with high DFW Bypass Rate, you usually see high rank utilization in the
DS8000 rank statistics, which are described in 14.1.8, “Enterprise Disk Systems report” on
page 473.

The DISK ACTIVITY part of the report can give you a rough indication of the back-end
performance. The read response time can be in the order of 10 - 20 ms if you have mostly
HDDs in the back end, and lower if SSDs and HPFE resources are used. The write response
time can be higher by a factor of two. Do not overrate this information, and check the ESS
report (see 14.1.8, “Enterprise Disk Systems report” on page 473), which provides much
more detail.

The report also shows the number of sequential I/O as SEQUENTIAL row and random I/O as
NORMAL row for read and write operations. These metrics can also help to analyze and
specify I/O bottlenecks.

Example 14-9 is the second part of the CACHE SUBSYSTEM ACTIVITY report, providing
measurements for each volume in the LCU. You can also see to which extent pool each
volume belongs.

Example 14-9 Cache Subsystem Activity by volume serial number


C A C H E S U B S Y S T E M A C T I V I T Y

------------------------------------------------------------------------------------------------------------------------------------
CACHE SUBSYSTEM DEVICE OVERVIEW
------------------------------------------------------------------------------------------------------------------------------------
VOLUME DEV XTNT % I/O ---CACHE HIT RATE-- ----------DASD I/O RATE---------- ASYNC TOTAL READ WRITE %
SERIAL NUM POOL I/O RATE READ DFW CFW STAGE DFWBP ICL BYP OTHER RATE H/R H/R H/R READ
*ALL 100.0 338.6 186.2 86.0 0.0 66.4 0.0 0.0 0.0 0.0 66.9 0.804 0.737 1.000 74.6
*CACHE-OFF 0.0 0.0
*CACHE 100.0 338.6 186.2 86.0 0.0 66.4 0.0 0.0 0.0 0.0 66.9 0.804 0.737 1.000 74.6
PR7000 7000 0000 22.3 75.5 42.8 19.2 0.0 13.5 0.0 0.0 0.0 0.0 14.4 0.821 0.760 1.000 74.6
PR7001 7001 0000 11.5 38.8 20.9 10.5 0.0 7.5 0.0 0.0 0.0 0.0 7.6 0.807 0.736 1.000 73.1
PR7002 7002 0000 11.1 37.5 20.4 9.5 0.0 7.6 0.0 0.0 0.0 0.0 7.0 0.797 0.729 1.000 74.7
PR7003 7003 0000 11.3 38.3 22.0 8.9 0.0 7.4 0.0 0.0 0.0 0.0 6.8 0.806 0.747 1.000 76.8
PR7004 7004 0000 3.6 12.0 6.8 3.0 0.0 2.3 0.0 0.0 0.0 0.0 2.6 0.810 0.747 1.000 75.2
PR7005 7005 0000 3.7 12.4 6.8 3.2 0.0 2.4 0.0 0.0 0.0 0.0 2.7 0.808 0.741 1.000 74.1
PR7006 7006 0000 3.8 12.8 6.5 3.6 0.0 2.6 0.0 0.0 0.0 0.0 3.1 0.796 0.714 1.000 71.5
PR7007 7007 0000 3.6 12.3 6.9 3.1 0.0 2.4 0.0 0.0 0.0 0.0 2.5 0.806 0.742 1.000 75.2
PR7008 7008 0000 3.6 12.2 6.7 3.4 0.0 2.2 0.0 0.0 0.0 0.0 2.7 0.821 0.753 1.000 72.5
PR7009 7009 0000 3.6 12.2 6.8 2.9 0.0 2.5 0.0 0.0 0.0 0.0 2.3 0.796 0.732 1.000 76.4

If you specify REPORTS(CACHE(DEVICE)) when running the cache report, you get the complete
report for each volume, as shown in Example 14-10 on page 473. You get detailed cache
statistics of each volume. By specifying REPORTS(CACHE(SSID(nnnn))), you can limit this
report to only certain LCUs.

472 IBM System Storage DS8000 Performance Monitoring and Tuning


Example 14-10 Cache Device Activity report detail by volume
C A C H E D E V I C E A C T I V I T Y

VOLSER @9C02F NUM C02F EXTENT POOL 0000


--------------------------------------------------------------------------------------------------------------------------
CACHE DEVICE STATUS
--------------------------------------------------------------------------------------------------------------------------
CACHE STATUS DUPLEX PAIR STATUS
CACHING - ACTIVE DUPLEX PAIR - NOT ESTABLISHED
DASD FAST WRITE - ACTIVE STATUS - N/A
PINNED DATA - NONE DUAL COPY VOLUME - N/A
--------------------------------------------------------------------------------------------------------------------------
CACHE DEVICE ACTIVITY
--------------------------------------------------------------------------------------------------------------------------
TOTAL I/O 3115 CACHE I/O 3115 CACHE OFFLINE N/A
TOTAL H/R 0.901 CACHE H/R 0.901
CACHE I/O -------------READ I/O REQUESTS------------- ----------------------WRITE I/O REQUESTS---------------------- %
REQUESTS COUNT RATE HITS RATE H/R COUNT RATE FAST RATE HITS RATE H/R READ
NORMAL 2786 46.4 2477 41.3 0.889 329 5.5 329 5.5 329 5.5 1.000 89.4
SEQUENTIAL 0 0.0 0 0.0 N/A 0 0.0 0 0.0 0 0.0 N/A N/A
CFW DATA 0 0.0 0 0.0 N/A 0 0.0 0 0.0 0 0.0 N/A N/A
TOTAL 2786 46.4 2477 41.3 0.889 329 5.5 329 5.5 329 5.5 1.000 89.4
-----------------------CACHE MISSES----------------------- ------------MISC------------ ------NON-CACHE I/O-----
REQUESTS READ RATE WRITE RATE TRACKS RATE COUNT RATE COUNT RATE
DFW BYPASS 0 0.0 ICL 0 0.0
NORMAL 309 5.1 0 0.0 311 5.2 CFW BYPASS 0 0.0 BYPASS 0 0.0
SEQUENTIAL 0 0.0 0 0.0 0 0.0 DFW INHIBIT 0 0.0 TOTAL 0 0.0
CFW DATA 0 0.0 0 0.0 ASYNC (TRKS) 173 2.9
TOTAL 309 RATE 5.1
---CKD STATISTICS--- ---RECORD CACHING--- ----HOST ADAPTER ACTIVITY--- --------DISK ACTIVITY-------
BYTES BYTES RESP BYTES BYTES
WRITE 0 READ MISSES 0 /REQ /SEC TIME /REQ /SEC
WRITE HITS 0 WRITE PROM 111 READ 4.1K 190.1K READ 14.302 55.6K 288.4K
WRITE 4.0K 21.8K WRITE 43.472 18.1K 48.1K

14.1.8 Enterprise Disk Systems report


The Enterprise Disk Systems (ESS) report provides the following measurement:
򐂰 DS8000 rank activity as ESS RANK STATISTICS
򐂰 DS8000 FICON/Fibre port and HA activity as ESS LINK STATISTICS
򐂰 DS8000 extent pool information as ESS EXTENT POOL STATISTICS

Important: Enterprise Storage Server data is not gathered by default. Make sure that
Enterprise Storage Server data is collected as processed, as described in 14.1.1, “RMF
Overview” on page 460.

DS8000 rank
ESS RANK STATISTICS provides measurements of back-end activity on the extent pool and
RAID array (rank) levels, such as OPS/SEC, BYTES/OP, BYTES/SEC, and RTIME/OP, for
read and write operations.

Example 14-11 shows rank statistics for a system with multi-rank extent pools, which contain
HDD resources and SSD or HPFE arrays (hybrid pools).

Example 14-11 Rank statistics example


E S S R A N K S T A T I S T I C S

----- READ OPERATIONS ---- ----- WRITE OPERATIONS ----


--EXTENT POOL- ADAPT OPS BYTES BYTES RTIME OPS BYTES BYTES RTIME -----ARRAY---- MIN RANK RAID
ID TYPE RRID ID /SEC /OP /SEC /OP /SEC /OP /SEC /OP SSD NUM WDTH RPM CAP TYPE

0000 CKD 1Gb 0000 0000 0.0 0.0 0.0 16.0 0.0 0.0 0.0 96.0 1 6 15 1800G RAID 5
0004 0000 0.0 65.5K 72.8 0.0 0.0 1.3M 2.8K 100.0 1 7 15 2100G RAID 5
0010 000A 190.0 57.2K 10.9M 2.2 8.0 1.1M 8.9M 9.3 Y 1 6 N/A 2400G RAID 5
0012 000A 180.6 57.3K 10.3M 2.3 8.3 1.1M 9.0M 9.5 Y 1 6 N/A 2400G RAID 5

POOL 370.6 57.2K 21.2M 2.2 16.3 1.1M 17.9M 9.4 Y 4 25 0 8700G RAID 5

Chapter 14. Performance considerations for IBM z Systems servers 473


0001 CKD 1Gb 0001 0000 0.0 0.0 0.0 0.0 0.0 1.4M 22.9K 22.9 1 6 15 1800G RAID 5
0005 0000 0.0 0.0 0.0 0.0 0.0 1.6M 22.9K 39.4 1 7 15 2100G RAID 5
000F 000A 82.9 57.3K 4.7M 2.4 4.3 1.0M 4.5M 7.8 Y 1 6 N/A 2400G RAID 5
0011 000A 82.9 57.3K 4.7M 2.4 5.2 1.1M 5.6M 7.5 Y 1 6 N/A 2400G RAID 5

POOL 165.8 57.3K 9.5M 2.4 9.5 1.1M 10.1M 7.7 Y 4 25 0 8700G RAID 5

Note: Starting with z/OS V2.2 and DS8000 LMC R7.5, this report also shows the
relationship of ranks to DA pairs (ADAPT ID). The MIN RPM value for SSD ranks was also
changed. It used to be 65 and is now N/A.

If I/O response elongation and rank saturation is suspected, it is important to check IOPS
(OPS/SEC) and throughput (BYTES/SEC) for both read and write rank activities. Also, you
must make sure what kind of workloads were run and how was the ratio of those workloads. If
your workload is of a random type, the IOPS is the significant figure. If it is more sequential,
the throughput is more significant.

The rank response times (RTIME) can be indicators for saturation. In a balanced and sized
system with growth potential, the read response times are in the order of 1 - 2 ms for SSD or
HPFE ranks and 10 - 15 ms for enterprise class SAS ranks. Ranks with response times
reaching the range of 3 - 5 ms for SSD and HPFE or 20 - 30 ms for enterprise SAS are
approaching saturation.

The write response times for SSD and HPFE should be in the same order as reads. For
HDDs, it can be about twice as high as for reads. Ranks based on NL-SAS drives have higher
response times, especially for write operations. They should not be used for performance
sensitive workload.

Note: IBM Spectrum Control performance monitoring also provides a calculated rank
utilization value. For more information, see 7.2.1, “IBM Spectrum Control overview” on
page 223.

DS8000 FICON/Fibre port and host adapter


The Enterprise Storage Server Link Statistics report includes data for all DS8000 HA ports,
regardless of their configuration (FICON, FCP, or PPRC). Example 14-12 shows a sample
that contains FICON ports (link type ECKD READ and WRITE) and PPRC ports (PPRC
SEND and PPRC RECEIVE). It also shows that the ports are running at 8 and 4 Gbps.
Example 14-12 DS8000 link statistics
E S S L I N K S T A T I S T I C S

------ADAPTER------ --LINK TYPE-- BYTES BYTES OPERATIONS RESP TIME I/O


SAID TYPE /SEC /OPERATION /SEC /OPERATION INTENSITY
0100 FIBRE 8Gb ECKD READ 661.9K 2.9K 230.3 0.0 8.5
0100 FIBRE 8Gb ECKD WRITE 96.1K 935.0 102.8 0.1 5.9
------
14.4
0312 FIBRE 4Gb PPRC SEND 5.4M 30.6K 176.5 0.5 84.9
0312 FIBRE 4Gb PPRC RECEIVE 0.0 0.0 0.0 0.0 0.0
------
84.9

For a definition of the HA port ID (SAID), see IBM DS8880 Architecture and Implementation
(Release 8), SG24-8323.

The I/O INTENSITY is the result of multiplication of the operations per second and the
response time per operation. For FICON ports, it is calculated for both the read and write
operations, and for PPRC ports, it is calculated for both the send and receive operations. The
total I/O intensity is the sum of those two numbers on each port.

474 IBM System Storage DS8000 Performance Monitoring and Tuning


For FICON ports, the I/O intensity should be below 2000 for most of the time. With higher
values, the interface might start becoming ineffective. Much higher I/O intensities can affect
the response time, especially PEND and CONN components. With a value of 4000, it is
saturated. For more information, see the description of PEND and CONN times in “PEND
time” on page 466 and “CONN time” on page 467. This rule does not apply for PPRC ports,
especially if the distance between the primary site and the secondary site is significant.

Note: IBM Spectrum Control performance monitoring also provides a calculated port
utilization value. For more information, see 7.2.1, “IBM Spectrum Control overview” on
page 223.

If you must monitor or analyze statistics for certain ports over time, you can generate an
overview report and filter for certain port IDs. Provide post processor control statements like
the ones in Example 14-13 and you get an overview report, as shown in Example 14-14.

Example 14-13 Control statements for DS8000 link overview report


OVW(TSEND003(ESTRPSD(SERN(0000ABC11),SAID(0003))))
OVW(RSEND003(ESRTPSD(SERN(0000ABC11),SAID(0003))))
OVW(TSEND133(ESTRPSD(SERN(0000ABC11),SAID(0133))))
OVW(RSEND133(ESRTPSD(SERN(0000ABC11),SAID(0133))))

Example 14-14 RMF overview report


R M F O V E R V I E W R E P O R T

DATE TIME INT TSEND033 RSEND033 TSEND233 RSEND233


MM/DD HH.MM.SS HH.MM.SS
06/20 20.53.00 00.01.00 133.7M 39.0 136.9M 39.6
06/20 20.54.00 00.00.59 123.4M 42.4 123.4M 42.7
06/20 20.55.00 00.00.59 121.5M 41.8 114.6M 40.4
06/20 20.56.00 00.01.00 121.3M 43.4 124.1M 42.1
06/20 20.57.00 00.00.59 118.2M 35.3 117.4M 36.4
06/20 20.58.00 00.01.00 103.8M 34.1 105.3M 35.0
06/20 20.59.00 00.01.00 93.9M 28.5 88.9M 27.1
06/20 21.00.00 00.00.59 86.5M 28.3 88.5M 29.0

14.1.9 Alternatives and supplements to RMF


There are several other tools available that can help with monitoring the performance of the
DS8000 storage system:
򐂰 IBM Tivoli OMEGAMON® XE
The IBM Tivoli OMEGAMON for z/OS Management Suite provides platform management
and monitoring capabilities for the operating system, networks, and storage subsystem.
For more information, see the following website:
http://www.ibm.com/software/products/en/tivoomegforzosmanasuit

Chapter 14. Performance considerations for IBM z Systems servers 475


򐂰 IBM Spectrum Control
IBM Spectrum Control does not provide detailed host related performance data like RMF.
However, it might be useful as follows:
– In a Remote Mirror and Copy environment, you can use IBM Spectrum Control to
monitor the performance of the remote disk system if there is no z/OS system that
accesses it to gather RMF data.
– In mixed (mainframe and distributed) environments.
For more information about IBM Spectrum Control, see 7.2.1, “IBM Spectrum Control
overview” on page 223.
򐂰 DS8000 GUI Performance report:
– With DS8000 LMC R7.5, the DS8000 GUI provides s a performance monitoring
function. With LMC R8.0, it provides the function to export performance data in the
CSV format.
– This data is presented online in graphical format. It is easy to access but provides less
detail than RMF.
򐂰 Global Mirror Monitor (GMMON)
– GMMON gathers and reports information about the state of DS8000 Global Mirror
asynchronous replication, such as consistency group formation periods or errors. The
data is useful to monitor Global Mirror or analyze issues.
– GMMON is implemented in the IBM disaster recovery management products Globally
Dispersed Parallel Sysplex® (GDPS) and IBM Copy Services Manager (CSM, also
known as IBM Tivoli Productivity Center for Replication), or available to clients as an
unsupported as-is application that can run on z/OS or various distributed platforms.
Contact your IBM representative or IBM Business Partner if you need more
information.
– GDPS stores GMMON data in SMF record type 105.
򐂰 There are several products that are provided by other vendors that use RMF data to
provide performance analysis capabilities.

Note: In a DS8000 remote replication configuration, it is important to also collect the


performance data of the remote storage systems, even if they are not online to production
systems. As mentioned, you can use IBM Spectrum Control for this purpose. In a GDPS
controlled environment, the GDPS controlling system on the remote site can collect this
data by way of RMF as well.

14.2 DS8000 and z Systems planning and configuration


This section describes general aspects and suggestions for planning the DS8000
configuration. For a less generic and more detailed analysis that accounts for your particular
environment, the Disk Magic modeling tools are available to IBM representatives and IBM
Business Partners who can help you in the planning activities. Disk Magic can be used to help
understand the performance effects of various configuration options, such as the number of
ports and HAs, disk drive capacity, and number of disks. For more information, see 6.1, “Disk
Magic” on page 160.

476 IBM System Storage DS8000 Performance Monitoring and Tuning


14.2.1 Sizing considerations
When you plan to configure a new DS8000 storage system, you have many decisions to
make. Each of these decisions might affect cost, scalability, and the performance capabilities
of the new system. For the current model of the DS8000 family, the DS8880 storage system,
the choices include, but are not limited to, the following items:
򐂰 DS8000 model: DS8884 Business Class or DS8886 Enterprise Class
򐂰 Cache size: 64 GB - 2048 GB
򐂰 Disk resources (number, type, capacity, speed, and RAID type)
򐂰 Flash resources (number, capacity, and type)
򐂰 HAs (connectivity and throughput requirements)

A z Systems workload is too complex to be estimated and described with only a few numbers
and general guidelines. Furthermore, the capabilities of the storage systems are not just the
sum of the capabilities of their components. Advanced technologies, such as Easy Tier, can
influence the throughput of the complete solution positively, and other functions, such as
point-in-time copies or remote replication, add additional workload.

Most of these factors can be accounted for by modeling the new solution with Disk Magic. For
z Systems, the modeling is based on RMF data and real workload characteristics. You can
compare the current to potential new configurations and consider growth (capacity and
workload), and the influence of Easy Tier and Copy Services.

For more information about Disk Magic, see 6.1, “Disk Magic” on page 160, which describes
the following items
򐂰 How to get access to the tool or someone who can use it
򐂰 The capabilities and limitations
򐂰 The data that is required for a proper model

Note: It is important, particularly for mainframe workloads, which often are I/O response
time sensitive, to observe the thresholds and limitations provided by the Disk Magic model.

14.2.2 Optimizing performance


This section describes optimizing the performance of an z Systems server with a DS8000
storage system.

Easy Tier
Easy Tier is a performance enhancement to the DS8000 family of storage systems that helps
to avoid issues that might occur if the back-end workload is not balanced across all the
available resources. For a description of Easy Tier, its capabilities, and how it works, see
1.3.4, “Easy Tier” on page 11.

There are some IBM Redbooks publications that you can refer to if you need more information
about Easy Tier:
򐂰 IBM DS8000 Easy Tier, REDP-4667
򐂰 DS8870 Easy Tier Application, REDP-5014
򐂰 IBM DS8870 Easy Tier Heat Map Transfer, REDP-5015

Chapter 14. Performance considerations for IBM z Systems servers 477


Easy Tier provides several capabilities that you can use to solve performance issues. Here
are some examples:
򐂰 Manage skewed workload in multitier (hybrid) extent pools: If part of your back-end disks
is overutilized (hot), consider adding some flash arrays to the affected pools and let Easy
Tier move the hot data to the faster resources.
򐂰 Distribute workload across all resources within a storage tier: In both uniform or hybrid
extent pools, Easy Tier automatically and transparently moves data from heavily used
resources to less used ones within the same tier.
򐂰 Give new workloads a good start: If you deploy a new application, you can assign its
storage space (volumes) to a specific tier manually by using the Easy Tier Application
feature. After the application starts and Easy Tier learns about its requirements, you can
switch it to automatic mode.
򐂰 Add or remove resources: Using Easy Tier manual mode, you can transparently add new
resources (arrays) to an extent pool or remove them from it. Easy Tier automatically
redistributes the data for best performance so that you can optimize the use of the
back-end resources you have available.

The predominant feature of Easy Tier is to manage data distribution in hybrid pools. It is
important to know how much of each resource type is required to optimize performance while
keeping the cost as low as possible. To determine a reasonable ratio between fast, but
expensive, and slower, but more cost-effective, resources, you can use the skew value of your
workload. Workload skew is a number that indicates how your active and frequently used data
is distributed across the total space:
򐂰 Workloads with high skew access data mainly from a small portion of the available storage
space.
򐂰 Workloads with low skew access data evenly from all or a large part of the available
storage space.

You can gain the most from a small amount of fast resources with a high skew value. You can
determine the skew value of your workload by using the Storage Tier Advisor Tool (STAT) that
is provided with the DS8000 storage systems. For more information, see to 6.5, “Storage Tier
Advisor Tool” on page 213.

Easy Tier is designed to provide a balanced and stable workload distribution. It aims at
minimizing the amount of data that must be moved as part of the optimization. To achieve this
task, it monitors data access permanently and establishes migration plans based on current
and past patterns. It will not immediately start moving data as soon as it is accessed more
frequently than before. There is a certain learning phase. You can use Easy Tier Application
to overrule learning if necessary.

FICON channels
The Disk Magic model that was established for your configuration indicates the number of
host channels, DS8000 HAs, and HA FICON ports that are necessary to meet the
performance requirements. The modeling results are valid only if the workload is evenly
distributed across all available resources. It is your responsibility to make sure this really is
the case.

In addition, there are certain conditions and constraints that you must consider:
򐂰 IBM mainframe systems support a maximum of eight channel paths to each logical control
unit (LCU). A channel path is a connection from a host channel to a DS8000 HA port.
򐂰 A FICON channel port can be shared between several z/OS images and can access
several DS8000 HA ports.

478 IBM System Storage DS8000 Performance Monitoring and Tuning


򐂰 A DS8000 HA port can be accessed by several z/OS images or even z Systems servers.
򐂰 A FICON host channel or DS8000 HA port has certain throughput capabilities that are not
necessarily equivalent to the link speed they support.
򐂰 A z Systems FICON feature or DS8000 HA card has certain throughput capabilities that
might be less than the sum of all individual ports on this card. This is especially true for the
DS8000 8-port HA cards.

The following sections provide some examples that can help to select the best connection
topology.

The simplest case is that you must connect one host to one storage system. If the Disk Magic
model indicates that you need eight or less host channels and DS8000 host ports, you can
use a one-to-one connection scheme, as shown in Figure 14-4.

z Systems host
. . . FICON Channel Ports

. . .
FICON Fabric

. . .
FICON HA Ports
DS8000

Figure 14-4 One to one host connection

To simplify the figure, it shows only one fabric “cloud”, where you normally have at least two
for redundancy reasons. The orange lines that here directly connect one port to another stand
for any route through the fabric. It can range from a direct connection without any switch
components to a cascaded FICON configuration.

Chapter 14. Performance considerations for IBM z Systems servers 479


If the model indicates that more than eight connections are required, you must distribute the
connections between resources. One example is to split the DS8000 LCUs into groups and
assign a certain number of connections to each of them. This concept is shown in
Figure 14-5.

z Systems host
. . . . . . FICON Channel Ports

FICON Fabric

. . . . . .
FICON HA Ports

A B C X Y Z DS8000 LCUs

DS8000
Figure 14-5 Split LCUs into groups and assign them to connections

Therefore, you can have up to eight connections and host and storage ports in each group. It
is your responsibility to define the LCU split in a way that each group gets the amount of
workload that the assigned number of connections can sustain.

Note: Disk Magic models can be created on several levels. To determine the best LCU
split, you might need LCU level modeling.

Therefore, in many environments, more than one host system shares the data and accesses
the same storage system. If eight or less storage ports are sufficient according to your Disk
Magic model, you can implement a configuration as shown in Figure 14-6, where the storage
ports are shared between the host ports.

z Systems host z Systems host


. . . . . . FICON Channel Ports

. . .
FICON Fabric

. . .
FICON HA Ports
DS8000

Figure 14-6 Several hosts accessing the same storage system - sharing all ports

480 IBM System Storage DS8000 Performance Monitoring and Tuning


If the model indicates that more than eight storage ports are required, the configuration might
look like the one that is shown in Figure 14-7.

z Systems host z Systems host


. . . . . . FICON Channel Ports

. . .
FICON Fabric

. . . . . .
FICON HA Ports
DS8000

Figure 14-7 Several hosts accessing the same storage system, - distributed ports

As you split storage system resources between the host systems, you must make sure that
you assign a sufficient number of ports to each host to sustain its workload.

Figure 14-8 shows another variation, where more than one storage system is connected to a
host. In this example, all storage system ports share the host ports. This works well if the Disk
Magic model does not indicate that more than eight host ports are required for the workload.

z Systems host
. . . FICON Channel Ports

FICON Fabric

. . . . . . FICON HA Ports

DS8000 DS8000

Figure 14-8 More than one storage system connected to a host

Chapter 14. Performance considerations for IBM z Systems servers 481


If more than eight host ports are required, you must split the host ports between storage
resources again. One way of doing this is to provide a separate set of host ports for each
storage system, as shown in Figure 14-9.

z Systems host

. . . . . . FICON Channel Ports

FICON Fabric

. . . . . . FICON HA Ports

DS8000 DS8000

Figure 14-9 Separate host ports for each storage system

Another way of splitting HA ports is by LCU, in a similar fashion as shown in Figure 14-5 on
page 480. As in the earlier examples with split resources, you must make sure that you
provide enough host ports for each storage system to sustain the workload to which it is
subject.

Mixed workload: A DS8000 HA has several ports. You can set each individual port to run
FICON or FCP topology. There is nothing wrong with using an HA for FICON, FCP, and
remote replication connectivity concurrently. However, if you need the highest possible
throughput or lowest possible response time for a given topology, consider isolating this
topology on separate HAs.

Remote replication: To optimize the HA throughput if you have remote replication active,
consider not sharing HA ports for the following items:
򐂰 Synchronous and asynchronous remote replication
򐂰 For FCP host I/O and remote replication

High Performance FICON (zHPF): zHPF is an enhancement to FICON, which improves


the channel throughput. It was implemented in several steps, both in hardware and
software, on the host and the storage side, over a couple of years.

At the time of writing, zHPF is used by most access methods and all DB2 workloads use it.
For a more detailed description and how to enable or disable the feature, see IBM DS8870
and IBM z Systems Synergy, REDP-5186.

zHPF Extended Distance II: With DS8000 R7.5 and IBM z Systems z13, zHPF was
further improved to deliver better response times for long write operations, as used, for
example, by DB2 utilities, at greater distances. It reduces the required round trips on the
FICON link. This benefits environments that are IBM HyperSwap® enabled and where the
auxiliary storage system is further away.

For more information about the performance implications of host connectivity, see 8.3,
“Attaching z Systems hosts” on page 276 and 4.10.1, “I/O port planning considerations” on
page 131.

482 IBM System Storage DS8000 Performance Monitoring and Tuning


Logical configuration
This section provides z Systems specific considerations for the logical configuration of a
DS8000 storage system. For a detailed and host system independent description of this topic,
see Chapter 4, “Logical configuration performance considerations” on page 83.

Note: The DS8000 architecture is symmetrical, based on the two CPCs. Many resources,
like cache, device adapters (DAs), and RAID arrays, become associated to a CPC. When
defining a logical configuration, you assign each array to one of the CPCs. Make sure to
spread them evenly, not only by count, but also by their capabilities. The I/O workload must
also be distributed across the CPCs as evenly as possible.

The preferred way to achieve this situation is to create a symmetrical logical configuration.

A fundamental question that you must consider is whether there is any special workload that
must be isolated, either because it has high performance needs or because it is of low
importance and should never influence other workloads.

Sharing or isolating resources


One of the most important aspects during logical configuration is to spread the I/O workload
across all available DS8000 resources. Regardless of whether you assign isolated resources
to your workloads, share everything, or configure a mix of both methods, the resources you
assigned must be used as evenly as possible. Otherwise, your sizing will not fit and the overall
performance will not be adequate.

The major groups of resources in a DS8000 storage system are the storage resources, such
as RAID arrays and DAs, and the connectivity resources, such as HAs. The traditional
approach in mainframe environments to assign storage resources is to divide them into the
smallest possible entity (single rank extent pool) and distribute the workload either manually
or managed by the Workload Manager (WLM) and System Managed Storage (SMS) as
evenly as possible. This approach still has its advantages:
򐂰 RMF data provides granular results, which can be linked directly to a resource.
򐂰 If you detect a resource contention, you can use host system tools to fix it, for example, by
moving a data set to a different volume in a different pool.
򐂰 It is easy to detect applications or workloads that cause contention on a resource.
򐂰 Isolation of critical applications is easy.

On the contrary, this approach comes with some significant disadvantages, especially with
modern storage systems that support automated tiering and autonomous balancing of
resources:
򐂰 Statically assigned resources might be over- or underutilized for various reasons:
– Monitoring is infrequent and only based on events or issues.
– Too many or too few resources are allocated to certain workloads.
– Workloads change without resources being adapted.
򐂰 All rebalancing actions can be performed only on a volume level.
򐂰 Modern automatic workload balancing methods cannot be used:
– Storage pool striping.
– Easy Tier automatic tiering.
– Easy Tier intra-tier rebalancing.

Chapter 14. Performance considerations for IBM z Systems servers 483


It is preferred practice to share as many storage resources as possible and let the storage
system take care of the balancing and tiering. There might be situations where resource or
application isolation is beneficial. In such cases, you must take great care that you assign
sufficient resources to the isolated application to fulfill its requirements. Isolation alone does
not solve a resource constrain issue.

Note: If you plan to share your storage resources to a large extent but still want to make
sure that certain applications have priority over others, consider using the DS8000 I/O
Priority Manager (IOPM) feature, which is described in “I/O Priority Manager” on page 485.

For the host connectivity resources (HAs and FICON links), similar considerations apply. You
can share FICON connections by defining them equally for all accessed LCUs in the
z Systems I/O definitions. That way, the z Systems I/O subsystem takes care of balancing the
load over all available connections. If there is a need to isolate a certain workload, you can
define specific paths for their LCUs and volumes.

Volume sizes
The DS8000 storage system now supports CKD logical volumes of any size 1 - 1182006
cylinders, which is 1062 times the capacity of a 3390-1 (1113 cylinders).

Note: The DS8000 storage system allocates storage with a granularity on one extent,
which is the equivalent of 1113 cylinders. Therefore, selecting capacities of multiples of this
value is most effective.

A key factor to consider when planning the CKD volume configuration and sizes is the limited
number of devices a z/OS system can address within one Subchannel Set (65,535). You must
define volumes with enough capacity so that you satisfy you storage requirements within this
supported address range, including room for PAV aliases and future growth.

Apart from saving device addresses, using large volumes brings additional benefits:
򐂰 Simplified storage administration
򐂰 Reduced number of X37 abends and allocation failures because of larger pools of free
space
򐂰 Reduced number of multivolume data sets to manage

One large volume performs the same as though you allocated the same capacity in several
smaller ones, if you use the DS8000 built-in features to distribute and balance the workload
across resources. There are two factors to consider to avoid potential I/O bottlenecks when
using large volumes:
򐂰 Use HyperPAV to reduce IOSQ.
With equal I/O density (I/Os per GB), the larger a volume, the more I/Os it gets. To avoid
excessive queuing, the use of PAV is of key importance. With HyperPAV, you can reduce
the total number of alias addresses because they are assigned automatically as needed.
For more information about the performance implications of PAV, see “Parallel Access
Volumes” on page 485.
򐂰 Eliminate unnecessary reserves.
As the volume sizes grow larger, more data on a single CKD device address will be
accessed in parallel. There is a danger of performance bottlenecks when there are
frequent activities that reserve an entire volume or its VTOC/VVDS.

484 IBM System Storage DS8000 Performance Monitoring and Tuning


Parallel Access Volumes
PAVs allow multiple concurrent I/Os to the same volume at the same time from applications
that run on the same z/OS system image. This concurrency helps applications better share
the logical volumes with reduced contention. The ability to send multiple concurrent I/O
requests to the same volume nearly eliminates I/O queuing in the operating system, thus
reducing I/O responses times.

PAV is implemented by defining alias addresses to the conventional base address. The alias
address provides the mechanism for z/OS to initiate parallel I/O to a volume. An alias is
another address/UCB that can be used to access the volume that is defined on the base
address. An alias can be associated with a base address that is defined in the same LCU
only. The maximum number of addresses that you can define in an LCU is 256. Theoretically,
you can define one base address, plus 255 aliases in an LCU.

Aliases are initially defined to be associated to a certain base address. In a traditional static
PAV environment, the alias is always associated to the same base address, which requires
many aliases and manual tuning.

In dynamic PAV or HyperPAV environments, an alias can be reassigned to any base address
as your needs dictate. Therefore, you need less aliases and no manual tuning.

With dynamic PAV, the z/OS WLM takes care of the alias assignment. Therefore, it
determines the need for additional aliases at fixed time intervals and adapts to workload
changes rather slowly.

The more modern approach of HyperPAV assigns aliases in real time, based on outstanding
I/O requests to a volume. The function is performed by the I/O subsystem with the storage
system. HyperPAV reacts immediately to changes. With HyperPAV, you achieve better
average response times and higher total throughput. Today, there is no technical reason
anymore to use either static or dynamic PAV.

You can check the usage of alias addresses by using RMF data. Example 14-6 on page 468
shows the I/O queuing report for an LCU, which includes the maximum number of aliases that
are used in the sample period. You can use such reports to determine whether you assigned
enough alias addresses for an LCU.

Number of aliases: With HyperPAV, you need fewer aliases than with the older PAV
algorithms. Assigning 32 aliases per LCU is a good starting point for most workloads. It is a
preferred practice to leave a certain number of device addresses in an LCU initially
unassigned in case it turns out that a higher number of aliases is required.

Special IBM zSynergy features


This section describes how certain features can help you avoid or overcome I/O performance
issues.

I/O Priority Manager


The IOPM helps you manage quality of service (QoS) levels for each application that runs on
the system. This feature aligns distinct service levels to separate workloads in the system to
help maintain the efficient performance of the DS8000 volume. The IOPM together with WLM
detects when a higher priority application is hindered by a lower priority application that
competes for the same system resources. This contention can occur when multiple
applications request data from the same drives. When IOPM encounters this situation, it
delays lower priority I/O data to assist the more critical I/O data in meeting its performance
targets.

Chapter 14. Performance considerations for IBM z Systems servers 485


For a detailed explanation of IOPM and its integration into WLM, see DS8000 I/O Priority
Manager, REDP-4760.

IBM zHyperwrite
IBM zHyperWrite™ is a technology that is provided by the DS8870 storage system, and used
by z/OS (DFSMS) and DB2 to accelerate DB2 log writes in HyperSwap enabled Metro Mirror
environments.

When an application sends a write I/O request to a volume that is in synchronous data
replication, the response time is increased by the latency because of the distance and by the
replication itself. Although the DS8000 PPRC algorithm is the most effective synchronous
replication available, there still is some replication impact because of the start of the write to
the primary and sending the data on to the secondary must happen one after another.

An application that uses zHyperwrite can avoid the replication impact for certain write
operations. Accordingly, an I/O that is flagged is not replicated by PPRC, but written to the
primary and secondary simultaneously by the host itself. The application, DFSMS, the I/O
subsystem, and the DS8000 storage system are closely coordinating the process to maintain
data consistency. The feature is most effective for the following situations:
򐂰 Small writes, where the data transfer time is short.
򐂰 Short distances, where the effect of the latency is not significant.

Note: At the time of writing, only DB2 uses zHyperwrite for log writes.

Copy Services considerations


The performance aspects of all DS8000 Copy Services features are described in IBM
DS8870 Copy Services for IBM z Systems, SG24-6787 and in IBM DS8870 and IBM z
Systems Synergy, REDP-5186. Here are some general aspects to consider from a
performance point of view:
򐂰 Remote data replication, such as Metro Mirror, workload can be accounted for in Disk
Magic. However, not all parameters and values are directly derived from the RMF data.
Make sure all values that are used in the modeling are correct.
򐂰 FlashCopy workload is not reflected in the RMF data that is used as Disk Magic input. If
there is significant FlashCopy activity, you might have to provide extra headroom in the
utilization of the back-end resources.

14.3 Problem determination and resolution


This section provides some basic tips and methods for problem determination and
troubleshooting for when you experience insufficient performance that appears storage
system-related.

This is only an introduction. It cannot replace a thorough analysis by IBM Technical Support in
more complex situations or if there are product issues.

14.3.1 Sources of information


If your client or the business you provide IT services for approaches you with a performance
issue, you most likely get statements like the following ones:
򐂰 “We cannot complete our jobs in the batch window.”
򐂰 “The backup takes too long.”
򐂰 “Users complain about the response time of application XY.”

486 IBM System Storage DS8000 Performance Monitoring and Tuning


Being responsible for the I/O part of your infrastructure, you must convert such general
statements to I/O terms and discover whether the problem is I/O-related. You need more
information from several instances. First, get as much detail as possible about how the issue
appears. Ask the client or user who reports the problem:
򐂰 How is the issue perceived?
򐂰 At which times does it occur?
򐂰 Is it reproducible?
򐂰 Does it show up under specific circumstances?
򐂰 What kind of workload is running at the time?
򐂰 Was there already any analysis that links the issue to I/O? If yes, get the details.
򐂰 Was anything changed, either before or after the issue started to appear?

Next, gather performance data from the system. For I/O related investigations, use RMF data.
For a description about how to collect, process, and interpret RMF data, see 14.1, “DS8000
performance monitoring with RMF” on page 460. There might be many data that you must
analyze. The faster you can isolate the issue up front, both from a time and a device point of
view, the more selective your RMF analysis can be.

There are other tools or methods that you can use to gather performance data for a DS8000
storage system. They most likely are of limited value in a mainframe environment because
they do not take the host system into account. However, they can be useful in situations
where RMF data does not cover the complete configuration, for example:
򐂰 Mixed environments (mainframe, open, and IBM i)
򐂰 Special copy services configurations, such as Global Mirror secondary

For more information about these other tools, see 14.1.9, “Alternatives and supplements to
RMF” on page 475.

To match physical resources to logical devices, you also need the exact logical configuration
of the affected DS8000 storage systems, and the I/O definition of the host systems you are
analyzing.

14.3.2 Identifying critical and restrained resources


With the information you collected, you can try to discover whether there is an I/O-related
issue, and if yes, isolate the responsible resources. In the RMF data you collected, look for
periods and resources with elevated, out of normal range values:
򐂰 High response time
򐂰 Exceptionally high throughput or utilization
򐂰 High elapsed time for jobs

The following sections point you to some key metrics in the RMF reports, which might help
you isolate the cause of a performance issue.

Attention: RMF provides many different measurements. It can be complex to associate


them with conditions that lead to performance issues. This section can cover only the most
common symptoms in this section. If the matter is too complex or you need a more general
analysis of your performance situation, consider a performance study. IBM offers this study
as a charged service. Contact your IBM representative or IBM Business Partner if you are
interested. Similar services might also be available from other companies.

Chapter 14. Performance considerations for IBM z Systems servers 487


RMF Summary Report
The RMF Summary Report gives an overview over the system load. From an I/O perspective,
it provides only LPAR-wide I/O rate and average DASD response time. Use it to get an overall
impression and discover periods of peak I/O load or response time. See whether these
periods match the times for which the performance problems are reported.

Attention: Because the summary report provides a high-level overview, there might be
issues with individual components that are not directly visible here.

Direct Access Device Activity


This report shows the activity of all DASD devices within the LPAR scope. Search this report
for devices with high response times. The term high in this context means either of the
following items:
򐂰 Higher than other devices
򐂰 Higher than usual for certain times

After you isolate a certain time and a set of volumes that are conspicuous, you can analyze
further. Discover which of the response time components are higher than usual. A description
of these components and why they can be increased is provided in 14.1.3, “I/O response time
components” on page 465.

If you also need this information on a Sysplex scope, create the Shared Direct Access Device
Activity report. It provides a similar set of measurements for each volume by LPAR and also
summarized for the complete Sysplex.

Important: Devices with no or almost no activity should not be considered. Their response
time values are not relevant and might be inaccurate.

This rule can also be applied to all other reports and measurements.

Cache Subsystem Summary


This report provides cache statistics at a volume or LCU level. RMF gathers this information
directly from the storage system. You can use it to analyze important values like read/write
ratio, read cache hit ratio, and the DASD fast write bypass rate.

You specifically might want to check the volumes you identified in the previous section and
see whether they have the following characteristics:
򐂰 They have a high write ratio.
򐂰 They show a DASD Fast Write Bypass rate greater than 0, which indicates that you are
running into an NVS Full condition.

Enterprise Storage Server


The data for the Enterprise Storage Server report is also gathered directly from the storage
system. It provides measurements for the DS8000 HAs and the disk back end (RAID arrays /
ranks).

Use the Enterprise Storage Server Link Statistics to analyze the throughput of the DS8000
HA ports. Pay particular attention to those that have higher response time than others. Also,
use the I/O intensity value to determine whether a link might be close to its limitations. All HA
ports are listed here. You can also analyze remote replication and Open Systems workload.

488 IBM System Storage DS8000 Performance Monitoring and Tuning


The Enterprise Storage Server Rank Statistics show the back-end I/O load for each of the
installed RAID arrays. Again, look for those that stand out, either with a high load, or much
higher response time.

Attention: Different types of ranks have different capabilities. HPFE or SSD ranks can
sustain much higher IOPS with much lower response time than any HDD rank. Within the
different HDD types, 15 K RPM enterprise (ENT) SAS drives perform better than 10 K
RPM ENT SAS drives. Nearline (NL) SAS drives have the worst performance. The RAID
type (RAID 5, 6, or 10) also affects the capabilities of a rank. Keep this in mind when
comparing back-end performance values. The drive type and RAID level are indicated in
the report.

Channel Path activity


This report contains measurements of all defined z/OS channel paths. You see their utilization
and throughput and the activity, where it distinguishes between classic FICON and zHPF. The
activity (I/O rate) is the number of FICON and zHPF operations per second, which is different
than the device I/O rate. A device I/O requires more several FICON operations.

Look for channel paths that have the following features:


򐂰 Have much higher throughput, utilization, or activity than others, which indicates an
unbalanced configuration.
򐂰 Have high classic FICON activity, which indicates that systems or application do not use
the more efficient zHPF protocol.
򐂰 Have significant high defer values, which can be an indicator for frame pacing.

I/O Queueing activity


This report consists of two parts. The first part shows the queuing activity and utilization of the
z Systems I/O processors. Check whether there are IOPs with are more busy than others.

The second part of the report provides queuing details about an LCU and channel path level.
You can see whether I/Os are delayed on their way to the device. Check for LCUs / paths that
have the following features:
򐂰 Higher average control unit busy delay (AVG CUB DLY), which can mean that devices are
in use or reserved by another system.
򐂰 Higher average command response delay (AVG CMR DLY), which can indicate a saturation
of certain DS8000 resources, such as HA, internal bus, or processor.
򐂰 Nonzero HPAV wait times and HPAV max values in the order of the number of defined
alias addresses, which can indicate that the number of alias addresses is not sufficient.

14.3.3 Corrective actions


If you can identify where the performance issue is, you can devise and perform measures to
correct it or avoid it in the future. Depending on what the cause is, there are several
approaches that you can follow.

In many cases, your analysis shows that one or more resources either on the host system or
DS8000 storage system are saturated or overloaded. The first thing to check is whether your
storage system is configured to use all available features that improve performance or
automate resource balancing.

Chapter 14. Performance considerations for IBM z Systems servers 489


The most important ones are the following ones:
򐂰 Easy Tier automated mode to avoid or eliminate back-end hot spots
򐂰 zHPF to improve the efficiency of the FICON links
򐂰 HyperPAV to improve parallelism

If these features do not solve the issue, consider the following actions:
򐂰 Distribute the workload further over additional existing resources with less utilization.
򐂰 Add more resources of the same type, if there is room.
򐂰 Exchange the existing, saturated resources for different ones (other or newer technology)
with higher capabilities.

If you isolated applications (for example, by using their own set of SSD ranks) but still
experience poor response times, check the following items:
򐂰 Are the dedicated resources saturated? If yes, you can add more resources, or consider
switching to a shared resource model.
򐂰 Is the application doing something that the dedicated resources are not suited for (for
example, mostly sequential read and write operations on SSD ranks)? If yes, consider
changing the resource type, or again, switching to a shared model.
򐂰 Does the contention come from other resources that are not dedicated, such as HA ports
in our example with dedicated SSD ranks? Here, you can consider increasing the isolation
by dedicating host ports to the application as well.

If you are running in a resource sharing model and find that your overall I/O performance is
good, but there is one critical application that suffers from poor response times, you can
consider moving to an isolation model, and dedicate certain resources to this application. If
the issue is limited to the back end, another solution might be to use advanced functions:
򐂰 IOPM to prioritize the critical application
򐂰 Easy Tier Application to manually assign certain data to a specific storage tier

If you cannot identify a saturated resource, but still have an application that experiences
insufficient throughput, it might not use the I/O stack optimally. For example, modern storage
systems can process many I/Os in parallel. If an application does not use this capability and
serializes all I/Os, it might not get the required throughput, although the response times of
individual I/Os are good.

490 IBM System Storage DS8000 Performance Monitoring and Tuning


15

Chapter 15. IBM System Storage SAN Volume


Controller attachment
This chapter describes the guidelines and procedures to make the most of the performance
available from your DS8000 storage system when attached to IBM Spectrum Virtualize™
software and the IBM SAN Volume Controller.

This chapter includes the following topics:


򐂰 IBM System Storage SAN Volume Controller
򐂰 SAN Volume Controller performance considerations
򐂰 DS8000 performance considerations with SAN Volume Controller
򐂰 Performance monitoring
򐂰 Configuration guidelines for optimizing performance
򐂰 Where to place flash
򐂰 Where to place Easy Tier

© Copyright IBM Corp. 2016. All rights reserved. 491


15.1 IBM System Storage SAN Volume Controller
The IBM System Storage SAN Volume Controller is designed to increase the flexibility of your
storage infrastructure by introducing an in-band virtualization layer between the servers and
the storage systems. The SAN Volume Controller can enable a tiered storage environment to
increase flexibility in storage management. The Spectrum Virtualize software combines the
capacity from multiple disk or flash storage systems into a single storage pool, which can be
managed from a central point, which is simpler to manage and helps increase disk capacity
utilization. With the SAN Volume Controller, you can apply Spectrum Virtualize Advanced
Copy Services across storage systems from many vendors to help you simplify operations.

For more information about SAN Volume Controller, see Implementing the IBM System
Storage SAN Volume Controller V7.4, SG24-7933.

15.1.1 SAN Volume Controller concepts


The SAN Volume Controller is a storage area network (SAN) appliance that attaches storage
devices to supported Open Systems servers. The SAN Volume Controller provides symmetric
virtualization by creating a pool of managed disks (MDisks) from the attached storage
subsystems, which are then mapped to a set of virtual disks (VDisks) for use by various
attached host computer systems. System administrators can view and access a common
pool of storage on the SAN, which enables them to use storage resources more efficiently
and provides a common base for advanced functions.

The Spectrum Virtualize solution is designed to reduce both the complexity and costs of
managing your SAN-based storage. With the SAN Volume Controller, you can perform these
tasks:
򐂰 Simplify management and increase administrator productivity by consolidating storage
management intelligence from disparate storage controllers into a single view, including
non-IBM storage.
򐂰 Improve application availability by enabling data migration between disparate disk storage
devices non-disruptively.
򐂰 Improve disaster recovery and business continuance needs by applying and managing
Copy Services across disparate disk storage devices within the SAN.
򐂰 Provide advanced features and functions to the entire SAN:
– Large scalable cache
– Advanced Copy Services
– Space management
– Mapping based on wanted performance characteristics
– Quality of service (QoS) metering and reporting

The SAN Volume Controller enables the DS8000 storage system with the options for the
following items:
򐂰 iSCSI or FCoE attachment
򐂰 IBM Real-time Compression™ (RtC) of volumes by using SAN Volume Controller built-in
compression accelerator cards and software.

For more information about the SAN Volume Controller, see the IBM Knowledge Center:
http://www.ibm.com/support/knowledgecenter/STPVGU_7.6.0/com.ibm.storage.svc.consol
e.760.doc/svc_ichome_760.html

492 IBM System Storage DS8000 Performance Monitoring and Tuning


Spectrum Virtualize: SAN Volume Controller virtualization
The SAN Volume Controller provides block aggregation into volumes and volume
management for disk storage within the SAN. In simpler terms, the Spectrum Virtualize
software manages a number of back-end storage controllers and maps the physical storage
within those controllers to logical disk images that can be seen by application servers and
workstations in the SAN.

The SAN must be zoned in such a way that the application servers cannot see the back-end
storage, preventing the SAN Volume Controller and the application servers from both trying to
manage the back-end storage.

The SAN Volume Controller I/O Groups are connected to the SAN in such a way that all
back-end storage and all application servers are visible to all of the I/O Groups. The SAN
Volume Controller I/O Groups see the storage presented to the SAN by the back-end
controllers as a number of disks, which are known as MDisks. MDisks are collected into one
or several groups, which are known as Managed Disk Groups (MDGs), or storage pools.
When an MDisk is assigned to a storage pool, the MDisk is divided into a number of extents
(the extent minimum size is 16 MiB and the extent maximum size is 8 GiB). The extents are
numbered sequentially from the start to the end of each MDisk.

The storage pool provides the capacity in the form of extents, which are used to create
volumes, also known as virtual disks (VDisks).

When creating SAN Volume Controller volumes or VDisks, the default option of striped
allocation is normally the preferred choice. This option helps balance I/Os across all the
MDisks in a storage pool, which optimizes overall performance and helps reduce hot spots.
Conceptually, this method is represented in Figure 15-1.

Storage Pool

Text durch Klicken hinzufügen

VDisk is a collection of
Extents
(each 16 MiB to 8 GiB)
Figure 15-1 Extents being used to create virtual disks

The virtualization function in the SAN Volume Controller maps the volumes that are seen by
the application servers onto the MDisks provided by the back-end controllers. I/O traffic for a
particular volume is, at any one time, handled exclusively by the nodes in a single I/O Group.
Thus, although a cluster can have many nodes within it, the nodes handle I/O in independent
pairs, which means that the I/O capability of the SAN Volume Controller scales well (almost
linearly) because additional throughput can be obtained by adding additional I/O Groups.

Chapter 15. IBM System Storage SAN Volume Controller attachment 493
Figure 15-2 summarizes the various relationships that bridge the physical disks through to the
VDisks within the SAN Volume Controller architecture.

SDD is the only multi-path


device driver needed on server
side.
SVC Driver SVC Driver

Virtual disks are created within


a Managed Disk Group and are
mapped to the hosts.
Fabric 1
Hosts zone
SVC isolates Hosts from any

VD 1

VD 2

VD 3

VD 6

VD 7
VD 4

VD 5
Virtual Disks storage modifications. SVC
Type = 2145 manages the relation between
Virtualization Engine Virtual Disks and Managed
Disks.
Managed Disk Group High Perf Low Cost
MD 2

MD 3

MD 4

MD 5

MD 6
MD 1

MD 7

MD 8
Managed Disks Managed Disks are grouped in
Managed Disks Groups
depending on their
Fabric 1 characteristics – Storage Pools

Disks zone Storage subsystem SCSI LUNs

LUN 3
LUN 1

LUN 2

LUN 4
are directly mapped to SVC
LUN 1

LUN 2

LUN 3

LUN 4

SCSI LUNs
cluster.

RAID RAID
controller 1 controller 2

RAID Array

Physical disks

Figure 15-2 Relationship between physical and virtual disks

15.1.2 SAN Volume Controller multipathing


Each SAN Volume Controller node presents a VDisk to the SAN through multiple paths. A
VDisk can be seen in the SAN by four paths. In normal operation, two nodes provide
redundant paths to the same storage, which means that, depending on zoning and SAN
architecture, a single server might see eight paths to each LUN presented by the SAN
Volume Controller. Each server host bus adapter (HBA) port must be zoned to a single port
on each SAN Volume Controller node.

Because most operating systems have only some basic detection or management of multiple
paths to a single physical device, IBM provides a multipathing device driver. The multipathing
driver that is supported by the SAN Volume Controller is the IBM Subsystem Device Driver
(SDD). SDD groups all available paths to a VDisk device and presents it to the operating
system. SDD performs all the path handling and selects the active I/O paths.

SDD supports the concurrent attachment of various DS8000 and IBM FlashSystem™
models, IBM Storwize V7000, V5000, and V3000, and SAN Volume Controller storage
systems to the same host system. Where one or more alternative storage systems are to be
attached, you can identify the required version of SDD at this website:
http://www.ibm.com/support/docview.wss?uid=ssg1S7001350

494 IBM System Storage DS8000 Performance Monitoring and Tuning


You can use SDD with the native Multipath I/O (MPIO) device driver on AIX and on Microsoft
Windows Server. For AIX MPIO, a Subsystem Device Driver Path Control Module (SDDPCM)
is provided to deliver I/O load balancing. The Subsystem Device Driver Device Specific
Module (SDDDSM) provides MPIO support based on the MPIO technology of Microsoft. For
Linux versions, a Device Mapper Multipath configuration file is available.

15.1.3 SAN Volume Controller Advanced Copy Services


The Spectrum Virtualize software provides Advanced Copy Services so that you can copy
volumes (VDisks) by using FlashCopy and Remote Copy functions. These Copy Services are
available for all supported servers that connect to the SAN Volume Controller cluster.

FlashCopy makes an instant, point-in-time copy from a source VDisk volume to a target
volume. A FlashCopy can be made only to a volume within the same SAN Volume Controller.

Remote Copy includes these functions:


򐂰 Metro Mirror
򐂰 Global Mirror

Metro Mirror is a synchronous remote copy, which provides a consistent copy of a source
volume to a target volume. Metro Mirror can copy between volumes (VDisks) on separate
SAN Volume Controller clusters or between volumes within the same I/O Group on the same
SAN Volume Controller.

Global Mirror is an asynchronous remote copy, which provides a remote copy over extended
distances. Global Mirror can copy between volumes (VDisks) on separate SAN Volume
Controller clusters or between volumes within the same I/O Group on the same SAN Volume
Controller.

Important: SAN Volume Controller Copy Services functions are incompatible with the
DS8000 Copy Services.

The physical implementation of Global Mirror between DS8000 storage systems versus
Global Mirror between SAN Volume Controllers is internally different. The DS8000 storage
system is one of the products with the most sophisticated Global Mirror implementations
available, especially, for example, when it comes to working with small bandwidths, weak or
unstable links, or intercontinental distances. When changing architectures from one Global
Mirror implementation to the other, a resizing of the minimum required Global Mirror
bandwidth is likely needed.

For more information about the configuration and management of SAN Volume Controller
Copy Services, see the Advanced Copy Services chapters of Implementing the IBM System
Storage SAN Volume Controller V7.4, SG24-7933 or IBM System Storage SAN Volume
Controller and Storwize V7000 Replication Family Services, SG24-7574.

A FlashCopy mapping can be created between any two VDisk volumes in a SAN Volume
Controller cluster. It is not necessary for the volumes to be in the same I/O Group or storage
pool. This function can optimize your storage allocation by using an auxiliary storage system
(with, for example, lower performance) as the target of the FlashCopy. In this case, the
resources of your high-performance storage system are dedicated for production. Your
low-cost (lower performance) storage system is used for a secondary application (for
example, backup or development).

Chapter 15. IBM System Storage SAN Volume Controller attachment 495
An advantage of SAN Volume Controller remote copy is that you can implement these
relationships between two SAN Volume Controller clusters with different back-end disk
subsystems. In this case, you can reduce the overall cost of the disaster recovery
infrastructure. The production site can use high-performance back-end disk systems, and the
recovery site can use low-cost back-end disk systems, even where the back-end disk
subsystem Copy Services functions are not compatible (for example, different models or
different manufacturers). This relationship is established at the volume level and does not
depend on the back-end disk storage system Copy Services.

Important: For Metro Mirror copies, the recovery site VDisk volumes must have
performance characteristics similar to the production site volumes when a high write I/O
rate is present to maintain the I/O response level for the host system.

15.2 SAN Volume Controller performance considerations


The SAN Volume Controller cluster is scalable up to eight nodes. The performance is almost
linear when adding more I/O Groups to a SAN Volume Controller cluster until it becomes
limited by other components in the storage infrastructure. Although virtualization with the SAN
Volume Controller provides a great deal of flexibility, it does not diminish the necessity to have
a SAN and disk systems that can deliver the wanted performance.

The following section presents the SAN Volume Controller concepts and describes the
performance of the SAN Volume Controller. This section assumes that there are no
bottlenecks in the SAN or on the disk system.

Determining the number of I/O Groups


Growing or adding I/O Groups to a SAN Volume Controller cluster is a decision that you must
make when either a configuration limit is reached or when the I/O load reaches a point where
a new I/O Group is needed.

To determine the number of I/O Groups and to monitor the processor performance of each
node, you can use IBM Spectrum Control, the IBM Virtual Storage Center (VSC), or the IBM
Tivoli Storage Productivity Center. The processor performance is related to I/O performance,
and when the processors become consistently 70% busy, you must consider one of these
actions:
򐂰 Adding more nodes to the cluster and moving part of the workload onto the new nodes
򐂰 Moving VDisk volumes to another I/O Group, if the other I/O Group is not busy

To see how busy your processors are, you can use the Tivoli Storage Productivity Center
performance report, by selecting the CPU Utilization option.

The following activities affect processor utilization:


򐂰 Volume activity
򐂰 Cache management
򐂰 FlashCopy activity
򐂰 Mirror Copy activity
򐂰 Real-time compression

With the newly added I/O Group, the SAN Volume Controller cluster can potentially double
the I/O rate per second (IOPS) that it can sustain. A SAN Volume Controller cluster can be
scaled up to an eight-node cluster with which you quadruple the total I/O rate.

496 IBM System Storage DS8000 Performance Monitoring and Tuning


Most common bottlenecks
When you analyze existing client environments, the most common bottleneck for a
non-performing SAN Volume Controller installation is caused by insufficiencies in the storage
system back end, specifically, too few disk drives. The second most common issue is high
SAN Volume Controller port utilizations and documenting the need for adequate bandwidth
sizing. SAN Volume Controller processor utilization issues are the third most common
bottleneck.

Number of ports in the SAN used by SAN Volume Controller


The SAN Volume Controller ports are more heavily loaded than the ports of a “native” storage
system because the SAN Volume Controller nodes must handle the I/O traffic of these other
components:
򐂰 All the host I/O
򐂰 The read-cache miss I/O (the SAN Volume Controller cache-hit rate is less than the rate of
a DS8000 storage system.)
򐂰 All write destage I/Os (if VDisk mirroring doubled)
򐂰 All writes for cache mirroring
򐂰 Traffic for remote mirroring

You must carefully plan the SAN Volume Controller port bandwidth.

Number of paths from SAN Volume Controller to a disk system


All SAN Volume Controller nodes in a cluster must be able to see the same set of storage
system ports on each device. Any operation in this mode, in which two nodes do not see the
same set of ports on the same device, is degraded, and the system logs errors that request a
resolution.

For the DS8000 storage system, there is no controller affinity for the LUNs. So, a single zone
for all SAN Volume Controller ports and up to eight DS8000 host adapter (HA) ports must be
defined on each fabric. The DS8000 HA ports must be distributed over as many HA cards as
available and dedicated to SAN Volume Controller use if possible.

Configure a minimum of eight controller ports to the SAN Volume Controller per controller
regardless of the number of nodes in the cluster. Configure 16 controller ports for large
controller configurations where more than 48 DS8000 ranks are being presented to the SAN
Volume Controller cluster.

Optimal storage pool configurations


A storage pool, or MDG, provides the pool of storage from which VDisks are created.
Therefore, it is necessary to ensure that each entire tier in a storage pool provides the same
performance and reliability characteristics.

For the DS8000 storage system, all LUNs in the same storage pool tier ideally have these
characteristics:
򐂰 Use disk drive modules (DDMs) of the same capacity and speed.
򐂰 Arrays must be the same RAID type.
򐂰 Use LUNs that are the same size.

For the extent size, to maintain maximum flexibility, use an SAN Volume Controller extent size
of 1 GiB (1024 MiB).

Chapter 15. IBM System Storage SAN Volume Controller attachment 497
For more information, see IBM System Storage SAN Volume Controller and Storwize V7000
Best Practices and Performance Guidelines, SG24-7521.

15.3 DS8000 performance considerations with SAN Volume


Controller
Use the DS8000 configuration to optimize the performance of your virtualized environment.

15.3.1 DS8000 array


The DS8000 storage system provides protection against the failure of individual DDMs by
using RAID arrays. This protection is important because the SAN Volume Controller provides
no protection for the MDisks within a storage pool.

Array RAID configuration


A DS8000 array is a RAID 5, RAID 6, or RAID 10 array that is made up of eight DDMs. A
DS8000 array is created from one array site, which is formatted in one of these RAID types.

There are many workload attributes that influence the relative performance of RAID 5
compared to RAID 10, including the use of cache, the relative mix of read as opposed to write
operations, and whether data is referenced randomly or sequentially.

Consider these RAID characteristics:


򐂰 For either sequential or random reads from disk, there is no difference in RAID 5 and
RAID 10 performance, except at high I/O rates. RAID 6 can also provide acceptable
results.
򐂰 For random writes to disk, RAID 10 performs better.
򐂰 For sequential writes to disk, RAID 5 performs better.
򐂰 RAID 6 performance is slightly inferior to RAID 5 but provides protection against two DDM
failures. Because of their large capacities, Nearline drives come in RAID 6.

SAN Volume Controller does not need to influence your choice of the RAID type that is used.
For more information about the RAID 5 and RAID 10 differences, see 4.7, “Planning RAID
arrays and ranks” on page 97.

15.3.2 DS8000 rank format


Using the DS8000 DSCLI, a rank is created for each array, and there is a one-to-one
relationship between arrays and ranks (the DS8000 GUI does not distinguish and sees
managed arrays). The SAN Volume Controller requires that such ranks are created in Fixed
Block (FB) format, which divides each array into 1 GiB extents (where 1 GiB = 230 bytes). A
rank must be assigned to an extent pool to be available for LUN creation.

The DS8000 processor complex (or server group) affinity is determined when the rank is
assigned. Assign the same number of ranks in a DS8000 storage system to each of the
processor complexes. Additionally, if you do not need to use all of the arrays for your SAN
Volume Controller storage pool, select the arrays so that you use arrays from as many device
adapters (DAs) as possible to balance the load across the DAs.

498 IBM System Storage DS8000 Performance Monitoring and Tuning


15.3.3 DS8000 extent pool implications
In the DS8000 architecture, extent pools are used to manage one or more ranks. An extent
pool is visible to both processor complexes in the DS8000 storage system, but it is directly
managed by only one of them. You must define a minimum of two extent pools with one extent
pool that is created for each processor complex to use fully the resources.

Classical approach: One array per extent pool configuration


For SAN Volume Controller attachments, many clients formatted the DS8000 arrays in 1:1
assignments between arrays and extent pools, which disabled any DS8000 storage pool
striping or auto-rebalancing activity. Then, they located one or two volumes (MDisks) in each
extent pool exclusively on one rank only, and put all of those volumes into one SAN Volume
Controller storage pool. The SAN Volume Controller controlled striping across all these
volumes and balanced the load across the RAID ranks by that method. No more than two
MDisks (DS8000 volumes) per rank are needed with this approach.1 So, the rank size
determines the MDisk size. For example, if the rank is 3682 GiB, make two volumes of
1841 GiB each, and eventually put them in different storage pools to avoid double striping
across one rank.

Often, clients worked with at least two storage pools: one (or two) containing MDisks of all the
6+P RAID 5 ranks of the DS8000 storage system, and the other one (or more) containing the
slightly larger 7+P RAID 5 ranks. This approach maintains equal load balancing across all
ranks when the SAN Volume Controller striping occurs because each MDisk in a storage pool
is the same size then.

The SAN Volume Controller extent size is the stripe size that is used to stripe across all these
single-rank MDisks.

This approach delivered good performance and has its justifications. However, it also has a
few drawbacks:
򐂰 There can be natural skew, for example, a small file of a few hundred KiB that is heavily
accessed. Even with a smaller SAN Volume Controller extent size, such as 256 MiB, this
classical setup led in a few cases to ranks that are more loaded than other ranks.
򐂰 When you have more than two MDisks on one rank, and not as many SAN Volume
Controller storage pools, the SAN Volume Controller might start striping across many
entities that are effectively in the same rank, depending on the MDG layout. Such striping
should be avoided.
򐂰 Clients tend to in DS8000 installations go to larger (multi-rank) extent pools to use modern
features, such as auto-rebalancing or advanced tiering.

An advantage of this classical approach is that it delivers more options for fault isolation and
control over where a certain volume and extent are located.

Modern approach: Multi-rank extent pool configuration


A more modern approach is to create a few DS8000 extent pools, for example, two DS8000
extent pools. Use either DS8000 storage pool striping or automated Easy Tier rebalancing to
help prevent from overloading individual ranks.

1
The SAN Volume Controller 7.6 Restrictions document, which is available at
http://www.ibm.com/support/docview.wss?uid=ssg1S1005424#_Extents, provides a table with the maximum size of
an MDisk that depends on the extent size of the storage pool.

Chapter 15. IBM System Storage SAN Volume Controller attachment 499
You have two options:
򐂰 Go for huge multitier hybrid pools, having just one pair of DS8000 pools where the
DS8000 internal Easy Tier logic is also doing the cross-tier internal optimization.
򐂰 Create in the DS8000 storage system as many extent pool pairs as you have tiers in the
DS8000 storage system, report each such DS8000 pool separately to the SAN Volume
Controller, and the SAN Volume Controller-internal Easy Tier logic makes the cross-tier
optimization.

In the latter case, the DS8000 internal Easy Tier logic can still do intra-tier auto-rebalancing.
For more information about this topic, see 15.8, “Where to place Easy Tier” on page 506.

You need only one MDisk volume size with this multi-rank approach because plenty of space
is available in each large DS8000 extent pool. Often, clients choose 2 TiB (2048 GiB) MDisks
for this approach. Create many 2-TiB volumes in each extent pool until the DS8000 extent
pool is full, and provide these MDisks to the SAN Volume Controller to build the storage pools.
Two extent pools are needed at least so that each DS8000 processor complex (even/odd) is
loaded about equal, or it might take a larger number of pools.

If you use DS8000 Easy Tier, even only for intra-tier auto-rebalancing, do not use 100% of
your extent pools. You must leave some small space of a few extents per rank free for Easy
Tier so that it can work.

To maintain the highest flexibility and for easier management, large DS8000 extent pools are
beneficial. However, if the SAN Volume Controller DS8000 installation is dedicated to
shared-nothing environments, such as Oracle ASM, DB2 warehouses, or General Parallel
File System (GPFS), use the single-rank extent pools.

15.3.4 DS8000 volume considerations with SAN Volume Controller


This section describes preferred practices about volume creation on the DS8000 storage
system when it is assigned to the SAN Volume Controller.

Number of volumes per extent pool and volume size considerations


In a classical (one rank = one extent pool) SAN Volume Controller environment, it is a
preferred practice to define one or two volumes per rank. Tests show a small response time
advantage to the two LUNs per array configuration and a small IOPS advantage to the one
LUN per array configuration for sequential workloads. Overall, the performance differences
between these configurations are minimal.

With the DS8000 supporting volume sizes up to 16 TiB, a classical approach is still possible
when using large disks, such as placing only two volumes onto an array of the 4 TB Nearline
disks (RAID 6). The volume size in this case is determined by the rank capacity.

With the modern approach of using large multi-rank extent pools, more clients use a standard
and not-too-large volume MDisk size, such as 2 TiB for all MDisks, with good results.

As a preferred practice, assign SAN Volume Controller DS8000 LUNs of all the same size for
each storage pool. In this configuration, the workload that is applied to a VDisk is equally
balanced on the MDisks within the storage pool.

500 IBM System Storage DS8000 Performance Monitoring and Tuning


15.3.5 Volume assignment to SAN Volume Controller
On the DS8000 storage system, create one volume group in which to include all the volumes
that are defined to be managed by SAN Volume Controller and all the host connections of the
SAN Volume Controller node ports. The DS8000 storage system offers the host type SVC for
this function.

Volumes can be added dynamically to the SAN Volume Controller. When the volume is added
to the volume group, run the svctask detectmdisk command on the SAN Volume Controller
to add it as a MDisk.

Before you delete or unmap a volume that is allocated to the SAN Volume Controller, remove
the MDisk from the SAN Volume Controller storage pool, which automatically migrates any
extents for defined volumes to other MDisks in the storage pool, if there is space available.
When it is unmapped on the DS8000 storage system, run the svctask detectmdisk
command and then run the maintenance procedure on the SAN Volume Controller to confirm
its removal.

15.4 Performance monitoring


You can use IBM Spectrum Control and Tivoli Storage Productivity Center to manage the IBM
SAN Volume Controller and monitor its performance. They are described in the following
resources:
򐂰 IBM Tivoli Storage Productivity Center V5.2 Release Guide, SG24-8204
򐂰 http://www.ibm.com/support/knowledgecenter/SS5R93_5.2.8/com.ibm.spectrum.sc.doc/t
pc_kc_homepage.html

To configure Spectrum Control or Tivoli Storage Productivity Center to monitor IBM SAN
Volume Controller, see IBM System Storage SAN Volume Controller and Storwize V7000
Best Practices and Performance Guidelines, SG24-7521.

Data that is collected from SAN Volume Controller


The two most important metrics when measuring I/O subsystem performance are response
time in milliseconds and throughput in I/Os per second (IOPS):
򐂰 Response time in non-SAN Volume Controller environments is measured from when the
server issues a command to when the storage controller reports the command as
completed. With the SAN Volume Controller, you must consider response time from the
server to the SAN Volume Controller nodes and also from the SAN Volume Controller
nodes to the storage controllers. The VDisk volume response time is what the client sees,
but if this number is high and there is no SAN Volume Controller bottleneck, it is
determined by the MDisk response time.
򐂰 However, throughput can be measured at various points along the data path, and the SAN
Volume Controller adds additional points where throughput is of interest and
measurements can be obtained.

IBM Spectrum Control offers many disk performance reporting options that support the SAN
Volume Controller environment and also the storage controller back end for various storage
controller types. The following storage components are the most relevant for collecting
performance metrics when monitoring storage controller performance:
򐂰 Subsystem
򐂰 Controller
򐂰 Array

Chapter 15. IBM System Storage SAN Volume Controller attachment 501
򐂰 MDisk
򐂰 MDG, or storage pool
򐂰 Port

With the SAN Volume Controller, you can monitor on the levels of the I/O Group and the SAN
Volume Controller node.

SAN Volume Controller thresholds


Thresholds are used to determine watermarks for warning and error indicators for an
assortment of storage metrics. SAN Volume Controller has the following thresholds within its
default properties:
򐂰 Volume (VDisk) I/O rate: Total number of VDisk I/Os for each I/O Group
򐂰 Volume (VDisk) bytes per second: VDisk bytes per second for each I/O Group
򐂰 MDisk I/O rate: Total number of MDisk I/Os for each MDG
򐂰 MDisk bytes per second: MDisk bytes per second for each MDG

The default status for these properties is Disabled with the Warning and Error options set to
None. Enable a particular threshold only after the minimum values for warning and error
levels are defined.

Tip: In Tivoli Storage Productivity Center for Disk, default threshold warning or error values
of -1.0 are indicators that there is no suggested minimum value for the threshold and are
entirely user-defined. You can choose to provide any reasonable value for these thresholds
based on the workload in your environment.

15.5 Sharing the DS8000 storage system between various


server types and the SAN Volume Controller
The DS8000 storage system can be shared between servers and a SAN Volume Controller.
This sharing can be useful if you want direct attachment for specific Open Systems servers or
if you must share your DS8000 storage system between the SAN Volume Controller and a
SAN Volume Controller-unsupported operating environment, such as z Systems/FICON.
Also, this option might be appropriate for IBM i.

For the current (currently Version 7.6) list of hardware that is supported for attachment to the
SAN Volume Controller, see this website:
http://www.ibm.com/support/docview.wss?uid=ssg1S1005419

15.5.1 Sharing the DS8000 storage system between Open Systems servers
and the SAN Volume Controller
If you have a mixed environment that includes IBM SAN Volume Controller and Open
Systems servers, share the maximum of the DS8000 resources to both environments.

Most clients choose a DS8000 extent pool pair (or pairs) for their SAN Volume Controller
volumes only, and other extent pool pairs for their directly attached servers. This approach is
a preferred practice, but you can fully share on the drive level if preferred. Easy Tier
auto-rebalancing, as done by the DS8000 storage system, can be enabled for all pools.

502 IBM System Storage DS8000 Performance Monitoring and Tuning


If you are sharing pools, I/O Priority Manager works on the level of full DS8000 volumes. So,
when you have large MDisks, which are the DS8000 volumes, I/O Priority Manager cannot
prioritize between various smaller VDisk volumes that are cut out of these MDisks. I/O Priority
Manager enables SAN Volume Controller volumes as a whole to be assigned different
priorities compared to other direct server volumes. For example, if IBM i mission-critical
applications with directly attached volumes share extent pools with SAN Volume Controller
MDisk volumes (very uncommon!), the I/O Priority Manager can throttle the complete SAN
Volume Controller volumes in I/O contention to protect the IBM i application performance.

IBM supports sharing a DS8000 storage system between a SAN Volume Controller and an
Open Systems server. However, if a DS8000 port is in the same zone as a SAN Volume
Controller port, that same DS8000 port must not be in the same zone as another server.

15.5.2 Sharing the DS8000 storage system between z Systems servers and the
SAN Volume Controller
SAN Volume Controller does not support FICON/count key data (CKD) based z Systems
server attachment. If you have a mixed server environment that includes IBM SAN Volume
Controller and hosts that use CKD, you must share your DS8000 storage system to provide
direct access to z Systems volumes and access to Open Systems server volumes through
the SAN Volume Controller.

In this case, you must split your DS8000 resources between two environments. You must
create a part of the ranks by using the CKD format (used for z Systems access) and the other
ranks in FB format (used for SAN Volume Controller access). In this case, both environments
get performance that is related to the allocated DS8000 resources.

A DS8000 port does not support a shared attachment between z Systems and SAN Volume
Controller. z Systems servers use the Fibre Channel connection (FICON), and SAN Volume
Controller supports Fibre Channel Protocol (FCP) connection only. Both environments should
each use their dedicated DS8000 HAs.

15.6 Configuration guidelines for optimizing performance

Guidelines: Many of these guidelines are not unique to configuring the DS8000 storage
system for SAN Volume Controller attachment. In general, any server can benefit from a
balanced configuration that uses the maximum available bandwidth of the DS8000 storage
system.

Follow the guidelines and procedures that are outlined in this section to make the most of the
performance that is available from your DS8000 storage systems and to avoid potential I/O
problems:
򐂰 Use multiple HAs on the DS8000 storage system. In case there are many DS8000 spare
ports available, use no more than two ports on each card. Use many ports on the DS8000
storage system, which is usually the SAN Volume Controller maximum port number.
򐂰 Unless you have special requirements, or if in doubt, build your MDisk volumes from large
extent pools on the DS8000 storage system.
򐂰 If using a 1:1 mapping of ranks to DS8000 extent pools, use one, or a maximum of two
volumes on this rank, and adjust the MDisk volume size for this 1:1 mapping.
򐂰 Create fewer and larger SAN Volume Controller storage pools and have multiple MDisks in
each pool.

Chapter 15. IBM System Storage SAN Volume Controller attachment 503
򐂰 Keep many DS8000 arrays active.
򐂰 Ensure that you have an equal number of extent pools and, as far as possible, spread the
volumes equally across the DAs and the two processor complexes of the DS8000 storage
system.
򐂰 In a storage pool, ensure that for a certain tier that all MDisks ideally have the same
capacity and RAID/RPM characteristics.
򐂰 For Metro Mirror configurations, always use DS8000 MDisks with similar characteristics for
both the master VDisk volume and the auxiliary volume.
򐂰 Spread the VDisk volumes across all SAN Volume Controller nodes, and check for
balanced preferred node assignments.
򐂰 In the SAN, use a dual fabric.
򐂰 Use multipathing software in the servers.
򐂰 Consider DS8000 Easy Tier auto-rebalancing for DS8000 homogeneous capacities.
򐂰 When using Easy Tier in the DS8000 storage system, consider a SAN Volume Controller
extent size of 1 GiB (1024 MiB) minimum to not put skew away from the DS8000 extents.
򐂰 When using DS8000 Easy Tier, leave some small movement space empty in the extent
pools to help it start working. Ten free extents per rank are sufficient.
򐂰 Consider the correct amount of cache, as explained in 2.2.2, “Determining the correct
amount of cache storage” on page 33. Usually, SAN Volume Controller installations have a
DS8000 cache of not less than 128 GB.

15.7 Where to place flash


Today, storage installations are deployed as tiered environments that contain some flash or
solid-state drives (SSDs). When implementing storage virtualization with the SAN Volume
Controller, you can place flash in several places:
򐂰 SSDs in a particular server
򐂰 SSDs in the SAN Volume Controller nodes
򐂰 Flash in a separate storage system, such as IBM FlashSystem (with the DS8000 storage
system handling the Enterprise drives)
򐂰 High-Performance Flash Enclosures (HPFE) in the DS8000 storage system (together with
Enterprise and Nearline disks)
򐂰 Traditional SSDs in the DS8000 storage system (together with hard disk drives (HDDs)).

Each method has advantages and drawbacks.

Solid-state drives in the server


Only one server benefits from flash; all other servers are slower. Using Easy Tier is not
possible, so when treating this flash as own capacity, the server or database administrator
manually must select those parts of the database to put on SSDs, or continually monitor and
adjust data placement. RAID protection is still necessary. You need a server RAID adapter
that can handle the amount of IOPS. If you have a separate SSD tier here, in cases of backup
automation or 2-site/3-site replication setups, these volumes are not part of the overall
replication concept and offer less general protection. This concept is mostly discouraged
because this flash would be its own tier within the whole environment.

However, there exist solutions where you use locally attached flash as flash cache (read
cache), for example, as part of a newer operating system such as AIX 7.2 or when using

504 IBM System Storage DS8000 Performance Monitoring and Tuning


additional software. In this case, as the flash is used just as read cache, its data protection
requirements are not as large. Write I/Os still go through the SAN, but many of the reads are
fulfilled from this local cache, so you have a good response time and move load away from
SAN and the main storage system.

Solid-state drives in the SAN Volume Controller nodes


Up to 48 internal flash drives can be put into one SAN Volume Controller node pair. This
design enables a low entry point to start with flash. Easy Tier works at the SAN Volume
Controller level. Because many IOPS are written to and through these SSDs, RAID
protection, which is configured by the storage administrator, is necessary. Never leave any of
the drives without RAID protection because in availability, the weakest point determines the
overall availability. Ensure that you have an equal load on each node.

Each SSD MDisk goes into one pool, which determines how many storage pools can benefit
from SAN Volume Controller Easy Tier. The SSD size determines the granularity of the
offered SSD capacity. Scalability is limited compared to flash in a DS8000 storage system.

Flash in a separate storage system


With a SAN Volume Controller, you can separate the flash storage system from the HDD
storage system. SAN Volume Controller Easy Tier automatically handles the optimal tiered
configuration. For example, one system can be a DS8000 storage system, and the other
system a FlashSystem, or vice versa. However, availability can be an issue: If you consider
your data or most of your data valuable enough that it must go on a DS8000 storage system
with the highest possible availability, then consider an about-equal level of data availability for
the other storage system as well. SAN Volume Controller Easy Tier always creates mixed
VDisk volumes that consist of non DS8000 and DS8000 parts. Also, you cannot enable
DS8000 Easy Tier across all tiers. Practically, you do not mix the DS8000 based and non
DS8000 based MDisks in one pool because of failure boundaries. The pool goes offline if one
system fails.

Another argument against this concept is that the ports that handle the traffic to the flash-only
storage system experience exceptionally high workloads.

High-Performance Flash Enclosures or solid-state drives in the DS8000


storage system
This concept can be more expensive because it requires a minimum of 16 flash cards (two
flash ranks) to balance the load across the two DS8000 processor complexes. But, this
approach and concept offers many advantages. In this approach, you can enable DS8000
Easy Tier if you want, which also monitors the internal DA usage and has more sophisticated
algorithms for checking overloaded flash (or HDD) ranks, or SSD-serving DAs, and perform
warm demotions if needed. Efficient RAID protection is performed, scalability is higher, with
up to 240 flash cards and potentially up to 1536 additional possible SSDs. Also, the DS8000
cache and advanced cache algorithms are optimally used.

Chapter 15. IBM System Storage SAN Volume Controller attachment 505
15.8 Where to place Easy Tier
IBM Easy Tier is an algorithm that is developed by IBM Almaden Research and made
available to storage systems, such as the DS8880 and DS8870 families, and to Spectrum
Virtualize and the SAN Volume Controller. When using Easy Tier in the SAN Volume
Controller with a mixed-tier storage pool, the MDisks can be flagged as ssd, enterprise, or
Nearline when you introduce them to the SAN Volume Controller storage pool.

When using the internal SSDs in the SAN Volume Controller nodes, only Easy Tier performed
by the SAN Volume Controller is possible for the inter-tier movements between SSD and HDD
tiers. DS8000 intra-tier auto-rebalancing can be used and can monitor the usage of all the
HDD ranks and move loads intra-tier by DS8000 storage system if some ranks are more
loaded.

When you have the flash in the DS8000 storage system, together with Enterprise HDDs and
also Nearline HDDs, on which level do you perform the overall inter-tier Easy Tiering? It can
be either done by the SAN Volume Controller, by setting the ssd attribute for all the DS8000
HPFE and SSD flash MDisks (which also means that SAN Volume Controller Easy Tier treats
HPFE volumes and DS8000 SSD volumes likewise). You also can leave the enterprise
(generic_hdd) attribute for all MDisks, but allow DS8000 Easy Tier to manage these MDisks,
with two-tier or three-tier MDisks offered to the SAN Volume Controller, and these MDisks
contain some flash (which is invisible to the SAN Volume Controller). For both options,
well-running installations exist.

There are differences between the Easy Tier algorithms in the DS8000 storage system and in
SAN Volume Controller. The DS8000 storage system is in the eighth generation of Easy Tier,
with additional functions available, such as Extended-Cold-Demote or Warm-Demote. The
warm demote checking is reactive if certain flash ranks or SSD-serving DAs suddenly
become overloaded. The SAN Volume Controller must work with different vendors and
varieties of flash space that is offered, and uses a more generic algorithm, which cannot learn
easily whether the SSD array of a certain vendor’s disk system is approaching its limits.

As a rule, when you use flash in the DS8000 storage system and use many or even
heterogeneous storage systems, and also for most new installations, consider implementing
cross-tier Easy Tier on the highest level, that is, managed by the SAN Volume Controller. SAN
Volume Controller can use larger blocksizes for its back end, such as 60 K and over, which do
not work well for DS8000 Easy Tier, so you have another reason to use SAN Volume
Controller Easy Tier inter-tiering. However, observe the system by using the STAT for SAN
Volume Controller. If the flash space gets overloaded, consider either adding more SSDs as
suggested by the STAT, or removing and reserving some of the flash capacity so that it is not
fully used by SAN Volume Controller by creating smaller SSD MDisks and leaving empty
space there.

Tip: For most new installations, the following ideas are preferred practices:
򐂰 Use SAN Volume Controller Easy Tier for the cross-tier (inter-tier) tiering.
򐂰 Have several larger extent pools (multi-rank, single-tier) on the DS8000 storage system,
with each pool pair containing the ranks of just one certain tier.
򐂰 Turn on the DS8000 based Easy Tier auto-rebalancing so that DS8000 Easy Tier is
used for the intra-tier rebalancing within each DS8000 pool.

506 IBM System Storage DS8000 Performance Monitoring and Tuning


If you have just one main machine behind the SAN Volume Controller, you can leave the Easy
Tier inter-tiering to the DS8000 logic. Use the more sophisticated DS8000 Easy Tier
algorithms that consider the sudden overload conditions of SSD ranks or SSD-serving
adapters. DS8000 Easy Tier algorithms have more insight into the DS8000 thresholds and
what workloads limit each component of the storage system can sustain. Choose a SAN
Volume Controller extent size of 1 GiB to not eliminate the skew for the DS8000 tiering, which
is on the 1 GiB extent level.

Use the current level of SAN Volume Controller software before you start SAN Volume
Controller managed Easy Tier and the current SAN Volume Controller node hardware.

Chapter 15. IBM System Storage SAN Volume Controller attachment 507
508 IBM System Storage DS8000 Performance Monitoring and Tuning
16

Chapter 16. IBM ProtecTIER data


deduplication
This chapter introduces you to attaching the IBM System Storage DS8000 storage system to
an IBM ProtecTIER data deduplication gateway.

This chapter includes the following topics:


򐂰 IBM System StorageTS7600 ProtecTIER data deduplication
򐂰 DS8000 attachment considerations

© Copyright IBM Corp. 2016. All rights reserved. 509


16.1 IBM System StorageTS7600 ProtecTIER data
deduplication
Data deduplication is a technology that is used to reduce the amount of space that is
required to store data on disk. Data deduplication is achieved by storing a single copy of data
that is backed up repetitively. Data deduplication can provide greater data reduction than
alternative technologies, such as compression and differencing. Differencing is used for
differential backups.

Figure 16-1 illustrates the basic components of data deduplication for the IBM System
Storage TS7650G ProtecTIER Gateway.

Figure 16-1 Basic concept of ProtecTIER data deduplication

With data deduplication, data is read by the data deduplication product while it looks for
duplicate data. Different data deduplication products use different methods of breaking up the
data into elements, but each product uses a technique to create a signature or identifier for
each data element. After the duplicate data is identified, one copy of each element is
retained, pointers are created for the duplicate items, and the duplicate items are not stored.

The effectiveness of data deduplication depends on many variables, including the data rate of
data change, the number of backups, and the data retention period. For example, if you back
up the same incompressible data once a week for six months, you save the first copy and do
not save the next 24. This example provides a 25:1 data deduplication ratio. If you back up an
incompressible file on week one, back up the same file again on week two, and never back it
up again, you have a 2:1 data deduplication ratio.

The IBM System Storage TS7650G is a preconfigured virtualization solution of IBM systems.
The IBM ProtecTIER data deduplication software improves backup and recovery operations.
The solution is available in single-node or two-node cluster configurations to meet the
disk-based data protection needs of a wide variety of IT environments and data centers. The
TS7650G ProtecTIER Deduplication Gateway can scale to repositories in the petabyte (PB)
range, and all DS8000 models are supported behind it. Your DS8000 storage system can
become a Virtual Tape Library (VTL). The multi-node concepts help achieve higher
throughput and availability for the backup, and replication concepts are available.

510 IBM System Storage DS8000 Performance Monitoring and Tuning


For a detailed assessment of ProtecTIER and data deduplication, see the following
publications:
򐂰 IBM System Storage TS7600 with ProtecTIER Version 3.3, SG24-7968
򐂰 IBM ProtecTIER Implementation and Best Practices Guide, SG24-8025
򐂰 Implementing IBM Storage Data Deduplication Solutions, SG24-7888
򐂰 IBM System Storage TS7650, TS7650G, and TS7610, SG24-7652

16.2 DS8000 attachment considerations


A ProtecTIER gateway attachment with a DS8000 storage system resembles an IBM SAN
Volume Controller attachment. Some of the SAN Volume Controller guidelines might also
apply. For example, you can choose to dedicate your full DS8000 storage system to the SAN
Volume Controller or ProtecTIER device, or you can share a DS8000 storage system, also
with ProtecTIER, with direct server attachment of other volumes on the same DS8000
storage system, reducing the data center footprint. Backups and active data of a specific
server must be physically separate for availability reasons. Therefore, in a multi-site concept,
a DS8000 storage system can incorporate the production data of some servers and backup
data of other servers that are at another data center, if this type of sharing is used.

Extent pool isolation


For ProtecTIER, if a client chooses to share the DS8000 storage system between active
server volumes and ProtecTIER Gateway backup volumes, use a separation within the
DS8000 storage system at the extent pool level. This way, you dedicate a certain extent pool
pair (or pairs) fully to the ProtecTIER Gateway nodes only, and others for the production
servers. The production server attachments go onto higher RPM hard disk drives (HDDs) and
flash. The overall mix on the DS8000 storage system level uses the investment in the DS8000
storage system better.

Host bus adapters


Use dedicated ports and application-specific integrated circuits (ASICs) for the I/O traffic to
the ProtecTIER appliance, as opposed to server I/O attachment traffic on other DS8000
ports. It is a preferred practice to keep disk and tape I/O traffic separate.

Drive types used


Similar to planning backups, you might consider the least expensive drives for this
environment, which are the Nearline drives, such as 4 TB 7200 min−1 of a DS8880 storage
system. Nearline drives come with RAID 6 data protection, which also protects data in a
double-disk failure in the array because rebuild times can be long at high capacities. But, they
also can be formatted in RAID 10. Enterprise drives with their higher RPM speeds outperform
Nearline drives also for ProtecTIER attachments, but Nearline can be used here as follows.

ProtecTIER access patterns can have either high random-read content (90%)(UserData), or
there is a random-write ratio for the MetaData areas. For the UserData, a selection of RAID 5
or RAID 6 is possible, with RAID 6 being preferred when using Nearline. However, for the
MetaData, RAID 10 must be used, and when using Nearline, follow the RPQ process to allow
the Nearline drives to be formatted as RAID 10 for such random-write data.

Chapter 16. IBM ProtecTIER data deduplication 511


512 IBM System Storage DS8000 Performance Monitoring and Tuning
17

Chapter 17. Databases for open performance


This chapter reviews the major IBM database systems and the performance considerations
when they are used with the DS8000 storage system. The following databases are described:
򐂰 IBM DB2 for Linux, UNIX, and Windows
򐂰 Oracle Database server in an open environment

You can obtain more information about the IBM DB2 and Oracle databases at these websites:
򐂰 http://www.ibm.com/software/data/db2/linux-unix-windows/
򐂰 http://www.ibm.com/support/knowledgecenter/SSEPGG/welcome
򐂰 http://www.oracle.com/technetwork/database/enterprise-edition/documentation/index
.html

This chapter includes the following topics:


򐂰 DS8000 with DB2 for Linux, UNIX, and Windows
򐂰 DB2 for Linux, UNIX, and Windows with DS8000 performance recommendations
򐂰 Oracle with DS8000 performance considerations
򐂰 Database setup with a DS8000 storage system: Preferred practices

© Copyright IBM Corp. 2016. All rights reserved. 513


17.1 DS8000 with DB2 for Linux, UNIX, and Windows
This section describes some basic DB2 for Linux, UNIX, and Windows concepts that are
relevant to storage performance.

17.1.1 DB2 for Linux, UNIX, and Windows storage concepts


The database object that maps the physical storage is the table space. Figure 17-1 illustrates
how DB2 for Linux, UNIX, and Windows is logically structured and how the table space maps
the physical object.

Logical Database Objects Equivalent Physical Object


System

Instances

Databases
Tablespaces are where tables are stored:

SMS or DMS
Tablespaces Each container Each container
is a directory is a fixed,
in the file space pre-allocated
Tables of the operating file or physical
Indexes system. device such as
a disk.
long data
/fs.rb.T1.DA3a1
/fs.rb.T1.DA3b1

Figure 17-1 DB2 for Linux, UNIX, and Windows logical structure

Instances
An instance is a logical database manager environment where databases are cataloged and
configuration parameters are set. An instance is similar to an image of the actual database
manager environment. You can have several instances of the database manager product on
the same database server. You can use these instances to separate the development
environment from the production environment, tune the database manager to a particular
environment, and protect sensitive information from a particular group of users.

For the DB2 Database Partitioning Feature (DPF) of the DB2 Enterprise Server Edition
(ESE), all data partitions are within a single instance.

514 IBM System Storage DS8000 Performance Monitoring and Tuning


Databases
A relational database structures data as a collection of database objects. The primary
database object is the table (a defined number of columns and any number of rows). Each
database includes a set of system catalog tables that describe the logical and physical
structure of the data, configuration files that contain the parameter values allocated for the
database, and recovery logs.

DB2 for Linux, UNIX, and Windows allows multiple databases to be defined within a single
database instance. Configuration parameters can also be set at the database level, so that
you can tune, for example, memory usage and logging.

Database partitions
A partition number in DB2 terminology is equivalent to a data partition. Databases with
multiple data partitions and that are on a symmetric multiprocessor (SMP) system are also
called multiple logical partition (MLN) databases.

Partitions are identified by the physical system where they are and by a logical port number
with the physical system. The partition number, which can be 0 - 999, uniquely defines a
partition. Partition numbers must be in ascending sequence (gaps in the sequence are
allowed).

The configuration information of the database is stored in the catalog partition. The catalog
partition is the partition from which you create the database.

Partitiongroups
A partitiongroup is a set of one or more database partitions. For non-partitioned
implementations (all editions except for DPF), the partitiongroup is always made up of a
single partition.

Partitioning map
When a partitiongroup is created, a partitioning map is associated to it. The partitioning map,
with the partitioning key and hashing algorithm, is used by the database manager to
determine which database partition in the partitiongroup stores a specific row of data.
Partitioning maps do not apply to non-partitioned databases.

Containers
A container is the way of defining where on the storage device the database objects are
stored. Containers can be assigned from file systems by specifying a directory. These
containers are identified as PATH containers. Containers can also reference files that are
within a directory. These containers are identified as FILE containers, and a specific size must
be identified. Containers can also reference raw devices. These containers are identified as
DEVICE containers, and the device must exist on the system before the container can be
used.

All containers must be unique across all databases; a container can belong to only one table
space.

Table spaces
A database is logically organized in table spaces. A table space is a place to store tables. To
spread a table space over one or more disk devices, you specify multiple containers.

For partitioned databases, the table spaces are in partitiongroups. In the create table space
command execution, the containers themselves are assigned to a specific partition in the
partitiongroup, thus maintaining the shared nothing character of DB2 DPF.

Chapter 17. Databases for open performance 515


Table spaces can be either system-managed space (SMS) or data-managed space (DMS).
For an SMS table space, each container is a directory in the file system, and the operating
system file manager controls the storage space (Logical Volume Manager (LVM) for AIX). For
a DMS table space, each container is either a fixed-size pre-allocated file, or a physical
device, such as a disk (or in the case of the DS8000 storage system, a vpath), and the
database manager controls the storage space.

There are three major types of user table spaces: regular (index and data), temporary, and
long. In addition to these user-defined table spaces, DB2 requires that you define a system
table space, which is called the catalog table space. For partitioned database systems, this
catalog table space is on the catalog partition.

Tables, indexes, and large objects


A table is a named data object that consists of a specific number of columns and unordered
rows. Tables are uniquely identified units of storage maintained within a DB2 table space.
They consist of a series of logically linked blocks of storage that have the same name. They
also have a unique structure for storing information that relates to information about other
tables.

When creating a table, you can choose to have certain objects, such as indexes and large
object (LOB) data, stored separately from the rest of the table data, but you must define this
table to a DMS table space.

Indexes are defined for a specific table and help with the efficient retrieval of data to satisfy
queries. They also can be used to help with the clustering of data.

LOBs can be stored in columns of the table. These objects, although logically referenced as
part of the table, can be stored in their own table space when the base table is defined to a
DMS table space. This approach allows for more efficient access of both the LOB data and
the related table data.

Pages
Data is transferred to and from devices in discrete blocks that are buffered in memory. These
discrete blocks are called pages, and the memory that is reserved to buffer a page transfer is
called an I/O buffer. DB2 supports various page sizes, including 4 K, 8 K, 16 K, and 32 K.

When an application accesses data randomly, the page size determines the amount of data
transferred. This size corresponds to the size of the data transfer request to the DS8000
storage system, which is sometimes referred to as the physical record.

Sequential read patterns can also influence the page size that is selected. Larger page sizes
for workloads with sequential read patterns can enhance performance by reducing the
number of I/Os.

Extents
An extent is a unit of space allocation within a container of a table space for a single table
space object. This allocation consists of multiple pages. The extent size (number of pages) for
an object is set when the table space is created:
򐂰 An extent is a group of consecutive pages defined to the database.
򐂰 The data in the table spaces is striped by extent across all the containers in the system.

516 IBM System Storage DS8000 Performance Monitoring and Tuning


Buffer pools
A buffer pool is main memory that is allocated on the host processor to cache table and index
data pages as they are read from disk or modified. The purpose of the buffer pool is to
improve system performance. Data can be accessed much faster from memory than from
disk; therefore, the fewer times that the database manager needs to read from or write to disk
(I/O), the better the performance. Multiple buffer pools can be created.

DB2 prefetch (reads)


Prefetching is a technique for anticipating data needs and reading ahead from storage in
large blocks. By transferring data in larger blocks, fewer system resources are used and less
time is required.

Sequential prefetch reads consecutive pages into the buffer pool before they are needed by
DB2. List prefetches are more complex. In this case, the DB2 optimizer optimizes the retrieval
of randomly located data.

The amount of data that is prefetched determines the amount of parallel I/O activity.
Ordinarily, the database administrator defines a prefetch value large enough to allow parallel
use of all of the available containers.

Consider the following example:


򐂰 A table space is defined with a page size of 16 KB that uses raw DMS.
򐂰 The table space is defined across four containers. Each container is on a separate logical
device, and the logical devices are on different DS8000 ranks.
򐂰 The extent size is defined as 16 pages (or 256 KB).
򐂰 The prefetch value is specified as 64 pages (number of containers x extent size).
򐂰 A user submits a query that results in a table space scan, which then results in DB2
performing a prefetch operation.

The following actions happen:


򐂰 DB2, recognizing that this prefetch request for 64 pages (1 MB) evenly spans four
containers, makes four parallel I/O requests, one against each of those containers. The
request size to each container is 16 pages (or 256 KB).
򐂰 After receiving several of these requests, the DS8000 storage system recognizes that
these DB2 prefetch requests are arriving as sequential accesses, causing the DS8000
sequential prefetch to take effect. The sequential prefetch results in all of the disks in all
four DS8000 ranks to operate concurrently, staging data to the DS8000 cache to satisfy
the DB2 prefetch operations.

Page cleaners
Page cleaners are present to make room in the buffer pool before prefetchers read pages on
disk storage and move them into the buffer pool. For example, if a large amount of data is
updated in a table, many data pages in the buffer pool might be updated but not written into
disk storage (these pages are called dirty pages). Because prefetchers cannot place fetched
data pages onto the dirty pages in the buffer pool, these dirty pages must be flushed to disk
storage and become clean pages so that prefetchers can place fetched data pages from disk
storage.

Chapter 17. Databases for open performance 517


Logs
Changes to data pages in the buffer pool are logged. Agent processes, which are updating a
data record in the database, update the associated page in the buffer pool and write a log
record to a log buffer. The written log records in the log buffer are flushed into the log files
asynchronously by the logger.

To optimize performance, the updated data pages in the buffer pool and the log records in the
log buffer are not written to disk immediately. The updated data pages in the buffer pool are
written to disk by page cleaners and the log records in the log buffer are written to disk by the
logger.

The logger and the buffer pool manager cooperate and ensure that the updated data page is
not written to disk storage before its associated log record is written to the log. This behavior
ensures that the database manager can obtain enough information from the log to recover
and protect a database from being left in an inconsistent state when the database crashes as
a result of an event, such as a power failure.

Parallel operations
DB2 for Linux, UNIX, and Windows extensively uses parallelism to optimize performance
when accessing a database. DB2 supports several types of parallelism, including query and
I/O parallelism.

Query parallelism
There are two dimensions of query parallelism: inter-query parallelism and intra-query
parallelism. Inter-query parallelism refers to the ability of multiple applications to query a
database concurrently. Each query runs independently of the other queries, but they are all
run concurrently. Intra-query parallelism refers to the simultaneous processing of parts of a
single query by using intra-partition parallelism, inter-partition parallelism, or both:
򐂰 Intra-partition parallelism subdivides what is considered a single database operation, such
as index creation, database loading, or SQL queries, into multiple parts, many or all of
which can be run in parallel within a single database partition.
򐂰 Inter-partition parallelism subdivides what is considered a single database operation, such
as index creation, database loading, or SQL queries, into multiple parts, many or all of
which can be run in parallel across multiple partitions of a partitioned database on one
machine or on multiple machines. Inter-partition parallelism applies to DPF only.

I/O parallelism
When there are multiple containers for a table space, the database manager can use parallel
I/O. Parallel I/O refers to the process of writing to, or reading from, two or more I/O devices
simultaneously. Parallel I/O can result in significant improvements in throughput.

DB2 implements a form of data striping by spreading the data in a table space across multiple
containers. In storage terminology, the part of a stripe that is on a single device is a strip. The
DB2 term for strip is extent. If your table space has three containers, DB2 writes one extent to
container 0, the next extent to container 1, the next extent to container 2, and then back to
container 0. The stripe width (a generic term that is not often used in DB2 literature) is equal
to the number of containers, or three in this case.

Extent sizes are normally measured in numbers of DB2 pages.

518 IBM System Storage DS8000 Performance Monitoring and Tuning


Containers for a table space are ordinarily placed on separate physical disks, allowing work to
be spread across those disks, and allowing disks to operate in parallel. Because the DS8000
logical disks are striped across the rank, the database administrator can allocate DB2
containers on separate logical disks that are on separate DS8000 arrays. This approach
takes advantage of the parallelism both in DB2 and in the DS8000 storage system. For
example, four DB2 containers that are on four DS8000 logical disks on four different 7+P
ranks have data that is spread across 32 physical disks.

17.2 DB2 for Linux, UNIX, and Windows with DS8000


performance recommendations
When using a DS8000 storage system, the following preferred practices are useful when
planning for good DB2 for Linux, UNIX, and Windows performance.

17.2.1 DS8000 volume layout for databases


Within the last few years, much has changed in the world of storage. IBM introduced
solid-state drives (SSD), High-Performance Flash Enclosures (HPFE), and the Easy Tier
feature.

The storage layout recommendations depend on the used technology. Although traditional
hard disk drives (HDDs) use requires a manual workload balance across the DS8000
resources, with hybrid storage pools and Easy Tier, the storage controller chooses what
should be stored on SSDs and what should be stored on HDDs.

17.2.2 Know where your data is


Know where your data is. Understand how DB2 containers map to the DS8000 logical disks,
and how those logical disks are distributed across the DS8000 ranks. Spread DB2 data
across as many DS8000 ranks as possible.

If you want optimal performance from the DS8000 storage system, do not treat it like a black
box. Establish a storage allocation policy that allocates data by using several DS8000 ranks.
Understand how DB2 tables map to underlying logical disks, and how the logical disks are
allocated across the DS8000 ranks. One way of making this process easier to manage is to
maintain a modest number of DS8000 logical disks.

17.2.3 Balance workload across DS8000 resources


Balance the workload across the DS8000 resources. Establish a storage allocation policy that
allows balanced workload activity across RAID arrays. You can take advantage of the
inherent balanced activity and parallelism within DB2, spreading the work for DB2 partitions
and containers across the DS8000 arrays. If you spread the work and plan sufficient
resource, many of the other decisions are secondary.

Consider the following general preferred practices:


򐂰 DB2 query parallelism allows workload to be balanced across processors and, if DB2 DPF
is installed, across data partitions.
򐂰 DB2 I/O parallelism allows workload to be balanced across containers.

Chapter 17. Databases for open performance 519


As a result, you can balance activity across the DS8000 resources by following these rules:
򐂰 Span the DS8000 storage units.
򐂰 Span ranks (RAID arrays) within a storage unit.
򐂰 Engage as many arrays as possible.

Also, remember these considerations:


򐂰 You can intermix data, indexes, and temp spaces on the DS8000 ranks. Your I/O activity is
more evenly spread, so you avoid the skew effect, which you otherwise see whether the
components are isolated.
򐂰 For DPF systems, establish a policy that allows partitions and containers within partitions
to be spread evenly across the DS8000 resources. Choose a vertical mapping in which
DB2 partitions are isolated to specific arrays, with containers spread evenly across those
arrays.

17.2.4 Use DB2 to stripe across containers


In a System Managed Space (SMS) table space, the operating system's file system manager
allocates and manages the space where the table is stored. The first table data file is created
in one of the table space containers. The database manager determines which one, based on
an algorithm that accounts for the total number of containers together with the table identifier.
This file is allowed to grow to the extent size. After it reaches this size, the database manager
writes data to the next container, and so on.

If the containers of a table space on separate DS8000 logical disks are on different DS8000
ranks, stripe across DS8000 arrays, disk adapters, clusters. This striping eliminates the need
for using underlying operating system or LVM striping.

17.2.5 Selecting DB2 logical sizes


Three settings in a DB2 system that primarily affect the movement of data to and from the
disk subsystem (DSS) work together:
򐂰 Page size
򐂰 Extent size
򐂰 Prefetch size

Page size
Page sizes are defined for each table space. There are four supported page sizes: 4 K, 8 K,
16 K, and 32 K.

For DMS, temporary DMS, and nontemporary automatic storage table spaces (see “Table
spaces” on page 515), the page size you choose for your database determines the upper limit
for the table space size. For tables in SMS and temporary automatic storage table spaces,
page size constrains the size of the tables themselves.

For more information, see the Page, table and table space size topic on the DB2 10.5 for
Linux, UNIX, and Windows IBM Knowledge Center website:

http://www.ibm.com/support/knowledgecenter/SSEPGG

Select a page size that can accommodate the total expected growth requirements of the
objects in the table space.

520 IBM System Storage DS8000 Performance Monitoring and Tuning


For online transaction processing (OLTP) applications that perform random row read and
write operations, a smaller page size is preferable because it wastes less buffer pool space
with unwanted rows. For DSS applications that access large numbers of consecutive rows at
a time, a larger page size is better because it reduces the number of I/O requests that are
required to read a specific number of rows.

Tip: A 4- or 8-KB page size is suitable for an OLTP environment, and a 16- or 32-KB page
size is appropriate for analytics. A 32-KB page size is recommended for column-organized
tables.

Extent size
The extent size for a table space is the amount of data that the database manager writes to a
container before writing to the next container. Ideally, the extent size should be a multiple of
the underlying segment size of the disks, where the segment size is the amount of data that
the disk controller writes to one physical disk before writing to the next physical disk.

If you stripe across multiple arrays in your DS8000 storage system, assign a LUN from each
rank to be used as a DB2 container. During writes, DB2 writes one extent to the first container
and the next extent to the second container until all eight containers are addressed before
cycling back to the first container. DB2 stripes across containers at the table space level.

Because the DS8000 storage system stripes at a fairly fine granularity (256 KB), selecting
multiples of 256 KB for the extent size ensures that multiple DS8000 disks are used within a
rank when a DB2 prefetch occurs. However, keep your extent size below 1 MB.

I/O performance is fairly insensitive to the selection of extent sizes, mostly because the
DS8000 storage system employs sequential detection and prefetch. For example, even if you
select an extent size, such as 128 KB, which is smaller than the full array width (it accesses
only four disks in the array), the DS8000 sequential prefetch keeps the other disks in the array
busy.

Prefetch size
The table space prefetch size determines the degree to which separate containers can
operate in parallel.

Prefetching pages means that one or more pages are retrieved from disk in the expectation
that they are required by an application. Prefetching index and data pages into the buffer pool
can help improve performance by reducing I/O wait times. In addition, parallel I/O enhances
prefetching efficiency.

There are three categories of prefetching:


򐂰 Sequential prefetching reads consecutive pages into the buffer pool before the pages are
required by the application.
򐂰 Readahead prefetching looks ahead in the index to determine the exact pages that scan
operations access, and prefetches them.
򐂰 List prefetching (sometimes called list sequential prefetching) prefetches a set of
nonconsecutive data pages efficiently.

Chapter 17. Databases for open performance 521


For optimal parallel I/O performance, ensure that the following items are true:
򐂰 There are enough I/O servers. Specify slightly more I/O servers than the number of
containers that are used for all table spaces within the database.
򐂰 The extent size and the prefetch size are appropriate for the table space. To prevent
overuse of the buffer pool, the prefetch size should not be too large. An ideal size is a
multiple of the extent size, the number of physical disks under each container (if a RAID
device is used), and the number of table space containers. The extent size should be fairly
small, with a good value being 8 - 32 pages.
򐂰 The containers are on separate physical drives.
򐂰 All containers are the same size to ensure a consistent degree of parallelism.

Prefetch size is tunable, that is, the prefetch size can be altered after the table space is
defined and data is loaded, which is not true for extents and page sizes that are set at table
space creation time and cannot be altered without redefining the table space and reloading
the data.

For more information, see the Prefetching data into the buffer pool & Parallel I/O
management and Optimizing table space performance when data is on RAID devices topics
on the DB2 10.5 for Linux, UNIX, and Windows IBM Knowledge Center website:

http://www.ibm.com/support/knowledgecenter/SSEPGG

17.2.6 Selecting the DS8000 logical disk sizes


The DS8000 storage system gives you great flexibility when it comes to disk allocation. This
flexibility is helpful, for example, when you need to attach multiple hosts. However, this
flexibility can present a challenge as you plan for future requirements.

The DS8000 storage system supports a high degree of parallelism and concurrency on a
single logical disk. As a result, a single logical disk the size of an entire array achieves the
same performance as many smaller logical disks. However, you must consider how logical
disk size affects both the host I/O operations and the complexity of your systems
administration.

Smaller logical disks provide more granularity, with their associated benefits. But, smaller
logical disks also increase the number of logical disks seen by the operating system. Select a
DS8000 logical disk size that allows for granularity and growth without proliferating the
number of logical disks.

Account for your container size and how the containers map to AIX logical volumes (LVs) and
DS8000 logical disks. In the simplest situation, the container, the AIX LV, and the DS8000
logical disk are the same size.

Tip: Try to strike a reasonable balance between flexibility and manageability for your
needs. As a preferred practice, create no fewer than two logical disks in an array, and the
minimum logical disk size must be 16 GB. Unless you have a compelling reason,
standardize a unique logical disk size throughout the DS8000 storage system.

Smaller logical disk sizes have the following advantages and disadvantages:
򐂰 Advantages of smaller size logical disks:
– Easier to allocate storage for different applications and hosts.
– Greater flexibility in performance reporting.

522 IBM System Storage DS8000 Performance Monitoring and Tuning


򐂰 Disadvantages of smaller size logical disks
– Small logical disk sizes can contribute to the proliferation of logical disks, particularly in
SAN environments and large configurations.
– Administration gets complex and confusing.

Larger logical disk sizes have the following advantages:


򐂰 Advantages of larger size logical disks:
– Simplifies understanding of how data maps to arrays.
– Reduces the number of resources used by the operating system.
– Storage administration is simpler, more efficient, and provides fewer chances for
mistakes.

Examples
Assume a 6+P array with 146 GB disk drives. You want to allocate disk space on your
16-array DS8000 storage system as flexibly as possible. You can carve each of the 16 arrays
into 32 GB logical disks or logical unit numbers (LUNs), resulting in 27 logical disks per array
(with a little left over). This design yields a total of 16 x 27 = 432 LUNs. Then, you can
implement four-way multipathing, which in turn makes 4 x 432 = 1728 hdisks visible to the
operating system.

This approach creates an administratively complex situation, and, at every restart, the
operating system queries each of those 1728 disks. Restarts might take a long time.

Alternatively, you create 16 large logical disks. With multipathing and attachment of four Fibre
Channel ports, you have 4 x 16 = 128 hdisks visible to the operating system. Although this
number is large, it is more manageable, and restarts are much faster. After overcoming that
problem, you can then use the operating system LVM to carve this space into smaller pieces
for use.

However, there are problems with this large logical disk approach. If the DS8000 storage
system is connected to multiple hosts or it is on a SAN, disk allocation options are limited
when you have so few logical disks. You must allocate entire arrays to a specific host, and if
you want to add additional space, you must add it in array-size increments.

17.2.7 Multipathing
Use the DS8000 multipathing along with DB2 striping to ensure the balanced use of Fibre
Channel paths.

Multipathing is the hardware and software support that provides multiple avenues of access
to your data from the host computer. You must provide at least two Fibre Channel paths from
the host computer to the DS8000 storage system. Paths are defined by the number of host
adapters (HAs) on the DS8000 storage system that service the LUNs of a certain host
system, the number of Fibre Channel host bus adapters (HBAs) on the host system, and the
SAN zoning configuration. The total number of paths includes consideration for the
throughput requirements of the host system. If the host system requires more than (2 x 200)
400 MBps throughput, two HBAs are not adequate.

DS8000 multipathing requires the installation of multipathing software. For example, the IBM
Subsystem Device Driver Path Control Module (SDDPCM) on AIX and Device Mapper -
Multipath I/O (DM-MP) on Linux. These products are described in Chapter 9, “Performance
considerations for UNIX servers” on page 285 and Chapter 8, “Host attachment” on
page 267.

Chapter 17. Databases for open performance 523


There are several benefits that you receive from using multipathing: higher availability, higher
bandwidth, and easier management. A high availability implementation is one in which your
application can still access data by using an alternative resource if a component fails. Easier
performance management means that the multipathing software automatically balances the
workload across the paths.

17.3 Oracle with DS8000 performance considerations


This section describes Oracle databases and some preferred practices with the DS8000
storage system to achieve better performance results.

Also, this section is intended to focus on Oracle I/O characteristics. Some memory or
processor considerations are needed, but these considerations are done at the appropriate
level according to your system specifications and planning.

Reviewing the following considerations can help you understand the Oracle I/O demand and
your DS8800/DS8700 planning for its use.

17.3.1 Architecture overview


First, review the components of an Oracle database for I/O considerations. The basic
components of an Oracle server are a database and an instance. A database is composed of
a set of data files that consists of data, redo logs, control files, and archive log files.

Although the instance is an important part of the Oracle components, this section focuses on
the data files. OLTP workloads can benefit from SSDs combined with Easy Tier automatic
mode management to optimize performance. Furthermore, you must consider segregation
and resource-sharing aspects when performing separate levels of isolation on the storage for
different components. Typically, in an Oracle database, you separate redo logs and archive
logs from data and indexes. The redo logs and archives are known for performing intensive
read/write workloads.

In a database, the disk part is considered the slowest component in the whole infrastructure.
You must plan to avoid reconfigurations and time-consuming performance problem
investigations when future problems, such as bottlenecks, might occur. However, as with all
I/O subsystems, good planning and data layout can make the difference between having
excellent I/O throughput and application performance, and having poor I/O throughput, high
I/O response times, and correspondingly poor application performance.

In many cases, I/O performance problems can be traced directly to “hot” files that cause a
bottleneck on some critical component, for example, a single physical disk. This problem can
occur even when the overall I/O subsystem is fairly lightly loaded. When bottlenecks occur,
storage or database administrators might need to identify and manually relocate the high
activity data files that contributed to the bottleneck condition. This problem solving tends to be
a resource-intensive and often frustrating task. As the workload content changes with the
daily operations of normal business cycles, for example, hour by hour through the business
day or day by day through the accounting period, bottlenecks can mysteriously appear and
disappear or migrate over time from one data file or device to another.

524 IBM System Storage DS8000 Performance Monitoring and Tuning


17.3.2 DS8000 performance considerations
Generally, I/O (and application) performance is best when the I/O activity is evenly spread
across the entire I/O subsystem and, if available, the appropriate storage classes. Easy Tier
automatic mode management, providing automatic intra-tier and cross-tier performance
optimization, is an option and can make the management of data files easier. Even in
homogeneous extent pools, you can benefit from Easy Tier automatic mode intra-tier
rebalancing (auto-rebalance) to optimize the workload distribution across ranks, reducing
workload skew and avoiding hot spots. The goal for balancing I/O activity across ranks,
adapters, and different storage tiers can be easily achieved by using Easy Tier automatic
mode even in shared environments.

In addition, the prioritization of important database workloads to meet their quality of service
(QoS) requirements when they share storage resources with less important workloads can be
managed easily by using the DS8000 I/O Priority Manager.

Section “RAID-level performance considerations” on page 98 reviewed the RAID levels and
their performance aspects. It is important to describe the RAID levels because some data
files can benefit from certain RAID levels, depending on their workload profile, as shown in
Figure 4-6 on page 112. However, advanced storage architectures, for example, cache and
advanced cache algorithms, or even multitier configurations with Easy Tier automatic
management, can make RAID level considerations less important.

For example, with 15 K RPM Enterprise disks and a significant amount of cache available on
the storage system, some environments might have similar performances on RAID 10 and
RAID 5, although mostly workloads with a high percentage of random write activity and high
I/O access densities benefit from RAID 10. RAID 10 benefits clients in single-tier pools.
RAID 10 takes advantage of Easy Tier automatic intra-tier performance management
(auto-rebalance) and constantly optimizes data placement across ranks based on rank
utilization in the extent pool.

However, by using hybrid pools with flash/SSDs and Easy Tier automode cross-tier
performance management that promotes the hot extents to flash on a subvolume level, you
can boost database performance and automatically adapt to changing workload conditions.

You might consider striping on one level only (storage system or host/application-level),
depending on your needs. The use of host-level or application-level striping might be
counterproductive when using Easy Tier in multitier extent pools because striping dilutes the
workload skew and can reduce the effectiveness of Easy Tier.

On previous DS8300/DS8100 storage systems, you benefited from using storage pool
striping (rotate extents) and striping on the storage level. You can create your redo logs and
spread them across as many extent pools and ranks as possible to avoid contention. On a
DS8800/DS8700 storage system with Easy Tier, data placement and workload spreading in
extent pools is automatic, even across different storage tiers.

You still can divide your workload across your planned extent pools (hybrid or homogeneous)
and consider segregation on the storage level by using different storage classes or RAID
levels or by separating table spaces from logs across different extent pools with regard to
failure boundary considerations.

Chapter 17. Databases for open performance 525


17.3.3 Oracle for AIX
With Easy Tier, the storage system optimizes workload spreading and data placement across
and within storage tiers, based on workload access patterns and by constantly adapting to
workload changes over time. The use of host-level or application-level striping can dilute the
workload skew and reduce the effectiveness of Easy Tier in multitier configurations.

However, if you consider striping on the AIX LVM level or the database level, for example,
Oracle Automatic Storage Management (ASM), you must consider the best possible
approaches if you use it with Easy Tier and multitier configurations. Keep your physical
partition (PP) size or stripe size at a high value to have enough skew with Easy Tier to
promote efficiently hot extents.

AIX LVM features different file systems mount options. Here are different mount options for
logical file systems as the preferred practices with Oracle databases:
򐂰 Direct I/O (DIO):
– Data is transferred directly from the disk to the application buffer. It bypasses the file
buffer cache and avoids double caching (file system cache + Oracle System Global
Area (SGA)).
– Emulates a raw device implementation.
򐂰 Concurrent I/O (CIO):
– Implicit use of DIO.
– No inode locking: Multiple threads can perform reads and writes on the same file
concurrently.
– Performance that is achieved by using CIO is comparable to raw devices.
– Avoid double caching: Some data is already cached in the application layer (SGA).
– Provides faster access to the back-end disk and reduces the processor utilization.
– Disables the inode-lock to allow several threads to read and write the same file (CIO
only).
– Because data transfer is bypassing the AIX buffer cache, Journaled File System 2
(JFS2) prefetching and write-behind cannot be used. These functions can be handled
by Oracle.

When using DIO or CIO, I/O requests made by Oracle must be aligned with the JFS2
blocksize to avoid a demoted I/O (returns to normal I/O after a DIO failure).

Additionally, when using JFS2, consider using the INLINE log for file systems so that it can
have the log striped and not be just placed in a single AIX PP.

For more information about AIX and Linux file systems see 9.2.1, “AIX Journaled File System
and Journaled File System 2” on page 296 and 12.3.6, “File systems” on page 400.

Other options that are supported by Oracle include the Asynchronous I/O (AIO), IBM
Spectrum Scale (formerly IBM General Parallel File System (GPFS)), and Oracle ASM:
򐂰 AIO:
– Allows multiple requests to be sent without having to wait until the DSS completes the
physical I/O.
– Use of AIO is advised no matter what type file system and mount option you implement
(JFS, JFS2, CIO, or DIO).

526 IBM System Storage DS8000 Performance Monitoring and Tuning


򐂰 IBM Spectrum Scale (formerly GPFS):
– IBM Spectrum Scale is a concurrent file system that can be shared among the nodes
that compose a cluster through a SAN or through a high-speed Internet Protocol
network.
– When implementing Oracle Real Application Clusters (RAC) environments, many
clients prefer to use a clustered file system. IBM Spectrum Scale is the IBM clustered
file system offering for Oracle RAC on AIX. Other Oracle files, such as the ORACLE_HOME
executable libraries, archive log directories that do not need to be shared between
instances. These files can either be placed on local disk, for example, by using JFS2
file systems, or a single copy can be shared across the RAC cluster by using
IBM Spectrum Scale. IBM Spectrum Scale can provide administrative advantages,
such as maintaining only one physical ORACLE_HOME, which ensures that archive logs
are always available (even when nodes are down) when recoveries are required.
– When used with Oracle, IBM Spectrum Scale automatically stripes files across all of
the available disks within the cluster by using a 1 MiB stripe size. Therefore,
IBM Spectrum Scale provides data and I/O distribution characteristics that are similar
to PP spreading, LVM (large granularity) striping, and ASM course-grained striping
techniques.
򐂰 ASM:
– ASM is a database file system that provides cluster file system and volume manager
capabilities. ASM is an alternative to conventional filesystem and LVM functions.
– Integrated into the Oracle database at no additional cost for single or RAC databases.
– With ASM, the management of Oracle data files is identical on all platforms (Linux,
UNIX, or Windows).
– Oracle ASM uses disk groups to store data files. ASM disk groups are comparable to
volume groups of an LVM. ASM disk groups can be mirrored by specifying a disk group
type: Normal for two-way mirroring, High for three-way mirroring, and External to not
use ASM mirroring, such as when you configure hardware RAID for redundancy.
– Data files are striped across all disks of a disk group, and I/O is spread evenly to
prevent hot spots and maximize performance.
– Online add/drop of disk devices with automatic online redistribution of data.

Note: Since Oracle Database 11g Release 2, Oracle no longer supports raw devices. You
can create databases on an Oracle ASM or a filesystem infrastructure. For more
information, see the Oracle Technology Network article:
http://www.oracle.com/technetwork/articles/sql/11g-asm-083478.html

For more information about the topics in this section, see the following resources:
򐂰 The Oracle Architecture and Tuning on AIX paper:
http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100883
򐂰 The IBM Power Systems, AIX and Oracle Database Performance Considerations paper
includes parameter setting recommendations for AIX 6.1 and AIX 7.1 environments:
http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP102171

Chapter 17. Databases for open performance 527


17.4 Database setup with a DS8000 storage system: Preferred
practices
This section provides a quick-start guide to create a database environment with a DS8000
storage system. You might use this section as a starting point for your setup and apply
changes.

17.4.1 The Oracle stripe and mirror everything approach


The stripe and mirror everything (S.A.M.E.) methodology stripes data across all available
physical devices to achieve maximum I/O performance. The goal is to balance I/O across all
disks, adapters, and I/O paths and to avoid I/O hotspots. A 1 MB stripe size is a good starting
point for OLTP and data warehouse workloads. In addition, Oracle recommends putting
transaction logs and data files on separate disks. The S.A.M.E. methodology can be applied
to non-Oracle databases in a similar way.

Oracle has recommended S.A.M.E. for many years. Oracle database installations used
Volume Manager or ASM-based mirroring and striping to implement the S.A.M.E.
methodology.

However, with storage technologies such as RAID and Easy Tier, alternative solutions are
available.

17.4.2 DS8000 RAID policy and striping


For database environments on a DS8000 storage system, create logical volumes (LUNs) on
RAID 5, RAID 6, or RAID 10 arrays.

On Enterprise-class storage with a huge cache, RAID 5 performance can be comparable to


RAID 10 and sufficient for many customer workloads.

Use the Easy Tier function on hybrid storage pools (flash/SSD and HDD) to improve I/O
performance. In most cases, 5 -10% of flash (compared to the overall storage pool capacity)
should be sufficient.

For HDD-only setups, consider RAID 10 for workloads with a high percentage of random write
activity (> 40%) and high I/O access densities (peak > 50%).

On a DS8000 storage system, the share everything approach is a preferred practice:


򐂰 Create one storage pool for each controller (Server 0 an dServer1).
򐂰 Group multiple RAID arrays into a storage pool and then cut out of the storage pool one or
more logical volumes (LUNs).
򐂰 Create striped LVs (by using the rotate extents mechanism) within the storage pools.
Choose the same size for all LVs within a storage pool.
򐂰 Database transaction logs and data files can share storage.

As a result, the LUNs are striped across all disks in the storage pool, as shown in Figure 17-2
on page 529.

528 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 17-2 Logical volume definition on a DS8000 storage system

Separate data files and transaction logs on different physical disks, not just because of
performance improvements, but because of data safety in case of a RAID array failure.
Separating data files and transaction logs does not need to be considered for an LVM mirror
across two DS8000 storage systems.

Note: Always create a homogeneous DS8000 configuration for your database


environment. Use identical resource types with reference to the following items:
򐂰 Disk type (Use only one type: flash, FC, SAS, and so on)
򐂰 Disk capacity (600 GB, 900 GB, 1 or 2 TB, and so on)
򐂰 Disk speed (15,000/10,000 RPM)
򐂰 RAID level (6+1+1, 7+1, RAID 10, and so on)
򐂰 LUN size (storage LV)

17.4.3 LVM striping


In addition to striping on the DS8000 storage system, use PP striping or LV striping on the
server to create a balanced environment. On AIX, create the LVs with the maximum range of
physical volumes option to configure PP striping. LV striping with a stripe size of 2 - 128 MB
(values must be a power of two) can be an alternative. For more information about LVM
striping, see 9.2.4, “IBM Logical Volume Manager” on page 301.

Chapter 17. Databases for open performance 529


530 IBM System Storage DS8000 Performance Monitoring and Tuning
18

Chapter 18. Database for IBM z/OS


performance
This chapter reviews the major IBM database systems and the performance characteristics
and considerations when they are used with the DS8000 storage system. This chapter
describes the following databases:
򐂰 IBM DB2 Universal Database™ (DB2 UDB) in a z/OS environment
򐂰 IMS in a z/OS environment

You can obtain additional information about IBM DB2 and IMS at these websites:
򐂰 http://www.ibm.com/software/data/db2/zos/family/
򐂰 http://www.ibm.com/software/data/ims/

This chapter includes the following topics:


򐂰 DB2 in a z/OS environment
򐂰 DS8000 considerations for DB2
򐂰 DB2 with DS8000 performance recommendations
򐂰 IMS in a z/OS environment
򐂰 DS8000 storage system considerations for IMS
򐂰 IMS with DS8000 performance recommendations

© Copyright IBM Corp. 2016. All rights reserved. 531


18.1 DB2 in a z/OS environment
This section provides a description of the characteristics of the various database workloads,
and the types of data-related objects that are used by DB2 (in a z/OS environment). Also, it
describes the performance considerations and general guidelines for using DB2 with the
DS8000 storage system, and a description of the tools and reports that can be used for
monitoring DB2.

18.1.1 Understanding your database workload


To better understand and position the performance of your particular database system, it is
helpful to learn about the following common database profiles and their unique workload
characteristics.

DB2 online transaction processing


Online transaction processing (OLTP) databases are among the most mission-critical and
widely deployed of all databases. The primary defining characteristic of OLTP systems is that
the transactions are processed in real time or online and often require immediate response
back to the user. The following examples are OLTP systems:
򐂰 A point of sale terminal in a retail business
򐂰 An automated teller machine (ATM), which is used for bank transactions
򐂰 A telemarketing site that processes sales orders and checks the inventories

From a workload perspective, OLTP databases typically have these characteristics:


򐂰 Process many concurrent user sessions
򐂰 Process many transactions by using simple SQL statements
򐂰 Process a single database row at a time
򐂰 Are expected to complete transactions in seconds, not minutes or hours

OLTP systems process the day-to-day operation of businesses, so they have strict user
response and availability requirements. They also have high throughput requirements and are
characterized by many database inserts and updates. They typically serve hundreds, or even
thousands, of concurrent users.

Decision support systems


Decision support systems differ from the typical transaction-oriented systems in that they
often use data that is extracted from multiple sources to support user decision making.
Decision support systems use these types of processing:
򐂰 Data analysis applications that use predefined queries
򐂰 Application-generated queries
򐂰 Ad hoc user queries
򐂰 Reporting requirements

Decision support systems typically deal with substantially larger volumes of data than OLTP
systems because of their role in supplying users with large amounts of historical data.
Although 100 GB of data is considered large for an OLTP environment, a large decision
support system might be 1 TB of data or more. The increased storage requirements of
decision support systems can also be attributed to the fact that they often contain multiple,
aggregated views of the same data.

532 IBM System Storage DS8000 Performance Monitoring and Tuning


Although OLTP queries are mostly related to one specific business function, decision support
system queries are often substantially more complex. The need to process large amounts of
data results in many processor-intensive database sort and join operations. The complexity
and variability of these types of queries must be given special consideration when estimating
the performance of a decision support system.

18.1.2 DB2 overview


DB2 is a database management system based on the relational data model. Most users
choose DB2 for applications that require good performance and high availability for large
amounts of data. This data is stored in data sets that are mapped to DB2 table spaces and
distributed across DB2 databases. Data in table spaces is often accessed by using indexes
that are stored in index spaces.

Data table spaces can be divided into two groups: system table spaces and user table spaces.
Both of these table spaces have identical data attributes. The difference is that system table
spaces are used to control and manage the DB2 subsystem and user data. System table
spaces require the highest availability and special considerations. User data cannot be
accessed if the system data is not available.

In addition to data table spaces, DB2 requires a group of traditional data sets that are not
associated to table spaces that are used by DB2 to provide data availability: The backup and
recovery data sets.

In summary, there are three major data set types in a DB2 subsystem:
򐂰 DB2 system table spaces
򐂰 DB2 user table spaces
򐂰 DB2 backup and recovery data sets

The following sections describe the objects and data sets that DB2 uses.

18.1.3 DB2 storage objects


DB2 manages data by associating it to a set of DB2 objects. These objects are logical
entities, and several of them are kept in storage. The following list shows the DB2 data
objects:
򐂰 TABLE
򐂰 TABLESPACE
򐂰 INDEX
򐂰 INDEXSPACE
򐂰 DATABASE
򐂰 STOGROUP

These following sections briefly describe each of them.

TABLE
All data that is managed by DB2 is associated to a table. The table is the main object used by
DB2 applications.

TABLESPACE
A table space is used to store one or more tables. A table space is physically implemented
with one or more data sets. Table spaces are VSAM linear data sets (LDS). Because table
spaces can be larger than the largest possible VSAM data set, a DB2 table space can require
more than one VSAM data set.

Chapter 18. Database for IBM z/OS performance 533


INDEX
A table can have one or more indexes (or can have no index). An index contains keys. Each
key points to one or more data rows. The purpose of an index is to get direct and faster
access to the data in a table.

INDEXSPACE
An index space is used to store an index. An index space is physically represented by one or
more VSAM LDSs.

DATABASE
The database is a DB2 representation of a group of related objects. Each of the previously
named objects must belong to a database. DB2 databases are used to organize and manage
these objects.

STOGROUP
A DB2 storage group is a list of storage volumes. STOGROUPs are assigned to databases,
table spaces, or index spaces when using DB2 managed objects. DB2 uses STOGROUPs for
disk allocation of the table and index spaces.

Installations that are storage management subsystem (SMS)-managed can define


STOGROUP with VOLUME(*). This specification implies that the SMS assigns a volume to the
table and index spaces in that STOGROUP. To assign a volume to the table and index spaces
in the STOGROUP, SMS uses automatic class selection (ACS) routines to assign a storage
class, a management class, and a storage group to the table or index space.

18.1.4 DB2 data set types


DB2 uses system and user table spaces for the data, and a group of data sets that are not
associated with table spaces that are used by DB2 to provide data availability; these data sets
are backup and recovery data sets.

DB2 system table spaces


DB2 uses databases to control and manage its own operation and the application data:
򐂰 The catalog and directory databases
Both databases contain DB2 system tables. DB2 system tables hold data definitions,
security information, data statistics, and recovery information for the DB2 system. The
DB2 system tables are in DB2 system table spaces.
The DB2 system table spaces are allocated when a DB2 system is first created. DB2
provides the IDCAMS statements that are required to allocate these data sets as VSAM
LDSs.
򐂰 The work database
The work database is used by DB2 to resolve SQL queries that require temporary
workspace. Multiple table spaces can be created for the work database.

DB2 application table spaces


All application data in DB2 is organized in the objects described in 18.1.3, “DB2 storage
objects” on page 533.

Application table spaces and index spaces are VSAM LDSs with the same attributes as DB2
system table spaces and index spaces. System and application data differ only because they
have different performance and availability requirements.

534 IBM System Storage DS8000 Performance Monitoring and Tuning


DB2 recovery data sets
To provide data integrity, DB2 uses data sets for recovery purposes. This section describes
these DB2 recovery data sets. These data sets are described in further detail in DB2 11 for
z/OS Administration Guide, SC19-4050.

DB2 uses these recovery data sets:


򐂰 Bootstrap data set
DB2 uses the bootstrap data set (BSDS) to manage recovery and other DB2 subsystem
information. The BSDS contains information that is needed to restart and to recover DB2
from any abnormal circumstance. For example, all log data sets are automatically
recorded with the BSDS. While DB2 is active, the BSDS is open and updated.
DB2 always requires two copies of the BSDS because they are critical for data integrity.
For availability reasons, the two BSDS data sets must be put on separate servers on the
DS8000 storage system or in separate logical control units (LCUs).
򐂰 Active logs
The active log data sets are used for data recovery and to ensure data integrity in software
or hardware errors.
DB2 uses active log data sets to record all updates to user and system data. The active
log data sets are open while DB2 is active. Active log data sets are reused when the total
active log space is used up, but only after the active log (to be overlaid) is copied to an
archive log.
DB2 supports dual active logs. Use dual active logs for all DB2 production environments.
For availability reasons, the log data sets must be put on separate servers on the DS8000
storage system or separate LCUs.
򐂰 Archive logs
Archive log data sets are DB2 managed backups of the active log data sets. Archive log
data sets are automatically created by DB2 whenever an active log is filled. DB2 supports
dual archive logs, and you should use dual archive log data sets for all production
environments.
Archive log data sets are sequential data sets that can be defined on disk or on tape and
migrated and deleted with standard procedures.

18.2 DS8000 considerations for DB2


By using a DS8000 storage system in a DB2 environment, you can help realize the following
benefits:
򐂰 DB2 takes advantage of the parallel access volume (PAV) function that allows multiple
concurrent I/Os to the same volume concurrently from applications that run on a z/OS
system image. Especially, PAV is recommended for an environment with large volumes.
򐂰 Less disk contention when accessing the same volumes from different systems in a DB2
data sharing group that uses the Multiple Allegiance function.
򐂰 Higher bandwidth on the DS8000 storage system allows higher I/O rates to be handled by
the disk subsystem, thus allowing for higher application transaction rates.

Chapter 18. Database for IBM z/OS performance 535


18.3 DB2 with DS8000 performance recommendations
When using the DS8000 storage system, the following generic recommendations are useful
when planning for good DB2 performance.

18.3.1 Knowing where your data is


DB2 storage administration can be done by using SMS to simplify disk use and control, or
also without using SMS. In both cases, it is important that you know where your data is.

If you want optimal performance from the DS8000 storage system, do not treat the DS8000
storage system as a “black box.” Understand how DB2 tables map to underlying volumes and
how the volumes map to RAID arrays.

18.3.2 Balancing workload across DS8000 resources


You can balance workload activity across the DS8000 resources by using the following
methods:
򐂰 Spreading DB2 data across DS8000 storage systems if practical
򐂰 Spreading DB2 data across servers in each DS8000 storage system
򐂰 Spreading DB2 data across the DS8000 device adapters (DAs)
򐂰 Spreading DB2 data across as many extent pools/ranks as practical

Using Easy Tier ensures that data is spread across ranks in a hybrid pool and placed on the
appropriate tier, depending on how active the data is.

You can intermix tables and indexes and also system, application, and recovery data sets on
the DS8000 ranks. The overall I/O activity is then more evenly spread, and I/O skews are
avoided.

18.3.3 Solid-state drives


The access to data on a solid-state drive (SSD) is faster than access to a hard disk drive
(HDD) because there is no read/write head to move and no magnetic platters need to spin (no
latency). Data sets with high I/O rates and poor cache hit ratios are ideal candidates for
SSDs. Data sets with good cache hit ratios or low I/O rates must remain on HDDs. SSDs are
ideal for small-block/random workloads. Sequential workloads can stay on HDDs.

The results in Figure 18-1 on page 537 were measured by a DB2 I/O benchmark. They show
random 4-KB read throughput and response times. The SSD response times are low across
the curve. They are lower than the minimum HDD response time for all data points.

536 IBM System Storage DS8000 Performance Monitoring and Tuning


Figure 18-1 DB2 on count key data random read throughput/response time curve

Identifying volumes or data sets to move to solid-state drives


The DS8000 storage system can obtain cache statistics for every volume in the storage
system. These measurements include the count of the number of operations from DASD
cache to the back-end storage, the number of random operations, the number of sequential
reads and sequential writes, the time to run those operations, and the number of bytes
transferred. These statistics are placed in the SMF 74 subtype 5 record.

If you plan to create an extent pool with only SSD or flash ranks, the FLASHDA tool can help
you identify which data benefits the most if placed on those extent pools.

FLASHDA is a tool based on SAS software to manage the transition to SSD. The tool
provides volumes and data set usage reports that use SAS code to analyze SMF 42 subtype
6 and SMF 74 subtype 5 records to help identify volumes and data sets that are good
candidates to be on SSDs.

This tool is available for download from the following website:


http://www.ibm.com/systems/z/os/zos/downloads/flashda.html

The FLASHDA user guide is available at this website:


http://publibz.boulder.ibm.com/zoslib/pdf/flashda.pdf

You can obtain reports to perform these tasks


򐂰 Identify page sets with high amounts of READ_ONLY_DISC time.
򐂰 Analyze DASD cache statistics to identify volumes with high I/O rates.
򐂰 Identify the data sets with the highest amount of write activity (number of write requests).

Based on the output report of these tools, you can select which hot volumes might benefit
most when migrated to SSDs. The tool output also provides the hot data at the data set level.
Based on this data, the migration to the SSD ranks can be done by data set by using the
appropriate z/OS tools.

Chapter 18. Database for IBM z/OS performance 537


18.3.4 High-Performance Flash Enclosure
Since 2014, the DS8870 storage system supports HPFE with LMC 7.3. HPFE provides more
significant improvement of I/O performance than SSDs or HDDs.

DS8880 R8.0 supports the following DDM types:


򐂰 300 and 600 GB, 15-K RPM SAS disk, 2.5-inch SFF
򐂰 600 and 1200 GB, 10-K RPM SAS disk, 2.5-inch SFF
򐂰 4 TB, 7,200 RPM Nearline-SAS disk, 3.5-inch LFF
򐂰 200/400/800/1600 GB e-MLC SAS SSDs (enterprise-grade Multi-Level Cell SSDs),
2.5-inch SFF
򐂰 400 GB high-performance flash cards, 1.8-inch

The DS8884 storage system provides up to four HPFEs, and the DS8886 storage system
provides up to eight HPFEs:
򐂰 DS8884 storage system: Two flash enclosures in a base rack with two additional
enclosures in the first expansion rack
򐂰 DS8886 storage system: Four flash enclosures in a base rack with four additional
enclosures in the first expansion rack

Figure 18-2 shows the comparison for a 4-KB random read workload between HPFE and
SSD (Single Array RAID 5 6+p) in a DS8870 storage system. As shown in Figure 18-2, for a
latency-sensitive application, HPFE can sustain low response time at more demanding I/O
rates than its SSD equivalent.

Figure 18-2 4-KB random read comparison between HPFE and SSD

For more information about DB2 performance with HPFE, see Making Smart Storage
Decisions for DB2 in a Flash and SSD World, REDP-5141.

18.3.5 High Performance FICON and FICON Express 16S


This section describes High Performance FICON and FICON Express 16S.

538 IBM System Storage DS8000 Performance Monitoring and Tuning


High Performance FICON

Important: Today, All DB2 I/Os, including format write and list prefetch, are supported by
High Performance Fibre Channel connection (FICON).

High Performance FICON, also known as zHPF, is not new. It was introduced in 2009. Initially,
zHPF support for DB2 I/O was limited to sync I/Os and write I/Os of individual records. That
support was enhanced gradually, zHPF supports all types of DB2 I/O in 2011.

As mentioned in 18.3.3, “Solid-state drives” on page 536 and 18.3.4, “High-Performance


Flash Enclosure” on page 538, HPFE and SSDs deliver greater performance, but that leads
to more stress on the channel subsystem on z Systems. Those flash cards and SSDs can be
used with zHPF, which yields better performance comprehensively because zHPF is an
effective way to reduce the channel impact.

Conversion of DB2 I/O to zHPF delivers more optimal resource utilization, bandwidth, and
response time:
򐂰 4-KB pages format write throughput increases up to 52%.
򐂰 Preformatting throughput increases up to 100%.
򐂰 Sequential prefetch through put increases up to 19%.
򐂰 Dynamic prefetch throughput increases up to 23% (40% with SSD).
򐂰 DB2 10 throughput increases up to 111% (more with 8-K pages) for disorganized index
scan. DB2 10 with zHPF is up to 11 times faster than DB2 9 without zHPF.
򐂰 Sync I/O cache hit response time decreases by up to 30%.

FICON Express 16S


IBM z13 has the FICON Express 16S feature that supports 16 Gbps link speed.

FICON Express 16S on z13 with a DS8870 storage system improves DB2 log write latency
and throughput by up to 32% with multiple I/O streams, compared with FICON Express 8S in
zEC12, resulting in improved DB2 transactional latency. Thus, you can expect up to 32%
reduction in elapsed time for I/O bound batch jobs.

zHPF with FICON Express 16S provides greater throughput and higher response time
compared with FICON Express 8 or 8S. For more information about that comparison, see
8.3.1, “FICON” on page 276.

18.3.6 DB2 Adaptive List Prefetch


Prefetch is a mechanism for reading a set of pages, usually thirty-two 4-K pages, into the
buffer pool with only one asynchronous I/O operation. zHPF provides improvements for DB2
list prefetch processing. The FICON Express 8S features allow DB2 for z/OS to read
thirty-two 4-K pages in a single zHPF channel program, which results in fewer I/Os and
processors. The limitation before was 22, which caused z/OS to split the pages into two I/Os.

Today, all DB2 I/O is supported by zHPF. The following environments are required to obtain
list prefetch with zHPF benefits:
򐂰 DS8700 or DS8800 storage system with LMC R6.2 or above
򐂰 DB2 10 or DB2 11
򐂰 z13, zEC12, zBC12, z194, or z114
򐂰 FICON Express 8S or 16S
򐂰 z/OS V1.11 or above with required PTFs

Chapter 18. Database for IBM z/OS performance 539


List Prefetch Optimizer
Combinations with FICON Express 8S and zHPF improved performance of DB2 list prefetch
processing. In addition, DS8000 LMC R6.2 also enhanced a new caching algorithm called
List Prefetch Optimizer.

Note: List Prefetch Optimizer requires zHPF.

List Prefetch Optimizer optimizes the fetching of data from the storage system when DB2 list
prefetch runs. DB2 list prefetch I/O is used for disorganized data and indexes or for
skip-sequential processing. Where List Prefetch with zHPF improves connect time of I/O
response, List Prefetch Optimizer contributes to reduce disconnect time.

Today, IBM introduces High Performance Flash Enclosure (HPFE) and SSDs, which achieve
higher performance for DB2 applications than before. The synergy between List Prefetch
Optimizer and HPFE or SSDs provides significant improvement for DB2 list prefetch I/O.

For more information about List Prefetch Optimizer, see DB2 for z/OS and List Prefetch
Optimizer, REDP-4862.

18.3.7 Taking advantage of VSAM data striping


Before VSAM data striping was available, in a multi-extent, multi-volume VSAM data set,
sequential processing did not present any type of parallelism. When an I/O operation was run
for an extent in a volume, no other activity from the same task was scheduled to the other
volumes.

VSAM data striping addresses this problem with two modifications to the traditional data
organization:
򐂰 The records are not placed in key ranges along the volumes; instead, they are organized
in stripes.
򐂰 Parallel I/O operations are scheduled to sequential stripes in different volumes.

By striping data, the VSAM control intervals (CIs) are spread across multiple devices. This
format allows a single application request for records in multiple tracks and CIs to be satisfied
by concurrent I/O requests to multiple volumes.

The result is improved data transfer to the application. The scheduling of I/O to multiple
volumes to satisfy a single application request is referred as an I/O path packet.

You can stripe across ranks, DAs, servers, and the DS8000 storage systems.

In a DS8000 storage system with storage pool striping, the implementation of VSAM striping
still provides a performance benefit. Because DB2 uses two engines for the list prefetch
operation, VSAM striping increases the parallelism of DB2 list prefetch I/Os. This parallelism
exists with respect to the channel operations and the disk access.

If you plan to enable VSAM I/O striping, see DB2 9 for z/OS Performance Topics, SG24-7473.

540 IBM System Storage DS8000 Performance Monitoring and Tuning


18.3.8 Large volumes
With Extended Address Volume (EAV), which supports up to 1,1182,006 cylinders per
volume, z Systems users can allocate the larger capacity volumes in the DS8000 storage
system. From the DS8000 perspective, the capacity of a volume does not determine its
performance. From the z/OS perspective, PAVs reduce or eliminate any additional enqueues
that might originate from the increased I/O on the larger volumes. From the storage
administration perspective, configurations with larger volumes are simpler to manage.

Measurements that are oriented to determine how large volumes can affect DB2 performance
show that similar response times can be obtained by using larger volumes compared to using
the smaller 3390-3 standard-size volumes. For more information, see “Volume sizes” on
page 484.

18.3.9 Modified Indirect Data Address Words


Modified Indirect Data Address Words (MIDAWs) help improve performance when you
access large chains of small blocks of data. To get this benefit, the data set must be accessed
by Media Manager. MIDAWs cut FICON channel utilization for DB2 sequential I/O streams by
half or more and improve the sequential throughput of Extended Format (EF) data sets by
about 30%. A larger benefit is realized for the following data sets:
򐂰 EF data sets
򐂰 Data sets that have small blocksizes (4 K)

Examples of DB2 applications that benefit from MIDAWs are DB2 prefetch and DB2 utilities.

18.3.10 Adaptive Multi-stream Prefetching


As described in 2.2, “Processor memory and cache” on page 27, Adaptive Multi-stream
Prefetching (AMP) works with DB2 sequential and dynamic prefetch. It works even better for
dynamic prefetch than it does for most other sequential applications because dynamic
prefetch uses two prefetch engines. DB2 can explicitly request that the DS8000 storage
system prefetches from the disks. AMP adjusts the prefetch quantity that the DS8000 storage
system uses to prefetch tracks from the disks into the cache. AMP achieves this improvement
by increasing the prefetch quantity sufficiently to meet the needs of the application.

As more data is prefetched, more disks are employed in parallel. Therefore, high throughput
is achieved by employing parallelism at the disk level. In addition to enabling one sequential
stream to be faster, AMP also reduces disk thrashing when there is disk contention.

18.3.11 DB2 burst write


When DB2 updates a record, it first updates the record that is in the buffer pool. If the
percentage of changed records in the buffer pool reaches the threshold that is defined in the
vertical deferred write threshold (VDWQT), DB2 flushes and writes these updated records.
These write activities to the disk subsystem are a huge burst of write I/Os, especially if the
buffer pool is large and the VDWQT is high. This burst can cause a nonvolatile storage (NVS)
saturation because it is being flooded with too many writes. It shows up in the Resource
Measurement Facility (RMF) cache report as DASD Fast Write Bypass (DFWBP or DFW
Bypass). The term bypass is misleading. In the 3990/3390 era, when the NVS was full, the
write I/O bypasses the NVS and the data was written directly to the disk drive module (DDM).
In the DS8000 storage system, when the NVS is full, the write I/O is retried from the host until
NVS space becomes available. So, DFW Bypass must be interpreted as DFW Retry for the
DS8000 storage system.

Chapter 18. Database for IBM z/OS performance 541


If RMF shows that the DFW Bypass divided by the total I/O Rate is greater than 1%, that is an
indication of NVS saturation. If this NVS saturation happens, set the VDWQT to 0 or 1. Setting
the VDWQT to 0 does not mean that every record update causes a write I/O to be triggered
because despite the 0% threshold, the DB2 buffer pool still has 40 buffers set aside. The last
recently updated 32 buffers are scheduled for write, which allows multiple successive updates
to the same record to be not written out every time that record is updated. This method
prevents multiple write I/Os to the same record on the disk subsystem. Lowering the VDWQT
has a cost. In this case, it increases the processor utilization, which shows up as a higher
DBM1 SRM processor time.

18.3.12 DB2 / Easy Tier integration


The Easy Tier Application software-defined storage data placement API allows DB2 to
instruct proactively Easy Tier of the intended use of data sets. DS8870 LMC R7.4 enables
Easy Tier integration between the DB2 and storage system.

This architecture allows DB2 to communicate performance requirements for optimal data set
placement by communicating application performance information (hints) to the Easy Tier
Application API by also using DFSMS. The application hint sets the intent through the API,
and Easy Tier moves the data set to the correct tier.

The following environments are required:


򐂰 DS8870 LMC R7.4 or above or DS8880 storage systems
򐂰 DB2 10 or DB2 11 with APAR PI35321
򐂰 z/OS V1.13 or above with appropriate PTFs (OA45242/OA45241/OA46482/OA45562)

The following description provides an example of the DB2 reorganization (REORG) process
with Easy Tier Application.

Without integration of Easy Tier Application, a DB2 REORG places the extents of the REORG
result in new extents. These new extents can be extents on a lower tier, and it takes a while
for Easy Tier to detect that these extents are hot and must be moved to a higher-level tier.

The integration of Easy Tier Application allows DB2 to instruct proactively Easy Tier about the
application-intended use of the data. The application hint sets the intent and Easy Tier moves
the data to the correct tier. So, hot extents before the REORG is moved to higher tier extents.

18.3.13 Bypass extent serialization in Metro Mirror


z/OS uses the define extent command to control serialization of access to a data set.
Certain applications, such as JES and DB2, use the bypass extent blocking feature because
they have their own serialization. Unfortunately, bypass extent blocking is not recognized by
Metro Mirror.

Starting with DS8870 R7.2, Metro Mirror now recognizes the bypass extent blocking option,
which reduces the device busy delay time and improves the throughput in a Metro Mirror
environment, in some cases by up to 100%.

Especially with a data-sharing environment, the DB2 burst write tends to range across many
tracks. The write operation to a specific track is serialized at the specific time with a non Metro
Mirror environment, but with Metro Mirror, the entire range of track is serialized of the entire
time of write I/O operation. The DB2 burst write processed asynchronous, but it might cause
serious problem if the Group Buffer Pool (GBP) fills up.

542 IBM System Storage DS8000 Performance Monitoring and Tuning


Metro Mirror could not bypass serialization or extent range checking when the write I/O
started because of deadlock concerns. But now, DS8870 R7.2 or later release provides the
bypass extent blocking option in Metro Mirror environments.

This function accelerates throughput for some Metro Mirror environments by up to 100%,
which reflects the reduction of device busy delay as I/O response time.

18.3.14 zHyperWrite
DS8870 LMC R7.4 released zHyperWrite, which helps accelerate DB2 log writes in Metro
Mirror synchronous data replication environments. zHyperWrite combines concurrent
DS8870 Metro Mirror (PPRC) synchronous replication and software mirroring through media
manager (DFSMS) to provide substantial improvements in DB2 log write latency. This
function also coexists with HyperSwap.

Without zHyperWrite in the Metro Mirror environment, the I/O response time of DB2 log write
is impacted by latency that is caused by synchronous replication.

zHyperWrite enables DB2 to perform parallel log writes to primary and secondary volumes.
When DB2 writes to the primary log volume, DFSMS updates the secondary log volume
concurrently.The write I/O to DB2 log with zHyperWrite is ended when both primary and
secondary volumes are updated by DFSMS.

Thus, with zHyperWrite, you can avoid the latency of storage-based synchronous mirroring,
which delivers an improvement of log write throughput. zHyperWrite realized a reduction of
response time up to 40% and 179% throughput improvement. These benefits depend on the
distance.

In addition to prerequisite of DS8000 Licensed Internal Code, appropriate PTFs should be


applied to z/OS, DFSMS, DB2, and RMF. For more information about that and how to set up
zHyperWrite, see IBM DS8870 and IBM z Systems Synergy, REDP-5186.

18.3.15 Monitoring the DS8000 performance


You can use RMF to monitor the performance of the DS8000 storage system. For a detailed
description, see 14.1, “DS8000 performance monitoring with RMF” on page 460.

18.4 IMS in a z/OS environment


This section describes IMS, its logging, and the performance considerations when IMS data
sets are placed on the DS8000 storage system.

18.4.1 IMS overview


IMS consists of three components: the Transaction Manager (TM) component, the Database
Manager (DB) component, and a set of system services that provides common services to
the other two components.

IMS Transaction Manager


IMS TM provides a network with access to the applications that are on the IMS or other
databases, such as DB2. The users can be people at terminals or workstations, or other
application programs.

Chapter 18. Database for IBM z/OS performance 543


IMS Database Manager
IMS Database Manager provides a central point of control and access to the data that is
processed by IMS applications. The Database Manager component of IMS supports
databases that use the hierarchical database model of IMS. It provides access to the
databases from the applications that run under the IMS TM, the IBM CICS® transaction
monitor, and z/OS batch jobs.

IMS Database Manager provides functions for preserving the integrity of databases and
maintaining the databases. It allows multiple tasks to access and update the data, while
ensuring the integrity of the data. It also provides functions for reorganizing and restructuring
the databases.

The IMS databases are organized internally by using a number of IMS internal database
organization access methods. The database data is stored on disk storage by using the
normal operating system access methods.

IMS system services


There are many functions that are common to the Database Manager and TM:
򐂰 Restart and recovery of the IMS subsystem failures
򐂰 Security: Controlling access to IMS resources
򐂰 Managing the application programs: Dispatching work, loading application programs, and
providing locking services
򐂰 Providing diagnostic and performance information
򐂰 Providing facilities for the operation of the IMS subsystem
򐂰 Providing an interface to other z/OS subsystems that interface with the IMS applications

18.4.2 IMS logging


IMS logging is one of the most write-intensive operations in a database environment.

During IMS execution, all information that is necessary to restart the system if there is a
failure is recorded on a system log data set. The IMS logs are made up of the following
information.

IMS log buffers


The log buffers are used to write the information that needs to be logged.

Online log data sets


The online log data sets (OLDS) are data sets that contain all the log records that are
required for restart and recovery. These data sets must be pre-allocated on DASD and hold
the log records until they are archived.

The OLDS are made of multiple data sets that are used in a wraparound manner. At least
three data sets must be allocated for the OLDS to allow IMS to start, and an upper limit of 100
data sets is supported.

Only complete log buffers are written to OLDS to enhance performance. If any incomplete
buffers must be written out, they are written to the write ahead data sets (WADS).

544 IBM System Storage DS8000 Performance Monitoring and Tuning


Write ahead data sets
The WADS is a small direct-access data set that contains a copy of committed log records
that are in OLDS buffers, but they are not written to OLDS yet.

When IMS processing requires writing a partially filled OLDS buffer, a portion of the buffer is
written to the WADS. If IMS or the system fails, the log data in the WADS is used to terminate
the OLDS, which can be done as part of an emergency restart, or as an option on the IMS
Log Recovery Utility.

The WADS space is continually reused after the appropriate log data is written to the OLDS.
This data set is required for all IMS systems, and must be pre-allocated and formatted at IMS
startup when first used.

When using a DS8000 storage system with storage pool striping, define the WADS volumes
as 3390-Mod.1 and allocate them consecutively so that they are allocated to different ranks.

System log data sets


The system log data set (SLDS) is created by the IMS log archive utility, preferably after every
OLDS switch. It is placed on tape, but it can be on disk. The SLDS can contain the data from
one or more OLDS data sets.

Recovery log data sets


When the IMS log archive utility is run, the user can request creation of an output data set
that contains all of the log records that are needed for database recovery. This data set is the
recovery log data set (RLDS), and is also known to DBRC. The RLDS is optional.

18.5 DS8000 storage system considerations for IMS


By using the DS8000 storage system in an IMS environment, the following benefits are
possible:
򐂰 IMS takes advantage of the PAV function that allows multiple concurrent I/Os to the same
volume concurrently from applications that run on a z/OS system image.
򐂰 Less disk contention occurs when accessing the same volumes from different systems in
an IMS data sharing group and using the Multiple Allegiance function.
򐂰 Higher bandwidth on the DS8000 storage system allows higher I/O rates to be handled by
the disk subsystem, thus allowing for higher application transaction rates.

18.6 IMS with DS8000 performance recommendations


When using the DS8000 storage system, the following generic recommendations are useful
when planning for good IMS performance.

18.6.1 Balancing workload across DS8000 resources


You can balance workload activity across the DS8000 resources by performing these tasks:
򐂰 Spreading IMS data across all DS8000 storage systems if practical
򐂰 Spreading IMS data across servers in each DS8000 storage system
򐂰 Use hybrid pools
򐂰 Define storage groups for the IMS data sets
򐂰 Allocate the IMS Storage Groups across all LCUs

Chapter 18. Database for IBM z/OS performance 545


You can intermix IMS databases and log data sets on the DS8000 ranks. The overall I/O
activity is more evenly spread, and I/O skews are avoided.

18.6.2 Write ahead data set volumes


The IMS WADS is the most active data set in an IMS environment as far as write activity is
concerned. Because of its nature, the volumes that contain IMS WADS have the biggest
challenge in the disk subsystems. For large IMS TM users, the only way to be able to handle
these I/Os is by putting these volumes on an SSD or flash rank or in a hybrid pool, where due
to its activity, it stays on the SSD or flash rank.

WADS has a fixed 1-byte key of ’00’x. WADS records include this CKD key field, which needs
to be in cache before data can be updated. There might be a cache contention, must to be
updated, the appropriate WADS key field must be staged into cache first. This action slows
the WADS write response time; it shows up as an increase in disconnect time, until recently,
where IMS changed the way the write I/Os are run.

In IMS V.11, an enhancement was made that allows the host software to provide an I/O
channel program indication that this is WADS, so the disk subsystem (that supports the
indication) can predict what the disk format key field is and avoid a write miss. This
enhancement requires two IMS APARs, which are PM44110 and PM19513.

Figure 18-3 shows the comparison of the performance of the IMS WADS volume before and
after the two APARs are put on and the disk subsystem has the appropriate Licensed Internal
Code to support this function. A significant reduction in the disconnect time can be observed
from the RMF report.

Figure 18-3 IMS WADS volume response time comparisons

Volume SRE004 has the WADS data set that does not support this enhancement and has a
much higher disconnect time compared to volume SRE104, which does have the above
enhancement.

Another enhancement in IMS V.12 made the WADS channel program to conform to ECKD
architecture, providing greater efficiency and reducing channel program operation.

Table 18-1 shows the response time improvement on the WADS volume between IMS V.11
with the enhancements as compared to the current IMS V.12. The response time improves
from 0.384 ms to 0.344 ms, which is a 10% improvement. The volume S24$0D on address
240D is allocated on a DDM rank on the DS8700 storage system and not on an SSD rank.

546 IBM System Storage DS8000 Performance Monitoring and Tuning


Table 18-1 IMS WADS volume response time improvement in IMS V12
Dev VOLSER PAV LCU Device Avg Avg Avg Avg Avg Avg Avg
num Activity Resp IOSQ CMR DB Pend Disc Conn
Rate Time time Dly Dly Time Time Time

IMS V.11

240D S24$0D 1.0H 0023 1143.49 0.384 0.000 0.025 0.000 0.179 0.001 0.204

IMS V.12

240D S24$0D 1.0H 0023 1163.02 0.384 0.000 0.028 0.000 0.184 0.001 0.159

18.6.3 Monitoring DS8000 performance


You can use RMF to monitor the performance of the DS8000 storage system. For a detailed
description, see 14.1, “DS8000 performance monitoring with RMF” on page 460.

Chapter 18. Database for IBM z/OS performance 547


548 IBM System Storage DS8000 Performance Monitoring and Tuning
Part 4

Part 4 Appendixes
This part includes the following topics:
򐂰 Performance management process
򐂰 Benchmarking

© Copyright IBM Corp. 2016. All rights reserved. 549


550 IBM System Storage DS8000 Performance Monitoring and Tuning
A

Appendix A. Performance management


process
This appendix describes the need for performance management and the processes and
approaches that are available for managing performance on the DS8000 storage system:
򐂰 Introduction
򐂰 Purpose
򐂰 Operational performance subprocess
򐂰 Tactical performance subprocess
򐂰 Strategic performance subprocess

© Copyright IBM Corp. 2016. All rights reserved. 551


Introduction
The IBM System Storage DS8000 series is designed to support the most demanding
business applications with its exceptional performance and superior data throughput. This
strength, which is combined with its world-class resiliency features, makes it an ideal storage
platform for supporting today’s 24x7 global business environment. Moreover, with its
tremendous scalability, broad server support, and flexible virtualization capabilities, a DS8000
storage system can help simplify the storage environment and consolidate multiple storage
systems onto a single DS8000 storage system.

This power is the potential of the DS8000 storage system, but careful planning and
management are essential to realize that potential in a complex IT environment. Even a
well-configured system is subject to the following changes over time that affect performance:
򐂰 Additional host systems
򐂰 Increasing workload
򐂰 Additional users
򐂰 Additional DS8000 capacity

A typical case
To demonstrate the performance management process, here is a typical situation where
DS8000 performance is an issue.

Users open incident tickets to the IT Help Desk claiming that the system is slow and is
delaying the processing of orders from their clients and the submission of invoices. IT Support
investigates and detects that there is contention in I/O to the host systems. The Performance
and Capacity team is involved and analyzes performance reports together with the IT Support
teams. Each IT Support team (operating system (OS), storage, database, and application)
issues its report defining the actions necessary to resolve the problem. Certain actions might
have a marginal effect but are faster to implement; other actions might be more effective but
need more time and resources to put in place. Among the actions, the Storage Team and
Performance and Capacity Team report that additional storage capacity is required to support
the I/O workload of the application and ultimately to resolve the problem. IT Support presents
its findings and recommendations to the company’s Business Unit, requesting application
downtime to implement the changes that can be made immediately. The Business Unit
accepts the report but says that it has no money for the purchase of new storage. They ask
the IT department how they can ensure that the additional storage can resolve the
performance issue. Additionally, the Business Unit asks the IT department why the need for
additional storage capacity was not submitted as a draft proposal three months ago when the
budget was finalized for next year, knowing that the system is one of the most critical systems
of the company.

Incidents, such as this one, make you realize the distance that can exist between the IT
department and the company’s business strategy. In many cases, the IT department plays a
key role in determining the company’s strategy. Therefore, consider these questions:
򐂰 How can you avoid situations like those described?
򐂰 How can you make performance management become more proactive and less reactive?
򐂰 What are the preferred practices for performance management?
򐂰 What are the key performance indicators of the IT infrastructure and what do they mean
from the business perspective?
򐂰 Are the defined performance thresholds adequate?
򐂰 How can you identify the risks in managing the performance of assets (servers, storage
systems, and applications) and mitigate them?

552 IBM System Storage DS8000 Performance Monitoring and Tuning


This chapter presents a method to implement a performance management process. The goal
is to give you ideas and insights with particular reference to the DS8000 storage system.
Assume in this instance that data from IBM Spectrum Control or Tivoli Storage Productivity
Center is available.

To better align the understanding between the business and the technology, use the
Information Technology Infrastructure Library (ITIL) as a guide to develop a process for
performance management as applied to DS8000 performance and tuning.

Purpose
The purpose of performance management is to ensure that the performance of the IT
infrastructure matches the demands of the business. The following activities are involved:
򐂰 Define and review performance baselines and thresholds.
򐂰 Collect performance data from the DS8000 storage system.
򐂰 Check whether the performance of the resources is within the defined thresholds.
򐂰 Analyze performance by using collected DS8000 performance data and tuning
suggestions.
򐂰 Define and review standards and IT architecture that are related to performance.
򐂰 Analyze performance trends.
򐂰 Size new storage capacity requirements.

Certain activities relate to the operational activities, such as the analysis of performance of
DS8000 components, and other activities relate to tactical activities, such as the performance
analysis and tuning. Other activities relate to strategic activities, such as storage capacity
sizing. You can split the process into three subprocesses:
򐂰 Operational performance subprocess
Analyze the performance of DS8000 components (processor complexes, device adapters
(DAs), host adapters (HAs), and ranks) and ensure that they are within the defined
thresholds and service-level objectives (SLOs) and service-level agreements (SLAs).
򐂰 Tactical performance subprocess
Analyze performance data and generate reports for tuning recommendations and the
review of baselines and performance trends.
򐂰 Strategic performance subprocess
Analyze performance data and generate reports for storage sizing and the review of
standards and architectures that relate to performance.

Every process consists of the following elements:


򐂰 Inputs: Data and information required for analysis. Here are the possible inputs:
– Performance data that is collected by IBM Spectrum Control or Tivoli Storage
Productivity Center
– Historical performance reports
– Product specifications (benchmark results, performance thresholds, and performance
baselines)
– User specifications (SLOs and SLAs)

Appendix A. Performance management process 553


򐂰 Outputs: The deliverables or results from the process. Here are possible types of output:
– Performance reports and tuning recommendations
– Performance trends
– Performance alerts
򐂰 Tasks: The activities that are the smallest unit of work of a process. These tasks can be
the following ones:
– Performance data collection
– Performance report generation
– Analysis and tuning recommendations
򐂰 Actors: A department or person in the organization that is specialized to perform a certain
type of work. Actors can vary from organization to organization. In smaller organizations, a
single person can own multiple actor responsibilities. Here are some examples:
– Capacity and performance team
– Storage team
– Server teams
– Database team
– Application team
– Operations team
– IT Architect
– IT Manager
– Clients
򐂰 Roles: The tasks that need to be run by an actor, but another actor might own the activity,
and other actors might be consulted. Knowing the roles is helpful when you define the
steps of the process and who is going to do what. Here are the roles:
– Responsible: The person that runs that task but is not necessarily the owner of that
task. Suppose that the capacity team is the owner for the generation of the
performance report with the tuning recommendations, but the specialist with the skill to
suggest tuning in the DS8000 storage system is the Storage Administrator.
– Accountable: The owner of that activity. There can be only one owner.
– Consulted: The people that are consulted and whose opinions are considered.
Suppose that the IT Architect proposes a new architecture for the storage. Normally,
the opinion of the Storage Administrator is requested.
– Informed: The people who are kept up-to-date on progress. The IT Manager normally
wants to know the evolution of activities.

When assigning the tasks, you can use a Responsible, Accountable, Consulted, and
Informed (RACI) matrix to list the actors and the roles that are necessary to define a process
or subprocess. A RACI diagram, or RACI matrix, is used to describe the roles and
responsibilities of various teams or people to deliver a project or perform an operation. It is
useful in clarifying roles and responsibilities in cross-functional and cross-departmental
projects and processes.

554 IBM System Storage DS8000 Performance Monitoring and Tuning


Operational performance subprocess
The operational performance subprocess relates to daily activities that can be run every few
minutes, hourly, or daily. For example, you can monitor the DS8000 storage system, and if
there is utilization above a defined threshold, automatically send an alert to the designated
person.

With Tivoli Storage Productivity Center, you can set performance thresholds for two major
categories:
򐂰 Status change alerts
򐂰 Configuration change alerts

Important: Tivoli Storage Productivity Center for Disk is not designed to monitor hardware
or to report hardware failures. You can configure the DS8000 Hardware Management
Console (HMC) to send alerts through Simple Network Management Protocol (SNMP) or
email when a hardware failure occurs.

You might also need to compare the DS8000 performance with the users’ performance
requirements. Often, these requirements are explicitly defined in formal agreements between
IT management and user management. These agreements are referred to as SLAs or SLOs.
These agreements provide a framework for measuring IT resource performance requirements
against IT resource fulfillment.

Performance SLA
A performance SLA is a formal agreement between IT Management and User
representatives concerning the performance of the IT resources. Often, these SLAs provide
goals for end-to-end transaction response times. For storage, these types of goals typically
relate to average disk response times for different types of storage. Missing the technical
goals described in the SLA results in financial penalties to the IT Service Management
providers.

Performance SLO
Performance SLOs are similar to SLAs with the exception that misses do not carry financial
penalties. Although SLO misses do not carry financial penalties, misses are a breach of
contract in many cases and can lead to serious consequences if not remedied.

Having reports that show you how many alerts and how many misses in SLOs/SLAs occurred
over time is important. The reports tell how effective your storage strategy is (standards,
architectures, and policy allocation) in the steady state. In fact, the numbers in those reports
are inversely proportional to the effectiveness of your storage strategy. The more effective
your storage strategy, the fewer performance threshold alerts are registered, and the fewer
SLO/SLA targets are missed.

It is not necessary to implement SLOs or SLAs for you to discover the effectiveness of your
current storage strategy. The definition of SLO/SLA requires a deep and clear understanding
of your storage strategy and how well your DS8000 storage system is running. That is why
before implementing this process that you should start with the tactical performance process:
򐂰 Generate the performance reports.
򐂰 Define tuning suggestions.
򐂰 Review the baseline after implementing tuning recommendations.
򐂰 Generate performance trends reports.

Appendix A. Performance management process 555


Then, redefine the thresholds with fresh performance numbers. The failure to redefine the
thresholds with fresh performance numbers causes you to spend time dealing with
performance incident tickets with false-positive alerts and not spend the time analyzing the
performance and suggesting tuning for your DS8000 storage system. Now, look at the
characteristics of this process.

Inputs
The following inputs are necessary to make this process effective:
򐂰 Performance trends reports of DS8000 components: Many people ask for the IBM
recommended thresholds. The best recommended thresholds are those thresholds that fit
your environment. The best thresholds depend on the configuration of your DS8000
storage system and the types of workloads. For example, you must define thresholds for
I/O per second (IOPS) if your application is a transactional system. If the application is a
data warehouse, you must define thresholds for throughput. Also, you must not expect the
same performance from different ranks where one set of ranks has 300 GB, 15-K
revolutions per minute (RPM) disk drive modules (DDMs) and another set of ranks has
4 TB, 7200 RPM Nearline DDMs. For more information, check the outputs that are
generated from the tactical performance subprocess.
򐂰 Performance SLO and performance SLA: You can define the SLO/SLA requirements in
two ways:
– By hardware (IOPS by rank or MBps by port): This performance report is the easiest
way to implement an SLO or SLA, but the most difficult method for which to get client
agreement. The client normally does not understand the technical aspects of a
DS8000 storage system.
– By host or application (IOPS by system or MBps by host): Most probably, this
performance report is the only way that you are going to get an agreement from the
client, but this agreement is not certain. The client sometimes does not understand the
technical aspects of IT infrastructure. The typical way to define a performance SLA is
by the average execution time or response time of a transaction in the application. So,
the performance SLA/SLO for the DS8000 storage system is normally an internal
agreement among the support teams, which creates additional work for you to
generate those reports, and there is no predefined solution. It depends on your
environment’s configuration and the conditions that define those SLOs/SLAs. When
configuring the DS8000 storage system with SLO/SLA requirements, separate the
applications or hosts by logical subsystem (LSS) (reserve two LSSs, one even and one
odd, for each host, system, or instance). The benefit of generating performance reports
by using this method is that they are more meaningful to the other support teams and
to the client. So, the level of communication increases and reduce chances for
misunderstandings.

Important: When defining a DS8000 related SLA or SLO, ensure that the goals are based
on empirical evidence of performance within the environment. Application architects with
applications that are highly sensitive to changes in I/O throughput or response time must
consider the measurement of percentiles or standard deviations as opposed to average
values over an extended period. IT management must ensure that the technical
requirements are appropriate for the technology.

556 IBM System Storage DS8000 Performance Monitoring and Tuning


In cases where contractual penalties are associated with production performance SLA or
SLO misses, be careful in the management and implementation of the DS8000 storage system.
Even in the cases where no SLA or SLO exists, users have performance expectations that
are not formally communicated. In these cases, they let IT management know when the
performance of the IT resources is not meeting their expectations. Unfortunately, by the time
they communicate their missed expectations, they are often frustrated, and their ability to
manage their business is severely affected by performance issues.

Although there might not be any immediate financial penalties associated with missed user
expectations, prolonged negative experiences with underperforming IT resources result in low
user satisfaction.

Outputs
The following outputs are generated by this process:
򐂰 Documentation of defined DS8000 performance thresholds. It is important to document
the agreed-to thresholds. Not just for you, but also for other members of your team or other
teams that need to know.
򐂰 DS8000 alerts for performance utilization. These alerts are generated when a DS8000
component reaches a defined level of utilization. With Tivoli Storage Productivity Center
for Disk, you can automate the performance data collection and also configure Tivoli
Storage Productivity Center to send an alert when this type of an event occurs.
򐂰 Performance reports comparing the performance utilization of the DS8000 storage system
with the performance SLO and SLA.

Tasks, actors, and roles


It is easier to visualize and understand the tasks, actors, and roles when they are combined
by using a RACI matrix, as shown in Figure A-1.

Figure A-1 Operational tasks, actors, and roles

Appendix A. Performance management process 557


Figure A-1 on page 557 is an example of a RACI matrix for the operational performance
subprocess, with all the tasks, actors, and roles identified and defined:
򐂰 Provide performance trends report: This report is an important input for the operational
performance subprocess. With this data, you can identify and define the thresholds that
best fit your DS8000 storage system. Consider how the workload is distributed between
the internal components of the DS8000 storage system: HAs, processor complexes, DAs,
and ranks. This analysis avoids the definition of thresholds that generate false-positive
performance alerts and ensure that you monitor only what is relevant to your environment.
򐂰 Define the thresholds to be monitored and their respective values, severity, queue to open
the ticket, and additional instructions: In this task, by using the baseline performance
report, you can identify and set the relevant threshold values. You can use Tivoli Storage
Productivity Center to create alerts when these thresholds are exceeded. For example,
you can configure Tivoli Storage Productivity Center to send the alerts through SNMP
traps to IBM Tivoli Enterprise Console® (TEC) or through email. However, the opening of
an incident ticket must be performed by the Monitoring team that needs to know the
severity to set, on which queue to open the ticket, and any additional information that is
required in the ticket. Figure A-2 is an example of the required details.

Figure A-2 Thresholds definitions table

򐂰 Implement performance monitoring and alerting: After you define the DS8000 components
to monitor, set their corresponding threshold values. For more information about how to
configure Tivoli Storage Productivity Center, see the IBM Spectrum Control or Tivoli
Storage Productivity Center documentation, found at:
http://www.ibm.com/support/knowledgecenter/SS5R93_5.2.8/com.ibm.spectrum.sc.doc
/fqz0_c_wg_managing_resources.html
򐂰 Publish the documentation to the IT Management team: After you implement the
monitoring, send the respective documentation to those people who need to know.

Performance troubleshooting
If an incident ticket is open for performance issues, you might be asked to investigate. The
following tips can help during your problem determination.

558 IBM System Storage DS8000 Performance Monitoring and Tuning


Sample questions for an AIX host
The following questions and comments are examples of the types of questions to ask when
engaged to analyze a performance issue. They might not all be appropriate for every
environment:
򐂰 What was running during the sample period (backups, production batch, or online
queries)?
򐂰 Describe your application (complex data warehouse, online order fulfillment, or DB2).
򐂰 Explain the types of delays experienced.
򐂰 What other factors might indicate a storage area network (SAN)-related or DS8000 related
I/O issue?
򐂰 Describe any recent changes or upgrades to the application, OS, Licensed Internal Code,
or database.
򐂰 When did the issue start and is there a known frequency (specify the time zone)?
򐂰 Does the error report show any disk, logical volume (LV), HA, or other I/O-type errors?
򐂰 What is the interpolicy of the LVs (maximum or minimum)?
򐂰 Describe any striping, mirroring, or RAID configurations in the affected LVs.
򐂰 Is this workload a production or development workload, and is it associated with
benchmarking or breakpoint testing?

In addition to the answers to these questions, the client must provide server performance and
configuration data. For more information, see the relevant host chapters in this book.

Identifying problems and recommending a fix


To identify problems in the environment, the storage and performance management team
must have the tools to monitor, collect, and analyze performance data. Although the tools
might vary, these processes are worthless without storage resource management tools. The
following sample process provides a way to correct a DS8000 disk bottleneck:
1. The performance management team identifies the hot RAID array and the logical unit
number (LUN) to which you must migrate to alleviate the disk array bottleneck.
2. The performance team identifies the target RAID array with low utilization.
3. The client that uses the hot LUN is contacted and requested to open a change request to
allocate a new LUN on a lesser used array.
4. The client opens a ticket and requests a new LUN.
5. The storage management team defines and zones the new LUN.
6. The client migrates data from the old LUN to the new LUN. The specifics of this step are
OS-specific and application-specific and are not explained.
7. The client opens a change request to delete the old LUN.
8. Performance management evaluates the change and confirms success. If the disk array
still has disk contention, further changes might be recommended.

Appendix A. Performance management process 559


Tactical performance subprocess
This process deals with activities that occur over a cycle of weeks or months. These activities
relate to the collection of data and the generation of reports for performance and utilization
trends. With these reports, you can produce tuning recommendations. With this process, you
can gain a better understanding of the workload of each application or host system. You can
also verify that you are getting the expected benefits of new tuning recommendations
implemented on the DS8000 storage system. The performance reports generated by this
process are also used as inputs for the operational and strategic performance subprocesses.

Tip: Start with the tactical performance subprocess for the implementation of a
performance management process.

Regardless of the environment, the implementation of proactive processes to identify


potential performance issues before they become a crisis saves time and money. All the
methods that are described depend on storage management reporting tools. These tools
must provide for the long-term gathering of performance metrics, allow thresholds to be set,
and provide for alerts to be sent. These capabilities permit Performance Management to
effectively analyze the I/O workload and establish proactive processes to manage potential
problems. Key performance indicators provide information to identify consistent performance
hot spots or workload imbalances.

Inputs
The following inputs are necessary to make this process effective:
򐂰 Product specifications: Documents that describe the characteristics and features of the
DS8000 storage system, such as data sheets, Announcement Letters, and planning
manuals
򐂰 Product documentation: Documents that provide information about the installation and use
of the DS8000 storage system, such as user manuals, white papers, and IBM Redbooks
publications
򐂰 Performance SLOs/performance SLAs: The documentation of performance SLO/SLAs to
which the client agreed for the DS8000 storage system.

Outputs
Performance reports with tuning recommendations and performance trends reports are the
outputs that are generated by this process.

Performance reports with tuning recommendations


Create your performance report with three chapters at least:
򐂰 Hardware view: This view provides information about the performance of each component
of the DS8000 storage system. You can determine the health of the storage subsystem by
analyzing key performance metrics for the HA, port, array, and volume. You can generate
workload profiles at the DS8000 subsystem or component level by gathering key
performance metrics for each component.
򐂰 Application or host system view: This view helps you identify the workload profile of each
application or host system that accesses the DS8000 storage system. This view helps you
understand which RAID configuration performs best or whether it is advisable to move a
specific system from ranks with DDMs of 300 GB / 15-K RPM to ranks with DDMs of
1200 GB / 10-K RPM, for example.

560 IBM System Storage DS8000 Performance Monitoring and Tuning


This information also helps other support teams, such as Database Administrators, the IT
Architect, the IT Manager, or even the clients. This type of report can assist you in
meetings to become more interactive with more people asking questions.
򐂰 Conclusions and recommendations: This section is the information that the IT Manager,
the clients, and the IT Architect read first. Based on the recommendations, an agreement
can be reached about actions to implement to mitigate any performance issue.

Performance trends reports


It is important for you to see how the performance of the DS8000 storage system is changing
over time. As in “Performance reports with tuning recommendations” on page 560, create
three chapters:
򐂰 Hardware view: Analysis of the key performance indicators that are referenced in 7.3, “IBM
Spectrum Control data collection considerations” on page 238.
򐂰 Application or host system view: One technique for creating workload profiles is to group
volume performance data into logical categories based on their application or host system.
򐂰 Conclusions and recommendations: The reports might help you recommend changing the
threshold values for your DS8000 performance monitoring or alerting. The report might
show that a specific DS8000 component will soon reach its capacity or performance limit.

Tasks, actors, and roles


Figure A-3 shows an example of a RACI matrix for the tactical performance subprocess with
all the tasks, actors, and roles identified and defined.

Figure A-3 Tactical tasks, actors, and roles

The RACI matrix includes the following tasks:


򐂰 Collect configuration and raw performance data: Use Tivoli Storage Productivity Center for
Disk daily probes to collect configuration data. Set up the Subsystem Performance Monitor
to run indefinitely and to collect data at 15-minute intervals for each DS8000 storage
system.
򐂰 Generate performance graphics: Produce one key metric for each physical component in
the DS8000 storage system over time. For example, if your workload is evenly distributed
across the entire day, show the average daily disk utilization for each disk array. Configure
thresholds in the chart to identify when a performance constraint might occur. Use a
spreadsheet to create a linear trend line based on the data that is previously collected and
identify when a constraint might occur.

Appendix A. Performance management process 561


򐂰 Generate performance reports with tuning recommendations: You must collect and review
host performance data on a regular basis. When you discover a performance issue,
Performance Management must work with the Storage team and the client to develop a
plan to resolve it. This plan can involve some form of data migration on the DS8000
storage system. For key I/O-related performance metrics, see the chapters in this book for
your specific OS. Typically, the end-to-end I/O response time is measured because it
provides the most direct measurement of the health of the SAN and disk subsystem.
򐂰 Generate performance reports for trend analysis: Methodologies that are typically applied
in capacity planning can be applied to the storage performance arena. These
methodologies rely on workload characterization, historical trends, linear trending
techniques, I/O workload modeling, and “what if” scenarios. You can obtain details about
disk management in Chapter 6, “Performance planning tools” on page 159.
򐂰 Schedule meetings with the involved areas: The frequency depends on the dynamism of
your IT environment. The greater the rate of change, such as the deployment of new
systems, upgrades of software and hardware, fix management, allocation of new LUNs,
and the implementation of Copy Services, determines the frequency of meetings. For the
performance reports with tuning recommendations, have weekly meetings. For
performance trends reports, have meetings monthly. You might want to change the
frequency of these meetings after you gain more confidence and familiarity with the
performance management process. At the end of these meetings, define with the other
support teams and the IT Manager the actions to resolve any potential issues that are
identified.

Strategic performance subprocess


The strategic performance subprocess relates to activities that occur over a cycle of
6 - 12 months. These activities define or review standards and the architecture or size new or
existing DS8000 storage systems.

It might be an obvious observation, but it is important to remember that the IT resources are
finite and some day they will run out. In the same way, the money to invest in IT Infrastructure
is limited, which is why this process is important. In each company, there is normally a time
when the budget for the next year is decided. So, even if you present a list of requirements
with performance reports to justify the investments, you might not be successful. The timing
of the request and the benefit of the investment to the business are also important
considerations.

Just keeping the IT systems running is not enough. The IT Manager and Chief Information
Officer (CIO) must show the business benefit for the company. Usually, this benefit means
providing the service at the lowest cost, but also showing a financial advantage that the
services provide. This benefit is how the IT industry grew over the years while it increased
productivity, reduced costs, and enabled new opportunities.

You must check with your IT Manager or Architect to learn when the budget is set and start
3 - 4 months before this date. You can then define the priorities for the IT infrastructure for the
coming year to meet the business requirements.

Inputs
The following inputs are required to make this process effective:
򐂰 Performance reports with tuning recommendations
򐂰 Performance trends reports

562 IBM System Storage DS8000 Performance Monitoring and Tuning


Outputs
The following outputs are generated by this process:
򐂰 Standards and architectures: Documents that specify the following items:
– Naming convention for the DS8000 components: ranks, extent pools, volume groups,
host connections, and LUNs.
– Rules to format and configure the DS8000 storage system: Arrays, RAID, ranks, extent
pools, volume groups, host connections, LSSs, and LUNs.
– Policy allocation: When to pool the applications or host systems on the same set of
ranks. When to segment the applications or hosts systems in different ranks. Which
type of workload must use RAID 5, RAID 6, or RAID 10? Which type of workload must
use flash drives, or DDMs of 300 GB /15-K RPM, 600 GB / 10-K RPM, or 1200 GB /
10-K RPM?
򐂰 Sizing of new or existing DS8000 storage systems: According to the business demands,
what are the recommended capacity, cache, and host ports for a new or existing DS8000?
򐂰 Plan configuration of new DS8000 storage systems: What is the planned configuration of
the new DS8000 storage system based on your standards and architecture and according
to the workload of the systems that will be deployed?

Tasks, actors, and roles


Figure A-4 is an example of a RACI matrix for the strategic performance subprocess with all
the tasks, actors, and roles identified and defined.

Figure A-4 Strategic tasks, actors, and roles

򐂰 Define priorities of new investments: In defining the priorities of where to invest, you must
consider these four objectives:
– Reduce cost: The simplest example is storage consolidation. There might be several
storage systems in your data center that are nearing the ends of their useful lives. The
costs of maintenance are increasing, and the storage systems use more energy than
new models. The IT Architect can create a case for storage consolidation, but needs
your help to specify and size the new storage.
– Increase availability: There are production systems that need to be available 24x7. The
IT Architect must submit a new solution for this case to provide data mirroring. The IT
Architect requires your help to specify the new storage for the secondary site and to
provide figures for the necessary performance.

Appendix A. Performance management process 563


– Mitigate risks: Consider a case where a system is running on an old storage model
without a support contract from the vendor. That system started as a pilot with no
importance. Over time, that system presented great performance and is now a key
application for the company. The IT Architect must submit a proposal to migrate to a
new storage system. Again, the IT Architect needs your help to specify the new storage
requirements.
– Business units’ demands: Depending on the target results that each business unit must
meet, the business units might require additional IT resources. The IT Architect
requires information about the additional capacity that is required.
򐂰 Define and review standards and architectures: After you define the priorities, you might
need to review the standards and architecture. New technologies appear, so you might
need to specify new standards for new storage models. Maybe, after a period analyzing
the performance of your DS8000 storage system, you discover that for a certain workload
that you might need to change a standard.
򐂰 Size new or existing DS8000 storage system: Modeling tools, such as Disk Magic, which
is described in 6.1.7, “Disk Magic modeling” on page 164, can gather multiple workload
profiles based on host performance data into one model and provide a method to assess
the impact of one or more changes to the I/O workload or DS8000 configuration.

Tip: For environments with multiple applications on the same physical servers or on
logical partitions (LPARs) that use the same Virtual I/O Servers (VIOSs), defining new
requirements can be challenging. Build profiles at the DS8000 level first and eventually
move into more in-depth study and understanding of the other shared resources in the
environment.

򐂰 Plan configuration of a new DS8000 storage system: Configuring the DS8000 storage
system to meet the specific I/O performance requirements of an application reduces the
probability of production performance issues. To produce a design to meet these
requirements, Storage Management needs to know the following items:
– IOPS
– Read/write ratios
– I/O transfer size
– Access type: Sequential or random
For help in converting application profiles to I/O workload, see Chapter 5, “Understanding
your workload” on page 141.
After the I/O requirements are identified, documented, and agreed upon, the DS8000
layout and logical planning can begin. For more information and considerations for
planning for performance, see Chapter 4, “Logical configuration performance
considerations” on page 83.

Communication: A lack of communication between the Application Architects and the


Storage Management team regarding I/O requirements can likely result in production
performance issues. It is essential that these requirements are clearly defined.

For existing applications, you can use Disk Magic to analyze an application I/O profile.
Details about Disk Management are in Chapter 6, “Performance planning tools” on
page 159.

564 IBM System Storage DS8000 Performance Monitoring and Tuning


B

Appendix B. Benchmarking
Benchmarking storage systems is complex because of all of the hardware and software that
are used for storage systems. This appendix describes the goals and the ways to conduct an
effective storage benchmark.

This appendix includes the following topics:


򐂰 Goals of benchmarking
򐂰 Requirements for a benchmark

© Copyright IBM Corp. 2016. All rights reserved. 565


Goals of benchmarking
Today, clients face difficult choices about the number of storage vendors and their product
portfolios. Performance information that is provided by storage vendors can be generic and
often not representative of real client environments. Benchmarking can help you decide
because it is a way to get an accurate representation of the storage product performance in
simulated application environments. The main objective of benchmarking is to identify
performance capabilities of a specific production environment and compare the performance
of two or more storage systems. Including the use of real production data in the benchmark
can be ideal.

To conduct a benchmark, you need a solid understanding of all of the parts of your
environment. This understanding includes the storage system requirements and also the
storage area network (SAN) infrastructure, the server environments, and the applications.
Emulating the actual environment, including actual applications and data, along with user
simulation, provides efficient and accurate analysis of the performance of the storage system
tested. The characteristic of a performance benchmark test is that results must be
reproducible to validate the integrity of the test.

What to benchmark
Benchmarking can be a simple thing, such as when you want to see the performance impact
of upgrading to a 16 Gb host adapter (HA) from an 8 Gb HA. The simplest scenario is if you
have the new HA card that is installed on your test system, in which case you run your normal
test workload by using the old HA card and rerun the same workload by using the new HA
card. Analyzing the performance metrics of the two runs gives you a comparison, hopefully
improvement, of the performance on the 16 Gb HA. Comparison can be done for the
response time, port intensity, HA utilization, and in case of z/OS, the connect time.

Benchmarking can be a complex and laborious project. The hardware that is required to do
the benchmark can be substantial. An example is if you want to benchmark your production
workload on the new DS8880 storage system in a Metro/Global Mirror environment at a
1000 km distance. Fortunately, there is equipment that can simulate the distance from your
primary to secondary site, so you do not need to have a physical storage system at a remote
location 1000 km away to perform this benchmark.

Here are some benchmark options:


򐂰 Local onsite location, where the hardware to be tested is already installed in your test
environment:
– New 16 Gb HA
– New High-Performance Flash Enclosure (HPFE) with flash cards
– New DS8880 storage system
򐂰 Benchmark at an IBM Lab, where a much more complex configuration is required:
– Benchmarking a new storage system, such as the DS8886 storage system
– A Metro Mirror or Global Mirror environment with a remote/distant location
– Benchmarking a Global Mirror environment versus a z/OS Global Mirror (XRC)
environment

566 IBM System Storage DS8000 Performance Monitoring and Tuning


Benchmark key indicators
The popularity of a benchmark is based on how representative the workload is and whether
the results are meaningful. Three key indicators can be used out of the benchmark results to
evaluate the performance of the storage system:
򐂰 Performance results in a real application environment
򐂰 Reliability
򐂰 Total cost of ownership (TCO)

Performance is not the only component to consider in benchmark results. Reliability and
cost-effectiveness must be considered. Balancing benchmark performance results with
reliability, functions, and TCO of the storage system provides a global view of the storage
product value.

To help client understanding of intrinsic storage product values in the marketplace,


vendor-neutral independent organizations developed several generic benchmarks. One of the
well-known organizations is the Storage Performance Council (SPC). You can discover more
about SPC at the following website:
http://www.storageperformance.org

Selecting one of these workloads from SPC depends on how representative that workload is
to your current production workload or new workload that you plan to implement. If none of
them fits your needs, then you must either build your own workload or ask your IBM account
team or IBM Business Partner for assistance in creating one. This way, the benchmark result
reflects what you expected to evaluate in the first place.

Requirements for a benchmark


You must carefully review your requirements before you set up a storage benchmark and use
these requirements to develop a detailed but reasonable benchmark specification and time
frame. Furthermore, you must clearly identify the objective of the benchmark with all the
participants and precisely define the success criteria of the results.

Defining the benchmark architecture


This process includes the specific storage equipment that you want to test, the servers that
host your application, the servers that are used to generate the workload, and the SAN
equipment that is used to interconnect the servers and the storage system. The monitoring
equipment and software are also part of the benchmark architecture.

Defining the benchmark workload


Your application environment can have different categories of data processing requirements.
In most cases, two data processing types can be identified: one is characterized as an online
transaction processing (OLTP) type, and the other type as batch processing.

The OLTP category typically has many users, who all access the same storage system and a
common set of files. The requests are typically random access and spread across many files
with a small transfer size (typically 4-K records).

Appendix B. Benchmarking 567


Batch workloads are frequent sequential accesses to the databases with a large data transfer
size. The batch processing period is expected to operate within a particular period, and it is
expected to finish before the OLTP period starts. Often, a month-end batch processing can
cause a problem when it takes longer to run and runs into the OLTP time.

To identify the specification of your production workload, you can use monitoring tools that are
available at the operating system level.

To set up a benchmark environment, there are two ways to generate the workload.
򐂰 The first way to generate the workload, which is the most complex, is to create a copy of
the production environment, including the applications software and the application data.
In this case, you must ensure that the application is well-configured and optimized on the
server operating system. The data volume also must be representative of the production
environment. Depending on your application, a workload can be generated by using
application scripts or an external transaction simulation tool. These tools provide a
simulation of users accessing your application. To build up an external simulation tool, you
first record several typical requests from several users and then generate these requests
multiple times. This process can provide an emulation of hundreds or thousands of
concurrent users to put the application through the rigors of real-life user loads and
measure the response times of key performance metrics.
򐂰 The other way to generate the workload is to use a standard workload generator or an I/O
driver. These tools, specific to each operating system, produce different kinds of I/O loads
on the storage systems. You can configure and tune these tools to match your application
workload. Here are the main performance metrics that must be tuned to simulate closely
your current workload:
– I/O Rate
– Read to Write Ratio
– Read Hit Ratio
– Read and Write Transfer Size
– % Read and Write Sequential I/O

Running the benchmark


When you start the workload, it starts from no activity until it reaches the target I/O load.
When it reaches this point, you must let it run at this steady state for some measurement
periods. Then, the I/O load can be increased until it reaches the next target I/O load, and the
new steady state can be run for some period. This action can be repeated as many times as
you want until it reaches the maximum I/O load that you planned.

After the benchmark is completed, the performance measurement data that is collected can
then be analyzed. The benchmark can be repeated and compared to the previous results.
This action ensures that there is no anomaly during the workload run.

Considering all the efforts and resources that are required to set up the benchmark, it is
prudent to plan other benchmark scenarios that you might want to run. As an example, you
might run other types of workloads, such OLTP and batch. Another scenario might be running
at different cache sizes.

568 IBM System Storage DS8000 Performance Monitoring and Tuning


Monitoring the performance
Monitoring is a critical component of benchmarking and must be fully integrated into the
benchmark architecture. The more information about component activity at each level of a
benchmark environment, the more you understand where the solution weaknesses are. With
this critical source of information, you can precisely identify bottlenecks and can optimize
component utilization and improve the configuration.

A minimum of monitoring tools are required at different levels in a storage benchmark


architecture:
򐂰 Storage level: Monitoring at the storage level provides information of intrinsic performance
of the storage equipment components. Most of these monitoring tools report storage
server utilization, storage cache utilization, volume performance, RAID array performance
for both disk drive modules (DDMs) and SSD or flash, and adapter utilization.
򐂰 SAN level: Monitoring at the SAN level provides information of the interconnect workloads
between servers and storage systems. This monitoring helps check that the workload is
balanced between the different paths that are used for production, and to verify that the
interconnection is not a bottleneck in performance.
򐂰 Server level: Monitoring at the server level provides information about server component
utilization (processor, memory, storage adapter, and file system). This monitoring helps
you understand the type of workload that the application hosted on the server is
generating and evaluate the storage performance in response time and bandwidth from
the application point of view.
򐂰 Application level: This monitoring is the most powerful tool in performance analysis
because these tools monitor the performance at the user point of view and highlight
bottlenecks of the entire solution. Monitoring the application is not always possible; only a
few applications provide a performance module for monitoring application processes.

Based on the monitoring reports, bottlenecks can be identified. Now, either the workload
should be modified or additional hardware must be added, such as more flash ranks if
possible.

Defining the benchmark time frame


Consider the following information when you plan and define the benchmark:
򐂰 Time to define the workload that will be benchmarked. Using the monitoring tool, you can
obtain the I/O characteristics of the workload, and probably find the peak periods that you
want to model, for example, peak I/O Rate and peak Write MBps for a Remote Copy
environment.
򐂰 Time to set up the environment (hardware and software), restore or create your data, and
validate that the solution works.
򐂰 Time of execution of each scenario, considering that each scenario must be run several
times.
򐂰 Time to analyze the monitoring data that is collected.
򐂰 After each run, benchmark data can be changed, inserted, deleted, or otherwise modified
so that it must be restored before another test iteration. In that case, consider the time that
is needed to restore the original data after each run.

During a benchmark, each scenario must be run several times to understand how the
different components perform by using monitoring tools, to identify bottlenecks, and then, to
test different ways to get an overall performance improvement by tuning each component.

Appendix B. Benchmarking 569


Using the benchmark results to configure the storage system
Based on the benchmark results, you might decide which storage system configuration you
want to select for your installation. There are some considerations that you might want to
take:
򐂰 The benchmark might be based on only the core application. If so, it does not include the
activity from other applications.
򐂰 The benchmark might be based on your I/O workload characteristics, but it might be
based on one measurement interval, for example, a peak interval during the online period.
It also might be based on the average I/O workload characteristics during the online period
8:00 AM - 5:00 PM. Any of these two choices do not reflect any variances of the workload
during that online period.
򐂰 If the monitoring tool indicates that there are any resources that are close to saturation,
consider increasing the size or number of that particular resource.

570 IBM System Storage DS8000 Performance Monitoring and Tuning


Related publications

The publications that are listed in this section are considered suitable for a more detailed
discussion of the topics covered in this book.

IBM Redbooks publications


For information about ordering these publications, see “How to get IBM Redbooks
publications” on page 573. Most of the documents referenced here will be available in
softcopy only:
򐂰 AIX 5L Performance Tools Handbook, SG24-6039
򐂰 AIX 5L Practical Performance Tools and Tuning Guide, SG24-6478
򐂰 Best Practices for DB2 on AIX 6.1 for POWER Systems, SG24-7821
򐂰 Database Performance Tuning on AIX, SG24-5511
򐂰 DS8000 I/O Priority Manager, REDP-4760
򐂰 DS8000: Introducing Solid State Drives, REDP-4522
򐂰 Effective zSeries Performance Monitoring Using Resource Measurement Facility,
SG24-6645
򐂰 End to End Performance Management on IBM i, SG24-7808
򐂰 FICON Native Implementation and Reference Guide, SG24-6266
򐂰 High Availability and Disaster Recovery Options for DB2 for Linux, UNIX, and Windows,
SG24-7363
򐂰 IBM AIX Version 7.1 Differences Guide, SG24-7910
򐂰 IBM DB2 11 for z/OS Performance Topics, SG24-8222
򐂰 IBM DS8000 Easy Tier, REDP-4667
򐂰 IBM DS8870 Copy Services for IBM z Systems, SG24-6787
򐂰 IBM DS8870 Copy Services for Open Systems, SG24-6788
򐂰 IBM i and IBM System Storage: A Guide to Implementing External Disks on IBM i,
SG24-7120
򐂰 IBM ProtecTIER Implementation and Best Practices Guide, SG24-8025
򐂰 IBM System Storage DS8000: Architecture and Implementation, SG24-8886
򐂰 IBM System Storage DS8000: Host Attachment and Interoperability, SG24-8887
򐂰 IBM System Storage SAN Volume Controller and Storwize V7000 Best Practices and
Performance Guidelines, SG24-7521
򐂰 IBM System Storage SAN Volume Controller and Storwize V7000 Replication Family
Services, SG24-7574
򐂰 IBM System Storage Solutions Handbook, SG24-5250
򐂰 IBM System Storage TS7600 with ProtecTIER Version 3.3, SG24-7968
򐂰 Implementing IBM Storage Data Deduplication Solutions, SG24-7888
򐂰 Implementing the IBM System Storage SAN Volume Controller V7.4, SG24-7933

© Copyright IBM Corp. 2016. All rights reserved. 571


򐂰 Introduction to Storage Area Networks, SG24-5470
򐂰 Linux for IBM System z9 and IBM zSeries, SG24-6694
򐂰 Linux Handbook A Guide to IBM Linux Solutions and Resources, SG24-7000
򐂰 Linux on IBM System z: Performance Measurement and Tuning, SG24-6926
򐂰 Linux Performance and Tuning Guidelines, REDP-4285
򐂰 Performance Metrics in TotalStorage Productivity Center Performance Reports,
REDP-4347
򐂰 Ready to Access DB2 for z/OS Data on Solid-State Drives, REDP-4537
򐂰 Tuning Linux OS on System p The POWER Of Innovation, SG24-7338
򐂰 Virtualizing an Infrastructure with System p and Linux, SG24-7499
򐂰 z/VM and Linux on IBM System z, SG24-7492

Other publications
These publications are also relevant as further information sources:
򐂰 AIX Disk Queue Depth Tuning for Performance, TD105745
򐂰 Application Programming Interface Reference, GC27-4211
򐂰 Command-Line Interface User's Guide, SC27-8526
򐂰 Driving Business Value on Power Systems with Solid-State Drives, POW03025USEN
򐂰 DS8870 Introduction and Planning Guide, GC27-4209
򐂰 Host Systems Attachment Guide, SC27-8527
򐂰 IBM DS8000 Performance Configuration Guidelines for Implementing Oracle Databases
with ASM, WP101375
򐂰 IBM DS8880 Introduction and Planning Guide, GC27-8525
򐂰 IBM i Shared Storage Performance using IBM System Storage DS8000 I/O Priority
Manager, WP101935
򐂰 IBM System Storage DS8700 and DS8800 Performance with Easy Tier 2nd Generation,
WP101961
򐂰 IBM System Storage DS8800 and DS8700 Performance with Easy Tier 3rd Generation,
WP102024
򐂰 IBM System Storage DS8800 Performance Whitepaper, WP102025
򐂰 IS8800 and DS8700 Introduction and Planning Guide, GC27-2297
򐂰 Multipath Subsystem Device Driver User’s Guide, GC52-1309
򐂰 Tuning SAP on DB2 for z/OS on z Systems, WP100287
򐂰 Tuning SAP with DB2 on IBM AIX, WP101601
򐂰 Tuning SAP with Oracle on IBM AIX, WP100377
򐂰 “Web Power – New browser-based Job Watcher tasks help manage your IBM i
performance” in the IBM Systems Magazine:
http://www.ibmsystemsmag.com/ibmi

572 IBM System Storage DS8000 Performance Monitoring and Tuning


Online resources
These websites and URLs are also relevant as further information sources:
򐂰 Documentation for the DS8000 storage system:
http://www.ibm.com/systems/storage/disk/ds8000/index.html
򐂰 IBM Announcement letters (for example, search for R8.0):
http://www.ibm.com/common/ssi/index.wss
򐂰 IBM Disk Storage Feature Activation (DSFA):
http://www.ibm.com/storage/dsfa
򐂰 IBM i IBM Knowledge Center:
http://www-01.ibm.com/support/knowledgecenter/ssw_ibm_i/welcome
򐂰 IBM System Storage Interoperation Center (SSIC):
http://www.ibm.com/systems/support/storage/config/ssic/index.jsp
򐂰 IBM Techdoc Library - The IBM Technical Sales Library:
http://www.ibm.com/support/techdocs/atsmastr.nsf/Web/Techdocs

How to get IBM Redbooks publications


You can search for, view, or download IBM Redbooks publications, Redpapers, Hints and
Tips, draft publications and Additional materials, and order hardcopy IBM Redbooks
publications or CD-ROMs, at this website:
ibm.com/redbooks

Help from IBM


IBM Support and downloads
ibm.com/support

IBM Global Services


ibm.com/services

Related publications 573


574 IBM System Storage DS8000 Performance Monitoring and Tuning
IBM System Storage DS8000 SG24-8318-00
Performance Monitoring and Tuning ISBN 073844149X
(1.0” spine)
0.875”<->1.498”
460 <-> 788 pages
Back cover

SG24-8318-00

ISBN 073844149X

Printed in U.S.A.

®
ibm.com/redbooks

You might also like