Ds8000 Logical Configuration
Ds8000 Logical Configuration
Ds8000 Logical Configuration
Bert Dufrasne Brett Allison John Barnes Jean Iyabi Rajesh Jeyapaul Peter Kimmel
Chuck Laing Anderson Nobre Rene Oehme Gero Schmidt Paulus Usong
ibm.com/redbooks
International Technical Support Organization DS8000 Performance Monitoring and Tuning March 2009
SG24-7146-01
Note: Before using this information and the product it supports, read the information in Notices on page xiii.
Second Edition (March 2009) This edition applies to the IBM System Storage DS8000 with Licensed Machine Code 5.4.1.xx.xx (Code bundles 64.1.x.x).
Copyright International Business Machines Corporation 2009. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii The team that wrote this IBM Redbooks publication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Special thanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx Chapter 1. DS8000 characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 The storage server challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Performance numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Recommendations and rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Modeling your workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Allocating hardware components to workloads. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Meeting the challenge: DS8000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 DS8000 models and characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 DS8000 performance characteristics overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Advanced caching techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 IBM System Storage multipath Subsystem Device Driver (SDD) . . . . . . . . . . . . . . 1.3.3 Performance characteristics for System z. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 2 3 3 3 4 5 6 6 6
Chapter 2. Hardware configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Storage system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Processor memory and cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Cache and I/O operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Determining the right amount of cache storage . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 RIO-G interconnect and I/O enclosures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 RIO-G loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 I/O enclosures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Disk subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Device adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 Fibre Channel disk architecture in the DS8000. . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.3 Disk enclosures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.4 Fibre Channel drives compared to FATA and SATA drives . . . . . . . . . . . . . . . . . 20 2.4.5 Arrays across loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.6 Order of installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.7 Performance Accelerator feature (Feature Code 1980) . . . . . . . . . . . . . . . . . . . . 23 2.5 Host adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.1 Fibre Channel and FICON host adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.2 ESCON host adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.3 Multiple paths to Open Systems servers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.4 Multiple paths to System z servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.5 Spreading host attachments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Tools to aid in hardware planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6.1 White papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6.2 Disk Magic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6.3 Capacity Magic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
iii
Chapter 3. Understanding your workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 General workload types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Standard workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Read intensive cache unfriendly workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Sequential workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Batch jobs workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Sort jobs workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Database workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 DB2 query workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 DB2 logging workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 DB2 transaction environment workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 DB2 utilities workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Application workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 General file serving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Online transaction processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Video on demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 Data warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.6 Engineering and scientific applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.7 Digital video editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Profiling workloads in the design phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Understanding your workload type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Monitoring the DS8000 workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Monitoring the host workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4. Logical configuration concepts and terminology . . . . . . . . . . . . . . . . . . . . 4.1 RAID levels and spares. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 RAID 5 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 RAID 6 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 RAID 10 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Spare creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The abstraction layers for logical configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Array sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Logical volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 Space Efficient volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.7 Allocation, deletion, and modification of LUNs and CKD volumes . . . . . . . . . . . . 4.2.8 Logical subsystems (LSS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.9 Address groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.10 Volume access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.11 Summary of the logical configuration hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Understanding the array to LUN relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 How extents are formed together to make DS8000 LUNs . . . . . . . . . . . . . . . . . . 4.3.2 Understanding data I/O placement on ranks and extent pools . . . . . . . . . . . . . . . Chapter 5. Logical configuration performance considerations . . . . . . . . . . . . . . . . . . 5.1 Basic configuration principles for optimal performance. . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Workload isolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Workload resource-sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Workload spreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 Using workload isolation, resource-sharing, and spreading . . . . . . . . . . . . . . . . . 5.2 Analyzing application workload characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
DS8000 Performance Monitoring and Tuning
29 30 30 30 30 30 30 31 32 32 32 32 32 33 34 34 34 34 35 35 35 38 38 38 41 42 42 43 43 44 45 45 45 46 47 48 49 50 53 53 54 54 55 56 61 63 64 64 65 66 67 68
5.2.1 Determining isolation requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.2 Reviewing remaining workloads for feasibility of resource-sharing. . . . . . . . . . . . 70 5.3 Planning allocation of disk and host connection capacity . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.1 Planning DS8000 hardware resources for isolated workloads . . . . . . . . . . . . . . . 70 5.3.2 Planning DS8000 hardware resources for resource-sharing workloads . . . . . . . . 70 5.4 Planning volume and host connection spreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.1 Spreading volumes for isolated and resource-sharing workloads. . . . . . . . . . . . . 71 5.4.2 Spreading host connections for isolated and resource-sharing workloads . . . . . . 72 5.5 Planning array sites. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5.1 DS8000 configuration example 1: Array site planning considerations . . . . . . . . . 73 5.5.2 DS8000 configuration example 2: Array site planning considerations . . . . . . . . . 75 5.5.3 DS8000 configuration example 3: Array site planning considerations . . . . . . . . . 77 5.6 Planning RAID arrays and ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.6.1 RAID-level performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.6.2 RAID array considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.6.3 Rank considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.7 Planning extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.7.1 Single-rank and multi-rank extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.7.2 Extent allocation methods for multi-rank extent pools. . . . . . . . . . . . . . . . . . . . . . 96 5.7.3 Balancing workload across available resources . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.7.4 Assigning workloads to extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.7.5 Planning for multi-rank extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.7.6 Planning for single-rank extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.8 Plan address groups, LSSs, volume IDs, and CKD PAVs . . . . . . . . . . . . . . . . . . . . . 118 5.8.1 Volume configuration scheme using application-related LSS/LCU IDs . . . . . . . 120 5.8.2 Volume configuration scheme using hardware-bound LSS/LCU IDs . . . . . . . . . 124 5.9 Plan I/O port IDs, host attachments, and volume groups . . . . . . . . . . . . . . . . . . . . . . 131 5.9.1 DS8000 configuration example 1: I/O port planning considerations . . . . . . . . . . 133 5.9.2 DS8000 configuration example 2: I/O port planning considerations . . . . . . . . . . 136 5.9.3 DS8000 configuration example 3: I/O port planning considerations . . . . . . . . . . 140 5.10 Implement and document DS8000 logical configuration . . . . . . . . . . . . . . . . . . . . . . 143 Chapter 6. Performance management process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Operational performance subprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Tasks, actors, and roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.4 Performance troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Tactical performance subprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Tasks, actors, and roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Strategic performance subprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.3 Tasks, actors, and roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 7. Performance planning tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Disk Magic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 The need for performance planning and modeling tools. . . . . . . . . . . . . . . . . . . 7.1.2 Overview and characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 148 149 151 152 153 153 154 155 156 156 157 158 158 158 159 161 162 162 163
Contents
7.1.3 Output information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.4 Disk Magic modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Disk Magic for System z (zSeries) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Process the DMC file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 zSeries model to merge the two ESS-800s to a DS8300 . . . . . . . . . . . . . . . . . . 7.2.3 Disk Magic performance projection for zSeries model . . . . . . . . . . . . . . . . . . . . 7.2.4 Workload growth projection for zSeries model . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Disk Magic for Open Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Process the TotalStorage Productivity Center csv output file . . . . . . . . . . . . . . . 7.3.2 Open Systems model to merge the two ESS-800s to a DS8300 . . . . . . . . . . . . 7.3.3 Disk Magic performance projection for an Open Systems model . . . . . . . . . . . . 7.3.4 Workload growth projection for an Open Systems model . . . . . . . . . . . . . . . . . . 7.4 Workload growth projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Input data needed for Disk Magic study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 z/OS environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Open Systems environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Configuration guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 8. Practical performance management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction to practical performance management . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Performance management tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 TotalStorage Productivity Center overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 TotalStorage Productivity Center data collection . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 TotalStorage Productivity Center measurement of DS8000 components. . . . . . 8.2.4 General TotalStorage Productivity Center measurement considerations . . . . . . 8.3 TotalStorage Productivity Center data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Timestamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Key performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 DS8000 key performance indicator thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 TotalStorage Productivity Center reporting options . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Alerts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Predefined performance reports in TotalStorage Productivity Center. . . . . . . . . 8.5.3 Ad hoc reports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.4 Batch reports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.5 TPCTOOL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.6 Volume Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.7 TPC Reporter for Disk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Monitoring performance of a SAN switch or director. . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 SAN configuration examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 TotalStorage Productivity Center for Fabric alerts . . . . . . . . . . . . . . . . . . . . . . . 8.6.3 TotalStorage Productivity Center for Fabric reporting. . . . . . . . . . . . . . . . . . . . . 8.6.4 TotalStorage Productivity Center for Fabric metrics . . . . . . . . . . . . . . . . . . . . . . 8.7 End-to-end analysis of I/O performance problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.1 Performance analysis examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 TotalStorage Productivity Center for Disk in mixed environment . . . . . . . . . . . . . . . . Chapter 9. Host attachment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 DS8000 host attachment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Attaching Open Systems hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Fibre Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 SAN implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
163 163 165 165 170 177 179 180 181 188 194 195 197 197 198 198 199 203 204 204 205 205 207 212 214 214 216 217 218 221 222 223 228 229 232 236 239 240 242 243 246 247 248 249 257 263 265 266 266 267 267
vi
9.2.3 Multipathing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Attaching IBM System z and S/390 hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 ESCON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 FICON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 FICON configuration and performance considerations . . . . . . . . . . . . . . . . . . . . 9.3.4 z/VM, z/VSE, and Linux on System z attachment. . . . . . . . . . . . . . . . . . . . . . . . Chapter 10. Performance considerations with Windows Servers . . . . . . . . . . . . . . . 10.1 General Windows performance tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 I/O architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Windows Server 2008 I/O Manager enhancements . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 Windows filesystem overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.2 NTFS guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 Volume management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.1 Microsoft Logical Disk Manager (LDM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.2 Microsoft LDM software RAID. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.3 Veritas Volume Manager (VxVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.4 Determining volume layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6 Multipathing and the port layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.1 SCSIport scalability issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.2 Storport scalability features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.3 Subsystem Device Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.4 Subsystem Device Driver Device Specific Module . . . . . . . . . . . . . . . . . . . . . . 10.6.5 Veritas Dynamic MultiPathing (DMP) for Windows . . . . . . . . . . . . . . . . . . . . . . 10.7 Host bus adapter (HBA) settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8 I/O performance measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.1 Key I/O performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.2 Windows Performance console (perfmon) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.3 Performance log configuration and data export . . . . . . . . . . . . . . . . . . . . . . . . 10.8.4 Collecting configuration data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.5 Correlating performance and configuration data. . . . . . . . . . . . . . . . . . . . . . . . 10.8.6 Analyzing performance data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.7 Windows Server Performance Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9 Task Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9.1 Starting Task Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.10 I/O load testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.10.1 Types of tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.10.2 Iometer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 11. Performance considerations with UNIX servers . . . . . . . . . . . . . . . . . . . 11.1 Planning and preparing UNIX servers for performance . . . . . . . . . . . . . . . . . . . . . . 11.1.1 UNIX disk I/O architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 AIX disk I/O components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 AIX Journaled File System (JFS) and Journaled File System 2 (JFS2) . . . . . . 11.2.2 Veritas File System (VxFS) for AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.3 General Parallel FileSystem (GPFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.4 IBM Logical Volume Manager (LVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.5 Veritas Volume Manager (VxVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.6 IBM Subsystem Device Driver (SDD) for AIX . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.7 MPIO with SDDPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.8 Veritas Dynamic MultiPathing (DMP) for AIX . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.9 FC adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
270 275 275 276 279 279 281 282 283 283 284 284 284 285 285 286 286 287 288 289 289 290 290 291 291 291 292 294 296 296 297 297 300 301 301 304 304 305 307 308 309 311 312 315 315 316 320 321 321 322 322
Contents
vii
11.2.10 Virtual I/O Server (VIOS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 AIX performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 AIX vmstat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.2 pstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.3 AIX iostat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.4 lvmstat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.5 topas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.6 nmon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.7 fcstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.8 filemon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Solaris disk I/O components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.1 UFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.2 Veritas FileSystem (VxFS) for Solaris. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.3 SUN Solaris ZFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.4 Solaris Volume Manager (formerly Solstice DiskSuite). . . . . . . . . . . . . . . . . . . 11.4.5 Veritas Volume Manager (VxVM) for Solaris . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.6 IBM Subsystem Device Driver for Solaris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.7 MPxIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.8 Veritas Dynamic MultiPathing (DMP) for Solaris. . . . . . . . . . . . . . . . . . . . . . . . 11.4.9 Array Support Library (ASL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.10 FC adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Solaris performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.1 fcachestat and directiostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.2 Solaris vmstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.3 Solaris iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.4 vxstat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.5 dtrace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 HP-UX Disk I/O architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.1 HP-UX High Performance File System (HFS). . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.2 HP-UX Journaled File System (JFS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.3 HP Logical Volume Manager (LVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.4 Veritas Volume Manager (VxVM) for HP-UX . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.5 PV Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.6 Native multipathing in HP-UX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.7 Subsystem Device Driver (SDD) for HP-UX . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.8 Veritas Dynamic MultiPathing (DMP) for HP-UX . . . . . . . . . . . . . . . . . . . . . . . 11.6.9 Array Support Library (ASL) for HP-UX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.10 FC adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7 HP-UX performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7.1 HP-UX sar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7.2 vxstat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7.3 GlancePlus and HP Perfview/Measureware . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.8 SDD commands for AIX, HP-UX, and Solaris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.8.1 HP-UX SDD commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.8.2 Sun Solaris SDD commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.9 Testing and verifying DS8000 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.9.1 Using the dd command to test sequential rank reads and writes . . . . . . . . . . . 11.9.2 Verifying your system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 12. Performance considerations with VMware . . . . . . . . . . . . . . . . . . . . . . . 12.1 Disk I/O architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Multipathing considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
323 325 326 328 329 334 335 336 339 339 343 343 344 345 346 347 348 348 349 349 350 350 350 351 352 353 354 356 356 357 357 362 362 362 363 363 363 363 363 363 366 366 366 371 373 375 376 377 383 384 386 389
viii
12.3.1 Virtual Center Performance Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.2 Performance monitoring with esxtop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.3 Guest-based performance monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 VMware specific tuning for maximum performance . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.1 Workload spreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.2 Virtual Machines sharing the same LUN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.3 ESX filesystem considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.4 Aligning partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5 Tuning of Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 13. Performance considerations with Linux. . . . . . . . . . . . . . . . . . . . . . . . . . 13.1 Supported platforms and distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Linux disk I/O architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.1 I/O subsystem architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.2 Cache and locality of reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.3 Block layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.4 I/O device driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3 Specific configuration for storage performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.1 Host bus adapter for Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.2 Multipathing in Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.3 Software RAID functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.4 Logical Volume Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.5 Tuning the disk I/O scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.6 Filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Linux performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.1 Disk I/O performance indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.2 Finding disk bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 14. IBM System Storage SAN Volume Controller attachment . . . . . . . . . . . 14.1 IBM System Storage SAN Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.1 SAN Volume Controller concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.2 SAN Volume Controller multipathing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.3 SVC Advanced Copy Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 SAN Volume Controller performance considerations . . . . . . . . . . . . . . . . . . . . . . . . 14.3 DS8000 performance considerations with SVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.1 DS8000 array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.2 DS8000 rank format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.3 DS8000 extent pool implications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.4 DS8000 volume considerations with SVC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.5 Volume assignment to SAN Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . 14.3.6 Managed Disk Group for DS8000 Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4 Performance monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.1 Using TotalStorage Productivity Center for Disk to monitor the SVC . . . . . . . . 14.5 Sharing the DS8000 between a server and the SVC . . . . . . . . . . . . . . . . . . . . . . . . 14.5.1 Sharing the DS8000 between Open Systems servers and the SVC . . . . . . . . 14.5.2 Sharing the DS8000 between System i server and the SVC . . . . . . . . . . . . . . 14.5.3 Sharing the DS8000 between System z server and the SVC . . . . . . . . . . . . . . 14.6 Advanced functions for the DS8000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.6.1 Cache-disabled VDisks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.7 Configuration guidelines for optimizing performance . . . . . . . . . . . . . . . . . . . . . . . .
389 390 391 392 392 392 393 396 399 401 402 402 403 404 405 405 406 406 407 409 410 412 414 417 417 418 421 422 422 425 426 427 429 429 429 430 434 434 435 436 436 437 437 438 438 438 438 439
Chapter 15. System z servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 15.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 15.2 Parallel Access Volumes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
Contents
ix
15.2.1 Static PAV, Dynamic PAV, and HyperPAV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.2 HyperPAV compared to dynamic PAV test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.3 PAV and large volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3 Multiple Allegiance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4 How PAV and Multiple Allegiance work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.1 Concurrent read operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.2 Concurrent write operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5 I/O Priority Queuing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.6 Logical volume sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.6.1 Selecting the volume size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.6.2 Larger volume compared to smaller volume performance . . . . . . . . . . . . . . . . 15.6.3 Planning the volume sizes of your configuration. . . . . . . . . . . . . . . . . . . . . . . . 15.7 FICON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.7.1 Extended Distance FICON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.7.2 High Performance FICON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.7.3 MIDAW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.8 z/OS planning and configuration guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.8.1 Channel configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.8.2 Extent pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.8.3 Considerations for mixed workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.9 DS8000 performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10 RMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.1 I/O response time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.2 I/O response time components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.3 IOP/SAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.4 FICON host channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.5 FICON director . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.6 Processor complex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.7 Cache and NVS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.8 DS8000 FICON/Fibre port and host adapter. . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.9 Extent pool and rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.11 RMF Magic for Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.11.1 RMF Magic analysis process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.11.2 Data collection step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.11.3 RMF Magic reduce step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.11.4 RMF Magic analyze step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.11.5 Data presentation and reporting step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.11.6 Hints and tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 16. Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1 DB2 in a z/OS environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.1 Understanding your database workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.2 DB2 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.3 DB2 storage objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.4 DB2 dataset types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 DS8000 considerations for DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3 DB2 with DS8000 performance recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.1 Know where your data resides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.2 Balance workload across DS8000 resources . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.3 Take advantage of VSAM data striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.4 Large volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.5 Modified Indirect Data Address Words (MIDAWs) . . . . . . . . . . . . . . . . . . . . . . 16.3.6 Adaptive Multi-stream Prefetching (AMP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
442 443 445 446 446 447 448 449 449 450 451 453 454 455 455 457 459 459 461 463 464 464 464 466 468 468 469 470 470 472 474 476 477 478 479 479 479 482 485 486 486 487 487 488 489 489 489 490 490 490 491 491
16.3.7 DB2 burst write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.8 Monitoring DS8000 performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4 DS8000 DB2 UDB in an Open Systems environment . . . . . . . . . . . . . . . . . . . . . . . 16.4.1 DB2 UDB storage concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5 DB2 UDB with DS8000 performance recommendations. . . . . . . . . . . . . . . . . . . . . . 16.5.1 Know where your data resides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.2 Balance workload across DS8000 resources . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.3 Use DB2 to stripe across containers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.4 Selecting DB2 logical sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.5 Selecting the DS8000 logical disk sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.6 Multipathing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.6 IMS in a z/OS environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.6.1 IMS overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.6.2 IMS logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.7 DS8000 considerations for IMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.8 IMS with DS8000 performance recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . 16.8.1 Know where your data resides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.8.2 Balance workload across DS8000 resources . . . . . . . . . . . . . . . . . . . . . . . . . . 16.8.3 Large volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.8.4 Monitoring DS8000 performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 17. Copy Services performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.1 Copy Services introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2 FlashCopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.1 FlashCopy performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.2 Performance planning for IBM FlashCopy SE . . . . . . . . . . . . . . . . . . . . . . . . . 17.3 Metro Mirror. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.1 Metro Mirror configuration considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.2 Metro Mirror performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4 Global Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4.1 Global Copy configuration considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4.2 Global Copy performance consideration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5 Global Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5.1 Global Mirror performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5.2 Global Mirror Session parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5.3 Avoid unbalanced configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5.4 Growth within Global Mirror configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.6 z/OS Global Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.6.1 z/OS Global Mirror control dataset placement . . . . . . . . . . . . . . . . . . . . . . . . . 17.6.2 z/OS Global Mirror tuning parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.6.3 z/OS Global Mirror enhanced multiple reader. . . . . . . . . . . . . . . . . . . . . . . . . . 17.6.4 zGM enhanced multiple reader performance improvement . . . . . . . . . . . . . . . 17.6.5 XRC Performance Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.7 Metro/Global Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.7.1 Metro/Global Mirror performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.7.2 z/OS Metro/Global Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.7.3 z/OS Metro/Global Mirror performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
491 491 492 492 497 497 498 499 499 500 502 502 502 503 504 504 504 505 505 506 507 508 509 511 516 518 519 524 526 526 527 529 530 530 533 535 538 541 543 545 545 549 550 552 552 553 553 554
Appendix A. Logical configuration examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 A.1 Considering hardware resource availability for throughput. . . . . . . . . . . . . . . . . . . . . 556 A.2 Resource isolation or sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
Contents
xi
Scenario 1: Spreading everything with no isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . Scenario 2: Spreading data I/O with partial isolation . . . . . . . . . . . . . . . . . . . . . . . . . . Scenario 3: Grouping unlike RAID types together in the extent pool . . . . . . . . . . . . . . Scenario 4: Grouping like RAID types in the extent pool . . . . . . . . . . . . . . . . . . . . . . . Scenario 5: More isolation of RAID types in the extent pool . . . . . . . . . . . . . . . . . . . . . Scenario 6: Balancing mixed RAID type ranks and capacities . . . . . . . . . . . . . . . . . . . Appendix B. Windows server performance log collection . . . . . . . . . . . . . . . . . . . . . B.1 Windows Server 2003 log file configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Configuring logging of disk metrics Windows Server 2003 . . . . . . . . . . . . . . . . . . . . . . Saving counter log settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Importing counter logs properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analyzing disk performance from collected data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Retrieving data from a counter log file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exporting logged data on Windows Server 2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Windows Server 2008 log file configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Windows Server 2008 Export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix C. UNIX shell scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 vgmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.3 lvmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.4 vpath_iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.5 ds_iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.6 test_disk_speeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.7 lsvscsimap.ksh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.8 mkvscsimap.ksh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix D. Post-processing scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.2 Dependencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.2.1 Running the scripts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix E. Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.1 Goals of benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.2 Requirements for a benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Define the benchmark architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Define the benchmark workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monitoring the performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Define the benchmark time frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.3 Caution using benchmark results to design production . . . . . . . . . . . . . . . . . . . . . . . Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How to get IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
556 561 562 563 565 566 571 572 572 575 576 576 576 578 580 584 587 588 588 589 590 594 597 598 602 607 608 608 609 623 624 624 625 625 626 626 627 629 629 630 630 630 631
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
xii
Notices
This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
xiii
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol ( or ), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:
AIX 5L AIX alphaWorks CICS DB2 Universal Database DB2 DS4000 DS6000 DS8000 ECKD Enterprise Storage Server ESCON eServer FICON FlashCopy GDPS Geographically Dispersed Parallel Sysplex GPFS HACMP i5/OS IBM iSeries Iterations OMEGAMON OS/390 Parallel Sysplex POWER5 POWER5+ POWER6 PowerHA PowerPC PowerVM POWER pSeries Rational Redbooks Redbooks (logo) RS/6000 S/390 Sysplex Timer System i System p5 System p System Storage System x System z10 System z9 System z Tivoli Enterprise Console Tivoli TotalStorage xSeries z/Architecture z/OS z/VM z/VSE z9 zSeries
The following terms are trademarks of other companies: Acrobat, and Portable Document Format (PDF) are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, other countries, or both. Disk Magic, IntelliMagic, and the IntelliMagic logo are trademarks of IntelliMagic BV in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office. Novell, SUSE, the Novell logo, and the N logo are registered trademarks of Novell, Inc. in the United States and other countries. Oracle, JD Edwards, PeopleSoft, Siebel, and TopLink are registered trademarks of Oracle Corporation and/or its affiliates. QLogic, and the QLogic logo are registered trademarks of QLogic Corporation. SANblade is a registered trademark in the United States. SAP R/3, SAP, and SAP logos are trademarks or registered trademarks of SAP AG in Germany and in several other countries. VMotion, VMware, the VMware "boxes" logo and design are registered trademarks or trademarks of VMware, Inc. in the United States and/or other jurisdictions. J2EE, Java, JNI, S24, Solaris, Solstice, Sun, ZFS, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Excel, Internet Explorer, Microsoft, MS-DOS, MS, PowerPoint, SQL Server, Visual Basic, Windows NT, xiv
DS8000 Performance Monitoring and Tuning
Windows Server, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.
Notices
xv
xvi
Preface
This IBM Redbooks publication provides guidance about how to configure, monitor, and manage your IBM System Storage DS8000 to achieve optimum performance. It describes the DS8000 performance features and characteristics and how they can be exploited with the various server platforms that attach to the DS8000. Then, in separate chapters, we detail specific performance recommendations and discussions that apply for each server environment, as well as for database and DS8000 Copy Services environments. We also outline the various tools available for monitoring and measuring I/O performance for different server environments, as well as describe how to monitor the performance of the entire DS8000 subsystem.
xvii
Rajesh Jeyapaul is an AIX Development Support Specialist in IBM India. He has nine years of experience in AIX, specializing in investigating the performance impact of processes running in AIX. Currently, he is leading a technical team responsible for providing Development support to various AIX components. He holds a Masters Degree in Software Systems from the University of BITS, India, and an MBA from University of MKU, India. His areas of expertise include System p, AIX, and High-Availability Cluster Multi-Processing (HACMP). Peter Kimmel is an IT Specialist and the ATS team lead of the Enterprise Disk Performance team at the European Storage Competence Center in Mainz, Germany. He joined IBM Storage in 1999 and since then worked with SSA, VSS, the various ESS generations, and DS8000/DS6000. He has been involved in all Early Shipment Programs (ESPs), early installs for the Copy Services rollouts, and has co-authored several DS8000 IBM Redbooks publications so far. Peter holds a Diploma (MSc) degree in Physics from the University of Kaiserslautern. Chuck Laing is a Senior IT Architect and Master Certified IT Specialist with The Open Group. He is also an IBM Certified IT Specialist, specializing in IBM Enterprise Class and Midrange Disk Storage Systems/Configuration Management in the Americas ITD. He has co-authored eight previous IBM Redbooks publications about the IBM TotalStorage Enterprise Storage Server and the DS8000/6000. He holds a degree in Computer Science. He has worked at IBM for over ten years. Before joining IBM, Chuck was a hardware CE on UNIX systems for ten years and taught Computer Science at Midland College for six and a half years in Midland, Texas. Anderson Ferreira Nobre is a Certified IT Specialist and Certified Advanced Technical Expert - IBM System p5 in Strategic Outsourcing in Hortolndia (Brazil). He has 10 years of experience with UNIX (mainly with AIX). He was assigned to the UNIX team in 2005 to plan, manage, and support the UNIX, SAN, and Storage environments for IBM Outsourcing clients. Rene Oehme is an IBM Certified Specialist for High-End Disk Solutions, working for the Germany and CEMAAS Hardware Support Center in Mainz, Germany. Rene has more than six years of experience in IBM hardware support, including Storage Subsystems, SAN, and Tape Solutions, as well as System p and System z. Currently, he provides support for clients and service representatives with High End Disk Subsystems, such as the DS8000, DS6000, and ESS. His main focus is Open Systems attachment of High-End Disk Subsystems, including AIX, Windows, Linux, and VMware. He holds a degree in Information Technology from the University of Cooperative Education (BA) Stuttgart. Gero Schmidt is an IT Specialist in the IBM ATS technical sales support organization in Germany. He joined IBM in 2001 working at the European Storage Competence Center (ESCC) in Mainz, providing technical support for a broad range of IBM storage products (SSA, ESS, DS4000, DS6000, and DS8000) in Open Systems environments with a primary focus on storage subsystem performance. During his seven years of experience with IBM storage products, he participated in various beta test programs for ESS 800 and especially in the product rollout and beta test program for the DS6000/DS8000 series. He holds a degree in Physics (Dipl.-Phys.) from the Technical University of Braunschweig, Germany. Paulus Usong started his IBM career in Indonesia decades ago. He rejoined IBM at the Santa Teresa Lab (now known as the Silicon Valley Lab). In 1995, he joined the Advanced Technical Support group in San Jose. Currently, he is a Certified Consulting I/T Specialist and his main responsibilities are handling mainframe DASD performance critical situations
xviii
and performing Disk Magic study and remote copy sizing for clients who want to implement the IBM solution for their disaster recovery system.
The team: Rene, Bert, John, Brett, Gero, Anderson, Jean, Paulus, and Peter
Special thanks
For hosting this residency at the ESCC in Mainz, Germany, we want to thank: Rainer Zielonka - Director ESCC Dr. Friedrich Gerken - Manager Services and Technical Sales Support Rainer Erkens - Manager ESCC Service & Support Management Bernd Mller - Manager Enterprise Disk High-End Solutions Europe, for dedicating so many resources to this residency Stephan Weyrich - Opportunity Manager ESCC Workshops We especially want to thank Lee La Frese (IBM, Tucson) for being our special advisor and development contact for this book. Many thanks to those people in IBM in Mainz, Germany, who helped us with access to equipment as well as technical information and review: Uwe Heinrich Mueller, Uwe Schweikhard, Guenter Schmitt, Joerg Zahn, Werner Deul, Mike Schneider, Markus Oscheka, Hartmut Bohnacker, Gerhard Pieper, Alexander Warmuth, Kai Jehnen, Frank Krueger, and Werner Bauer Special thanks to: John Bynum DS8000 World Wide Technical Support Marketing Lead
Preface
xix
Thanks to the following people for their contributions to this project: Mary Anne Bromley Garry Bennet Jay Kurtz Rosemary McCutchen Brian J. Smith Sonny Williams IBM US Nick Clayton Patrick Keyes Andy Wharton Barry Whyte IBM UK Brian Sherman IBM Canada
Comments welcome
Your comments are important to us. We want our IBM Redbooks publications to be as helpful as possible. Send us your comments about this or other IBM Redbooks publications in one of the following ways: Use the online Contact us review IBM Redbooks publication form found at: ibm.com/redbooks Send your comments in an e-mail to: [email protected] Mail your comments to: IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400
xx
Chapter 1.
DS8000 characteristics
This chapter contains a high level discussion and introduction to the storage server performance challenge. Then, we provide an overview of the DS8000 model characteristics that allow the DS8000 to meet this performance challenge.
components. Just remember, only use general rules when there is no information available to make a more informed decision.
Expansion frames, minimum - maximum Expansion frame model Host adapters, minimum - maximum
Ports per Fibre Channel Protocol (FCP)/ Fibre Channel connection (FICON) host adapter Ports per Enterprise Systems Connection (ESCON) host adapter
DS8100 Turbo model 931 Disk drive modules (DDMs), minimum maximum 16 - 384
Internal fabric
The DS8000 comes with a high bandwidth, fault tolerant internal interconnection, which is also used in the IBM System p servers. It is called RIO-2 (Remote I/O) and can operate at speeds up to 1 GHz and offers a 2 GB/s sustained bandwidth per link.
Disk drives
The DS8000 offers a selection of industry standard Fibre Channel (FC) disk drives. There are 15k rpm FC drives available with 146 GB, 300 GB, or 450 GB capacity. The 500 GB Fibre Channel Advanced Technology Attachment (FATA) drives (7200 rpm) allow the system to scale up to 512 TB of capacity.
Host adapters
The DS8000 offers enhanced connectivity with the availability of four-port Fibre Channel/FICON host adapters. The 4 Gb/s Fibre Channel/FICON host adapters, which are offered in longwave and shortwave, can also auto-negotiate to 2 Gb/s or 1 Gb/s link speeds. This flexibility enables immediate exploitation of the benefits offered by the higher performance, 4 Gb/s storage area network (SAN)-based solutions, while also maintaining compatibility with existing 2 Gb/s infrastructures. In addition, the four ports on the adapter can be configured with an intermix of Fibre Channel Protocol (FCP) and FICON, which can help protect your investment in Fibre adapters, and increase your ability to migrate to new servers. The DS8000 also offers two-port ESCON adapters. A DS8000 can support up to a maximum of 32 host adapters, which provide up to 128 Fibre Channel/FICON ports.
With all these new components, the DS8000 is positioned at the top of the high performance category. As previously mentioned in this chapter, the following components contribute to the high performance of the DS8000: Redundant Array of Independent Disks (RAID), array across loops (AAL), POWER5+ processors, RIO, and the FC-AL implementation with a truly switched FC back end. In addition to these, there are even more contributions to performance as illustrated in the following sections.
bandwidth of a single disk drive or a single ESCON channel, but FICON, working together with other DS8000 series functions, provides a high-speed pipe supporting a multiplexed operation. High Performance FICON for z (zHPF) takes advantage of the hardware available today, with enhancements that are designed to reduce the overhead associated with supported commands, that can improve FICON I/O throughput on a single DS8000 port by 100%. Enhancements have been made to the z/Architecture and the FICON interface architecture to deliver improvements for online transaction processing (OLTP) workloads. When exploited by the FICON channel, the z/OS operating system, and the control unit, zHPF is designed to help reduce overhead and improve performance. Parallel Access Volume (PAV) enables a single System z server to simultaneously process multiple I/O operations to the same logical volume, which can help to significantly reduce device queue delays. This function is achieved by defining multiple addresses per volume. With Dynamic PAV, the assignment of addresses to volumes can be managed automatically to help the workload meet its performance objectives and reduce overall queuing. PAV is an optional feature on the DS8000 series. HyperPAV allows an alias address to be used to access any base on the same control unit image per I/O base. This capability also allows different HyperPAV hosts to use one alias to access different bases, which reduces the number of alias addresses required to support a set of bases in a System z environment with no latency in targeting an alias to a base. This functionality is also designed to enable applications to achieve equal or better performance than possible with the original PAV feature alone while also using the same or fewer z/OS resources. Multiple Allegiance expands the simultaneous logical volume access capability across multiple System z servers. This function, along with PAV, enables the DS8000 series to process more I/Os in parallel, helping to improve performance and enabling greater use of large volumes. I/O priority queuing allows the DS8000 series to use I/O priority information provided by the z/OS Workload Manager to manage the processing sequence of I/O operations.
Chapter 2.
Hardware configuration
In this chapter, we look at DS8000 hardware configuration, specifically as it pertains to the performance of the device. Understanding the hardware components, including the functions performed by each component, and the technology that they use will help you in making selections of the components to order and the quantities of each component. However, do not focus too much on any one hardware component. Instead, make sure to have a good balance of components that will work together effectively. The ultimate criteria as to whether a storage server is performing well depends on how good its total throughput is. We look at the major DS8000 hardware components: Storage unit, processor complex, and storage logical partitions (LPARs) Cache RIO-G interconnect Disk subsystem and device adapters Host adapters
Storage unit
A storage unit consists of a single DS8000 (including expansion frames). A storage unit can consist of several frames: one base frame and up to four expansion frames. The storage unit ID is the DS8000 base frame serial number, ending in 0 (for example, 75-06570).
Processor complex
A DS8000 processor complex is one POWER5+ p570 copper-based symmetric multiprocessor (SMP) system unit. On the DS8100 Turbo Model 931, each processor complex has 2-way servers running at 2.2 GHz. On the DS8300 Turbo Models 932 and 9B2, each processor complex has 4-way servers running at 2.2 GHz. On all DS8000 models, there are two processor complexes (servers), which are housed in the base frame. These processor complexes form a redundant pair so that if either processor complex fails, the surviving processor complex continues to run the workload.
10
processor LPARs
Processor complex 0 Storage Facility Image 1 Processor complex 1
Server 0
Server 1
Server 0
Server 1
processor LPARs
Each of the two Storage Facility Images has parts of the following DS8000 resources dedicated to its use: Processors Cache and persistent memory I/O enclosures Disk enclosures
Note: Licensed Machine Code (LMC) level 5.4.0xx.xx or later supports variable SFIs (or storage LPARs) on DS8000 Models 9B2 and 9A2. You can configure the two storage LPARs (or SFIs) for a 50/50 or 25/75% ratio.
The two SFIs can actually have different amounts of disk drives and host adapters available.
11
Read operations
When a host sends a read request to the DS8000: A cache hit occurs if the requested data resides in the cache. In this case, the I/O operation will not disconnect from the channel/bus until the read is complete. A read hit provides the highest performance. A cache miss occurs if the data is not in the cache. The I/O operation is logically disconnected from the host, allowing other I/Os to take place over the same interface, and a stage operation from the disk subsystem takes place.
12
technique does not apply for the RAID 10 arrays, because there is no parity generation required and therefore no penalty involved when writing to RAID 10 arrays. It is possible that the DS8000 cannot copy write data to the persistent cache because it is full, which can occur if all data in the persistent cache is still waiting for destage to disk. In this case, instead of a fast write hit, the DS8000 sends a command to the host to retry the write operation. Having full persistent cache is obviously not a good situation, because it delays all write operations. On the DS8000, the amount of persistent cache is sized according to the total amount of system memory and is designed so that there is a low probability of full persistent cache occurring in normal processing.
Cache management
The DS8000 system offers superior caching algorithms, the Sequential Prefetching in Adaptive Replacement Cache (SARC) algorithm and the Adaptive Multi-stream Prefetching (AMP) algorithm, which were developed by IBM Storage Development in partnership with IBM Research. We explain these technologies next.
Demand paging means that eight disk blocks (a 4K cache page) are brought in only on a
cache miss. Demand paging is always active for all volumes and ensures that I/O patterns with some locality find at least some recently used data in the cache.
Prefetching means that data is copied into the cache even before it is requested. To prefetch,
a prediction of likely future data accesses is required. Because effective, sophisticated prediction schemes need extensive history of the page accesses, the algorithm uses prefetching only for sequential workloads. Sequential access patterns are commonly found in video-on-demand, database scans, copy, backup, and recovery. The goal of sequential prefetching is to detect sequential access and effectively preload the cache with data in order to minimize cache misses. For prefetching, the cache management uses tracks. A track is a set of 128 disk blocks (16 cache pages). To detect a sequential access pattern, counters are maintained with every track to record if a track has been accessed together with its predecessor. Sequential prefetching becomes active only when these counters suggest a sequential access pattern. In this manner, the DS8000 monitors application read patterns and dynamically determines whether it is optimal to stage into cache: Just the page requested The page requested plus remaining data on the disk track An entire disk track or multiple disk tracks that have not yet been requested
13
The decision of when and what to prefetch is essentially made on a per-application basis (rather than a system-wide basis) to be responsive to the different data reference patterns of various applications that can be running concurrently. To decide which pages are flushed when the cache is full, sequential and random (non-sequential) data is separated into different lists as illustrated in Figure 2-2.
RANDOM
MRU
SEQ
MRU
In Figure 2-2, a page, which has been brought into the cache by simple demand paging, is added to the Most Recently Used (MRU) head of the RANDOM list. With no further references to that page, it moves down to the Least Recently Used (LRU) bottom of the list. A page, which has been brought into the cache by a sequential access or by sequential prefetching, is added to the MRU head of the sequential (SEQ) list and then moves down in that list as more sequential reads are done. Additional rules control the management of pages between the lists in order to not keep the same pages in memory twice. To follow workload changes, the algorithm trades cache space between the RANDOM and SEQ lists dynamically. Trading cache space allows the algorithm to prevent one-time sequential requests from filling the entire cache with blocks of data that have a low probability of being read again. The algorithm maintains a desired size parameter for the SEQ list. The desired size is continually adapted in response to the workload. Specifically, if the bottom portion of the SEQ list is found to be more valuable than the bottom portion of the RANDOM list, the desired size of the SEQ list is increased; otherwise, the desired size is decreased. The constant adaptation strives to make optimal use of limited cache space and delivers greater throughput and faster response times for a given cache size.
SARC Performance
IBM performed a simulation comparing cache management with and without the SARC algorithm. The new algorithm, with no change in hardware, provided: Effective cache space: 33% greater Cache miss rate: 11% reduced Peak throughput: 12.5% increased Response time: 50% reduced Figure 2-3 on page 15 shows the improvement in response time due to SARC. 14
DS8000 Performance Monitoring and Tuning
15
SARC and AMP play complementary roles. While SARC carefully divides the cache between the RANDOM and the SEQ lists to maximize the overall hit ratio, AMP manages the contents of the SEQ list to maximize the throughput obtained for the sequential workloads. While SARC impacts cases that involve both random and sequential workloads, AMP helps any workload that has a sequential read component, including pure sequential read workloads.
16
17
modules (DDMs).
Finally, we have the disks themselves. The disks are commonly referred to as disk drive
18
15
FC switch
Rear enclosures
15
15
4 FC-AL Ports
15
Front enclosures
15
15
These switches use FC-AL protocol and attach FC-AL drives through a point-to-point connection. The arbitration message of a drive is captured in the switch and processed and propagated back to the drive without routing it through all the other drives in the loop. Performance is enhanced, because both DAs connect to the switched Fibre Channel disk subsystem. Note that each DA port can concurrently send and receive data.
DDMs
The DS8000 provides a choice of several DDM types: 146 GB, 15K rpm FC disk 300 GB, 15K rpm FC disk
Chapter 2. Hardware configuration
19
450 GB, 15K rpm FC disk 1000 GB, 7200 rpm Serial Advanced Technology Attachment (SATA) disk For existing installations: 73 GB, 15K rpm FC disk 500 GB, 7200 rpm FC Advanced Technology Attachment (FATA) disk These disks provide a range of options to meet the capacity and performance requirements of various workloads.
A third aspect regarding the difference between these drive types is the RAID rebuild time after a drive failure. Because this rebuild time grows with larger capacity drives, RAID 6 can be advantageous for the large-capacity SATA and FATA drives to prevent a failing second disk, which causes a loss of data during the rebuild of a first failed disk. We explain more detail about RAID 6 in 4.1.2, RAID 6 overview on page 43.
20
DS8100
On the DS8100 Turbo, there can be up to twelve disk enclosure pairs spread across two frames. These twelve disk enclosure pairs are supported on four DA pairs, numbered 0-3. Note that these DA pair numbers are just assigned and do not indicate any order of installation, which is illustrated in Figure 2-5.
3 3 1 1 2 2 0 0
b
Up to 4 Device Adapter Pairs DA pairs installed in order 2, 0, 3, 1 DA pairs 2 and 0 have up to 128 DDMs each DA pairs 3 and 1 have up to 64 DDMs each
In Figure 2-5, DA pair 2 attaches to the first two disk enclosure pairs, which contain a total of 64 DDMs. DA pair 0 attaches to the next two disk enclosure pairs, which again contain a total of 64 DDMs. This method continues in a like manner for DA pairs 3 and 1 (in that order). At this point, all four DA pairs are in use, and there are 256 DDMs installed. DA pair 2 (which already has two disk enclosure pairs attached) is now used to attach the next two disk enclosure pairs. Then, DA pair 0 (which also has two disk enclosure pairs already attached) is used to attach two more disk enclosure pairs. At this point, the DS8100 holds its maximum configuration of 384 DDMs. DA pairs 2 and 0 each have 128 DDMs attached. DA pairs 3 and 1 each have 64 DDMs attached. Note: If you have more than 256 DDMs installed, several DA pairs will have more than 64 DDMs. For large configurations, modeling your configuration is even more important to ensure that your DS8100 has sufficient DAs and other resources to handle the workload.
21
DS8300
The DS8300 Turbo models can connect to one, two, three, or four Expansion Frames, which provides the following configuration alternatives: With one Expansion Frame, the storage capacity and number of adapters of the DS8300 models can expand: Up to 384 DDMs in total (as with the DS8100), for a maximum disk storage capacity of 172.8 TB when using 450 GB FC DDMs. Up to 32 host adapters (HAs), which can be an intermix of Fibre Channel/FICON (four-port) adapters and ESCON (two-port) adapters. With two Expansion Frames, the disk capacity of the DS8300 models expands: Up to 640 DDMs in total, for a maximum disk storage capacity of 288 TB when using 450 GB FC DDMs. With three Expansion Frames, the disk capacity of the DS8300 models expands: Up to 896 DDMs in total, for a maximum disk storage capacity of 403.2 TB when using 450 GB FC DDMs. With four Expansion Frames, the disk capacity of the DS8300 models expands: Up to 1024 DDMs in total, for a maximum disk storage capacity of 460.8 TB when using 450 GB FC DDMs (512 TB with the 500 GB FATA DDMs). There are no additional DAs installed for the second, third, and fourth Expansion Frames. Installing all possible 1024 DDMs results in their even distribution over all the DA pairs; refer to Figure 2-6.
3 3 1 1 2 2 0 0
b
6 6 4 4 7 7 5 5
b
3 3 1 1 0 0 1 1
b
4/6 6/4
For an LPAR Model: DA pairs 2, 6, 7, and 3 dedicated to SFI 1 DA pairs 0, 4, 5, and 1 dedicated to SFI 2
Figure 2-6 DS8300 DA pair installation order
DA pair 2 attaches to the first two disk enclosure pairs, which contain a total of 64 DDMs. DA pair 0 attaches to the next two disk enclosure pairs, which again contain a total of 64 DDMs. This method continues in a like manner for DA pairs 6, 4, 7, 5, 3, and 1 (in that order). At this point, all eight DA pairs are in use, and there are 512 DDMs installed.
22
DA pair 2 (which already has two disk enclosure pairs attached) is now used to attach the next two disk enclosure pairs. Then, DA pair 0 (which also has two disk enclosure pairs already attached) is used to attach two more disk enclosure pairs. At this point, the DS8300 holds a configuration of 640 DDMs. DA pairs 2 and 0 have 128 DDMs attached each. DA pairs 6, 4, 7, 5, 3, and 1 have 64 DDMs attached each. The installation sequence for the third and fourth Expansion Frames mirrors the installation sequence of the first and second Expansion Frames with the exception of the last 128 DDMs in the fourth Expansion Frame. Note: If you have more than 512 DDMs installed, several DA pairs will have more than 64 DDMs. For large configurations, modeling your configuration is even more important to ensure that your DS8000 has sufficient DAs and other resources to handle the workload.
DS8300 LPAR
The rules for adding DDMs on the DS8300 LPAR model are the same as the DS8300 non-LPAR model. Disks can be added to one storage image without regard to the number of disks in the other storage image. Each storage image within the DS8300 LPAR model can have up to half the disk hardware components of the DS8300. DA pairs 2, 6, 7, and 3 (and associated disk enclosures) are dedicated to storage image one and DA pairs 0, 4, 5, and 1 (and associated disk enclosures) are dedicated to storage image two. Each storage image can have no more than ten disk enclosures and no more than 512 DDMs. As is the case with the non-LPAR models, DA pairs 2 and 0 can have 128 DDMs each. All other DA pairs have 64 DDMs each.
23
Figure 2-7 Sequential throughput increasing when using additional DA pairs (922 model)
24
QDR
Fibre Channel Protocol Engine
PPC 750GX
Processor 1 GHz
Buffer
Protocol Chipset
QDR
Data Mover
These adapters are designed to hold four Fibre Channel ports, which can be configured to support either FCP or FICON. They are also enhanced in their configuration flexibility and provide more logical paths, from 256 with an ESS FICON port to 2048 per FICON port on the DS8000 series. The front end with the 4 Gbps ports scales up to 128 ports for a DS8300, which results in a theoretical aggregated host I/O bandwidth of 128 times 4 Gbps and outperforms an ESS by a factor of eight. The DS8100 still provides four times more bandwidth at the front end than an ESS. However, note that the 4 Gbps and the 2 Gbps HBAs essentially have the same architecture, with the exception of the protocol engine chipset. Hence, while the throughput for an individual port doubles between the 2 Gb HBA and the 4 Gb HBA, the aggregated throughput for the overall HBA when using all ports practically stays the same. For high performance configurations requiring the highest sequential throughputs, we recommend that you actively use two of the four ports of a 2 Gb HBA only, and when using the 4 Gb HBA, use one port in that case. The remaining ports can serve for pure attachment purposes.
25
Because host connections frequently go through various external connections between the server and the DS8000, an availability-oriented approach is to have enough host connections for each server so that if half of the connections fail, processing can continue at the same level as before the failure. This approach requires that each connection carry only half the data traffic that it otherwise might carry. These multiple lightly loaded connections also help to minimize the instances when spikes in activity might cause bottlenecks at the host adapter or port. A multiple-path environment requires at least two connections. Four connections are typical, and eight connections are not unusual.
26
We also offer the following general guidelines: If you run on a non-LPAR DS8300 Turbo (which has two RIO-G loops), spread the host adapters across the RIO-G loops. For Fibre Channel and FICON paths with high or moderate utilization, use only two or three ports on each host adapter, which might increase the number of host adapters required. Spread multiple paths from a single host as widely as possible across host adapters, I/O enclosures, and RIO-G loops to maximize performance and minimize the points where a failure causes outages on multiple paths.
27
28
Chapter 3.
29
30
Table 3-1 Workload types Workload type Sequential read Characteristics Large record reads - QSAM half track - Open 64 KB blocks Large files from disk Large record writes Large files to disk Random 4 KB record R/W ratio 3.4 Read hit ratio 84% Random 4 KB record R/W ratio 3.0 Read hit ratio 78% Random 4 KB record R/W ratio 5.0 Read hit ratio 92% Random 4 KB record R/W ratio 2.0 Read hit ratio 40% Random 4 KB record Read% = 67% Hit ratio 28% Random 4 KB record Read% = 70% Hit ratio 50% Random 4 KB record Read% = 100% Hit ratio 0% Representative of Database backups Large queries Batch Reports Database restores and loads Batch Average database CICS/VSAM IMS Representative of typical database conditions Interactive Existing software DB2 logging
Open read-intensive
Very large DB DB2 OLTP filesystem Decision support Warehousing Large DB inquiry
Open standard
Open read-intensive
31
2. Need for high throughput and a mix of R/W, similar to category 1 (large transfer sizes). In addition to 100% read operations, this situation has a mixture of reads and writes in the 70/30 and 50/50 ratios. Here, the DBMS is typically sequential, but random and 100% writes operation also exist. 3. Need for high I/O rate and throughput. This category requires both performance characteristics of IOPS and MBps. Depending upon the application, the profile is typically sequential access, medium to large transfer sizes (16 KB, 32 KB, and 64 KB), and 100/0, 0/100, and 50/50 R/W ratios. 4. Need for high I/O rate. With many users and applications running simultaneously, this category can consist of a combination of small to medium-sized transfers (4 KB, 8 KB, 16 KB, and 32 KB), 50/50 and 70/30 R/W ratios, and a random DBMS. Note: Certain applications have synchronous activities, such as locking database tables during an online backup, or logging activities. These types of applications are highly sensitive to any increase in disk response time and must be handled with extreme care. Table 3-2 summarizes these workload categories and common applications that can be found at any installation.
Table 3-2 Application workload types Category 4 4 4 1 1 2 2 3 3 1 Application General file serving Online transaction processing Batch update Data mining Video on demand Data warehousing Engineering and scientific Digital video editing Image processing Backup Read/write ratio All simultaneously 50/50, 70/30 50/50 100/0 100/0 100/0, 70/30, 50/50 100/0, 0/100, 70/30, 50/50 100/0, 0/100, 50/50 100/0, 0/100, 50/50 100/0 I/O size 4 KB - 32 KB 4 KB, 8 KB 16 KB, 32 KB 32 KB, 64 KB, or larger 64 KB or larger 64 KB or larger 64 KB or larger 32 KB, 64 KB 16 KB, 32 KB, 64 KB 64 KB or larger Access type Random and sequential Random Random and sequential Mainly sequential, some random Sequential Mainly sequential, random easier Sequential Sequential Sequential Sequential
33
34
35
For general rules for application types, refer to Table 3-1 on page 31. Requirements for developing an application I/O profile include: User population Determining the user population requires understanding the total number of potential users, which for an online banking application might represent the total number of customers. From this total population, you need to derive the active population that represents the average number of persons using the application at any given time, which is usually derived from experiences with other similar applications. In Table 3-3, we use 1% of the total population. From the average population, we estimate the peak. The peak workload is some multiplier of the average and is typically derived based on experience with similar applications. In this example, we use a multiple of 3.
Table 3-3 User Population Total potential users 50000 Average active users 500 Peak active users 1500
Transaction distribution Table 3-4 breaks down the number of times that key application transactions are executed by the average user and how much I/O is generated per transaction. Detailed knowledge of the application and database are required in order to identify the number of I/Os and the type of I/Os per transaction. The following information is a sample.
Table 3-4 Transaction distribution Transaction Look up savings account Look up checking account Transfer money to checking Configure new bill payee Submit payment Look up payment history Iterations per user 1 1 .5 .5 1 1 I/Os 4 4 4 reads/4 writes 4 reads/4 writes 4 writes 24 reads I/O Type Random read Random read Random read/write Random read/write Random write Random read
Logical I/O profile An I/O profile is created by combining the user population and the transaction distribution. Table 3-5 provides an example of a logical I/O profile.
Table 3-5 Logical I/O profile from user population and transaction profiles Transaction Look up savings account Look up checking account Iterations per user 1 1 I/Os 4 4 I/O type Random read I/Os (RR) RR Average user I/Os 2000 2000 Peak users 6000 6000
36
Transaction Transfer money to checking Configure new bill payee Submit payment Look up payment history
.5 1 1
Physical I/O profile The physical I/O profile is based on the logical I/O with the assumption that the database will provide cache hits to 90% of the read I/Os. All write I/Os are assumed to require a physical I/O. This physical I/O profile results in a read miss ratio of (1-.9) = .1 or 10%. Table 3-6 is an example, and every application will have different characteristics.
Table 3-6 Physical I/O profile Transaction Average user logical I/Os 2000 2000 1000, 1000 1000, 1000 2000 12000 20000 R, 2000 W Average active users physical I/Os 200 RR 200 RR 100 RR, 1000 RW 100 RR, 1000 RW 200 RR 1200 SR 2000 RR 2000 RW Peak active users physical I/Os 600 RR 600 RR 300 RR, 3000 RW 300 RR, 3000 RW 600 RR 3600 RR 6000 RR 6000 RW
Look up savings account Look up checking account Transfer money to checking Configure new bill payee Submit payment Look up payment history Totals
As you can see in Table 3-6, in order to meet the peak workloads, you need to design an I/O subsystem to support 6000 random reads/sec and 6000 random writes/sec: Physical I/Os RR RW The number of physical I/Os per second from the host perspective Random Read I/Os Random Write I/Os
To determine the appropriate configuration to support your unique workload, refer to Chapter 6, Performance management process on page 147.
37
38
System i environment
Here are the most popular tools on System i: Collection Services Disk Watcher Job Watcher iSeries Navigator Monitors IBM Performance management for System i Performance Tools for System i Most of these comprehensive planning tools address the entire spectrum of workload performance on System i, including CPU, system memory, disks, and adapters. The main IBM System i performance data collector is called Collection Services. It is designed to run 24x7x365 and is documented in detail in the System i Information Center at: http://publib.boulder.ibm.com/infocenter/systems/scope/i5os/index.jsp Collection Services is a sample-based engine (usually 5 to 15 minute intervals) that looks at jobs, threads, CPU, disk, and communications. It also has a set of specific statistics for the DS8000. For the new systems that are IOP-less hardware (System i POWER6 with IOP-less Fibre Channel), the i kernel keeps track of service and wait time (and therefore response times (S+W)) in buckets per logical unit number (LUN).
Disk Watcher is a new function of IBM i5/OS that provides disk data to help identify the
source of disk-related performance problems on System i platform. It can either collect information about every I/O in trace mode or collect information in buckets in statistics mode. In statistics mode, it can run much more often than Collection Services to see more granular statistics. The command strings and file layouts are documented in the System i Information Center. The usage is covered in the following articles: A New Way to Look at Disk Performance: http://www.ibmsystemsmag.com/i5/may07/administrator/15631p1.aspx Analyzing Disk Watcher Data: http://www.ibmsystemsmag.com/i5/may08/tipstechniques/20662p1.aspx Disk Watcher gathers detailed information associated with I/O operations to disk units. Disk Watcher provides data beyond the data that is available in tools, such as Work with Disk Status (WRKDSKSTS), Work with System Status (WRKSYSSTS), and Work with System Activity (WKSYSACT). Disk Watcher, like other tools, provides data about disk I/O, paging rates, CPU use, and temporary storage use. But Disk Watcher goes further by simultaneously collecting the program, object, job, thread, and task information that is associated with disk I/O operations.
Job Watcher is an advanced tool for collecting and analyzing performance information as a
means to effectively manage your system or to analyze a performance issue. It is job-centric and thread-centric and can collect data at intervals of seconds. The collection contains vital information, such as job CPU and wait statistics, call stacks, SQL statements, objects waited on, sockets, TCP, and more. For more information about using Job Watcher, refer to Web Power - New browser-based Job Watcher tasks help manage your IBM i performance at: http://www.ibmsystemsmag.com/i5/november08/administrator/22431p1.aspx
System z environment
The z/OS systems have proven performance monitoring and management tools available to use for performance analysis. Resource Measurement Facility (RMF), a z/OS performance tool, collects performance data and reports it for the desired interval. It also provides cache
Chapter 3. Understanding your workload
39
reports. The cache reports are similar to the disk-to-cache and cache-to-disk reports that are available in the TotalStorage Productivity Center for Disk, except that RMFs cache reports are provided in text format. RMF collects the performance statistics of the DS8000 that are related to the link or port and also to the rank and extent pool. The REPORTS(ESS) parameter in the RMF report generator produces the reports that are related to those resources. For more information, refer to Chapter 15, System z servers on page 441.
40
Chapter 4.
41
RAID 5 theory
The DS8000 series supports RAID 5 arrays. RAID 5 is a method of spreading volume data plus parity data across multiple disk drives. RAID 5 provides faster performance by striping data across a defined set of disk drive modules (DDMs). Data protection is provided by the generation of parity information for every stripe of data. If an array member fails, its contents can be regenerated by using the parity data.
42
RAID 6 theory
Starting with Licence Machine Code 5.4.0.xx.xx, the DS8000 supports RAID 6 protection. RAID 6 presents an efficient method of data protection in case of double disk errors, such as two drive failures, two coincident medium errors, or a drive failure and a medium error. RAID 6 allows for additional fault tolerance by using a second independent distributed parity scheme (dual parity). Data is striped on a block level across a set of drives, similar to RAID 5 configurations, and a second set of parity is calculated and written across all the drives. RAID 6 is best used in combination with large capacity disk drives, such as the 500 GB Fibre ATA (FATA) drives, because these drives have a longer rebuild time, but RAID 6 can also be used with Fibre Channel (FC) drives when the primary concern is higher reliability.
RAID 10 theory
RAID 10 provides high availability by combining features of RAID 0 and RAID 1. RAID 0 optimizes performance by striping volume data across multiple disk drives at a time. RAID 1 provides disk mirroring, which duplicates data between two disk drives. By combining the features of RAID 0 and RAID 1, RAID 10 provides a second optimization for fault tolerance. Data is striped across half of the disk drives in the RAID 1 array. The same data is also striped across the other half of the array, creating a mirror. Access to data is preserved if one disk in each mirrored pair remains available. RAID 10 offers faster data reads and writes than RAID 5, because it does not need to manage parity. However, with half of the DDMs in the group used for data and the other half to mirror that data, RAID 10 disk groups have less capacity than RAID 5 disk groups.
43
then mirrored. If spares do not exist on the array site, eight DDMs are used to make a four-disk RAID 0 array, which is then mirrored.
Floating spares
The DS8000 implements a smart floating technique for spare DDMs. The DS8000 microcode might choose to allow the hot spare to remain where it has been moved, but it can instead choose to migrate the spare to a more optimum position. This move is done to better balance the spares across the DA pairs, the loops, and the enclosures. It might be preferable that a DDM that is currently in use as an array member is converted to a spare. In this case, the data on that DDM will be migrated in the background onto an existing spare. This process does not fail the disk that is being migrated, though it does reduce the number of available spares in the DS8000 until the migration process is complete. A smart process is used to ensure that the larger or higher rpm DDMs always act as spares. This design is preferable, because if we rebuild the contents of a 146 GB DDM onto a 300 GB DDM, approximately half of the 300 GB DDM will be wasted, because that space is not needed. The problem here is that the failed 146 GB DDM will be replaced with a new 146 GB DDM. So, the DS8000 microcode will most likely migrate the data back onto the recently replaced 146 GB DDM. When this process completes, the 146 GB DDM will rejoin the array and the 300 GB DDM will become the spare again. Another example is if we fail a 73 GB 15K rpm DDM onto a 146 GB 10K rpm DDM The data has now moved to a slower DDM, but the replacement DDM will be the same as the failed DDM. The array will have a mix of rpms, which is not desirable. Again, a smart migration of the data will be performed when suitable spares become available.
44
Array Site
Switch
Loop 1
Figure 4-1 Array site
Loop 2
As you can see from Figure 4-1, array sites span loops. Four DDMs are taken from loop 1 and another four DDMs from loop 2.
4.2.2 Arrays
An array is created from one array site. Forming an array means defining it as a specific RAID type. The supported RAID types are RAID 5, RAID 6, and RAID 10 (refer to 4.1, RAID levels and spares on page 42). For each array site, you can select a RAID type. The process of selecting the RAID type for an array is also called defining an array. Note: In the DS8000 implementation, one array is defined using one array site. Figure 4-2 on page 46 shows the creation of a RAID 5 array with one spare, which is also called a 6+P+S array (capacity of 6 DDMs for data, capacity of one DDM for parity, and a spare drive). According to the RAID 5 rules, parity is distributed across all seven drives in this example.
45
On the right side in Figure 4-2, the terms D1, D2, D3, and so on stand for the set of data contained on one disk within a stripe on the array. If, for example, 1 GB of data is written, it is distributed across all the disks of the array.
Array Site
D1 D2 D3 D7 D8 D9 D10 D11 P D12 D13 D14 D15 D16 P D17 D18 ... ... ... ... ... ... ...
Creation of an array
Data Data Data Data Data Data Parity Spare
D4 D5 D6 P
RAID Array
Spare
So, an array is formed using one array site, and while the array can be accessed by each adapter of the device adapter pair, it is managed by one device adapter. You define which adapter and which server manage this array later in the configuration process.
4.2.3 Ranks
In the DS8000 virtualization hierarchy, there is another logical construct, a rank. When defining a new rank, its name is chosen by the DS Storage Manager, for example, R1, R2, R3, and so on. You have to add an array to a rank. Note: In the DS8000 implementation, a rank is built using just one array. The available space on each rank will be divided into extents. The extents are the building blocks of the logical volumes. An extent is striped across all disks of an array as shown in Figure 4-3 on page 47 and indicated by the small squares in Figure 4-4 on page 48. The process of forming a rank performs two jobs: The array is formatted for either fixed block (FB) type data (Open Systems) or count key data (CKD) (System z). This formatting determines the size of the set of data contained on one disk within a stripe on the array. The capacity of the array is subdivided into equal-sized partitions, which are called extents. The extent size depends on the extent type: FB or CKD. An FB rank has an extent size of 1 GB (where 1 GB equals 230 bytes).
46
Figure 4-3 shows an example of an array that is formatted for FB data with 1 GB extents (the squares in the rank just indicate that the extent is composed of several blocks from different DDMs).
D1
RAID Array
D2 D3 D4 D5 D6 P
Creation of a Rank
....
....
....
1 GB
1 GB
1 GB
1 GB
....
....
....
FB Rank of 1 GB extents
....
....
Storage Pool Striping was made available with Licensed Machine Code 5.3.xx.xx and allows
you to create logical volumes striped across multiple ranks, which typically enhances performance. To benefit from Storage Pool Striping (see Storage Pool Striping extent rotation on page 51), more than one rank in an extent pool is required.
47
Storage Pool Striping can significantly enhance performance. However, when you lose one rank, not only is the data of this rank lost, but also, all of the data in this extent pool is lost, because data is striped across all ranks. Therefore, you must keep the number of ranks in an extent pool in the range of four to eight. The minimum number of extent pools is two, with one extent pool assigned to server 0 and the other extent pool assigned to server 1 so that both servers are active. In an environment where both FB type data and CKD type data are to go onto the DS8000 storage server, four extent pools will provide one FB pool for each server and one CKD pool for each server, to balance the capacity between the two servers. Figure 4-4 is an example of a mixed environment with CKD and FB extent pools. Additional extent pools might also be desirable to segregate ranks with different DDM types. Extent pools are expanded by adding more ranks to the pool. Ranks are organized in two rank groups; rank group 0 is controlled by server 0 and rank group 1 is controlled by server 1.
Server0
1GB FB
1GB FB
1GB FB
1GB FB
1GB FB
1GB FB
1GB FB
1GB FB
Server1
CKD volumes
A System z CKD volume is composed of one or more extents from one CKD extent pool. CKD extents are the size of 3390 Model 1, which has 1113 cylinders. However, when you define a System z CKD volume, you do not specify the number of 3390 Model 1 extents but the number of cylinders that you want for the volume. Prior to Licensed Machine Code 5.4.0.xx.xx, the maximum size for a CKD volume was 65520 cylinders. Now, you can define CKD volumes with up to 262668 cylinders, which is about 223 GB. This new volume capacity is called Extended Address Volume (EAV) and the device type is 3390 Model A. For more information about EAV volumes, refer to DS8000 Series: Architecture and Implementation, SG24-6786. Important: EAV volumes can only be exploited by z/OS Version 1.10 or later. If the number of specified cylinders is not an exact multiple of 1113 cylinders, part of the space in the last allocated extent is wasted. For example, if you define 1114 or 3340 cylinders, 1112 cylinders are wasted. For maximum storage efficiency, consider allocating volumes that are exact multiples of 1113 cylinders. In fact, consider multiples of 3339 cylinders for future compatibility. A CKD volume cannot span multiple extent pools, but a volume can have extents from different ranks in the same extent pool or you can stripe a volume across the ranks (see Storage Pool Striping extent rotation on page 51). The allocation process for FB volumes is similar, and it is shown in Figure 4-5.
Logical 3 GB LUN
Rank-a
3 GB LUN
use d 1 GB free
Rank-b
used
1 GB free
Allocate a 3 GB LUN
3 GB LUN
Rank-b
use d
1 GB used
used
100 MB unused
49
A Space Efficient volume does not occupy physical capacity when it is created. Space gets allocated when data is actually written to the volume. The amount of space that gets physically allocated is a function of the amount of data changes that are performed on the volume. The sum of all defined Space Efficient volumes can be larger than the physical capacity available. This function is also called over provisioning or thin provisioning. Note: In the current implementation (Licensed Machine Code 5.4.1.xx.xx), Space Efficient volumes are supported as FlashCopy target volumes only.
Extpool
sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents
40GB LUN
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
Figure 4-6 shows how the extents are grouped the first time that the LUNs are created in a one to one (1:1) ratio between ranks and extent pools. In this example, we show 40 extents used in a sequential pattern to create the first LUN. It is important to note that if one or more LUNs are deleted, new LUNs are created with free extents as illustrated in the Figure 4-7 on page 51.
50
In Figure 4-7, we show 13 colored extents to represent the free extents in the Extpool on the left that make up the 13 GB LUN on the right side of the diagram. The extents that are not shaded or colored (red) represent used extents.
13 extents
Extpool
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
13 GB LUN
Figure 4-7 Existing free extents that can be used to form a LUN after LUN deletions occur
There are two extent allocation algorithms of the DS8000: rotated volume allocation method and Storage Pool Striping extent allocation method.
51
When you create striped volumes and non-striped volumes in an extent pool, a rank can be filled before the other ranks. A full rank is skipped when you create new striped volumes. There is no reorg function for the extents in an extent pool. If you add one ore more ranks to an existing extent pool, the existing extents are not redistributed. Tip: If you have to add capacity to an extent pool, because it is nearly full, it is better to add several ranks at one time instead of just one rank. This method allows new volumes to be striped across the added ranks.
Where to start with the first volume is determined at power on (say, R2) Striped volume with two Extents created Next striped volume (five extents in this example) starts at next rank (R3) from which the previous volume was started Non-striped volume created Starts at next rank (R1), going in a round-robin Striped volume created Starts at next rank (R2) (extents 13 to 15)
11 47 89 1 1 1 36014
R1
Extent Pool
1 51 1 25
R2
Ranks
1 2 361 3 3
6
1 5
R3
Extent 8.12
By using striped volumes, you distribute the I/O load to a LUN/CKD volume to more than just one set of eight disk drives. The ability to distribute a workload to many physical drives can greatly enhance performance for a logical volume. In particular, operating systems that do not have a volume manager but that can perform striping will benefit most from this allocation method. However, if you have extent pools with many ranks and all volumes are striped across the ranks and you lose just one rank, for example, because there are two disk drives in the same rank that fail at the same time and it is not a RAID 6 rank, you will lose a significant portion of your data. Therefore, it might be better to have extent pools with only about four to eight ranks. However, if you perform, for example, Physical Partition striping in AIX already, double striping probably will not improve performance any further, which is also true when the DS8000 LUNs are used by an SAN Volume Controller (SVC) striping data across LUNs. If you decide to use Storage Pool Striping, it is probably better to use this allocation method for all volumes in the extent pool to keep the ranks equally filled and utilized.
52
Tip: If you configure a new DS8000, do not mix striped volumes and non-striped volumes in an extent pool.
Rank-b
Rank-x
Server0
LSS X'1E'
Extent Pool FB-1 Rank-c
Rank-y
Rank-d
Volume ID
LSSs are numbered X'ab' where a is the address group and b denotes an LSS within the address group. So, for example, X'10' to X'1F' are LSSs in address group 1.
Server1
53
All LSSs within one address group have to be of the same type, either CKD or FB. The first LSS defined in an address group fixes the type of that address group. Important: System z users who still want to use ESCON to attach hosts to the DS8000 must be aware that ESCON supports only the 16 LSSs of address group 0 (LSS X'00' to X'0F'). Therefore, this address group must be reserved for ESCON-attached CKD devices in this case and not used as FB LSSs. The LUN identifications X'gabb' are composed of the address group X'g', and the LSS number within the address group X'a', and the position of the LUN within the LSS X'bb'. For example, LUN X'2101' denotes the second (X'01') LUN in LSS X'21' of address group 2.
Host attachment
Host bus adapters (HBAs) are identified to the DS8000 in a host attachment construct that specifies the HBAs worldwide port names (WWPNs). A set of host ports can be associated through a port group attribute that allows a set of HBAs to be managed collectively. This port group is referred to as host attachment within the GUI. Each host attachment can be associated with a volume group to define which LUNs that HBA is allowed to access. Multiple host attachments can share the same volume group. The host attachment can also specify a port mask that controls which DS8000 I/O ports the HBA is allowed to log in to. Whichever ports the HBA logs in to, it sees the same volume group that is defined in the host attachment associated with this HBA. The maximum number of host attachments on a DS8000 is 8192.
Volume group
A volume group is a named construct that defines a set of logical volumes. When used in conjunction with CKD hosts, there is a default volume group that contains all CKD volumes and any CKD host that logs in to a FICON I/O port has access to the volumes in this volume group. CKD logical volumes are automatically added to this volume group when they are created and automatically removed from this volume group when they are deleted. When used in conjunction with Open Systems hosts, a host attachment object that identifies the HBA is linked to a specific volume group. You must define the volume group by indicating which fixed block logical volumes are to be placed in the volume group. Logical volumes can be added to or removed from any volume group dynamically.
Next, we created logical volumes within the extent pools (optionally striping the volumes), assigning them a logical volume number that determined to which logical subsystem they will be associated and which server will manage them. Space Efficient volumes can be created within the repository of the extent pool. Then, the LUNs can be assigned to one or more volume groups. Finally, the HBAs were configured into a host attachment that is associated with a volume group. This virtualization concept provides for greater flexibility. Logical volumes can dynamically be created, deleted, and resized. They can be grouped logically to simplify storage management. Large LUNs and CKD volumes reduce the total number of volumes, which also contributes to a reduction of the management effort. Figure 4-10 summarizes the virtualization hierarchy.
Array Site RAID Array Data Data Data Data Data Data Parity Spare Rank Type FB Extent Pool 1 GB FB 1 GB FB 1 GB FB Logical Volume
1 GB FB
1 GB FB
1 GB FB
Server0
1 GB FB
LSS FB
Address Group
X'2x' FB 4096 addresses LSS X'27'
Volume Group
Host Attachment
1 GB FB
1 GB FB
55
RAID-5 6+P+S
1
1 2 3 4 5
read/write head
1
1 2 3 4 5 1 2 3 4 5
1
1 2 3 4 5
1
1 2 3 4 5
1
1 2 3 4 5
1
1 2 3 4 5
spare
RAID-5 6+P+S
Figure 4-11 How logical extents are formed from the DS8000 in a 6+P+S type array format
Logical-exent3 is created from the middle sections of the DDMs. In this example, we only show that five extents equal the capacity of an entire DS8000 array. This example is for illustration purposes only, and in reality, the number of extents equals the capacity of the entire array. For example, a RAID 5 array that consists of 300 GB raw DDMs actually produces 1576 logical extents. Parity is not distributed on one disk, but instead, it is striped across all of the disks in the array. It is important to keep in mind that RAID 5 arrays consist of one disks worth of parity, which is striped throughout the array. In Figure 4-12 on page 57, we have a 6+P+S (one spare). One of the chunks in the array is parity, which is striped throughout the seven disks.
56
RAID-5 7+P
1 1 1
read/write head
1
1 2 3 4 5 1 2 3 4 5
1
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
Spare
1 2 3 4 5 1 2 3 4 5
RAID-5 7+P
Figure 4-13 How logical extents are formed from the DS8000 in a 7+P type array format
For fine-tuning of performance intensive workloads, you might realize that a 7+P RAID 5 array performs better than a 6+P array, and a 4x4 RAID 10, as shown in Figure 4-20 on page 60, performs better than a 3x3, as shown in Figure 4-19 on page 60. For random I/O, you might see an up to 15% greater throughput on a 7+P and a 4x4 array than a 6+P and a 3x3. For sequential applications, the differences are minimal. As a general rule though, try to balance workload activity evenly across RAID arrays, regardless of the size. It is not worth the management effort to do otherwise. It is important to remember that RAID 5 arrays consist of one disks worth of parity, which is striped throughout the array. In Figure 4-14 on page 58, we show a 7+P (no spare). One of the chunks (shaded) in the array is parity, which is striped throughout the eight disks.
57
RAID-6 5+P+Q+S
read/write head
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
spare
RAID-6 5+P+Q+S
Figure 4-15 How logical extents are formed from the DS8000 in a 6+P+Q+S type array format
Two of the chunks in the array are parity and are striped throughout the seven disks as shown in Figure 4-15.
58
Spare
RAID-6 6+P+Q
read/write head
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
RAID-6 6+P+Q
Figure 4-17 How logical extents are formed from the DS8000 in a 6+P+Q type array format
Two of the chunks in the array are parity stripes, which are striped throughout the eight disks as shown in Figure 4-18.
Chunk1 Parity Chunk2 Chunk3 Parity Chunk4 Parity Chunk5 Chunk6 Chunk7 Chunk8 Parity
59
RAID-10 3X2+S+S
Mirror-copy1
Logical-extent1 Made up of strips from the outer section/edge of each DDM
Mirror-copy2
Logical-extent1 Made up of strips from the outer section/edge of each DDM
read/write head
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
spare
spare
Figure 4-19 How logical extents are formed from the DS8000 in a 3X2+S+S mirror array format
RAID-10 4X2
Mirror-copy1
Logical-extent1 Made up of strips from the outer section/edge of each DDM
Mirror-copy2
Logical-extent1 Made up of strips from the outer section/edge of each DDM
read/write head
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Figure 4-20 How logical extents are formed from the DS8000 in a 4X4 mirror type array format
It is important to note that the stripe widths differ in size from the stripes in a 3X2+S+S RAID array configuration. Note: Due to the different stripe widths that make up the extent from each type of RAID array, it is important not to intermix the RAID array types within the same extent pool.
60
Extpool
Rank1
Rank2
LUN 1 LUN 2 LUN 3 LUN 4 LUN 5 LUN 6 LUN 7 LUN 8
Rank3
LUN 1 LUN 2 LUN 3 LUN 4 LUN 5 LUN 6 LUN 7 LUN 8
Rank4
LUN 1 LUN 2 LUN 3 LUN 4 LUN 5 LUN 6 LUN 7 LUN 8
In Figure 4-21, we show that the rotate extents function was used to create LUNs by spreading the extents across ranks in the extent pool. LUN 1 is assigned to host A and LUN 5 is assigned to host B. These LUNs share the same heads and spindles in the array/rank and are not isolated. To obtain true disk capacity isolation at the rank level, all the LUNs in the array must be assigned to one database or workload. Workloads must be strategically placed and distributed as evenly as possible for proper sharing or isolation. When the rotate extents function is not used, then each rank operates independently in the extent pool. LUN isolation to one rank is more achievable but can still spread across multiple ranks when space on one rank is depleted. To achieve even more isolation, we recommend that you place fewer ranks or even just one rank in an extent pool.
61
62
Chapter 5.
63
storage type (CKD or FB) than the disk types, RAID types, or storage types that are used by other workloads. You must consider workloads with heavy, large blocksize and sequential activity for DA-level isolation, because these workloads tend to consume all of the DA resources that are available to them. Processor complex level: All ranks assigned to extent pools managed by processor complex 0 or all ranks assigned to extent pools managed by processor complex 1 are dedicated to a workload. We typically do not recommend this approach, because it can reduce the processor and cache resources available to the workload by 50%. Storage Image level: This level applies to the 2107 Models 9A2 or 9B2 only. All ranks owned by one Storage Image (Storage Image 1 or Storage Image 2) are dedicated to a workload. That is, an entire logical DS8000 is dedicated to a workload. Storage unit level: All ranks in a physical DS8000 are dedicated to a workload. That is, the physical DS8000 runs only one workload.
65
Multiple resource-sharing workloads can have logical volumes on the same ranks and can access the same DS8000 host adapters or even I/O ports. resource-sharing allows a workload to access more DS8000 hardware than can be dedicated to the workload, providing greater potential performance, but this hardware sharing can result in resource contention between applications that impacts performance at times. It is important to allow resource-sharing only for workloads that will not consume all of the DS8000 hardware resources that are available to them. It might be easier to understand the resource-sharing principle for workloads on disk arrays by referring to Figure 5-1. An application workload (for example, a database) can use two logical unit numbers (LUNs), such as LUNs 1 and 3, in an extent pool and another application (for example, another database) can use another two LUNs, LUNs 5 and 7, in the same extent pool. If the extents of these LUNs also share the same ranks (for example, using Storage Pool Striping or the rotate extents volume allocation algorithm as shown in Figure 5-1), I/O contention can easily occur if the two the application workloads peak at the same time, because these LUNs physically share the same disk heads and disk spindles in the arrays.
Extpool
Rank1
Rank2
LUN 1 LUN 2 LUN 3 LUN 4 LUN 5 LUN 6 LUN 7 LUN 8
Rank3
LUN 1 LUN 2 LUN 3 LUN 4 LUN 5 LUN 6 LUN 7 LUN 8
Rank4
LUN 1 LUN 2 LUN 3 LUN 4 LUN 5 LUN 6 LUN 7 LUN 8
Figure 5-1 Example for resource-sharing workloads with different LUNs sharing the same ranks
66
You must allocate DS8000 hardware resources to either an isolated workload or multiple resource-sharing workloads in a balanced manner. That is, you must allocate either an isolated workload or resource-sharing workloads to DS8000 ranks that are assigned to device adapters (DAs) and both processor complexes in a balanced manner. You must allocate either type of workload to I/O ports that are spread across host adapters and I/O enclosures in a balanced manner. You must distribute volumes and host connections for either an isolated workload or a resource-sharing workload in a balanced manner across all DS8000 hardware resources that have been allocated to that workload. You must create volumes as evenly as possible across all ranks and DAs allocated to those workloads. You can then use host-level striping (Open Systems Logical Volume Manager (LVM) striping or z/OS storage groups) across all of the volumes belonging to either type of workload. You can obtain more information about host-level striping in the appropriate chapters for the various operating systems and platforms in this book. One exception to the recommendation of spreading volumes is when specific files or datasets will never be accessed simultaneously, such as multiple log files for the same application where only one log file will be in use at a time. In that case, you can optimize the overall workload performance by placing all volumes required by these datasets or files on a single DS8000 rank. You must also configure host connections as evenly as possible across the I/O ports, host adapters, and I/O enclosures available to either an isolated or a resource-sharing workload. Then, you can use host server multipathing software to optimize performance over multiple host connections. For more information about multipathing software, refer to Chapter 9, Host attachment on page 265.
67
68
Business Intelligence and Data Mining Disk copies (including Point-in-Time Copy background copies, remote mirroring target volumes, and tape simulation on disk) Video/imaging applications Engineering/scientific applications Certain batch workloads You must consider workloads for all applications for which DS8000 storage will be allocated, including current workloads that will be migrated from other installed storage subsystems and new workloads that are planned for the DS8000. Also, consider projected growth for both current and new workloads. For existing applications, consider historical experience first. For example, is there an application where certain datasets or files are known to have heavy, continuous I/O access patterns? Is there a combination of multiple workloads that might result in unacceptable performance if their peak I/O times occur simultaneously? Consider workload importance (workloads of critical importance and workloads of lesser importance). For existing applications, you can also use performance monitoring tools that are available for the existing storage subsystems and server platforms to understand current application workload characteristics, such as: Read/Write ratio Random/sequential ratio Average transfer size (blocksize) Peak workload (I/Os per second for random access and MB per second for sequential access) Peak workload periods (time of day and time of month) Copy Services requirements (Point-in-Time Copy and Remote Mirroring) Host connection utilization and throughput (FCP Host connections and FICON and ESCON channels) Remote mirroring link utilization and throughput Estimate the requirements for new application workloads and for current application workload growth. You can obtain information about general workload characteristics in Chapter 3, Understanding your workload on page 29. As new applications are rolled out and current applications grow, you must monitor performance and adjust projections and allocations. You can obtain more information in Chapter 6, Performance management process on page 147 and in Chapter 8, Practical performance management on page 203. You can use the Disk Magic modeling tool to model the current or projected workload and estimate the required DS8000 hardware resources. We introduce Disk Magic in 7.1, Disk Magic on page 162.
69
70
71
4. Assign each required logical volume to a different rank or a different set of aggregated ranks (which means an extent pool with multiple ranks using Storage Pool Striping) if possible: If the number of volumes required is less than the number of ranks (or sets of aggregated ranks), assign the volumes evenly to ranks or extent pools that are owned by processor complex 0 and ranks or extent pools that are owned by processor complex 1, on as many DA pairs as possible. If the number of volumes required is greater than the number of ranks (or sets of aggregated ranks), assign additional volumes to the ranks and DAs in a balanced manner. Ideally, the workload has the same number of logical volumes on each of its ranks, on each DA available to it. 5. Then, you can use host-level striping (such as Open Systems Logical Volume Manager striping or z/OS storage groups) across all logical volumes.
72
2 2
HMC
S0 S1 0 1 1 0 2 33 2
Figure 5-2 DS8000 configuration example 1 with two disk enclosures populated (base frame only)
73
Note: In the schematic, each of the two green rectangles at the top of the DS8000 represents one disk enclosure pair, or 16 disk drives in the front and 16 disk drives in the rear for a total of 32 disk drives. The two green rectangles together make up a pair of disk enclosure pairs, for a total of 64 disk drives. The number 2 in the green rectangles indicates that the disk enclosure pairs are cabled to DA 2, which is shown by the boxes in I/O enclosures 2 and 3 at the bottom of the DS8000. Example 5-1 shows the output of the lsarraysite command issued for this DS8000. The lsarraysite output shows that this DS8000 has: Eight array sites (S1 - S8) One DA in use (DA2). DA0 will not be used until the remaining two disk enclosure pairs in the base frame are populated with disk drives. DA3 and DA1 will not be used until an expansion frame is added. Homogeneous disk drives (all 146 GB and 10K rpm)
Example 5-1 DS8000 configuration example 1: Array sites, DA pairs, and disk drive types dscli> lsarraysite -l -dev ibm.2107-7506571 Date/Time: September 4, 2005 2:30:15 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7506571 arsite DA Pair dkcap (10^9B) diskrpm State Array =================================================== S1 2 146.0 10000 Unassigned S2 2 146.0 10000 Unassigned S3 2 146.0 10000 Unassigned S4 2 146.0 10000 Unassigned S5 2 146.0 10000 Unassigned S6 2 146.0 10000 Unassigned S7 2 146.0 10000 Unassigned S8 2 146.0 10000 Unassigned -
74
to processor complex 1 extent pools. Again, you can create volumes for the resource-sharing workloads evenly on all ranks, so that all workloads are able to take advantage of all six ranks performance capabilities as well as the processor and cache resources of both processor complexes.
2 2 0 0
HMC C0
6 6 4 4 7 7 5 5
b
S0 C1 S1 0 11 0 2 33 2
Base Frame A
4 55 4 6 77 6
1st Expansion Frame B
Example 5-2 on page 76 shows output from the DSCLI lsarraysite command issued for this DS8000. Because of the two fully populated frames, the lsarraysite output shows a total of 48 array sites (384 disk drives): Eight array sites on DA2 (S1 - S8) Eight array sites on DA0 (S9 - S16) Eight array sites on DA7 (S17 - S24) Eight array sites on DA6 (S25 - S32) Eight array sites on DA5 (S33 - S40) Eight array sites on DA4 (S41 - S48)
75
A total of six DA pairs are used (DA 2, 0, 7, 6, 5, and 4). DA3 and DA1 will not be used until a second expansion frame is added. All disk drives are the same capacity and speed (73 GB and 15K rpm). Important: It is important to note the association of array sites and DA pairs as shown in the lsarraysite output, because array sites do not have any fixed or predetermined relationship to physical disk drive locations in the DS8000. Array sites are created and assigned to device adapters dynamically during the DS8000 installation and can vary from one DS8000 to another DS8000. In this example, the association between array sites (S1 S48) and DA pairs is not the same as the order of installation of disks on DA pairs (DA2, DA0, DA6, DA4, DA7, and DA5).
Example 5-2 DS8000 configuration example 2: Array sites, DA pairs, and disk drive types dscli> lsarraysite -dev ibm.2107-7520331 -l Date/Time: September 9, 2005 2:57:27 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7520331 arsite DA Pair dkcap (10^9B) diskrpm State Array =================================================== S1 2 73.0 15000 Unassigned S2 2 73.0 15000 Unassigned S3 2 73.0 15000 Unassigned S4 2 73.0 15000 Unassigned S5 2 73.0 15000 Unassigned S6 2 73.0 15000 Unassigned S7 2 73.0 15000 Unassigned S8 2 73.0 15000 Unassigned S9 0 73.0 15000 Unassigned S10 0 73.0 15000 Unassigned S11 0 73.0 15000 Unassigned S12 0 73.0 15000 Unassigned S13 0 73.0 15000 Unassigned S14 0 73.0 15000 Unassigned S15 0 73.0 15000 Unassigned S16 0 73.0 15000 Unassigned S17 7 73.0 15000 Unassigned S18 7 73.0 15000 Unassigned S19 7 73.0 15000 Unassigned S20 7 73.0 15000 Unassigned S21 7 73.0 15000 Unassigned S22 7 73.0 15000 Unassigned S23 7 73.0 15000 Unassigned S24 7 73.0 15000 Unassigned S25 6 73.0 15000 Unassigned S26 6 73.0 15000 Unassigned S27 6 73.0 15000 Unassigned S28 6 73.0 15000 Unassigned S29 6 73.0 15000 Unassigned S30 6 73.0 15000 Unassigned S31 6 73.0 15000 Unassigned S32 6 73.0 15000 Unassigned S33 5 73.0 15000 Unassigned S34 5 73.0 15000 Unassigned S35 5 73.0 15000 Unassigned S36 5 73.0 15000 Unassigned S37 5 73.0 15000 Unassigned S38 5 73.0 15000 Unassigned S39 5 73.0 15000 Unassigned S40 5 73.0 15000 Unassigned -
76
4 4 4 4 4 4 4 4
77
160 disk drives 20 Array Sites Only 1 populated disk enclosure pair on DA 7
2 2 0 0
HMC C0
6 6 4 4 7 5
b
6 DA pairs total
Storage Image 1 DAs 0,4,5 (I/O enclosures 0, 1, 4, 5) Storage Image 2 DAs 2,6,7 (I/O enclosures 2, 3, 6, 7)
S0C1 S0 S1 S1 0 11 0 2 33 2
4 55 4 6 77 6
DA1 & DA3 are not used without a second expansion frame
Figure 5-4 DS8000 configuration example 3 with dual Storage Image and 10 disk enclosure pairs
In order to see all of the array sites in this DS8000, you must issue the DS command-line interface (CLI) lsarraysite command twice, one time for each Storage Image: Storage Image 1: IBM.2107-7566321 Storage Image 2: IBM.2107-7566322 The DSCLI lsarraysite output for Storage Image 1(IBM.2107-7566321) in Example 5-3 on page 79 shows a total of 20 array sites (160 disk drives): Eight array sites on DA0 (S1 - S8): 64 73 GB, 15K rpm disk drives. Four array sites on DA5 (S9 - S12): 32 73 GB, 15K rpm disk drives. The second disk enclosure pair on DA5 is not populated with disk drives. Eight array sites on DA4 (S13 - S20): 32 300 GB, 10K rpm disk drives. 32 73 GB, 15K rpm disk drives. A total of three DA pairs are used (DA0, DA4, and DA5). Storage Image 1 does not show array sites on DA1, because DA1 will not be used until a second expansion frame is added. Disk drives are not homogeneous. There are: 128 73 GB 15K rpm drives. 32 300 GB 10K rpm drives.
78
Important: Note the association of array sites and DA pairs as shown in the lsarraysite output, because array sites do not have any fixed or predetermined relationship to physical disk drive locations in the DS8000. Array sites are created and assigned to device adapters dynamically during DS8000 installation and can vary from one DS8000 to another DS8000. In this example, the association between array sites (S1 - S20 on each Storage Image) and DA pairs is not the same as the order of installation of disks on DA pairs (DA0, DA4, and DA5 for Storage Image 1 and DA2, DA6, and DA7 for Storage Image 2).
Example 5-3 DS8000 example 3 Storage Image 1: Array sites, DA pairs, and disk drive types dscli> lsarraysite -l -dev ibm.2107-7566321 Date/Time: September 4, 2005 2:23:20 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7566321 arsite DA Pair dkcap (10^9B) diskrpm State Array =================================================== S1 0 73.0 15000 Unassigned S2 0 73.0 15000 Unassigned S3 0 73.0 15000 Unassigned S4 0 73.0 15000 Unassigned S5 0 73.0 15000 Unassigned S6 0 73.0 15000 Unassigned S7 0 73.0 15000 Unassigned S8 0 73.0 15000 Unassigned S9 5 73.0 15000 Unassigned S10 5 73.0 15000 Unassigned S11 5 73.0 15000 Unassigned S12 5 73.0 15000 Unassigned S13 4 300.0 10000 Unassigned S14 4 300.0 10000 Unassigned S15 4 300.0 10000 Unassigned S16 4 300.0 10000 Unassigned S17 4 73.0 15000 Unassigned S18 4 73.0 15000 Unassigned S19 4 73.0 15000 Unassigned S20 4 73.0 15000 Unassigned -
The DSCLI lsarraysite output for Storage Image 2 (IBM.2107-7566322) Example 5-4 on page 80 also shows a total of 20 array sites (160 disk drives): Four array sites on DA7 (S1 - S4): 32 73 GB, 15K rpm disk drives. The second disk enclosure pair on DA7 is not populated with disk drives. Eight array sites on DA2 (S5 - S12): 64 73 GB, 15K rpm disk drives. Eight array sites on DA6 (S13 - S20): 32 300 GB, 10K rpm disk drives (S13 - S16). 32 73 GB, 15K rpm disk drives (S17 - S20). A total of three DA pairs are used (DA2, DA6, and DA7). Storage Image 2 shows no array sites on DA3, because DA3 will not be used until a second expansion frame is added. Disk drives are not homogeneous. There are: 128 73 GB 15K rpm drives. 32 300 GB 10K rpm drives.
Chapter 5. Logical configuration performance considerations
79
Important: Note the association of array sites and DA pairs as shown in the lsarraysite output, because array sites do not have any fixed or predetermined relationship to physical disk drive locations in the DS8000. Array sites are created and assigned to device adapters dynamically during DS8000 installation and can vary from one DS8000 to another DS8000. In this example, the association between array sites (S1 - S20 on each Storage Image) and DA pairs is not the same as the order of installation of disks on DA pairs (DA0, DA4, and DA5 for Storage Image 1 and DA2, DA6, and DA7 for Storage Image 2).
Example 5-4 DS8000 example 3 Storage Image 2: Array sites, DA pairs, and disk drive types dscli> lsarraysite -l -dev ibm.2107-7566322 Date/Time: September 4, 2005 2:23:34 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7566322 arsite DA Pair dkcap (10^9B) diskrpm State Array =================================================== S1 7 73.0 15000 Unassigned S2 7 73.0 15000 Unassigned S3 7 73.0 15000 Unassigned S4 7 73.0 15000 Unassigned S5 2 73.0 15000 Unassigned S6 2 73.0 15000 Unassigned S7 2 73.0 15000 Unassigned S8 2 73.0 15000 Unassigned S9 2 73.0 15000 Unassigned S10 2 73.0 15000 Unassigned S11 2 73.0 15000 Unassigned S12 2 73.0 15000 Unassigned S13 6 300.0 10000 Unassigned S14 6 300.0 10000 Unassigned S15 6 300.0 10000 Unassigned S16 6 300.0 10000 Unassigned S17 6 73.0 15000 Unassigned S18 6 73.0 15000 Unassigned S19 6 73.0 15000 Unassigned S20 6 73.0 15000 Unassigned -
80
81
also be used with smaller FC drives, when the primary concern is a higher level of data protection than is provided by RAID 5. RAID 10 optimizes high performance while maintaining fault tolerance for disk drive failures. The data is striped across several disks, and the first set of disk drives is mirrored to an identical set. RAID 10 can tolerate at least one, and in most cases, even multiple disk failures as long as the primary and secondary copy of a mirrored disk pair do not fail at the same time. In addition to the considerations for data protection and capacity requirements, the question typically arises about which RAID level performs better, RAID 5, RAID 6, or RAID 10. As with most complex issues, the answer is that it depends. There are a number of workload attributes that influence the relative performance of RAID 5, RAID 6, or a RAID 10, including the use of cache, the relative mix of read as compared to write operations, and whether data is referenced randomly or sequentially. Regarding read I/O operations, either random or sequential, there is generally no noteworthy difference between RAID 5, RAID 6, and RAID 10. When a DS8000 subsystem receives a read request from a host system, it first checks if the requested data is already in cache. If the data is in cache (that is, a read cache hit), there is no need to read the data from disk, and actually the RAID level on the arrays does not matter at all. For reads that must actually be satisfied from disk (that is, the array or the back end), performance of RAID 5, RAID 6, and RAID 10 is roughly equal, because the requests are spread evenly across all disks in the array. In RAID 5 and RAID 6 arrays, data is striped across all disks, so I/Os are spread across all disks. In RAID 10, data is striped and mirrored across two sets of disks, so half of the reads are processed by one set of disks, and half of the reads are processed by the other set, reducing the utilization of individual disks. Regarding random write I/O operations, the different RAID levels vary considerably in their performance characteristics. With RAID 10, each write operation at the disk back end initiates two disk operations to the rank. With RAID 5, an individual random small block write operation to the disk back end typically causes a RAID 5 write penalty, which initiates four I/O operations to the rank by reading the old data and the old parity block before finally writing the new data and the new parity block. For a RAID 6 with two parity blocks, the write penalty even increases to six required I/O operations at the back end for a single random small block write operation. Note that this assumption is a worst-case scenario that is quite helpful for understanding the back-end impact of random workloads with a given read:write ratio for the various RAID levels. It permits a rough estimation of the expected back-end I/O workload and helps to plan for the proper number of arrays. On a heavily loaded system, it might actually even take fewer I/O operations on average than expected for RAID 5 and RAID 6 arrays. The optimization of the queue of write I/Os waiting in cache for the next destage operation can lead to a high number of partial or even full stripe writes to the arrays with fewer back-end disk operations required for the parity calculation. It is important to understand that on modern disk systems, such as the DS8000, write operations are generally cached by the storage subsystem and thus handled asynchronously with short write response times for the attached host systems so that any RAID 5 or RAID 6 write penalties are generally shielded from the attached host systems in terms of disk response time. Typically, a write request that is sent to the DS8000 subsystem is written into storage server cache and persistent cache, and the I/O operation is then acknowledged immediately to the host system as completed. As long as there is room in these cache areas, the response time seen by the application is only the time to get data into the cache, and it does not matter whether RAID 5, RAID 6, or RAID 10 is used. However, if the host systems send data to the cache areas faster than the storage server can destage the data to the arrays (that is, move it from cache to the physical disks), the cache can occasionally fill up with no space for the next write request, and therefore, the storage server will signal the host system to retry the I/O write operation. In the time that it takes the host system to retry the I/O write 82
DS8000 Performance Monitoring and Tuning
operation, the storage server will likely have time to destage part of the data, providing free space in the cache and allowing the I/O operation to complete on the retry attempt. When random small block write data is destaged from cache to disk, RAID 5 and RAID 6 arrays can experience a severe write penalty with four or six required back-end disk operations, while RAID 10 always requires only two disk operations per small block write request. Because RAID 10 performs only half the disk operations of RAID 5, for random writes, a RAID 10 destage completes faster and thereby reduces the busy time of the disk subsystem. So with steady and heavy random write workloads, the back-end write operations to the ranks (the physical disk drives) can become a limiting factor, so that only a RAID 10 configuration (instead of additional RAID 5 or RAID 6 arrays) will provide enough back-end disk performance at the rank level to meet the workload performance requirements. While RAID 10 clearly outperforms RAID 5 and RAID 6 with regard to small block random write operations, RAID 5 and also RAID 6 show excellent performance with regard to sequential write I/O operations. With sequential write requests, all of the blocks required for the RAID 5 parity calculation can be accumulated in cache, and thus the destage operation with parity calculation can be done dynamically as a full stripe write without the need for additional disk operations to the array. So with only one additional parity block for a full stripe write (for example, seven data blocks plus one parity block for a 7+P RAID 5 array), a RAID 5 requires less disk operation at the back end than a RAID 10, which always requires twice the amount of write operations due to data mirroring. RAID 6 also benefits from sequential write patterns with most of the data blocks required for the double parity calculation staying in cache and thus reducing the amount of additional disk operations to the back end considerably. For sequential writes, a RAID 5 destage completes faster and thereby reduces the busy time of the disk subsystem. Comparing RAID 5 to RAID 6, the performance of small block random read and the performance of a sequential read are roughly equal. Due to the higher write penalty, the RAID 6 small block random write performance is explicitly less than with RAID 5. Also, the maximum sequential write throughput is slightly less with RAID 6 than with RAID 5 due to the additional second parity calculation. However, RAID 6 rebuild times are close to RAID 5 rebuild times (for the same size disk drive modules (DDMs)), because rebuild times are primarily limited by the achievable write throughput to the spare disk during data reconstruction. So, RAID 6 mainly is a significant reliability enhancement with a trade-off in random write performance. It is most effective for large capacity disks that hold mission critical data and that are properly sized for the expected write I/O demand. Workload planning is especially important before implementing RAID 6 for write intensive applications, including Copy Services targets and FlashCopy Space Efficient (SE) repositories. RAID 10 is not as commonly used as RAID 5 for two key reasons. First, RAID 10 requires more raw disk capacity for every GB of effective capacity. Second, when you consider a standard workload with a typically high number of read operations and only a small amount of write operations, RAID 5 generally offers the best trade-off between overall performance and usable capacity. In many cases, RAID 5 write performance is adequate, because disk systems tend to operate at I/O rates below their maximum throughputs, and differences between RAID 5 and RAID 10 will primarily be observed at maximum throughput levels. Consider using RAID 10 for critical workloads with a high percentage of steady random write requests, which can easily become rank-limited. Here, RAID 10 provides almost twice the throughput as RAID 5 (because of the write penalty). The trade-off for better performance with RAID 10 is about 40% less usable disk capacity. Larger drives can be used with RAID 10 to get the random write performance benefit while maintaining about the same usable capacity as a RAID 5 array with the same number of disks. The individual performance characteristics of the RAID arrays can be summarized as:
83
For read operations from disk, either random or sequential, there is no significant difference in RAID 5, RAID 6, and RAID 10 performance. For random writes to disk, RAID 10 outperforms RAID 5 and RAID 6. For random writes to disk, RAID 5 performs better than RAID 6. For sequential writes to disk, RAID 5 tends to perform better. Table 5-1 shows a short overview of the advantages and disadvantages for the RAID levels with regard to reliability, space efficiency, and random write performance.
Table 5-1 RAID-level comparison with regard to reliability, space efficiency, and write penalty RAID level Reliability (number of erasures) Space efficiencya Performance write penalty (number of disk operations) 4 6 2
1 2 At least 1
a. The space efficiency in this table is based on the number of disks remaining available for data storage. The actual usable, decimal capacities are up to 5% less.
In general, workloads that make effective use of storage subsystem cache for reads and writes see little difference between RAID 5 and RAID 10 configurations. For workloads that perform better with RAID 5, the difference in RAID 5 performance over RAID 10 is typically small. However, for workloads that perform better with RAID 10, the difference in RAID 10 performance over RAID 5 performance or even RAID 6 performance can be significant. Because RAID 5, RAID 6, and RAID 10 basically perform equally well for both random and sequential read operations, RAID 5 and RAID 6 might be a good choice with regard to space efficiency and performance for standard workloads with a high percentage of read requests. RAID 6 offers a higher level of data protection than RAID 5, especially for large capacity drives, but the random write performance of RAID 6 is less due to the second parity calculation. Therefore, we highly recommend a proper performance sizing, especially for RAID 6. RAID 5 tends to have a slight performance advantage for sequential writes, whereas RAID 10 performs better for random writes. RAID 10 is generally considered to be the RAID type of choice for business-critical workloads with a high amount of random write requests (typically more than 35% writes) and low response time requirements. For array rebuilds, RAID 5, RAID 6, and RAID 10 require approximately the same elapsed time, although RAID 5 and RAID 6 require significantly more disk operations and therefore are more likely to impact other disk activity on the same disk array. Note that you can select RAID types for each individual array site. So, you can select the RAID type based on the specific performance requirements of the data that will be located there. The best way to compare the performance of a given workload using RAID 5, RAID 6, or RAID 10 is to run a Disk Magic model. For additional information about the capabilities of this tool, refer to 7.1, Disk Magic on page 162. For workload planning purposes, it might be convenient to have a general idea of the I/O performance that a single RAID array can provide. Figure 5-5 on page 85 and Figure 5-6 on page 86 show measurement results1 for a single array built from eight 146 GB 15k FC disk drives when configured as RAID 5, RAID 6, or RAID 10. These numbers are not 84
DS8000 Performance Monitoring and Tuning
DS8000-specific, because they simply represent the limits that you can expect from a simple set of eight physical disks forming a RAID array.
75
50
25
0 0 500 1000 IOps RAID5 (7+P) RAID6 (6+P+Q) RAID10 (4x2) 1500 2000 2500
Figure 5-5 Single rank RAID-level comparison for a 4 KB random workload with 100% reads
For small block random read workloads, there is no significant performance difference between RAID 5, RAID 6 and RAID 10 as seen in Figure 5-5. Without taking any read cache hits into account, 1800 read IOPS with high back-end response times above 30 ms mark the upper limit of the capabilities of a single array for random read access using all of the available capacity of that array. Small block random writes, however, make the difference between the various RAID levels with regard to performance. Even for a typical 70:30 random small block workload with 70% reads (no read cache hits) and 30% writes as shown in Figure 5-6 on page 86, the different performance characteristics between the RAID levels already become evident. With an increasing amount of random writes, RAID 10 clearly outperforms RAID 5 and RAID 6. Here, for a standard random small block 70:30 workload, 1500 IOPS mark the upper limit of a RAID 10 array, 1100 IOPS for a RAID 5 array, and 900 IOPS for a RAID 6 array. Note that in both Figure 5-5 and Figure 5-6 on page 86, no read cache hits have been considered. Furthermore, the I/O requests were spread across the entire available capacity of each RAID array. So, depending on the read cache hit ratio of a given workload and the capacity used on the array (using less capacity on an array simply means reducing disk arm movements and thus reducing average access times), you can expect typically lower overall response times and even higher I/O rates. Also, the read:write ratio, as well as the access pattern of a particular workload, either random or sequential, determine the achievable performance of a rank. Figure 5-5 and Figure 5-6 on page 86 (examples of small block
1
The measurements were done with IOmeter (http://www.iometer.org) on Windows Server 2003 utilizing the entire available capacity on the array for I/O requests. The performance data contained herein was obtained in a controlled, isolated environment. Actual results that might be obtained in other operating environments can vary significantly. There is no guarantee that the same or similar results will be obtained elsewhere.
85
random I/O requests) simply help to give you an idea of the performance capabilities of a single rank for different RAID levels.
75
50
25
0 0 200 400 600 800 1000 IOps RAID5 (7+P) RAID6 (6+P+Q) RAID10 (4x2) 1200 1400 1600 1800
Figure 5-6 Single rank RAID-level comparison for a 4 KB random workload with 70% reads 30% writes
Despite the different RAID levels and the actual workload pattern (read:write ratio, sequential access, or random access), it is also important to note that the limits of the maximum I/O rate per rank also depend on the type of disk drives used. As a mechanical device, each disk drive is only capable of processing a limited number of random I/O operations per second depending on the drive characteristics. So the mere number of disk drives used for a given amount of storage capacity finally determines the achievable random IOPS performance. The 15k drives offer approximately 30% more random IOPS performance than 10k drives. As a general rule for random IOPS planning calculations, you can use 160 IOPS per 15k FC drive and 120 IOPS per 10k FC drive. Be aware that at these levels of disk utilization, you might see already elevated response times. So for excellent response time expectations, consider even lower IOPS limits. Low spinning, large capacity FATA or SATA disk drives offer a considerably lower maximum random access I/O rate per drive (approximately half of a 15k FC drive). Therefore, they are only intended for environments with fixed content, data archival, reference data, or near-line applications that require large amounts of data at low cost and do not require drive duty cycles greater than 20%. Note that duty cycles smaller than 20% are enforced on FATA drives on DS8000 for data protection reasons by throttling FATA drives if the duty cycle exceeds 20%.
86
RAID 5: 6+P+S RAID 6: 5+P+Q+S RAID 10: 3x2+2S And, other arrays do not contain any spares, such as: RAID 5: 7+P RAID 6: 6+P+Q RAID 10: 4x2 This requirement essentially leads to arrays with different storage capacities and performance characteristics although they have been created from array sites with identical disk types and RAID levels. The spares are assigned during array creation. Typically, the first arrays created from an unconfigured set of ranks on a given DA pair contain spare drives until the minimum requirements as outlined in 4.1.4, Spare creation on page 44 are met. With regard to the distribution of the spare drives, you might need to plan the sequence of array creation carefully if a mixture of RAID 5, RAID 6, and RAID 10 arrays is required on the same DA pair. Otherwise, you might not meet your initial capacity requirements and end up with more spare drives on the system than actually required, simply wasting storage capacity. For example, if you plan for two RAID 10 (3x2+2S) arrays on a given DA pair with homogeneous array sites, you might start with the creation of these arrays first, because these arrays will already reserve two spare drives per array, so that the final RAID 5 or RAID 6 arrays will not contain any spares. Or, if you prefer to obtain RAID 10 (4x2) arrays without spare drives, you can instead start first with the creation of four RAID 5 or RAID 6 arrays, which then contain the required number of spare drives before creating the RAID 10 arrays. In order to spread the available storage capacity and thus the overall workload evenly across both DS8000 processor complexes, you must assign an equal number of arrays containing spares to processor complex 0 (rank group 0) and processor complex 1 (rank group 1). Furthermore, note that performance can differ between RAID arrays containing spare drives and RAID arrays without spare drives, because the arrays without spare drives offer more storage capacity and also provide more active disk spindles for processing I/O operations. Note: You must confirm spare allocation after array creation and make any necessary adjustments to the logical configuration plan before creating ranks and assigning them to extent pools.
When creating arrays from array sites, it might help to order the array IDs by DA pair, array size (that is, arrays with or without spares), RAID level, or even disk type depending on the available hardware resources and workload planning considerations. The mapping of the array sites to particular DA pairs can be taken from the output of the DSCLI lsarraysite command as shown in Example 5-5 on page 87. Array sites are numbered starting with S1, S2, and so forth by the DS8000 microcode. Arrays are numbered with system-generated IDs starting with A0, A1, and so forth in the sequence that they are created.
Example 5-5 Array sites and DA pair association as taken from the DSCLI lsarraysite command dscli> lsarraysite -l Date/Time: 27 October 2008 17:45:59 CET IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-75GB192 arsite DA Pair dkcap (10^9B) diskrpm State array diskclass encrypt ========================================================================= S1 2 146.0 15000 Unassigned ENT unsupported S2 2 146.0 15000 Unassigned ENT unsupported
87
S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20 S21 S22 S23 S24
2 2 2 2 2 2 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7
146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0
15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000
Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned
ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT
unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported
For configurations using only single-rank extent pools with the maximum control of volume placement and performance management, consider creating the arrays (A0, A1, and so forth) ordered by DA pair, which in most cases means simply following the sequence of array sites (S1, S2, and so forth) as shown in Example 5-6. But, note that the sequence of array sites is initially determined by the system and might not always strictly follow the DA pair order. Refer to Configuration technique for simplified performance management on page 112 for more information about this specific configuration strategy with single-rank extent pools and a hardware-related volume and logical subsystem (LSS)/logical control unit (LCU) ID configuration concept.
Example 5-6 Array ID sequence sorted by DA pair dscli> lsarray -l Date/Time: 24 October 2008 11:35:08 CEST IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-75GB192 Array State Data RAIDtype arsite rank DA Pair DDMcap (10^9B) diskclass encrypt =========================================================================================== = A0 Unassigned Normal 6 (5+P+Q+S) S1 2 146.0 ENT unsupported A1 Unassigned Normal 6 (5+P+Q+S) S2 2 146.0 ENT unsupported A2 Unassigned Normal 6 (5+P+Q+S) S3 2 146.0 ENT unsupported A3 Unassigned Normal 6 (5+P+Q+S) S4 2 146.0 ENT unsupported A4 Unassigned Normal 6 (6+P+Q) S5 2 146.0 ENT unsupported A5 Unassigned Normal 6 (6+P+Q) S6 2 146.0 ENT unsupported A6 Unassigned Normal 6 (6+P+Q) S7 2 146.0 ENT unsupported A7 Unassigned Normal 6 (6+P+Q) S8 2 146.0 ENT unsupported A8 Unassigned Normal 6 (5+P+Q+S) S9 6 146.0 ENT unsupported A9 Unassigned Normal 6 (5+P+Q+S) S10 6 146.0 ENT unsupported A10 Unassigned Normal 6 (5+P+Q+S) S11 6 146.0 ENT unsupported A11 Unassigned Normal 6 (5+P+Q+S) S12 6 146.0 ENT unsupported A12 Unassigned Normal 6 (6+P+Q) S13 6 146.0 ENT unsupported A13 Unassigned Normal 6 (6+P+Q) S14 6 146.0 ENT unsupported A14 Unassigned Normal 6 (6+P+Q) S15 6 146.0 ENT unsupported A15 Unassigned Normal 6 (6+P+Q) S16 6 146.0 ENT unsupported A16 Unassigned Normal 6 (5+P+Q+S) S17 7 146.0 ENT unsupported A17 Unassigned Normal 6 (5+P+Q+S) S18 7 146.0 ENT unsupported A18 Unassigned Normal 6 (5+P+Q+S) S19 7 146.0 ENT unsupported
88
6 6 6 6 6
7 7 7 7 7
For initial configurations using multi-rank extent pools on a storage unit especially with a homogeneous hardware base (for example, a single type of DDMs using only one RAID level) and resource-sharing workloads, consider configuring the arrays in a round-robin fashion across all available DA pairs by creating the first array from the first array site on the first DA pair, then the second array from the first array site on the second DA pair, and so on. This sequence also sorts the arrays by array size (that is, arrays with or without spares), creating the smaller capacity arrays with spare drives first as shown in Example 5-7. If the ranks are finally created in the same ascending ID sequence from the arrays, the rank ID sequence will also cycle through all DA pairs in a round-robin fashion as seen in Example 5-8 on page 90, which might enhance the distribution of volumes across ranks from different DA pairs within multi-rank extent pools. The creation of successive volumes (using the rotate volumes allocation method) or extents (using the rotate extents allocation method) within a multi-rank extent pool also follows the ascending numerical sequence of rank IDs.
Example 5-7 Array ID sequence sorted by array size (with and without spares) and cycling through all available DA pairs dscli> lsarray -l Date/Time: 24 October 2008 11:35:08 CEST IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-75GB192 Array State Data RAIDtype arsite rank DA Pair DDMcap (10^9B) diskclass encrypt =========================================================================================== A0 Unassigned Normal 6 (5+P+Q+S) S1 2 146.0 ENT unsupported A1 Unassigned Normal 6 (5+P+Q+S) S9 6 146.0 ENT unsupported A2 Unassigned Normal 6 (5+P+Q+S) S17 7 146.0 ENT unsupported A3 Unassigned Normal 6 (5+P+Q+S) S2 2 146.0 ENT unsupported A4 Unassigned Normal 6 (5+P+Q+S) S10 6 146.0 ENT unsupported A5 Unassigned Normal 6 (5+P+Q+S) S18 7 146.0 ENT unsupported A6 Unassigned Normal 6 (5+P+Q+S) S3 2 146.0 ENT unsupported A7 Unassigned Normal 6 (5+P+Q+S) S11 6 146.0 ENT unsupported A8 Unassigned Normal 6 (5+P+Q+S) S19 7 146.0 ENT unsupported A9 Unassigned Normal 6 (5+P+Q+S) S4 2 146.0 ENT unsupported A10 Unassigned Normal 6 (5+P+Q+S) S12 6 146.0 ENT unsupported A11 Unassigned Normal 6 (5+P+Q+S) S20 7 146.0 ENT unsupported A12 Unassigned Normal 6 (6+P+Q) S5 2 146.0 ENT unsupported A13 Unassigned Normal 6 (6+P+Q) S13 6 146.0 ENT unsupported A14 Unassigned Normal 6 (6+P+Q) S21 7 146.0 ENT unsupported A15 Unassigned Normal 6 (6+P+Q) S6 2 146.0 ENT unsupported A16 Unassigned Normal 6 (6+P+Q) S14 6 146.0 ENT unsupported A17 Unassigned Normal 6 (6+P+Q) S22 7 146.0 ENT unsupported A18 Unassigned Normal 6 (6+P+Q) S7 2 146.0 ENT unsupported A19 Unassigned Normal 6 (6+P+Q) S15 6 146.0 ENT unsupported A20 Unassigned Normal 6 (6+P+Q) S23 7 146.0 ENT unsupported A21 Unassigned Normal 6 (6+P+Q) S8 2 146.0 ENT unsupported A22 Unassigned Normal 6 (6+P+Q) S16 6 146.0 ENT unsupported A23 Unassigned Normal 6 (6+P+Q) S24 7 146.0 ENT unsupported
Note that depending on the installed hardware resources in the DS8000 storage subsystem, you might have different numbers of DA pairs and even different numbers of arrays per DA pair. Also, be aware that you might not be able to strictly follow your initial array ID numbering scheme anymore when upgrading storage capacity by adding array sites to the Storage Unit later.
89
The DSCLI command lsrank -l as illustrated in Example 5-9 on page 91 shows the actual capacity of the ranks, and after their assignment to extent pools, the association to extent pools and rank groups. This information is important for subsequently configuring the extent pools with regard to the planned workload and capacity requirements. It is important to note that unassigned ranks do not have a fixed or predetermined relationship to any DS8000 processor complex. Each rank can be assigned to any extent pool or any rank group. Only when assigning a rank to an extent pool and thus rank group 0 or rank group 1 does the rank become associated with processor complex 0 or processor complex 1. Ranks from rank group 0 (even-numbered extent pools: P0, P2, P4, and so forth) are managed by processor complex 0, and ranks from rank group 1 (odd-numbered extent pools: P1, P3, P5, and so forth) are managed by processor complex 1. For a balanced distribution of the overall workload across both processor complexes, half of the ranks must be assigned to rank group 0 and half of the ranks must be assigned to rank group 1. Also, the ranks with and without spares must be spread evenly across both rank 90
DS8000 Performance Monitoring and Tuning
groups. Furthermore, it is important that the ranks from each DA pair are distributed evenly across both processor complexes; otherwise, you might seriously limit the available back-end bandwidth and thus the systems overall throughput. If, for example, all ranks of a DA pair are assigned to only one processor complex, only one DA card of the DA pair is used to access the set of ranks, and thus, only half of the available DA pair bandwidth is available.
Example 5-9 Rank, array, and capacity information provided by the DSCLI command lsrank -l dscli> lsrank -l Date/Time: 28 October 2008 14:28:24 CET IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-75GB192 ID Group State datastate array RAIDtype extpoolID extpoolnam stgtype exts usedexts encryptgrp ======================================================================================================= R0 - Unassigned Normal A0 6 fb 634 R1 - Unassigned Normal A1 6 fb 634 R2 - Unassigned Normal A2 6 fb 634 R3 - Unassigned Normal A3 6 fb 634 R4 - Unassigned Normal A4 6 fb 634 R5 - Unassigned Normal A5 6 fb 634 R6 - Unassigned Normal A6 6 fb 634 R7 - Unassigned Normal A7 6 fb 634 R8 - Unassigned Normal A8 6 fb 634 R9 - Unassigned Normal A9 6 fb 634 R10 - Unassigned Normal A10 6 fb 634 R11 - Unassigned Normal A11 6 fb 634 R12 - Unassigned Normal A12 6 fb 763 R13 - Unassigned Normal A13 6 fb 763 R14 - Unassigned Normal A14 6 fb 763 R15 - Unassigned Normal A15 6 fb 763 R16 - Unassigned Normal A16 6 fb 763 R17 - Unassigned Normal A17 6 fb 763 R18 - Unassigned Normal A18 6 fb 763 R19 - Unassigned Normal A19 6 fb 763 R20 - Unassigned Normal A20 6 fb 763 R21 - Unassigned Normal A21 6 fb 763 R22 - Unassigned Normal A22 6 fb 763 R23 - Unassigned Normal A23 6 fb 763 -
91
The assignment of the ranks to extent pools together with an appropriate concept for the logical configuration and volume layout is the most essential step to optimize overall subsystem performance. When an appropriate DS8000 hardware base has been selected for the planned workloads (that is, isolated and resource-sharing workloads), the next goal is to provide a logical configuration concept that will widely guarantee a balanced workload distribution across all available hardware resources within the storage subsystem at any time - from the beginning, when only part of the available storage capacity is used, up to the end, when almost all of the capacity of the subsystem is allocated. Next, we outline several concepts for the logical configuration for spreading the identified workloads evenly across the available hardware resources.
Single-rank extent pools provide an easy one-to-one mapping between ranks and extent
pools. Because a volume is always created from a single extent pool, single-rank extent pools 92
DS8000 Performance Monitoring and Tuning
allow you to precisely control the volume placement across selected ranks and thus manually manage the I/O performance of the different workloads at the rank level. Furthermore, you can obtain the relationship of a volume to its extent pool by using the output of the DSCLI lsfbvol or lsckdvol command. Thus, with single-rank extent pools, there is a direct relationship between volumes and ranks based on the volumes extent pool, which makes performance management and analysis easier, especially with host-based tools, such as Resource Measurement Facility (RMF) on System z and a preferred hardware-related assignment of LSS/LCU IDs. However, the administrative effort increases, because you have to create the volumes for a given workload in multiple steps from each extent pool separately when distributing the workload across multiple ranks. Furthermore, you choose a configuration design that limits the capabilities of a created volume to the capabilities of a single rank with regard to capacity and performance. With single-rank extent pools, a single volume cannot exceed the capacity or the I/O performance provided by a single rank. So, for demanding workloads, consider creating multiple volumes from different ranks by using host-level-based techniques, such as volume striping, to distribute the workload. You can also waste storage capacity and are likely to benefit less from features, such as dynamic volume expansion (DVE), if extents remain left on ranks in different extent pools, because a single volume can only be created from extents within a single extent pool, not across extent pools. The decision to strictly use single-rank extent pools also limits the use of features, such as Storage Pool Striping or FlashCopy Space Efficiency, which exploits the capabilities of multiple ranks within a single extent pool.
Multi-rank extent pools allow you to fully exploit the features of the DS8000's virtualization
architecture, providing ease of use and also a more efficient usage of all of the available storage capacity in the ranks. Consider multi-rank extent pools especially for workloads that are to be evenly spread across multiple ranks. The DS8000 has always supported multi-rank extent pools with constantly developing volume allocation algorithms in a history of regular performance and usability enhancements. Multi-rank extent pools help to simplify management and volume creation, and they also allow the creation of single volumes that can span multiple ranks and thus even exceed the capacity and performance limits of a single rank. With a properly planned concept for the extent pools and a reasonable volume layout with regard to the various workloads and the workload planning principles outlined in 5.1, Basic configuration principles for optimal performance on page 64, the latest DS8000 volume allocation algorithms, such as rotate volumes (-eam rotatevols) and rotate extents (-eam rotateexts) take care of spreading the volumes and thus the individual workloads evenly across the ranks within homogeneous multi-rank extent pools. Multi-rank extent pools using Storage Pool Striping reduce the level of complexity for standard performance and configuration management by shifting the overall effort from managing a large number of individual ranks (micro-performance management) to a small number of multi-rank extent pools (macro-performance management). In most standard cases, manual allocation of ranks or even the use of single-rank extent pools is obsolete, because it only achieves the same result as multi-rank extent pools using the rotate volumes algorithm, but with higher administrative effort and the limitations for single-rank extent pools that we previously outlined. Especially when using homogeneous extent pools, which strictly contain only identical ranks of the same RAID level, DDM type, and capacity, together with standard volume sizes, multi-rank extent pools can help to
93
considerably reduce management efforts while still achieving a well balanced distribution of the volumes across the ranks. Furthermore, even multi-rank extent pools provide full control of volume placement across the ranks in cases where it is necessary to manually enforce a special volume allocation scheme. You can use the DSCLI command chrank -reserve to reserve all of the extents from a rank in an extent pool from being used for the next creation of volumes. Alternatively, you can use the DSCLI command chrank -release to release a rank and make the extents available again. The major drawback when using multi-rank extent pools compared to single-rank extent pools with regard to performance monitoring and analysis is a slightly higher effort in figuring out the exact relationship of volumes to ranks, because volumes within a multi-rank extent pool can be located on different ranks depending on the extent allocation method that is used and the availability of extents on the ranks during volume creation. The extent pool ID alone, which is given by the output of the lsfbvol or lsckdvol command, generally is insufficient to tell which ranks contribute extents to a given volume. While single-rank extent pools offer a direct relationship between volume, extent pool, and rank due to the one-to-one mapping of ranks to extent pools, you must use the DSCLI commands showfbvol or showckdvol -rank or showrank with multi-rank extent pools in order to determine the location of the volumes on the ranks. The showfbvol or showckdvol -rank command (Example 5-10) lists all of the ranks that contribute extents to a specific volume and the showrank command (Example 5-11 on page 95) reveals a list of all of the volumes that use extents from the specific rank. When gathering the logical configuration of a whole subsystem, you might prefer the use of the showrank command for each rank, because there are typically considerably fewer ranks than volumes on a DS8000 subsystem. Using a showfbvol -rank or showckdvol -rank command for each volume on a DS8000 can take a considerable amount of time and might be appropriate when investigating the particular extent distribution for the individual volumes in question.
Example 5-10 Use of showfbvol -rank command to relate volumes to ranks when using rotate extents dscli> showfbvol -rank 1a10 Date/Time: 05 November 2008 11:33:52 CET IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-75GB192 Name w2k_1A10 ID 1A10 accstate Online datastate Normal configstate Normal deviceMTM 2107-900 datatype FB 512 addrgrp 1 extpool P2 exts 192 captype DS cap (2^30B) 192.0 cap (10^9B) cap (blocks) 402653184 volgrp V0 ranks 6 dbexts 0 sam Standard repcapalloc eam rotateexts ## Volume 1A10 uses Storage Pool Striping reqcap (blocks) 402653184 ==============Rank extents============== rank extents ============
94
R2 R3 R4 R5 R6 R7
32 32 32 32 32 32
## Volume 1A10 has 32 extents on ranks ## R2, R3, R4, R5, R6 and R7
Example 5-11 Use of showrank command to relate volumes to ranks in multi-rank extent pools dscli> showrank r2 Date/Time: 05 November 2008 11:34:06 CET IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-75GB192 ID R2 SN Group 0 State Normal datastate Normal Array A18 RAIDtype 6 extpoolID P2 extpoolnam fb_146GB15k_RAID6_SPS_0 volumes 1A00,1A10 ## Volumes 1A00 and 1A10 have extents on rank R2 stgtype fb exts 763 usedexts 64 widearrays 1 nararrays 0 trksize 128 strpsize 384 strpesize 0 extsize 16384 encryptgrp dscli> showrank r3 Date/Time: 05 November 2008 11:34:11 CET IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-75GB192 ID R3 SN Group 0 State Normal datastate Normal Array A19 RAIDtype 6 extpoolID P2 extpoolnam fb_146GB15k_RAID6_SPS_0 volumes 1A01,1A10 ## Volumes 1A01 and 1A10 have extents on rank R3, stgtype fb ## so Volume 1A10 has extents on ranks R2 and R3 exts 763 usedexts 64 widearrays 1 nararrays 0 trksize 128 strpsize 384 strpesize 0 extsize 16384 encryptgrp -
Single-rank extent pools originally were recommended primarily for reasons of performance management, especially in conjunction with the initially released DS8000 extent allocation method, which followed a simple Fill and Spill algorithm. Therefore, in configurations where performance was a major concern, single-rank extent pools were preferred for providing full
95
control of volume placement, as well as performance management. However, single-rank extent pools implicitly require operating system (OS) striping or database (DB) striping to prevent hot volumes. SIngle-rank extent pools are no guarantee against hot spots. They are just a tool to facilitate strict host-based striping techniques, and they require careful performance planning. Multi-rank extent pools have always offered advantages with respect to ease of use and space efficiency. And especially with the latest rotate extents algorithm, multi-rank extent pools provide both ease of use and good performance for standard environments, and therefore, they are a good choice to start with for workload groups that have a sufficient number of ranks dedicated to them.
96
Fill and Spill (now referred to as legacy with the lsfbvol command) LUNs are created on the first rank in the extent pool until all extents are used, and then volume creation continues on the next rank in the extent pool. This initial allocation method does not lead to a balanced distribution of the volumes across multiple ranks in an extent pool. Most Empty (now referred to as legacy with the lsfbvol command) Each new LUN is created on the rank (in the specified extent pool) with the largest total number of available extents. If more than one rank in the specified extent pool has the same total number of free extents, the volume is allocated on the rank with the lowest rank ID (Rx). If the required volume capacity is larger than the number of free extents on any single rank, volume allocation begins on the rank with the largest total number of free extents and continues on the next rank in ascending numerical sequence of rank IDs (Rx). All extents for a volume are on a single rank unless the volume is larger than the size of a rank or the volume starts towards the end of one rank and spills over onto another rank. If all ranks in the extent pool have the same amount of available extents and if LUNs of the same size are created, a balanced distribution of the volumes across all ranks in ascending rank ID sequence can be achieved. Rotate LUNs/Rotate Volumes (rotatevols) This more advanced volume allocation algorithm ensures more strictly that successive LUN allocations to a multi-rank extent pool are assigned to different ranks by using an internal pointer, which points to the next rank within the extent pool to start with when creating the next volume. This algorithm especially improves the LUN distribution across the ranks within a multi-rank extent pool independent of LUN sizes or the available free capacity on the ranks. Rotate extents (rotateexts, which is also referred to as Storage Pool Striping) In addition to the rotate volumes extent allocation method, which remains the default, the new rotate extents algorithm is introduced as an additional option of the mkfbvol command (mkfbvol or mkckdvol -eam rotateexts), which evenly distributes the extents of a single volume across all the ranks within a multi-rank extent pool. This new algorithm, which is also known as Storage Pool Striping (SPS), provides the maximum granularity available on the DS8000 (that is, on the extent level = 1 GB for FB volumes and 0.94 GB or 1113 cylinders for CKD volumes), spreading each single volume across multiple ranks and thus evenly balancing the workload within an extent pool.
The second generation algorithm is called the Most Empty algorithm, which was introduced with DS8000 code level 6.0.500.46 (August 2005). Each new volume was created on whichever rank in the specified extent pool happened to have the largest total number of available extents. If more than one rank in the specified extent pool had the same total number of free extents, the volume was allocated to the rank with the lowest rank ID (Rx). If the required volume capacity was larger than the number of free extents on any single rank, volume allocation began on the rank with the largest total number of free extents and
Chapter 5. Logical configuration performance considerations
97
continued on the next rank in ascending numerical sequence of rank IDs (Rx). All extents for a volume were on a single rank unless the volume was larger than the size of a rank or the volume started toward the end of one rank and spilled over onto another rank. If all ranks in the extent pool had the same amount of available extents, and if multiple volumes of the same capacity were created, they were allocated on different ranks in ascending rank ID sequence. With DS8000 code level 6.2.420.21 (September 2006), the algorithm was further improved and finally replaced by the third volume allocation algorithm called rotate volumes. New volumes now are allocated to ranks in a round-robin fashion, as long as the rank has enough available extents. Typically, volumes have a relationship to a single rank unless the capacity of available extents on a single rank is exceeded. In this case, the allocation continues on subsequent ranks until the volume is fully provisioned. In most respects, this algorithm looks similar to the second algorithm. However, it avoids stacking many small capacity LUNs that were created in sequence to a single rank. The algorithm now more strictly ensures that successive LUN allocations to a multi-rank extent pool are assigned to different ranks by using an internal pointer which points to the next rank within the extent pool to be used when creating the next volume. It especially improves the LUN distribution across the ranks within a multi-rank extent pool when different LUN sizes are used. With DS8000 code level 63.0.102.0 (December 2007), the fourth and latest volume allocation algorithm called rotate extents or Storage Pool Striping was introduced as an option in addition to the default rotate volumes algorithm. It tries to evenly distribute the extents of a single volume across all the ranks within a multi-rank extent pool in a round-robin fashion. If a rank runs out of available extents, it is skipped. Also, the next new volume to be allocated will start on a different rank from the starting rank of the previous volume, provided there is another rank with available extents. This algorithm further ensures that volumes start on different ranks. Where the volumes end depends on the volume size and the number of ranks containing available extents. This new algorithm provides the maximum granularity available on the DS8000 to spread single volumes across several ranks and to evenly balance the workload within an extent pool.
98
Prior to the introduction of Storage Pool Striping on DS8000, the maximum I/O performance of a single DS8000 LUN was simply limited by the I/O performance of the underlying rank, because a LUN was generally located on a single rank only. Using host-level striping with LUNs created from several ranks was the recommended way to achieve a single logical volume on the attached host system that was capable of a considerably higher random I/O performance. Now, with the new rotate extents algorithm, even single LUNs can be created that can deliver the I/O performance of multiple ranks taking advantage of all the available disk spindles within an extent pool. When Storage Pool Striping is used, you typically expect the extents and thus the workload to be evenly distributed across all ranks within an extent pool, so generally a volume simply can be related to an extent pool again, which in this case represents a set of evenly used ranks instead of only a single rank. So with Storage Pool Striping, the level of depth for standard performance management and analysis is shifted from a large number of individual ranks (micro-performance management) to a small number of extent pools (macro-performance management), which considerably reduces the overall management effort. However, if an extent pool is not homogeneous and it is created from ranks of different capacities (simply due to the use of RAID arrays with and without spares), a closer investigation for the individual ranks in an extent pool and the distribution of extents across the ranks for given volumes might be required with regard to performance management. Certain volumes that were created from the last available extents in this type of an extent pool might only be spread across a smaller number of large capacity ranks in the extent pool. In this case, use the DSCLI commands showfbvol or showckdvol -rank or showrank help to determine the association of volumes to ranks. Furthermore, DS8000 only provides performance metrics on the I/O port, rank, and volume level. There are no DS8000 performance metrics available for extents to provide a hot spot analysis on the extent level. So when using rotate extents as the volume allocation algorithm where the extents of each volume are spread across multiple ranks, you cannot tell how much I/O workload a certain extent or even a volume contributes to a specific rank. Typically, the workload is well balanced across the ranks with rotate extents, so that a single rank becoming a hot spot within a multi-rank extent pool is highly unlikely. Note: The extents for a single volume are not spread across ranks in a multi-rank extent pool by default. You need to manually specify the -eam rotateexts option of the mkfbvol or mkckdvol command in order to spread the extents of a volume across multiple ranks in an extent pool. Certain application environments might particularly benefit from the use of Storage Pool Striping. Examples for such environments include: Operating systems that do not directly support host-level striping VMware datastores Microsoft Exchange 2003 or Exchange 2007 databases Windows clustering environments Older Solaris environments Environments that need to sub-allocate storage from a large pool Resources sharing workload groups dedicated to a large number of ranks with a variety of different host operating systems, which do not all commonly use or even support host-level striping techniques or application-level striping techniques Applications with multiple volumes and volume access patterns that differ from day to day
99
There also are many valid reasons for not using Storage Pool Striping, mainly to avoid unnecessary additional layers of striping and reorganizing I/O requests, which might only increase latency and do not actually help you to achieve a more evenly balanced workload distribution. Multiple independent striping layers might even be counterproductive under certain circumstances. For example, we do not recommend that you create a number of volumes from a single multi-rank extent pool using Storage Pool Striping and then, additionally, use host-level striping or application-based striping on the same set of volumes. In this case, two layers of striping are combined with no overall performance benefit at all. In contrast, creating four volumes from four different extent pools from both rank groups using Storage Pool Striping and then using host-based striping or application-based striping on these four volumes to aggregate the performance of the ranks in all four extent pools and both processor complexes is reasonable. Examples where single-rank extent pools or multi-rank extent pools using the default rotate volumes algorithm (with volumes assigned to distinct ranks) for you to consider are: SAN Volume Controller (SVC): It is preferable to use SVC with dedicated LUN to rank associations with a small number of LUNs in each rank (for example, one or two volumes per rank). These LUNs become SVC managed disks (MDisks) in MDisk groups. OS logical volumes are virtual disks (VDisks) sub-allocated from MDisk groups, usually striping allocation extents across all the MDisks in the MDisk group. There are many similarities between the design of SVC MDisk groups and DS8000 striped storage pools; however, SVC provides more granular and even customizable stripe sizes than the DS8000. System i: System i controls its own striping and has its own recommendations about how to allocate storage volumes. So there are no common recommendations to use Storage Pool Striping, although there is likely no adverse consequence in using it. For large System i installations with a large number of ranks, it might be a valid option to use Storage Pool Striping with two or more identical ranks within an extent pool simply to reduce the overall management effort. System z: System z also controls its own striping and thus is not dependent on striping at the subsystem level. Furthermore, it has its own recommendations about how to allocate storage volumes. Often, multiple successive volumes are created on single ranks with a common LCU ID assignment scheme in place that is related to physical ranks. So here, single-rank extent pools and the use of System z storage management subsystem (SMS) striping might offer a benefit with regard to configuration and performance management. With a direct relation of volumes to ranks and a reasonable strategy for LCU and volume ID numbering, performance management and analysis can be done more easily just with native host-based tools, such as Remote Monitoring Facility (RMF) without the need for additional DSCLI outputs in order to relate volumes to ranks. However, for System z installations with a large number of ranks, it might still be a valid option to use Storage Pool Striping with two or more ranks of the same type within an extent pool simply to reduce the overall administration effort by shifting management from rank level to extent pool level with an extent pool simply representing a set of aggregated ranks with an even workload distribution. Refer to 15.8.2, Extent pool on page 461 for more information about System z-related recommendations with regard to multi-rank extent pools. Database volumes: If these volumes are used by databases or applications that explicitly manage the workload distribution by themselves, these applications might achieve maximum performance simply by using their native techniques for spreading their workload across independent LUNs from different ranks. Especially with IBM DB2 or Oracle where the vendor recommends specific volume configurations, for example, DB2 balanced configuration units (BCUs) or Oracle Automatic Storage Management (ASM), it is preferable to simply follow those recommendations.
100
Applications that have evolved particular storage strategies over a long period of time, which have proven their benefits, and where it is not clear whether they will additionally benefit from using Storage Pool Striping. When in doubt, simply follow the vendor recommendations. Note: For environments where dedicated volume-to-rank allocations are preferred or even required, you can use either single-rank or multiple-rank extent pools if the workloads need to be spread across multiple ranks with successive volume IDs. Multi-rank extent pools using the default rotate volumes extent allocation method together with a carefully planned volume layout will in most cases achieve the same volume distribution with less administration effort than can be achieved with single-rank extent pools. DS8000 Storage Pool Striping is based on spreading extents across different ranks. So with extents of 1 GB (FB) or 0.94 GB (1113 cylinders/CKD), the size of a data chunk is rather large. For distributing random I/O requests, which are evenly spread across each volumes capacity, this chunk size generally is quite appropriate. However, dependent on the individual access pattern of a given application and the distribution of the I/O activity across the volume capacity, certain applications might provide a higher overall performance with more granular stripe sizes for optimizing the distribution of their I/O requests across different RAID arrays by using host-level striping techniques or by having the application manage the workload distribution across independent volumes from different ranks. Additional considerations for the use of Storage Pool Striping for selected applications or environments include: DB2: Excellent opportunity to simplify storage management using Storage Pool Striping. You probably will still prefer to use DB2 traditional recommendations for DB2 striping for performance sensitive environments. DB2 and similar data warehouse applications, where the database manages storage and parallel access to data: Here, consider generally independent volumes on individual ranks with a careful volume layout strategy that does not use Storage Pool Striping. Containers or database partitions are configured according to recommendations from the database vendor. Oracle: Excellent opportunity to simplify storage management for Oracle. You will probably prefer to use Oracle traditional recommendations involving ASM and Oracles striping capabilities for performance sensitive environments. Small, highly active logs or files: Small highly active files or storage areas smaller than 1 GB with an extraordinary high access density might require spreading across multiple ranks for performance reasons. However, Storage Pool Striping only offers a striping granularity on extent levels around 1 GB, which is too large in this case. Here, continue to exploit host-level striping techniques or application-level striping techniques that support smaller stripe sizes. For example, assume that there is a 0.8 GB log file with extreme write content, and you want to spread this log file across several RAID arrays. Assume that you intend to spread its activity across four ranks. At least four 1 GB extents must be allocated, one extent on each rank (which is the smallest possible allocation). Creating four separate volumes, each with a 1 GB extent from each rank, and then using Logical Volume Manager (LVM) striping with a relatively small stripe size (for example, 16 MB) effectively distributes the workload across all four ranks. Creating a single LUN of four extents, which is also distributed across the four ranks using DS8000 Storage Pool Striping, cannot effectively spread the files workload evenly across all four ranks simply due to the large stripe size of one extent, which is larger than the actual size of the file.
101
Tivoli Storage Manager storage pools: Tivoli Storage Manager storage pools work well in striped pools. But in adherence to long standing Tivoli Storage Manager recommendations, the Tivoli Storage Manager databases need to be allocated in a separate pool or pools. AIX volume groups (VGs): LVM and physical partition (PP) striping continue to be powerful tools for managing performance. In combination with Storage Pool Striping, now considerably fewer stripes are required for common environments. Instead of striping across a large set of volumes from many ranks (for example, 32 volumes from 32 ranks), striping is only required across a small number of volumes from just a small set of different multi-rank extent pools from both DS8000 rank groups using Storage Pool Striping (for example, four volumes from four extent pools, each with eight ranks). For specific workloads, even using the advanced AIX LVM striping capabilities with a smaller granularity on the KB or MB level, instead of Storage Pool Striping with 1 GB extents (FB), might be preferable in order to achieve the highest possible level of performance. Windows volumes: Typically, only a small number of large LUNs per host system are preferred, and host-level striping is not commonly used. So, basically Storage Pool Striping is an ideal option for Windows environments. It easily allows the creation of single, large capacity volumes that offer the performance capabilities from multiple ranks. A single volume no longer is limited by the performance limits of a single rank, and the DS8000 simply handles spreading the I/O load across multiple ranks. Microsoft Exchange: Storage Pool Striping makes it much easier for DS8000 to conform to Microsoft sizing recommendations for Microsoft Exchange databases and logs. Microsoft SQL Server: Storage Pool Striping makes it much easier for DS8000 to conform to Microsoft sizing recommendations for Microsoft SQL Server databases and logs. VMware Datastore for Virtual Machine Storage Technologies (VMware ESX Server Filesystem (VMFS) or virtual raw device mapping (RDM) access: Because datastores concatenate LUNs rather than striping them, just allocate the LUNs inside a striped storage pool. Estimating the number of disks (or ranks) to support any given I/O load is straightforward based on the given requirements. In general, Storage Pool Striping helps to improve overall performance and reduce the effort of performance management by evenly distributing workloads across a larger set of ranks, reducing skew and hot spots. Certain application workloads can also benefit from the higher number of disk spindles behind only a single volume. But, there are cases where host-level striping or application-level striping might achieve even a higher performance, of course at the cost of higher overall administration effort. Storage Pool Striping still might deliver good performance in these cases, but manual striping with careful configuration planning is required to achieve the best possible performance. So with regard to overall performance and ease of use, Storage Pool Striping might still offer an excellent compromise for many environments, especially for larger workload groups where host-level striping techniques or application-level striping techniques are not widely used or not even available. Note: Business and performance critical applications always require careful configuration planning and individual decisions on a case by case basis about whether to use Storage Pool Striping or LUNs from dedicated ranks together with host-level striping techniques or application-level striping techniques for the best performance. Storage Pool Striping is best suited for completely new extent pools. Adding new ranks to an existing extent pool will not restripe volumes (LUNs) that are already allocated in an existing pool. So, adding single ranks to an extent pool, which uses Storage Pool Striping, when running out of available capacity simply undermines the concept of Storage Pool Striping and easily lead to hot spots on the added ranks. Thus, you need to perform capacity planning 102
DS8000 Performance Monitoring and Tuning
using Storage Pool Striping on an extent pool level and not on a rank level. In order to upgrade capacity on a DS8000 using Storage Pool Striping, simply add new extent pools (preferably in groups of two: one for each rank group or processor complex) with a specific number of ranks per extent pool based on your individual configuration concept (for example, using multi-rank extent pool with four to eight ranks).
103
104
As the first step, you can visualize the rank and DA pair association in a simple spreadsheet based on the graphical scheme given in Figure 5-7.
DA2
6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P
6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P
DA2
DA0
DA0
DA3
DA3
DA1
DA1
Figure 5-7 Basic scheme for rank and DA pair association with regard to extent pool planning
This example represents a homogeneously configured DS8100 with four DA pairs and 32 ranks all configured to RAID 5. Based on the specific DS8000 hardware and rank configuration, the scheme typically becomes more complex with regard to the number of DA pairs, ranks, different RAID levels, disk drives, spare distribution, and storage types. Based on this scheme, you can easily start planning an initial assignment of ranks to your planned workload groups, either isolated or resource-sharing, and extent pools with regard to your capacity requirements as shown in Figure 5-8.
D S -8 1 0 0 7 5 x y z w n
DA 2 RA RA RA RA RA RA RA RA ID ID ID ID 10 10 10 10 5 5 5 5 6 6 6 6 / / / / / / / / / / / / x / / / / 386 519 519 519 / / / / / / / / / / / / / / / / fb fb fb fb / / / / 146G 146G 146G 146G / / / / B B B B 15k 15k 15k 15k 15k 15k 15k 15k RA RA RA RA RA RA RA RA ID ID ID ID 10 10 10 10 5 5 5 5 6 6 6 6 6 6 6 6 / / / / / / / / / / / / / / / / 386 519 519 519 / / / / / / / / / / / / / / / / fb fb fb fb / / / / 146G 146G 146G 146G / / / / B B B B 15k 15k 15k 15k 15k 15k 15k 15k C o lo u r DA 2 W o rklo a d w o rk lo a d w o rk lo a d w o rk lo a d w o rk lo a d w o rk lo a d w o rk lo a d w o rk lo a d w o rk lo a d A A B B B B C C E x te n t Pool P P P P P P P P 0 1 2 3 4 5 6 7 EA M R otV ol R otV ol R otE x t R otE x t R otE x t R otE x t R otV ol R otV ol
DA 0
ID ID ID ID ID ID ID ID
1771 1771 2066 2066 1287 1287 1549 1549 1287 1287 1549 1549 0
B B B B
ID ID ID ID ID ID ID ID ID ID ID ID
1771 1771 2066 2066 1287 1287 1549 1549 1287 1287 1549 1549
B B B B
DA 0
DA 3
RA RA RA RA
B B B B B B B B
RA RA RA RA RA RA RA RA
B B B B
DA 3 W o r k l o a d D e fi n i ti o n s w o rk lo a d A w o rk lo a d B w o rk lo a d C = = = O L T P M a in B u s in e s s A p p lic a va r i o u s w o r k l o a d s , O p e n S y s S y s t e m z w o rk lo a d , is o la t e d
R A ID 6 R A ID 6 R A ID 6 R A ID 6 P r o c e s so r C o m p l e DA 1
Figure 5-8 Initial spreadsheet for workload assignments to ranks and extent pools with regard to capacity requirements
After this initial assignment of ranks to extent pools and appropriate workload groups, you can create additional spreadsheets to hold more details about the logical configuration and finally the volume layout with regard to array site IDs, array IDs, rank IDs, DA pair association, extent pools IDs, and even volume IDs, as well as their assignment to volume groups and host connections as, for example, shown in Figure 5-9 on page 106. 105
Figure 5-9 Example of a detailed spreadsheet for planning and documenting the logical configuration
DA2
DA2
DA0
DA0
DA3
DA3
DA1
DA1
Figure 5-10 Example of a set of uniformly configured ranks with their association to DA pairs
106
The minimum recommended number of extent pools in this case is two (P0 and P1) on a uniformly equipped subsystem in order to spread the workload evenly across both DS8000 processor complexes as shown in Figure 5-11 on page 108. Note that here you have two extent pools, which are equal in available capacity, but contain ranks with different numbers of extents per rank. So here, you need to be aware that the last volumes will only be created from extents of the large capacity ranks as soon as the capacity of the small capacity ranks is exceeded. Two large extent pools might be a good choice when planning to use the rotate volumes extent allocation method with dedicated volumes on multiple ranks and a standard volume size as shown in Figure 5-11 on page 108. A two extent pool configuration might be convenient if FlashCopy SE is not used and all workloads are meant to share the same resources (workload resource-sharing). Here, simply distributing the ranks from each DA pair evenly across extent pools P0 and P1 offers all of the flexibility as well as ease of use. Using a standard volume size and aligning the number of volumes of a given workload to the number of ranks within the dedicated extent pools will help achieve a balanced workload distribution. The use of host-level striping or application-level striping will further optimize workload distribution evenly across the hardware resources. However, with regard to the different rank sizes and due to the distribution of spare drives (6+P+S and 7+P arrays for RAID 5 in this example), you might also consider using four strictly homogeneous extent pools (P0, P1, P2, and P3), where each extent pool has only ranks of the same capacity as shown in Figure 5-11 on page 108. Note that with homogeneous RAID 5 and RAID 6 configurations, you might have four arrays with spares on a DA pair, but only two with a RAID 10 configuration. So with RAID 10 configurations, there might not be enough 3x2+2S arrays available to use only homogeneous extent pools. Of course, your configuration also depends on the number of available arrays, as well as the number of ranks required per extent pool. In either case, try to spread ranks from each available DA pair evenly across all extent pools, so that the overall workload is spread evenly across all DA pairs. Be aware that you still need to manually distribute the volumes of each application workload evenly across all extent pools dedicated to this workload group. Furthermore, when using Storage Pool Striping, typically, you plan for a total of four to eight ranks per extent pool for an optimum performance benefit.
107
Configuration with 2 Extent Pools DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P0 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P1 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1
Configuration with 4 Extent Pools 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P0 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P2 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P1 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P3 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1
Homogeneous extent pools with ranks of the same capacity offer the best basis for the DS8000 volume allocation algorithms in order to achieve a strictly balanced distribution of the volumes across all ranks within an extent pool and especially with standard volume sizes. However, homogeneous extent pools with ranks of the same capacity also lead to extent pools with different amounts of available capacity. However, they reduce management efforts by providing identical performance characteristics for all of the volumes created from the same extent pool up to the allocation of the last extents, especially when using rotate extents as the preferred extent allocation method. Alternatively, if having extent pools that are equal in size is a major concern, consider four non-homogeneous extent pools with a mixed number of ranks with and without spares as shown in Figure 5-12 on page 109. In this case, however, additional management effort and care are required to control the volume placement when the capacity of the smaller ranks is exceeded, especially when using the rotate extents allocation method. The administrator needs to be aware that the volumes that are created from the last available extents provide lower performance, because they are only distributed across a smaller number of large capacity ranks and thus use fewer disk spindles when compared to the initially created volumes that span all ranks. Because there is no warning message when this condition applies, one method to control the usage of these final extents in non-homogeneous extent pools and plan their usage carefully only for less demanding applications can be to initially create dummy volumes on these large capacity ranks using the chrank -reserve/-release DSCLI commands and reserving these additional extents.
108
Alternate configuration with 4 Extent Pools DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 6+P+S 6+P+S 6+P+S 6+P+S 7+P 7+P 7+P 7+P P0 6+P+S 6+P+S 6+P+S 6+P+S 7+P 7+P 7+P 7+P P2 6+P+S 6+P+S 6+P+S 6+P+S 7+P 7+P 7+P 7+P P1 6+P+S 6+P+S 6+P+S 6+P+S 7+P 7+P 7+P 7+P P3 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1
Figure 5-12 Alternate example of a four extent pool configuration with equal extent pool capacities
You can achieve a homogeneous extent pool configuration with extent pools of different rank capacities simply by following these steps after the initial creation of the extent pools: 1. Identify the number of available extents on the ranks and the assignment of the ranks to the extent pools from the output of the lsrank -l command. Calculate the amount of extents that make up the difference between the small and large ranks. 2. Use the DSCLI command chrank -reserve against all smaller (for example, 6+P+S) ranks in order to reserve all extents on these ranks within each extent pool from being used for the creation of the dummy volumes in the next step. 3. Now, create a number of dummy volumes using the mkfbvol command from each extent pool according to the number of large ranks in the extent pool and the additional capacity of these ranks in comparison to the smaller arrays. For example, with 16 ranks in extent pool P0 as shown in the Figure 5-11 on page 108, you have eight small capacity (6+P+S) ranks and eight (7+P) large capacity ranks. With 73 GB disk drives, you get 388 extents per 6+P+S rank and 452 extents per 7+P rank. In this case, you need to create eight dummy volumes of 64 extents in size (volume size = 64 GB, binary) per extent pool using two mkfbvol DSCLI commands: # mkfbvol -extpool P0 -cap 64 -type ds -name dummy_vol ee00-ee07 # mkfbvol -extpool P1 -cap 64 -type ds -name dummy_vol ef00-ef07 In this example, we use LSS ee with volume IDs ee00-ee07 for P0 (even extent pool) dummy volumes and LSS ef with volume IDs ef00-ef07 for P1 dummy volumes. The rotate extents volume allocation algorithm automatically distributes the volumes across the ranks. 4. Use the DSCLI command chrank -release against all smaller (6+P+S) ranks in order to release all extents on these ranks again so that finally all ranks in the extent pools are available for the creation of volumes for the attached host systems.
109
Now, we have created a homogeneous extent pool configuration with ranks of equal size. You can remove the dummy volumes when the last amount of storage capacity needs to be allocated. However, here you need to remember that the volumes that will be created from these final extents on the large (7+P) arrays are distributed only across half of the ranks in the extent pool, so consider using this capacity primarily for applications with lower I/O demands. Of course, you can also apply a similar procedure to a four extent pool configuration with 6+P+S and 7+P ranks mixed in each extent pool if you prefer four identical extent pools with exactly the same amount of storage capacity. However, simply using separate extent pools for 6+P+S and 7+P ranks reduces the administration effort. Another consideration for the number of extent pools to create is the usage of Copy Services, such as FlashCopy Space Efficient (FlashCopy SE). If you use FlashCopy SE, you also might consider a minimum of four extent pools with two extent pools per rank group, as shown in Figure 5-13. As the FlashCopy SE repository for the Space Efficient target volumes is distributed across all available ranks within the extent pool (comparable to using Storage Pool Striping), we recommend that you distribute the source and target volumes across different extent pools (that is, different ranks) from the same DS8000 processor complex (that is, the same rank group) for the best FlashCopy performance. Each extent pool can have FlashCopy source volumes, as well as repository space for Space Efficient FlashCopy target volumes from source volumes, in the alternate extent pool. However, for certain environments, consider a dedicated set of extent pools using RAID 10 arrays just for FlashCopy SE target volumes while the other extent pools using RAID 5 arrays are only used for source volumes. You can still separate workloads using different extent pools with regard to the principles of workload isolation as seen Figure 5-14 on page 111 using, for example, rotate extents (Storage Pool Striping) as the extent allocation method in the extent pools for the resource-sharing workload and rotate volumes in the extent pools for the isolated workload. The isolated workload can further use host-level striping or application-level striping. The workload isolation in this example is done on the DA pair level.
FlashCopy SE
S E P O O L
110
FlashCopy SE
S E P O O L
6+P+S 6+P+S 7+P 7+P P0 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P2 7+P 7+P 7+P 7+P 7+P 7+P P4
6+P+S 6+P+S 7+P 7+P P1 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P3 7+P 7+P 7+P 7+P 7+P 7+P P5
ISOLATION
SHARED
SHARED
Figure 5-14 Example of multi-rank extent pools with workload isolation on the DA pair level
111
Under certain circumstances, you might consider modifying this strict single rank approach to extent pools with more than just one rank using Storage Pool Striping, for example: A volume size larger than a single rank is required. In this case, use single-rank extent pools where possible, and assign the minimum number of ranks needed to multi-rank extent pools. Here, you can even consider the use of Storage Pool Striping so that the volume is evenly related to the set of ranks associated with this extent pool. Required volume sizes result in unacceptable unused capacity if single-rank extent pools are used. In this case, use as low a rank-to-extent pool ratio as possible. Even here, you might consider the usage of Storage Pool Striping for these extent pools so that the volumes are evenly associated with all the ranks within the same extent pool. Installations with a large number of ranks require considerable administration efforts. If performance management on individual rank level is not actually required, you can reduce this overall administration effort by using extent pools with two or more identical ranks and Storage Pool Striping instead of single-rank extent pools. Note that in these cases when using multi-rank extent pools with Storage Pool Striping, the level of performance management is shifted from rank to extent pool level (a set of uniformly utilized ranks) and that the capability to control performance on the rank level for the
112
Array sites are assigned system-generated IDs in sequence beginning with S1 at DS8000 installation. Arrays, ranks, and extent pools are assigned system-generated IDs when they are created by the user during logical configuration. They are sequentially numbered in order of resource creation. For example, the first array created will be assigned array ID A0, the second array created will be assigned array ID A1, and so on, independent of which array sites are used to create the arrays. The same applies to ranks with rank IDs R0, R1, R2, and so on, which are assigned in the sequence of their creation independent of the array ID that is used. Extent pool IDs are also system-generated and sequentially numbered in order of creation, but extent pool IDs are also affected by the user assignment of the extent pool to processor complex 0 (even-numbered IDs) or processor complex 1 (odd-numbered IDs). The first extent pool created and assigned to processor complex 0 (rank group 0) will be assigned extent pool ID P0. The second extent pool created and assigned to processor complex 0 will be assigned extent pool ID P2 and so on. The first extent pool created and assigned to processor complex 1 (rank group 1) will be assigned extent pool ID P1. The second extent pool created and assigned to processor complex 1 will be assigned extent pool ID P3 and so on. Note: When an array, rank, or extent pool is deleted, the corresponding resource ID is freed and will be used for the next resource of the same resource type that is created. For example, if ranks R0 - R7 were created on array sites S1 - S8, and R0 (on array site S1) is deleted, when array site S9 is used to create an array, it will be assigned array ID A0. If array site S1 is then used to create an array, it will be assigned array ID A8. A consistent association of array site, array, rank, extent pool, and LSS/LCU IDs simplifies performance analysis and management, because the volume ID (which includes the LSS ID) implies a corresponding rank and DA pair, and the reverse is true also. The consistent association shown in the following example is based on the order of installation of disks on DA pairs (DA2, DA0, DA6 DA4, DA7, DA5, DA3, DA1, DA2, DA0, DA6 DA4, DA7, DA5, DA3, DA1). For more information about the order of installation of disks on DA pairs, refer to 2.4.6, Order of installation on page 21. Note: Because array site IDs begin at 1 and array IDs begin at 0, it is not uncommon to have odd-numbered array sites associated with even-numbered array IDs, or to have array site IDs that are one greater than array IDs. However, array, rank, and extent pool IDs need to have the same number (for example, A10, R10, and P10) by following the process that we describe next. The LSS ID needs to be the hexadecimal equivalent (for example, LSS 10 means 0x0A). You can achieve a consistent association between array site, array, rank, extent pool, and LSS/LCU IDs by following these steps: 1. Create arrays from all array sites associated with DA2 in ascending order of array site ID. That is, if DA2 has array sites S1 - S8, create an array on array site S1 first, create an array on array site S2 next, and so on. Note: DA2/DA0 can have more than eight array sites if disk enclosure pairs are installed in a second/third expansion frame.
113
2. Repeat creating arrays from all array sites in ascending order of array site ID for array sites associated with DA0, DA6, DA4, DA7, DA5, DA3, and DA1 in order as long as array sites exist. Note that the order of array sites assigned to DAs in this way might not necessarily follow the system-generated sequential numbering of array site IDs. Note: DA0 might have more than eight array sites if disk enclosure pairs are installed in a second expansion frame.
3. Create ranks from arrays in array ID numerical order. That is, create the first rank from array A0, the second rank from array A1, and so on. 4. Create as many extent pools as ranks, with half of the extent pools assigned to processor complex 0 (which is also referred to as server0, managing rank group 0) and half of the extent pools assigned to processor complex 1 (which is also referred to as server1, managing rank group 1). 5. Assign ranks with even-numbered rank IDs to even-numbered extent pools, in ascending order. That is, assign rank R0 to extent pool P0, assign rank R2 to extent pool P2, and so on. 6. Assign ranks with odd-numbered rank IDs to odd-numbered extent pools, in ascending order. That is, assign rank R1 to extent pool P1, assign rank R3 to extent pool P3, and so on. 7. When creating the plan for volumes for hexadecimal volume IDs where the first two digits match the hexadecimal equivalent of the extent pool ID, create, for example, volumes 00zz from extent pool P0, volumes 01zz from extent pool P1, and so on. If additional volume addresses are required (more than 256 per rank), one or more unique LSS/LCU IDs can be added to each rank. Refer to Volume configuration scheme using hardware-bound LSS/LCU IDs on page 124 for more details about this hardware-related configuration concept. You can simplify performance analysis if you follow a similar approach for logical configurations for all DS8000s. Next, we apply this configuration strategy to a DS8300 with a fully populated base frame and one fully populated expansion frame as introduced in 5.5.2, DS8000 configuration example 2: Array site planning considerations on page 75. Figure 5-15 on page 115 shows a schematic of this DS8300 with a fully populated base frame and one fully populated expansion frame.
114
2 2 0 0
HMC C0
6 6 4 4 7 7 5 5
b
S0 C1 S1 0 11 0 2 33 2
Base Frame A
4 55 4 6 77 6
1st Expansion Frame B
Figure 5-15 DS8000 configuration example 2 (fully populated base and expansion frame)
Figure 5-16 on page 116 and Figure 5-17 on page 117 show a schematic of an example logical configuration for this DS8000 with: A unique extent pool for each rank. One unique LSS for each rank. If additional addresses are needed (for example, to provide additional small CKD volumes or PAVs, or to identify different workloads), you can add additional unique LSSs. A one-to-one relationship between the array, rank, extent pool, and LSS so that each horizontal box in the schematic represents an array, a rank, an extent pool, and an LSS. All volumes on a given rank are associated with the same unique extent pool and LSS. Array site, array, rank, and extent pool IDs that match up to show the association between these hardware resources and imply a DA association that is based on the convention of using array sites in the order of the installation of disks on DAs. Figure 5-16 on page 116 shows the logical configuration for the DS8000 base frame. For the base frame, the sequence of array site IDs matches the order of the installation of disks on DA pairs (DA2 followed by DA0). array sites on DA2 and DA0 were configured in ascending order, so the array site IDs (which begin with S1) are all one greater than the corresponding array, rank, and pool numbers. The LSS IDs are the hexadecimal equivalent of the corresponding array, rank, and pool IDs.
115
Figure 5-17 on page 117 shows a schematic of the logical configuration for the first expansion frame. Logical configuration of this expansion frame also implements a one-to-one relationship between the array, rank, extent pool, and LSS, so again each horizontal box in the schematic represents an array, a rank, an extent pool, and an LSS, and all volumes on a given rank will be associated with the same unique extent pool and LSS. For this expansion frame, the sequence of array site IDs (S17 - S48) created dynamically at DS8000 installation does not show a clear association to the order of installation of disks on DA pairs (DA6, DA4, DA7, and DA5). For example, the next array site ID (array site S17) is on DA7 rather than DA6. To ensure that the array, rank, extent pool, and LSS IDs reflect a clear and consistent association with a DA pair, the arrays, ranks, extent pools, and LSSs were configured in sequence according to the order of installation of disks on DA pairs (DA2, DA0, DA6, DA4, DA7, DA5, DA3, and DA1). Using this approach for logical configuration, a volume ID (which includes the LSS ID) can be used to deduce the associated DA pair as well as the array, rank, and extent pool IDs. For example, a volume ID of 0000 indicates the array, rank, pool, and LSS 0 and implies DA2. A volume ID of 0900 indicates array, rank, pool, and LSS 9 and implies DA0.
116
Array sites were configured in array site ID sequence within the sequence of the order of installation of disks associated with DA pairs. Array site S17 on DA7 is the next candidate for array creation in pure array site ID sequence (S1 - S48). However, the disks on DA6 are next in order of installation, so the lowest-numbered array site on DA6 (S25) is used to create the next array (A16). Array A16 was used to create the next rank (R16), which was assigned to the next extent pool on processor complex 0 (P16), which was used to create volumes in the next LSS (0x10). Then, the next array site on DA6 (S26) is used to create array A17. After all array sites on DA6 have been used to create arrays, the first array site on DA4 is used. After all array sites on DA4 have been configured, the array sites on DA7 will be configured, and finally the array sites on DA5 will be configured. Therefore, all the array, rank, and Pool IDs are the same (and the LSS IDs are the hexadecimal equivalent) for this expansion frame, while the array site IDs do not have a consistent numerical relationship. The clear association of DA pair to rank, extent pool, and LSS ID allows quick identification of rank and DA pair based on the volume ID that contains the LSS ID. For example, given a hexadecimal volume ID 10zz (LSS 0x10), we can deduce that the volume is in extent pool P16, rank R16, and array A16 on DA 6.
117
5.8 Plan address groups, LSSs, volume IDs, and CKD PAVs
After creating the extent pools and evenly distributing the back-end resources (DA pairs and ranks) across both DS8000 processor complexes, you can start with the creation of host volumes from these extent pools. When creating the host volumes, it is important to follow a strict volume layout scheme that evenly spreads the volumes of each application workload across all ranks and extent pools that are dedicated to this workload, in order to achieve a balanced I/O workload distribution across ranks, DA pairs, and DS8000 processor complexes. So, the next step is to plan the volume layout and thus the mapping of address groups and LSSs to volumes created from the various extent pools with regard to the identified workloads and workload groups. For performance analysis reasons, it is important to easily identify the association of given volumes to ranks or extent pools when investigating resource contention. Although the mapping of volumes to ranks can be taken from the DSCLI showrank or showfbvol/showckdvol -rank commands, performance and analysis management significantly is easier if a well-planned logical configuration strategy is in place using a numbering scheme that easily relates volume IDs to workloads, extent pools, and ranks. Each volume is associated with a hexadecimal 4-digit volume ID that has to be specified when creating the volume, as shown, for example, in Table 5-3 for volume ID 1101.
Table 5-3 Understanding the volume ID relation to address groups and LSSs/LCUs Volume ID 1101 Digits 1st digit: 1xxx 1st and 2nd digits: 11xx Description Address group (0-F) (16 address groups on a DS8000 subsystem) Logical subsystem (LSS) ID for FB Logical control unit (LCU) ID for CKD (x0-xF: 16 LSSs or LCUs per address group) Volume number within an LSS or LCU (00-FF: 256 volumes per LSS or LCU)
The first digit of the hexadecimal volume ID specifies the address group, 0 to F, of that volume. Each address group can only be used by a single storage type, either FB or CKD. Note that volumes accessed by ESCON channels need to be defined in address group 0, using LCUs 00 to 0F. So if ESCON channels are used, reserve address group 0 for these volumes with a range of volume IDs from 0000 to 0FFF. The first and second digit together specify the logical subsystem ID (LSS ID) for Open Systems volumes (FB) or the logical control unit ID (LCU ID) for System z volumes (CKD), providing 16 LSS/LCU IDs per address group. The third and fourth digits specify the volume number within the LSS/LCU, 00-FF, providing 256 volumes per LSS/LCU. The volume with volume ID 1101 is the volume with volume number 01 of LSS 11 belonging to address group 1 (first digit). Important: You must define volumes accessed by ESCON channels in address group 0 using LCUs 00 through 0F.
The LSS/LCU ID is furthermore related to a rank group. Even LSS/LCU IDs are restricted to volumes created from rank group 0, which are serviced by processor complex 0. Odd LSS/LCU IDs are restricted to volumes created from rank group 1, which are serviced by processor complex 1. So the volume ID also reflects the affinity of that volume to a DS8000
118
processor complex. All volumes which are created from even-numbered extent pools (P0, P2, P4, and so forth) have even LSS IDs and are managed by DS8000 processor complex 0, whereas all volumes created from odd-numbered extent pools (P1, P3, P5, and so forth) have odd LSS IDs and are managed by DS8000 processor complex 1. There is no direct DS8000 performance implication as a result of the number of LSSs/LCUs or the association of LSSs/LCUs and ranks, with the exception of additional CKD PAVs that are potentially available with multiple LCUs assigned to a single rank. For the z/OS CKD environment, a DS8000 volume ID is required for each PAV. The maximum of 256 addresses per LCU includes both CKD base volumes and PAVs, so the number of volumes and PAVs determines the number of LCUs required. So especially with single-rank extent pools, more than one LCU per rank might be needed. When planning the volume layout, you can basically decide on two concepts for LSS/LCU IDs. You can either try to strictly relate them to hardware resources (as on former IBM 2105 subsystems) or to application workloads. Especially with single-rank extent pools, it is a common practice to actually relate LSS/LCU IDs to the physical back-end resources, such as ranks and extent pools. So the relation of each volume to a specific rank (or set of ranks) simply can be taken from the volume ID. In certain homogeneous environments, this concept can reduce the effort of performance management and analysis. You can use this concept initially simply with native host-based tools without actually having to use additional DSCLI configuration data to investigate the relation of volumes to ranks. This concept is suitable in environments where performance management is the major concern and full control of performance on the rank level is required. Typically, this concept also requires host-level striping techniques or application-level striping techniques to balance workloads and increases configuration effort if volumes of a given workload are spread across multiple ranks. The other approach is to relate a LSS/LCU to a specific application workload with a meaningful numbering scheme for the volume IDs with regard to the specific applications, ranks, and extent pools. Each LSS can have 256 volumes, with volume numbers ranging from 00 to ff. So relating the LSS/LCU to a certain application workload and the volume number to the physical location of the volume, such as the rank (when using the rotate volumes algorithm with multi-rank extent pools or single-rank extent pools) or the extent pool (when using the rotate extents algorithm with multi-rank extent pools) might be a reasonable choice. Because the volume IDs are transparent to the attached host systems, this approach helps the administrator of the host system to determine the relation of volumes to ranks simply from the volume ID and thus easily identify independent volumes from different ranks when setting up host-level striping or when separating, for example, DB table spaces from DB logs onto volumes from physically different arrays. Ideally, all volumes that belong to a certain application workload or a group of related host systems are within the same LSS. However, because the volumes need to also be spread evenly across both DS8000 processor complexes, at least two logical subsystems are typically required per application workload: one even LSS for the volumes managed by processor complex 0 and one odd LSS for volumes managed by processor complex 1 (for example, LSS 10 and LSS 11). Furthermore, the assignment of LSS IDs to application workloads significantly reduces management effort when using DS8000-related Copy Services, because basic management steps (such as establishing Peer-to-Peer Remote Copy (PPRC) paths and consistency groups) are related to LSSs. Even if Copy Services are not currently planned, plan the volume layout accordingly, because management will be easier if you need to introduce Copy Services in the future (even, for example, when migrating to a new subsystem using Copy Services). Note that a single mkfbvol or mkckdvol command can create a set of volumes with successive volume IDs from the specified extent pool. So if you want to assign LSS/LCU IDs to specific hardware resources, such as ranks, where you need to create a large number of successive
Chapter 5. Logical configuration performance considerations
119
volumes on a single rank, consider single rank extent pools. You can also extend this concept to assigning LSS/LCU IDs to extent pools with more than one rank using Storage Pool Striping. In this case, the LSS/LCU ID simply is related to an extent pool as a set of aggregated and evenly utilized ranks. If you want to assign LSS/LCUs based on application workloads and typically need to spread volumes across multiple ranks with the allocation of volumes or extents on successive ranks, multi-rank extent pools might be a good choice, simply taking advantage of the DS8000 volume allocation algorithms to spread the volumes and thus the workload evenly across the ranks with considerably less management effort. Single rank extent pools require much more configuration effort to distribute a set of successive volumes of a given workload with unique LSS/LCU IDs across multiple ranks. However, the actual strategy for the assignment of LSS/LCU IDs to resources and workloads can vary depending on the particular requirements in an environment. The following subsections introduce suggestions for LSS/LCU and volume ID numbering schemes to help to relate volume IDs to application workloads, extent pools, and ranks.
120
Figure 5-18 Volume layout example for two shared extent pools using the rotate volumes algorithm
121
Volume ID Configuration (Rank Group 0) 1000 1001 1002 1003 1004 1005 1006 1007 1010 1011 1012 1013 1014 1015 1016 1017 2a00 2a01 2a02 2a03 2a04 2a05 2a06 2a07 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P0 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P2 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P1 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P3
Volume ID Configuration (Rank Group 1) 1100 1101 1102 1103 1104 1105 1106 1107 1110 1111 1112 1113 1114 1115 1116 1117 2b00 2b01 2b02 2b03 2b04 2b05 2b06 2b07
2a08 2a09 2a0a 2a0b 2a0c 2a0d 2a0e 2a0f Host B Application B
2b08 2b09 2b0a 2b0b 2b0c 2b0d 2b0e 2b0f Host B Application B
Figure 5-19 Volume layout example for four shared extent pools using the rotate volumes algorithm
When the volumes of each host or application are distributed evenly across the ranks, DA pairs, and both processor complexes using the rotate volumes algorithm, consider using striping techniques on the host system in order to achieve a uniform distribution of the I/O activity across the assigned volumes (and thus, the DS8000 hardware resources). Refer to Figure 5-20.
1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 100a 100b 100c 100d 100e 100f AIX VG AIX LV 1 AIX LV 2
1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 110a 110b 110c 110d 110e 110f 1
1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 101a 101b 101c 101d 101e 101f AIX VG AIX LV A AIX LV B
1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 111a 111b 111c 111d 111e 111f 2
Figure 5-20 Example of using host-level striping with volumes on distinct ranks
122
We typically recommend for host-level striping a large granularity stripe size of at least 8 MB or 16 MB for standard workloads in order to span multiple full stripe sets of a DS8000 rank2 and not to disable the DS8000 cache algorithms sequential read ahead detection mechanism. For example, using an AIX host system with AIX Logical Volume Manager (LVM) means building an AIX LVM volume group (VG) with LUNs 1000-100f and LUNs 1100-110f (as shown in Figure 5-20 on page 122) and creating AIX logical volumes from this volume group with an INTERDISK-POLICY of maximum and PP sizes of at least 8 MB in order to spread the workload across all LUNs in the volume group (which is called AIX PP striping). If you have another set of volumes from the same ranks, for example, for a second host system, you simply configure them in another AIX LVM VG as shown in Figure 5-20 on page 122. Note that this is a general recommendation for standard workloads. If you know your workload characteristics in more detail, you also might consider another stripe size on your host system, which is aligned to your particular I/O request size, as well as to the DS8000 array stripe set size. However, perform a test with the real application to evaluate if the chosen stripe size really yields a better performance before implementing this stripe size in the productive environment.
The stripe size that is internally used on DS8000 depends on the storage type, either FB or CKD, and the RAID level. FB uses a stripe size of 256 KB segments per disk drive for RAID 5 and RAID 10 arrays and 192 KB for RAID 6 arrays. CKD uses a stripe size of 224 KB for RAID 5 and RAID 10 arrays and 168 KB for RAID 6 arrays.
123
Volume ID Configuration (Rank Group 0) 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P0 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P2 Host A1 Host A2 Application A Host B Application B 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P1 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P3
1 0 0 0
1 0 0 1
2 a 0 0
1 1 0 0
1 1 0 1
2 b 0 0
1 0 1 0
1 0 1 1
2 a 1 0
1 1 1 0
1 1 1 1
2 b 1 0
Host B Application B
Figure 5-21 Volume layout example for four shared extent pools using the rotate extents algorithm
Here, the third digit of the volume ID is used to relate to an extent pool and thus a set of uniformly utilized ranks. Two LSS/LCU IDs (one even, one odd) are assigned to a workload to spread the I/O activity evenly across both processor complexes (both rank groups). Furthermore, you can quickly create volumes with successive volume IDs for a given workload per extent pool with a single DSCLI mkfbvol/mkckdvol command. Using host-level striping or application-level striping across volumes from different extent pools still is a valid option to balance the overall workload across the extent pools.
spread across all the ranks in the extent pool so that performance and configuration management simply can be shifted from rank level to extent pool level. Instead of a large number of ranks, now only a small number of extent pools need to be managed with considerably less overall administration effort. This approach combines the benefits from a strict volume to physical resource association as provided with single-rank extent pools as well as the ease of use with the reduction of overall management effort. Also, you need to apply appropriate consideration as outlined in 5.7.5, Planning for multi-rank extent pools on page 106 for evenly distributing the ranks and DA pairs across the extent pools with regard to the workload configuration principles. A simple example of assigning unique LSS/LCU IDs to extent pools is given in Figure 5-22 on page 125. Here, you can easily relate a volume ID to a set of ranks, for example, volume 1108 is evenly related to the set of ranks in extent pool P1. However, you cannot relate the workload of these volumes to individual ranks anymore, because all ranks in an extent pool are shared by all volumes in that extent pool. When using only identical ranks within an extent pool as shown in this example, you need to be aware that now the extent pools as an aggregation of multiple ranks can also have different performance capabilities dependent on the different number of disk drives actively servicing I/O requests. Here, extent pools P0 and P1 are built from ranks with spare drives. So they have a smaller number of disks actively servicing I/O requests than the extent pools P2 and P3, which were built from the same type of ranks but without spare drives. Thus, extent pools P0 and P1 have less total storage capacity and also provide slightly lower overall performance capabilities than extent pools P2 and P3. However, typically the overall I/O activity of a given workload scales evenly with its capacity allocation based on a given I/O access density3, so the difference in capacity for those extent pools might scale accordingly to their performance capabilities with regard to this workload and its space allocation.
Volume ID Configuration (Rank Group 0) 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P0 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P2
Volume ID Configuration (Rank Group 1) 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P1 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P3
1 0 0 0
1 0 0 1
1 0 0 2
1 0 0 3
1 0 0 4
1 0 0 5
1 0 0 6
1 0 0 7
1 0 0 8
1 0 . .
1 0 f f
1 1 0 0
1 1 0 1
1 1 0 2
1 1 0 3
1 1 0 4
1 1 0 5
1 1 0 6
1 1 0 7
1 1 0 8
1 1 . .
1 1 f f
1 2 0 0
1 2 0 1
1 2 0 2
1 2 0 3
1 2 0 4
1 2 0 5
1 2 0 6
1 2 0 7
1 2 0 8
1 2 . .
1 2 f f
1 3 0 0
1 3 0 1
1 3 0 2
1 3 0 3
1 3 0 4
1 3 0 5
1 3 0 6
1 3 0 7
1 3 0 8
1 3 . .
1 3 f f
Figure 5-22 Volume layout example for hardware-related volume IDs with multi-rank extent pools
The access density is a measure of I/O rate per unit of usable storage capacity expressed in IOPS per usable GB of storage capacity.
125
We now return to our example DS8000 configurations that were introduced in 5.3, Planning allocation of disk and host connection capacity on page 70 to see several possible LSS assignment options. For this discussion, we focus on isolation considerations for different storage types (CKD and FB). However, the same isolation strategies can be used for the isolation of different workloads of a single storage type.
127
128
Single or Mixed Storage Types (CKD and FB) Processor Complex 0 (even)
Rank R0/Pool P0 Rank R2/Pool P2 Rank R4/Pool P4 Rank R6/Pool P6 Rank R8/Pool P8
00 08 10 18 20
02 0A 12 1A 22 2A 32 3A
04
06
03 0B 13 1B 23 2B 33 3B
05
07
Rank R1/Pool P1 Rank R3/Pool P3 Rank R5/Pool P5 Rank R7/Pool P7 Rank R9/Pool P9 Rank R11/Pool P11 Rank R13/Pool P13 Rank R15/Pool P15
0C 0E 14 16
09 11 19 21
0D 0F 15 17
1C 1E 24 26
DA 0
1D 1F 25 27
2C 2E 34 36
29 31 39
2D 2F 35 37
3C 3E
3D 3F
If only a single storage type is required, another option for LSS assignment to 16 ranks in the base frame of DS8000 2 is shown in Figure 5-25 on page 130. Only a single storage type can be supported, because LSSs from the same address groups (address group 2 and address group 3) are used on all ranks. This strategy of LSS assignment is suitable for resource-sharing (such as multiple workloads sharing ranks). Each workload can be assigned volumes from a single address group that are spread across multiple ranks, DA pairs, and servers, increasing the maximum potential performance of the workload. Another benefit of this LSS assignment pattern is that it is easy to define additional volumes on all ranks by simply adding an LSS from another address group to each rank. Assigning LSSs to ranks in this way might also be convenient for differentiating between different standard sizes of volumes. For example, for CKD volumes, address group 2 can be used for 3390 Mod3 volumes, address group 3 for 3390 Mod9 volumes, and so on.
129
20 22 24 26 28
30 32 34 36 38
DA 0
31 33 35 37 39
Rank R1/Pool P1 Rank R3/Pool P3 Rank R5/Pool P5 Rank R7/Pool P7 Rank R9/Pool P9 Rank R11/Pool P11 Rank R13/Pool P13 Rank R15/Pool P15
23 25 27 29
2B 3B 2D 3D 2F 3F
Figure 5-25 Alternate hardware-related LSS assignment example for DS8000 configuration 2
Figure 5-26 shows one option for LSS assignment to ranks belonging to the two Storage Images. This simple pattern with a one-to-one correspondence between a rank and an LSS is 130
DS8000 Performance Monitoring and Tuning
only appropriate when a maximum of 256 volume addresses is required per rank. However, if additional volume addresses are needed, one or more additional unique LSSs can be added to each rank. As shown in the schematic, the same LSSs can be used on both Storage Images for two identically configured logical DS8000s (for example, for production and test, or for workloads isolated to one Storage Image. However, performance analysis and management can be simplified by assigning different LSSs in different address groups to the ranks in each Storage Image). Note: Any of the LSS assignment patterns shown for DS8000 configuration examples 1 and 2 can also be applied to one or both Storage Images in example 3.
5.9 Plan I/O port IDs, host attachments, and volume groups
Finally, when planning the attachment of the host system to the storage subsystems host adapter I/O ports, you also need to achieve a balanced workload distribution across the available front-end resources for each workload with appropriate isolation and resource-sharing considerations. Therefore, distribute the FC connections from the host systems evenly across the DS8000 host adapter (HA) ports, HA cards, I/O enclosures, and if available, RIO-G loops. For high availability, each host system must use a multipathing device driver, such as Subsystem Device Driver (SDD), and have a minimum of two host connections to HA cards in different I/O enclosures on the DS8000, preferably using one left side (even-numbered) I/O enclosure and one right side (odd-numbered) I/O enclosure so that there is a shortest path via the RIO-G loop to either DS8000 processor complex for a good balance of the I/O requests to both rank groups. For a host system with four FC connections to a DS8100, consider using one HA port in each of the four I/O enclosures. If a host system with four FC connections is attached to a DS8300, consider spreading two connections across two different I/O enclosures in the first RIO-G loop and two connections across different I/O enclosures in the second RIO-G loop. The number of host connections per host system is primarily determined by the required bandwidth. Use an appropriate number of HA cards in order to satisfy high throughput demands. Because the overall bandwidth of one HA card scales well up to two ports (2 Gbps), while the other two ports simply provide additional connectivity, use only one of the upper pair of FCP ports and one of the lower pair of FCP ports of a single HA card for workloads with high sequential throughputs, and spread the workload across several HA cards. However, with typical transaction-driven workloads showing high numbers of random, small block-size I/O operations, all four ports can be used likewise. For the best performance of workloads with very different I/O characteristics, consider isolation of large block sequential and small block random workloads at the I/O port level or even the HA card level. The best practice is using dedicated I/O ports for Copy Services paths and host connections. For more information about performance aspects related to Copy Services, refer to 17.3.1, Metro Mirror configuration considerations on page 519. In order to assign FB volumes to the attached Open Systems hosts using LUN masking, these volumes need to be grouped in DS8000 volume groups. A volume group can be assigned to multiple host connections, and each host connection is specified by the worldwide port name (WWPN) of the hosts FC port. A set of host connections from the same host system is called a host attachment. Each host connection can only be assigned to a single volume group. You cannot assign the same host connection to multiple volume groups, but the same volume group can be assigned to multiple host connections. In order to share volumes between multiple host systems, the most convenient way is to create a separate volume group for each host system and assign the shared volumes to each of the individual volume groups as required, because a single volume can be assigned to multiple volume groups. Only if a group of host systems shares exactly the same set of volumes, and there is
Chapter 5. Logical configuration performance considerations
131
no need to assign additional non-shared volumes independently to particular hosts of this group, can you consider using a single shared volume group for all host systems in order to simplify management. Typically, there are no significant DS8000 performance implications due to the number of DS8000 volume groups or the assignment of host attachments and volumes to DS8000 volume groups. Do not omit additional host attachment and host system considerations, such as SAN zoning, multipathing software and host-level striping. For additional information, refer to Chapter 9, Host attachment on page 265, and Chapter 11, Performance considerations with UNIX servers on page 307, Chapter 13, Performance considerations with Linux on page 401, and Chapter 15, System z servers on page 441. The best practice is using dedicated I/O ports for Copy Services links and host connections. After the DS8000 has been installed, you can use the DSCLI lsioport command to display and document I/O port information, including the I/O ports, host adapter type, I/O enclosure location, and WWPN. Use this information to add specific I/O port IDs, the required protocol (FICON, FCP, or ESCON), and DS8000 I/O port WWPNs to the plan of host and remote mirroring connections identified in 5.3, Planning allocation of disk and host connection capacity on page 70. Additionally, the I/O port IDs might be required as input to DS8000 host definitions if host connections need to be restricted to specific DS8000 I/O ports using the -ioport option of the mkhostconnect DSCLI command. If host connections are configured to allow access to all DS8000 I/O ports, which is the default, typically the paths must be restricted by SAN zoning. Here, the I/O port WWPNs will be required as input for SAN zoning. The lshostconnect -login DSCLI command might help to verify the final allocation of host attachments to DS8000 I/O ports, because it lists host port WWPNs that are logged in, sorted by the DS8000 I/O port IDs for known connections. The lshostconnect -unknown DSCLI command might further help to identify host port WWPNs, which have not yet been configured to host connections, when creating host attachments using the mkhostconnect DSCLI command. The DSCLI lsioport output will identify: The number of I/O ports on each host adapter installed The type of host adapters installed (SW FCP/FICON-capable, LW FCP/FICON-capable, or ESCON) The distribution of host adapters across I/O enclosures The WWPN of each I/O port DS8000 I/O ports have predetermined, fixed DS8000 logical port IDs in the form I0xyz where: x = I/O enclosure y = slot number within the I/O enclosure z = port within the adapter card For example, I0101 is the I/O port ID for: I/O enclosure 1 Slot 0 Second port
132
Note: The slot numbers for logical I/O port IDs are one less than the physical location numbers for host adapter cards as shown on the physical labels and in TotalStorage Productivity Center for Disk, for example, I0101 is R1-XI2-C1-T2. A simplified example of spreading the DS8000 I/O ports evenly to two redundant SAN fabrics is given in Figure 5-27. Note that the SAN implementations can vary dependent on individual requirements, workload considerations for isolation and resource-sharing, and available hardware resources.
Bay 2 C0 C1
L3
2 2 2 2 3 3 3 3 0 1 2 3
Bay 4 C0
L4
4 4 4 4 3 3 3 3 0 1 2 3
Bay 3 C0 C1
R3
3 3 3 3 0 0 0 0 0 1 2 3
Bay 5 C0
R4
5 5 5 5 0 0 0 0 0 1 2 3
Bay 7 C0 C1
R3
7 7 7 7 0 0 0 0 0 1 2 3
C1
L8
0 0 0 0 0 0 0 0 0 1 2 3
C1
L8
4 4 4 4 0 0 0 0 0 1 2 3
C0 C1
L3
6 6 6 6 3 3 3 3 0 1 2 3
C1
R8
1 1 1 1 3 3 3 3 0 1 2 3
C1
R8
5 5 5 5 3 3 3 3 0 1 2 3
L7
2 2 2 2 0 0 0 0 0 1 2 3
L7
6 6 6 1 1 6 0 0 0 0 0 0 0 1 2 0 1 3
R7
3 3 3 3 3 3 3 3 0 1 2 3
R7
7 7 7 1 1 7 3 3 3 0 0 3 0 1 2 0 1 3
01 02 03 04 05 06 07 08
01 02 03 04 05 06 07 08
SAN Fabric #0
09 10 11 12 13 14 15 16
SAN Fabric #1
09 10 11 12 13 14 15 16
Figure 5-27 Example of spreading DS8000 I/O ports evenly across two redundant SAN fabrics
Now, we return again to the three DS8000 configuration examples introduced in 5.3, Planning allocation of disk and host connection capacity on page 70 to see the I/O enclosure, host adapter, and I/O port resources available.
133
Figure 5-28 DS8000 example 1: I/O enclosures, host adapters, and I/O ports
The DSCLI lsioport command output for the DS8000 in Example 5-12 shows a total of 36 I/O ports available: Four multi-mode (shortwave) 4-port FCP/FICON-capable adapters: One in each base frame I/O enclosure (I/O enclosures 0 - 3) IDs I0000 - 0003, I0130 - I0133, I0240 - I0243, and I310 - I0313 Four single-mode (longwave) 4-port FCP/FICON-capable adapters: One in each base frame I/O enclosure (I/O enclosures 0 - 3) IDs I0030 - I0033, I0100 - I0103, I0200 - I0203, and I0330 - I0333 Two 2-port ESCON adapters: One in I/O enclosure 2 (IDs I0230 and I0231) One in I/O enclosure 3 (IDs I0300 and I0301)
Example 5-12 DS8000 example 1: I/O enclosures, host adapters, and I/O ports dscli> lsioport -l -dev ibm.2107-7506571 Date/Time: September 4, 2005 2:35:44 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7506571 ID WWPN State Type topo portgrp =============================================================== I0000 50050763030000B6 Online Fibre Channel-SW FC-AL 0 I0001 50050763030040B6 Online Fibre Channel-SW FC-AL 0 I0002 50050763030080B6 Online Fibre Channel-SW FC-AL 0 I0003 500507630300C0B6 Online Fibre Channel-SW FC-AL 0 I0030 50050763030300B6 Online Fibre Channel-LW FICON 0 I0031 50050763030340B6 Online Fibre Channel-LW FICON 0 I0032 50050763030380B6 Online Fibre Channel-LW FICON 0
134
I0033 I0100 I0101 I0102 I0103 I0130 I0131 I0132 I0133 I0200 I0201 I0202 I0203 I0230 I0231 I0240 I0241 I0242 I0243 I0300 I0301 I0310 I0311 I0312 I0313 I0330 I0331 I0332 I0333
500507630303C0B6 50050763030800B6 50050763030840B6 50050763030880B6 500507630308C0B6 50050763030B00B6 50050763030B40B6 50050763030B80B6 50050763030BC0B6 50050763031000B6 50050763031040B6 50050763031080B6 500507630310C0B6 50050763031400B6 50050763031440B6 50050763031480B6 500507630314C0B6 50050763031900B6 50050763031940B6 50050763031980B6 500507630319C0B6 50050763031B00B6 50050763031B40B6 50050763031B80B6 50050763031BC0B6
Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online
Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre ESCON ESCON Fibre Fibre Fibre Fibre ESCON ESCON Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre
Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-SW Channel-SW Channel-SW Channel-SW Channel-LW Channel-LW Channel-LW Channel-LW
FICON FICON FICON FICON FICON FC-AL FC-AL FC-AL FC-AL FICON FICON FICON FICON FC-AL FC-AL FC-AL FC-AL FC-AL FC-AL FC-AL FC-AL FICON FICON FICON FICON
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Note: As seen in Example 5-12 on page 134, the default I/O topology configurations are FICON for longwave (LW) host adapters and FC-AL for shortwave (SW) host adapters.
The second port can be configured for FCP host access: I0031
135
I0101 I0201 I0331 On two (or more) of the four LW cards, the third port can be configured for remote mirroring links (two links might be sufficient for remote mirroring requirements): I0032 I0102 It is possible to have Open Systems host workloads and remote mirroring workloads share the same I/O ports, but if enough I/O ports are available, the best practice is to separate FCP host connections and FCP remote mirroring connections to dedicated I/O ports. In each I/O enclosure of this DS8000, there is one SW FICON/FCP-capable host adapter and one LW FICON/FCP-capable host adapter. Each port on a DS8000 4-port SW adapter or a DS8000 4-port LW adapter can be configured to either FCP protocol for Open Systems host connections and DS8000 remote mirroring links, or to FICON protocol for z/OS host connections. Generally, SAN distance considerations and host server adapter types determine whether SW or LW adapters are appropriate. In our case, the fact that both LW and SW adapters have been ordered for DS8000 implies two sets of host server or SAN requirements. So it is likely that certain workloads will be separated by adapter type. Note: For host server connections or remote mirroring connections known to require high throughput, we recommend a maximum of two I/O ports on a single DS8000 host adapter.
136
Figure 5-29 DS8000 example 2: Base frame I/O enclosures, host adapters, and I/O ports
Figure 5-30 DS8000 example 2: First expansion frame I/O enclosures, host adapters, and I/O ports
137
Note: The host adapters installed in the DS8000 in this example do not necessarily follow the current order of installation of host adapters. Example 5-13 shows output from the DSCLI lsioport command for the DS8000 in this example. As shown in the lsioport output, 16 in total host adapters (64 I/O ports) are installed: Four 4-port SW FCP/FICON-capable host adapters in the base frame (enclosures 0 - 3) IDs I0010 - I0013, I0140 - I0143, I0210 - I0213, and I0340 - I0343 Eight 4-port LW FCP/FICON-capable host adapters in the base frame (enclosures 0 - 3) IDs I0030 - I0033, I0040 - I0043, I0100 - I0103, I0110 - I0113, I0230 - I0233, I0240 I0243, I0300 - I0303, and I0310 - I0313 Four 4-port FCP/FICON-capable host adapters in the expansion frame (enclosures 4 - 7) IDs I0400 - I0403, I0530 - I0533, I0600 - I0603, and I0730 - I0733
Example 5-13 DS8000 example 2: I/O enclosures, host adapters, and I/O ports dscli> lsioport -l -dev ibm.2107-7520331 Date/Time: September 9, 2005 3:06:36 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7520331 ID WWPN State Type topo portgrp =============================================================== I0010 5005076303010194 Online Fibre Channel-SW FC-AL 0 I0011 5005076303014194 Online Fibre Channel-SW FC-AL 0 I0012 5005076303018194 Online Fibre Channel-SW FC-AL 0 I0013 500507630301C194 Online Fibre Channel-SW FC-AL 0 I0030 5005076303030194 Online Fibre Channel-LW FICON 0 I0031 5005076303034194 Online Fibre Channel-LW FICON 0 I0032 5005076303038194 Online Fibre Channel-LW FICON 0 I0033 500507630303C194 Online Fibre Channel-LW FICON 0 I0040 5005076303040194 Online Fibre Channel-LW FICON 0 I0041 5005076303044194 Online Fibre Channel-LW FICON 0 I0042 5005076303048194 Online Fibre Channel-LW FICON 0 I0043 500507630304C194 Online Fibre Channel-LW FICON 0 I0100 5005076303080194 Online Fibre Channel-LW FICON 0 I0101 5005076303084194 Online Fibre Channel-LW FICON 0 I0102 5005076303088194 Online Fibre Channel-LW FICON 0 I0103 500507630308C194 Online Fibre Channel-LW FICON 0 I0110 5005076303090194 Online Fibre Channel-LW FICON 0 I0111 5005076303094194 Online Fibre Channel-LW FICON 0 I0112 5005076303098194 Online Fibre Channel-LW FICON 0 I0113 500507630309C194 Online Fibre Channel-LW FICON 0 I0140 50050763030C0194 Online Fibre Channel-SW FC-AL 0 I0141 50050763030C4194 Online Fibre Channel-SW FC-AL 0 I0142 50050763030C8194 Online Fibre Channel-SW FC-AL 0 I0143 50050763030CC194 Online Fibre Channel-SW FC-AL 0 I0210 5005076303110194 Online Fibre Channel-SW FC-AL 0 I0211 5005076303114194 Online Fibre Channel-SW FC-AL 0 I0212 5005076303118194 Online Fibre Channel-SW FC-AL 0 I0213 500507630311C194 Online Fibre Channel-SW FC-AL 0 I0230 5005076303130194 Online Fibre Channel-LW FICON 0 I0231 5005076303134194 Online Fibre Channel-LW FICON 0 I0232 5005076303138194 Online Fibre Channel-LW FICON 0 I0233 500507630313C194 Online Fibre Channel-LW FICON 0 I0240 5005076303140194 Online Fibre Channel-LW FICON 0 I0241 5005076303144194 Online Fibre Channel-LW FICON 0 I0242 5005076303148194 Online Fibre Channel-LW FICON 0
138
I0243 I0300 I0301 I0302 I0303 I0310 I0311 I0312 I0313 I0340 I0341 I0342 I0343 I0400 I0401 I0402 I0403 I0530 I0531 I0532 I0533 I0600 I0601 I0602 I0603 I0730 I0731 I0732 I0733
500507630314C194 5005076303180194 5005076303184194 5005076303188194 500507630318C194 5005076303190194 5005076303194194 5005076303198194 500507630319C194 50050763031C0194 50050763031C4194 50050763031C8194 50050763031CC194 5005076303200194 5005076303204194 5005076303208194 500507630320C194 50050763032B0194 50050763032B4194 50050763032B8194 50050763032BC194 5005076303300194 5005076303304194 5005076303308194 500507630330C194 50050763033B0194 50050763033B4194 50050763033B8194 50050763033BC194
Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online
Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre
Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-SW Channel-SW Channel-SW Channel-SW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW
FICON FICON FICON FICON FICON FICON FICON FICON FICON FC-AL FC-AL FC-AL FC-AL FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Notes: 1. The port IDs shown in the lsioport command output are logical port IDs. In logical port IDs, the slot numbers are one less than the physical location numbers for host adapter cards listed below the host adapter cards in the DS8000. 2. As shown in Example 5-13 on page 138, the default I/O topology configurations are FICON for LW host adapters and Fibre Channel Arbitrated Loop (FC-AL) for SW host adapters.
139
It is possible for Open Systems host workloads and remote mirroring workloads to share the same I/O ports, but if enough I/O ports are available, separation of the FCP host connections and the FCP remote mirroring connections might simplify performance management. Note: For host server connections or remote mirroring connections known to require high throughput, we recommend a maximum of two I/O ports on a single DS8000 host adapter.
Figure 5-31 DS8000 example 3: Base frame I/O enclosures, host adapters, and I/O ports
140
Figure 5-31 on page 140 and Figure 5-32 are schematics of the I/O enclosures in the base frame and the first expansion frame. A total of 12 4-port FCP/FICON-capable host adapters are installed in this DS8000, for a total of 48 available I/O ports. Six LW host adapters are installed in the base frame (I/O enclosures 0 - 3), with three host adapters dedicated to each Storage Image. Six LW host adapters are installed in the expansion frame (I/O enclosures 4 7), with three host adapters dedicated to each Storage Image. All installed host adapters are the same type (LW 4-port FCP/FICON-capable). In this example, each Storage Image owns two host adapters in its first I/O enclosure in the base frame and expansion frame (I/O enclosures 0 and 4 for Storage Image 1 and I/O enclosures 2 and 6 for Storage Image 2). Each storage Image owns one host adapter in its second I/O enclosure in the base frame and expansion frame (I/O enclosures 1 and 5 for Storage Image 1 and I/O enclosures 3 and 7 for Storage Image 2).
Figure 5-32 DS8000 example 3: First expansion frame I/O enclosures, host adapters, and I/O ports
The DS8000 in this example is a DS8000 model with dual Storage Images, so in order to see all the host adapters, you must issue the DSCLI lsioport command twice, once for each Storage Image: Storage Image 1 - IBM.2107-7566321 Storage Image 2 - IBM.2107-7566322
141
As shown in the output from the lsioport command for Storage Image 1 in Example 5-14 and the output from the from the lsioport command for Storage Image 2 in Example 5-15, half of the I/O enclosures are dedicated to each Storage Image: I/O enclosures 0, 1, 4, and 5 (as indicated by I/O port IDs I00yz, I01yz, I04yz, and I05yz) appear in the output for Storage Image 1 (IBM.2107-7566321). I/O enclosures 2, 3, 6, and 7 (as indicated by I/O port IDs I02yz, I03yz, I06yz, and I07yz) appear in the output for Storage Image 2 (IBM.2107-7566322).
Example 5-14 DS8000 example 3: Storage Image 1 I/O enclosures, host adapters, and I/O ports dscli> lsioport -l -dev ibm.2107-7566321 Date/Time: September 4, 2005 5:32:12 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7566321 ID WWPN State Type topo portgrp ============================================================ I0000 50050763030003BD Online Fibre Channel-LW FICON 0 I0001 50050763030043BD Online Fibre Channel-LW FICON 0 I0002 50050763030083BD Online Fibre Channel-LW FICON 0 I0003 500507630300C3BD Online Fibre Channel-LW FICON 0 I0030 50050763030303BD Online Fibre Channel-LW FICON 0 I0031 50050763030343BD Online Fibre Channel-LW FICON 0 I0032 50050763030383BD Online Fibre Channel-LW FICON 0 I0033 500507630303C3BD Online Fibre Channel-LW FICON 0 I0100 50050763030803BD Online Fibre Channel-LW FICON 0 I0101 50050763030843BD Online Fibre Channel-LW FICON 0 I0102 50050763030883BD Online Fibre Channel-LW FICON 0 I0103 500507630308C3BD Online Fibre Channel-LW FICON 0 I0400 50050763032003BD Online Fibre Channel-LW FICON 0 I0401 50050763032043BD Online Fibre Channel-LW FICON 0 I0402 50050763032083BD Online Fibre Channel-LW FICON 0 I0403 500507630320C3BD Online Fibre Channel-LW FICON 0 I0430 50050763032303BD Online Fibre Channel-LW FICON 0 I0431 50050763032343BD Online Fibre Channel-LW FICON 0 I0432 50050763032383BD Online Fibre Channel-LW FICON 0 I0433 500507630323C3BD Online Fibre Channel-LW FICON 0 I0500 50050763032803BD Online Fibre Channel-LW FICON 0 I0501 50050763032843BD Online Fibre Channel-LW FICON 0 I0502 50050763032883BD Online Fibre Channel-LW FICON 0 I0503 500507630328C3BD Online Fibre Channel-LW FICON 0 Example 5-15 DS8000 example 3: Storage Image 2 I/O enclosures, host adapters, and I/O ports dscli> lsioport -l -dev ibm.2107-7566322 Date/Time: September 4, 2005 5:32:15 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7566322 ID WWPN State Type topo portgrp =============================================================== I0200 5005076303100BBD Online Fibre Channel-LW FICON 0 I0201 5005076303104BBD Online Fibre Channel-LW FICON 0 I0202 5005076303108BBD Online Fibre Channel-LW FICON 0 I0203 500507630310CBBD Online Fibre Channel-LW FICON 0 I0230 5005076303130BBD Online Fibre Channel-LW FICON 0 I0231 5005076303134BBD Online Fibre Channel-LW FICON 0 I0232 5005076303138BBD Online Fibre Channel-LW FICON 0 I0233 500507630313CBBD Online Fibre Channel-LW FICON 0 I0300 5005076303180BBD Online Fibre Channel-LW FICON 0 I0301 5005076303184BBD Online Fibre Channel-LW FICON 0 I0302 5005076303188BBD Online Fibre Channel-LW FICON 0 I0303 500507630318CBBD Online Fibre Channel-LW FICON 0 I0600 5005076303300BBD Online Fibre Channel-LW FICON 0 I0601 5005076303304BBD Online Fibre Channel-LW FICON 0
142
I0602 I0603 I0630 I0631 I0632 I0633 I0700 I0701 I0702 I0703
5005076303308BBD 500507630330CBBD 5005076303330BBD 5005076303334BBD 5005076303338BBD 500507630333CBBD 5005076303380BBD 5005076303384BBD 5005076303388BBD 500507630338CBBD
Online Online Online Online Online Online Online Online Online Online
Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre
Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW
FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON
0 0 0 0 0 0 0 0 0 0
Notes: 1. The port IDs shown in the lsioport command output are logical port IDs. In logical port IDs, the slot numbers are one less than the physical location numbers for host adapter cards on labels below the host adapter cards at the bottom of the DS8000. 2. As shown in the DSCLI output, the default I/O topology configurations are FICON for LW host adapters and FC-AL for SW host adapters. 3. The host adapters installed in the DS8000 in this example do not necessarily follow the current order of installation of host adapters.
143
11.Create FB LUNs. 12.Create Open Systems host definitions. 13.Create Open Systems DS8000 volume groups. 14.Assign Open Systems hosts and volumes to DS8000 volume groups. 15.Configure I/O ports. 16.Implement SAN zoning, multipathing software, and host-level striping as desired. After the logical configuration has been created on the DS8000, it is important to document it. You can use the DS Storage Manager to export information in spreadsheet format (that is, save it as a comma separated values (csv) file). You can use this information together with the planning spreadsheet, as shown in Figure 5-9 on page 106, to document the logical configuration. The DSCLI provides a set of list (ls) and show commands, which can be redirected and appended into a plain text or csv file. A list of selected DSCLI commands as shown in Example 5-16 can easily be invoked as a DSCLI script (using the DSCLI command dscli -script) to collect the logical configuration of a DS8000 Storage Image. This output can be used as a text file or imported into a spreadsheet to document the logical configuration. You can obtain more advanced scripts to collect the logical configuration of a DS8000 subsystem in Appendix C, UNIX shell scripts on page 587. Example 5-16 only collects a minimum set of DS8000 logical configuration information, but it illustrates a simple DSCLI script implementation and runs quickly within a single DSCLI command session. Dependent on the environment, you can modify this script to include more commands to provide more information, for example, about Copy Services configurations and source/target relations. Note that the DSCLI script terminates with the first command that returns an error, which, for example, can even be a simple lslcu command if no LCUs are defined. You can adjust the output of the ls commands in a DSCLI script to meet special formatting and delimiter requirements using appropriate options for format, delim, or header in the specified DS8000 profile file or selected ls commands.
Example 5-16 Example of a minimum DSCLI script get_config.dscli to gather the logical configuration > dscli -cfg profile/DEVICE.profile -script get_config.dscli > DEVICE_SN_config.out CMMCI9029E showrank: rank R48 does not exist. > cat get_config.dscli ver -l lssu -l lssi -l lsarraysite -l lsarray -l lsrank -l lsextpool -l lsaddressgrp lslss #lslcu lsioport -l lshostconnect lsvolgrp lsfbvol -l # Use only if FB volumes have been configured
# Use only if FB volumes have been configured # Use only if CKD volumes and LCUs have been configured # otherwise the command returns an error and the script terminates.
144
# # # # # # #
Use only if CKD volumes have been configured otherwise the command returns an error and the script terminates. Modify this list of showrank commands so that the showrank command is run on all available ranks! Note that an error is returned if the specified rank is not present. The script terminates on the first non-existing rank. Check for gaps in the rank ID sequence.
145
146
Chapter 6.
147
6.1 Introduction
The IBM System Storage DS8000 series is designed to support the most demanding business applications with its exceptional performance and superior data throughput. This strength, combined with its world-class resiliency features, makes it an ideal storage platform for supporting todays 24x7, global business environment. Moreover, with its tremendous scalability, broad server support, and flexible virtualization capabilities, the DS8000 can help simplify the storage environment and consolidate multiple storage systems onto a single DS8000 system. This power is the potential of the DS8000 but careful planning and management is essential to realize that potential in a complex IT environment. Even a well-configured system will be subject to changes over time that affect performance, such as: Additional host systems Increasing workload Additional users Additional DS8000 capacity
A typical case
To demonstrate the performance management process, we look at a typical situation where DS8000 performance has become an issue. Users begin to open incident tickets to the IT Help Desk claiming that the system is slow and therefore is delaying the processing of orders from their clients and the submission of invoices. IT Support investigates and detects that there is contention in I/O to the host systems. The Performance and Capacity team is involved and analyzes performance reports together with the IT Support teams. Each IT Support team (operating system, storage, database, and application) issues its report defining the actions necessary to resolve the problem. Certain actions might have a marginal effect but are faster to implement; other actions might be more effective but need more time and resources to put in place. Among the actions, the Storage Team and Performance and Capacity Team reports that additional storage capacity is required to support the I/O workload of application and ultimately to resolve the problem. IT Support presents its findings and recommendations to the companys Business Unit, requesting application downtime to implement the changes that can be made immediately. The Business Unit accepts the report but says it has no money for the purchase of new storage. They ask the IT department how they can ensure that the additional storage will resolve the performance issue. Additionally, the Business Unit asks the IT department why the need for additional storage capacity was not submitted as a draft proposal three months ago when the budget was finalized for next year, knowing that the system is one of the most critical systems of the company. Incidents, such as this one, make us realize the distance that can exist between the IT department and the companys business strategy. In many cases, the IT department plays a key role in determining the companys strategy. Therefore, the questions to consider are: How can we avoid situations like those just described? How can we make performance management become more proactive and less reactive? What are best practices for performance management? What are the key performance indicators of the IT infrastructure and what do they mean from the business perspective? Are the defined performance thresholds adequate? How can we identify the risks in managing the performance of assets (servers, storage systems, and applications) and mitigate them? 148
DS8000 Performance Monitoring and Tuning
In the following pages, we present a method to implement a performance management process. The goal is to give you ideas and insights with particular reference to the DS8000. We assume in this instance that data from IBM TotalStorage Productivity Center is available. To better align the understanding between the business and the technology, we use as a guide the Information Technology Infrastructure Library (ITIL) to develop a process for performance management as applied to DS8000 performance and tuning.
6.2 Purpose
The purpose of performance management is to ensure that the performance of the IT infrastructure matches the demands of the business. The activities involved are: Definition and review of performance baselines and thresholds Performance data collection from the DS8000 Check if the performance of resources are within the defined thresholds Analyze performance using DS8000 performance data collected and tuning recommendations Definition and review of standards and IT architecture related to performance Analyze performance trends Sizing new storage capacity requirements Certain activities are related to the operational activities, such as the analysis of performance of DS8000 components, and other activities are related to tactical activities, such as the performance analysis and tuning. Other activities are related to strategic activities, such as storage capacity sizing. We can split the process into three subprocesses: Operational performance subprocess Analyze the performance of DS8000 components (processor complexes, device adapters, host adapters, ranks, and so forth) and ensure that they are within the defined thresholds and service level objectives (SLOs) and service level agreements (SLAs). Tactical performance subprocess Analyze performance data and generate reports for tuning recommendations and the review of baselines and performance trends. Strategic performance subprocess Analyze performance data and generate reports for storage sizing and the review of standards and architectures that are related to performance. Every process is composed of the following elements: Inputs: Data and information required for analysis. The possible inputs are: Performance data collected by TotalStorage Productivity Center Historical performance reports Product specifications (benchmark results, performance thresholds, and performance baselines) User specifications (SLOs and SLAs) Outputs: The deliverables or results from the process. Possible types of output are: Performance reports and tuning recommendations Performance trends
Chapter 6. Performance management process
149
Performance alerts Tasks: The activities that are the smallest unit of work of a process. These tasks can be: Performance data collection Performance report generation Analysis and tuning recommendations Actors: A department or person in the organization that is specialized to perform certain type of work. Actors can vary from organization to organization. In smaller organizations, a single person can own multiple actor responsibilities, for example: Capacity and performance team Storage team Server teams Database team Application team Operations team IT Architect IT Manager Clients Roles: The tasks that need to be executed by an actor, but another actor might own the activity, and other actors might just be consulted. Knowing the roles is helpful when you define the steps of the process and who is going to do what. The roles can be: Responsible: The person that executes that task but not necessarily the owner of that task. Suppose that the capacity team is the owner for the generation of the performance report with the tuning recommendations, but the specialist that holds the skill to make tuning recommendations in the DS8000 is the Storage Administrator. Accountable: The owner of that activity. There can only be one owner. Consulted: The people that are consulted and whose opinions are considered. Suppose that the IT Architect proposes a new architecture for the storage. Normally, the opinion of the Storage Administrator is requested. Informed: The people who are kept up-to-date on progress. The IT Manager normally wants to know the evolution of activities. When assigning the tasks, you can use a Responsible, Accountable, Consulted, and Informed (RACI) matrix to list the actors and the roles that are necessary to define a process or subprocess. A RACI diagram, or RACI matrix, is used to describe the roles and responsibilities of various teams or people to deliver a project or perform an operation. It is especially useful in clarifying roles and responsibilities in cross-functional and cross-departmental projects and processes.
150
Performance SLA
A performance SLA is a formal agreement between IT Management and User representatives concerning the performance of the IT resources. Often these SLAs provide goals for end-to-end transaction response times. In the case of storage, these types of goals typically relate to average disk response times for different types of storage. Missing the technical goals described in the SLA result in financial penalties to the IT service management providers.
Performance SLO
Performance SLOs are similar to SLAs with the exception that misses do not carry financial penalties. While SLO misses do not carry financial penalties, misses are a breach of contract in many cases and can lead to serious consequences if not remedied. Having reports that show you how many alerts and how many misses in SLOs/SLAs have occurred over time are very important. They tell how effective your storage strategy is (standards, architectures, and policy allocation) in the steady state. In fact, the numbers in those reports are inversely proportional to the effectiveness of your storage strategy. In other words, the more effective your storage strategy, the fewer performance threshold alerts are registered and the fewer SLO/SLA targets will be missed. It is not necessary to have implemented SLOs or SLAs for you to discover the effectiveness of your current storage strategy. The definition of SLO/SLA requires a deep and clear understanding of your storage strategy and how well your DS8000 is running. That is why, before implementing this process, we recommend that you start with the tactical performance process: Generate the performance reports Define tuning recommendations Review the baseline after implementing tuning recommendations Generate performance trends reports
151
Then, redefine the thresholds with fresh performance numbers. Failure to redefine the thresholds with fresh performance numbers will cause you to spend time dealing with performance incident tickets with false-positive alerts and not spend the time analyzing the performance and making tuning recommendations for your DS8000. Let us look to the characteristics of this process.
6.3.1 Inputs
The inputs necessary to make this process effective are: Performance trends reports of DS8000 components: Many people ask for the IBM recommended thresholds. In our opinion, the best recommended thresholds are those thresholds that fit your environment. The best thresholds are extremely dependent on the configuration of your DS8000 and the type of workloads. For example, you need to define thresholds for IOPS if your application is a transactional system. If the application is a data warehouse, you need to define thresholds for throughput. Also, you must not expect the same performance from different ranks where one set of ranks has 73 GB, 15k revolutions per minute (rpm) Fibre Channel disk drive modules (DDMs) and another set of ranks has 500 GB, 7200 rpm Fibre Channel Advanced Technology Attachment (FATA) DDMs. Check the outputs generated from the tactical performance subprocess for additional information. Performance SLO and performance SLA: You can define the SLO/SLA requirements in two ways: By hardware (IOPS by rank or MB/s by port): This performance report is the easiest way to implement an SLO or SLA but the most difficult method for which to get client agreement. The client normally does not understand the technical aspects of a DS8000. By host or application (IOPS by system or MB/s by host): Most probably this performance report is the only way that you are going to get an agreement from the client but this agreement is not certain. As we said before, the client does not normally understand the technical aspects of IT infrastructure. The most typical way to define a performance SLA is by average execution time or response time of a transaction in the application. So, the performance SLA/SLO for the DS8000 is normally an internal agreement among the support teams, which creates additional work for you to generate those reports, and there is no predefined solution. It is dependent on your environments configuration and the conditions that define those SLOs/SLAs. We recommend that when configuring the DS8000 with SLO/SLA requirements that you separate the applications or hosts by LSS (reserve two LSSs, one even and one odd, for each host, system, or instance). The benefit of generating performance reports using this method is that they are more meaningful to the other support teams and to the client. Consequently, the level of communication will increase significantly and reduce chances for misunderstandings. Note: When defining a DS8000-related SLA or SLO, ensure that the goals are based on empirical evidence of performance within the environment. Application architects with applications that are highly sensitive to changes in I/O throughput or response time need to consider the measurement of percentiles or standard deviations as opposed to average values over an extended period of time. IT management must ensure that the technical requirements are appropriate for the technology.
In cases where contractual penalties are associated with production performance SLA or SLO misses, be extremely careful in the management and implementation of the DS8000.
Even in the cases where no SLA or SLO exists, users have performance expectations that are not formally communicated. In these cases, they will let IT management know when the 152
DS8000 Performance Monitoring and Tuning
performance of the IT resources is not meeting their expectations. Unfortunately, by the time they communicate their missed expectations they are often frustrated, and their ability to manage their business is severely impacted by performance issues. While there might not be any immediate financial penalties associated with missed user expectations, prolonged negative experiences with under-performing IT resources will result in low user satisfaction.
6.3.2 Outputs
The outputs generated by this process are: Documentation of defined DS8000 performance thresholds. It is important to document the agreed-to thresholds. Not just for you, but also for other members of your team or other teams that need to know. DS8000 alerts for performance utilization. These alerts are generated when a DS8000 component reaches a defined level of utilization. With TotalStorage Productivity Center for Disk, you can automate the performance data collection and also configure TotalStorage Productivity Center to send an alert when this type of an event occurs. Performance reports comparing the performance utilization of DS8000 with the performance SLO and SLA.
Figure 6-1 is an example of a RACI matrix for the operational performance subprocess, with all the tasks, actors, and roles identified and defined. Provide performance trends report: This report is an important input for the operational performance subprocess. With this data, you can identify and define the thresholds that best fit your DS8000. Consider how the workload is distributed between of the internal components of the DS8000: host adapters, processor complexes, device adapters, and ranks. This analysis avoids the definition of thresholds that generate false-positive performance alerts and ensure that you monitor only what is relevant to your environment.
153
Define the thresholds to be monitored and their respective values, severity, queue to open the ticket, and additional instructions: In this task, using the baseline performance report, you can identify and set the relevant threshold values. You can use TotalStorage Productivity Center to create alerts when these thresholds are exceeded. For example, you can configure TotalStorage Productivity Center to send the alerts via SNMP traps to Tivoli Enterprise Console (TEC) or via e-mail. However, the opening of an incident ticket needs to be performed by the Monitoring team who will need to know the severity to set, on which queue to open the ticket, and any additional information that is required in the ticket. Figure 6-2 is an example of the required details.
Implement performance monitoring and alerting: After you have defined what DS8000 components to monitor, set their corresponding threshold values. For detailed information about how to configure TotalStorage Productivity Center, refer to the IBM TotalStorage Productivity Center documentation, which can be found at: http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/index.jsp Publish the documentation to the IT Management team: After you have implemented the monitoring, send the respective documentation to those people who need to know.
154
Does the error report show any disk, logical volume (LV), host adapter card, or other I/O-type errors? What is the interpolicy of the logical volumes (maximum or minimum)? Describe any striping, mirroring, or RAID configurations in the affected LVs. Is this a production or development workload, and is it associated with benchmarking or breakpoint testing? In addition to the answers to these questions, the client must provide server performance and configuration data. Refer to the relevant host chapters in this book for more detail.
155
problems. Key performance indicators provide information to identify consistent performance hot spots or workload imbalances.
6.4.1 Inputs
The inputs necessary to make this process effective are: Product specifications: Documents that describe the characteristics and features of the DS8000, such as data sheets, Announcement Letters, and planning manuals Product documentation: Documents that provide information about the installation and use of the DS8000, such as user manuals, white papers, and IBM Redbooks publications Performance SLOs/Performance SLAs: The documentation of performance SLO/SLAs to which the client has agreed for DS8000.
6.4.2 Outputs
Performance reports with tuning recommendations and performance trends reports are the outputs that are generated by this process.
156
The tasks include: Collect configuration and raw performance data: Use TotalStorage Productivity Center for Disk daily probes to collect configuration data. Set up the Subsystem Performance Monitor to run indefinitely and to collect data at 15 minute intervals for each DS8000. Generate performance graphics: Produce one key metric for each physical component in the DS8000 over time. For example, if your workload is evenly distributed across the entire day, show the average daily disk utilization for each disk array. Configure thresholds in the chart to identify when a performance constraint might occur. Use a spreadsheet to create a linear trend line based on the data previously collected and identify when a constraint might occur. Generate performance reports with tuning recommendations: You must collect and review host performance data on a regular basis. On discovery of a performance issue, Performance Management must work with the Storage team and the client to develop a plan to resolve it. This plan can involve some form of data migration on the DS8000. For key I/O-related performance metrics, refer to the chapters in this book for your specific operating system. Typically, the end-to-end I/O response time is measured, because it provides the most direct measurement of the health of the SAN and disk subsystem. Generate performance reports for trend analysis: Methodologies typically applied in capacity planning can be applied to the storage performance arena. These methodologies rely on workload characterization, historical trends, linear trending techniques, I/O workload modeling, and What if scenarios. You can obtain details about disk management in Chapter 7, Performance planning tools on page 161. Schedule meetings with the involved areas: The frequency depends on the dynamism of your IT environment. The greater the rate of change, such as the deployment of new systems, upgrades of software and hardware, fix management, allocation of new LUNs, implementation of Copy Services, and so on, will determine the frequency of meetings. For the performance reports with tuning recommendations, we recommend weekly meetings. For performance trends reports, we recommend meetings on a monthly basis. You might want to change the frequency of these meetings after you gain more confidence and familiarity with the performance management process. At the end of these meetings, define with the other support teams and the IT Manager the actions to resolve any potential issues that have been identified.
157
6.5.1 Inputs
The inputs required to make this process effective are: Performance reports with tuning recommendations Performance trends reports
6.5.2 Outputs
The outputs generated by this process are: Standards and architectures: Documents that specify: Naming convention for the DS8000 components: ranks, extent pools, volume groups, host connections, and LUNs. Rules to format and configure DS8000: Arrays, RAID, ranks, extent pools, volume groups, host connections, logical subsystems (LSSs), and LUNs. Policy allocation: When to pool the applications or host systems on the same set of ranks. When to segment the applications or hosts systems in different ranks. Which type of workload must use RAID 5, RAID 6, or RAID 10? Which type of workload must use DDMs of 73 GB/15k rpm, 146 GB/15k rpm, or 300 GB/15k rpm? Sizing of new or existing DS8000: According to the business demands, what are the recommended capacity, cache, and host ports for a new or existing DS8000? Plan configuration of new DS8000: What is the planned configuration of the new DS8000 based on your standards and architecture and according to the workload of the systems that will be deployed?
158
Figure 6-4 is an example of RACI matrix for the strategic performance subprocess with all the tasks, actors, and roles identified and defined: Define priorities of new investments: In defining the priorities of where to invest, you must consider these four objectives: Reduce cost: The most simple example is storage consolidation. There might be several storage systems in your data center, which are nearing the end of their useful life. The costs of maintenance are increasing, and the storage subsystems use more energy than new models. The IT Architect can create a case for storage consolidation but will need your help to specify and size the new storage. Increase availability: There are production systems that need to be available 24x7. The IT Architect needs to submit a new solution for this case to provide data mirroring. The IT Architect will require your help to specify the new storage for the secondary site and to provide figures for the necessary performance. Mitigate risks: Consider a case where a system is running on a old storage model without a support contract from the vendor. That system started as a pilot with no importance. Over time, that system presented great performance and is now a key application for the company. The IT Architect needs to submit a proposal to migrate to a new storage system. Again, the IT Architect will need your help to specify the new storage requirements. Business units demands: Depending on the target results that each business unit has to meet, the business units might require additional IT resources. The IT Architect will require information about the additional capacity that is required. Define and review standards and architectures: After you have defined the priorities, you might need to review the standards and architecture. New technologies will appear so you might need to specify new standards for new storage models. Or maybe, after a period of time analyzing the performance of your DS8000, you discover that for a certain workload, you might need to change a standard. Size new or existing DS8000: Modeling tools, such as Disk Magic, which is described in 7.1.4, Disk Magic modeling on page 163, can gather multiple workload profiles based on host performance data into one model and provide a method to assess the impact of one or more changes to the I/O workload or DS8000 configuration. Note: For environments with multiple applications on the same physical servers or on logical partitions (LPARs) using the same Virtual I/O servers, defining new requirements can be quite challenging. We recommend building profiles at the DS8000 level first and eventually moving into more in-depth study and understanding of the other shared resources in the environment.
159
Plan configuration of new DS8000: Configuring the DS8000 to meet the specific I/O performance requirements of an application will reduce the probability of production performance issues. To produce a design to meet these requirements, Storage Management needs to know: I/Os per second Read to write ratios I/O transfer size Access type: Sequential or random
For help in translating application profiles to I/O workload, refer to Chapter 3, Understanding your workload on page 29. After the I/O requirements have been identified, documented, and agreed upon, the DS8000 layout and logical planning can begin. Refer to Chapter 5, Logical configuration performance considerations on page 63 for additional detail and considerations for planning for performance. Note: A lack of communication between the Application Architects and the Storage Management team regarding I/O requirements will likely result in production performance issues. It is essential that these requirements are clearly defined. For existing applications, you can use Disk Magic to analyze an applications I/O profile. Details about Disk Management are in Chapter 7, Performance planning tools on page 161.
160
Chapter 7.
161
162
163
164
2. In this particular example, we select the JL292059.dmc file, which opens the following window, shown in Figure 7-3 on page 166.
165
3. Here, we see that there are four LPARs (SYSA, SYSB, SYSC, and SYSD) and two disk subsystems (IBM-12345 and IBM-67890). Clicking the IBM-12345 icon opens the general information that is related to this disk subsystem (Figure 7-4). It shows that this is an ESS-800 with 32 GB of cache, and that it was created in RMF Magic using the subsystem identifier (SSID) or logical control unit (LCU) level. The number of subsystem identifiers (SSIDs) or LCUs is 12, as shown in the Number of zSeries LCUs field.
4. Selecting Hardware Details on Figure 7-4 brings up the window in Figure 7-5 on page 167 and allows you to change the following features, based on the actual hardware configuration of the ESS-800: SMP Type Number of host adapters Number of device adapters Cache Size
166
5. Next, click the Interface tab shown in Figure 7-4 on page 166. We see that each LPAR connects to the disk subsystem through eight FICON Express2 2 Gb channels. If this is not correct, you can change it by clicking Edit.
6. Selecting From Disk Subsystem in Figure 7-6 shows the interface used by the disk subsystem. Figure 7-7 on page 168 indicates that ESS IBM-12345 uses eight FICON Ports. In this panel, you also indicate if there is a Remote Copy relationship between this ESS-800 and a remote disk subsystem. You also get a choice to define the connections used between the Primary site and the Secondary site.
167
7. The next step is to look at the DDM by clicking the zSeries Disk tab. The DDM type shows up here as 36 GB/15K rpm. Because the DDM used is actually 73 GB/10K rpm, we update this information by clicking Edit. The 3390 types or models here are 3390-3 and 3390-9 (Figure 7-8). Because any 3390 model that has a greater capacity than a 3390-9 model will show up as a 3390-9 in the DMC file, we need to know the actual models of the 3390s. Generally, there is a mixture of 3390-9, 3390-27, and 3390-54.
8. To see the last option, select zSeries Workload. Because this DMC file is created using the SSID or LCU option, here we see the I/O statistics for each LPAR by SSID (Figure 7-9 on page 169). If we click the Average tab to the right of SYSA (the Average tab at the top in Figure 7-9 on page 169) and scroll to the right of SSID 4010 and click the Average tab (the Average tab at the bottom in Figure 7-10 on page 169), we get the total I/O rate from all four LPARs to this ESS-800, which is 9431.8 IOPS (Figure 7-10 on page 169).
168
Figure 7-10 zSeries I/O statistics from all LPARs to this ESS-800
9. Clicking Base creates the base model for this ESS-800. If the workload statistics mean that the base model cannot be created, the cause might be an excessive CONN time, for example. In which case, we will have to find another DMC from a different time period, and try to create the base model from that DMC file. After creating this base model for IBM-12345, we must also create the base model for IBM-67890, following this same procedure.
169
Figure 7-11 zSeries merge and create new target disk subsystem
2. Because we want to merge the ESS-800s to a DS8300, we need to modify this Merge Target1. Clicking IBM DS8100 on the Hardware Type option opens a window presenting choices, where we can select the IBM DS8300 Turbo. We also select Parallel Access Volumes so that Disk Magic will model the DS8300 to take advantage of this feature.
170
3. Selecting Hardware Details opens the window in Figure 7-13. If we had selected the IBM DS8300LPAR Turbo option, this option also allows us to select from the Processor Percentage option, with the choices of 25, 50, or 75. The Failover Mode option allows you to model the performance of the DS8000 when one processor server with its associated processor storage has been lost. Here, we can select the cache size, in this case, we select 64 GB, because the two ESS-800s each has 32 GB cache. In a DS8300, this selection automatically also determines the nonvolatile storage (NVS) size. Disk Magic computes the number of host adapters on the DS8000 based on the specification on the Interfaces page, but you can, to a certain extent, override these numbers. We recommend that you use one host adapter for every two ports, for both the Fibre Channel connection (FICON) ports and the Fibre ports. The Fibre Ports are used for Peer-to-Peer Remote Copy (PPRC) links. Here, we select 4 FICON Host Adapters because we are using eight FICON ports on the DS8300 (refer to the Count column in Figure 7-15 on page 172).
4. Clicking the Interfaces tab opens the From Servers dialog (Figure 7-14 on page 172). Because the DS8300 FICON ports are running at 4 Gbps, we need to update this option on all four LPARs and also on the From Disk Subsystem (Figure 7-15 on page 172) dialog. If the Host CEC uses different FICON channels than what is specified here, it also needs to be updated. At this point, you select and determine the Remote Copy Interfaces. You need to select the Remote Copy type and the connections used for the Remote Copy links.
171
5. To select the DDM capacity and rpm used, click the zSeries Disk tab in Figure 7-15. Now, you can select the DDM type used by clicking Edit in Figure 7-16 on page 173. In our example, we select DS8000 146GB/15k DDM. Usually, you do not specify the number of volumes used, but let Disk Magic determine it by adding up all the 3390s coming from the merge source disk subsystems. If you know the configuration that will be used as the target subsystem and want the workload to be spread over all the DDMs in that configuration, you can select the number of volumes on the target subsystem so that it will reflect the number of ranks configured. You can also specify the RAID type used for this DDM set.
172
6. Merge the second ESS onto the target subsystem. In Figure 7-17, right-click IBM-67890, select Merge, and then, select Add to Merge Source Collection.
7. Perform the merge procedure. From the Merge Target window (Figure 7-18 on page 174), click Start Merge.
173
8. This selection initiates Disk Magic to merge the two ESS-800s onto the new DS8300 and creates Merge Result1 (Figure 7-19).
9. To see the DDM configured for the DS8300, select zSeries Disk on MergeResult1. Here, you can see the total capacity configured based on the total number of volumes on the two ESS-800s (Figure 7-20 on page 175). There are 11 ranks of 146 GB/15K rpm DDM required.
174
10.Selecting zSeries Workload shows the Disk Magic predicted performance of the DS8300. You can see that the modelled DS8300 will have an estimated response time of 1.1 msec. Note that Disk Magic assumes that the workload is spread evenly among the ranks within the extent pool configured for the workload (Figure 7-21).
11.Clicking Utilization brings up the utilization statistics of the various DS8300 components. In Figure 7-22 on page 176, you can see that the Average FICON HA Utilization is 39.8% and has a darker (amber) background color. This amber background is an indication that the utilization of that resource is approaching its limit. This percentage is still acceptable at this point, but it is as a warning that workload growth might push this resource to its limit. Any resource that is a bottleneck will be shown with a red background. If a resource has a red background, you need to increase the size of that resource to resolve the bottleneck.
175
12.Figure 7-23 on page 177 can be used as a guideline for the various resources in a DS8000. The middle column has an amber background color, and the rightmost column has a red background color. The amber number indicates a warning that if the resource utilization reaches this number, an increase in the workload might soon cause the resource to reach its limit. The red numbers are the utilization numbers, which indicate that the resource is already saturated and will cause an increase in one of the components of the response time. 13.Of course, it is better if the merge result shows that none of the resource utilization falls into the amber color category.
176
177
2. Hold the Ctrl key down on the keyboard and select IBM-12345, IBM-67890, and MergeResult1. Right-click any of them and a small window pops up. Select Graph from this window. In the panel that appears (Figure 7-25), select Clear to clear up any graph option that might have been set up before.
3. Click Plot to produce the response time components graph of the three disk subsystems that you selected in a Microsoft Excel spreadsheet. Figure 7-26 is the graph that was created based on the numbers from the Excel spreadsheet.
2.5
2 1.5 1
ES S -67890@6901IO /s C n D on isc P d IO Q R en S /T
D 8300@16332IO S /s
178
You might have noticed that the DS8300 Response Time on the chart shown in Figure 7-26 on page 178 is 1.0 msec, while the Disk Magic projected Response Time of the DS8300 in Figure 7-28 on page 180 is 1.1 msec. They differ, because Disk Magic rounds up to one decimal point on the performance statistics.
179
4 3.5 3
R C m on n inm ec /T o p e ts s
3.6
2.1 1.5 1.6 1.3 1.4 1.4 1.2 1.2 1 1.1 1.1 1.1
3 0 80 0
1.3%
In Figure 7-29, observe that the FICON host adapter started to reach the red area at 22000 I/O per second. The workload growth projection stops at 40000 I/O per second, because the FICON host adapter reaches 100% utilization when the I/O rate is greater than 40000 I/O pr second.
Total I/O Rate (I/Os per second) Amber Red 16000 18000 20000 22000 24000 26000 28000 30000 32000 34000 36000 38000 40000 Threshold Threshold 60% 70% n/a 60% 60% 35% 35% 80% 90% n/a 80% 80% 50% 50% 10.7% 12.0% 13.4% 14.7% 16.0% 17.4% 18.7% 20.1% 21.4% 22.7% 24.1% 25.4% 26.7% 6.7% 0.5% 7.5% 0.6% 8.4% 0.7% 9.2% 0.7% 10.0% 10.9% 11.7% 12.5% 13.4% 14.2% 15.0% 15.9% 16.7% 0.8% 0.9% 1.0% 1.1% 1.2% 1.4% 1.5% 1.6%
Utilizations Average SMP Average Bus Average Logical Device Highest DA Highest HDD Average FICON HA Highest FICON Port
14.9% 16.8% 18.7% 20.5% 22.4% 24.3% 26.1% 28.0% 29.9% 31.8% 33.6% 35.5% 37.4% 27.2% 30.6% 34.0% 37.3% 40.7% 44.1% 47.5% 50.9% 54.3% 57.7% 61.1% 64.5% 67.9% 39.0% 43.9% 48.8% 53.7% 58.5% 63.4% 68.3% 73.2% 78.0% 82.9% 87.8% 92.7% 97.6% 31.9% 35.9% 39.9% 43.9% 47.9% 51.9% 55.9% 59.9% 63.9% 67.9% 71.9% 75.8% 79.8%
180
4 0 00 0
TotalStorage Productivity Center comma separated values (csv) output file for the period that we want to model with Disk Magic. The periods to model are usually: Peak I/O period Peak Read + Write throughput in MBps Peak Write throughput in MBps TotalStorage Productivity Center for Disk creates the TotalStorage Productivity Center csv output files.
2. The result is shown in Figure 7-31 on page 182. To include both csv files, click Select All and then click Process to display the I/O Load Summary by Interval table (refer to Figure 7-35 on page 184). This table shows the combined load of both ESS11 and ESS14 for all the intervals recorded in the TotalStorage Productivity Center csv file.
181
3. Selecting Excel in Figure 7-35 on page 184 creates a spreadsheet with graphs for the I/O rate (Figure 7-32), Total MBps (Figure 7-33 on page 183), and Write MBps (Figure 7-34 on page 183) for the combined workload on both of the ESS-800s by time interval. Figure 7-32 shows the I/O Rate graph and that the peak is approximately 18000+ IOPS.
I/O Rate
20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 1-Aug 3-Aug 4-Aug 5-Aug 6-Aug 8-Aug 9-Aug 10-Aug 12-Aug 13-Aug 14-Aug 16-Aug 17-Aug 18-Aug 19-Aug 21-Aug 22-Aug 23-Aug
Interval Time
Figure 7-32 Open Systems I/O rate by time interval
182
Figure 7-33 shows the total MBps graph and that the peak is approximately 4700 MBps. This graph shows that this peak looks out-of-line compared to the other total MBps numbers from the other periods. Before using this interval to model the peak total MBps, investigate to learn whether this peak is real or if something unusual happened during this period that might have caused this anomaly.
Total MB/s
6000 5000 4000 3000 2000 1000 0 10-Aug 11-Aug 13-Aug 14-Aug 15-Aug 17-Aug 18-Aug 19-Aug 20-Aug 22-Aug 23-Aug 1-Aug 3-Aug 4-Aug 5-Aug 6-Aug 8-Aug 9-Aug
Interval Time
Figure 7-33 Open Systems total MBps by time interval
Also, investigate the situation for the peak write MBps on Figure 7-34. The peak period here, as expected, coincides with the peak period of the total MBps.
Write MB/s
1400 1200 1000 800 600 400 200 0 1-Aug 3-Aug 4-Aug 5-Aug 6-Aug 8-Aug 9-Aug 10-Aug 11-Aug 13-Aug 14-Aug 15-Aug 16-Aug 18-Aug 19-Aug 20-Aug 22-Aug 23-Aug
Interval Time
Figure 7-34 Open Systems write MBps by time interval Chapter 7. Performance planning tools
183
4. Clicking the I/O Rate column header in Figure 7-35 highlights the peak I/O rate for the combined ESS11 and ESS14 disk subsystems (Figure 7-35).
5. From this panel, select Add Model and then select Finish. A pop-up window prompts you with Did you add a Model for all the intervals you need? because you can include multiple workload intervals in the model. However, we just model one workload interval, so we respond Yes. The window in Figure 7-36 opens.
6. In Figure 7-36, double-click ESS11 to get the general information related to this disk subsystem as shown in Figure 7-37 on page 185. Figure 7-37 on page 185 shows that this disk subsystem is an ESS-800 with 16 GB of cache and 2 GB of NVS.
184
Figure 7-37 Open Systems general information about the ESS11 disk subsystem
7. Selecting Hardware Details in Figure 7-37 allows you to change the following features, based on the actual hardware configuration of the ESS-800: SMP type Number of host adapters Number of device adapters Cache size
In this example (Figure 7-38), we change the cache size to the actual cache size of the ESS11, which is 32 GB.
8. Next, click the Interface tab in Figure 7-37. The From Servers panel (Figure 7-39 on page 186) shows that each server connects to the disk subsystem through four 2 Gb Fibre Channels. If this information is not correct, you can change it by clicking Edit.
185
9. Selecting the From Disk Subsystem option in Figure 7-39 displays the interface used by the disk subsystem. Figure 7-40 shows that ESS11 uses eight Fibre 2 Gb Ports. We need to know how many Fibre ports are actually used here, because there are two servers accessing this ESS-800 and each server uses four Fibre Channels, so there can be up to eight Fibre ports on the ESS-800. In this particular case, there are eight Fibre ports on the ESS-800. If there are more (or fewer) FICON ports, you can update this information by clicking Edit. 10.On this panel, you also indicate whether there is a Remote Copy relationship between this ESS-800 and a remote disk subsystem. It also gives you a choice to define the connections used between the Primary site and the Secondary site.
186
11.The next step is to look at the DDM by choosing the Open Disk tab. Figure 7-41 shows DDM options by server. Here, we fill in or select the actual configuration specifics of ESS11, which is accessed by server Sys_ESS11. The configuration details are: Total capacity: 12000 GB DDM type: 73 GB/10K rpm RAID type: RAID 5 12.Next, click Sys_ESS14 in Figure 7-41 and leave the Total Capacity at 0, because ESS11 is accessed by Sys_ESS11 only.
13.Selecting the Total tab in Figure 7-41 displays the total capacity of the ESS-800, which is 12 TB on 28 RAID ranks of 73GB/10K rpm DDMs, as shown in Figure 7-42.
187
14.To see the last option, select Open Workload in Figure 7-42 on page 187. Figure 7-43 shows the I/O rate from Sys_ESS11 is 6376.8 IOPS and the service time is 4.5 msec. If we click Average, we observe the same I/O statistics, because ESS11 is accessed by Sys_ESS11 only. Clicking Base creates the base model for this ESS-800. 15.We can now also create the base model for ESS14 by following the same procedure.
Figure 7-44 Open Systems merge and create a new target disk subsystem
188
2. Because we want to merge the ESS-800s to a DS8300, we need to modify Merge Target1. Clicking Hardware Type in Figure 7-45 option opens a list box where we select IBM DS8300 Turbo.
3. Selecting Hardware Details opens the window shown in Figure 7-46. If we had selected the IBM DS8300LPAR Turbo option, this window allows us to select from the Processor Percentage option, which offers the choices of 25, 50, or 75. The Failover Mode option allows you to model the performance of the DS8000 when one processor server with its associated processor storage has been lost. Here, we can select the cache size, in this case 64 GB, which is the sum of the cache sizes of ESS11 and ESS14. In a DS8300, this selection automatically also determines the NVS size. Disk Magic computes the number of host adapters on the DS8000 based on what is specified on the Interfaces page, but you can, to a certain extent, override these numbers. We recommend that you use one host adapter for every two Fibre ports. In this case, we select four Fibre host adapters, because we are using eight Fibre ports. 4. Clicking the Interface option in Figure 7-45 opens the dialog that is shown in Figure 7-46.
189
5. Because the DS8300 Fibre ports run at 4 Gbps, we need to update this option on both servers and also on the From Disk Subsystem dialog (Figure 7-48). If the servers use different Fibre Channels than the Fibre Channels that are specified here, update this information. Select and determine the Remote Copy Interfaces. You need to select the Remote Copy type and the connections used for the Remote Copy links.
6. To select the DDM capacity and rpm used, click the Open Disk tab in Figure 7-49 on page 191. Then, select the DDM used by clicking Add. Select the HDD type used (146GB/15K rpm) and the RAID type (RAID 5), and enter capacity in GB (24000). Now, click OK.
190
7. Now, merge the second ESS onto the target subsystem. In Figure 7-50, right-click ESS14, select Merge, and then, select Add to Merge Source Collection.
8. To start the merge, in the Merge Target window that is shown in Figure 7-51 on page 192, click Start Merge. This selection initiates Disk Magic to merge the two ESS-800s onto the new DS8300. A pop-up window allows you to select whether to merge all workloads or
191
only a subset of the workloads. Here, we select I want to merge all workloads on the selected DSSs, which creates Merge Result1 (refer to Figure 7-52).
9. Clicking the Open Disk tab in Figure 7-52 shows the disk configuration. In this case, it is 24 TB on 28 ranks of 146 GB/15K rpm DDMs (Figure 7-53 on page 193).
192
10.Selecting the Open Workload tab in Figure 7-53 shows the Disk Magic predicted performance of the DS8300. Here, we see that the modelled DS8300 will have an estimated service time of 5.9 msec (Figure 7-54).
11.Click Utilizations in Figure 7-54 to show the utilization statistics of the various components of the DS8300. In Figure 7-55 on page 194, we see that the Highest HDD Utilization is 60.1% and has a darker (amber) background color. This amber background is an indication that the utilization of that resource is approaching its limit. It still acceptable at this point, but the color is a warning that a workload increase might push this resource to its limit. Any resource that is a bottleneck will be shown with a red background. If a resource shows a red background, you need to increase that resource to resolve the bottleneck.
193
Use Figure 7-23 on page 177 as a guideline for the various resources in a DS8000. The middle column has an amber background color and the rightmost column has a red background color. The amber number indicates a warning that if the resource utilization reaches this number, a workload increase might soon cause the resource to reach its limit. The red numbers are utilization numbers that will cause an increase in one of the components of the response time.
Hold the Ctrl key down and select ESS11, ESS14, and MergeResult1. Right-click any of them, and a small window appears. Select Graph from this window. On the panel that appears (Figure 7-57 on page 195), select Clear to clear any graph option that might have been set up before. Click Plot to produce the service time graph of the three disk subsystems selected (Figure 7-58 on page 195).
194
14
11.4
12
10
5.9 4.5
Configuration
Figure 7-58 Open Systems service time comparison
195
40000, and 1000. Select Line for the graph type, and then, select Clear to clear up any graph that might have been set up before. Now, select Plot. An error message displays, showing that the HDD utilization > 100%. This message indicates that we cannot increase the I/O rate up to 40000 IOPSec because of the DDM bottleneck. Clicking OK completes the graph creation.
The graph shows the service time plotted against the I/O rate increase as shown in Figure 7-60.
40
35
33.4
30
25
msec
20
15.4
15
10.7
10
5.7
5
6.3
7.2
8.5
IO/sec
Figure 7-60 Open Systems service time projection with workload growth
196
This graph shows the DS8300 resource utilization growth with the increase in I/O rate. We can observe here that the HDD utilization starts to reach the red area at an I/O rate > 24000 IOPS. This utilization impacts the service time and can be seen in Figure 7-60 on page 196 where at greater than 24000 IOPS, the service time increases more rapidly. After selecting Utilization Overview in the Graph Data choices, click Clear and click Plot, which produces the resource utilization table in Figure 7-61.
Total I/O Rate (I/Os per second) Utilizations Average SMP Average Bus Average Logical Device Highest DA Highest HDD Average FICON HA Highest FICON Port Average Fibre HA Highest Fibre Port Amber Red Threshold Threshold 60% 70% n/a 60% 60% 35% 35% 60% 60% 80% 90% n/a 80% 80% 50% 50% 80% 80% 18000 24.1% 12.1% 3.6% 8.9% 57.3% 0.0% 0.0% 28.8% 26.7% 20000 26.8% 13.5% 4.6% 9.9% 63.7% 0.0% 0.0% 32.0% 29.7% 22000 29.5% 14.8% 5.8% 10.9% 70.1% 0.0% 0.0% 35.2% 32.7% 24000 32.2% 16.2% 7.7% 11.9% 76.5% 0.0% 0.0% 38.4% 35.6% 26000 34.8% 17.5% 10.8% 12.9% 82.8% 0.0% 0.0% 41.6% 38.6% 28000 37.5% 18.9% 17.4% 13.9% 89.2% 0.0% 0.0% 44.8% 41.6% 30000 40.2% 20.2% 42.2% 14.9% 95.6% 0.0% 0.0% 48.1% 44.5%
197
198
Data collection
The preferred data collection method for a Disk Magic study is by using TotalStorage Productivity Center. For each control unit to be modeled, collect performance data, create a report for each control unit, and export each report as a comma separated values (csv) file. You can obtain the detailed instructions for this data collection from the IBM representative.
Workload
The example is based on an online workload with the assumption that the transfer size is 4K and all the read and write operations are random I/Os. The workload is a 70/30/50 online transaction processing (OLTP) workload, which is an online workload with 70% reads, 30% writes, and a 50% read-hit-ratio. The workload estimated characteristics are: Maximum host I/O rate is 10000 IOPS. Write efficiency is 33%, which means that 67% of the writes are destaged. Ranks use a RAID 5 configuration.
DDM speed
For I/O intensive workloads, consider 15K rpm DDMs.
DDM capacity
You must estimate to determine the choice among the 146 GB, 300 GB, or 450 GB DDMs based on: Total capacity needed in GBs Estimated Read and Write I/O rates RAID type used. For a discussion about this topic, refer to 5.6, Planning RAID arrays and ranks on page 81.
199
Based on the workload characteristics, the calculation is: Reads: Read misses are 10000 x 70% x 50% = 3500 IOPS Writes: 10000 x 30% x 67% x 4 = 8040 IOPS Total = 11540 IOPS Note: The RAID 5 write penalty of four I/O operations per write is shown in Table 5-1 on page 84 under the heading Performance Write Penalty. Calculate the number of ranks required based on: A 15K rpm DDM can sustain a maximum of 200 IOPS/DDM. For a 10K rpm DDM, reduce this number by 33%. For planning purposes, use a DDM utilization of 50%: The (6+P) rank will be able to sustain 200 x 7 x 50% = 700 IOPS. The (7+P) rank will be able to handle a higher IOPS. To be on the conservative side, calculate these estimates based on the throughput of a (6+P) rank. Based on the 11540 total IOPS, this calculation yields: 11540 / 700 = 17 ranks Because you can only order DDMs based on a multiple of two ranks, requiring a capacity with 17 ranks will require a DS8000 configuration with 18 ranks. Depending on the DDM size, the table in Figure 7-62 shows how much capacity you will get with 18 ranks. Knowing the total GB capacity needed for this workload, use this chart to select the DDM size that will meet the capacity requirement.
Note: Only use larger DDM sizes for applications that are less I/O intensive.
200
For Open Systems loads, make sure that the DS8000 adapters and ports do not reach their throughput limits (MB/s values). Hence, if you have to fulfill specific throughput requirements,
make sure that each port under peak conditions is only loaded with a maximum of approximately 70% of its nominal throughput and leave every second port on a DS8000 idle. If no throughput requirements are given, use the following rule to initially estimate the correct number of host adapters: For each TB of capacity used, configure a nominal throughput of 100 MB/s. For instance, a 16 TB disk capacity then leads to 1600 MB/s required nominal throughput. With 2-Gbps ports assumed, you need eight ports. Following the recommendation of using every second port only, you need four host adapters.
DS8000 cache
Use Figure 7-63 as a guideline for the DS8000 cache size if there is no workload experience that can be used.
Recommendation
To finalize your DS8000 configuration, contact your IBM representative to further validate the workload against the selected configuration.
201
202
Chapter 8.
203
Tactical Strategic
Certain features that are required to support the performance management processes are not provided in TotalStorage Productivity Center. These features are shown in Table 8-2 on page 205.
204
Table 8-2 Other tools required Process Strategic Strategic Activity Sizing Planning Alternative Disk Magic, general rules, refer to Chapter 7, Performance planning tools on page 161. Logical configuration performance considerations, refer to Chapter 5, Logical configuration performance considerations on page 63. Native host tools. Refer to the OS chapters for more detail. Native host tools. Refer to the OS chapters for more detail.
Operational Tactical
Host data collection performance and alerting Host performance analysis and tuning
205
CIMOM
CIMOM
CIMOM
SNMP
HOST AGENT
HOST AGENT
HOST AGENT
There are many components in a TotalStorage Productivity Center environment. An example of the complexity of a TotalStorage Productivity Center environment is provided in Figure 8-2.
GUI
GUI Proxy CIMOM Proxy CIMOM Proxy CIMOM Backup Master Console SVC Master Console (Proxy CIMOM, GUI)
GUI
GUI
GUI
Figure 8-2 does not include TotalStorage Productivity Center for Fabric or switch components. This architecture was designed by the Storage Networking Industry Association (SNIA), an industry workgroup. The architecture is not simple, but it is open, meaning any company can use SMI-S standard CIMOMs to manage and monitor storage and switches.
206
For additional information about the configuration and deployment of TotalStorage Productivity Center, refer to: TotalStorage Productivity Center V3.3 Update Guide, SG24-7490: http://www.redbooks.ibm.com/abstracts/sg247490.html?Open Storage Subsystem Performance Monitoring using TotalStorage Productivity Center, which is in Monitoring Your Storage Subsystems with TotalStorage Productivity Center, SG24-7364
207
By Port
Port Port Port Port Port Port Port Port
By Subsystem
Controller 1
Write Cache Controller 1 Read Cache Controller 1 Write Cache Mirror Controller 2
Controller 2
Write Cache Mirror Controller 1
- Cache
Cache
By Controller
Port
Port
Port
Port
Port
Port
Port
Port
By Array
The amount of the available information or available metrics depends on the type of subsystem involved. The SMI-S standard does not require vendors to provide detailed performance data. For the DS8000, IBM provides extensions to the standard that include much more information than required by the SMI-S standard.
Subsystem
On the subsystem level, metrics have been aggregated from multiple records to a single value per metric in order to give the performance of a storage subsystem from a high-level view, based on the metrics of other components. This is done by adding values, or calculating average values, depending on the metric.
Cache
Notice the cache in Figure 8-3. This cache is a subcomponent of the subsystem, because the cache plays a crucial role in the performance of any storage subsystem. You do not find the cache as an selection in the Navigation Tree in TotalStorage Productivity Center, but there are available metrics that provide information about cache. Cache metrics for the DS8000 are available in the following report types: Subsystem Controller Array Volume
Cache metrics
Metrics, such as disk-to-cache operations, show the number of data transfer operations from disks to cache, referred as staging for a specific volume. Disk-to-cache operations are directly linked to read activity from hosts. When data is not found in the DS8000 cache, the data is first staged from back-end disks into the cache of the DS8000 server and then transferred to the host.
208
Read hits occur when all the data requested for a read data access is located in cache. The DS8000 improves the performance of read caching by using Sequential Prefetching in Adaptive Replacement Cache (SARC) staging algorithms. Refer to 1.3.1, Advanced caching techniques on page 6 for more information about the SARC algorithm. The SARC algorithm seeks to store those data tracks that have the greatest probability of being accessed by a read operation in cache. The cache-to-disk operation shows the number of data transfer operations from cache to disks, referred as destaging for a specific volume. Cache-to-disk operations are directly linked to write activity from hosts to this volume. Data written is first stored in the persistent memory (also known as nonvolatile storage (NVS)) at the DS8000 server and then destaged to the back-end disk. The DS8000 destaging is enhanced automatically by striping the volume across all the disk drive modules (DDMs) in one or several ranks (depending on your configuration). This striping provides automatic load balancing across DDMs in ranks and an elimination of the hot spots. The DASD fast write delay percentage due to persistent memory allocation gives us information about the cache usage for write activities. The DS8000 stores data in the persistent memory before sending an acknowledgement to the host. If the persistent memory is full of data (no space available), the host will receive a retry for its write request. In parallel, the subsystem has to destage data stored in its persistent memory to the back-end disk before accepting new write operations from any host. If a volume is facing write operation delayed due to persistent memory constraint, consider moving the volume to a new rank that is less used or spread this volume on multiple ranks (increase the number of DDMs used). If this solution does not fix the persistent memory constraint problem, you can consider adding cache capacity to your DS8000.
Controller
TotalStorage Productivity Center refers to the DS8000 processor complexes as controllers. A DS8000 has two processor complexes, and each processor complex independently provides major functions for the disk subsystem. Examples include directing host adapters for data transferring to and from host processors, managing cache resources, and directing lower device interfaces for data transferring to and from physical disks. To analyze performance data, you need to know that most volumes can only be assigned/used by one controller at a time. You can use the controller reports to identify if the DS8000s processor complexes are busy and persistent memory is sufficient. Write delays can occur due to write performance limitations on the back-end disk (at the rank level) or limitation of the persistent memory size.
Ports
The port information reflects the performance metrics for the front-end DS8000 ports that connect the DS8000 to the SAN switches or hosts. The DS8000 host adapter (HA) card has four ports. The SMI-S standards do not reflect this aggregation so TotalStorage Productivity Center does not show any group of ports belonging to the same HA. Monitoring and analyzing the ports belonging to the same card are beneficial, because the aggregate throughput is less that the sum of the stated bandwidth of the individual ports. For more information about DS8000 port cards, refer to 2.5.1, Fibre Channel and FICON host adapters on page 24. Note: TotalStorage Productivity Center reports on many port metrics; therefore, be aware that the ports on the DS8000 are the front-end part of the storage device.
209
210
The array reports include both front-end metrics and back-end metrics. The back-end metrics are specified by the keyword Backend. They provide metrics from the perspective of the controller to the back-end array sites. The front-end metrics relate to the activity between the server and the controller. There is a relationship between array operations, cache hit ratio, and percentage of read requests. When the cache hit ratio is low, the DS8000 has frequent transfers from DDMs to cache (staging). When the percentage of read requests is high and cache hit ratio is also high, most of the I/O requests can be satisfied without accessing the DDMs due to the cache management prefetching algorithm. When the percentage of read requests is low, the DS8000 write activity to the DDMs can be high. The DS8000 has frequent transfers from cache to DDMs (destaging). Comparing the performance of different arrays shows if the global workload is equally spread on the DDMs of your DS8000. Spreading data across multiple arrays increases the number of DDMs used and optimizes the overall performance. Important: Back-end write metrics do not include the RAID overhead. In reality, the RAID 5 write penalty adds additional unreported I/O operations.
Volumes
The volumes, which are also called logical unit numbers (LUNs), are shown in Figure 8-5 on page 212. The host server sees the volumes as physical disk drives and treats them as physical disk drives.
211
Analysis of volume data facilitates the understanding of the I/O workload distribution among volumes as well as workload characteristics (random or sequential and cache hit ratios). A DS8000 volume can belong to one or several ranks as shown in Figure 8-5. For more information about volumes, refer to 4.2.5, Logical volumes on page 48. Analysis of volume metrics will show how busy the volumes are on your DS8000. This information helps to: Determine where the most accessed data is located and what performance you get from the volume. Understand the type of workload your application generates (sequential or random and the read or write operation ratio). Determine the cache benefits for the read operation (cache management prefetching algorithm SARC). Determine cache bottlenecks for write operations. Compare the I/O response observed on the DS8000 with the I/O response time observed on the host.
212
Table 8-3 DS8000 I/O types and behavior I/O type Sequential read Random read Sequential write DS8000 high-level behavior Pre-stage reads in cache to increase cache hit ratio. Attempt to find data in cache. If not present in cache, read from back end. Write data to NVS of processor complex owning volume and send copy of data to cache in other processor complex. Upon back-end destaging, perform prefetching of read data and parity into cache to reduce the number of disk operations on the back end. Write data to NVS of processor complex owning volume and send copy of data to cache in other processor complex. Destage modified data from NVS to disk as determined by microcode.
Random write
213
We recommend that you monitor the read hit ratio over an extended period of time: If the cache hit ratio has been historically low, it is most likely due to the nature of the data access patterns. Defragmenting the filesystem and making indexes if none exist might help more than adding cache. If you have a high cache hit ratio initially and it is decreasing as the workload increases, adding cache or moving part of the data to volumes associated with the other processor complex might help.
8.3.1 Timestamps
TotalStorage Productivity Center server uses the timestamp of the source devices when it inserts data into the database. If the TotalStorage Productivity Center server clock is not synchronized with the rest of your environment, it does not include any additional offset, because you might need to compare the performance data of the DS8000 with the data gathered on a server. Although the devices time information is written to the database, reports are always based on the time of the TotalStorage Productivity Center server. TotalStorage Productivity Center actually receives the time zone information from the devices (or the CIMOMs) and uses this information to adjust the time in the reports to the local time. Certain devices might convert the time into Greenwich mean time (GMT) timestamps and not provide any time zone information. This complexity is necessary to be able to compare the information from two subsystems located in different time zones from a single administration point. This administration point is the GUI not the TotalStorage Productivity Center server. If you open the GUI in different time zones, a performance diagram might show a distinct peak at different times, depending on its local time zone. When using TotalStorage Productivity Center to compare data from a server (for example, iostat data) with the data of the storage subsystem, it is important to know the time stamp of the storage subsystem. Unfortunately, TotalStorage Productivity Center does not provide a report to see the time zone information for a device. Most likely because the devices or CIMOMs convert the timestamps into GMT timestamps before they are sent.
214
In order to ensure that the timestamps on the DS8000 are synchronized with the other infrastructure components, the DS8000 provides features for configuring a Network Time Protocol (NTP) server. In order to modify the time and configure the hardware management console (HMC) to utilize an NTP server, the following steps are required: 1. Log on to HMC. 2. Select HMC Management. 3. Select Change Date and Time. 4. A dialog box similar to Figure 8-6 will appear. Change the Time here to match the current time for the time zone.
5. In order to configure an NTP server, select the NTP Configuration tab. A dialog box similar to Figure 8-7 on page 216 will display.
215
6. Select Add NTP Server and provide the IP address and the NTP version. 7. Check Enable NTP service on this HMC and click OK. Note: These configuration changes will require a reboot of the HMC. These steps were tested on DS8000 code Version 4.0 and later.
8.3.2 Duration
TotalStorage Productivity Center provides the ability to collect data continuously. From a performance management perspective, collecting data continuously means performance data exists to facilitate reactive, proactive, and even predictive processes as described in Chapter 8, Practical performance management on page 203. For ongoing performance management of the DS8000, we recommend one of the following approaches to data collection: Run continuously. The benefit to this approach is that at least in theory data always exists. The downside is that if a component of TotalStorage Productivity Center goes into a bad state, it will not always generate an alert. In these cases, data collection might stop with only a warning, and a Simple Network Management Protocol (SNMP) alert will not be generated. In certain cases, the only obvious indication of a problem is a lack of performance data. Restart collection every n number of hours. In this approach, configure the collection to run for somewhere between 23 and 168 hours. For larger environments, a significant delay period might need to be configured the last interval and the first interval in the next data collection. The benefit to this approach is that data collection failures will result in an alert every time that the job fails. You can configure this alert to go to an operational monitoring tool, such as Tivoli Enterprise Console (TEC). In this case, performance data loss is limited to the configured duration. The downside to this approach is that there will always be data missing for a period of time as TotalStorage Productivity Center begins to start the data collection on all of the devices. For large environments, this technique might not be tenable for an interval less than 72 hours, because the start-up costs related to starting the collection on a large number of devices can be significant.
216
There is at least a gap of one hour every n number of hours. You get alerts for every job that fails. Manually restarted jobs can cause trouble, because the jobs can easily overlap with the next scheduled job, which prevents the scheduled job from starting. Logfile created for each scheduled run. The Navigation Tree shows the status of the current and past jobs, and whether the job was successful in the past.
Logfiles
Usually, you only see a single logfile. You see multiple logfiles only if you have stopped and restarted the job manually.
8.3.3 Intervals
In TotalStorage Productivity Center, the data collection interval is referred to as the sample interval. The sample interval for DS8000 performance data collection tasks is from five minutes to 60 minutes. A shorter sample interval results in a more granular view of performance data at the expense of requiring additional database space. The appropriate sample interval depends on the objective of the data collection. Table 8-5 on page 218 displays example data collection objectives and reasonable values for a sample interval.
217
Table 8-5 Sample interval examples Objective Problem determination/service level agreement (SLA) Ongoing performance management Baseline or capacity planning Sample interval minutes 5 15 60
To reduce the growth of the TotalStorage Productivity Center database while watching for potential performance issues, TotalStorage Productivity Center has the ability to only store samples in which an alerting threshold is reached. This skipping function is useful for SLA reporting and longer term capacity planning. In support of ongoing performance management, a reasonable sample interval is 15 minutes. An interval of 15 minutes usually provides enough granularity to facilitate reactive performance management. In certain cases, the level of granularity required to identify the performance issue is less than 15 minutes. In these cases, you can reduce the sample interval. TotalStorage Productivity Center also provides reporting at higher intervals, including hourly and daily. TotalStorage Productivity Center provides these views automatically.
Read I/O Rate (overall) Write I/O Rate (overall) Total I/O Rate (overall) Read Cache Hits Percentage (overall)
218
Subsystem
Controller
Definition Volume Array The rate of I/Os (actually writes) that are delayed during the sample interval because of write cache. This must be 0. Average read data rate in megabytes per second during the sample interval. Average write data rate in megabytes per second during the sample interval. Average total (read+write) data rate in megabytes per second during the sample interval. Average response time in milliseconds for reads during the sample interval. For this report, this metric is an average of read hits in cache as well as read misses. Average response time in milliseconds for writes during the sample interval. Average response time in milliseconds for all I/O in the sample interval, including both cache hits as well as misses to back-end storage if required. Average transfer size in kilobytes for reads during the sample interval. Average transfer size in kilobytes for writes during the sample interval. Average transfer size in kilobytes for all I/O during the sample interval. The average read rate in reads per second caused by read misses. This rate is the read rate to the back-end storage for the sample interval. The average write rate in writes per second caused by front-end write activity. This rate is the write rate to the back-end storage for the sample interval. These writes are logical writes, and the actual number of physical I/O operations depends on the type of RAID architecture. The average write rate in writes per second caused by front-end write activity. This rate is the write rate to the back-end storage for the sample interval. These writes are logical writes and the actual number of physical I/O operations depends on the type of RAID architecture. Average number of megabytes per second read from back-end storage during the sample interval. Port
Read Transfer Size Write Transfer Size Overall Transfer Size Backend Read I/O Rate
219
Subsystem
Controller
Definition Volume Array Average number of megabytes per second written to back-end storage during the sample interval. Sum of the Backend Read and Write Data Rates for the sample interval. Average response time in milliseconds for read operations to the back-end storage. Average response time in milliseconds for write operations to the back-end storage. This time might include several physical I/O operations, depending on the type of RAID architecture. The weighted average of Backend Read and Write Response Times during the sample interval. Average disk utilization during the sample interval. This percentage is also the utilization of the RAID array, because the activity is uniform across the array. The average rate per second for operations that send data from an I/O port, typically to a server. This operation is typically a read from the servers perspective. The average rate per second for operations where the storage port receives data, typically from a server. This operation is typically a write from the servers perspective. Average read plus write I/O rate per second at the storage port during the sample interval. The average data rate in megabytes per second for operations that send data from an I/O port, typically to a server. The average data rate in megabytes per second for operations where the storage port receives data, typically from a server. Average (read+write) data rate in megabytes per second at the storage port during the sample interval. Average number of milliseconds that it took to service each port send (server read) operation for a particular port over the sample interval. Average number of milliseconds that it took to service each port receive (server write) operation for a particular port over the sample interval. Port
Total Backend Data Rate Backend Read Response Time Backend Write Response Time
220
Subsystem
Controller
Definition Volume Array Weighted average port send and port receive time over the sample interval. Average size in kilobytes per Port Send operation during the sample interval. Average size in kilobytes per Port Receive operation during the sample interval. Average size in kilobytes per port transfer during the sample interval. Threshold < 60 >1 > 50% > 250 > 1000 > 35 > 35 > 35 Port
Total Port Response Time Port Send Transfer Size Port Receive Transfer Size Total Port Transfer Size
The following Redpaper provides a more exhaustive list of TotalStorage Productivity Center metrics for DS8000: http://www.redbooks.ibm.com/abstracts/redp4347.html?Open&pdfbookmark
221
Component Volume
Metric Read Cache Hits Percentage (overall) Write Cache Hits Percentage (overall) Read I/O Rate (Overall) Write I/O Rate (overall) Read Response Time Write Response Time
Threshold > 90
Comment Look for opportunities to move volume data to application or database cache. Cache misses can indicate busy back end or need for additional cache. Look for high rates. Look for high rates. Indicates disk or port contention. Indicates caches misses, busy back end, and possible front-end contention. Cache misses may indicate busy back end or need for additional cache. Indicates throughput intensive workload. Indicates throughput intensive workload. Indicates transaction intensive load. If port data rate is close to bandwidth, this rate indicates a saturation. Indicates contention on I/O path from DS8000 to host. Indicates potential issue on I/O path or DS8000 back end. Indicates potential issue on I/O path or DS8000 back end.
Volume
< 100
Volume
>1
Read Transfer Size Write Transfer Size Total Port I/O Rate Total Port Data Rate Port Send Response Time Port Receive Response Time Total Port Response Time
> 100 > 100 > 2500 ~= 2/4 Gb > 20 > 20 > 20
222
Table 8-8 Report category, usage, and considerations Report type Alerting/Constraints Performance process Operational Advantages Facilitates operational reporting for certain failure conditions and threshold exception reporting for support of SLAs and service level objectives (SLOs) Ease of use Top 10 reports provide method for quickly viewing entire environment Ease of use Flexible Disadvantages Requires thorough understanding of workload to configure appropriate thresholds
Tactical
Limited metrics Lacks scheduling Inflexible charting Limited to 2500 rows displayed Can only export multiple metrics of same data type at a time Lack of scheduling Inflexible charting Limited to 2500 rows displayed Time stamps are in AM/PM format Volume data does not contain array correlation No charting
Ad hoc reports
Tactical, Strategic
Batch reports
Tactical, Strategic
Ease of use Ability to export all metrics available Schedule Drill downs with preestablished relationships Flexible Programmable
TPCTOOL
Tactical, Strategic
Non-intuitive Output to flat files that must be post-processed in spreadsheet or other reporting tool Requires some DB and reporting skills Does not take into account future or potential changes to the environment
Highly customizable
All of the reports utilize the metrics available for the DS8000 as described in Table 8-6 on page 218. In the remainder of this section, we describe each of the report types in detail.
8.5.1 Alerts
TotalStorage Productivity Center provides support for the performance management operational subprocesses via performance alerts and constraint violations. In this section, we discuss the difference between the alerts and constraint violations and how to implement them.
223
While TotalStorage Productivity Center is not an online performance monitoring tool, it uses the term performance monitor for the name of the job that is set up to gather data from a subsystem. The performance monitor is a performance data collection task. TotalStorage Productivity Center collects information at certain intervals and stores the data in its database. After inserting the data, the data is available for analysis using several methods that we discuss in this section. Because the intervals are usually 5 - 15 minutes, TotalStorage Productivity Center is not an online or real-time monitor. You can use TotalStorage Productivity Center to define performance-related alerts that can trigger an event when the defined thresholds are reached. Even though TotalStorage Productivity Center works in a similar manner to a monitor without user intervention, the actions are still performed at the intervals specified during the definition of the performance monitor job. Before discussing alerts, we must clarify the terminology.
Alerts
Generally, alerts are the notifications defined for different jobs. TotalStorage Productivity Center creates an alert on certain conditions, for example, when a probe or scan fails. There are various ways to be notified: SNMP traps, Tivoli Enterprise Console (TEC) events, and e-mail are the most common methods. All the alerts are always stored in the Alert Log, even if you have not set up notification. This log can be found in the Navigation Tree at IBM TotalStorage Productivity Center Alerting Alert Log. In addition to the alerts that you set up when you define a certain job, you can also define alerts that are not directly related to a job, but instead to specific conditions, such as a new subsystem has been discovered. This type of alert is defined in Disk Manager Alerting Storage Subsystem Alerts. These types of alerts are either condition-based or threshold-based. When we discuss setting up a threshold, we really mean setting up an alert that defines a threshold. The same is true if someone says they set up a constraint. They really set up an alert to define a constraint. These values or conditions need to be exceeded or met in order for an alert to be generated.
Constraints
In contrast to the alerts that are defined with a probe or a scan job, the alerts defined in the Alerting navigation subtree are kept in a special constraint report available in the Disk Manager Reporting Storage Subsystem Performance Constraint Violation navigation subtree. This report lists all the threshold-based alerts, which can be used to identify hot spots within the storage environment. In order to effectively utilize thresholds, the analyst must have familiarity with the workloads. Figure 8-8 on page 225 shows all the available constraint violations. Unfortunately, most of them are not applicable to the DS8000.
224
Table 8-9 shows the constraint violations applicable to the DS8000. For those constraints without predefined values, we provide suggestions. You need to configure the exact values appropriately for the environment. Most of the metrics that are used for constraint violations are I/O rates and I/O throughput. It is difficult to configure thresholds based on these metrics, because absolute threshold values depend on the hardware capabilities and the workload. It might be perfectly acceptable for a tape backup to utilize the full bandwidth of the storage subsystem ports during backup periods. If the thresholds are configured to identify a high data rate, a threshold will be generated. In these cases, the thresholds are exceeded, but the information does not necessarily indicate a problem. These types of exceptions are called false positives. Other metrics, such as Disk Utilization Percentage, Overall Port Response Time, Write Cache Delay Percentage, and perhaps Cache Hold Time, tend to be more predictive of actual resource constraints and need to be configured in every environment. These constraints are highlighted in green in Table 8-9.
Table 8-9 DS8000 constraints Condition Disk Utilization Percentage Threshold Critical stress 80 Warning stress 50 Comment Can be effective in identifying consistent disk hot spots. Can be used to identify hot ports. Percentage of total I/O operations per processor complex delayed due to write cache space constraints. Amount of time in seconds that the average track persisted in cache per processor complex.
Overall Port Response Time Threshold Write Cache Delay Percentage Threshold
20 10
10 3
30
60
225
Condition Total Port I/O Rate Threshold Total Port Data Rate Threshold Total I/O Rate Threshold
Comment Indicates highly active ports. Indicates highly active port. Difficult to use, because I/O rates vary depending on workload and configuration. Difficult to use, because data rates vary depending on workload and configuration.
Depends
Depends
For information about the exact meaning of these metrics and thresholds, refer to 8.3, TotalStorage Productivity Center data collection on page 214. Figure 8-9 on page 227 is a diagram to illustrate the four thresholds that create five regions. Stress alerts define levels that, when exceeded, trigger an alert. An idle threshold level triggers an alert when the data value drops below the defined idle boundary. There are two types of alerts for both the stress category and the idle categories: Critical Stress: No warning stress alert is created, because both (warning and critical) levels are exceeded with the interval. Warning Stress: It does not matter that the metric shows a lower value than in the last interval. An alert is triggered, because the value is still above the warning stress level. Normal workload and performance: No alerts are generated. Warning Idle: The workload drops significantly, and this drop might indicate a problem (does not have to be performance-related). Critical Idle: The same applies as for critical stress.
226
It is unnecessary to specify a threshold value for all levels. In order to configure a constraint, perform the following steps: 1. 2. 3. 4. Go to Disk Manager Alerting. Right-click Storage Subsystems Alert. Select Create Storage Subsystem Alert. A window appears that is similar to Figure 8-10. Select the triggering condition from the list box and scroll down until you see the desired metrics.
227
5. Select the Condition to configure. 6. Set the Critical Stress and Warning Stress levels. 7. On the Storage Subsystems tab, select the systems to which apply the Constraint. 8. Configure any Triggered Actions, such as SNMP Trap, TEC Event, Login Notification, Windows Event Log, Run Script, or email. 9. Save the Alert and provide a name. 10.You can view the alerts in the Disk Manager Reporting Storage Subsystem Performance - Constraint Violation navigation subtree.
The predefined TotalStorage Productivity Center performance reports are customized reports. The Top Volume reports show only a single metric over a given time period. These reports provide a way to identify the busiest volumes in the entire environment or by storage
228
subsystem. You can use Selection and Filter for these reports. We describe the Selection and Filter options in detail in 8.5.4, Batch reports on page 232.
229
230
If you click the drill down icon in Figure 8-16, you get a report containing all the volumes that are stored on that specific array. If you click the drill up icon, you get a performance report at the controller level. In Figure 8-17 on page 232, we show you the DS8000 components and
231
levels to which you can drill down. TotalStorage Productivity Center refers to the DS8000 processor complexes as controllers.
232
2. Right-click Batch Reports and Select Create Batch Report. 3. Select the report type and provide a description as shown in Figure 8-19.
4. On the Selection tab, select the date and time range, the interval, and the Subsystem components as shown in Figure 8-20.
Note: Avoid using the Selection button when extracting volume data. In this case, we recommend using Filter. Use the following syntax to gather volumes for only the subsystem of interest: DS8000-2107-#######-IBM. Refer to Figure 8-21 for an example. Replace ####### with the seven character DS8000 serial number.
233
5. In order to reduce the amount of data, we suggest creating a filter that requires the selected component to contain at least 1 for the Total I/O Rate (overall) (ops/s) as shown in Figure 8-22.
6. Click the Options tab and select Include Headers. 7. Leave the radio button selected for CSV File. This option exports the data to a comma separated values file that can be analyzed with spreadsheet software. 8. Select an agent computer. Usually, this batch report runs on the TotalStorage Productivity Center Server. Refer to Figure 8-23 for an example.
234
9. Another consideration is When to Run. Click When to Run to see the available options. The default is Run Now. While this option is fine for ad hoc reporting, you might also schedule the report to Run Once at a certain time or Run Repeatedly. This tab also contains an option for setting the time zone for the report. The default is to use the local time in each time zone. Refer to the discussion about time stamps in 8.3.1, Timestamps on page 214 for more information. 10.Prior to running the job, configure any desired alerts in the Alert tab, which provides a means for sending alerts if the job fails. This feature can be useful if the job is a regularly scheduled job. 11.In order to run the batch report, immediately click the Save icon (diskette) in the toolbar as shown in Figure 8-24.
12.When clicking the Save icon, a prompt displays Specify a Batch Report name. Enter a name that is descriptive enough for later reference. 13.After submitting the job, it will either be successful or unsuccessful. Examine the log under the Batch Reports to perform problem determination on the unsuccessful jobs. Note: The location of the batch file reports is not intuitive. It is located in the TotalStorage Productivity Center installation directory as shown in Figure 8-25.
235
8.5.5 TPCTOOL
You can use TPCTOOL command line interface to extract data from the TotalStorage Productivity Center database. While it requires no knowledge of the TotalStorage Productivity Center schema or SQL query skills, you need to understand how to use the tool. It is not obvious. Nevertheless, it has advantages over the TotalStorage Productivity Center GUI, such as: Multiple components Extract information about multiple components, such as volumes and arrays by specifying a list of component IDs. If the list is omitted, every component for which data has been gathered is returned. Multiple metrics The multiple metrics feature is probably the most important feature of the TPCTOOL reporting function. While exporting data from a history chart allows data from multiple samples for multiple components, it is limited to a single metric type. In TPCTOOL, the metrics are specified by the columns parameter. The data extraction can be completely automated. TPCTOOL, when used in conjunction with shell scripting, can provide an excellent way to automate the TotalStorage Productivity Center data extracts, which can be useful for loading data into a consolidated performance history repository for the custom reporting and data correlation with other data sources. TPCTOOL can be useful if you need to create your own metrics using supplied metrics or counters. For example, you can create a metric that shows the access density: the 236
DS8000 Performance Monitoring and Tuning
number of I/Os per GB. For this metric, you also need information from other TotalStorage Productivity Center reports that include the volume capacity. Manipulating the data will require additional work. Nevertheless, TPCTOOL also has a few limitations: Single subsystem or fabric Reports can only include data of a single subsystem or a single fabric, regardless of the components, ctypes, and metrics that you specify. Identification The identification of components, subsystems, and fabrics is not so easy, because TPCTOOL uses worldwide name (WWN) and Globally Unique Identifiers (GUIDs) instead of the user-defined names or labels. At least for certain commands, you can tell TPCTOOL to return more information by using the -l parameter. For example, lsdev also returns the user-defined label when using the -l parameter. Correlation The drill-down relationships provided in the GUI are not maintained in the TPCTOOL extracts. Manual correlation of volume data with the TotalStorage Productivity Center array can be done or a script can be used to automate this process. A script is provided in Correlate TotalStorage Productivity Center volume batch reports with rank data obtained from DSCLI. on page 618 for mapping volume data with the TotalStorage Productivity Center arrays. The script is specific to volume data extracted using batch reports; however, the logic can be applied to TPCTOOL extracted volume data. TPCTOOL has one command for creating reports and several list commands (starting with ls) for querying information needed to generate a report. To generate a report with TPCTOOL: 1. Launch TPCTOOL by clicking tpctool.bat in the installation directory. Typically, tpctool.bat is in C:\Program Files\IBM\TPC\cli. 2. List the devices using lsdev as shown in Figure 8-26. Note the devices from which to extract data. In this example, we use 2107.1303241+0.
3. Determine the component type to report by using the lstype command as shown in Figure 8-27 on page 238.
237
4. Next, decide which metrics to include in the report. The metrics returned by the lsmetrics command are the same as the columns in the TotalStorage Productivity Center GUI. Figure 8-28 provides an example of the lsmetrics command.
tpctool> lsmetrics -user <USERID> -pwd <PASSWORD> -ctype subsystem localhost:9550 -subsys 2107.1303241+0 Metric Value ============================================== Read I/O Rate (normal) 801 Read I/O Rate (sequential) 802 Read I/O Rate (overall) 803 Write I/O Rate (normal) 804 Write I/O Rate (sequential) 805 Write I/O Rate (overall) 806 Total I/O Rate (normal) 807 Total I/O Rate (sequential) 808 Total I/O Rate (overall) 809 Read Cache Hit Percentage (normal) 810 Record Mode Read I/O Rate 828 Read Cache Hits Percentage (sequential) 811 Read Cache Hits Percentage (overall) 812 Write Cache Hits Percentage (normal) 813 Write Cache Hits Percentage (sequential) 814 Write Cache Hits Percentage (overall) 815 Total Cache Hits Percentage (normal) 816 Total Cache Hits Percentage (sequential) 817 Total Cache Hits Percentage (overall) 818 Cache Holding Time 834 Read Data Rate 819 Write Data Rate 820 Total Data Rate 821 Read Response Time 822 Write Response Time 823 Overall Response Time 824 Read Transfer Size 825 Write Transfer Size 826 Overall Transfer Size 827 Record Mode Read Cache Hit Percentage 829 Disk to Cache Transfer Rate 830 Cache to Disk Transfer Rate 831 NVS Full Percentage 832 NVS Delayed I/O Rate 833 Backend Read I/O Rate 835 Backend Write I/O Rate 836 Total Backend I/O Rate 837 Backend Read Data Rate 838 Backend Write Data Rate 839 Total Backend Data Rate 840 Backend Read Response Time 841 Backend Write Response Time 842 Overall Backend Response Time 843 Backend Read Transfer Size 847 Backend Write Transfer Size 848 Overall Backend Transfer Size 849
-url
5. Determine the start date and time and put in the following format: YYYY.MM.DD:HH:MM:SS. 6. Determine the data collection interval in seconds: 86400 (1 day). 7. Determine the summarization level: sample, hourly, or daily. 238
DS8000 Performance Monitoring and Tuning
8. Run the report using the getrpt command as shown in Figure 8-29. The command output can be redirected to a file for analysis in a spreadsheet. The <USERID> and <PASSWORD> variables need to be replaced with the correct values for your environment.
tpctool> getrpt -user <USERID> -pwd <PASSWORD> -ctype array -url localhost:9550 -subsy 2107.1303241+0 -level hourly -start 2008.11.04:10:00:00 -duration 86400 -columns 801,8 Timestamp Interval Device Component 801 802 ================================================================================ 2008.11.04:00:00:00 3600 DS8000-2107-1303241-IBM 2107.1303241-10 78.97 26.31 2008.11.04:01:00:00 3600 DS8000-2107-1303241-IBM 2107.1303241-10 54.73 14.85 2008.11.04:02:00:00 3600 DS8000-2107-1303241-IBM 2107.1303241-10 43.72 11.13 2008.11.04:03:00:00 3600 DS8000-2107-1303241-IBM 2107.1303241-10 40.92 8.36 2008.11.04:04:00:00 3600 DS8000-2107-1303241-IBM 2107.1303241-10 50.92 10.03
Tip: If you want to import the data into Excel later, we recommend using a semi-colon as the field separator (-fs parameter). A comma can easily be mistaken as a decimal or decimal grouping symbol. The book titled TotalStorage Productivity Center Advanced Topics, SG24-7438, contains instructions for importing TPCTOOL data into Excel. The book also provides a Visual Basic macro that can be used to modify the time stamp to the international standard. The lstime command is extremely helpful, because it provides information that can be used to determine if performance data collection is running. It provides three fields: Start Duration Option The date and time of the start of performance data collections The number of seconds that the job ran The location
tpctool> lstime -user <USERID> -pwd <PASSWORD> -ctype array -url localhost:9550 -level hourly -subsys 2107.1303241+0 Start Duration Option =================================== 2008.10.23:13:00:00 370800 server 2008.10.27:20:00:00 928800 server In order to identify if the performance job is still running, use the following logic: 1. Identify the start time of the last collection (2008.10.27 at 20:00:00). 2. Identify the duration (928800). 3. Add the start time to the duration (Use Excel =Sum(2008.10.27 20:00:00+(928800/86400)). 4. Compare the result to the current time. The result is 2008.11.07 at 14:00, which happens to be the current time. This result indicates that data collection is running.
239
pools that can satisfy the request. If you explicitly specify the storage pool and controller information, the Volume Planner checks to see whether the input performance and capacity requirements can be satisfied.
4. A prompt will ask for host name (IP), user ID, and password. Click OK. 5. If it connects properly, it will then prompt you to select one or more device serial numbers as shown in Figure 8-31 on page 241.
240
7. Select the components: Ports, arrays, and volumes. 8. Enter the client information and click Continue. 9. Enter the IBM/Business Partner information and click Continue. 10.Select the file name to save the report as and click Save. 11.Click Exit to close the reporter. A report is now in the location selected in the previous steps. The report will contain the following information: Configuration and Capacity Performance overview - Subsystem-level averages Subsystem-level charts of key metrics Subsystem definitions Port information Port performance summary Port detail charts Port metric definitions Array configuration information Array performance summary Array detail charts Array metric definitions Volume information Volume performance summary Volume detail charts Volume metric definitions
241
Note: TPC Reporter for Disk is an excellent way to generate regular performance healthcheck reports especially for port and array. Due to the quantity of reports generated, we suggest excluding the volumes from the report unless a problem is identified with the array or port data that requires additional detail.
242
Host_1
SAN Switch/Director_1
ISL
SAN Switch/Director_2
D S8000_1
I/O D r a w e r
I/O D r a w e r
I/O D r a w e r
I/O D r a w e r
A second type of configuration in which SAN statistics can be useful is shown in Figure 8-34 on page 244. In this configuration, host bus adapters or channels from multiple servers access the same set of I/O ports on the DS8000 (server adapters 1 - 4 share access to DS8000 I/O ports 5 and 6). In this environment, the performance data available from only the host server or only the DS8000 might not be enough to confirm load balancing or to identify each servers contributions to I/O port activity on the DS8000, because more than one host is accessing the same DS8000 I/O ports. If DS8000 I/O port 5 is highly utilized, it might not be clear whether Host_A, Host_B, or both hosts are responsible for the high utilization. Taken together, the performance data available from Host_A, Host_B, and the DS8000 might be enough to isolate each server connections contribution to I/O port utilization on the DS8000; however, the performance data available from the SAN switch or director might make it easier to see load balancing and relationships between I/O traffic on specific host server ports and DS8000 I/O ports at a glance, because it can provide real-time utilization and traffic statistics for both host server SAN ports and DS8000 SAN ports in a single view, with a common reporting interval and metrics.
243
TotalStorage Productivity Center for Fabric can be used for analysis of historical data, but it does not collect data in real time.
IBM Systems and Technology Group
Host_A 1 2 3
Host_B 4
CEC 1
I /O D r a w e r R IO 1 I /O D r a w e r R IO 0
I/ O D r a w e r R IO 1 I/ O D r a w e r R IO 0
SAN statistics can also be helpful in isolating the individual contributions of multiple DS8000s to I/O performance on a single server. In Figure 8-35 on page 245, host bus adapters or channels 1 and 2 from a single host (Host_A) access I/O ports on multiple DS8000s (I/O ports 3 and 4 on DS8000_1 and I/O ports 5 and 6 on DS8000_2). In this configuration, the performance data available from either the host server or from the DS8000 might not be enough to identify each DS8000s contribution to adapter activity on the host server, because the host server is accessing I/O ports on multiple DS8000s. For example, if adapters on Host_A are highly utilized or if I/O delays are experienced, it might not be clear whether this is due to traffic that is flowing between Host_A and DS8000_1, between Host_A and DS8000_2, or between Host_A and both DS8000_1 and DS8000_2. The performance data available from the host server and from both DS8000s can be used together to identify the source of high utilization or I/O delays. Additionally, you can use TotalStorage Productivity Center for Fabric or vendor point products to gather performance data for both host server SAN ports and DS8000 SAN ports.
244
Host_1
6
Storage Enclosure Storage Enclosure Storage Enclosure Storage Enclosure
DS8000_1
DS8000_2
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
Another configuration in which SAN statistics can be important is a remote mirroring configuration, such as the configuration shown in Figure 8-36. Here, two DS8000s are connected through a SAN for synchronous or asynchronous remote mirroring or remote copying, and the SAN statistics can be collected to analyze traffic for the remote mirroring links.
Primary Site
Storage Enclosure Storage Enclosure Storage Enclosure Storage Enclosure
Secondary Site
Storage Enclosure Storage Enclosure Storage Enclosure Storage Enclosure
DS8000_1
DS8000_2
2
SAN Switch or Director
3
SAN Switch or Director
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
You must check SAN statistics to determine if there are SAN bottlenecks limiting DS8000 I/O traffic. You can also use SAN link utilization or throughput statistics to breakdown the I/O
Chapter 8. Practical performance management
245
activity contributed by adapters on different host servers to shared storage subsystem I/O ports. Conversely, you can use SAN statistics to break down the I/O activity contributed by different storage subsystems accessed by the same host server. SAN statistics can also highlight whether multipathing/load balancing software is operating as desired, or whether there are performance problems that need to be resolved.
The available TotalStorage Productivity Center for Fabric thresholds are highlighted in gray in Table 8-10. Unfortunately, the available thresholds are aggregated at the switch level and are not granular enough to identify individual port saturation.
246
Figure 8-37 TotalStorage Productivity Center for Fabric: Switch alert threshold
2. Configure the Critical Stress and Warning Stress rates. 3. Enable any Triggered-Actions and save.
247
Report name Command line tool for extracting data from TPC Create custom queries using BIRT
Comments Extract data for analysis in spreadsheet software. Can be automated. Useful for creating reports not available in TotalStorage Productivity Center.
The process of using TotalStorage Productivity Center for Fabric to create reports is similar to the process that is used to create reports in TotalStorage Productivity Center for Disk as described in 8.5, TotalStorage Productivity Center reporting options on page 222.
Port Peak Send Data Rate Port Peak Receive Data Rate Port Send Packet Size Port Receive Packet Size Overall Port Packet Size Error Frame Rate Dumped Frame Rate
248
Metric Link Failure Rate Loss of Sync Rate Loss of Signal Rate CRC Error Rate
Definition The average number of link errors per second during the sample interval The average number of times per second that synchronization was lost during the sample interval The average number of times per second that the signal was lost during the sample interval The average number of frames received per second in which the cyclic redundancy check (CRC) in the frame did not match the CRC computed by the receiver during the sample interval
The most important metric for determining if a SAN bottleneck exists is the Total Port Data Rate. When used in conjunction with the port configuration information, you can identify port saturation. For example, if the inter-switch links (ISLs) between two switches are rated at 4 Gbit/sec then a throughput of greater than or equal to 3.5 Gbits/sec indicates saturation.
Perceived or actual I/O bottlenecks can result from hardware failures on the I/O path, contention on the server, contention on the SAN Fabric, contention on the DS8000 front-end ports, or contention on the back-end disk adapters or disk arrays. In this section, we provide a process for diagnosing these scenarios using TotalStorage Productivity Center and external data. This process was developed for identifying specific types of problems and is not a substitute for common sense, knowledge of the environment, and experience. Figure 8-39 shows the high-level process flow.
Definition Classification
No Hardware or configuration issue No ID hot/slow host disks Yes Fix it Yes Host resource issue? Fix it
Identification
Validation
Figure 8-39 I/O performance analysis process
I/O bottlenecks as referenced in this section relate to one or more components on the I/O path that have reached a saturation point and can no longer achieve the I/O performance requirements. I/O performance requirements are typically throughput-oriented or transaction-oriented. Heavy sequential workloads, such as tape backups or data warehouse environments, might require maximum bandwidth and use large sequential transfers. However, they might not have stringent response time requirements. Transaction-oriented workloads, such as online banking systems, might have stringent response time requirements but have no requirements for throughput. If a server CPU or memory resource shortage is identified, it is important to take the necessary remedial actions. These actions might include but are not limited to adding additional CPUs, optimizing processes or applications, or adding additional memory. In general, if there are not any resources constrained on the server but the end-to-end I/O response time is higher than expected for the DS8000 (See General rules on page 293), there is likely a resource constraint in one or more of the SAN components. In order to troubleshoot performance problems, TotalStorage Productivity Center for Disk and TotalStorage Productivity Center for Fabric data must be augmented with host performance and configuration data. Figure 8-40 on page 251 shows a logical end-to-end view from a measurement perspective.
250
Host Data
Host 1
HBA HBA
Host 2
HBA HBA
Switch 1
Switch 2
Controller 1
DS8000
Controller 2
As shown in Figure 8-40, TotalStorage Productivity Center does not provide host performance, configuration, or error data. TotalStorage Productivity Center for Fabric provides performance and error log information about SAN switches. TotalStorage Productivity Center for Disk provides DS8000 storage performance and configuration information.
Process assumptions
This process assumes that: The server is connected to the DS8000 natively. Tools exist to collect the necessary performance and configuration data for each component along the I/O path (server disk, SAN fabric, and DS8000 arrays, ports, and volumes). Skills exist to utilize the tools, extract data, and analyze data. Data is collected in a continuous fashion to facilitate performance management.
Process flow
The order in which you conduct the analysis is important. We suggest the following process: 1. Define the problem. A sample questionnaire is provided in Sample questions for an AIX host on page 154. The goal is to assist in determining the problem background and understand how the performance requirements are not being met. Note: Before proceeding any further ensure that adequate discovery is pursued to identify any changes in the environment. In our experience, there is a significant correlation between changes in the environment and sudden unexpected performance issues.
251
2. Properly classify the problem by identifying hardware or configuration issues. Hardware failures often manifest themselves as performance issues, because I/O is significantly degraded on one or more paths. If a hardware issue is identified at this point, all problem determination efforts must be focused on identifying the root cause of the hardware errors: a. Gather any errors on any of the host paths. Note: If you notice significant errors in the datapath query device or the pcmpath query device and the errors increase, likely there is a problem with a physical component on the I/O path. b. Gather the host error report and look for Small Computer System Interface (SCSI) or FIBRE errors. Note: Often a hardware error relating to a component on the I/O path will manifest itself as a TEMP error. A TEMP error does not necessarily exclude hardware failure. You must perform diagnostics on all hardware components in the I/O path, including host bus adapter (HBA), SAN switch ports, and DS8000 HBA ports. c. Gather the SAN switch configuration and errors. Every switch vendor provides different management software. All of the SAN switch software provides error monitoring and a way to identify if there is a hardware failure with a port or application-specific integrated circuit (ASIC). Refer to your vendor-specific manuals or contact vendor support for more information about identifying hardware failures. Note: As you move from the host to external resources, remember any patterns. A common error pattern that you see involves errors affecting only those paths on the same HBA. If both paths on the same HBA experience errors, the errors are a result of a common component. The common component is likely to be the host HBA, the cable from the host HBA to the SAN switch, or the SAN switch port itself. Ensure that all of these components are thoroughly reviewed before proceeding. d. If errors exist on one or more of the host paths, determine if there are any DS8000 hardware errors. Log on to the HMC as customer/cust0mer and look to make sure that there are no hardware alerts. Figure 8-41 provides a sample of a healthy DS8000. If there are any errors, you might need to open a problem ticket (PMH) with DS8000 hardware support (2107 engineering).
3. After validating that no hardware failures exist, analyze server performance data and identify any disk bottlenecks. The fundamental premise of this methodology is that I/O performance degradation relating to SAN component contention can be observed at the server via analysis of key
252
server-based I/O metrics. Degraded end-to-end I/O response time is the strongest indication of I/O path contention. Typically, server physical disk response times measure the time that a physical I/O request takes from the moment that the request was initiated by the device driver until the device driver receives an interrupt from the controller that the I/O completed. The measurements are displayed as either service time or response time. They are usually averaged over the measurement interval. Typically, server wait or queue metrics refer to time spent waiting at the HBA, which is usually an indication of HBA saturation. In general, you need to interpret the service times as response times, because they include potential queuing at various storage subsystem components, for example, switch, storage HBA, storage cache, storage back-end disk controller, storage back-end paths, and disk drives. Note: Subsystem-specific load balancing software usually does not add any performance overhead and can be viewed as a pass-through layer. In addition to the disk response time and disk queuing data, gather the disk activity rates, including read I/Os, write I/Os, and total I/Os, because they show which disks are active: a. Gather performance data as shown in Table 8-13.
Table 8-13 Native tools and key metrics OS AIX Native tool iostat (5.3), filemon Command/Object iostat -D, filemon -o /tmp/fmon.log -O all Metric/Counter read time(ms) write time(ms) reads, writes queue length avserv (ms) avque blks/s svctm(ms) avgqu-sz tps svc_t(ms) Avque blks/s Avg Disk Sec/Read Avg Disk Sec/Write Read Disk Queue Length Write Disk Queue Length Disk Reads/sec Disk Writes/sec N/A
sar
sar -d
*iostat
iostat -d
Solaris
iostat
iostat -xn
Windows server
perfmon
Physical Disk
System z
253
Note: The number of total I/Os per second indicates the relative activity of the device. This relative activity provides a metric to prioritize the analysis. Those devices with high response times and high activity are obviously more important to understand than devices with high response time and infrequent access. If analyzing the data in a spreadsheet, consider creating a combined metric of Average I/Os x Average Response Time to provide a method for identifying the most I/O-intensive disks. You can obtain additional detail about OS-specific server analysis in the OS-specific chapters. b. Gather configuration data (Subsystem Device Driver (SDD)/Subsystem Device Driver Path Control Module (SDDPCM) as shown in Table 8-14. In addition to the multipathing configuration data, you need to collect configuration information for the host and DS8000 HBAs, including the bandwidth of each adapter.
Table 8-14 Path configuration data OS All UNIX Tool SDD/SDDPCM Command datapath query essmap pcmpath query essmap datapath query essmap Key LUNserial Other information *Rank, logical subsystem (LSS), Storage subsystem *Rank, LSS, Storage subsystem
Windows
LUN serial
Note: The rank column is not meaningful for multi-rank extent pools on the DS8000. For single rank extent pools, it only provides a mechanism for understanding that different volumes are located on different ranks.
Note: Ensure that multipathing behaves as designed. For example, if there are two paths zoned per HBA to the DS8000, there must be four paths active per LUN. Both SDD and SDDPCM use an active/active configuration of multipathing, which means that traffic flows across all the traffic fairly evenly. For native DS8000 connections, the absence of activity on one or more paths indicates a problem with the SDD behavior. c. Format the data. Format the data and correlate the host LUNs with their associated DS8000 resources. Formatting the data is not required for analysis, but it is easier to analyze formatted data in a spreadsheet. The following steps represent the logical steps required to perform the formatting, and they do not represent literal steps. You can codify these steps in scripts. You can obtain examples of these scripts in Appendix D, Post-processing scripts on page 607: i. Read configuration file. ii. Build hdisk hash with key = hdisk and value = LUN SN. iii. Read I/O response time data. iv. Create hashes for each of the following values with hdisk as the key: Date, Start time, Physical Volume, Reads, Avg Read Time, Avg Read Size, Writes, Avg Write Time, and Avg Write Size. 254
DS8000 Performance Monitoring and Tuning
v. Print the data to a file with headers and commas to separate the fields. vi. Iterate through hdisk hash and use the common hdisk key to index into the other hashes and print those hashes that have values. d. Analyze the host performance data: i. Determine if I/O bottlenecks exist by summarizing the data and analyzing key performance metrics for values in excess of thresholds discussed in General rules on page 293. Identify those vpaths/LUNs with poor response time. We show an example in 10.8.6, Analyzing performance data on page 297. At this point, you need to have excluded hardware errors and multipathing configuration issues, and you must have identified the hot LUNs. Proceed to step four to determine the root cause of the performance issue. ii. If no degraded disk response times exist, the issue is likely related to something internal to the server. 4. If there were disk constraints identified, continue the identification of the root cause by collecting and analyzing DS8000 configuration and performance data: a. Gather the configuration information. A script called DS8K-Config-Gatherer.cmd is provided in Appendix C, UNIX shell scripts on page 587. TotalStorage Productivity Center can also be used to gather configuration data via the topology viewer or from the Data Manager Reporting Asset By Storage Subsystem as shown in Figure 8-42.
Note: While analysis of the SAN fabric and the DS8000 performance data can be completed in either order, SAN bottlenecks occur much less frequently than disk bottlenecks, so it is more efficient to analyze DS8000 performance data first. b. Use TotalStorage Productivity Center to gather DS8000 performance data for subsystem port, array, and volume. Compare the key performance indicators from Table 8-7 on page 221 with the performance data. Follow these steps to analyze the performance: i. For those server LUNs that had poor response time, analyze the associated volumes during the same period. If the problem is on the DS8000, a correlation exists between the high response times observed on the host and the volume response times observed on the DS8000.
255
Note: Meaningful correlation with the host performance measurement and the previously identified hot LUNs requires analysis of the DS8000 performance data for the same time period that the host data was collected. Refer to 8.3.1, Timestamps on page 214 for more information about time stamps. ii. Correlate the hot LUNs with their associated disk arrays. When using the TotalStorage Productivity Center GUI, the relationships are provided automatically within the drill-down feature. If using batch exports and you want to correlate the volume data with the rank data, you can perform this correlation manually or by using the script provided in Correlate TotalStorage Productivity Center volume batch reports with rank data obtained from DSCLI. In the case of multiple ranks per extent pool and Storage Pool Striping, one volume can exist on multiple ranks. Note: TotalStorage Productivity Center performance reports always refer to the hardware numbering scheme, which is bound to the array sites Sxy and not to the array number (DSCLI array number - 1). Refer to Example 8-2 and Figure 8-4 on page 211 for more information. iii. Analyze storage subsystem ports for the ports associated with the server in question. 5. Continue the identification of the root cause by collecting and analyzing SAN fabric configuration and performance data: a. Gather the connectivity information and establish a visual diagram of the environment. If you have TotalStorage Productivity Center for Fabric, you can use the Topology Viewer to quickly create a visual representation of your SAN environment as shown in Figure 8-38 on page 249. Note: Sophisticated tools are not necessary for creating this type of view; however, the configuration, zoning, and connectivity information must be available in order to create a logical visual representation of the environment. b. Gather the SAN performance data. Each vendor provides SAN management applications that provide alerting and some level of performance management. Often, the performance management software is limited to real-time monitoring and historical data collection features require additional licenses. In addition to the vendor-provided solutions, TotalStorage Productivity Center provides a component called TotalStorage Productivity Center for Fabric. TotalStorage Productivity Center for Fabric can collect the metrics that are shown in Table 8-12 on page 248. c. Consider graphing the Overall Port Response Time and Total Port Data Rate metrics to determine if any of the ports along the I/O path are saturated during the time when the response time was degraded. If the Total Port Data Rate is close to the maximum expected throughput for the link, this situation is likely a contention point. You can add additional bandwidth to mitigate this type of issue either by adding additional links or by adding faster links, which might require upgrades of the server HBAs and the DS8000 host adapter cards in order to take advantage of the additional switch link capacity. Beside the ability to create ad hoc reports using TotalStorage Productivity Center for Fabric metrics, TotalStorage Productivity Center provides the following reports: i. IBM Tivoli Storage Productivity Center Reporting System Reporting Fabric Switch Performance
256
ii. IBM Tivoli Storage Productivity Center Reporting System Reporting Fabric Top Switch Port Data Rate iii. IBM Tivoli Storage Productivity Center Reporting System Reporting Fabric Top Switch Port Packet Rate
Problem definition
The application owner complains of poor response time for transactions during certain times of the day.
Problem classification
There are no hardware errors, configuration issues, or host performance constraints.
Identification
Figure 8-43 on page 258 shows the average read response time for a Windows Server 2003 server performing a random workload in which the response time increases steadily over time.
257
25.00 Response Time (ms) 20.00 Disk1 Disk2 15.00 10.00 Disk3 Disk4 Disk5 Disk6
5.00 16:55:08 17:01:08 17:07:08 17:13:08 17:19:08 17:25:08 17:31:08 17:37:08 17:43:08 17:49:08 17:55:08 18:01:08 18:07:08 18:13:08 18:19:08 18:25:08 18:31:08 18:37:08 18:43:08 18:49:08 18:55:08
Figure 8-43 Windows Server 2003 perfmon average physical disk read response time
At approximately 18:39, the average read response time jumps from approximately 15 ms to 25 ms. Further investigation on the host reveals that the increase in response time correlates with an increase in load as shown in Figure 8-44.
258
As discussed in 8.7, End-to-end analysis of I/O performance problems on page 249, there are several possibilities for high average disk read response time: DS8000 array contention DS8000 port contention SAN fabric contention Host HBA saturation Because the most probable reason for the elevated response times is the disk utilization on the array, gather and analyze this metric first. Figure 8-45 shows the disk utilization on the DS8000.
Disk Utilization
100 90 80 Disk Utilization 70 60 50 40 30 20 10 0
:0 0 :0 0 :0 0 :0 0 :0 0 :0 0 :0 0 :0 0 :0 0 :0 0 :0 0 :0 0 :1 2 :2 2 :0 2 :1 2 :2 2 :3 2 :0 2 :3 2 :4 2 :5 2 :5 2 :4 2 :5 2 :0 0
19
18
18
17
18
Recommend changes
We recommend adding volumes on additional disks. For environments where host striping is configured, you might need to recreate the host volumes to spread the I/O from an existing workload across the new volumes.
Validate changes
Gather performance data and determine if the issue is resolved.
19
18
18
18
19
19
19
19
259
Classify the problem After reviewing the hardware configuration and the error reports for all hardware components, we have determined that there are errors on the paths associated with one of the host HBAs as shown in Figure 8-46 on page 260. This output shows the errors on path 0 and path 1, which are both on the same HBA (SCSI port 1). For a Windows Server 2003 server running SDD, additional information about the host adapters is available via the gethba.exe command. The command that you use to identify errors depends on the multipathing software installation.
Identify the root cause A further review of the switch software revealed significant errors on the switch port associated with the paths in question. A visual inspection of the environment revealed the cable from the host to the switch was kinked. Implement changes to resolve the problem Replace the cable. Validate the problem resolution Since implementing the change, the error counts have stopped increasing and nightly backups have completed within the backup window.
260
Identify the root cause The IBM service support representative (SSR) ran the IBM Diagnostics on the host HBA, and the card did not pass diagnostics. Note: In the cases where there is a path with significant errors, you can disable the path with the multipathing software, which allows the non-working paths to be disabled without causing performance degradation to the working paths. With SDD, disable the path by using datapath set device # path # offline. Implement changes to resolve the problem Replace the card. Validate the problem resolution The errors did not persist after the card was replaced and the paths were brought online.
261
Dev Disk7 Dev Disk6 Dev Disk5 Dev Disk4 Production Disk5 Production Disk2 Production Disk1
600000 500000 400000 300000 200000 100000 0 17:39:06 17:46:06 17:53:06 18:00:06 18:07:06 18:14:06 18:21:06 18:28:06 18:35:06 18:42:06 18:49:06 18:56:06 19:03:06 19:10:06 19:17:06 19:24:06
DS8000 port data reveals a peak throughput of around 300 MBps per port.
R1-I3-C4-T0 R1-I3-C1-T0
Time - 5 minute
Implement changes to resolve the problem Rezone ports for production servers and development servers so that they do not use the same DS8000 ports. Add additional ports so that each server HBA is zoned to two DS8000 ports according to best practices. Validate the problem resolution After implementing the new zoning that separated the production server and the development server, the storage ports were no longer the bottleneck.
262
Disconnect time is an indication of cache miss activity or destage wait (due to persistent memory high utilization) for logical disks behind the DS8000s. Device busy delay is an indication that another system locks up a volume, and an extent conflict occurs among z/OS hosts or applications in the same host when using Parallel
Access Volumes. The DS8000 multiple allegiance or Parallel Access Volume capability allows it to process multiple I/Os against the same volume at the same time. However, if a read or write request against an extent is pending while another I/O is writing to the extent, or if a write request against an extent is pending while another I/O is reading or writing data from the extent, the DS8000 will delay the I/O by queuing. This condition is referred as extent conflict. Queuing time due to extent conflict is accumulated to device busy (DB) delay time. An extent is a sphere of access; the unit of increment is a track. Usually, I/O drivers or system routines decide and declare the sphere. To determine the possible cause of high disconnect times, check the read cache hit ratios, read-to-write ratios, and bypass I/Os for those volumes. If you see the cache hit ratio is lower than usual while you have not added other workload on your System Z environment, I/Os against Open Systems fixed block volumes might be a cause of the problem. Possibly, FB volumes defined on the same server had a cache-unfriendly workload, thus impacting your System Z volumes hit ratio. In order to get more information about cache usage, you can check the cache statistics of the FB volumes that belong to the same server. You might be able to identify the FB volumes that have a low read hit ratio and short cache holding time. Moving the workload of these Open
Chapter 8. Practical performance management
263
Systems logical disks or the System Z CKD volumes about which you are concerned to the other side of the cluster, so that you can concentrate cache-friendly I/O workload to either cluster, will improve the situation. If you cannot or if the condition has not improved after this move, consider balancing the I/O distribution on more ranks. Balancing the I/O distribution on more ranks will optimize the staging and destaging operation. The approaches for using the data of other tools in conjunction with the IBM TotalStorage Productivity Center for Disk, as described in this chapter, do not cover all the possible situations you will encounter. But if you basically understand how to interpret the DS8000 performance reports, and you also have a good understanding of how the DS8000 works, you will be able to develop your own ideas about how to correlate the DS8000 performance reports with other performance measurement tools when approaching specific situations in your production environment.
264
Chapter 9.
Host attachment
This chapter discusses the attachment considerations between host systems and the DS8000 series for availability and performance. Topics include: DS8000 attachment types Attaching Open Systems hosts Attaching System z hosts We provide detailed information about performance tuning considerations for specific operating systems in subsequent chapters of this book.
265
266
267
In this largest configuration, you can support up to 128 direct connect host attachments, but the current implementation of the DS8300 allows for the installation of even more host adapters, up to 64 adapters or 256 ports. Consider these additional ports for connectivity purposes if you choose not to use Fibre Channel switched connections. Do not expect additional performance or throughput capabilities beyond the installation of 32 host adapters.
LUN masking
In Fibre Channel attachment, logical unit number (LUN) affinity is based on the worldwide port name (WWPN) of the adapter on the host, independent of the DS8000 host adapter port to which the host is attached. This LUN masking function on the DS8000 is provided through the definition of DS8000 volume groups. A volume group is defined using the DS Storage Manager or dscli, and host WWPNs are connected to the volume group. The LUNs that are to be accessed by the hosts connected to the volume group are then defined to reside in that volume group. While it is possible to limit through which DS8000 host adapter ports a given WWPN will connect to volume groups, we recommend that you define the WWPNs to have access to all available DS8000 host adapter ports. Then, using the recommended process of creating Fibre Channel zones as discussed in Importance of establishing zones on page 268, you can limit the desired host adapter ports through the Fibre Channel zones. In a switched fabric with multiple connections to the DS8000, this concept of LUN affinity enables the host to see the same LUNs on different paths.
268
269
Host
FC0 FC1
You can see how the number of logical devices presented to a host can increase rapidly in a SAN environment if you are not careful in selecting the size of logical disks and the number of paths from the host to the DS8000. Typically, we recommend that you cable the switches and create zones in the SAN switch software for dual-attached hosts so that each server host adapter has two to four paths from the switch to the DS8000. With hosts configured this way, you can let the multipathing module balance the load across the four host adapters in the DS8000. Zoning more paths, such as eight connections from the host to DS8000, generally does not improve SAN performance and only causes twice as many devices to be presented to the operating system.
9.2.3 Multipathing
Multipathing describes a technique that allows you to attach one host to a external storage
device via more than one path. Usage of multipathing can improve fault-tolerance and the performance of the overall system, because the fault of a single component in the environment can be tolerated without an impact to the host. Also, you can increase the overall system bandwidth, which positively influences the performance of the system. As illustrated in Figure 9-2, attaching a host system using a single-path connection implements a solution that depends on several single points of failure. In this example, as a single link failure either between the host system and the switch or between the switch and the storage system, as well as a failure of the host adapter in the host system, the DS8000 storage system or even a failure of the switch leads to an access loss of the host system. Additionally, the path performance of the whole system is reduced by the slowest component in the link.
270
Host System
Host adapter
Logical disk
DS8000
Figure 9-2 SAN single-path connection
Adding additional paths requires you to use multipathing software, because otherwise, the same LUN behind each path is handled as a separate disk from the operating system side, which does not allow failover support. Multipathing provides DS8000-attached Open Systems hosts running Windows, AIX, HP-UX, Sun Solaris, or Linux with: Support for several paths per LUN Load balancing between multiple paths when there is more than one path from a host server to the DS8000. This approach might eliminate I/O bottlenecks that occur when many I/O operations are directed to common devices via the same I/O path, thus improving the I/O performance. Automatic path management, failover protection, and enhanced data availability for users that have more than one path from a host server to the DS8000. It eliminates a potential single point of failure by automatically rerouting I/O operations to the remaining active paths from a failed data path. Dynamic reconsideration after changing the configuration environment, including zoning, LUN masking, and adding or removing physical paths.
271
Host System
multipathing module Host adapter Host adapter
SAN sw itch
SAN sw itch
LUN
DS8000
Figure 9-3 DS8000 multipathing implementation using two paths
DS8000 supports several multipathing implementations. Depending on the environment, host type, and operating system, only a subset of those implementations are available. This section introduces their concepts and give general information about the implementation, usage, and specific benefits. Note: Do not intermix several multipathing solutions within one host system; usually, the multipathing software solutions cannot coexist.
272
Note: Do not share LUNs among multiple hosts without the protection of Persistent Reserve (PR). If you share LUNs among hosts without PowerHA, you are exposed to data corruption situations. You must also use PR when using FlashCopy. It is important to note that the IBM Subsystem Device Driver does not support booting from or placing a system primary paging device on an SDD pseudo device. For certain servers running AIX, booting off the DS8000 is supported. In that case, LUNs used for booting are manually excluded from the SDD configuration by using the querysn command to create an exclude file. You can obtain more information in querysn for multi-booting AIX off the DS8000 on page 370. For more information about installing and using SDD, refer to IBM System Storage Multipath Subsystem Device Driver Users Guide, GC52-1309. This publication and other information are available at: http://www.ibm.com/servers/storage/support/
273
From an availability point of view, we discourage this configuration because of the single fiber cable from the host to the SAN switch. However, this configuration is better than a single path from the host to the DS8000, and this configuration can be useful for preparing for maintenance on the DS8000.
Multipath I/O
Multipath I/O (MPIO) summarizes native multipathing technologies that are available in several operating systems, such as AIX, Linux, and Windows. Although the implementation differs for each of the operating systems, the basic concept is almost the same: The multipathing module is delivered with the operating system. The multipathing module supports failover and load balancing for standard SCSI devices, such as simple SCSI disks or SCSI arrays. To add device-specific support and functions for a specific storage device, each storage vendor might provide a device-specific module implementing advanced functions for managing the specific storage device. IBM currently provides a device-specific module for the DS8000 for AIX, Linux, and Windows according to the information in Table 9-1.
Table 9-1 Available DS8000-specific MPIO path control modules Operating system AIX Windows Multipathing solution MPIO MPIO Device-specific module SDD Path Control Module SDD Device Specific Module DM-MPIO configuration file Acronym SDDPCM Subsystem Device Driver Device Specific Module (SDDDSM) DM-MPIO
Linux
Device-Mapper Multipath
274
Check the System Storage Interoperation Center (SSIC) Web site for your specific hardware configuration: http://www.ibm.com/systems/support/storage/config/ssic/
9.3.1 ESCON
The ESCON adapter in the DS8000 has two ports and is intended for connection to older System z hosts that do not support FICON. For good performance and high availability, the ESCON adapters (refer to 2.5.2, ESCON host adapters on page 25) must be available through all I/O enclosures and provide the following configurations: Access to only the first 16 (3390) logical control units (LCUs) Up to 32 ESCON links for the DS8100 and 64 ESCON links for the DS8300; two per ESCON host adapter A maximum of 64 logical paths per port or link and 256 logical paths per control unit image (or logical subsystem (LSS)) Access to all 16 LCUs (4096 CKD devices) over a single ESCON port 17 MB/s native data rate For the System z environments with ESCON attachment, it is not possible to take full advantage of the DS8000 performance capacity. When configuring for ESCON, consider these general recommendations (refer to Figure 9-4 on page 276): Use 4-path or 8-path groups (preferably eight) between each system z host and LSS. Plug channels for a 4-path group into four host adapters across different I/O enclosures. Plug channels for a 8-path group into four host adapters across different I/O enclosures (using both ports per adapter) or into eight host adapters across different I/O enclosures. One 8-path group is better than two 4-path groups. This way, the host system and the DS8000 are able to balance all of the work across the eight available paths.
275
CEC 1
LPAR A LPAR B
CEC 2
LPAR C LPAR D
ESCON Director
ESCON Director
(LCU 00) (LCU 01) (LCU 02) (LCU 03) (LCU 04) (LCU 05) (LCU 06) (LCU 07)
You can use ESCON cables to attach the DS8000 directly to a S/390 or System z host, or to an ESCON director, channel extender, or a dense wave division multiplexer (DWDM). ESCON cables cannot be used to connect to another DS8000, either directly or via ESCON director or DWDM, for Remote Copy (PPRC). The maximum unrepeated distance of an ESCON link from the ESS to the host channel port, ESCON switch, or extender is 3 km (1.86 miles) using 62.5 micron fiber or 2 km (1.24 miles) using 50 micron fiber. The FICON bridge card in the ESCON Director 9032 Model 5 enables connections to ESCON host adapters in the storage unit. The FICON bridge architecture supports up to 16384 devices per channel. Note: The IBM ESCON Director 9032 (including all models and features) has been withdrawn from marketing. There is no IBM replacement for the IBM 9032. Third-party vendors might be able to provide functionality similar to the IBM 9032 Model 5 ESCON Director.
9.3.2 FICON
FICON is a Fibre Connection used with System z servers. Each storage unit host adapter has four ports, and each port has a unique world wide port name (WWPN). You can configure the port to operate with the FICON upper layer protocol. When configured for FICON, the storage unit provides the following configurations: Either fabric or point-to-point topology A maximum of 64 host ports for DS8100 Models 921/931 and a maximum of 128 host ports for DS8300 Models 922/9A2 and 932/9B2 A maximum of 2048 logical paths on each Fibre Channel port 276
DS8000 Performance Monitoring and Tuning
Access to all 255 control unit images (65280 CKD devices) over each FICON port. The connection speeds are 100 - 200 MB/s, which is similar to Fibre Channel for Open Systems. FICON channels were introduced in the IBM 9672 G5 and G6 servers with the capability to run at 1 Gbps. These channels were enhanced to FICON Express channels and then to FICON Express2 channels and both were capable of running at transfer speeds of 2 Gbps. The fastest link speeds currently available are FICON Express4 channels. They are designed to support 4 Gbps link speeds and can also auto-negotiate to 1 or 2 Gbps link speeds depending on the capability of the director or control unit port at the other end of the link. Operating at 4 Gbps speeds, FICON Express4 channels are designed to achieve up to 350 MBps for a mix of large sequential read and write I/O operations as depicted in the following charts. Figure 9-5 shows a comparison of the overall throughput capabilities of various generations of channel technology.
As you can see, the FICON Express4 channel on the IBM System z9 EC and z9 BC represents a significant improvement in maximum bandwidth capability compared to FICON Express2 channels and previous FICON offerings. The response time improvements are expected to be noticeable for large data transfers. The speed at which data moves across a 4 Gbps link is effectively 400 MBps compared to 200 MBps with a 2 Gbps link. The maximum number of I/Os per second that was measured on a FICON Express4 channel running an I/O driver benchmark with a 4 KB per I/O workload is approximately 13000, which is the same as what was measured with a FICON Express2 channel. Changing the link speed has no effect on the number of small block (4 KB per I/O) I/Os that can be processed. The greater performance capabilities of the FICON Express4 channel make it a good match with the performance characteristics of the new DS8000 host adapters.
277
Note: FICON Express2 SX/LX and FICON Express SX/LX are supported on System z10 and System z9 servers only if carried forward on an upgrade. The FICON Express LX feature is required to support CHPID type FCV. The System z10 and System z9 servers offer FICON Express4 SX and LX features that have four (or two for the 2-port SX and LX features) independent channels. Each feature occupies a single I/O slot and utilizes one CHPID per channel. Each channel supports 1 Gbps, 2 Gbps, and 4 Gbps link data rates with auto-negotiation to support existing switches, directors, and storage devices. Note: FICON Express4-2C SX/4KM LX are only available on z10 BC and z9 BC. FICON Express4 is the last feature to support 1 Gbps link data rates. Future FICON features will not support auto-negotiation to 1 Gbps link data rates. For any generation of FICON channels, you can attach directly to a DS8000 or you can attach via a FICON-capable Fibre Channel switch. When you use a Fibre Channel/FICON host adapter to attach to FICON channels, either directly or through a switch, the port is dedicated to FICON attachment and cannot be simultaneously attached to FCP hosts. When you attach a DS8000 to FICON channels through one or more switches, the maximum number of FICON logical paths is 2048 per DS8000 host adapter port. The directors provide extremely high availability with redundant components and no single points of failure. Figure 9-6 on page 279 shows an example of FICON attachment to connect a System z server through FICON switches, using 16 FICON channel paths to eight host adapter ports on the DS8000, and addressing eight Logical Control Units (LCUs). This channel consolidation might be possible when your host workload does not exceed the performance capabilities of the DS8000 host adapter and is most appropriate when connecting to the original generation FICON channel. It is likely, again depending on your workload, that FICON Express2 channels must be configured one to one with a DS8000 host adapter port.
278
zSeries
FICON (FC) channels
FC FC FC FC FC FC FC FC FC FC
zSeries
FICON (FC) channels
FC FC FC FC FC FC
FICON Director
FICON Director
FICON Director
FICON Director
(LCU 20) (LCU 21) (LCU 22) (LCU 23) (LCU 24) (LCU 25) (LCU 26) (LCU 27)
279
280
10
Chapter 10.
281
282
In order to initiate an I/O request, an application issues an I/O request using one of the supported I/O request calls. The I/O manager receives the application I/O request and passes the I/O request packet (IRP) from the application to each of the lower layers that route the IRP to the appropriate device driver, port driver, and adapter-specific driver. Windows server filesystems can be configured as FAT, FAT32, or NTFS. The file structure is specified for a particular partition or logical volume. A logical volume can contain one or more physical disks. All Windows volumes are managed by the Windows Logical Disk Management utility. For additional information relating to the Windows Server 2003 and Windows Server 2008 I/O stacks and performance, refer to the following documents: http://download.microsoft.com/download/5/6/6/5664b85a-ad06-45ec-979e-ec4887d715eb /Storport.doc http://download.microsoft.com/download/5/b/9/5b97017b-e28a-4bae-ba48-174cf47d23cd /STO089_WH06.ppt
I/O priorities
The Windows Server 2008 I/O subsystem provides a mechanism to specify I/O processing priorities. Windows will primarily use this mechanism to prioritize critical I/O requests over background I/O requests. API extensions exist to provide application vendors file-level I/O priority control. The prioritization code has some processing overhead and can be disabled for disks that are targeted for similar I/O activities (such as an SQL database).
283
10.4 Filesystem
A filesystem is a part of the operating system that determines how files are named, stored, and organized on a volume. A filesystem manages files, folders, and the information needed to locate and access these files and folders for local or remote users.
Compression
284
compression. In addition to causing additional CPU overhead, the I/O subsystem will not honor asychnronous I/O calls made to compressed files. Refer to the following link for additional detail: http://support.microsoft.com/kb/156932 Defragment disks Over time, files become fragmented in noncontiguous clusters across disks, and disk response time suffers as the disk head jumps between tracks to seek and reassemble the files when they are required. We recommend regularly defragmenting volumes. For Windows 2000 Server and Windows Server 2003 servers, use diskpar.exe and diskpart.exe respectively to force sector alignment. Windows Server 2008 automatically enforces a 1 MB offset for the first sector in the partition, which negates the need for using diskpart.exe. For additional information, refer to the following documents: http://download.microsoft.com/download/9/c/5/9c5b2167-8017-4b ae-9fde-d599bac8184a/Perf-tun-srv.docx http://support.microsoft.com/kb/929491 Note: The start sector offset must be 256 KB due to the stripe size on the DS8000. Workloads with small, random I/Os (<16 KB) will not likely experience any significant performance improvement from sector alignment on DS8000 LUNs.
Block alignment
Basic disk
All dynamic disks contain an LDM database that keeps track of changes to the volume state and synchronizes the databases across disks for the purpose of recovery. If all dynamic disks exist on the SAN and there is an unplanned outage to the SAN disks, the LDM on the SAN disks will all be in the same state. If you have dynamic disks both locally and on the SAN, there is a high probability that the LDMs will be out of synch if you take an outage to your SAN disks only. For this reason, Microsoft recommends the following approach when configuring a system with SAN-attached disks, such as DS8000.
285
Use dynamic disks for SAN-attached storage and basic disks for local storage, or use basic disks for SAN-attached storage and dynamic disks for local storage. For more information about this recommendation, refer to the following article at: http://support.microsoft.com/kb/816307 Note: Concatenated volumes provide capacity scalability but do not distribute allocation units across the physical disks. Volume space is allocated sequentially starting at the first drive in the drive set, which often leads to hot spots on certain volumes within a concatenated volume.
RAID 1
RAID 5
Notes: RAID 0 provides no availability improvement. If the two physical disks (DS8000 LUNs) reside on the same DS8000 rank, there is no additional availability. While RAID 0 has some potential performance benefits for sequential I/O streams, the Microsoft LDM implementation does not allow a software RAID 0 volume to be extended, which makes it impractical for enterprise class servers. We do not recommend using Microsoft software RAID in conjunction with physical disks provisioned from a DS8000.
286
VxVM includes the following features: Support of concatenated, striped (Raid 0), mirrored (Raid 1), mirrored striped (Raid 1+0), and RAID 5 volumes Dynamic expansion of all volume types Dynamic MultiPathing (DMP) as an optional component Support for Microsoft Cluster Service (might require additional hardware and software) Support for up to 256 physical disks in a dynamic volume The Veritas Storage Foundation Administrator Guide contains additional information relating to VxVM, and you can refer to it at: http://seer.entsupport.symantec.com/docs/286744.htm Note: For applications requiring high sequential throughput, consider using striped volumes. Striped volumes must be extended by the number of drives in the stripe set. For example, a volume striped across a series of four physical disks will require four physical disks to be added during any extension of the striped volume. To have any performance benefit, the physical disks (DS8000 LUNs) have to reside on separate DS8000 ranks.
SQL Server online transaction processing (OLTP) SQL Server OLTP Real-time IIS Server
287
Application
Throughput
Sensitivity
Low Medium
Note: Consider the applications listed in Table 10-1 as general examples only and not specific rules.
288
On Windows servers, the implementations of multipathing rely on either native multipathing (Microsoft MPIO + Storport driver) or non-native multipathing and the Small Computer System Interface (SCSI) port driver or SCSIport. The following sections discuss the performance considerations for each of these implementations.
289
Key features addressed by the Storport driver include: Adapter limits removed There are no adapter limits. There is a limit of 254 requests queued per device. Improvement in I/O request processing Storport decouples the StartIo and Interrupt processing, enabling parallel processing of start and completion requests. Improved IRQL processing Storport provides a mechanism to perform part of the I/O request preparation work at a low priority level, reducing the number of requests queued at the same elevated priority level. Improvement in data buffer processing Lists of information are exchanged between the Storport driver and the miniport driver as opposed to single element exchanges. Improved queue management Granular queue management functions provide HBA vendors and device driver developers the ability to improve management of queued I/O requests. For additional information about the Storport driver, refer to the following document: http://download.microsoft.com/download/5/6/6/5664b85a-ad06-45ec-979e-ec4887d715 eb/Storport.doc
In Windows Server 2003, the MPIO drivers are provided as part of the SDDDSM package. On Windows Server 2008, they ship with the OS. Note: For non-clustered environments, we recommend using SDDDSM for the performance and scalability improvements as previously described.
291
At the application layer, there are application-specific tools and metrics available for monitoring and analyzing application performance on Windows servers. Application-specific objects and counters are outside the scope of this text. The I/O Manager provides a control mechanism for interacting with the lower layer devices. Many of the I/O Manager calls are monitored and recorded in Event Trace for Windows (ETW), which is available in Windows Performance Console (perfmon). While the information provided from the ETW can be excellent for problem determination, it is often complex to interpret and much too detailed for general disk performance issues, particularly in Windows 2000 Server and Windows Server 2003. The usability of ETW was improved when Microsoft provided a utility called Windows Server Performance Analyzer (SPA) that works on Windows Server 2003 servers. It provides a simple way to collect system performance statistics (volume metrics), as well as ETW information, and process and correlate it in a user friendly report. In Windows Server 2008, all of the functionality of SPA was incorporated into perfmon. In an effort to appeal to the widest possible audience, we will take a generic approach that can be applied across Windows servers. In this approach, we will utilize the basic perfmon logging facilities to collect key PhysicalDisk and LogicalDisk counters to diagnose the existence of disk performance issues. We will not provide analysis of the ETW events, although you can analyze the ETW events with the use of SPA. We will also demonstrate how to correlate the Windows physical disks to the DS8000 LUNs using the configuration information provided with IBM SDD/SDDDSM. In this section, we discuss: Overview of I/O metrics Overview of perfmon Overview of logs Mechanics of logging Mechanics of exporting data Collecting multipath data Correlating the configuration and performance data Analyzing the performance data
access from the disk subsystem, resulting in delays of milliseconds per I/O request. Due to the relatively long processing times for I/O requests, the disk subsystem often composes the largest component of end-to-end application response time. As a result, the disk subsystem can be the single most important aspect of overall application performance. In this section, we discuss the key performance metrics available for diagnosing performance issues on Windows servers. In Windows servers, there are two kinds of disk counters: PhysicalDisk object counters and LogicalDisk object counters. PhysicalDisk object counters are used to monitor single disks of hardware RAID arrays (DS8000 LUNs) and are enabled by default. LogicalDisk object counters are used to monitor logical disks or software RAID arrays and are enabled by default on Windows Server 2003 and Windows Server 2008 servers. In Windows 2000 Server, the logical disk performance counters are disabled by default but can be enabled by typing the command DISKPERF -ye and then restarting the server. Tip: When attempting to analyze disk performance bottlenecks, always use physical disk counters to identify performance issues with individual DS8000 LUNs. Table 10-2 describes the key I/O-related metrics that are reported by perfmon.
Table 10-2 Performance monitoring counters Object Physical Disk Counter Average Disk sec/Read Description The average amount of time in seconds to complete an I/O read request. Because most I/Os complete in milliseconds, three decimal places are appropriate for viewing this metric. These results are end-to-end disk response times. The average amount of time in seconds to complete an I/O write request. Because most I/Os complete in milliseconds, three decimal places are appropriate for viewing this metric. These results are end-to-end disk response times. The average number of disk reads per second during the collection interval. The average number of disk writes per second during the collection interval. The average number of bytes read per second during the collection interval. The average number of bytes written per second during the collection interval. Indicates the average number of read I/O requests waiting to be serviced. Indicates the average number of write I/O requests waiting to be serviced.
Physical Disk
Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk
Disk Reads/sec Disk Writes/sec Disk Read bytes/sec Disk Write bytes/sec Average Disk Read Queue Length Average Disk Write Queue Length
General rules
We provide the following rules based on our field experience. These rules are provided as general guidelines and do not represent service level agreements (SLAs) or service level objectives (SLOs) that have been endorsed by IBM for your DS8000. Prior to using these rules for anything specific, such as a contractual SLA, you must carefully analyze and
293
consider these technical requirements: disk speeds, RAID format, workload variance, workload growth, measurement intervals, and acceptance of response time and throughput variance. The general rules are: In general, average write disk response times for Fibre Channel-based DS8000 LUNs must be between 2 and 6 ms. In general, average read disk response times for Fibre Channel-based DS8000 LUNs must be between 5 and 15 ms. Average values higher than the top end of these ranges indicate contention either in the fabric or the DS8000. It is beneficial to look at both the I/O rates and the disk response times, because often the disks with the highest response times had extremely low I/O rates. Focus on those disks that have high I/O rates and high response times. Shared storage environments are more likely to have a variance in disk response time. If your application is highly sensitive to variance in response time, you need to isolate the application at either the processor complex, device adapter (DA), or rank level. On average, the total I/Os to any single volume (partial rank - DS8000 LUN) must not exceed 500 IOPS, particularly if this workload is a write-intensive workload on a RAID 5 LUN. If you consistently issue more than 500 IOPS to a single LUN, look at spreading out the data in the volume across more than one physical disk, or use DS8000 Storage Pool Striping (SPS) to rotate the extents across ranks.
Note: By default, Windows provides the response time in seconds. In order to convert to milliseconds, you must multiply by 1000, which is done automatically in the perfmon-essmap.pl script that is provided in Running the scripts on page 609.
294
The Performance console is a snap-in for Microsoft Management Console (MMC). You use the Performance console to configure the System Monitor and Performance Logs and Alerts tools. You can open the Performance console by clicking Start Programs Administrative Tools Performance or by typing perfmon on the command line.
295
As with Windows Server 2003, you can open the Performance console by clicking Start Programs Administrative Tools Performance or by typing perfmon on the command line.
296
3. By default, the output shows on the display. In order to save the data, you need to redirect the output to a file. Enter datapath query essmap > $servername.essmap.txt where $servername is the actual name of the server and press Enter as shown in Example 10-1.
Example 10-1 The datapath query essmap command C:\Program Files\IBM\Subsystem Device Driver>datapath query essmap > $servername.essmap.txt
4. Place the $servername.essmap.txt in the same directory as the performance comma-separated values (csv) file. 5. In addition to the multipathing configuration data, gather the host HBA configuration information, including but not limited to the HBA bandwidth, errors, and HBA queue settings.
297
The following case involves a Windows Server 2003 server running two 2 Gb/sec QLogic HBAs. A highly sequential read workload ran on the system. After the performance data is correlated to the DS8000 LUNs and reformatted, open the performance data file in Microsoft Excel. It looks similar to Figure 10-6.
DATE TIME Subsyste LUN m Serial Disk Disk Avg Read Avg Total Reads/se RT(ms) Time c 1,035.77 1,035.75 1,035.77 1,035.77 1,035.75 1,035.77 1,047.24 1,047.27 1,047.29 1,047.25 1,047.29 1,047.29 1,035.61 1,035.61 1,035.61 0.612 0.613 0.612 0.615 0.612 0.612 5.076 5.058 5.036 5.052 5.064 5.052 0.612 0.612 0.615 633.59 634.49 633.87 637.11 634.38 633.88 5,315.42 5,296.86 5,274.30 5,291.01 5,303.36 5,290.89 634.16 633.88 636.72 Avg Read Read Queue KB/sec Length 0.63 0.63 0.63 0.64 0.63 0.63 5.32 5.30 5.27 5.29 5.30 5.29 0.63 0.63 0.64 66,289.14 66,288.07 66,289.14 66,289.14 66,288.07 66,289.14 67,023.08 67,025.21 67,026.28 67,024.14 67,026.28 67,026.28 66,279.00 66,279.00 66,279.00
11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008
13:44:48 13:44:48 13:44:48 13:44:48 13:44:48 13:44:48 14:29:48 14:29:48 14:29:48 14:29:48 14:29:48 14:29:48 13:43:48 13:43:48 13:43:48
75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192
75GB1924 Disk6 75GB1924 Disk2 75GB1924 Disk3 75GB1924 Disk5 75GB1924 Disk4 75GB1924 Disk1 75GB1924 Disk6 75GB1924 Disk2 75GB1924 Disk3 75GB1924 Disk5 75GB1924 Disk4 75GB1924 Disk1 75GB1924 Disk6 75GB1924 Disk2 75GB1924 Disk3
Summary
Microsoft Excel provides an excellent way to summarize and analyze data via pivot tables. After you have the normalized data, you can create a pivot table. After you have created the pivot table, you can summarize the performance data by placing the disk and the LUN data in the rows section and all of the key metrics in the data section. Figure 10-7 shows the summarized data.
LUN Disk Average of Disk Reads/sec Average of Avg Average of Read RT(ms) Avg Read Queue Length 20.21 19.68 20.07 20.11 19.43 19.69 21.35 20.79 21.20 21.24 20.52 20.80 Average of Read KB/sec
Observations: Average read response time of 20 ms indicates a bottleneck. I/Os are spread evenly across the disks. There is not a single disk that has a bottleneck. The sum of the average combined throughput is 396360 MB, which is extremely close to the theoretical limit of 2 Gb HBAs.
298
Note: In order to have a meaningful summary or averaged data, you must collect the data for a period of time that reflects the goal of the collection. High response times during a backup period might not be problematic, whereas high response times during an online period might indicate problems. If the data is collected over too long of a period of time, the averages can be misleading and reflect multiple workloads.
Disk Read KB/sec 450000 400000 350000 KBytes/sec 300000 250000 200000 150000 100000 50000 0 13:33:48 13:38:48 13:43:48 13:48:48 13:53:48 13:58:48 14:03:48 14:08:48 14:13:48 14:18:48 14:23:48 14:28:48 14:33:48 14:38:48 14:43:48 14:48:48 14:53:48 14:58:48 15:03:48 15:08:48 15:13:48 15:18:48 15:23:48 15:28:48 15:33:48
Observations: Throughput peaked early during the measurement period and resulted in a horizontal line at 400000 KB/sec. It appears that this workloads throughput was limited from the beginning of the collection period. We know from previously gathered data that the system has two 2 HBAs that have a 2 Gb/sec capacity, which equates to roughly 200 MBps for each adapter.
Recommendations
Confirm that other SAN fabric components and the DS8000 HBAs are 4 Gbps. If they are 2 GB/sec, replacing the current 2 Gb/sec cards with 4 Gbps cards will increase the bandwidth available to the host for additional throughput.
299
300
3. Produce report. 4. Analyze report. At the time of the writing of this book, there was an excellent guide about using SPA to perform disk performance diagnosis available at the following link: http://www.codeplex.com/PerfTesting/Wiki/View.aspx?title=How%20To%3a%20Identify%20 a%20Disk%20Performance%20Bottleneck%20Using%20SPA1&referringTitle=How%20Tos
Processes tab
In Figure 10-9, you can see the resources being consumed by each of the processes currently running. You can click the column headings to change the sort order, which will be based on that column.
301
Click View Select Columns. This selection displays the window shown in Figure 10-10, from which you can select additional data to be displayed for each process.
Table 10-3 shows the columns available in the Windows Server 2003 operating system that are related to disk I/O.
Table 10-3 Task Manager disk-related columns Column Paged Pool Description The paged pool (user memory) usage of each process. The paged pool is virtual memory available to be paged to disk. It includes all of the user memory and a portion of the system memory. The amount of memory reserved as system memory and not pageable for this process. The processs base priority level (low/normal/high). You can change the processs base priority by right-clicking it and selecting Set Priority. This option remains in effect until the process stops. The number of read input/output (file, network, and disk device) operations generated by the process. The number of bytes read in input/output (file, network, and disk device) operations generated by the process. The number of write input/output operations (file, network, and disk device) generated by the process. The number of bytes written in input/output operations (file, network, and device) generated by the process. The number of input/output operations generated by the process that are neither reads nor writes (for example, a control type of operation). The number of bytes transferred in input/output operations generated by the process that are neither reads nor writes (for example, a control type of operation).
I/O Reads I/O Read bytes I/O Writes I/O Write bytes I/O Other I/O Other bytes
302
Performance tab
The Performance view shows you performance indicators, as shown in Figure 10-11.
The charts show you the CPU and memory usage of the system as a whole. The bar charts on the left show the instantaneous values, and the line graphs on the right show the history since Task Manager was started. The four sets of numbers under the charts are: Totals: Handles: Current total handles of the system Threads: Current total threads of the system Processes: Current total processes of the system Physical Memory (K): Total: Total RAM installed (in KB) Available: Total RAM available to processes (in KB) File Cache: Total RAM released to the file cache on demand (in KB) Commit Charge (K): Total: Total amount of virtual memory in use by all processes (in KB) Limit: Total amount of virtual memory (in KB) that can be committed to all processes without adjusting the size of the paging file Peak: Maximum virtual memory used in the session (in KB) Kernel Memory (K): Total: Sum of paged and non-paged kernel memory (in KB) Paged: Size of paged pool allocated to the operating system (in KB) Non-paged: Size of non-paged pool allocated to the operating system (in KB)
303
The following section provides several examples of types of tests related to I/O performance.
304
synthetic workload, you can validate the performance characteristics of the I/O subsystem without the added complication of the application. For example, if an application owner states that the I/O subsystem is under-performing, take the following steps: Gather the I/O workload characteristics of the application (Reads/Writes and Sequential/Random). Conduct load using synthetic load tool. Gather measurements. Analyze the data. Compare to the hypothesis Does I/O subsystem perform reasonably, or is the I/O subsystem under-performing? Perform various What-if scenarios. Often, it is necessary to understand the implications of major future changes to the environment or workload. In these types of situations, you perform the same steps as described previously in 10.10, I/O load testing on page 304.
10.10.2 Iometer
Iometer is an I/O subsystem measurement and characterization tool for single and clustered
systems. Formerly, Iometer was owned by Intel Corporation, but Intel has discontinued work on Iometer, and it was given to the Open Source Development Lab. For more information about Iometer, go to: http://www.iometer.org/ Iometer is both a workload generator (it performs I/O operations in order to stress the system) and a measurement tool (it examines and records the performance of its I/O operations and their impact on the system). It can be configured to emulate the disk or network I/O load of any program or benchmark, or can be used to generate entirely synthetic I/O loads. It can generate and measure loads on single or multiple (networked) systems. Iometer can be used for the measurement and characterization of: Performance of disk and network controllers Bandwidth and latency capabilities of buses Network throughput to attached drives Shared bus performance System-level hard drive performance System-level network performance Iometer consists of two programs, Iometer and Dynamo. Iometer is the controlling program. Using Iometers graphical user interface, you configure the workload, set operating parameters, and start and stop tests. Iometer tells Dynamo what to do, collects the resulting data, and summarizes the results in output files. Only run one copy of Iometer at a time. It is typically run on the server machine. Dynamo is the workload generator. It has no user interface. At Iometers command, Dynamo performs I/O operations, records performance information, and then returns the data to Iometer. There can be more than one copy of Dynamo running at a time. Typically, one copy runs on the server machine, and one additional copy runs on each client machine. Dynamo is multi-threaded. Each copy can simulate the workload of multiple client programs. Each running copy of Dynamo is called a manager. Each thread within a copy of Dynamo is called a worker.
305
Iometer provides the ability to configure: Read/write ratios Sequential/random Arrival rate and queue depth Blocksize Number of concurrent streams With these configuration settings, you can simulate and test most types of workloads. Specify the workload characteristics to reflect the workload in your environment.
306
11
Chapter 11.
307
308
Apply the recommended patches. Go to the required patches Web page for additional information about how to download and install them: http://www.hp.com/products1/unix/java/patches/index.html Also, always consult the IBM System Storage DS8000 Host System Attachment Guide, SC26-7917-02, for detailed information about how to attach and configure a host system to a DS8000: http://www-01.ibm.com/support/docview.wss?rs=1114&context=HW2B2&dc=DA400&q1=ssg1*& uid=ssg1S7001161&loc=en_US&cs=utf-8&lang=en
I/O requests normally go through these layers: Application/DB layer: This layer is the top-level layer where many of the I/O requests start. Each of those applications generates several I/Os that follow a pattern or a profile. The characteristics that compose an application's I/O profile are: IOPS: IOPS is the number of I/Os (reads and writes) per second. Throughput: How much data is transferred in a given sample time? Typically, the throughput is measured in MB/s or KB/s. I/O size: This I/O size is the result of MB/s divided by IOPS. Read ratio: The read ratio is the percentage of I/O reads compared to the total of I/Os. Disk space: This amount is the total amount of disk space needed by the application. I/O system calls layer: Through the system calls provided by the operating system, the application issues I/O requests to the storage. By default, all I/O operations are synchronous. Many operating systems also provide asynchronous I/O, which is a facility that allows an application to overlap processing time while it issues I/Os to the storage. Typically, the databases take the advantage of this feature. Filesystem layer: The filesystem is the operating systems way to manipulate the data in the form of files. Many filesystems supports buffered and unbuffered I/Os. If your application has its own caching mechanism and supports a type of direct I/O, we
Chapter 11. Performance considerations with UNIX servers
309
recommend that you enable it, because it avoids double-buffering and reduces the CPU utilization. Otherwise, your application can take advantage of features, such as file caching, read-ahead, and write-behind. Volume manager layer: A volume manager is a key component to distribute the I/O workload over the logical unit numbers (LUNs) of DS8000. You need understand how they work to combine strategies of spreading workload at the storage level with spreading workload at the operating system, and consequently, maximizing the I/O performance of the database. Multipathing/Disk layer: Today, there are several multipathing solutions available: hardware multipathing, software multipathing, and operating system multipathing. It is usually better to adopt the operating system multipathing solutions. However, depending on the environment, you might face limitations and prefer to use a hardware or software multipathing solution. Always try not to exceed a maximum of four paths for each LUN unless required. FC adapter layer: The need to make configuration changes in the FC adapters depends on the operating system and vendor model. Always consult the DS8000 Host Attachment Guide for specific instructions about how to set up the FC adapters. Also, check the compatibility matrix for dependencies among the firmware level, operating system patch levels, and adapter models. Fabric layer: The Storage Area Network (SAN) is used to interconnect storage devices and servers. Array layer: The array layer is the DS8000 in our case. Normally in each of these layers, there are performance indicators that enable you to assess how that particular layer impacts performance.
310
The wio might be an indication that there is a disk I/O bottleneck, but it is not enough to assume from only the wio that there is a constraint of disk I/O. We must observe other counters, such as the blocked process in the kernel threads column and statistics generated by iostat or an equivalent tool. The technology for disk has evolved significantly. In the past, disks only were capable of 120 I/Os per second and had no cache memory. Consequently, utilization levels of 10 to 30% were considered extremely high. Today, with arrays of DS8000 class (supporting tens or hundreds of gigabytes of cache memory and hundreds of physical disks at the back end), even utilization levels above 80 or 90% still might not indicate an I/O performance problem. It is fundamental to check the queue length, the service time, and the I/O size averages being reported in the disk statistics: If the queue length and the service time are low, there is no performance problem. If the queue length is low and the service time and the I/O size are high, it is also not evidence of a performance problem. Performance thresholds: They only might indicate that something has changed in the system. However, they are not able to say why or how it has changed. Only a good interpretation of data is able to answer these types of questions. Here is a real case: A user was complaining that the users transactional system had a performance degradation of 12% in the average transaction response time. The Database Administrator (DBA) argued that the database spent a good part of the time in disk I/O operations. At the operating system and the storage level, all performance indicators were excellent. The only curious fact was that the system had a hit cache of 80%, which did not make much sense, because the access characteristic of a transactional system is random. High levels of hit cache indicated that somehow the system was reading sequentially. By analyzing the application, it was discovered that about 30% of database workload was related to eleven queries. These eleven queries were accessing the tables without the help of an index. The fix was the creation of those indexes with specific fields to optimize the access to disk.
311
For a complete discussion of AIX tuning, refer to the following links: AIX 6.1 Performance Tuning Manual, SC23-5253-00: http://publib.boulder.ibm.com/infocenter/systems/scope/aix/topic/com.ibm.aix.pr ftungd/doc/prftungd/file_sys_perf.htm?tocNode=int_214554 AIX 5.3 Performance Tuning Manual, SC23-4905-04: http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.prftung d/doc/prftungd/file_sys_perf.htm The Performance Tuning Manual discusses the relationships between Virtual Memory Manager (VMM) and the buffers used by the filesystems and Logical Volume Manager (LVM): http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.p rftungd/doc/prftungd/vmm_page_replace_tuning.htm A paper providing tuning recommendations for Oracle on AIX: http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100883 A paper discussing the setup and tuning of Direct I/O with SAS 9: http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100890 A paper describing the performance improvement of an Oracle database with concurrent I/O (CIO): http://www-03.ibm.com/systems/resources/systems_p_os_aix_whitepapers_db_perf_ai x.pdf A paper discussing how to optimize Sybase ASE on AIX: http://www.ibm.com/servers/enable/site/peducation/wp/b78a/b78a.pdf
11.2.1 AIX Journaled File System (JFS) and Journaled File System 2 (JFS2)
JFS and JFS2 are AIX standard filesystems. JFS was created for the 32-bit kernels. They implement the concept of a transactional filesystem where all of the I/O operations of the metadata information are kept in a log. The practical impact is that in the case of a recovery of a filesystem, the fsck command looks at that log to see what I/O operations were completed and rolls back only those operations that were not completed. Of course, from a performance point of view, there is overhead. However, it is generally an acceptable compromise to ensure the recovery of a corrupted filesystem. Its file organization method is a linear algorithm. You can mount the filesystems with the Direct I/O option. You can adjust the mechanisms of sequential read ahead, sequential and random write behind, delayed write operations, and others. You can tune its buffers to increase the performance. It also supports asynchronous I/O. JFS2 was created for 64-bit kernels. Its file organization method is a B+ tree algorithm. It supports all the features described for JFS, with exception of delayed write operations. It also supports concurrent I/O (CIO).
312
Read ahead
JFS and JFS2 have read ahead algorithms that can be configured to buffer data for sequential reads into the filesystem cache before the application requests it. Ideally, this feature reduces the percent of I/O wait (%iowait) and increases I/O throughput as seen from the operating system. Configuring the read ahead algorithms too aggressively will result in unnecessary I/O. The VMM tunable parameters that control read ahead behavior are: For JFS: minpgahead = max(2, <applications blocksize> / <filesystems blocksize>) maxpgahead = max(256, (<applications blocksize> / <filesystems blocksize> * <applications read ahead block count>)) For JFS2: j2_minPgReadAhead = max(2, <applications blocksize> / <filesystems blocksize>) j2_maxPgReadAhead = max(256, (<applications blocksize> / <filesystems blocksize> * <applications read ahead block count>))
I/O pacing
The purpose of I/O pacing is to manage concurrency to files and segments by limiting the CPU resources for processes that exceed a specified number of pending write I/Os to a discrete file or segment. When a process exceeds the maxpout limit (high-water mark), it is put to sleep until the number of pending write I/Os to the file or segment is less than minpout (low-water mark). This pacing allows another process to access the file or segment. Disabling I/O pacing (default) improves backup times and sequential throughput. Enabling I/O pacing ensures that no single process dominates the access to a file or segment. Typically, we recommend leaving I/O pacing disabled. There are certain circumstances where it is appropriate to have I/O pacing enabled: For HACMP, we recommend enabling I/O pacing to ensure that heartbeat activities complete. If you enable it, start with settings of maxpout=321 and minpout=240. Beginning with AIX 5.3, I/O pacing can be enabled at the filesystem level with mount command options. In AIX Version 6, I/O pacing is technically enabled but with such high settings that the I/O pacing will not become active except under extreme situations. In summary, enabling I/O pacing improves user response time at the expense of throughput.
313
Write behind
This parameter enables the operating system to initiate I/O that is normally controlled by the syncd. Writes are triggered when a specified number of sequential 16 KB clusters are updated: Sequential write behind: numclust for JFS j2_nPagesPerWriteBehindCluster and j2_nRandomCluster for JFS2 Random write behind: maxrandwrt for JFS j2_maxRandomWrite Note that setting j2_nPagesPerWriteBehindCluster to 0 disables JFS2 sequential write behind and setting j2_maxRandomWrote=0 also disables JFS2 random write behind.
Mount options
Use release behind mount options when appropriate: The release behind mount option can reduce syncd and lrud overhead. This option modifies the filesystem behavior in such a way that it does not maintain data in JFS2 cache. You use these options if you know that data going into or out of certain filesystems will not be requested again by the application before the data is likely to be paged out. Therefore, the lrud daemon has less work to do to free up cache and eliminates any syncd overhead for this filesystem. One example of a situation where you can use these options is if you have a Tivoli Storage Manager Server with disk storage pools in filesystems and you have configured the read ahead mechanism to increase the throughput of data, especially when a migration takes place from disk storage pools to tape storage pools: -rbr for release behind after a read -rbw for release behind after a write -rbrw for release behind after a read or a write Direct I/O (DIO) Bypass JFS/JFS2 cache No read ahead An option of the mount command Useful for databases that use filesystems rather than raw logical volumes. If an application has its own cache, it does not make sense to also have the data in filesystem cache. Concurrent I/O (CIO) Same as DIO but without inode locking, so the application must ensure data integrity for multiple simultaneous I/Os to a file.
Asynchronous I/O
Asynchronous I/O is the AIX facility that allows an application to issue an I/O request and continue processing without waiting for the I/O to finish: Since AIX 5.2, there are two types of asynchronous I/O: the legacy AIO and the new POSIX-compliant AIO. Many databases already take advantage of legacy AIO, so normally, AIX legacy AIO will be enabled.
314
For additional information about the two types of asynchronous I/O, consult AIX 5L Differences Guide Version 5.2 Edition, SG24-5765, in section 2.5 POSIX-compliant AIO (5.2.0): http://www.redbooks.ibm.com/abstracts/sg245765.html?Open With AIX 5.3 Technology Level (TL) 05, a new aioo command was shipped with the AIO fileset (bos.rte.aio) that allows you to increase the values of three tunable parameters (minservers, maxservers, and maxreqs) online without a reboot. However, a reduction of any of these values requires a server reboot to take effect. With AIX Version 6, the tunables fastpath and fsfastpath are classified as restricted tunables and now are set to a value of 1 by default. Therefore, all asynchronous I/O requests to a raw logical volume are passed directly to the disk layer using the corresponding strategy routine (legacy AIO or POSIX-compliant AIO), or all asynchronous I/O requests for files opened with cio are passed directly to LVM or disk using the corresponding strategy routine. Also, there are no more AIO devices in Object Data Manager (ODM) and all their parameters now become tunables using the ioo command. The newer aioo command is removed. For additional information, refer to IBM AIX Version 6.1 Differences Guide, SG24-7559, at: http://www.redbooks.ibm.com/abstracts/sg247559.html?Open
Number of threads
The workerThreads parameter controls the maximum number of concurrent file operations at any instant. The recommended value is the same number of maxservers in AIX 5.3. There is a limit of 550 threads.
315
The prefetchThreads parameter controls the maximum possible number of threads dedicated to prefetching data for files that are read sequentially or to handle sequential write behind. For Oracle RAC, set this value for 548. There is a limit of 550 threads.
maxMBpS
Increase the maxMBpS to 80% of the total bandwidth for all HBAs in a single host. The default value is 150 MB/s.
maxblocksize
Configure the GPFS blocksize (maxblocksize) to match the applications I/O size, the RAID stripe size, or a multiple of the RAID stripe size. For example, if you use an Oracle database, it is better to adjust a value that matches the product of the value of the DB_BLOCK_SIZE and DB_FILE_MULTIBLOCK_READ_COUNT parameters. If the application does a lot of sequential I/O, it is better to configure a blocksize from 8 to 16 MB to take advantage of the sequential prefetching algorithm on the DS8000. For additional information, consult the following links: GPFS manuals http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm .cluster.gpfs.doc/gpfsbooks.html Tuning considerations in the Concepts, Planning, and Installation Guide V3.2.1, FA76-0413-02: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/com.ibm.cluster.gp fs321.install.doc/bl1ins_tuning.html Deploying Oracle 10g RAC on AIX V5 with GPFS, SG24-7541: http://www.redbooks.ibm.com/abstracts/sg247541.html?Open Configuration and Tuning GPFS for Digital Media Environments, SG24-6700: http://www.redbooks.ibm.com/abstracts/sg246700.html?Open
316
In Figure 11-2, the DS8000 LUNs that are under the control of the LVM are called physical volumes (PVs). The LVM splits the disk space into smaller pieces, which are called physical partitions (PPs). A logical volume (LV) is composed of several logical partitions (LPs). A filesystem can be mounted over an LV, or it can be used as a raw device. Each LP can point to up to three corresponding PPs. The ability of the LV to point a single LP to multiple PPs is the way in which LVM implements mirroring (RAID 1). To set up the volume layout with DS8000s LUNs, you can adopt one of the following strategies: Storage Pool Striping (SPS): In this case, you are spreading the workload at the storage level. At the operating system level, you just need to create the LVs with the inter-policy attribute set to minimum, which is the default option when creating an LV. PP Striping: A set of LUNs are created in different ranks inside of DS8000. When the LUNs are recognized in AIX, a volume group (VG) is created and the LVs are spread evenly over the LUNs by setting the inter-policy to maximum, which is the most common method used to distribute the workload. The advantage of this method compared to the SPS is the granularity of data spread over the LUNs. While in SPS, the data is spread in chunks of 1 GB. In a VG, you can create PP sizes from 8 MB to 16 MB. The advantage of this method compared to LVM Striping is that you have more flexibility to manage the LVs, such as adding more disks and redistributing evenly the LVs across all disks by reorganizing the VG. LVM Striping: As in the PP Striping, a set of LUNs is created in different ranks inside of DS8000. After the LUNs are recognized in AIX, a VG is created with larger PP sizes, such as 128 MB or 256 MB. And the LVs are spread evenly over the LUNs by setting the stripe size of LV from 8 MB to 16 MB. From a performance standpoint, LVM Striping and PP Striping provide the same performance. You might see an advantage in a scenario of HACMP with LVM Cross-Site and VGs of 1 TB or more when you perform cluster verification, or you see that operations related to creating, modifying, or deleting LVs are faster.
317
PP Striping
Figure 11-3 shows an example of PP Striping. The volume group contains four LUNs and has created 16 MB physical partitions on the LUNs. The logical volume in this example is composed of a group of 16 MB physical partitions from four logical disks: hdisk4, hdisk5, hdisk6, and hdisk7.
PP Striping
/dev/inter-disk_lv 8GB Logical disk (LUN) = hdisk4
16MB 16MB 16MB 16MB 16MB 16MB
pp497
lp1
pp1
lp5
pp2
pp3
pp4
pp498
16MB
pp499
16MB
pp500
lp2
lp6
pp2
16MB
pp3
16MB
pp4
16MB 16MB
pp497
pp498
16MB
pp499
16MB
pp500
lp3
pp1
lp7
pp2
16MB
pp3
16MB
pp4
16MB 16MB
pp497
pp498
16MB
pp499
16MB
pp500
16MB
pp3
16MB
pp4
16MB 16MB
pp497
lp4
lp8
pp498
16MB
pp499
16MB
pp500
vpath0, vpath1, vpath2, and vpath3 are hardware-striped LUNs on different DS8000 Extent Pools 8 GB/16 MB partitions ~ 500 physical partitions per LUN (pp1-pp500) /dev/inter-disk_lv is made up of eight logical partitions (lp1 + lp2 + lp3 + lp4 + lp5 +lp6 +lp7 + lp8) = 8 x 16 = 128 MB
The first step is to create a volume group. We recommend that you create a VG with a set of DS8000 LUNs where each LUN is located in a separate Extent Pool. If you are going to add a new set of LUNs to a host, define another VG and so on. For you to create a VG, execute the following command to create the data01vg and a PP size of 16 MB: mkvg -S -s 16 -y data01vg hdisk4 hdisk5 hdisk6 hdisk7
318
Note: To create the volume group, if you use SDD, you use the mkvg4vp command. And if you use SDDPCM, you use the mkvg command. All the flags for the mkvg command apply to the mkvg4vp command. After you create the VG, the next step is to create the LVs. To create a VG with four disks (LUNs), we recommend that you create the LVs as a multiple of the number of disks in the VG, times the PP size. In our case, we create the LVs in multiples of 64 MB. You can implement the PP Striping by using the option -e x. To create an LV of 1 GB, execute the following command: mklv -e x -t jfs2 -y inter-disk_lv data01vg 64 hdisk4 hdisk5 hdisk6 hdisk7 Preferably, use inline logs for JFS2 logical volumes, because then there is one log for every filesystem and it is automatically sized. Having one log per filesystem improves performance, because it avoids serialization of access when multiple filesystems make metadata changes. The disadvantage of inline logs is that they cannot be monitored for I/O rates, which can provide an indication of the rate of metadata changes for a filesystem.
LVM Striping
Figure 11-4 shows an example of a striped logical volume. The logical volume called /dev/striped_lv uses the same capacity as /dev/inter-disk_lv (shown in Figure 11-3), but it is created differently.
LVM Striping
8GB LUN = hdisk4
lp1
256MB 256MB 256MB 256MB
pp1
lp5
pp2
pp3
pp4
256MB 256MB
pp29
pp30
256MB 256MB
pp31
pp32
lp6
pp2
pp3
pp4
256MB 256MB
pp29 pp30
256MB 256MB
pp31 pp32
IO
lp3
pp1
lp7
pp2
pp3
pp4
256MB 256MB
pp29 pp30
256MB 256MB
pp31 pp32
lp8
pp2
pp3
pp4
256MB 256MB
pp29 pp30
256MB 256MB
pp31 pp32
hdisk4, hdisk5, hdisk6, and hdisk7 are hardware-striped LUNS on different DS8000 Extent Pools 8 GB/256 MB partitions ~ 32 physical partitions per LUN (pp1 pp32) /dev/striped_lv is made up of eight logical partitions (8 x 256 MB = 32 MB) Each logical partition is divided into 64 equal parts of 4 MB (only 3 of the 4 MB parts are shown for each logical partition) /dev/striped_lv = lp1.1 +lp2.1 + lp3.1 + lp4.1 + lp1.2 + lp2.2 + lp3.2 + lp4.2 + lp5.1.
Notice that /dev/striped_lv is also made up of eight 256 MB physical partitions, but each partition is then subdivided into 32 chunks of 8 MB; only three of the 8 MB chunks are shown per logical partition for space reasons.
319
Again, the first step is to create a VG. To create a VG for LVM Striping, execute the following command: mkvg -S -s 256 -y data01vg hdisk4 hdisk5 hdisk6 hdisk7 For you to create a striped LV, you need to combine the following options when using LVM Striping: Stripe width (-C): This option sets the maximum number of disks to spread the data. The default value is used from the option upperbound. Copies (-c): This option is only required when you create mirrors. You can set from 1 to 3 copies. The default value is 1. Strict allocation policy (-s): This option is only required when you create mirrors and it is necessary to use the value s (superstrict). Stripe size (-S): This option sets the size of a chunk of a sliced PP. Since AIX 5.3, the valid values include 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1M, 2M, 4M, 8M, 16M, 32M, 64M, and 128M. Upperbound (-u): This option sets the maximum number of disks for a new allocation. If you set the allocation policy to superstrict, the upperbound value must be the result of the stripe width times the number of copies that you want to create. Important: Do not set the option -e with LVM Striping. Execute the following command to create an striped LV: mklv -C 4 -c 1 -s s -S 8 -t jfs2 -u 4 -y striped_lv data01vg 4 hdisk4 hdisk5 hdisk6 hdisk7 AIX 5.3 implemented a new feature, the striped column. With this feature, you can extend an LV in a new set of disks after the current disks where the LV is spread is full.
Memory buffers
Adjust the memory buffers (pv_min_pbuf) of LVM to increase the performance. Set it to 1568 for AIX 5.3.
Scheduling policy
If you have a dual-site cluster solution using PowerHA with LVM Cross-Site, you can reduce the link requirements among the sites by changing the scheduling policy of each LV to parallel write/sequential read (ps). You must remember that the first copy of the mirror needs to point to the local storage.
320
321
If you use a multipathing solution with Virtual I/O Server (VIOS), use MPIO. There are several limitations when using SDD with VIOS. Refer to 11.2.10, Virtual I/O Server (VIOS) on page 323 and the VIOS support site for additional information: http://www14.software.ibm.com/webapp/set2/sas/f/vios/documentation/datasheet.ht ml#multipath
11.2.9 FC adapters
FC adapters or host bus adapters (HBAs) provide the connection between the host and the storage devices. There are three important parameters that we recommend that you configure: num_cmd_elems: This parameter sets the maximum number of commands to queue to the adapter. When a large number of supported storage devices are configured, you can increase this attribute to improve performance. The default value is 200. The maximum values are: LP10000 adapters: 2048 LP9000 adapters: 2048 LP7000 adapters: 1024 dyntrk: Beginning with AIX 5.20 TL1, the AIX Fibre Channel (FC) driver supports FC dynamic device tracking, which enables dynamically changing FC cable connections on switch ports or on supported storage ports without unconfiguring and reconfiguring the hdisk and SDD vpath devices. Note: The disconnected cable must be reconnected within 15 seconds. fc_err_recov: Beginning with AIX 5.1 and AIX 5.2 TL02, the fc_err_recov attribute enables fast failover during error recovery. Enabling this attribute can reduce the amount of time that the AIX disk driver takes to fail I/O in certain conditions and, therefore, reduce the overall error recovery time. The default value for fc_err_recov is delayed_fail. To enable FC adapter fast failover, change the value to fast_fail. Note: Only change the attributes fc_err_recov to fast_fail and dyntrk to yes if you use a multipathing solution with more than one path. Example 11-1 is the output of the attributes of an fcs device.
Example 11-1 Output of an fcs device # lsattr -El fcs0 bus_intr_lvl 8673 bus_io_addr 0xffc00 bus_mem_addr 0xfffbf000 init_link al intr_priority 3 lg_term_dma 0x800000 max_xfer_size 0x100000 Bus interrupt level Bus I/O address Bus memory address INIT Link flags Interrupt priority Long term DMA Maximum Transfer Size False False False True False True True
322
Maximum number of COMMANDS to queue to the adapter True Preferred AL_PA True FC Class for Fabric True
For additional information, consult the SDD Users Guide at the following link: http://www-01.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S7000303
323
When you assign several LUNs from the DS8000 to the VIOS and then map those LUNs to the LPAR Clients with the time, trivial activities, such as upgrading the SDDPCM device driver, can become somewhat challenging. To ease the complexity, we created two scripts: the first script generates a list of mappings among the LUNs and LPAR Clients. The second script, based on that output, creates the commands needed to recreate the mappings among the LUNs and LPAR Clients. The scripts are available in Appendix C, UNIX shell scripts on page 587. For additional information about VIOS, refer to the following links: Introduction to PowerVM Editions on IBM p5 Servers: http://www.redbooks.ibm.com/redpieces/abstracts/sg247940.html IBM System p PowerVM Editions Best Practices: http://www.redbooks.ibm.com/abstracts/redp4194.html Virtual I/O Server and Integrated Virtualization Manager command descriptions: http://publib.boulder.ibm.com/infocenter/systems/scope/hw/topic/iphcg/iphcg.pdf Also, check the VIOS frequently asked questions (FAQs) that explain in more detail several restrictions and limitations, such as the lack of the load balancing feature for AIX VSCSI MPIO devices, and so forth: http://www14.software.ibm.com/webapp/set2/sas/f/vios/documentation/faq.html
324
Performance recommendations
Here are our performance recommendations when configuring Virtual SCSI for performance: CPU: Typical entitlement is .25 Virtual CPU of 2 Always run uncapped Run at higher priority (weight factor >128) More CPU power with high network loads
Memory: Typically >= 1 GB (at least 512 MB of memory is required. The minimum is 512 MB + 4 MB per hdisk.) Add more memory if there are extremely high device (vscsi and hdisk) counts Small LUNs drive up the memory requirements For multipathing with VIOS, check the configuration of the following parameters: fscsi devices on VIOS: The attribute fc_err_recov is set to fast_fail The attribute dyntrk is set to yes with the command chdev -l fscsiX -a dyntrk=yes
hdisk devices on VIOS: The attribute algorithm is set to load_balance The attribute reserve_policy is set to no_reserve The attribute hcheck_mode is set to nonactive The attribute hcheck_interval is set to 20 The attribute vscsi_path_to is set to 30 The attribute algorithm is set to failover The attribute reserv_policy is set to no_reserve The attribute hcheck_mode is set to nonactive The attribute hcheck_interval is set to 20
Note: Only change the reserve_policy parameter to no_reserve if you are going to map the LUNs of DS8000 directly to the client LPAR. For additional information, refer to the following link: http://publib.boulder.ibm.com/infocenter/systems/scope/hw/index.jsp?topic=/iphb1/i phb1_vios_planning_vscsi_sizing.htm
325
In an I/O-bound system, look for: A high I/O wait percentage as shown in the cpu column under the wa sub-column. Example 11-3 shows that a majority of CPU cycles are waiting for I/O operations to complete. A high number of blocked processes as shown in the kthr column under the sub-columns b and p, which are wait queue (b) and wait queue for raw devices (p) respectively. A high number of blocked processes normally indicates I/O contention among the process. Paging activity as seen under the column page. High first in first out (FIFO) indicates intensive file caching activity. Example 11-4 shows you another option that you can use, vmstat -v, from which you can understand whether the blocked I/Os are due to a shortage of buffers.
326
Example 11-4 The vmstat -v utility output for filesystem buffer activity analysis [root@p520-tic-3]# vmstat -v | tail -7 0 pending disk I/Os blocked with no pbuf 0 paging space I/Os blocked with no psbuf 2484 filesystem I/Os blocked with no fsbuf 0 client filesystem I/Os blocked with no fsbuf 0 external pager filesystem I/Os blocked with no fsbuf 0 Virtualized Partition Memory Page Faults 0.00 Time resolving virtualized partition memory page faults [root@p520-tic-3]#
In Example 11-4, notice that: Filesystem buffer (fsbuf) and LVM buffer (pbuf) space are used to hold the I/O request at the filesystem and LVM level respectively. If a substantial number of I/Os are blocked due to insufficient buffer space, both buffers can be increased using the ioo command, but a larger value will result in overall poor system performance. Hence, we suggest that you increase buffers incrementally and monitor the system performance with each increase. For the best practice values, consult the application papers listed under AIX filesystem caching on page 312. Using lvmo, you can also check if contention is happening due to a lack of LVM memory buffer, which is illustrated in Example 11-5.
Example 11-5 Output of lvmo -a [root@p520-tic-3]# lvmo -a -v rootvg vgname = rootvg pv_pbuf_count = 512 total_vg_pbufs = 1024 max_vg_pbuf_count = 16384 pervg_blocked_io_count = 0 pv_min_pbuf = 512 global_blocked_io_count = 0 [root@p520-tic-3]#
As you can see in Example 11-5, there are two incremental counters: pervg_blocked_io_count and global_blocked_io_count. The first counter indicates how many times an I/O block happened because of a lack of Logical Volume Managers (LVMs) pinned memory buffer (pbufs) on that VG. The second incremental counter counts how many times an I/O block happened due to the lack of LVMs pinned memory buffer (pbufs) in the whole OS. Other indicators for I/O bound can be seen with the disk xfer part of the vmstat output when run against the physical disk as shown in Example 11-6.
Example 11-6 Output of vmstat for disk xfer # vmstat hdisk0 hdisk1 1 8 kthr memory page faults ---- ---------- ----------------------- -----------r b avm fre re pi po fr sr cy in sy cs 0 0 3456 27743 0 0 0 0 0 0 131 149 28 0 0 3456 27743 0 0 0 0 0 0 131 77 30 1 0 3498 27152 0 0 0 0 0 0 153 1088 35 0 1 3499 26543 0 0 0 0 0 0 199 1530 38 cpu ----------us sy id wa 0 1 99 0 0 1 99 0 1 10 87 2 1 19 0 80 disk xfer -----1 2 3 4 0 0 0 0 0 11 0 59
327
0 0 0 0
1 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
38 37 19 23
2 26 0 72 2 12 20 66 0 0 99 0 0 0 99 0
0 0 0 0
53 42 0 0
The disk xfer part provides the number of transfers per second to the specified physical volumes that occurred in the sample interval. This count does not imply an amount of data that was read or written. You can consult the man page of vmstat for additional details: http://publib.boulder.ibm.com/infocenter/systems/scope/aix/topic/com.ibm.aix.cmds/ doc/aixcmds6/vmstat.htm?tocNode=int_216867
11.3.2 pstat
The pstat command counts how many legacy asynchronous I/O servers are being used in the server. There are two asynchronous I/O subsystems (AIOs): Legacy AIO Posix AIO In AIX Version 5.3, you can use the command psat -a | grep aioserver | wc -l to get the number of legacy AIO servers that are running. You can use the command pstat -a | grep posix_aioserver | wc -l to see the number of Posix AIO servers.
Example 11-7 pstat -a output to measure the legacy AIO activity [root@p520-tic-3]# pstat -a | grep aioserver | wc -l 0 [root@p520-tic-3]#
Note: If you use raw devices, you have to use ps -k instead of pstat -a to measure the legacy AIO activity. Example 11-7 shows that the host does not have any AIO servers running. This function is not enabled by default. You can enable this function with mkdev -l aio0 or by using SMIT. For Posix AIO, substitute posix_aio for aio0. In AIX Version 6, both AIO subsystems are loaded by default but will be activated only when an AIO request is initiated by the application. Use the command pstat -a | grep aio to see the AIO subsystems loaded, as shown in Example 11-8.
Example 11-8 pstat -a output to show the AIO subsystem defined in AIX 6 [root@p520-tic-3]# pstat -a | grep aio 18 a 1207c 1 1207c 0 0 33 a 2104c 1 2104c 0 0 [root@p520-tic-3]# 1 1 aioLpool aioPpool
In AIX Version 6, you can use the new ioo tunables to show whether the AIO is being used. An illustration is given in Example 11-9 on page 329.
328
Example 11-9 ioo -a output to show the AIO subsystem activity in AIX 6 [root@p520-tic-3]# ioo -a | grep aio aio_active = 0 aio_maxreqs = 65536 aio_maxservers = 30 aio_minservers = 3 aio_server_inactivity = 300 posix_aio_active = 0 posix_aio_maxreqs = 65536 posix_aio_maxservers = 30 posix_aio_minservers = 3 posix_aio_server_inactivity = 300 [root@p520-tic-3]#
From Example 11-9, aio_active and posix_aio_active show whether the AIO is being used. The parameters aio_server_inactivity and posix_aio_server_inactivity show how long an AIO server will sleep without servicing an I/O request. To check the Asynchronous I/O configuration in AIX 5.3, just type the following commands shown in Example 11-10.
Example 11-10 lsattr -El aio0 output to list the configuration of legacy AIO [root@p520-tic-3]# autoconfig defined fastpath enable kprocprio 39 maxreqs 4096 maxservers 10 minservers 1 [root@p520-tic-3]# lsattr -El aio0 STATE to be configured at system restart State of fast path Server PRIORITY Maximum number of REQUESTS MAXIMUM number of servers per cpu MINIMUM number of servers True True True True True True
Notes: If your AIX 5.3 is between TL05 and TL08, you can also use the aioo command to list and increase the values of maxservers, minservers, and maxreqs. If you use AIX Version 6, there are no more Asynchronous I/O devices in ODM, and the command aioo has been removed. You must use the ioo command to change them. The general rule is to monitor the I/O wait using the vmstat command. If the I/O wait is more than 25%, consider enabling AIO, which reduces the IO wait but does not help disks that are overly busy. You can monitor busy disks by using iostat, which we explain in the next section.
329
The option a will get the adapter-level details, and the option D will get the disk-level details. The option R will reset the min* and max* values at each interval. Refer to Example 11-10.
Example 11-11 Disk-level and adapter-level details using iostat -aDR [root@p520-tic-3]# iostat -aDR 1 1 System configuration: lcpu=2 drives=1 paths=1 vdisks=1 tapes=0 Vadapter: vscsi0
Kbps tps bkread bkwrtn partition-id 29.7 3.6 2.8 0.8 0 rps avgserv minserv maxserv 0.0 48.2S 1.6 25.1 wps avgserv minserv maxserv 30402.8 0.0 2.1 52.8 avgtime mintime maxtime avgwqsz avgsqsz sqfull 0.0 0.0 0.0 0.0 0.0 0.0
Paths/Disks: hdisk0
tps bread bwrtn 3.6 23.7K 6.7K minserv maxserv timeouts 1.6 25.1 0 minserv maxserv timeouts 2.1 52.8 0 maxtime avgwqsz avgsqsz 34.4 0.0 0.0
When analyzing the output of iostat: Check if the number of I/Os is balanced among the disks. If not, it might indicate that you have problems in the distribution of PPs over the LUNs. With the information provided by lvmstat or filemon, pick the most active LV, and with the lslv -m command, check if the PPs are distributed evenly among the disks of the VG. If not, check the inter-policy attribute on the LVs if they are set to maximum. If the PPs are not distributed evenly and the logical volumes (LVs) inter-policy attribute is set to minimum, you need to change the attribute to maximum and reorganize the VG. Check in the read section that the avgserv is larger than 15 ms. It might indicate that your bottleneck is in a lower layer, which can be the HBA, the SAN, or even in the storage. Also, check if the same problem occurs with other disks of the same VG. If yes, you need to add up the number of I/Os per second, add up the throughput by vpath (if it is the case), rank, and host and compare with the performance numbers from TotalStorage Productivity Center for Disk. Check in the write section if the avgserv is larger than 3 ms. Writes averaging significantly and consistently higher indicate that write cache is full, and there is a bottleneck in the disk. Check in the queue section if avgwqsz is larger than avgsqsz. Compare with other disks in the storage. Check whether the PPs are distributed evenly in all disks in the VG. If avgwqsz is smaller than avgsqsz, compare with other disks in the storage. If there are differences and the PPs are distributed evenly in the VG, it might indicate that the unbalance is at the rank level. The following example shows how multipath needs to be considered to interpret the iostat output.
330
In this example, a server has two Fibre Channel adapters and is zoned so that it uses four paths to the DS8000. In order to determine the I/O statistics for vpath0 for the example given in Figure 11-6, you need to add up the iostats for hdisk1 - hdisk4. One way to find out which disk devices make a vpath is to use the datapath query essmap command that is included with SDD. Tip: When using iostat on a server that is running SDD with multiple attachments to the DS8000, each disk device is really just a single path to the same logical disk (LUN) on the DS8000. To understand how busy a logical disk is, you need to add up iostats for each disk device making up a vpath.
Host
I/O calls from OS vpath0 from SDD (Subsystem Device Driver) load balance & failover
vpath0
reported on by iostat
Enclosure 0
Enclosure 1
DS8000 LUN 1
Another way is shown in Example 11-12. The command, datapath query device 0, lists the paths (hdisks) to vpath0. In this example, the logical disk on the DS8000 has LUN serial number 75065513000. The disk devices presented to the operating system are hdisk4, hdisk12, hdisk20, and hdisk28, so we can add up the iostats for these four hdisk devices to see how busy vpath0 is.
331
Example 11-12 The datapath query device command {CCF-part2:root}/ -> datapath query device 0 DEV#: 0 DEVICE NAME: vpath0 TYPE: 2107900 POLICY: Optimized SERIAL: 75065513000 ========================================================================== Path# Adapter/Hard Disk State Mode Select Errors 0 fscsi0/hdisk4 OPEN NORMAL 155 0 1 fscsi0/hdisk12 OPEN NORMAL 151 0 2 fscsi1/hdisk20 OPEN NORMAL 144 0 3 fscsi1/hdisk28 OPEN NORMAL 131 0
The option shown in Example 11-13 provides details in a record format, which can be used to sum up the disk activity.
Example 11-13 Output of iostat -alDRT [root@p520-tic-3]# iostat -alDRT 1 5 System configuration: lcpu=6 drives=32 paths=32 vdisks=0 tapes=0 Adapter: xfers time -------------------- --------------------------- --------bps tps bread bwrtn fcs1 0.0 0.0 0.0 0.0 15:43:22 Disks: write xfers read
queue time -------------------- ------------------------------------------------------------------ -------------------------------------- --------%tm bps tps bread bwrtn wps avg min max time fa il avg min max avg avg serv act serv serv serv outs time time time wqsz sqsz qfull hdisk10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 15:43:22 hdisk18 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 15:43:22 hdisk9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 15:43:22 hdisk5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 15:43:22 hdisk15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 15:43:22 root@p520-tic-3]#
------------------------------------
rps
avg
min
serv
serv
serv outs
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
It is not unusual to see a device reported by iostat as 90% to 100% busy, because a DS8000 volume that is spread across an array of multiple disks can sustain a much higher I/O rate than for a single physical disk. A device being 100% busy is generally a problem for a single device, but it is probably not a problem for a RAID 5 device. 332
DS8000 Performance Monitoring and Tuning
Further Asynchronous I/O can be monitored through iostat -A for legacy AIO and iostat -P for Posix AIO. Because the asynchronous I/O queues are assigned by filesystem, it is more interesting to measure the queues per filesystem. If you have several instances of the same application where each application uses a set of filesystems, you can see which instances are consuming more resources. Execute the iostat -AQ command to see legacy AIO, which is shown in Example 11-14. Similarly for POSIX-compliant AIO statistics, use iostat -PQ.
Example 11-14 iostat -AQ output to measure legacy AIO activity by filesystem [root@p520-tic-3]# iostat -AQ 1 2 System configuration: lcpu=4 aio: avgc avfc maxg maif maxr avg-cpu: % user % sys % idle % iowait 0 0 0 0 16384 0.0 0.1 99.9 0.0 Queue# 129 130 132 133 136 137 138 Count 0 0 0 0 0 0 0 Filesystems / /usr /var /tmp /home /proc /opt
aio: avgc avfc maxg maif maxr avg-cpu: % user % sys % idle % iowait 0 0 0 0 16384 0.0 0.1 99.9 0.0 Queue# 129 130 132 133 136 137 138 Count 0 0 0 0 0 0 0 Filesystems / /usr /var /tmp /home /proc /opt
Refer to the paper Oracle Architecture and Tuning on AIX (v1.2) listed under 11.2, AIX disk I/O components on page 311 for recommended configuration values for legacy AIO. This paper lists several considerations about the implementation of legacy AIO in AIX 5.3 as well as in AIX 6.1. If your AIX system is in a SAN environment, you might have so many hdisks that iostat will not give you too much information. We recommend using nmon, which can report iostats based on vpaths or ranks, as discussed in Interactive nmon options for DS8000 performance monitoring on page 336. For detailed information about the implementation of asynchronous I/O statistics in iostat, consult AIX 5L Differences Guide Version 5.3 Edition, SG24-7463. Refer to section 6.3 Asynchronous I/O statistics: http://www.redbooks.ibm.com/abstracts/sg247463.html?Open Or, consult the iostat man pages: http://publib.boulder.ibm.com/infocenter/systems/scope/aix/index.jsp?topic=/com.ib m.aix.cmds/doc/aixcmds3/iostat.htm&tocNode=int_215801
Chapter 11. Performance considerations with UNIX servers
333
11.3.4 lvmstat
The lvmstat command reports input and output statistics for logical partitions, logical volumes, and volume groups. This command is useful in determining the I/O rates to LVM volume groups, logical volumes, and logical partitions. This command is useful for dealing with unbalanced I/O situations where the data layout was not considered initially.
334
The lvmstat tool has powerful options, such as reporting on a specific logical volume or only reporting busy logical volumes in a volume group. For additional information about usage, check the following links: AIX 5L Differences Guide Version 5.2 Edition, SG24-5765: http://www.redbooks.ibm.com/abstracts/sg245765.html?Open AIX 5L Performance Tools Handbook, SG24-6039 http://www.redbooks.ibm.com/abstracts/sg246039.html?Open The man page of lvmstat: http://publib.boulder.ibm.com/infocenter/systems/scope/aix/index.jsp?topic=/com .ibm.aix.cmds/doc/aixcmds3/lvmstat.htm&tocNode=int_215986
11.3.5 topas
The interactive AIX tool, topas, is convenient if you want to get a quick overall view of the systems current activity. A fast snapshot of memory usage or user activity can be a helpful starting point for further investigation. Figure 11-7 contains a sample topas output.
With AIX6.1 the topas monitor has enhanced monitoring capabilities and now also provides I/O statistics for filesystems: Enter ff (first f turns it off, the next f expands it) to expand the filesystems I/O statistics. Type F to get an exclusive and even more detailed view of the filesystem I/O statistics. Expanded disk I/O statistics can be obtained by typing dd or D in the topas initial window. Consult the topas manual page for more details: http://publib.boulder.ibm.com/infocenter/systems/scope/aix/index.jsp?topic=/com.ib m.aix.cmds/doc/aixcmds5/topas.htm&tocNode=int_123659
335
11.3.6 nmon
The nmon tool and analyzer for AIX and Linux is a great storage performance analysis resource, and it is free. It was written by Nigel Griffiths who works for IBM in the United Kingdom. We use this tool, among others, when we perform client benchmarks. It is available at: http://www.ibm.com/developerworks/eserver/articles/analyze_aix/ Note: The nmon tool is not formally supported. No warranty is given or implied, and you cannot obtain help or maintenance from IBM. The nmon tool currently comes in two versions to run on different levels of AIX: The nmon Version 12e for AIX 5L and AIX 6.1 The nmon Version 9 for AIX 4.X. This version is functionally established and will not be developed further. The interactive nmon tool is similar to monitor or topas, which you might have used before to monitor AIX, but it offers many more features that are useful for monitoring DS8000 performance. We will explore these interactive options. Unlike topas, the nmon tool can also record data that can be used to establish a baseline of performance for comparison later. Recorded data can be saved in a file and imported into the nmon analyzer (a spreadsheet format) for easy analysis and graphing.
336
xOr try http://www.ibm.com/collaboration/wiki/display/WikiPtype/nmon x x x xnmon version 12e build=5300-06 - written by Nigel Griffiths, [email protected] x x x x x x x x x mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
The nmon tool disk group performance The nmon Version 10 tool has a feature called disk grouping. For example, you can create a
disk group based on your AIX volume groups. First, you need to create a file that maps hdisks
337
to nicknames. For example, you can create a map file like that shown in Example 11-19 on page 338.
Example 11-19 The nmon tool disk group mapping file vi /tmp/vg-maps rootvg hdisk0 hdisk1 6000vg hdisk2 hdisk3 hdisk4 hdisk5 hdisk6 hdisk7 hdisk8 hdisk9 hdisk10 hdisk11 hdisk1 2 hdisk13 hdisk14 hdisk15 hdisk16 hdisk17 8000vg hdisk26 hdisk27 hdisk28 hdisk29 hdisk30 hdisk31 hdisk32 hdisk33
Then, type nmon with the -g flag to point to the map file: nmon -g /tmp/vg-maps When nmon starts, press the G key to view statistics for your disk groups. An example of the output is shown in Example 11-20.
Example 11-20 The nmon tool disk-group output --nmon-v10p---N=NFS--------------Host=san5198b-------Refresh=1 secs---14:02.10----Disk-Group-I/O Name Disks AvgBusy Read|Write-KB/s TotalMB/s xfers/s R:W-SizeKB rootvg 2 0.0% 0.0|0.0 0.0 0.0 0.0 6000vg 16 45.4% 882.6|93800.1 92.5 2131.5 44.4 8000vg 8 95.3% 1108.7|118592.0 116.9 2680.7 44.7 Groups= 3 TOTALS 26 5.4% 1991.3|212392.2 209.4 4812.2
Notice that: The nmon tool reports real-time iostats for different disk groups. In this case, the disk groups that we created are for volume groups. You can create logical groupings of hdisks for any kind of group that you want. You can make multiple disk-group map files and start nmon -g <map-file> to report on different groups. To enable nmon to report iostats based on ranks, you can make a disk-group map file listing ranks with the associated hdisk members. Use the SDD command datapath query essmap to provide a view of your host systems logical configuration on the DS8000 or DS6000. You can, for example, create a nmon disk group of storage type (DS8000 or DS6000), logical subsystem (LSS), rank, port, and so forth to give you unique views into your storage performance.
Recording nmon information for import into the nmon analyzer tool
A great benefit that the nmon tool provides is the ability to collect data over time to a file and then just import the file into the nmon analyzer tool, which is at: http://www.ibm.com/developerworks/aix/library/au-nmon_analyser/ To collect nmon data in comma-separated value (csv) file format for easy spreadsheet import: 1. Run nmon with the -f flag. Refer to nmon -h for the details, but as an example, to run nmon for an hour capturing data snapshots every 30 seconds, use: nmon -f -s 30 -c 120 2. This command will create the output file in the current directory called <hostname>_date_time.nmon. 338
DS8000 Performance Monitoring and Tuning
The nmon analyzer is a macro-customized Microsoft Excel spreadsheet. After transferring the output file to the machine running the nmon analyzer, simply start the nmon analyzer, enabling the macros, and click Analyze nmon data. You will be prompted to select your spreadsheet and then to save the results. Many spreadsheets have fixed numbers of columns and rows. We suggest that you collect up to a maximum of 300 snapshots to avoid experiencing these issues. When you capture data to a file, the nmon tool disconnects from the shell to ensure that it continues running even if you log out, which means that nmon can appear to fail, but it is still running in the background until the end of the analysis period.
11.3.7 fcstat
The fcstat command displays statistics from an specific Fibre Channel adapter. Example 11-21 shows the output of the fcstat command.
Example 11-21 The fcstat command output # fcstat fcs0 FIBRE CHANNEL STATISTICS REPORT: fcs0 skipping......... FC SCSI Adapter Driver Information No DMA Resource Count: 0 No Adapter Elements Count: 0 No Command Resource Count: 99023 skipping.........
The No Command Resource Count indicates how many times the num_cmd_elems value was exceeded since AIX was booted. You can keep taking snapshots every 3 to 5 minutes during a peak period to evaluate if you need to increase the value of num_cmd_elems. For additional information, consult the man pages of the fcstat command: http://publib.boulder.ibm.com/infocenter/systems/scope/aix/index.jsp?topic=/com.ib m.aix.cmds/doc/aixcmds2/fcstat.htm&tocNode=int_122561
11.3.8 filemon
The filemon command monitors a trace of filesystem and I/O system events, and reports performance statistics for files, virtual memory segments, logical volumes, and physical volumes. The filemon command is useful to individuals whose applications are believed to be disk-bound, and who want to know where and why. The filemon command provides a quick test to determine if there is an I/O problem by measuring the I/O service times for reads and writes at the disk and logical volume level. The filemon command resides in /usr/bin and is part of the bos.perf.tools file set, which can be installed from the AIX base installation media.
filemon measurements
To provide a more complete understanding of filesystem performance for an application, the filemon command monitors file and I/O activity at four levels: Logical filesystem The filemon command monitors logical I/O operations on logical files. The monitored operations include all read, write, open, and seek system calls, which might or might not
339
result in actual physical I/O depending on whether the files are already buffered in memory. I/O statistics are kept on a per-file basis. Virtual memory system The filemon command monitors physical I/O operations (that is, paging) between segments and their images on disk. I/O statistics are kept on a per segment basis. Logical volumes The filemon command monitors I/O operations on logical volumes. I/O statistics are kept on a per-logical volume basis. Physical volumes The filemon command monitors I/O operations on physical volumes. At this level, physical resource utilizations are obtained. I/O statistics are kept on a per-physical volume basis.
filemon examples
A simple way to use filemon is to run the command shown in Example 11-22, which will: Run filemon for two minutes and stop the trace. Store output in /tmp/fmon.out. Just collect logical volume and physical volume output.
Example 11-22 Using filemon #filemon -o /tmp/fmon.out -T 500000 -PuvO lv,pv; sleep 120; trcstop
Note: To set the size of the buffer of option -T, in general, you can start with 2 MB per logical CPU. For additional information about filemon, check the man pages of filemon: http://publib.boulder.ibm.com/infocenter/systems/scope/aix/index.jsp?topic=/com.ib m.aix.cmds/doc/aixcmds2/filemon.htm&tocNode=int_215661 To produce sample output for filemon, we ran a sequential write test in the background, and started a filemon trace, as shown in Example 11-23. We used the lmktemp command to create a 2 GB file full of nulls while filemon gathered I/O statistics.
Example 11-23 Using filemon with a sequential write test cd /interdiskfs time lmktemp 2GBtest 2000M & filemon -o /tmp/fmon.out -T 500000 -PuvO lv,pv; sleep 120; trcstop
In Example 11-24 on page 341, we look at parts of the /tmp/fmon.out file. When analyzing the output from filemon, focus on: Most active physical volume: Look for balanced I/O across disks. Lack of balance might be a data layout problem. Look at I/O service times at the physical volume layer: Writes to cache that average less than 3 ms is good. Writes averaging significantly and consistently longer times indicate that write cache is full, and there is a bottleneck in the disk.
340
Reads averaging less than 10 ms - 20 ms are good. The disk subsystem read cache hit rate affects this value considerably. Higher read cache hit rates will result in lower I/O service times, often near 5 ms or less. If reads average greater than 15 ms, it can indicate that something between the host and the disk is a bottleneck, though it usually indicates a bottleneck in the disk subsystem. Look for consistent I/O service times across physical volumes. Inconsistent I/O service times can indicate unbalanced I/O or a data layout problem. Longer I/O service times can be expected for I/Os that average greater than 64 KB in size. Look at the difference between the I/O service times between the logical volume and the physical volume layers. A significant difference indicates queuing or serialization in the AIX I/O stack. The fields in the filemon report of the filemon command are: util Utilization of the volume (fraction of time busy). The rows are sorted by this field, in decreasing order. The first number, 1.00, means 100 percent. Number of 512-byte blocks read from the volume. Number of 512-byte blocks written to the volume. Total transfer throughput in Kilobytes per second. Name of volume. Contents of volume; either a filesystem name, or logical volume type (jfs2, paging, jfslog, jfs2log, boot, or sysdump). Also indicates if the filesystem is fragmented or compressed.
Example 11-24 The filemon most active logical volumes report Thu Oct 6 21:59:52 2005 System: AIX CCF-part2 Node: 5 Machine: 00E033C44C00 Cpu utilization: 73.5%
Most Active Logical Volumes -----------------------------------------------------------------------util #rblk #wblk KB/s volume description -----------------------------------------------------------------------0.73 0 20902656 86706.2 /dev/305glv /interdiskfs 0.00 0 472 2.0 /dev/hd8 jfs2log 0.00 0 32 0.1 /dev/hd9var /var 0.00 0 16 0.1 /dev/hd4 / 0.00 0 104 0.4 /dev/jfs2log01 jfs2log Most Active Physical Volumes -----------------------------------------------------------------------util #rblk #wblk KB/s volume description -----------------------------------------------------------------------0.99 0 605952 2513.5 /dev/hdisk39 IBM FC 2107 0.99 0 704512 2922.4 /dev/hdisk55 IBM FC 2107 0.99 0 614144 2547.5 /dev/hdisk47 IBM FC 2107 0.99 0 684032 2837.4 /dev/hdisk63 IBM FC 2107 0.99 0 624640 2591.1 /dev/hdisk46 IBM FC 2107
341
0.99 0.98
/dev/hdisk54 /dev/hdisk38
skipping........... -----------------------------------------------------------------------Detailed Logical Volume Stats (512 byte blocks) -----------------------------------------------------------------------VOLUME: /dev/305glv description: /interdiskfs writes: 81651 (0 errs) write sizes (blks): avg 256.0 min 256 max 256 sdev 0.0 write times (msec): avg 1.816 min 1.501 max 2.409 sdev 0.276 write sequences: 6 write seq. lengths: avg 3483776.0 min 423936 max 4095744 sdev 1368402.0 seeks: 6 (0.0%) seek dist (blks): init 78592, avg 4095744.0 min 4095744 max 4095744 sdev 0.0 time to next req(msec): avg 1.476 min 0.843 max 13398.588 sdev 56.493 throughput: 86706.2 KB/sec utilization: 0.73 skipping........... -----------------------------------------------------------------------Detailed Physical Volume Stats (512 byte blocks) -----------------------------------------------------------------------VOLUME: /dev/hdisk39 description: IBM FC 2107 writes: 2367 (0 errs) write sizes (blks): avg 256.0 min 256 write times (msec): avg 1.934 min 0.002 write sequences: 2361 write seq. lengths: avg 256.7 min 256 seeks: 2361 (99.7%) seek dist (blks): init 14251264, avg 1928.4 min 256 seek dist (%tot blks):init 10.61802, avg 0.00144 min 0.00019 time to next req(msec): avg 50.666 min 1.843 throughput: 2513.5 KB/sec utilization: 0.99 VOLUME: /dev/hdisk55 description: IBM FC 2107 writes: 2752 (0 errs) write sizes (blks): avg 256.0 min 256 write times (msec): avg 1.473 min 0.507 write sequences: 2575 write seq. lengths: avg 273.6 min 256 seeks: 2575 (93.6%) seek dist (blks): init 14252544, avg 1725.9 min 256 seek dist (%tot blks):init 10.61897, avg 0.00129 min 0.00019 time to next req(msec): avg 43.573 min 0.844 throughput: 2922.4 KB/sec utilization: 0.99 skipping to end.....................
max 511232 sdev 23445.5 max 0.38090 sdev 0.01747 max 14010.230 sdev 393.436
max 511232 sdev 22428.8 max 0.38090 sdev 0.01671 max 14016.443 sdev 365.314
342
In the filemon output in Example 11-24 on page 341, we notice: The most active logical volume is /dev/305glv (/interdiskfs); it is the busiest logical volume with an average data rate of 87 MBps. The Detailed Logical Volume Status shows an average write time of 1.816 ms for /dev/305glv. The Detailed Physical Volume Stats show an average write time of 1.934 ms for the busiest disk, /dev/hdisk39, and 1.473 ms for /dev/hdisk55 for the next busiest disk. The filemon command is a useful tool to determine where a host is spending I/O. More details about the filemon options and reports are available in the publication AIX 5L Performance Tools Handbook, SG24-6039, which you can download from: http://www.redbooks.ibm.com/abstracts/sg246039.html?Open
11.4.1 UFS
UFS is the standard filesystem of Solaris. You can configure a journaling feature, adjust cache filesystem parameters, implement Direct I/O, and adjust the mechanism of sequential read ahead. For more information about UFS, consult the following links: Introduction to the Solaris filesystem: http://www.solarisinternals.com/si/reading/sunworldonline/swol-05-1999/swol-05filesystem.html http://www.solarisinternals.com/si/reading/fs2/fs2.html http://www.solarisinternals.com/si/reading/sunworldonline/swol-07-1999/swol-07filesystem3.html You can obtain the Tunable Parameters Reference Manual for Solaris 8 at: http://docs.sun.com/app/docs/doc/816-0607 You can obtain the Tunable Parameters Reference Manual for Solaris 9 at: http://docs.sun.com/app/docs/doc/806-7009 You can obtain the Tunable Parameters Reference Manual for Solaris 10 at: http://docs.sun.com/app/docs/doc/817-0404 More information about SUN Solaris commands and tuning options is available from the following Web site: http://www.solarisinternals.com
343
Filesystem blocksize
The smallest allocation unit of a filesystem is the blocksize. In VxFS, you can choose from 512 bytes to 8192 bytes. To decide which size is best for your application, consider the average size of the applications files. If the application is a file server and the average size is about 1 KB, choose a blocksize of 1 KB. But if the application is a database with a few but large files, choose the maximum size of 8 KB. The default blocksize is 2 KB. In addition, when creating and allocating file space inside of VxFS and using standard tools, such as mkfile (Solaris only) or database commands, you might have performance degradations. For additional information, refer to: http://seer.entsupport.symantec.com/docs/192660.htm
Quick I/O
Quick I/O is a licensed feature from Storage Foundation for Oracle, DB2, or Sybase databases. However, the binaries come in the VRTSvxfs package. It allows a database access to pre-allocated VxFS files as raw character devices. It combines the performance of a raw device with the convenience to manipulate files that a filesystem can provide. There is an interesting document describing the use of Quick I/O compared to other I/O access methods such as Direct I/O at: http://eval.veritas.com/webfiles/docs/qiowp.pdf
344
Veritas File System 5.0 - Administrators Guide for HP-UX: ftp://exftpp.symantec.com/pub/support/products/FileSystem_UNIX/283704.pdf Veritas File System 5.0 - Administrators Guide for Linux: ftp://exftpp.symantec.com/pub/support/products/FileSystem_UNIX/283836.pdf In addition, there is an IBM Redbooks publication about VERITAS Storage Foundation Suite: http://www.redbooks.ibm.com/abstracts/sg246619.html?Open
Storage pools
ZFS implements built-in features of volume management. You define a storage pool and add disks for that storage pool. It is not necessary to partition the disks and create filesystems on top of those partitions. Instead, you simply define the filesystems, and ZFS allocates disk space dynamically. ZFS does what virtual memory does by abstracting the real memory through a virtual memory address space. You also can implement quotas and reserve space for a specific filesystem. Note: ZFS is not supported with SDD, only with MPxIO.
345
Another important detail is that every time that ZFS issues an I/O write to the disk, it does not know if the storage has a nonvolatile random access memory (NVRAM). Therefore, it requests to flush the data from cache to disk in the storage. If your Solaris release is November 2006 (11/06), you can disable the flush by executing the command set zfs:zfs_nocacheflush=1. Additional details are provided at the following link: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#FLUSH
Dynamic striping
With the storage pool model, after you add more disks to the storage pool, ZFS automatically redistributes the data among the disks.
Multiple blocksize
In ZFS, there is no need to define a blocksize for each filesystem. ZFS tries to match the blocksize with the application I/O size. However, if your application is a database, we recommend that you enforce the blocksize to match the database blocksize. The parameter is recordsize and can range from 512 bytes to 128 KB. For example, if you want to configure a blocksize of 8 KB, you type the command zfs set recsize=16384 <userpool name> or <filesystem>.
Cache management
Cache management is implemented by a modified version of an Adaptive Replacement Cache (ARC) algorithm. By default, it tries to use the real memory of the system while the utilization increases and there is free memory in the system. Therefore, if your application also uses a lot of memory, such as a database, you might need to limit the amount of memory available for the ZFS ARC. For detailed information and instructions about how to limit the ZFS ARC, check the following link: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ARCSIZE For more information about ZFS, consult the following links: A presentation that introduces ZFS and its new features: http://www.sun.com/software/solaris/zfs_lc_preso.pdf A Web site with ZFS documentation and a link to a guide about ZFS best practices: http://opensolaris.org/os/community/zfs/docs/ The ZFS Evil Tuning Guide is a Web site with the latest recommendations about how to tune ZFS: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide A blog with performance recommendations for tuning ZFS with a database: http://blogs.sun.com/realneel/entry/zfs_and_databases The Solaris ZFS Administrators Guide: http://dlc.sun.com/pdf/817-2271/817-2271.pdf
346
For detailed information, consult the following links: A paper explaining the advantages to move from VxVM to SVM: http://www.sun.com/software/whitepapers/solaris9/transition_volumemgr.pdf A paper providing the performance best practices with SVM: http://www.sun.com/blueprints/1103/817-4368.pdf The Solaris Volume Manager Administration Guide - Solaris 10: http://docs.sun.com/app/docs/doc/816-4520
The DS8000 LUNs that are under the control of VxVM are called VM Disks. You can split those VM Disks in smaller pieces that are called subdisks. A plex looks like a mirror and is composed of a set of subdisks. It is at the plex level that you can configure RAID 0, RAID 5, or simply concatenate the subdisks. A volume is composed of one or more plexes. When you add more than one plex to a volume, you can implement RAID 1. On top of that volume, you can create a filesystem or simply use it as a raw device. To set up the volume layout with DS8000 LUNs, you can adopt one of the following strategies: Storage Pool Striping (SPS): In this case, you spread the workload at the storage level. At the operating system level, you just need to create the plexes with the layout attribute set to concat (this is the default option when creating a plex). Striped Plex: A set of LUNs is created in different ranks inside of DS8000. After the LUNs are recognized in Solaris, a Disk Group (DG) is created, the plexes are spread evenly over the LUNs, and the stripe size of a plex is set from 8 MB to 16 MB.
Chapter 11. Performance considerations with UNIX servers
347
RAID considerations
When using VxVM with DS8000 LUNs, spread the workload over the several DS8000 LUNs by creating RAID 0 plexes. The stripe size is based on the I/O size of your application. If your application has I/O sizes of 1 MB, define the stripe sizes as 1 MB. If your application performs a lot of sequential I/Os, it is better to configure stripe sizes of 4 MB or more to take advantage of the DS8000 prefetch algorithm. Refer to Chapter 11, Performance considerations with UNIX servers on page 307 for details about RAID configuration.
vxio:vol_maxio
When you use VxVM on the DS8000 LUNs, you must set the VxVM maximum I/O size parameter (vol_maxio) to match the I/O size of your application or the stripe size of VxVM RAID 0. If the I/O size of your application is 1 MB and you use the Veritas Volume Manager on your DS8000 LUNs, edit the /etc/system and add the entry set vxio:vol_maxio=2048. The value is in blocks of 512 bytes. For detailed information about how to configure and use VxVM in several platforms, consult the following links: Veritas Volume Manager 5.0 - Administrators Guide for Solaris: ftp://exftpp.symantec.com/pub/support/products/Foundation_Suite/283916.pdf Veritas Volume Manager 5.0 - Administrators Guide for AIX: ftp://exftpp.symantec.com/pub/support/products/VolumeManager_UNIX/284310.pdf Veritas Volume Manager 5.0 - Administrators Guide for HP-UX: ftp://exftpp.symantec.com/pub/support/products/VolumeManager_UNIX/283742.pdf Veritas Volume Manager 5.0 - Administrators Guide for Linux: ftp://exftpp.symantec.com/pub/support/products/VolumeManager_UNIX/283835.pdf Veritas Storage Foundation 5.0 - Administrators Guide for Windows 2000 Server and Windows Server 2003: ftp://exftpp.symantec.com/pub/support/products/Storage_Foundation_for_Windows/2 86744.pdf
11.4.7 MPxIO
MPxIO is the multipathing device driver that comes with Solaris and is required when implementing SUN Clusters. For additional information, consult the following links: A presentation providing an overview of MPxIO: http://opensolaris.org/os/project/mpxio/files/mpxio_toi_sio.pdf In addition, the home page of MPxIO: http://opensolaris.org/os/project/mpxio The Solaris SAN Configuration and Multipathing Guide: http://docs.sun.com/app/docs/doc/820-1931
348
349
11.4.10 FC adapter
AMCC (formerly JNI), Emulex, QLogic, and SUN FC adapters are described in the DS8000 Host System Attachment Guide, SC26-7917-02 with recommended performance parameters. For more information, refer to the following link: http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000ic/index.jsp?topic=/com.ib m.storage.ssic.help.doc/f2c_agrs62105inst_1atzyy.html
fcachestat
The fcachestat command and its output are illustrated in Example 11-25.
Example 11-25 The fcachestat command output [root@v480-1]# fcachestat 1 --- dnlc ---- inode --%hit total %hit total 99.68 145.4M 17.06 512720 100.00 1 0.00 0 100.00 8 0.00 0 100.00 1 0.00 0 100.00 1 0.00 0 100.00 1 0.00 0 100.00 4 0.00 0 100.00 1 0.00 0 100.00 7 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 [root@v480-1]# -- ufsbuf -%hit total 99.95 17.6M 99.99 8653 99.99 7747 99.99 9653 99.99 8533 99.99 8651 99.98 9358 99.99 8781 99.99 8387 99.99 9215 99.99 8237 99.99 7789 -- segmap -%hit total 42.05 11.0M 0.00 4288 0.00 3840 0.00 4780 0.00 4215 0.00 4289 0.00 4636 0.00 4352 0.00 4154 0.00 4566 0.00 4080 0.00 3842 -- segvn --%hit total 80.97 9.7M 50.01 4289 50.00 3840 50.92 4694 49.15 4279 51.18 4195 48.82 4752 50.00 4352 50.00 4154 50.00 4566 50.14 4069 49.86 3853
With this tool, you can measure the filesystem buffer utilization. You have to ignore the first line, because it is an accumulation of statistics since the server was booted: Directory Name Lookup Cache (DNLC) and inode cache: Every time that a process tries to look for a directory, inodes, and file metadata, it first looks in the cache memory. If the information is not there, it must go to the disk. If you are not reaching above 90%, you might need to increase the size of bufhwm. UFS buffer cache: Whenever a process accesses a file, it first checks in the buffer cache to see if the file pages of that file are still there. If not, it has to get the data from disk. If you see a hit cache percentage below 90%, you might have problems with data buffering in memory. You can check the actual buffer size with the sysdef command shown in Example 11-26.
Example 11-26 The sysdef command output [root@v480-1]# sysdef
350
skipping * * Tunable Parameters * 85114880 maximum memory allowed in buffer cache (bufhwm) 30000 maximum number of processes (v.v_proc) 99 maximum global priority in sys class (MAXCLSYSPRI) 29995 maximum processes per user id (v.v_maxup) 30 auto update time limit in seconds (NAUTOUP) 25 page stealing low water mark (GPGSLO) 1 fsflush run rate (FSFLUSHR) 25 minimum resident memory for avoiding deadlock (MINARMEM) 25 minimum swapable memory for avoiding deadlock (MINASMEM) skipping... [root@v480-1]#
To make a change, you need to edit the /etc/system and look for the parameter bufhwm. For additional information and to download this tool, go to the following link: http://www.brendangregg.com/cachekit.html You can also use sar -a, sar -b, and sar -v to check the DNLC and inode cache utilizations. Check the following links for more details about how to use sar in Solaris: The sar -a command: http://docs.sun.com/app/docs/doc/817-0403/enueh?a=view The sar -b command: http://docs.sun.com/app/docs/doc/817-0403/enuef?a=view
directiostat
The directiostat command and its output are illustrated in Example 11-27.
Example 11-27 directiostats output [root@v480-1]# ./directiostat 1 5 lreads lwrites preads pwrites 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [root@v480-1]# Krd 0 0 0 0 0 Kwr holdrds 0 0 0 0 0 0 0 0 0 0 nflush 0 0 0 0 0
With this tool, you can measure the I/O request being executed in filesystems with the Direct I/O mount option enabled. For additional information and to download this tool, go to the following link: http://www.solarisinternals.com/wiki/index.php/Direct_I/O
351
Example 11-28 The vmstat -p command output [root@v480-1]# vmstat -p 1 5 memory page swap free re mf fr de 3419192 3244504 3 3 0 0 3686632 3669336 8 19 0 0 3686632 3669392 0 0 0 0 3686632 3669392 0 0 0 0 3686632 3669392 0 0 0 0 [root@v480-1]# executable sr epi epo epf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 anonymous filesystem api apo apf fpi fpo fpf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The vmstat output has five major columns (memory, page, executable, anonymous, and filesystem). The filesystem column contains three sub-columns: fpi: It means file pages in. It tells how many file pages were copied from disk to memory. fpo: It means file pages out. It tells how many file pages were copied from memory to disk. fpf: It means file pages free. It tells how many file pages are being freed at every sample of time. If you see no page activity in anonymous (api/apo) and only in file page activity (fpi/fpo), it means that you do not have memory constraints but there are too many file page activities and you might need to optimize it. One way is by enabling Direct I/O in the filesystems of your application. Another way is by adjusting the read ahead mechanism if that is enabled, or adjusting the scanners parameters of virtual memory. The recommended values for them are: fastscan: This parameter sets how many memory pages are scanned per second. Configure it for 1/4 of real memory with a limit of 1 GB. handspreadpage: This parameter sets the distance between the two-handled clock algorithm to look for candidate memory pages to be reclaimed when memory is slow. Configure it with the same value set for the fastscan parameter. maxpgio: This parameter sets the maximum number of pages that can be queued by the Virtual Memory Manager. Configure it for 1024 if you use eight or more ranks in the DS8000 and if you use a high-end server.
%b 0 0 0 %b 0 0 0
%b device
352
0.0 0.0 0.0 0.0 0.0 0.0 extended kr/s kw/s 0.0 0.0 0.0 0.0 0.0 0.0 extended kr/s kw/s 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 device statistics wait actv wsvc_t asvc_t %w 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 device statistics wait actv wsvc_t asvc_t %w 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0
0 c0t0d0 0 c1t0d0 0 c1t1d0 %b 0 0 0 %b 0 0 0 device c0t0d0 c1t0d0 c1t1d0 device c0t0d0 c1t0d0 c1t1d0
When analyzing the output: Look to see if the number of I/Os is balanced among the disks. If not, it might indicate that you have problems in the distribution of subdisks or plexes (VxVM), or volumes (SVM) over the LUNs. You can use the option -p in iostat to see the workload among the slices of a LUN. If r/s is larger than w/s and if the asvc_t is larger than 15 ms, it means that your bottleneck is in a lower layer, which can be the HBA, SAN, or even in the storage. For some reason, it is taking too much time to process the I/O request. Also, check if the same problem occurs with other disks of the same Disk Set or Disk Group (DG). If yes, you will need to add up the number of I/Os per second, add up the throughput by vpath (if it is the case), rank, and host, and compare with the performance numbers from TotalStorage Productivity Center for Disk. If r/s is smaller than w/s and if the asvc_t is larger than 3 ms, writes averaging significantly and consistently higher indicate that the write cache is full, and there is a bottleneck in the disk. Confirm this information with the TotalStorage Productivity Center for Disk reports. If the wait is greater than actv, compare with other disks of the storage subsystem. Look if the distribution of subdisks or plexes (VxVM) or volumes (SVM) is evenly distributed among all disks in the Disk Set or DG. If the wait value is smaller than actv, compare with other disks of the storage. If there are differences and the subdisks or plexes (VxVM) or volumes (SVM) are distributed evenly in the Disk Set or DG, it might indicate that the unbalance is at the rank level. Confirm this information with the TotalStorage Productivity Center for Disk reports. For detailed information about the options of iostat, check the following link: http://docs.sun.com/app/docs/doc/816-5166/iostat-1m?a=view
11.5.4 vxstat
The vxstat performance tool is part of VxVM. With this tool, you can collect performance data related to VM disks, subdisks, plexes, and volumes. It can provide the following information: Operations (reads/writes): The number of I/Os over the sample interval Blocks (reads/writes): The number of blocks in 512 bytes over the sample interval Avg time (reads/writes): The average response time for reads and writes over the sample interval With the DS8000, we recommend that you collect performance information from VM disks and subdisks. To display 10 sets of disk statistics, with intervals of one second, use vxstat -i 1 -c 10 -d. To display 10 sets of subdisk statistics, with intervals of one second, use
Chapter 11. Performance considerations with UNIX servers
353
vxstat -i 1 -c 10 -s. You need to dismiss the first sample, because it provides statistics since the boot of the server. When analyzing the output of vxstat, focus on: Whether the number of I/Os is balanced among the disks. If not, it might indicate that you have problems in the distribution of subdisks or plexes (VxVM) over the LUNs. If operations/read is bigger than operations/write and if the Avg time/read is longer than 15 ms, it means that your bottleneck is in a lower layer, which can be the HBA, SAN, or even in the storage. For some reason, it takes too much time to process the I/O request. Also, check if the same problem occurs with other disks of the same Disk Group (DG) or LUNs of DS8000. If yes, you will need to add up the number of I/Os per second, add up the throughput by VM disk, rank, and host, and compare with the performance numbers from TotalStorage Productivity Center for Disk. If operations/read is smaller than operations/write and if the Avg time/write is greater than 3 ms, writes averaging significantly and consistently higher might indicate that write cache is full, and that there is a bottleneck in the disk. Confirm this information with the TotalStorage Productivity Center for Disk reports. Whether the distribution of subdisks or plexes (VxVM) is even in all disks in the DG. If there are differences and the subdisks or plexes (VxVM) are distributed evenly in the DG, it can indicate that the unbalance might be at the rank level.
11.5.5 dtrace
The dtrace is not just a trace tool. DTrace is also a framework for tracking dynamically the operating systems kernel and also applications that run on top of Solaris 10. You can write your own tools for performance analysis by using the D programming language. The syntax is based on C programming language with several specific commands for tracing instrumentation. There are many scripts already developed that you can use for performance analysis. You can start by downloading the DTrace Toolkit from the following link: http://www.brendangregg.com/dtrace.html#DTraceToolkit Follow the instructions at the Web site to install the DTrace Toolkit. When installed, set your PATH environment variable to avoid having to type the full path every time, as shown in Example 11-30 on page 354.
Example 11-30 Setting PATH environment variable for DTrace Toolkit [root@v480-1]# export PATH=$PATH:/opt/DTT:/opt/DTT/Bin
One example is a very large sequential I/O that might be reaching a limit. Refer to the script in Example 11-31.
Example 11-31 bitesize.d script [root@v480-1]# [1] 6516 [root@v480-1]# Tracing... Hit 1000+0 records 1000+0 records ^C PID 0 dd if=/dev/zero of=/export/home/test_file.dd bs=2048k cou> bitesize.d Ctrl-C to end. in out
CMD sched\0
354
count 0 59 22 1 0 191 0
fsflush\0 value 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 ------------- Distribution ------------- count | 0 | 1 | 1 | 0 |@@@@@@@ 15 |@@@@@@ 14 |@@@@@ 11 |@@@@@@ 12 |@@ 5 |@@ 4 |@@@@@@@@@@@ 24 | 0
6516
dd if=/dev/zero of=/export/home/test_file.dd bs=2048k count=1000\0 value 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 ------------- Distribution ------------| | |@@ |@ |@ |@@ |@ |@ |@@@ |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | count 0 2 118 29 47 142 62 47 150 1692 0
In the previous example, we executed a dd command with a blocksize of 2 MB, but when we measure the I/O activity, we can see that in fact the maximum I/O size is 1 MB and not 2 MB, which might be related to the maximum physical I/O size that can be executed in the operating system. Let us confirm the maxphys value (Example 11-32).
Example 11-32 Checking the maxphys parameter [root@v480-1]# grep maxphys /etc/system [root@v480-1]#
As you can see, the maxphys parameter is not set in the /etc/system configuration file. It means that Solaris 10 is using the default value, which is 1 MB. You can increase the value of maxphys to increase the size of I/O requests.
355
For additional information about how to use DTrace, check the following links: Introduction about DTrace: http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_Intro The DTrace Toolkit: http://www.solarisinternals.com/wiki/index.php/DTraceToolkit
Asynchronous I/O
Asynchronous I/O is a feature of HP-UX that is not enabled by default. It allows the application to keep processing while issuing I/O requests without needing to wait for a reply, consequently, reducing the applications response time. Normally, database applications take advantage of that feature. If your application supports asynchronous I/O, enable it in the operating system as well. For detailed information about how to configure asynchronous I/O, refer to the appropriate application documentation: For Oracle 11g with HP-UX using asynchronous I/O: http://download.oracle.com/docs/cd/B28359_01/server.111/b32009/appb_hpux.htm#BA BBFDCI For Sybase Adaptive Server Enterprise (ASE) 15.0 using asynchronous I/O: http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.dc35823_1500/html /uconfig/BBCBEAGF.htm
356
357
The DS8000 LUNs that are under control of LVM are called physical volumes (PVs). The LVM splits the disk space in smaller pieces that are called physical extents (PEs). A logical volume (LV) is composed of several logical extents (LEs). A filesystem is created on top of a LV or simply used as a raw device. Each LE can point to up to two corresponding PEs in LVM Version 1.0 and up to five corresponding PEs in LVM Version 2.0/2.1, which is how LVM implements mirroring (RAID 1). Note: In order to implement mirroring (RAID 1), it is necessary to install an optional product, HP MirrorDisk/UX. To set up the volume layout with DS8000 LUNs, you can adopt one of the following strategies: Storage Pool Striping (SPS): In this case, you spread the workload at the storage level. At the operating system level, you just need to create the LVs with the inter-policy attribute set to minimum (this is the default option when creating an LV). Distributed Allocation Policy: A set of LUNs is created in different ranks inside the DS8000. When the LUNs are recognized in the HP-UX, a VG is created and the LVs are spread evenly over the LUNs with the option -D. The advantage of this method compared to the SPS is the granularity of data spread over the LUNs. Whereas in SPS the data is spread in chunks of 1 GB, you can create PE sizes from 8 MB to 16 MB in a VG. LVM Striping: As with Distributed Allocation Policy, a set of LUNs is created in different ranks inside the DS8000. When the LUNs are recognized in the HP-UX, a VG is created with larger PE sizes, such as 128 MB or 256 MB. And the LVs are spread evenly over the LUNs by setting the stripe size of LV from 8 MB to 16 MB. From a performance standpoint, LVM Striping and PP Striping will provide the same results.
358
le1
pe1
le5
pe2
pe3
pe4
pe498
16MB
pe499
16MB
pe500
le2
pe1
le6
pe2
16MB
pe3
16MB
pe4
16MB 16MB
pe497
pe498
16MB
pe499
16MB
pe500
le3 lp3
pe1
le7
pe2
16MB
pe3
16MB
pe4
16MB 16MB
pe497
pe498
16MB
pe499
16MB
pe500
16MB
pe3
16MB
pe4
16MB 16MB
pe497
le4
le8
pe498
16MB
pe499
16MB
pe500
disk4, disk5, disk6, and disk7 are hardware-striped LUNs on different DS8000 Extent Pools 8 GB/16 MB partitions ~ 500 physical extents per LUN (pe1-pe500) /dev/inter-disk_lv is made up of eight logical extents (le1 + le2 + le3 + le4 + le5 +le6 +le7 + le8) = 8 x 16 = 128 MB
The first step is to initialize the PVs with the following commands: pvcreate /dev/rdsk/disk4 pvcreate /dev/rdsk/disk5
359
pvcreate /dev/rdsk/disk6 pvcreate /dev/rdsk/disk7 The next step is to create the VG. We recommend to create a VG with a set of DS8000 LUNs where each LUN is located in a different Extent Pool. If you add a new set of LUNs to a host, define another VG and so on. To create the VG data01vg and the PE size of 16 MB: 1. Create the directory /dev/data01vg with a character special file called group: mkdir /dev/data01vg mknod /dev/data01vg/group c 70 0x020000 2. Create the VG with the following command: vgcreate /dev/data01vg -g data01pvg01 -s 16 /dev/data01vg /dev/dsk/disk4 /dev/dsk/disk5 /dev/dsk/disk6 /dev/dsk/disk7 Then, you can create the LVs with the option -D, which stripes the logical volume from one LUN to the next LUN in chunks the size of the physical partition size of the volume group. For instance: lvcreate -D y -l 16 -m 1 -n inter-disk_lv -s g /dev/data01vg
LVM striping
An example of a striped logical volume is shown in Figure 11-11. The logical volume called /dev/striped_lv uses the same capacity as /dev/inter-disk_lv (shown in Figure 11-10 on page 359), but it is created differently. Notice that /dev/striped_lv is also made up of eight 256 MB physical partitions, but each partition is then subdivided into 64 chunks of 4 MB (only three of the 4 MB chunks are shown per logical partition for space reasons).
360
le5
pe2
pe3
pe4
256MB 256MB
pe29
pe30
256MB 256MB
pe31
pe32
8 GB LUN = disk5
le2
256MB 256MB 256MB 256MB
pe1
le6
pe2
pe3
pe4
256MB 256MB
pe29 pe30
256MB 256MB
pe31 pe32
IO
le3
pe1
8 GB LUN = disk6
256MB 256MB 256MB 256MB
le7
pe2
pe3
pe4
256MB 256MB
pe29 pe30
256MB 256MB
pe31 pe32
8 GB LUN = disk7
le4
256MB 256MB 256MB 256MB
pe1
le8
pe2
pe3
pe4
256MB 256MB
pe29 pe30
256MB 256MB
pe31 pe32
disk4, disk5, disk6, and disk7 are hardware-striped LUNS on different DS8000 Extent Pools 8 GB/256 MB partitions ~ 32 physical extents per LUN (pe1 pe32) /dev/striped_lv is made up of eight logical partitions (8 x 256 = 2048 MB) Each logical partition is divided into 64 equal parts of 4 MB (only three of the 4 MB parts are shown for each logical partition) /dev/striped_lv = le1.1 +le2.1 + le3.1 + le4.1 + le1.2 + le2.2 + le3.2 + le4.2 + le5.1.
As with LVM Distributed Allocation Policy, the first step is to initialize the PVs with the following commands: pvcreate pvcreate pvcreate pvcreate /dev/rdsk/disk4 /dev/rdsk/disk5 /dev/rdsk/disk6 /dev/rdsk/disk7
Then, you need to create the VG with a PE size of 256 MB: 1. Create the directory /dev/data01vg with a character special file called group: mkdir /dev/data01vg mknod /dev/data01vg/group c 70 0x020000 2. Create the VG with the following command: vgcreate /dev/data01vg -g data01pvg01 -s 256 /dev/data01vg /dev/dsk/disk4 /dev/dsk/disk5 /dev/dsk/disk6 /dev/dsk/disk7 The last step is to create all of the needed LVs with a stripe size of 8 MB. To create a striped LV, you need to combine the following options: Number of LEs (-l): This option sets the size of your LV. In our case, we want to create a 2 GB LV. Knowing that the PE size is 256 MB, we only need to divide 2048 by 256 to find out how many LEs are needed. Stripes (-i): This option sets the number of disks where the data needs to be spread. Stripe size (-I): This option sets the size in kilobytes of the stripe. Name of LV (-n): This option sets the name of the LV.
Chapter 11. Performance considerations with UNIX servers
361
For each LV, execute the following command: lvcreate -l 8 -i 4 -I 8192 -n striped_lv /dev/data01vg For additional information, refer to: LVM Limits White Paper: http://docs.hp.com/en/6054/LVM_Limits_White_Paper_V4.pdf LVM Version 2.0 Volume Groups in HP-UX 11i v3: http://docs.hp.com/en/lvm-v2/L2_whitepaper_8.pdf LVM New Features in HP-UX 11i v3: http://docs.hp.com/en/LVM-11iv3features/LVM_New_Features_11iv3_final.pdf LVM documentation at docs.hp.com: http://docs.hp.com/en/oshpux11iv3.html#LVM%20Volume%20Manager
11.6.5 PV Links
PV Links is the multipathing solution that comes with HP-UX. It primarily provides a failover capability, but if the storage allows it, you can use the alternate path for load balancing. It Is important to first check the DS8000 Interoperability Matrix to decide which multipathing solution to use: http://www.ibm.com/systems/resources/systems_storage_disk_ds8000_interop.pdf
362
11.6.10 FC adapter
The FC adapter provides the connection between the host and the storage devices.
363
HP-UX rx4640-1 B.11.31 U ia64 10:57:56 10:57:57 10:57:58 10:57:59 10:58:00 10:58:01 %usr 3 1 1 1 1 %sys 23 6 9 10 1 10 %wio 12 42 36 37 70 39
11/13/08 %idle 62 51 54 52 28 50
Average 1 [root@rx4640-1]#
Not all sar options are the same for AIX, HP-UX, and Sun Solaris, but the sar -u output is the same. The output in the example shows CPU information every 1 second, 5 times. To check if a system is I/O-bound, the important column to check is the %wio column. The %wio includes time spent waiting on I/O from all drives, including internal and DS8000 logical disks. If %wio values exceed 40, you need to investigate to understand storage I/O performance. The next action is to look at I/O service times reported by the sar -Rd command.
Example 11-34 HP-UX sar -d output [root@rx4640-1]# sar -Rd 1 5 HP-UX rx4640-1 B.11.31 U ia64 11:01:11 11:01:12 11:01:13 11:01:14 11:01:15 11:01:16 device disk3 disk3 disk3 disk3 disk3 %busy 9.90 15.00 9.00 9.09 11.88 11/13/08 r/s 11 27 4 6 9 11 w/s 6 28 23 18 20 19 blks/s 1006 792 402 598 388 638 avwait 0.00 9.11 16.18 7.86 8.36 9.00 avserv 28.01 11.45 22.43 19.53 16.87 17.56
The avwait and avserv columns show the average times spent in the wait queue and service queue respectively. The avque column represents the average number of I/Os in the queue of that device. With the HP-UX 11i v3, the sar command has new options to monitor the performance: -H reports I/O activity by HBA -L reports I/O activity by lunpath -R with option -d splits the number of I/Os per second between reads and writes -t reports I/O activity by tape device When analyzing the output of sar: Check if the number of I/Os is balanced among the disks. If not, it might indicate that you have problems in the distribution of Logical Volumes (LVs) over the LUNs. Check if the I/O is balanced among the controllers (HBAs). If not, certain paths might be in failed status. If r/s is larger than w/s and if the avserv is larger than 15 ms, your bottleneck is in a lower layer, which can be the HBA, SAN, or even in the storage. It is taking too much time to process the I/O request. Also, check if the same problem occurs with other disks of the same volume group (VG) or DS8000 LUNs. If yes, you will need to add up the number of
364
I/Os per second, add up the throughput by LUN, rank, and host, and compare with the performance numbers from TotalStorage Productivity Center for Disk. If r/s is smaller than w/s and if the avserv is greater than 3 ms and writes are averaging significantly and consistently higher, it might indicate that write cache is full, and there is a bottleneck in the disk. Confirm this information with the TotalStorage Productivity Center for Disk reports. Check that LVs are evenly distributed across all disks in the VG. If there are differences and the LVs are distributed evenly in the VG, it might indicate that the unbalance is at the rank level. For additional information about the HP-UX sar command, go to: http://docs.hp.com/en/B2355-60130/sar.1M.html
Remember, sa2 will remove the data collection files over a week old as scheduled in cron. You can save sar information to view later with the commands:
365
sar -A -o data.file interval count > /dev/null & [SAR data saved to data.file] sar -f data.file [Read SAR information back from saved file:] All data is captured in binary form and saved to a file (data.file). The data can then be selectively displayed with the sar command using the -f option.
sar summary
The sar tool helps to tell quickly if a system is I/O-bound. Remember though that a busy system can mask I/O issues, because io_wait counters are not increased if the CPUs are busy. The sar tool can help to save a history of I/O performance so you have a baseline measurement for each host. You can then verify if tuning changes make a difference. You might want, for example, to collect sar data for a week and create reports: 8 a.m. - 5 p.m. Monday - Friday if that time is the prime time for random I/O, and 6 p.m. - 6 a.m. Saturday Sunday if those times are batch/backup windows.
11.7.2 vxstat
The vxstat tool is a performance tool that comes with VxVM. Refer to 11.5.4, vxstat on page 353 for additional information.
366
Table 11-3 SDD command matrix SDD command chgvpath ckvpath datapath defvpath get_root_disks gettrace hd2vp pathtest querysn rmvpath showvpath vp2hd vpathmkdev chgvpath addpaths cfallvpath dpovgfix extendvg4vp lquerypr lsvpcfg mkvg4vp restvg4vp savevg4vp X X X X X X X X X X X X X X AIX HP-UX X X X X X X X X X X X X X X X X X X X X Solaris
367
Table 11-4 AIX SDD commands Command addpaths lsvpcfg dpovgfix hd2vp vp2hd querysn lquerypr mkvg4vp extendvg4vp savevg4vp pathtest cfallvpath restvg4vp Description Dynamically adds paths to SDD devices while they are in the Available state. Queries the SDD configuration state. Fixes an SDD volume group that has mixed vpath and hdisk physical volumes. The SDD script that converts a DS8000 hdisk device volume group to a Subsystem Device Driver vpath device volume group. The SDD script that converts an SDD vpath device volume group to a DS8000 hdisk device volume group. The SDD driver tool to query unique serial numbers of DS8000 devices. This tool is used to exclude certain LUNs from SDD, for example, boot disks. The SDD driver persistent reserve command tool. Creates an SDD volume group. Extends SDD devices to an SDD volume group. Backs up all files belonging to a specified volume group with SDD devices. Used with tracing functions. Fast-path configuration method to configure the SDD pseudo-parent dpo and all of the SDD vpath devices. Restores all files belonging to a specified volume group with SDD devices.
addpaths
In a SAN environment, where servers are attached to SAN switches, the paths from the server to the DS8000 are controlled by zones created with the SAN switch software. You might want to add a new path and remove another path for planned maintenance on the DS8000 or for proper load balancing. You can take advantage of the addpaths command to make the changes live.
lsvpcfg
To display which DS8000 vpath devices are available to provide failover protection, run the lsvpcfg command. You will see output similar to that shown in Example 11-35.
Example 11-35 The lsvpcfg command for AIX output # lsvpcfg vpath0 (Avail pv vpathvg)018FA067=hdisk1 (Avail ) vpath1 (Avail )019FA067=hdisk2 (Avail ) vpath2 (Avail )01AFA067=hdisk3 (Avail ) vpath3 (Avail )01BFA067=hdisk4 (Avail )hdisk27(Avail ) vpath4 (Avail )01CFA067=hdisk5 (Avail )hdisk28 (Avail ) vpath5 (Avail )01DFA067=hdisk6 (Avail )hdisk29 (Avail ) vpath6 (Avail )01EFA067=hdisk7(Avail )hdisk30 (Avail ) vpath7(Avail )01FFA067=hdisk8 (Avail )hdisk31 (Avail ) vpath8 (Avail )020FA067=hdisk9 (Avail )hdisk32 (Avail ) vpath9 (Avail pv vpathvg)02BFA067=hdisk20 (Avail )hdisk44 (Avail ) vpath10 (Avail pv vpathvg)02CFA067=hdisk21 (Avail )hdisk45 (Avail ) vpath11 (Avail pv vpathvg)02DFA067=hdisk22 (Avail )hdisk46 (Avail ) vpath12 (Avail pv vpathvg)02EFA067=hdisk23 (Avail )hdisk47(Avail ) vpath13 (Avail pv vpathvg)02FFA067=hdisk24 (Avail )hdisk48 (Avail
368
Notice in the example that vpath0, vpath1, and vpath2 all have a single path (hdisk device) and, therefore, will not provide failover protection, because there is no alternate path to the DS8000 LUN. The other SDD vpath devices have two paths and, therefore, can provide failover protection and load balancing.
FREE DISTRIBUTION 00..00..00..00..04 00..00..00..00..04 ! MIXED MODE- HDISKs and VPATHS ! 00..00..00..00..04 06..05..05..06..06
To fix this problem, run the command dpovgfix volume_group_name. Then, rerun the lsvpcfg or lsvg command to verify. Note: In order for the dpovgfix shell script to be executed, you must unmount all mounted filesystems of this volume group. After successful completion of the dpovgfix shell script, mount the filesystems again.
369
lquerypr
The lquerypr command implements certain SCSI-3 persistent reservation commands on a device. The device can be either hdisk or SDD vpath devices. This command supports persistent reserve service actions or read reservation key, release persistent reservation, preempt-abort persistent reservation, and clear persistent reservation.
370
The syntax and options are: lquerypr [[-p]|[-c]|[-r]][-v][-V][-h/dev/PVname] Flags: p If the persistent reservation key on the device differs from the current host reservation key, it preempts the persistent reservation key on the device. If there is a persistent reservation key on the device, it removes any persistent reservation and clears all reservation key registration on the device. Removes the persistent reservation key on the device made by this host. Displays the persistent reservation key if it exists on the device. Verbose mode. Prints detailed message.
r v V
To query the persistent reservation on a device, type: lquerypr -h /dev/vpath30 This command queries the persistent reservation on the device. If there is a persistent reserve on a disk, it returns 0 if the device is reserved by the current host. It returns 1 if the device is reserved by another host. Caution must be taken with the command, especially when implementing the preempt-abort or clear persistent reserve service action. With the preempt-abort service action, the current persistent reserve key is preempted; it also aborts tasks on the LUN that originated from the initiators that are registered with the preempted key. With clear service action, both persistent reservation and reservation key registrations are cleared from the device or LUN. This command is useful if disk was attached to one system, and was not varied off leaving the SCSI reserves on the disks, thus preventing another system from accessing them.
371
Table 11-5 HP-UX SDD commands Command showvpath chgvpath Description Lists the configuration mapping between SDD devices and underlying disks. Configures SDD vpath devices. Updates the information in /etc/vpath.cfg and /etc/vpathsave.cfg. The -c option updates the configuration file. The -r option updates the device configuration without a system reboot. Second part of the chgvpath command configuration during startup time. SDD driver console command tool. Removes SDD vpath devices from the configuration. Converts a volume group from DS8000 hdisks into SDD vpaths. Generates a file called /etc/vpathexcl.cfg to exclude bootable disks from the SDD configuration. Lists all disk storage devices by serial number. Debug tool. Debug tool that gets trace information. Converts a volume group from SDD vpaths into DS8000 hdisks.
defvpath datapath rmvpath [-all, -vpathname] hd2vp get_root_disks querysn pathtest gettrace vp2hd
showvpath
The showvpath command for HP-UX is similar to the lsvpcfg command for AIX. Use showvpath to verify that an HP-UX vpath uses multiple paths to the DS8000. We show an example of the output from showvpath in Example 11-37.
Example 11-37 The showvpath command for HP-UX #/opt/IBMsdd/bin/showvpath vpath1: /dev/rdsk/c12t0d0 /dev/rdsk/c10t0d0 /dev/rdsk/c7t0d0 /dev/rdsk/c5t0d0 vpath2: /dev/rdsk/c12t0d1
Notice that vpath1 in the example has four paths to the DS8000. The vpath2, however, has a single point of failure, because it only uses a single path. Tip: You can use the output from showvpath to modify iostat or sar information to report statistics based on vpaths instead of hdisks. Gather iostats to a file, and then replace the disk names with the corresponding vpaths.
372
On Sun Solaris, SDD resides above the Sun SCSI disk driver (sd) in the protocol stack. For more information about how SDD works, refer to Subsystem Device Driver (SDD) on page 272. SDD is supported for the DS8000 on Solaris 8/9. There are specific commands that SDD provides to Sun Solaris, which we list next, as well as the steps to update SDD after making DS8000 logical disk configuration changes for a Sun server.
cfvgpath
The cfgvpath command configures vpath devices using the following process: Scan the host system to find all DS8000 devices (LUNs) that are accessible by the Sun host. Determine which DS8000 devices (LUNs) are the same devices that are accessible through different paths. Create configuration file /etc/vpath.cfg to save the information about DS8000 devices. With the -c option: cfgvpath exits without initializing the SDD driver. The SDD driver will be initialized after reboot. This option is used to reconfigure SDD after a hardware reconfiguration. Without the -c option: cfgvpath initializes the SDD device driver vpathdd with the information stored in /etc/vpath.cfg and creates pseudo-vpath devices /devices/pseudo/vpathdd*.
Chapter 11. Performance considerations with UNIX servers
373
Note: Do not use cfgvpath without the -c option after hardware reconfiguration, because the SDD driver is already initialized with the previous configuration information. A reboot is required to properly initialize the SDD driver with the new hardware configuration information.
vpathmkdev
The vpathmkdev command creates files vpathMsN in the /dev/dsk/ and /dev/rdsk/ directories by creating links to the pseudo-vpath devices /devices/pseudo/vpathdd*, which are created by the SDD driver. Files vpathMsN in the /dev/dsk/ and /dev/rdsk/ directories provide block and character access to an application the same way as the cxtydzsn devices created by the system. The vpathmkdev command is executed automatically during SDD package installation and must be executed manually to update files vpathMsN after hardware reconfiguration.
showvpath
The showpath command lists all SDD devices and their underlying disks. We illustrate the showpath command in Example 11-38.
Example 11-38 Sun Solaris showpath command output # showpath vpath0c c1t8d0s2 /devices/pci@1f,0/pci@1/scsi@2/sd@1,0:c,raw c2t8d0s2 /devices/pci@1f,0/pci@1/scsi@2,1/sd@1,0:c,raw
Tip: Note that you can use the output from showvpath to modify iostat or sar information to report statistics based on vpaths instead of hdisks. Gather iostats to a file, and then replace the disk device names with the corresponding vpaths.
374
375
11.9.1 Using the dd command to test sequential rank reads and writes
To test the sequential read speed of a rank, you can run the command: time dd if=/dev/rvpath0 of=/dev/null bs=128k count=781 The rvpath0 is the character or raw device file for the LUN presented to the operating system by SDD. This command will read 100 MB off of rvpath0 and report how long it takes in seconds. Take 100 MB and divide by the number of seconds reported to determine the MB/s read speed. If you determine that the average read speed for your vpaths, for example, 50 MB/s, you know you need to stripe your future logical volumes across at least four different ranks to achieve 200 MB/s sequential read speeds. We show an example of the output from the test_disk_speeds sample shell script in Example 11-39.
Example 11-39 The test_disk_speeds script output # test_disk_speeds rvpath0 100 MB/sec 100 MB bs=128k
Let us explore the dd command more. Issue the following command: dd if=/dev/rvpath0 of=/dev/null bs=128k Your nmon monitor (the e option) reports that this previous command has imposed a sustained 100 MB/s bandwidth with a blocksize=128k on vpath0. Notice the xfers/sec column; xfers/sec is IOPS. Now, if your dd command has not already errored out because it reached the end of the disk, press Ctrl-C to stop the process. Now, nmon reports idle. Next, issue the following dd command with a 4 KB blocksize and put it in the background: dd if=/dev/rvpath0 of=/dev/null bs=4k & For this command, nmon reports a lower MB/s but a higher IOPS, which is the nature of I/O as a function of blocksize. Use the previous kill-grep-awk command to clear out all the dd processes from your system. Try your dd sequential read command with a bs=1024 and you see a high MB/s but a reduced IOPS. Now, start several of these commands and watch your throughput increase until it reaches a plateau; something in your configuration (CPU?, HBAs?, DS8000, or rank?) has become a bottleneck. This plateau is as fast as your hardware configuration can perform sequential reads for a specific blocksize. The kill-grep-awk script will clear everything out of the process table for you. Try loading up another raw vpath device (vpath1) device. Watch the performance of your HBAs (nmon a option) approach 200 MBps. You can perform the same kinds of tests against the block vpath device, vpath0. What is interesting here is that you will always observe the same I/O characteristics, no matter what blocksize you specify. That is because, in AIX anyway, the Logical Volume Manager breaks everything up into 4K blocks, reads and writes. Run the following two commands separately. The nmon command reports about the same for both: dd if=/dev/vpath0 of=/dev/null bs=128k dd if=/dev/vpath0 of=/dev/null bs=4k Use caution when using the dd command to test sequential writes. If LUNs have been incorporated into the operating system using logical volume manager (LVM) commands, and the dd command is used to write to the LUNs, they will not be part of the operating system anymore, and the operating system will not like that. For example, if you want to write to a vpath, that vpath must not be part of a LVM volume group. And if you want to write to a LVM
376
logical volume, it must not have a filesystem on it, and if the logical volume has a logical volume control block (LVCB), you must skip over the LVCB when writing to the logical volume. It is possible to create a logical volume without a LVCB by using the mklv -T O option. Important: Use extreme caution when using the dd command to perform a sequential write operation. Ensure that the dd command is not writing to a device file that is part of the UNIX operating system. The following commands will perform sequential writes to your LUNs: dd if=/dev/zero of=/dev/rvpath0 bs=128k dd if=/dev/zero of=/dev/rvpath0 bs=1024k time dd if=/dev/zero of=/dev/rvpath0 bs=128k count=781 Try different blocksizes, different raw vpath devices, and combinations of reads and writes. Run the commands against the block device (/dev/vpath0) and notice that blocksize does not affect performance.
377
49 49 49 49 49 49 49 49
1 1 1 1 2 2 2 2
17 17 17 17 17 17 17 17
Y Y Y Y Y Y Y Y
2. Next, run sequential reads and writes to all of the vpath devices (raw or block) for about an hour. Use the commands that we discussed in 11.9.1, Using the dd command to test sequential rank reads and writes on page 376. Then, look at your SAN infrastructure to see how it performs. Look at the UNIX error report. Problems will show up as storage errors, disk errors, or adapter errors. If there are problems, they will not be hard to find in the error report, because there will be a lot of them. The source of the problem can be hardware problems on the storage side of the SAN, Fibre Channel cables or connections, down-level device drivers, or device (HBA) microcode. If you see errors similar to the errors shown in Example 11-41, stop and get them fixed.
Example 11-41 SAN problems reported in UNIX error report IDENTIFIER 3074FEB7 3074FEB7 3074FEB7 825849BF 3074FEB7 3074FEB7 3074FEB7 3074FEB7 3074FEB7 TIMESTAMP 0915100805 0915100805 0915100805 0915100705 0915100705 0915100705 0914175405 0914175405 0914175305 T T T T T T T T T T C H H H H H H H H H RESOURCE_NAME fscsi0 fscsi3 fscsi3 fcs0 fscsi3 fscsi0 fscsi0 fscsi0 fscsi0 DESCRIPTION ADAPTER ERROR ADAPTER ERROR ADAPTER ERROR ADAPTER ERROR ADAPTER ERROR ADAPTER ERROR ADAPTER ERROR ADAPTER ERROR ADAPTER ERROR
Ensure that after running an hours worth of dd commands on all your vpaths, there are no storage errors in the UNIX error report. 3. Next, issue the following command to see if SDD is correctly load balancing across paths to the LUNs: datapath query device Output from this command looks like Example 11-42.
Example 11-42 The datapath query device command output {CCF-part2:root}/tmp/perf/scripts -> datapath query device|more Total Devices : 16
DEV#: 0 DEVICE NAME: vpath0 TYPE: 2107900 POLICY: Optimized SERIAL: 75065513000 ========================================================================== Path# Adapter/Hard Disk State Mode Select Errors 0 fscsi0/hdisk4 OPEN NORMAL 220544 0 1 fscsi0/hdisk12 OPEN NORMAL 220396 0 2 fscsi1/hdisk20 OPEN NORMAL 223940 0 3 fscsi1/hdisk28 OPEN NORMAL 223962 0 DEV#: 1 DEVICE NAME: vpath1 TYPE: 2107900 POLICY: Optimized SERIAL: 75065513001 ========================================================================== Path# Adapter/Hard Disk State Mode Select Errors
378
0 1 2 3
0 0 0 0
DEV#: 2 DEVICE NAME: vpath2 TYPE: 2107900 POLICY: Optimized SERIAL: 75065513002 ========================================================================== Path# Adapter/Hard Disk State Mode Select Errors 0 fscsi0/hdisk6 OPEN NORMAL 218881 0 1 fscsi0/hdisk14 OPEN NORMAL 219835 0 2 fscsi1/hdisk22 OPEN NORMAL 222697 0 3 fscsi1/hdisk30 OPEN NORMAL 223918 0 DEV#: 3 DEVICE NAME: vpath3 SERIAL: 75065513003 TYPE: 2107900 POLICY: Optimized
Check to make sure, for every LUN, that the counters under the Select column are the same and that there are no errors. 4. Next, spot-check the sequential read speed of the raw vpath device. The following command is an example of the command run against a LUN called vpath0. For the LUNs that you test, ensure that they each yield the same results: time dd if=/dev/rvpath0 of=/dev/null bs=128k count=781
Tip: For this dd command, for the first time that it is run against rvpath0, the I/O must be read from disk and staged to the DS8000 cache. The second time that this dd command is run, the I/O is already in cache. Notice the shorter read time when we get an I/O cache hit. Of course, if any of these LUNs are on ranks that are also used by another application, you see a variation in the throughput. If there is a large variation in the throughput, perhaps you need to give that LUN back to the storage administrator; trade for another LUN. You want all your LUNs to have the same performance. If everything looks good, continue with the configuration of volume groups and logical volumes.
379
Performance is the same for all the logical volumes. Raw logical volumes devices (/dev/rlvname) are faster than the counterpart block logical volume devices (/dev/lvname) as long as the blocksize specified is more than 4 KB. Larger blocksizes result in higher MB/s but reduced IOPS for raw logical volumes. The blocksize will not affect the throughput of a block (not raw) logical volume, because, in AIX, the LVM imposes an I/O blocksize of 4 KB. Verify this size by running the dd command against a raw logical volume with a blocksize of 4 KB. This performance will be the same as running the dd command against the non-raw logical volume. Reads are faster than writes. With inter-disk logical volumes, nmon will not report that all the LUNs have input at the same time, as with a striped logical volume. This result is normal and has to do with the nmon refresh rate and the characteristics of inter-disk logical volumes. 4. Ensure that the UNIX errorlog is clear of storage-related errors.
380
the DS8000 cache, because these results are a more realistic measure of how the application will perform. For HP-UX, use the prealloc command instead of lmktemp for AIX to create large files. For Sun Solaris, use the mkfile command. Note: The prealloc command for HP-UX and the lmktemp command for AIX have a 2 GB size limitation. Those commands are not able to create a file greater than 2 GB in size. If you want a file larger than 2 GB for a sequential read test, concatenate a couple of 2 GB files.
381
382
12
Chapter 12.
383
Virtual machine
vmdk
vmdk
Virtual disks
ESX Server
VMFS volume
External storage
DS8000 logical volume
For VMware to use external storage, VMware needs to be configured with logical volumes that are defined in accordance with the users expectations, which might include using RAID or striping at a storage hardware level. These logical volumes have to be presented to the ESX Server. For the DS8000, host access to the volumes includes the proper configuration of logical volumes, host mapping, correct LUN masking, and zoning of the involved SAN fabric. At the ESX Server layer, these logical volumes can be addressed as a VMware ESX Server File System (VMFS) volume or as a raw disk using Raw Device Mapping (RDM). A VMFS volume is a storage resource that can serve several VMs as well as several ESX Servers as
384
consolidated storage, whereas a RDM volume is intended for usage as isolated storage by a
single VM. On the Virtual Machine layer, you can configure one or several Virtual Disks (VMDKs) out of a single VMFS volume. These Virtual Disks can be configured to be used by several VMs.
ESX Server
virtual machine 1 virtual machine 2 virtual machine 3
virtual disk
virtual disk
virtual disk
SCSI controller
SCSI controller
SCSI controller
VMFS
LUN0
VMFS is optimized to run multiple virtual machines as one workload to minimize disk I/O overhead. A VMFS volume can be spanned across several logical volumes, but there is no striping available to improve disk throughput in these configurations. Each VMFS volume can be extended by adding additional logical volumes while the virtual machines use this volume. A raw device mapping (RDM) is implemented as a special file in a VMFS volume that acts as a proxy for a raw device. RDM combines advantages of direct access to physical devices with the advantages of virtual disks in the VMFS. In certain special configurations, you must use RDM raw devices, such as in Microsoft Cluster Services (MSCS) clustering, using virtual
385
machine snapshots, or using VMotion, which enables the migration of virtual machines from one datastore to another with zero downtime. With RDM volumes, ESX Server supports the usage of N_Port ID Virtualization (NPIV). This HBA virtualization technology allows a single physical host bus adapter (HBA) port to function as multiple logical ports, each with its own separate worldwide port name (WWPN). This function can be helpful when migrating virtual machines between ESX Servers using VMotion as well as to separate workloads of multiple VMs configured to the same paths on the HBA level for performance measuring purposes. The VMware ESX virtualization of datastores is shown in Figure 12-3.
ESX Server
virtual machine 1 virtual machine 2
virtual disk 1
virtual disk 2
virtual disk 1
virtual disk 2
SCSI controller
SCSI controller
VMFS
LUN0
LUN1
LUN4
.vmdk
RDM
386
actually two or more storage device names. After a rescan or reboot, the path information displayed by ESX Server might change; however, the name still refers to the same physical device. In Figure 12-4, the identical LUN can be addressed as vmhba2:0:1 or vmhba2:1:1, which can easily by verified in the Canonical Path column. If more than one HBA is used to connect to the SAN-attached storage for redundancy reasons, this LUN can also be addressed via a different HBA, in this configuration, for example, vmhba5:0:1 and vmhba5:1:2.
ESX Server provides a built-in multipathing support, which means it is not necessary to install any additional failover driver. Any external failover drivers, such as SDD, are not supported for VMware ESX. In the current implementation, ESX Server only supports path failover, which means that only one of the paths of one LUN is active at a time. Only in the case of a failure of any element in the active path will the multipathing module perform a path failover, which means change the active path to another path that is still available. VMware ESX currently does not support dynamic load balancing. However, there is already a Round-robin load balancing scheme available, which is considered experimental and is not to be used in production environments. ESX Server 3 provides two multipathing policies for usage in production environments: Fixed and Most Recently Used (MRU). MRU policy is designed for usage by active/passive storage devices, such as DS4000, which only have one active controller available per LUN. The Fixed policy is the best choice for attaching a DS8000 to VMware ESX, because this policy makes sure that the designated preferred path to the storage is used whenever available. During a path failure, an alternate path will be used, and as soon as the preferred path is available again, the multipathing module will switch back to it as the active path. The multipathing policy and the preferred path can be configured from the VI Client or by using the command line tool esxcfg-mpath. Figure 12-5 shows how the preferred path is changed from the VI Client.
387
By using the Fixed multipathing policy, you can implement a static load balancing if several LUNs are attached to the VMware ESX Server. The multipathing policy is set on a per LUN basis, as well as the preferred path is chosen for each LUN individually. If the ESX Server is connected over four paths to its DS8000 storage, we recommend that you spread the preferred paths over all four available physical paths. For example, when you want to configure four LUNs, assign the preferred path of LUN0 via the first path, the one for LUN1 via the second path, the preferred path for LUN2 via the third path, and the one for LUN3 via the fourth path. This method allows you to spread the throughput over all physical paths in the SAN fabric and thus results in an optimized performance regarding the physical connections between ESX Server and DS8000. If the workload varies greatly between the accessed LUNs, it might be a good approach to monitor the performance on the paths and adjust the configuration according to the workload. It might be necessary to assign one path as preferred to just one LUN having high workload but sharing another path as preferred between five separate LUNs showing moderate workloads. Of course, this static load balancing will only work if all paths are available. As soon as one path fails, all LUNs having selected this failing path as preferred will fail over to another path and put additional workload onto those paths. Furthermore, there is no capability to influence the failover algorithm to which path the failover will occur. When the active path fails, for example, due to a physical path failure, I/O might pause for about 30 to 60 seconds until the FC driver determines that the link is down and fails over to one of the remaining paths. This behavior can cause the virtual disks used by the operating systems of the virtual machines to appear unresponsive. After failover is complete, I/O resumes normally. The timeout value for detecting a failed link can be adjusted, it is usually set in the HBA BIOS or driver and thus the way to set this option depends on the HBA hardware and vendor. In general, the recommended failover timeout value is 30 seconds. With VMware ESX, you can adjust this value by editing the device driver options for the installed HBAs in /etc/vmware/esx.conf. Additionally, you can increase the standard disk timeout value in the virtual machines operating system to make sure that the operating system is not extensively disrupted and to make sure that the system is logging permanent errors during the failover phase. Adjusting this timeout again depends on the operating system that is used; refer to the appropriate technical documentation for details.
388
389
#changes to disk storage utilization panels #selects expanded display of vmhba2 #selects expanded display of SCSI channel 0 #selects expanded display mode of SCSI target 0 #writes the current configuration into ~/.esctop3rc file
390
After this initial configuration, the performance counters are displayed as shown in Example 12-2.
Example 12-2 Disk performance metrics in esxtop
1:25:39pm up 12 days 23:37, 86 worlds; CPU load average: 0.36, 0.14, 0.17 ADAPTR CID TID LID vmhba1 vmhba2 0 0 0 vmhba2 0 0 1 vmhba2 0 0 2 vmhba2 0 0 3 vmhba2 0 0 4 vmhba2 0 0 5 vmhba2 0 0 6 vmhba2 0 0 7 vmhba2 0 0 8 vmhba2 0 0 9 vmhba2 0 1 vmhba3 vmhba4 vmhba5 NCHNS NTGTS NLUNS NVMS AQLEN LQLEN WQLEN ACTV %USD 2 1 1 32 238 0 0 1 1 1 10 4096 32 0 8 25 1 1 1 10 4096 32 0 0 0 1 1 1 10 4096 32 0 0 0 1 1 1 9 4096 32 0 0 0 1 1 1 4 4096 32 0 0 0 1 1 1 17 4096 32 0 0 0 1 1 1 4 4096 32 0 0 0 1 1 1 4 4096 32 0 0 0 1 1 1 4 4096 32 0 0 0 1 1 1 4 4096 32 0 0 0 1 1 10 76 4096 0 0 1 2 4 16 4096 0 0 1 2 4 16 4096 0 0 1 2 20 152 4096 0 0 LOAD CMDS/s READS/s WRITES/s MBREAD/s 4.11 0.20 3.91 0.00 0.25 25369.69 25369.30 0.39 198.19 0.00 0.00 0.00 0.00 0.00 0.00 0.39 0.00 0.39 0.00 0.00 0.39 0.00 0.39 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.78 0.00 0.78 0.00
Additionally, you can change the field order and select or deselect various performance counters in the view. The minimum refresh rate is 2 seconds, and the default setting is 5 seconds. When using esxtop in Batch Mode, always include all of the counters by using the option -a. To collect the performance counters every 10 seconds for 100 iterations and save them to file, invoke esxtop this way: esxtop -b -a -d 10 -n 100 > perf_counters.csv Additional information about how to use esxtop is provided in the ESX Server Resource Management Guide: http://www.vmware.com/pdf/vi3_301_201_resource_mgmt.pdf
391
commands for a QLogic HBA, and Emulex supports up to 128. The default value for both vendors is 32. If a Virtual Machine generates more commands to a LUN than the LUN queue depth, these additional commands are queued in the ESX kernel, which increases the latency. The queue depth is defined on a per LUN basis, not per initiator. An HBA (SCSI initiator) supports many more outstanding commands. For ESX Server, if two Virtual Machines access their virtual disks on two different LUNs, each VM can generate as many active commands as the LUN queue depth. But if those two Virtual Machines have their virtual disks on the same LUN (within the same VMFS volume), the total number of active commands that the two VMs combined can generate without queuing I/Os in the ESX kernel is equal to the LUN queue depth. Therefore, when several Virtual Machines share a LUN, the maximum number of outstanding commands to that LUN from all those VMs together must not exceed the LUN queue depth. Within ESX Server, there is a configuration parameter Disk.SchedNumReqOutstanding, which can be configured from the Virtual Center. If the total number of outstanding commands from all Virtual Machines for a specific LUN exceeds this parameter, the remaining commands are queued to the ESX kernel. This parameter must always be set at the same value as the queue depth for the HBA. To reduce latency, it is important to make sure that the sum of active commands from all Virtual Machines of an ESX Server does not frequently exceed the LUN queue depth. If the LUN queue depth is exceeded regularly, you might either increase the queue depth or move the virtual disks of a few Virtual Machines to different VMFS volumes. Therefore, you lower the number of Virtual Machines accessing a single LUN. The maximum LUN queue depth per ESX Server must not exceed 64. The maximum LUN queue depth per ESX Server can be up to 128 only when a server has exclusive access to a LUN. VMFS is a filesystem for clustered environments, and it uses SCSI reservations during administrative operations, such as creating or deleting virtual disks or extending VMFS volumes. A reservation makes sure that at a given time, a LUN is only available to one ESX Server exclusively. These SCSI reservations usually are only used for administrative tasks that require a metadata update. To avoid SCSI reservation conflicts in a productive environments with several ESX Servers accessing shared LUNs, it might be helpful to perform those administrative tasks at off-peak hours. If this is not possible, perform the administrative tasks from an ESX Server that also hosts I/O-intensive Virtual Machines, which will be less impacted because the SCSI reservation is set on SCSI initiator level, which means for the complete ESX Server. The maximum number of Virtual Machines that can share the same LUN depends on several conditions. In general, Virtual Machines with heavy I/O activity result in a smaller number of possible VMs per LUN. Additionally, you must consider the already discussed LUN queue depth limits per ESX Server and the storage system specific limits.
393
RDM offers two configuration modes: virtual compatibility mode and physical compatibility mode. When using physical compatibility mode, all SCSI commands toward the virtual disk are passed directly to the device, which means all physical characteristics of the underlying hardware become apparent. Within virtual compatibility mode, the virtual disk is mapped as a file within a VMFS volume, allowing advanced file locking support and the usage of snapshots. Figure 12-7 compares both possible RDM configuration modes and VMFS.
ESX Server
virtual machine virtual machine virtual machine
virtual disk 1
virtual disk 1
virtual disk 1
LUN0
LUN0
address resolution
LUN2
LUN0
.vmdk
mapping file
mapping file
address resolution
LUN3
VMFS
Figure 12-7 Comparison of RDM virtual and physical modes with VMFS
The implementations of VMFS and RDM imply a possible impact on the performance of the virtual disks; therefore, all three possible implementations have been tested together with the DS8000. This section summarizes the outcome of those performance tests. In general, it turned out that the filesystem selection has only very limited impact on the performance: For random workloads, the measured throughput is almost equal between VMFS, RDM physical, and RDM virtual. Only for read requests of 32 KB, 64 KB, and 128 KB transfer sizes, both RDM implementations show a slight performance advantage (Figure 12-9 on page 395). For sequential workloads for all transfer sizes, a slight performance advantage for both RDM implementations was verified against VMFS. For all sequential write and certain read requests, the measured throughput for RDM virtual was slightly higher than for RDM physical mode, which might be caused by an additional caching of data within the virtualization layer, which is not used in RDM physical mode (Figure 12-8 on page 395).
394
Figure 12-8 Result of random workload test for VMFS, RDM physical, and RDM virtual
Sequential Throughput
250,00
200,00 VMFS write 150,00 MBps RDM physical write RDM virtual write VMFS read 100,00 RDM physical read RDM virtual read 50,00
Figure 12-9 Result of sequential workload test for VMFS, RDM physical, and RDM virtual
To summarize, the choice between the available filesystems, VMFS and RDM, only has a very limited influence on the data performance of the Virtual Machines. These tests verified a possible performance increase of about 2 - 3%.
395
cluster VMFS
stripe
Figure 12-10 Processing of a data request in a unaligned structure
An aligned partition setup makes sure that a single I/O request results in a minimum number of physical disk I/Os, eliminating the additional disk operations, which, in fact, results in an overall performance improvement. Operating systems using the x86 architecture create partitions with a master boot record (MBR) of 63 sectors. This design is a relief from legacy BIOS code from personal computers that used cylinder, head, and sector addressing instead of Logical Block Addressing (LBA). The first track is always reserved for the MBR, and the first partition starts at the second track
396
(cylinder 0, head 1, and sector 1), which is sector 63 in LBA. Also, in todays operating systems, the first 63 sectors cannot be used for data partitions. The first possible start sector for a partition is 63. In a VMware ESX environment, because of the additional virtualization layer implemented by ESX, this partition alignment has to be performed for both layers: VMFS and the host filesystems. Because of that additional layer, using properly aligned partitions is considered to have even a higher performance effect than in the usual host setups without an additional virtualization layer. Figure 12-11 illustrates how a single I/O request is fulfilled within an aligned setup without causing additional physical disk I/O.
cluster
VMFS
block
DS8000 LUN
stripe
Figure 12-11 Processing a data request in an aligned structure
Partition alignment is a known issue in filesystems, but its effect on performance is somehow controversial. In performance lab tests, it turned out that in general all workloads show a slight increase in throughput when the partitions are aligned. A significant effect can only be verified on sequential workloads. Starting with transfer sizes of 32 KB and larger, we recognized performance improvements of up to 15%. In general, aligning partitions can improve the overall performance. For random workloads, we only identified a slight effect, whereas for sequential workloads, a possible performance gain of about 10% seems to be realistic. So, we can recommend to align partitions especially for mainly sequential workload characteristics. Aligning partitions within an ESX Server environment requires two steps. First, the VMFS partition needs to be aligned, and then, the partitions within the VMware guest systems filesystems have to be aligned as well for maximum effectiveness. You can only align the VMFS partition when configuring a new datastore. When using the VI client, the new partition is automatically configured to an offset of 128 sectors = 64 KB. But, in fact, this configuration is not ideal when using DS8000 disk storage. As the DS8000 uses larger stripe sizes, the offset must be configured to at least the stripe size. For RAID 5 and
Chapter 12. Performance considerations with VMware
397
RAID 10 in Open Systems attachment, the stripe size is 256 KB, and it is a good approach to set the offset to 256 KB (or 512 sectors). To configure an individual offset is only possible from the ESX Server command line. Example 12-3 shows how to create an aligned partition with an offset of 512 using fdisk.
Example 12-3 Creating an aligned VMFS partition using fdisk fdisk /dev/sdf #invoke fdisk for /dev/sdf Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel Building a new DOS disklabel. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable.
The number of cylinders for this disk is set to 61440. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite) Command (m for help): n #create a new partition Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-61440, default 1): Using default value 1 Last cylinder or +size or +sizeM or +sizeK (1-61440, default 61440): Using default value 61440 Command (m for help): t Selected partition 1 Hex code (type L to list codes): fb Changed system type of partition 1 to fb (Unknown) Command (m for help): x Expert command (m for help): b Partition number (1-4): 1 New beginning of data (32-125829119, default 32): 512 Expert command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks. fdisk -lu /dev/sdf #check the partition config #set partitions system id #fb = VWware VMFS volume
#enter expert mode #set starting block number #partition offset set to 512 #save changes
Disk /dev/sdf: 64.4 GB, 64424509440 bytes 64 heads, 32 sectors/track, 61440 cylinders, total 125829120 sectors Units = sectors of 1 * 512 = 512 bytes Device Boot /dev/sdf1 Start End Blocks 512 125829119 62914304 Id fb System Unknown
398
Afterwards a new VMFS filesystem has to be created within the aligned partition using the vmkfstools command as shown in Example 12-4.
Example 12-4 Creating a VMFS volume using vmkfstools vmkfstools -C vmfs3 -b 1m -S LUN0 vmhba2:0:0:1 Creating vmfs3 file system on "vmhba2:0:0:1" with blockSize 1048576 and volume label "LUN0". Successfully created new volume: 490a0a3b-cabf436e-bf22-001a646677d8
As the last step, all the partitions at the virtual machine level must be aligned as well. This task needs to be performed from the operating system of each VM using the available tools. For example, for Windows, use the diskpart utility as shown in Example 12-5. Windows only allows you to align basic partitions, and the offset size is set in KB (not in sectors).
Example 12-5 Creating an aligned NTFS partition using diskpart DISKPART> create partition primary align=256 DiskPart succeeded in creating the specified partition. DISKPART> list partition Partition ### ------------* Partition 1 DISKPART> Type Size Offset ---------------- ------- ------Primary 59 GB 256 KB
You can obtain additional information about aligning VMFS partitions and the performance effects from the document VMware Infrastructure 3: Recommendations for aligning VMFS partitions at: http://www.vmware.com/pdf/esx3_partition_align.pdf
399
In Virtual Machines running Windows using the LSILogic driver, I/Os larger than 64 KB might be split into multiple I/Os of a maximum size of 64 KB, which might negatively affect I/O performance. You can improve I/O performance by editing the registry setting: HKLM\SYSTEM\CurrentControlSet\Services\Symmpi\Parameters\Device\MaximumSGList For further information, refer to Large Block Size Support at: http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc& externalId=9645697&sliceId=1
400
13
Chapter 13.
401
402
The architecture discussed here applies to Open Systems servers attached to DS8000 using the Fibre Channel Protocol (FCP). If Linux is installed on System z servers, a special disk I/O setup might apply depending on the specific hardware implementation and configuration. For further information about disk I/O setup and configuration for System z, refer to the IBM Redbooks publication Linux for IBM System z9 and IBM zSeries, SG24-6694.
bio
pdflush
block layer
device driver
Device driver
disk device
Disk
sector
For a quick overview of overall I/O subsystem operations, we use an example of writing data to a disk. The following sequence outlines the fundamental operations that occur when a disk-write operation is performed, assuming that the file data is on sectors on disk platters, that it has already been read, and is on the page cache: 1. A process requests to write a file through the write() system call. 2. The kernel updates the page cache mapped to the file. 3. A pdflush kernel thread takes care of flushing the page cache to disk. 4. The filesystem layer puts each block buffer together to a bio struct (refer to 13.2.3, Block layer on page 405) and submits a write request to the block device layer.
403
5. The block device layer gets requests from upper layers and performs an I/O elevator operation and puts the requests into the I/O request queue. 6. A device driver, such as Small Computer System Interface ( SCSI) or other device-specific drivers, will take care of write operation. 7. A disk device firmware performs hardware operations, such as seek head, rotation, and data transfer to the sector on the platter. This sequence is simplified, because it only reflects I/Os to local physical disks (a SCSI disk attached via a native SCSI adapter). Storage configurations using additional virtualization layers and SAN attachment require additional operations and layers, such as in the DS8000 storage system.
CPU
Register Data Cache Data Data1 Data2 Data1 Data2
Memory
Data
First access
First access
CPU
Register Data Cache Data Data
CPU
Register Data Cache Data Data1 Data2 Data1 Data2
Memory
Data
Temporal locality
Spatial locality
Linux uses this principle in many components, such as page cache, file object cache (i-node cache, directory entry cache, and so on), read ahead buffer, and more.
404
SCSI
The Small Computer System Interface (SCSI) is the most commonly used I/O device technology, especially in the enterprise server environment. In Linux kernel implementations, SCSI devices are controlled by device driver modules. They consist of the following types of modules (Figure 13-3 on page 406): Upper level drivers: sd_mod, sr_mod (SCSI-CDROM), st (SCSI Tape), sq (SCSI generic device), and so on. Provides functionality to support several types of SCSI devices, such as SCSI CD-ROM, SCSI tape, and so on.
405
Middle level driver: scsi_mod Implements SCSI protocol and common SCSI functionality. Low level drivers Provide lower level access to each device. A low level driver is basically specific to a hardware device and is provided for each device, for example, ips for the IBM ServeRAID controller, qla2300 for the QLogic HBA, mptscsih for the LSI Logic SCSI controller, and so on. Pseudo driver: ide-scsi Used for IDE-SCSI emulation.
Process
sg
st
sd_mod scsi_mod
sr_mod
mptscsih
ips
qla2300
Device
If specific functionality is implemented for a device, it must be implemented in device firmware and the low level device driver. The supported functionality depends on which hardware you use and which version of device driver you use. The device itself must also support the desired functionality. Specific functions are usually tuned by a device driver parameter.
406
To configure the HBA properly, refer to the IBM TotalStorage DS8000: Host Systems Attachment Guide, SC26-7625, which includes detailed procedures and recommended settings. Also, read the readme files and manuals of the driver, BIOS, and HBA. Each HBA driver allows you to configure several parameters. The list of available parameters depends on the specific HBA type and driver implementation. If these settings are not configured correctly, it might affect performance or the system might not work properly. You can configure each parameter as either temporary or persistent. For temporary configurations, you can use the modprobe command. Persistent configuration is performed by editing the following file (based on distribution): /etc/modprobe.conf for RHEL /etc/modprobe.conf.local for SLES To set the queue depth of an Emulex HBA to 20, add the following line to modprobe.conf(.local): options lpfc lpfc_lun_queue_depth=20 Specific HBA types support a failover on the HBA level, for example, QLogic HBAs. When using Device Mapper - Multipath I/O (DM-MPIO) or Subsystem Device Driver (SDD) for multipathing, this failover on the HBA level needs to be manually disabled. To disable failover on a QLogic qla2xxx adapter, simply add the following line to modeprobe.conf(.local): options qla2xxx ql2xfailover=0 For performance reasons, the parameters queue depth and several timeout and retry parameters in the case of path errors can be interesting. Changing the queue depth allows you to queue more outstanding I/Os on the adapter level, which can in certain configurations have a positive effect on throughput. However, increasing the queue depth cannot be generally recommended, because it can slow performance or cause delays, depending on the actual configuration. Thus, the complete setup needs to be checked carefully before adjusting the queue depth. Change the I/O timeout and retry values when using DM-MPIO, which handles the path failover and recovery scenarios. In those cases, we recommend that you decrease those values to allow a fast reaction of the multipathing module in case of path or adapter problems.
407
Device Mapper is to balance the workload of I/O operations across all available paths as well as detecting defective links and failing over to the remaining links. Currently (as of November 2008), IBM only supports SDD for SLES 8 and 9 and RHEL 3 and 4 versions. For the new distribution releases SLES 10 and RHEL 5, DM-MPIO is the only supported multipathing solution. We recommend to use DM-MPIO if possible for your system configuration. DM-MPIO already is the preferred multipathing solution for most Linux 2.6 kernels. Hence, it is also available for 2.4 kernels but needs to be manually included and configured during kernel compilation. DM-MPIO is the required multipathing setup for LVM2. Further information about supported distribution releases, kernel versions, and multipathing software is documented at the IBM Subsystem Device Driver for Linux Web site: http://www.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S4000107 DM-MPIO provides round-robin load balancing for up to eight paths per LUN. The userspace component is responsible for automated path discovery and grouping, as well as path handling and retesting of previously failed paths. The framework is extensible for hardware-specific functions and additional load balancing or failover algorithms. IBM provides a device-specific configuration file for the DS8000 for the supported levels of Red Hat Enterprise Linux (RHEL) and SUSE Enterprise Linux Server (SLES). This file needs to be copied to /etc/multipath.conf before the multipath driver and multipath tools are started. It sets default parameters for the scanned LUNs and creates user friendly names for the multipath devices that are managed by DM-MPIO. Further configuration, as well as adding aliases for certain LUNs or blacklisting specific devices, can be manually configured by editing this file. Using DM-MPIO, you can configure various path failover policies, path priorities, and failover priorities. This type of configuration can be done for each device individually in the /etc/multipath.conf setup. When using DM-MPIO, consider changing the default HBA timeout settings. In case a path fails, the default HBA timeout setting must be reported to the multipath module as fast as possible to avoid delay, because of I/O retries on the HBA level. MPIO is then able to react quickly and fail over to one of the remaining healthy paths. This setting needs to be configured at the HBA level. Edit the file /etc/modprobe.conf or /etc/modprobe.conf.local (depending on the distribution). For example, you can use the following settings for an QLogic qla2xxx HBA (Example 13-1).
Example 13-1 HBA settings in /etc/modeprobe.conf.local cat /etc/modprobe.conf.local # # please add local extensions to this file # options qla2xxx qlport_down_retry=1 ql2xfailover=0 ql2xretrycount=5
For further configuration and setup information, refer to the following publications: For SLES: http://www.novell.com/documentation/sles10/stor_evms/index.html?page=/documenta tion/sles10/stor_evms/data/mpiotools.html
408
For RHEL: http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/DM_Mu ltipath/ Considerations and comparisons between IBM SDD for Linux and DM-MPIO: http://www.ibm.com/support/docview.wss?uid=ssg1S7001664&rs=555
Using the command mdadm --detail --scan returns the configuration of all software RAIDs.
Example 13-3 Displaying the configuration of all software RAIDs x345-tic-20:/dev/disk/by-name # mdadm --detail --scan ARRAY /dev/md0 level=raid0 num-devices=2 UUID=9fd0170a:d18e3d37:2d44e795:d9cc87b4
409
You can obtain further documentation about how to use the command line RAID tools in Linux from: http://tldp.org/HOWTO/Software-RAID-HOWTO-5.html
LUN0
LUN1
LUN2
physical extent
Physical Volume 1
Physical Volume 2
Physical Volume 3
logical extent
stripe
Logical Volume
Figure 13-4 LVM striped mapping of three LUNs to a single logical volume
410
Furthermore, LVM2 offers additional functions and flexibility: Logical volumes can be resized during operation. Data from one physical volume can be relocated during operations, for example, in data migration scenarios. Logical volumes can be mirrored between several physical volumes for redundancy. Logical volume snapshots can be created for backup purposes. With the Linux 2.6 kernel levels, only LVM 2 is supported. This configuration also uses the Device Mapper multipathing functionality, thus, every LUN can be an mpath device that is available via several paths for redundancy. Both basic functions, host-based striping and host-based mirroring, can be implemented either using the Software RAID functions using mdadm or Logical Volume Management functions using LVM2. From a performance point of view, both solutions deliver comparable results, maybe with a slight performance advantage for mdadm because of lesser implementation overhead compared to LVM. Both implementations can also be configured using the EVMS management functions. Further documentation about LVM2 can be obtained from: http://sources.redhat.com/lvm2/ http://www.tldp.org/HOWTO/LVM-HOWTO/index.html
411
With the capability to have different I/O elevators per disk subsystem, the administrator now can isolate a specific I/O pattern on a disk subsystem (such as write intensive workloads) and select the appropriate elevator algorithm: Synchronous filesystem access Certain types of applications need to perform filesystem operations synchronously, which can be true for databases that might even use a raw filesystem or for large disk subsystems where caching asynchronous disk accesses simply is not an option. In those cases, the performance of the anticipatory elevator usually has the least throughput and the highest latency. The three other schedulers perform equally well up to an I/O size of roughly 16 KB where the CFQ and the NOOP elevators begin to outperfom the deadline elevator (unless disk access is very seek-intense). Complex disk subsystems Benchmarks have shown that the NOOP elevator is an interesting alternative in high-end server environments. When using configurations with enterprise-class disk subsystems, such as the DS8000, the lack of ordering capability of the NOOP elevator becomes its strength. Enterprise-class disk subsystems can contain multiple SCSI or Fibre Channel disks that each have individual disk heads and data striped across the disks. It becomes difficult for an I/O elevator to anticipate the I/O characteristics of such complex subsystems correctly, so you might often observe at least equal performance at less overhead when using the NOOP I/O elevator. Most large scale benchmarks that use hundreds of disks most likely use the NOOP elevator. Database systems Due to the seek-oriented nature of most database workloads, some performance gain can be achieved when selecting the deadline elevator for these workloads. Virtual machines Virtual machines, regardless of whether in VMware or VM for System z, usually communicate through a virtualization layer with the underlying hardware. So, a virtual machine is not aware of whether the assigned disk device consists of a single SCSI device or an array of Fibre Channel disks on a DS8000. The virtualization layer takes care of necessary I/O reordering and the communication with the physical block devices. CPU-bound applications While certain I/O schedulers can offer superior throughput, they can at the same time create more system overhead. The overhead that for instance the CFQ or deadline elevators cause comes from aggressively merging and reordering the I/O queue. Sometimes, the workload is not so much limited by the performance of the disk subsystem as by the performance of the CPU, for example, with a scientific workload or a data warehouse processing very complex queries. In these scenarios, the NOOP elevator offers an advantage over the other elevators, because it causes less CPU overhead as shown on Figure 13-5 on page 414. However, note that when comparing CPU overhead to throughput, the deadline and CFQ elevators are still the best choices for most access patterns to asynchronous filesystems. Single ATA or SATA disk subsystems If you choose to use a single physical ATA or SATA disk, for example, for the boot partition of your Linux system, consider using the anticipatory I/O elevator, which reorders disk writes to accommodate the single disk head found in these devices.
413
13.3.6 Filesystems
The filesystems that are available for Linux have been designed with different workload and availability characteristics in mind. If your Linux distribution and the application allow the selection of a different filesystem, it might be worthwhile to investigate if Ext, Journal File System (JFS), ReiserFS, or eXtended File System (XFS) is the optimal choice for the planned workload. Generally speaking, ReiserFS is more suited to accommodate small I/O requests whereas XFS and JFS are tailored toward large filesystems and large I/O sizes. Ext3 fits the gap between ReiserFS and JFS and XFS, because it can accommodate small I/O requests while offering good multiprocessor scalability. The workload patterns JFS and XFS are best suited for high-end data warehouses, scientific workloads, large Symmetric Multi Processor (SMP) servers, or streaming media servers. ReiserFS and Ext3 are typically used for file, Web, or mail serving. For write-intense workloads that create smaller I/Os up to 64 KB, ReiserFS might have an edge over Ext3 with default journaling mode. However, this advantage is only true for synchronous file operations. An option to consider is the Ext2 filesystem. Due to its lack of journaling abilities, Ext2 outperforms ReiserFS and Ext3 for synchronous filesystem access regardless of the access pattern and I/O size. So, Ext2 might be an option when performance is more important than data integrity.
80000
70000
60000
50000
kB/sec 40000
30000
20000
10000
Figure 13-5 Random write throughput comparison between Ext and ReiserFS (synchronous)
In the most common scenario of an asynchronous filesystem, ReiserFS most often delivers solid performance and outperforms Ext3 with the default journaling mode (data=ordered). However, Ext3 is equal to ReiserFS as soon as the default journaling mode is switched to writeback.
414
415
small I/O sizes as shown in Figure 13-6 on page 416. The benefit of using writeback journaling declines as I/O sizes grow. Also, note that the journaling mode of your filesystem only impacts write performance. Therefore, a workload that performs mainly reads (for example, a Web server) will not benefit from changing the journaling mode.
140000
120000
100000
40000
20000
There are three ways to change the journaling mode on a filesystem: When executing the mount command: mount -o data=writeback /dev/sdb1 /mnt/mountpoint /dev/sdb1 is the filesystem being mounted. Including it in the options section of the /etc/fstab file: /dev/sdb1 /testfs ext3 defaults,data=writeback 0 0 If you want to modify the default data=ordered option on the root partition, make the change to the /etc/fstab file, and then execute the mkinitrd command to scan the changes in the /etc/fstab file and create a new image. Update grub or lilo to point to the new image.
Blocksizes The blocksize, the smallest amount of data that can be read or written to a drive, can have a
direct impact on a servers performance. As a guideline, if your server is handling a lot of small files, a smaller blocksize will be more efficient. If your server is dedicated to handling large files, a larger blocksize might improve performance. Blocksizes cannot be changed dynamically on existing filesystems, and only a reformat will modify the current blocksize. Most Linux distributions allow blocksizes between 1 K, 2 K, and 4 K. As benchmarks have shown, there is hardly any performance improvement to be gained from changing the blocksize of a filesystem, so it is generally better to leave it at the default of 4 K.
416
Disk filesystem
Filesystem blocksize
417
Analysis If a single swap device/area is used, it might cause performance problems. To improve the performance, create multiple swap devices or areas.
418
iostat tool, you can monitor the I/O device loading in real time. Various options enable you to drill down even deeper to gather the necessary data. Example 13-6 shows a potential I/O bottleneck on the device /dev/sdb1. This output shows average wait times (await) of about 2.7 seconds and service times (svctm) of 270 ms.
Example 13-6 Sample of an I/O bottleneck as shown with iostat 2 -x /dev/sdb1
[root@x232 root]# iostat 2 -x /dev/sdb1 avg-cpu: %user 11.50 %nice 0.00 %sys 2.00 %idle 86.50 rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util 1792.00 12240.00 748.37 101.70 2717.33 266.67 100.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s /dev/sdb1 441.00 3030.00 7.00 30.50 3584.00 24480.00 avg-cpu: %user 10.50 %nice 0.00 %sys 1.00 %idle 88.50
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s /dev/sdb1 441.00 3030.00 7.00 30.00 3584.00 24480.00 avg-cpu: %user 10.95 %nice 0.00 %sys 1.00 %idle 88.06
rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util 1792.00 12240.00 758.49 101.65 2739.19 270.27 100.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s /dev/sdb1 438.81 3165.67 6.97 30.35 3566.17 25576.12
rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util 1783.08 12788.06 781.01 101.69 2728.00 268.00 100.00
Example 13-7 shows the output of the iostat command on an LPAR configured with a 1.2 CPU running Red Hat Enterprise Linux AS 4 while issuing server writes to the disks sda and dm-2, where the transfers per second are 130 for sda and 692 for dm-2. Also, the iowait is 6.37%.
Example 13-7 Shows output of iostat #iostat Linux 2.6.9-42.EL (rhel) 09/29/2006 avg-cpu: %user 2.70 %nice 0.11 tps 130.69 1.24 4.80 790.73 96.19 0.29 692.66 %sys %iowait 6.50 6.37 Blk_read/s 1732.56 2.53 5.32 1717.40 1704.71 2.35 6.38 %idle 84.32 Blk_read 265688 388 816 263364 261418 360 978 Blk_wrtn 893708 0 4 893704 44840 0 848864
Example 13-8 shows the output of the iostat command on an LPAR configured with a 1.2 CPU running Red Hat Enterprise Linux AS 4 issuing server writes to the disks sda and dm-2, where the transfers per second are 428 for sda and 4024 for dm-2. Also, the iowait has gone up to 12.42%.
Example 13-8 Shows output of iostat to illustrate disk I/O bottleneck # iostat Linux 2.6.9-42.EL (rhel) 09/29/2006 avg-cpu: %user %nice %sys %iowait 2.37 0.20 27.22 12.42 Device: sda sda1 tps 428.14 0.17 Blk_read/s 235.64 0.34
419
Changes made to the elevator algorithm as described in 13.3.5, Tuning the disk I/O scheduler on page 412 will be seen in avgrq-sz (average size of request) and avgqu-sz (average queue length). As the latencies are lowered by manipulating the elevator settings, avgrq-sz decreases. You can also monitor the rrqm/s and wrqm/s to see the effect on the number of merged reads and writes that the disk can manage.
sar command
The sar command, which is included in the sysstat package, uses the standard system activity daily data file to generate a report. The system has to be configured to collect the information and log it; therefore, a cron job must be set up. Add the following lines (shown in Example 13-9) to the /etc/crontab. Example 13-9 illustrates an example of automatic log reporting with cron.
Example 13-9 Example of automatic log reporting with cron .... #8am-7pm activity reports every 10 0 8-18 **1-5 /usr/lib/sa/sa1 600 6 #7pm-8am activity reports every an 0 19-7 **1-5 /usr/lib/sa/sa1 & #Activity reports every an hour on 0 ***0,6 /usr/lib/sa/sa1 & #Daily summary prepared at 19:05 5 19 ***/usr/lib/sa/sa2 -A & .... minutes during weekdays. & hour during weekdays. Saturday and Sunday.
You get a detailed overview of your CPU utilization (%user, %nice, %system, %idle), memory paging, network I/O and transfer statistics, process creation activity, activity for block devices, and interrupts/second over time. Using sar -A (the -A is equivalent to -bBcdqrRuvwWy -I SUM -I PROC -n FULL -U ALL, which selects the most relevant counters of the system) is the most effective way to grep all relevant performance counters. Using sar is recommended to analyze if a system is disk I/O-bound and spending a lot of waiting time, which results in filled-up memory buffers and low CPU usage. Furthermore, this method is useful to monitor the overall system performance over a longer period of time, for example, days or weeks, to further understand which times a claimed performance bottleneck is seen. A various number of additional performance data collection utilities are available for Linux, most of them transferred from UNIX systems. You can obtain more details about those additional tools in Chapter 11, Performance considerations with UNIX servers on page 307.
420
14
Chapter 14.
421
422
Note: The preferred node by no means signifies absolute ownership. The data will still be accessed by the partner node in the I/O Group in the event of a failure or if the preferred node workload becomes too high. Beyond automatic configuration and cluster administration, the data transmitted from attached application servers is also treated in the most reliable manner. When data is written by the server, the preferred node within the I/O Group stores a write in its own write cache and the write cache of its partner (non-preferred) node before sending an I/O complete status back to the server application. To ensure that data is written in the event of a node failure, the surviving node empties its write cache and proceeds in write-through mode until the cluster is returned to a fully operational state. Note: Write-through mode is where the data is not cached in the nodes but is written directly to the disk subsystem instead. While operating in this mode, performance is slightly degraded. Furthermore, each node in the I/O Group is protected by its own dedicated uninterruptible power supply (UPS).
423
Note: For performance considerations, we recommend that you create Managed Disk Groups using only MDisks, which have the same characteristics in terms of performance or reliability. An MDG provides a pool of capacity (extents), which will be used to create volumes, known as Virtual Disks (VDisks). When creating VDisks, the default option of striped allocation is normally the best choice. This option helps to balance I/Os across all the managed disks in an MDG, which optimizes overall performance and helps to reduce hot spots. Conceptually, this method is represented in Figure 14-1.
VDISK1
MDISK 1
MDISK 2
MDISK 3
2 GB
The virtualization function in the SAN Volume Controller maps the VDisks seen by the application servers onto the MDisks provided by the back-end controllers. I/O traffic for a particular VDisk is, at any one time, handled exclusively by the nodes in a single I/O Group. Thus, although a cluster can have many nodes within it, the nodes handle I/O in independent pairs, which means that the I/O capability of the SAN Volume Controller scales well (almost linearly), because additional throughput can be obtained by simply adding additional I/O Groups. Figure 14-2 on page 425 summarizes the various relationships that bridge the physical disks through to the virtual disks within the SAN Volume Controller architecture.
424
Fabric 1
Virtual disks are created within a Managed Disk Group and are mapped to the hosts.
Hosts zone
Virtual Disks
Type = 2145 Managed Disk Group
MD 1
Low Cost
MD 6 MD 7 MD 8
SVC isolates Hosts from any storage modifications. SVC manages the relation between Virtual Disks and Managed Disks. Managed Disks are grouped in Managed Disks Groups depending on their characteristics Storage Pools Storage subsystem SCSI LUNs are directly mapped to SVC cluster.
VD 1
VD 2
VD 3
VD 6
VD 4
Managed Disks
Fabric 1
Disks zone
LUN 1 LUN 2 LUN 3 LUN 1 LUN 2 LUN 3
VD 5
VD 7
SCSI LUNs
LUN 4
RAID controller 1
RAID Array
RAID controller 2
LUN 4
Physical disks
425
V4.3:
Note: SAN Volume Controller Copy Services functions are not compatible with the ESS, DS6000, and DS8000 Copy Services.
426
For details about configuration and management of SAN Volume Controller Copy Services, refer to SVC V4.2.1 Advanced Copy Services, SG24-7574. A FlashCopy mapping can be created between any two VDisks in a cluster. It is not necessary for the VDisks to be in the same I/O Group or in the same Managed Disk Group. This functionality provides the ability to optimize your storage allocation using a secondary storage subsystem (with, for example, lower performance) as the target of the FlashCopy. In this case, the resources of your high performance storage subsystem will be dedicated for production, while your low-cost (lower performance) storage subsystem will be used for a secondary application (for example, backup or development). An advantage of SAN Volume Controller remote copy is that we can implement such relationships between two SAN Volume Controller clusters with different back-end disk subsystems. In this case, you can reduce the overall cost of the disaster recovery infrastructure. The production site can use high performance back-end disk subsystems, and the recovery site can use low-cost back-end disk subsystems, even where back-end disk subsystems Copy Services functions are not compatible (for example, different models or different manufacturers). This relationship is established at the VDisk level and does not depend on the back-end disk storage subsystem Copy Services. Important: For Metro Mirror copies, the recovery site VDisks need to have performance characteristics similar to the production site VDisks when a high write I/O rate is present in order to maintain the I/O response level for the host system.
427
To see how busy your CPUs are, you can use the TotalStorage Productivity Center performance report, by selecting CPU Utilization. Several of the activities that affect CPU utilization are: VDisk activity: The preferred node is responsible for I/Os for the VDisk and coordinates sending the I/Os to the alternate node. While both systems will exhibit similar CPU utilization, the preferred node is a little busier. To be precise, a preferred node is always responsible for the destaging of writes for VDisks that it owns. Cache management: The purpose of the cache component is to improve performance of read and write commands by holding some read or write data in SVC memory. Because the nodes in a caching pair have physically separate memories, the cache component must keep the caches on both nodes consistent. FlashCopy activity: Each node (of the flash copy source) maintains a copy of the bitmap; CPU utilization is similar. Mirror Copy activity: The preferred node is responsible for coordinating copy information to the target and also ensuring that the I/O Group is up-to-date with the copy progress information or change block information. As soon as Global Mirror is enabled, there is an additional 10% overhead on I/O work due to the buffering and general I/O overhead of performing asynchronous remote copy. With the newly added I/O Group, the SVC cluster can potentially double the I/O rate (IOPS) that it can sustain. An SVC cluster itself can be scaled up to an eight node cluster with which we will quadruple the total I/O rate.
Note: The SAN Volume Controller extent size does not have a great impact on the performance of an SVC installation, and the most important consideration is to be consistent across MDGs. You must use the same extent size in all MDGs within an SVC cluster to avoid limitations when migrating VDisks from one MDG to another MDG. For additional information, refer to the SAN Volume Controller Best Practices and Performance Guidelines, SG24-7521.
429
It is when the rank is assigned that the DS8000 processor complex (or server group) affinity is determined. To balance ranks on each device adapter (DA), we recommend that you assign equal numbers of ranks of a given capacity on each DA pair to each DS8000 processor complex as shown in Figure 14-1 on page 424.
430
Device adapter 0 0 0
Server group 1 0 1
Figure 14-3 Example showing configuration with multiple ranks to an extent pool
In this example, an extent pool (Extent Pool 1) is defined on a DS8000. This extent pool includes three ranks, each of which is 519 GB. The overall capacity of this extent pool is 1.5 TB. This capacity is available for LUN creation as a set of 1 GB DS8000 extents. 431
In this pool of available extents, we create one DS8000 Logical Volume called Volume0, which contains all the extents in the extent pool. Volume0 is 1.5 TB. Due to the DS8000 internal Logical Volume creation algorithm, the extents from rank1 will be assigned, then the extents of rank2, and the extents of rank3. In this case, the data stored on the first third of Volume0 will be physically located on rank1, the second third on the rank2, and the last third on rank3. When Volume0 is assigned to the SAN Volume Controller, the Logical Volume is identified by the SAN Volume Controller cluster as a Managed Disk, MDiskB. MDiskB is assigned to a Managed Disk Group, MDG0, where the SAN Volume Controller extent size is defined as 512 MB. Two other Managed Disks, MDiskA and MDiskC, both 1.5 TB, from the same DS8000 but from different extent pools, are defined in this Managed Disk. These extent pools are similarly configured as extent pool 1. The overall capacity of the Managed Disk Group is 4.5 TB. This capacity is available through a set of 512 MB SAN Volume Controller extents. Next, a SAN Volume Controller Virtual Disk called VDisk0 is created in the Managed Disk Group 0. VDisk0 is 50 GB, 100 SAN Volume Controller extents. VDisk0 is created in SAN Volume Controller Striped mode in the hopes of obtaining optimum performance. But actually, a performance bottleneck was just created. When VDisk0 was created, it was assigned sequentially one SAN Volume Controller extent from MDiskA, then one SAN Volume Controller extent from MDiskB, and one extent from MDiskC, and so on. In total, VDisk0 was assigned the first 34 extents of MDiskA, the first 33 of MDiskB, and the first 33 of MDiskC. This point is where the bottleneck occurs. All of the first 33 extents used from MDiskB are physically located at the beginning of Volume0, which means that all of these extents belong to DS8000 rank1. This configuration does not follow the performance recommendation that you need to spread the workload assigned to VDisk0 to all the ranks defined in the extent pool. In this case, performance will be limited to the performance of a single rank. Furthermore, if the configuration of MDiskA and MDiskC are equivalent to MDiskB, data stores on VDisk0 are spread across only three ranks of the nine ranks available within the three DS8000 extent pools used by SAN Volume Controller. This example shows the bottlenecks for VDisk0, but more generally, almost all of the VDisk created in this Managed Disk Group will be spread on only three ranks instead of the nine ranks available. Important: The configuration presented in Figure 14-3 on page 431 is not optimized for performance in an SAN Volume Controller environment.
Note: The rotate extents algorithm for Storage Pool Striping within an extent pool is not recommended for use with the SVC. There is no perceived advantage, because the SVC will stripe across MDisks by default.
432
SVC Extent A0001 SVC Extent B0001 SVC Extent I0001 SVC Extent A0002 SVC Extent B0002 SVC Extent I0002 ---
Ext C to H
VDisk0 50GB
100 SVC Extents allocated
Ext C to H
In this case, SVC extents used for VDisk0 are physically located on Rank1 through Rank9 of the DS8000
SVC
SVC Extent SVC Extent SVC Extent --SVC Extent SVC Extent SVC Extent
MDisk1
519 GB
MDisk2
519 GB
MDisk9
519GB
DS8000
519GB
Extent Pool 2
DS Extent 2A DS Extent 2B DS Extent 2C --DS Extent 2X DS Extent 2Y DS Extent 2Z
Capacity 519GB DS 8000 Ext size: 1GB
Extent Pool 9
Rank1
519GB
Rank2
519GB
Rank9
519GB
Figure 14-4 Example showing configuration with a single rank per extent pool
In this example, nine extent pools are defined on a DS8000. Each extent pool includes only one rank of 519 GB, and so the overall capacity of each extent pool is 519 GB. This capacity is available through a set of 1 GB DS8000 extents. In each extent pool, we create one volume that assigns all the capacity of the extent pool. The nine volumes created each have a size of 519 GB. Volume0 though Volume9 are assigned to the SAN Volume Controller. These volumes are identified by the SAN Volume Controller cluster as Managed Disks, MDisk1 though MDisk9. These Managed Disks are assigned in a Managed Disk Group, MDG0, where the SAN Volume Controller extent size is defined as 512 MB. The overall capacity of the Managed Disk Group is 4.5 TB. A Virtual Disk (VDisk0) of 50 GB size (100 SAN Volume Controller extents) is created in this storage pool. The Virtual Disk is created in SAN Volume Controller Striped mode in order to obtain the greatest performance. This mode implies that VDisk0 will assign sequentially one extent from MDisk1, then one extent from MDisk2, and so on with until we have one extent from MDisk9 and return to MDisk1. VDisk0 will assign the first 12 extents of MDisk1 and the first 11 extents of MDisk2 through MDisk9. In this case, all the SAN Volume Controller extents assigned for VDisk0 are physically located on all assigned ranks of the DS8000. This configuration permits you to spread the workload applied on VDisk0 to all nine ranks of the DS8000. In this case, we efficiently use the hardware available for each VDisk of the Managed Disk Group.
433
Important: The configuration presented in Figure 14-4 on page 433 is optimized for performance in an SAN Volume Controller environment.
434
New volumes can be added dynamically to the SAN Volume Controller. When the volume has been added to the volume group, run the command svctask detectmdisk on the SAN Volume Controller to add it as a new MDisk. Before you delete or unmap a volume allocated to the SAN Volume Controller, remove the MDisk from the Managed Disk Group, which will automatically migrate any extents for defined VDisks to other MDisks in the Managed Disk Group provided there is space available. When it has been unmapped on the DS8000, run the command svctask detectmdisk and then run the maintenance procedure on the SAN Volume Controller to confirm its removal.
Figure 14-5 Managed Disk Group using multiple DS8000 DA pairs Chapter 14. IBM System Storage SAN Volume Controller attachment
435
If a volume is added to an existing Managed Disk Group, the performance can become unbalanced due to the extents already assigned. These extents can be rebalanced manually or by using the Perl script provided as part of the SVCTools package from the alphaWorks Web site. Refer to SAN Volume Controller Best Practices and Performance Guidelines, SG24-7521, for information about performing this task.
14.4.1 Using TotalStorage Productivity Center for Disk to monitor the SVC
To configure TotalStorage Productivity Center for Disk to monitor IBM SAN Volume Controller, refer to SAN Volume Controller Best Practices and Performance Guidelines, SG24-7521.
436
14.5.1 Sharing the DS8000 between Open Systems servers and the SVC
If you have a mixed environment that includes IBM SAN Volume Controller and Open Systems servers, we recommend sharing the maximum of DS8000 resources to both environments. Our storage configuration recommendation is to create one extent pool per rank. In each extent pool, create one volume allocated to the IBM SAN Volume Controller environment and allocate one or more other volumes to the Open Systems servers. In this configuration, each environment can benefit from the DS8000 overall performance. If an extent pool has multiple ranks, we recommend that all SVC volumes are created using the rotate volumes algorithm. Server volumes can use the rotate extents algorithm if desired.
437
IBM supports sharing a DS8000 between a SAN Volume Controller and an Open Systems server. However, if a DS8000 port is in the same zone as a SAN Volume Controller port, that same DS8000 port must not be in the same zone as another server.
14.5.2 Sharing the DS8000 between System i server and the SVC
IBM SAN Volume Controller does not support System i server attachment. If you have a mixed server environment that includes IBM SAN Volume Controller and System i servers, you have to share your DS8000 to provide a direct access to System i volumes and access to Open Systems server volumes through the IBM SAN Volume Controller. In this case, we recommend to share the maximum of DS8000 resources to both environments. Our storage configuration recommendation is to create one extent pool per rank. In each extent pool, create one volume allocated to the IBM SAN Volume Controller environment and create one or more other volumes allocated to the System i servers. IBM supports sharing a DS8000 between a SAN Volume Controller and System i servers. However, if a DS8000 port is in the same zone as a SAN Volume Controller port, that same DS8000 port must not be in the same zone as System i servers.
14.5.3 Sharing the DS8000 between System z server and the SVC
IBM SAN Volume Controller does not support System z server attachment. If you have a mixed server environment that includes IBM SAN Volume Controller and System z servers, you have to share your DS8000 to provide a direct access to System z volumes and access to Open Systems server volumes through the IBM SAN Volume Controller. In this case, you have to split your DS8000 resources between two environments. Some of the ranks have to be created using the count key data (CKD) format (used for System z access) and the other ranks have to be created in Fixed Block (FB) format (used for IBM SAN Volume Controller access). In this case, both environments will get performance related to the allocated DS8000 resources. A DS8000 port will not support a shared attachment between System z and IBM SAN Volume Controller, because System z servers use Fiber Channel connection (FICON) and IBM SAN Volume Controller only supports FC connection.
target of the remote copy from a server directly, rather than through the SAN Volume Controller. The SAN Volume Controller Copy Services can be usefully employed with the image mode VDisk representing the primary of the controller copy relationship, but it does not make sense to use SAN Volume Controller Copy Services with the VDisk at the secondary site, because the SAN Volume Controller does not see the data flowing to this LUN through the controller. Cache-disabled VDisks are primarily used when virtualizing an existing storage infrastructure, and you need to retain the existing storage system Copy Services. You might want to use cache-disabled VDisks where there is a lot of intellectual capital in existing Copy Services automation scripts. We recommend that you keep the use of cache-disabled VDisks to minimum for normal workloads. Another case where you might need to use cache-disabled VDisks is where you have servers, such as System i or System z, that are not supported by the SAN Volume Controller, but you need to maintain a single Global Mirror session for consistency between all servers. In this case, the DS8000 Global Mirror must be able to manage the LUNs for all server systems. Important: When configuring cache-disabled VDisks, consider the guidelines for configuring the DS8000 for host system attachment, which will have as great an impact on performance as the SAN Volume Controller. Because the SAN Volume Controller will not perform any striping of VDisks, it might be an advantage to use extent pools with multiple ranks to allow volumes to be created using the rotate extents algorithm. The guidelines for the use of different DA pairs for FlashCopy source and target LUNs with affinity to the same DS8000 server will also apply. Cache-disabled VDisks can also be used to control the allocation of cache resources. By disabling the cache for certain VDisks, more cache resources will be available to cache I/Os to other VDisks in the same I/O Group. This technique is particularly effective where an I/O Group is serving some VDisks that will benefit from cache and other VDisks where the benefits of caching are small or non-existent. Currently, there is no direct way to enable the cache for previously cache-disabled VDisks. You will need to remove the VDisk from the SAN Volume Controller and redefine it with cache enabled.
439
Avoid splitting an extent pool into multiple volumes at the DS8000 layer. Where possible, create one or at most two volumes on the entire capacity of the extent pool. Ensure that you have an equal number of extent pools and volumes spread equally across the device adapters and the two processor complexes of the DS8000 storage subsystem. Ensure that Managed Disk Groups contain MDisks with similar characteristics and the same capacity. Consider the following factors: The number of DDMs in the array, for example, 6+P+S or 7+P The disk rotational speed: 10k rpm or 15k rpm The DDM attachment: Fibre Channel or FATA/SATA The underlying RAID type: RAID 5, RAID 6, or RAID 10 Using the same MDisk capacity within a Managed Disk Group makes efficient use of the SAN Volume Controller striping. Do not mix MDisks of differing performance in the same Managed Disk Group. The overall group performance will be limited by the slowest MDisk in the group. Do not mix MDisks from different controllers in the same Managed Disk Group. For Metro Mirror configurations, always use DS8000 MDisks with similar characteristics for both the master VDisk and the auxiliary VDisk.
440
15
Chapter 15.
System z servers
In this chapter, we describe the performance features and other enhancements that enable higher throughput and lower response time when connecting a DS8000 to your System z. We also review several of the monitoring tools and their usage with the DS8000.
441
15.1 Overview
The special synergy between DS8000 features and the System z operating systems (mainly the z/OS operating system) makes the DS8000 an outstanding performer in that environment. The specific DS8000 performance features as they relate to application I/O in a z/OS environment include: Parallel Access Volumes (PAVs) Multiple allegiance I/O priority queuing Logical volume sizes Fibre Channel connection (FICON) In the following sections, we describe those DS8000 features and discuss how to best use them to boost performance.
442
perform alias device reassignments from one base device to another base device to help meet its goals and to minimize IOS queuing as workloads change. WLM manages PAVs across all the members of a sysplex. When making decisions on alias reassignment, WLM considers I/O from all systems in the sysplex. By default, the function is turned off, and must be explicitly activated for the sysplex through an option in the WLM service definition, and through a device-level option in the hardware configuration definition (HCD). Dynamic alias management requires your sysplex to run in WLM Goal mode. In a HyperPAV environment, WLM is no longer involved in managing alias addresses. When the base address UCB is used by an I/O operation, each additional I/O against that address will be assigned an alias that can be picked from a pool of alias addresses within the same LCU. This approach eliminates the latency caused by WLM having to manage the alias movement from one base address to another base address. Also, as soon as the I/O that uses the alias is finished, it will drop that alias, which makes the alias available in the alias pool. HyperPAV allows different hosts to use one alias to access different base addresses, which reduces the number of alias addresses required to support a set of base addresses in a System z environment.
443
I/O rate: HyperPAV: The I/O rate reaches its maximum almost immediately. The I/O rate achieved is around 5780 IOPS. Dynamic PAV: The I/O rate starts to creep up from the beginning of the test and takes several minutes to reach its maximum of about 5630 IOPS.
10 9 8 7 6 msec 5 4 3 2 1 0 1:24 1:26 1:28 1:30 1:32 1:34 3:18 3:20 3:22 1:20 1:22 3:26 3:28 3:30 3:32 3:34 3:24
7000
HyperPAV test
6300 5600 4900 4200 3500 2800 2100 1400 700 0 IO/sec
Figure 15-2 shows the number of PAVs assigned to the base address.
444
12
10
# PAV
HyperPAV test
2
1:20
1:26
1:28
1:30
3:20
3:30
3:32
1:24
1:34
3:18
3:28
1:22
1:32
3:22
3:24
3:26
time
HyperPAV: The number of PAVs almost immediately jumped to around 10 and fluctuates between 9 and 11. Dynamic PAV: The number of PAVs starts at one, and gradually WLM increases the PAV one at a time until it reaches a maximum number of nine here. In this test, we see that HyperPAV assigns more aliases compared to dynamic PAV. But, we also see that HyperPAV reaches a higher I/O rate compared to dynamic PAV. Note that this is an extreme test that tries to show how HyperPAV reacts to a very high concurrent I/O rate to a single volume, as compared to how dynamic PAV responds to this condition. The conclusion here is that HyperPAV can and will react immediately to a condition where there is a high concurrent demand on a volume. The other advantage of HyperPAV is that there is no overhead for assigning and releasing an alias for every single I/O operation that needs an alias.
3:34
445
will be satisfied from either the cache or one of the disk drive modules (DDMs) on a rank where the volume resides.
z/OS 1
Appl.A
Wait = IOSQ
UCB 100 UCB Busy
z/OS 2
Appl.C
UCB 100
Appl.B
UCB 100
Device Busy
Wait = PEND
100
z/OS 1
z/OS 2 Appl.C
UCB 100
Appl.A
UCB 1FF - Alias to UCB 100
Appl.B
UCB 100
100
Multiple Allegiance
447
Multiple Allegiance
z/OS 1 z/OS 2
z/OS 2
TCB READ2
READ1
TCB
concurrent
concurrent
448
Multiple Allegiance
z/OS 1 z/OS 2
z/OS 2
TCB WRITE2
WRITE1
TCB
concurrent
concurrent
449
Besides these standard models, there is the 3390-27 that supports up to 32760 cylinders and the 3390-54 that supports up to 65520 cylinders. With the availability of the Extended Address Volume (EAV), we now have the capability to support extremely large volumes of up to 262668 cylinders.
450
886
# of volumes
500 400 300 200 100 0 EAV EAV 3390-9 3390-3 3390-27 3390-54 3390-9 3390-27 3390-3 3390-9 3390-54 3390-27 3390-3 3390-54 EAV 97 32 16 4 291 196 98 32 49 8 12 295
65
146 GB DDM
300 GB DDM
450GB DDM
Random workload
The measurements for DB2 and IMS online transaction workloads in our measurements showed that there was only a slight difference in device response time between a six 3390-27 volume configuration compared to a a 60 3390-3 volume configuration of equal capacity on the ESS-F20 using FICON channels. The measurements for DB2 are shown in Figure 15-8. Note that even when the device response time for a large volume configuration is higher, the online transaction response time can sometimes be lower due to the reduced system overhead of managing fewer volumes.
451
3
Device response time (msec)
2 3390-3 3390-27 1
0
2101 3535
The measurements were carried out so that all volumes were initially assigned with zero or one alias. WLM dynamic alias management then assigned additional aliases as needed. The number of aliases at the end of the test run reflects the number that was adequate to keep IOSQ down. For this DB2 benchmark, the alias assignment done by WLM resulted in an approximately 4:1 reduction in the total number of UCBs used.
Sequential workload
Figure 15-9 on page 453 shows elapsed time comparisons between nine 3390-3s compared to one 3390-27 when a DFSMSdss full volume physical dump and full volume physical restore are executed. The workloads were run on a 9672-XZ7 processor connected to an ESS-F20 with eight FICON channels. The volumes are dumped to or restored from a single 3590E tape with an A60 Control Unit with one FICON channel. No PAV aliases were assigned to any volumes for this test, even though an alias might have improved the performance.
452
1500
3390-3
3390-27
1000
500
Larger volumes
To avoid potential I/O bottlenecks when using large volumes, you might also consider the following recommendations: Use PAVs to reduce IOS queuing. Parallel Access Volume (PAV) is of key importance when using large volumes. PAV enables one z/OS system image to initiate multiple I/Os to a device concurrently, which keeps IOSQ times down even with many active datasets on the same volume. PAV is a practical must with large volumes. In particular, we recommend using HyperPAV. Multiple Allegiance is a function that the DS8000 automatically provides. Multiple Allegiance automatically allows multiple I/Os from different z/OS systems to be executed concurrently, which will reduce the Device Busy Delay time, which is part of PEND time. Eliminate unnecessary reserves. As the volume sizes grow larger, more data and datasets will reside on a single CKD device address. Thus, the larger the volume, the greater the multi-system performance impact will be when serializing volumes with RESERVE processing. You need to exploit a Global Resource Serialization (GRS) Star Configuration and convert all RESERVEs possible into system ENQ requests.
453
Some applications might use poorly designed channel programs that define the whole volume or the whole extent of the dataset it is accessing as their extent range or domain, instead of just the actual track on which the I/O operates. This design prevents other I/Os from running simultaneously, if a write I/O is being executed against that volume or dataset, even when PAV is used. You need to identify such applications and allocate the datasets on volumes where they do not conflict with other applications. Custom volumes are an option here. For an Independent Software Vendor (ISV) product, asking the vendor for an updated version might help solve the problem. Other benefits of using large volumes can be briefly summarized as follows: Reduce the number of UCBs required. We reduce the number of UCBs by consolidating smaller volumes to larger volumes, and we also reduce the number of total aliases required, as explained in 15.2.3, PAV and large volumes on page 445. Simplified storage administration. Larger pools of free space, thus reducing number of X37 abends and allocation failures. Reduced number of multivolume datasets to manage.
15.7 FICON
FICON provides several benefits as compared to ESCON, from the simplified system connectivity to the greater throughput that can be achieved when using FICON to attach the host to the DS8000. FICON allows you to significantly reduce the batch window processing time. Response time improvements can accrue particularly for data stored using larger blocksizes. The data transfer portion of response time is greatly reduced because of the much higher data rate during transfer with FICON. This improvement leads to significant reductions in the connect time component of the response time. The larger the transfer, the greater the reduction as a percentage of the total I/O service time. The pending time component of the response time, which is caused by director port busy, is totally eliminated, because collisions in the director are eliminated with the FICON architecture. For users whose ESCON directors are experiencing as much as 45 - 50% busy conditions, FICON will provide significant response time reduction. Another performance advantage delivered by FICON is that the DS8000 accepts multiple channel command words (CCWs) concurrently without waiting for completion of the previous CCW, which allows setup and execution of multiple CCWs from a single channel to happen concurrently. Contention among multiple I/Os accessing the same data is now handled in the FICON host adapter and queued according to the I/O priority indicated by the Workload Manager. FICON Express2 and FICON Express4 channels on the z9 EC and z9 BC systems also support the Modified Indirect Data Address Word (MIDAW) facility and a maximum of 64 open exchanges per channel compared to the maximum of 32 open exchanges available on FICON, FICON Express and IBM System z 990 (z990) and IBM System z 890 (z890) FICON Express2 channels. Significant performance advantages can be realized by users accessing the data remotely. FICON eliminates data rate droop effect for distances up to 100 km (62.1 miles) for both read and write operations by using enhanced data buffering and pacing schemes. FICON thus extends the DS8000s ability to deliver high bandwidth potential to the logical volumes needing it, when they need it.
454
For additional information about FICON, refer to 9.3.2, FICON on page 276.
455
zHPF is available as a license feature (7092 and 0709) on DS8000 series Turbo Models in R4.1. The software requirements are: zHPF support for CHPID type FC on the System z10 EC requires at a minimum: z/OS V1.8, V1.9, or V1.10 with PTFs. z/OS V1.7 with the IBM Lifecycle Extension for z/OS V1.7 (5637-A01) with PTFs.
Figure 15-11 shows a 4k write hit performance with 4 Gbps FICON and with HPF. On a single port, the write is 11.8 kIOPS for the 4 Gbps FICON and 20.7 KIOPS for HPF (57% improvement). The write performance is almost the same (55%) for a single card.
456
Figure 15-12 shows a 4k read/write hit performance with 4 Gbps FICON and with HPF. On a single port, the read/write is 12 kIOPS for the 4 Gbps FICON and 23.5 kIOPS for HPF (51% improvement). The read/write improvement is even higher (57%) for a single card.
15.7.3 MIDAW
The IBM System z9 server introduces a Modified Indirect Data Address Word (MIDAW) facility, which in conjunction with the DS8000 and the FICON Express4 channels delivers enhanced I/O performance for Media Manager applications running under z/OS 1.7 and z/OS 1.8. It is also supported under z/OS 1.6 with PTFs.
457
The MIDAW facility is a modification to a channel programming technique that has existed since S/360 days. MIDAW is a new method of gathering and scattering data into and from noncontiguous storage locations during an I/O operation, thus decreasing channel, fabric, and control unit overhead by reducing the number of channel command words (CCWs) and frames processed. There is no tuning needed to use this MIDAW facility. The requirements to be able to take advantage of this MIDAW facility are: z9 server. Applications that use Media Manager. Applications that use long chains of small blocks. The biggest performance benefit comes with FICON Express4 channels running on 4 Gbps links, especially when processing extended format datasets. Compared to ESCON channels, using FICON channels will improve performance. This performance improvement is more significant for I/Os with bigger blocksizes, because FICON channels can transfer data much faster, which will reduce the connect time. The improvement for I/Os with smaller blocksizes is not as significant. In these cases where chains of small records are processed, MIDAW can significantly improve FICON Express4 performance if the I/Os use Media Manager. Figure 15-13 on page 459 shows that for a 32x4k READ channel program. Without MIDAWs, a FICON Express4 channel is pushed to 100% channel processor utilization at just over 100 MBps of throughput, which is about the limit of a 1 Gigabit/s FICON link. Two FICON Express4 channels are needed to get to 200 MBps with a 32x4k channel program. With MIDAWs, 100 MBps is achieved at only 30% channel utilization, and 200 MBps, which is about the limit of a 2 Gigabit/s FICON link, is achieved at about 60% channel utilization. A FICON Express4 channel operating at 4 Gigabit/s link speeds can achieve over 325 MBps. This measurement was done with a single FICON Express4 channel connected to a FICON director to two 4 Gigabit/s Control Unit (CU) ports.
458
459
Daisy chaining can go either way: Connecting more than one FICON channel to one DS8000 port Connecting one FICON channel to more than one DS8000 port
MB/sec
250 200 150 100 50 0 DS8000 2Gb Port DS8000 4Gb Port FICON Express 2Gb FICON Express2 2Gb FICON Express4
However, if you have multiple DS8000s installed, it might be a good option to balance the channel load on the System z server. You can double the number of required FICON ports on the DS8000s and daisy chain these FICON ports to the same channels on the System z server. This design provides the advantage of being able to balance the load on the FICON channels, because the load on the DS8000 fluctuates during the day. Figure 15-15 on page 461 shows configuration A with no daisy chaining. In this configuration, we see that each DS8000 uses 8 FICON ports and that each port is connected to a separate FICON channel on the host. In this case, we have two sets of 8 FICON ports connected to 16 FICON channels on the System z host. In configuration B, we double the number of FICON ports on both DS8000s and keep the same number of FICON channels on the System z server. We can now connect each FICON channel to two FICON ports, one on each DS8000. The advantage of configuration B is: Workload from each DS8000 will now be spread across more FICON ports, which lowers the load on the FICON ports and FICON host adapters. Any imbalance in the load that is going to the two DS8000s will now be spread more evenly across the 16 FICON channels.
460
CEC
A
DS8000 DS8000 Assumption:
CEC
Each line from the FICON channel in the CEC and each line from the FICON port in the DS8000 represents a set of 8-paths
B
DS8000 DS8000
461
Many performance issues are caused by a rank that is being over-driven by the I/O load of the volumes that reside on that rank. The only real solution is to spread the I/O over multiple ranks. Traditionally, you spread the I/O over multiple ranks by moving volumes from a busier rank to a rank or multiple ranks with less utilization. This solution might not be a practical solution, because often the heavily loaded volumes on one day can differ from the heavily loaded volumes on another day. With SPS, we can proactively involve the mechanical capabilities of multiple ranks into handling the I/O activities of those volumes. We utilize those multiple resources, and the load will be balanced across all the ranks within the SPS extent pool. This load balancing is the main advantage compared to the single-rank extent pool and multi-rank extent pool configurations using the previous allocation methods. For the SPS performance considerations, refer to Multiple ranks on one extent pool using Storage Pool Striping on page 476. Tip: Considering the performance benefits, we highly recommend that you configure the DS8000 using Storage Pool Striping. Using SPS does not mean that we now can allocate all the volumes of one application on one single extent pool. It is still a prudent practice to spread those volumes across all extent pools, including extent pools on the other processor complex (Server). If an extent pool is used exclusively for SPS volumes, the showrank command of all the ranks within that extent pool shows the same list of volumes, except when: The ranks defined in the extent pool contain ranks with different numbers of DDMs, such as (6+P) and (7+P). In which case, a condition can occur where all the extents on the (6+P) ranks are fully occupied, so new volumes will be located on the (7+P) ranks only. Several volumes have a capacity of less than (n x 1113) cylinders, where n = number of ranks defined in the extent pool. For example, a 3390-1 will only be allocated on one rank. Example 15-1 is the output of a showckdvol command for an SPS volume. Here, we can observe for a volume with ID 9730: It is allocated on extent pool P1, which is a multi-rank extent pool with two ranks: R4 and R7. The volume is defined as a 3390-27 with 27 extents (30051 cylinders). It is allocated with the rotateexts (rotate extents) algorithm, which is the DSCLI and DS Storage Manager term for SPS. Because there are only two ranks in this P1 extent pool, volume 9730 allocation is: 13 extents on R4 14 extents on R7
Example 15-1 The showckdvol command output dscli> showckdvol -rank 9730 Date/Time: November 13, 2008 7:47:29 AM PST IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-7512321 Name ITSO_RotExts ID 9730 accstate Online datastate Normal configstate Normal
462
deviceMTM 3390-9 volser datatype 3390 voltype CKD Base orgbvols addrgrp 9 extpool P1 exts 27 cap (cyl) 30051 cap (10^9B) 25.5 cap (2^30B) 23.8 ranks 2 sam Standard repcapalloc eam rotateexts reqcap (cyl) 30051 ==============Rank extents============== rank extents ============ R4 13 R7 14
463
15.10 RMF
Resource Management Facility (RMF), which is part of the z/OS operating system, provides performance information for the DS8000 and other disk subsystems for the users. RMF can help with monitoring the following performance components: I/O response time IOP/SAP FICON host channel FICON director SMP Cache and nonvolatile storage (NVS) FICON/Fibre port and host adapter Extent pool and rank/array
464
Starting with z/OS Release 1.10, this report also shows the number of cylinders allocated to the volume. Example 15-2 shows 3390-9 volumes that have either 10017 or 30051 cylinders.
Example 15-2 RMF Direct Access Device Activity (DASD) report
D I R E C T A C C E S S AVG RESP TIME 1.39 1.29 5.39 1.47 2.73 12.9 .803 2.53 6.50 3.31 1.60 AVG IOSQ TIME .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 D E V I C E AVG CMR DLY .128 .128 .152 .140 .144 .140 .139 .149 .142 .136 .133 AVG DB DLY .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 A C T I V I T Y AVG PEND TIME .213 .207 .252 .237 .242 .237 .238 .250 .245 .232 .224 AVG DISC TIME .836 .788 3.21 .846 1.89 12.0 .228 .492 4.96 2.58 .914 AVG CONN TIME .341 .295 1.92 .389 .597 .640 .337 1.78 1.30 .497 .463 % DEV CONN 0.00 0.00 0.01 0.09 0.01 0.01 0.01 0.01 0.02 0.00 0.00 % DEV UTIL 0.00 0.00 0.02 0.28 0.05 0.23 0.02 0.01 0.09 0.01 0.00 % DEV RESV 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 AVG NUMBER ALLOC 0.0 0.0 10.0 9.0 6.0 11.0 18.0 0.0 10.0 7.0 1.0 % % ANY MT ALLOC PEND 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0
STORAGE GROUP
DEV NUM A100 A101 A102 A103 A104 A105 A106 A107 A108 A109 A10A
DEVICE TYPE 33903 33903 33909 33909 33909 33909 33909 33909 33909 33909 33909
NUMBER OF CYL 3339 3339 30051 30051 30051 30051 30051 10017 10017 10017 10017
VOLUME PAV SERIAL AISL00 4 AISL01 4 D26454 5 D26455 5 D26456 5 D26457 5 D26458 8* P26531 5 P26532 5 P26533 5 P26534 5
DEVICE LCU ACTIVITY RATE 007E 0.017 007E 0.014 007E 0.198 007E 11.218 007E 0.990 007E 0.903 007E 2.654 007E 0.154 007E 0.741 007E 0.186 007E 0.111
PAV
PAV is the number of addresses assigned to a UCB, which includes the base address plus the number of aliases assigned to that base address. RMF will report the number of PAV addresses (or in RMF terms, exposures) that have been used by a device. In a dynamic PAV environment, when the number of exposures has changed during the reporting interval, there will be an asterisk next to the PAV number. Example 15-2 shows that address A106 has a PAV of 8*, the asterisk indicates that the number of PAVs was either lower or higher than 8 during the previous RMF period. For HyperPAV, the number of PAVs is shown in this format: n.nH. The H indicates that this volume is supported by HyperPAV, and the n.n is a one decimal number showing the average number of PAVs assigned to the address during the RMF report period. Example 15-3 shows that address 9505 has an average of 9.6 PAVs assigned to it during this RMF period. When a volume has no I/O activity, the PAV is always 1, which means that there is no alias assigned to this base address, because in HyperPAV, an alias is used or assigned to a base address only during the period required to execute an I/O. The alias is then released and put back into the alias pool after the I/O is completed. Note: The number of PAVs includes the base address plus the number of aliases assigned to it. Thus, a PAV=1 means that the base address has no aliases assigned to it.
Example 15-3 RMF DASD report for HyperPAV volumes (report created on pre-z/OS 1.10)
D I R E C T A C C E S S AVG CMR DLY 0.3 0.0 0.0 0.0 0.0 0.4 D E V I C E AVG DB DLY 0.0 0.0 0.0 0.0 0.0 0.0 A C T I V I T Y % DEV CONN 0.04 0.00 0.00 0.00 0.00 60.44 % DEV UTIL 0.04 0.00 0.00 0.00 0.00 60.83 % DEV RESV 0.0 0.0 0.0 0.0 0.0 0.0 AVG % NUMBER ANY ALLOC ALLOC 0.0 0.0 0.0 0.0 0.0 10.9 100.0 100.0 100.0 100.0 100.0 100.0 % MT PEND 0.0 0.0 0.0 0.0 0.0 0.0
STORAGE GROUP
VOLUME PAV SERIAL HY9500 HY9501 HY9502 HY9503 HY9504 HY9505 1.0H 1.0H 1.0H 1.0H 1.0H 9.6H
DEVICE AVG AVG LCU ACTIVITY RESP IOSQ RATE TIME TIME 0227 0.900 0227 0.000 0227 0.000 0227 0.000 0227 0.000 0227 5747.73 0.8 0.0 0.0 0.0 0.0 1.6 0.0 0.0 0.0 0.0 0.0 0.0
AVG AVG AVG PEND DISC CONN TIME TIME TIME 0.4 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.0 0.0 0.0 0.0 1.0
465
IOSQ time
IOSQ time is the time measured when an I/O request is being queued in the LPAR by z/OS.
The following situations can cause high IOSQ time: One of the other response time components is high. When you see a high IOSQ time, look at the other response time components to investigate where the problem actually exists. Sometimes, the IOSQ time is due to the unavailability of aliases to initiate an I/O request. There is also a slight possibility that the IOSQ is caused by a long busy condition during device error recovery. To reduce the high IOSQ time: Reduce the other component of the response time. Lowering this other components response time automatically lowers the IOSQ time. Lower the I/O load through data in memory or use faster storage devices. Provide more aliases. Using HyperPAV is the best option.
PEND time
PEND time represents the time that an I/O request waits in the hardware. This PEND time
can be increased by: High FICON Director port or DS8000 FICON port utilization: High FICON Director port or DS8000 FICON port utilization can be caused by a high activity rate on those ports. More commonly, high FICON Director port or DS8000 FICON port utilization is due to daisy chaining multiple FICON channels from different CECs to the same port on the FICON DIrector or the DS8000 FICON host adapter. In this case, the FICON channel utilization as seen from the host might be low, but the combination or sum of the utilization of these channels that share the same port (either on the Director or the DS8000) can be significant. For more information, refer to FICON director on page 469 and DS8000 FICON/Fibre port and host adapter on page 472. High FICON host adapter utilization. Using too many ports within a DS8000 host adapter can overload the host adapter. We recommend that only two out of the four ports in a host adapter are used.
466
I/O Processor (IOP/SAP) contention at the System z host. More IOP might be needed. IOP is the processor in the CEC that is assigned to handle I/Os. For more information, refer to IOP/SAP on page 468. CMR Delay is a component of PEND time. Refer to Example 15-2 on page 465. It is the initial selection time for the first I/O command in a chain for a FICON channel. It can be elongated by contention downstream from the channel, such as a busy control unit. Device Busy Delay is also a component of PEND time. Refer to Example 15-2 on page 465. Device Busy Delay is caused by a domain conflict, because of a read or write operation against a domain that is in use for update. A high Device Busy Delay time can be caused by the domain of the I/O not being limited to the track that the I/O operation is accessing. If you use an Independent Software Vendor (ISV) product, ask the vendor for an updated version, which might help solve this problem.
DISC time
If the major cause of delay is the DISC time, you need to search more to find the cause. The most probable cause of high DISC time is having to wait while data is being staged from the DS8000 rank into cache, because of a read miss operation. This time can be elongated by: Low read hit ratio. Refer to Cache and NVS on page 470. The lower the read hit ratio, the more read operations have to wait for the data to be staged from the DDMs to the cache. Adding cache to the DS8000 can increase the read hit ratio. High DDM utilization. You can verify high DDM utilization from the ESS Rank Statistics report. Refer to Extent pool and rank on page 474. Look at the rank read response time. As a general rule, this number must be less than 35 ms. If it is higher than 35 ms, it is an indication that this rank is too busy, because the DDMs are saturated. If the rank is too busy, consider spreading the busy volumes allocated on this rank to other ranks that are not as busy.
Persistent memory full condition or nonvolatile storage (NVS) full condition can also
elongate the DISC time. Refer to Cache and NVS on page 470. In a Metro Mirror environment, a significant transmission delay between the primary and the secondary site also causes a higher DISC time.
CONN time
For each I/O operation, the channel subsystem measures the time that the DS8000, channel, and CEC are connected during the data transmission. When there is a high level of utilization of resources, significant time can be spent in contention, rather than transferring data. Several reasons for high CONN time: FICON channel saturation. If the channel or BUS utilization at the host exceeds 50%, it elongates the CONN time. Refer to FICON host channel on page 468. In FICON channels, the data transmitted is divided into frames, and when the channel is busy with multiple I/O requests, the frames from an I/O will be multiplexed with the frames from other I/Os, thus elongating the elapsed time that it takes to transfer all of the frames that belong to that I/O. The total of this time, including the transmission time of the other multiplexed frames, is counted as CONN time. Contention in the FICON Director, FICON port, and FICON host adapter elongate the PEND time, which also has the same effect on CONN time. Refer to the PEND time discussion in PEND time on page 466. Rank saturation caused by high DDM utilization increases DISC time, which also increases CONN time. Refer to the DISC time discussion in DISC time on page 467.
467
15.10.3 IOP/SAP
The IOP/SAP is the CEC processor that handles the I/O operation. We check the I/O QUEUING ACTIVITY report (Example 15-4) to determine whether the IOP is saturated. An average queue length greater than 1 indicates that the IOP is saturated, even though an average queue length greater than 0.5 is considered as a warning sign. A burst of I/O can also trigger a high average queue length. If only certain IOPS are saturated, redistributing the channels assigned to the disk subsystems can help balance the load to the IOP, because an IOP is assigned to handle a certain set of channel paths. So, assigning all of the channels from one IOP to access a very busy disk subsystem can cause a saturation on that particular IOP. Refer to the appropriate hardware manual of the CEC that you use.
Example 15-4 I/O Queuing Activity report
I/O z/OS V1R6 Q U E U I N G A C T I V I T Y
SYSTEM ID SYS1 DATE 06/15/2005 INTERVAL 00.59.997 RPT VERSION V1R5 RMF TIME 11.34.00 CYCLE 1.000 SECONDS TOTAL SAMPLES = 59 IODF = 94 CR-DATE: 06/15/2005 CR-TIME: 09.34.25 ACT: ACTIVATE - INITIATIVE QUEUE ------- IOP UTILIZATION -------- % I/O REQUESTS RETRIED --------- RETRIES / SSCH --------IOP ACTIVITY AVG Q % IOP I/O START INTERRUPT CP DP CU DV CP DP CU DV RATE LNGTH BUSY RATE RATE ALL BUSY BUSY BUSY BUSY ALL BUSY BUSY BUSY BUSY 00 278.930 0.00 1.25 278.930 471.907 0.2 0.0 0.2 0.0 0.0 0.00 0.00 0.00 0.00 0.00 01 551.228 0.04 1.55 551.228 553.744 17.2 17.2 0.0 0.0 0.0 0.21 0.21 0.00 0.00 0.00 SYS 830.158 0.02 1.40 830.158 1025.651 12.2 12.1 0.1 0.0 0.0 0.14 0.14 0.00 0.00 0.00
468
The link between the director and the DS8000 can run at 1, 2, or 4Gbps. If the channel is point-to-point connected to the DS8000 FICON port, the G field indicates the speed that was negotiated between the FICON channel and the DS8000 port.
Example 15-5 Channel Path Activity report
C H A N N E L z/OS V1R6 SYSTEM ID SYS1 RPT VERSION V1R5 RMF P A T H A C T I V I T Y INTERVAL 00.59.997 CYCLE 1.000 SECONDS
IODF = 94 CR-DATE: 06/15/2005 CR-TIME: 09.34.25 ACT: ACTIVATE MODE: LPAR CPMF: EXTENDED MODE --------------------------------------------------------------------------------------------------------------------------------DETAILS FOR ALL CHANNELS --------------------------------------------------------------------------------------------------------------------------------CHANNEL PATH UTILIZATION(%) READ(MB/SEC) WRITE(MB/SEC) CHANNEL PATH UTILIZATION(%) READ(MB/SEC) WRITE(MB/SEC) ID TYPE G SHR PART TOTAL BUS PART TOTAL PART TOTAL ID TYPE G SHR PART TOTAL BUS PART TOTAL PART TOTAL 2E FC_S 2 Y 0.15 0.66 4.14 0.02 0.13 0.05 0.08 36 FC_? OFFLINE 2F FC_S 2 Y 0.15 0.66 4.14 0.02 0.12 0.05 0.08 37 FC_? OFFLINE 30 FC_S 2 Y 0.02 0.14 3.96 0.00 0.00 0.00 0.00 38 FC_S 2 Y 0.00 5.17 4.45 0.00 0.00 0.00 0.00 31 FC_S 2 Y 0.02 0.14 3.96 0.00 0.00 0.00 0.00 39 FC_S 2 Y 0.00 4.47 4.37 0.00 0.00 0.00 0.00 32 FC_? 2 Y 0.00 0.13 3.96 0.00 0.00 0.00 0.00 3A FC_S 2 Y 0.02 0.14 3.96 0.00 0.00 0.00 0.00 3B FC_S 2 Y 9.20 9.20 10.20 0.00 0.00 13.28 13.28 43 FC_S OFFLINE 3C FC_S 2 Y 3.09 3.14 6.53 6.37 6.37 0.00 0.00 44 FC_S 2 Y 9.27 9.27 10.37 0.00 0.00 14.07 14.07 3D FC_S 2 Y 3.25 3.31 6.50 6.34 6.34 0.00 0.00 45 FC 2 0.00 0.13 3.96 0.00 0.00 0.00 0.00
TYPE:005000
AVG FRAME SIZE READ WRITE 808 285 149 964 558 1424 872 896 73 574 868 1134 962 287 1188 731
PORT BANDWIDTH (MB/SEC) --READ ----WRITE -50.04 10.50 20.55 5.01 50.07 10.53 50.00 10.56 20.51 5.07 70.52 2.08 50.03 10.59 20.54 5.00
469
SYSTEM ID SYS1 DATE 06/15/2005 INTERVAL 00.59.997 RPT VERSION V1R5 RMF TIME 11.34.00 SUBSYSTEM 2107-01 CU-ID 7015 SSID 1760 CDATE 06/15/2005 CTIME 11.34.02 CINT 00.59 TYPE-MODEL 2107-922 MANUF IBM PLANT 75 SERIAL 000000012331 -----------------------------------------------------------------------------------------------------------------------------------CACHE SUBSYSTEM STATUS -----------------------------------------------------------------------------------------------------------------------------------SUBSYSTEM STORAGE NON-VOLATILE STORAGE STATUS CONFIGURED 31104M CONFIGURED 1024.0M CACHING - ACTIVE AVAILABLE 26290M PINNED 0.0 NON-VOLATILE STORAGE - ACTIVE PINNED 0.0 CACHE FAST WRITE - ACTIVE OFFLINE 0.0 IML DEVICE AVAILABLE - YES -----------------------------------------------------------------------------------------------------------------------------------CACHE SUBSYSTEM OVERVIEW -----------------------------------------------------------------------------------------------------------------------------------TOTAL I/O 19976 CACHE I/O 19976 CACHE OFFLINE 0 TOTAL H/R 0.804 CACHE H/R 0.804 CACHE I/O -------------READ I/O REQUESTS----------------------------------WRITE I/O REQUESTS---------------------% REQUESTS COUNT RATE HITS RATE H/R COUNT RATE FAST RATE HITS RATE H/R READ NORMAL 14903 252.6 10984 186.2 0.737 5021 85.1 5021 85.1 5021 85.1 1.000 74.8
470
SEQUENTIAL 0 0.0 0 0.0 N/A CFW DATA 0 0.0 0 0.0 N/A TOTAL 14903 252.6 10984 186.2 0.737 -----------------------CACHE MISSES----------------------REQUESTS READ RATE WRITE RATE TRACKS RATE NORMAL 3919 SEQUENTIAL 0 CFW DATA 0 TOTAL 3919 ---CKD STATISTICS--WRITE WRITE HITS 0 0 66.4 0 0.0 3921 0.0 0 0.0 0 0.0 0 0.0 RATE 66.4 ---RECORD CACHING--READ MISSES WRITE PROM 0 3456 66.5 0.0
52 0 5073
0.9 52 0.9 0.0 0 0.0 86.0 5073 86.0 ------------MISC-----------COUNT RATE DFW BYPASS 0 0.0 CFW BYPASS 0 0.0 DFW INHIBIT 0 0.0 ASYNC (TRKS) 3947 66.9 ----HOST ADAPTER ACTIVITY--BYTES BYTES /REQ /SEC READ 6.1K 1.5M WRITE 5.7K 491.0K
52 0 5073
0.9 1.000 0.0 0.0 N/A N/A 86.0 1.000 74.6 ------NON-CACHE I/O----COUNT RATE ICL 0 0.0 BYPASS 0 0.0 TOTAL 0 0.0
--------DISK ACTIVITY------RESP BYTES BYTES TIME /REQ /SEC READ 6.772 53.8K 3.6M WRITE 12.990 6.8K 455.4K
Following the report in Example 15-7 on page 470 is the CACHE SUBSYSTEM ACTIVITY report by volume serial number, as shown in Example 15-8. Here, you can see to which extent pool each volume belongs. In the case where we have the following setup, it is easier to perform the analysis if a performance problem happens on the LCU: One extent pool has one rank. All volumes on an LCU belong to the same extent pool. If we look at the rank statistics in the report in Example 15-11 on page 474, we know that all the I/O activity on that rank comes from the same LCU. So, we can concentrate our analysis on the volumes on that LCU only. Note: Depending on the DDM size used and the 3390 model selected, you can put multiple LCUs on one rank, or you can also have an LCU that spans more than one rank.
Example 15-8 Cache Subsystem Activity by volume serial number
C A C H E z/OS V1R6 S U B S Y S T E M A C T I V I T Y
SYSTEM ID SYS1 DATE 06/15/2005 INTERVAL 00.59.997 RPT VERSION V1R5 RMF TIME 11.34.00 SUBSYSTEM 2107-01 CU-ID 7015 SSID 1760 CDATE 06/15/2005 CTIME 11.34.02 CINT 00.59 TYPE-MODEL 2107-922 MANUF IBM PLANT 75 SERIAL 000000012331 -----------------------------------------------------------------------------------------------------------------------------------CACHE SUBSYSTEM DEVICE OVERVIEW -----------------------------------------------------------------------------------------------------------------------------------VOLUME DEV XTNT % I/O ---CACHE HIT RATE-----------DASD I/O RATE---------ASYNC TOTAL READ WRITE % SERIAL NUM POOL I/O RATE READ DFW CFW STAGE DFWBP ICL BYP OTHER RATE H/R H/R H/R READ *ALL 100.0 338.6 186.2 86.0 0.0 66.4 0.0 0.0 0.0 0.0 66.9 0.804 0.737 1.000 74.6 *CACHE-OFF 0.0 0.0 *CACHE 100.0 338.6 186.2 86.0 0.0 66.4 0.0 0.0 0.0 0.0 66.9 0.804 0.737 1.000 74.6 PR7000 7000 0000 22.3 75.5 42.8 19.2 0.0 13.5 0.0 0.0 0.0 0.0 14.4 0.821 0.760 1.000 74.6 PR7001 7001 0000 11.5 38.8 20.9 10.5 0.0 7.5 0.0 0.0 0.0 0.0 7.6 0.807 0.736 1.000 73.1 PR7002 7002 0000 11.1 37.5 20.4 9.5 0.0 7.6 0.0 0.0 0.0 0.0 7.0 0.797 0.729 1.000 74.7 PR7003 7003 0000 11.3 38.3 22.0 8.9 0.0 7.4 0.0 0.0 0.0 0.0 6.8 0.806 0.747 1.000 76.8 PR7004 7004 0000 3.6 12.0 6.8 3.0 0.0 2.3 0.0 0.0 0.0 0.0 2.6 0.810 0.747 1.000 75.2 PR7005 7005 0000 3.7 12.4 6.8 3.2 0.0 2.4 0.0 0.0 0.0 0.0 2.7 0.808 0.741 1.000 74.1 PR7006 7006 0000 3.8 12.8 6.5 3.6 0.0 2.6 0.0 0.0 0.0 0.0 3.1 0.796 0.714 1.000 71.5 PR7007 7007 0000 3.6 12.3 6.9 3.1 0.0 2.4 0.0 0.0 0.0 0.0 2.5 0.806 0.742 1.000 75.2 PR7008 7008 0000 3.6 12.2 6.7 3.4 0.0 2.2 0.0 0.0 0.0 0.0 2.7 0.821 0.753 1.000 72.5 PR7009 7009 0000 3.6 12.2 6.8 2.9 0.0 2.5 0.0 0.0 0.0 0.0 2.3 0.796 0.732 1.000 76.4
If you specify REPORTS(CACHE(DEVICE)) when running the cache report, you will get the detailed report by volume as in Example 15-9 on page 472. This report gives you the detailed cache statistics of each volume. By specifying REPORTS(CACHE(SSID(nnnn))), you can limit this report to only certain LCUs. The report basically shows the same performance statistics as in Example 15-7 on page 470, but at the level of each volume.
471
SYSTEM ID WIN5 DATE 11/05/2008 INTERVAL 00.59.976 CONVERTED TO z/OS V1R10 RMF TIME 21.54.00 SUBSYSTEM 2107-01 CU-ID C01C SSID 0847 CDATE 11/05/2008 CTIME 21.54.01 CINT 01.00 TYPE-MODEL 2107-932 MANUF IBM PLANT 75 SERIAL 0000000AB171 VOLSER @9C02F NUM C02F extent POOL 0000 -------------------------------------------------------------------------------------------------------------------------CACHE DEVICE STATUS -------------------------------------------------------------------------------------------------------------------------CACHE STATUS DUPLEX PAIR STATUS CACHING - ACTIVE DUPLEX PAIR - NOT ESTABLISHED DASD FAST WRITE - ACTIVE STATUS - N/A PINNED DATA - NONE DUAL COPY VOLUME - N/A -------------------------------------------------------------------------------------------------------------------------CACHE DEVICE ACTIVITY -------------------------------------------------------------------------------------------------------------------------TOTAL I/O 3115 CACHE I/O 3115 CACHE OFFLINE N/A TOTAL H/R 0.901 CACHE H/R 0.901 CACHE I/O -------------READ I/O REQUESTS----------------------------------WRITE I/O REQUESTS---------------------% REQUESTS COUNT RATE HITS RATE H/R COUNT RATE FAST RATE HITS RATE H/R READ NORMAL 2786 46.4 2477 41.3 0.889 329 5.5 329 5.5 329 5.5 1.000 89.4 SEQUENTIAL 0 0.0 0 0.0 N/A 0 0.0 0 0.0 0 0.0 N/A N/A CFW DATA 0 0.0 0 0.0 N/A 0 0.0 0 0.0 0 0.0 N/A N/A TOTAL 2786 46.4 2477 41.3 0.889 329 5.5 329 5.5 329 5.5 1.000 89.4 -----------------------CACHE MISSES----------------------------------MISC-----------------NON-CACHE I/O----REQUESTS READ RATE WRITE RATE TRACKS RATE COUNT RATE COUNT RATE DFW BYPASS 0 0.0 ICL 0 0.0 NORMAL 309 5.1 0 0.0 311 5.2 CFW BYPASS 0 0.0 BYPASS 0 0.0 SEQUENTIAL 0 0.0 0 0.0 0 0.0 DFW INHIBIT 0 0.0 TOTAL 0 0.0 CFW DATA 0 0.0 0 0.0 ASYNC (TRKS) 173 2.9 TOTAL 309 RATE 5.1 ---CKD STATISTICS-----RECORD CACHING------HOST ADAPTER ACTIVITY----------DISK ACTIVITY------BYTES BYTES RESP BYTES BYTES WRITE 0 READ MISSES 0 /REQ /SEC TIME /REQ /SEC WRITE HITS 0 WRITE PROM 111 READ 4.1K 190.1K READ 14.302 55.6K 288.4K WRITE 4.0K 21.8K WRITE 43.472 18.1K 48.1K
472
and CONN time on page 467. This rule does not apply for PPRC ports, especially if the distance between the primary site and the secondary site is significant. If the DS8000 is shared between System z and Open Systems, the report in Example 15-10 also shows the port activity used by the Open Systems. It shows up as SCSI READ and SCSI WRITE on ports 0200 and 0201 in Example 15-16 on page 484.
Example 15-10 DS8000 link statistics
E S S z/OS V1R7 SERIAL NUMBER 00000ABC01 ------ADAPTER-----SAID TYPE 0000 FIBRE 2Gb L I N K S T A T I S T I C S INTERVAL 14.59.778 CYCLE 1.000 SECONDS CTIME 01.14.01 CINT 14.59 RESP TIME /OPERATION 0.1 0.2 I/O INTENSITY 131.6 123.4 -----255.0 79.9 101.4 -----181.2 1024.9 0.0 -----1024.9 998.0 0.0 -----998.0 67.5 83.3 -----150.8 53.3 3.5 -----56.8
SYSTEM ID SYSA DATE 02/01/2008 CONVERTED TO z/OS V1R10 RMF TIME 01.14.00 TYPE-MODEL 002107-921 CDATE 02/01/2008 BYTES /SEC 17.2M 7.7M BYTES /OPERATION 9.9K 14.5K OPERATIONS /SEC 1735.2 533.9
9.1M 7.7M
8.4K 17.0K
1087.2 455.9
0.1 0.2
6.0M 0.0
53.1K 0.0
112.2 0.0
9.1 0.0
6.2M 0.0
53.1K 0.0
115.9 0.0
8.6 0.0
10.8M 1.9M
30.7K 31.5K
352.4 60.9
0.2 1.4
9.0M 135.0K
38.7K 10.7K
232.0 12.6
0.2 0.3
Device Adapter
Device Adapter
Device Adapter
Slot 2 Slot 3
Slot 4 Slot 5
Slot 2 Slot 3
Slot 4 Slot 5
Enclosure 0
Enclosure 1
I0240 I0241 I0242 I0243 Slot 4 Slot 5 I0300 I0301 I0302 I0303 Slot 0 I0310 I0311 I0312 I0313 Slot 1 I0330 I0331 I0332 I0333 Slot 2 Slot 3 I0340 I0341 I0342 I0343 Slot 4 Slot 5
Device Adapter
Device Adapter
Device Adapter
Device Adapter
Slot 2 Slot 3
Enclosure 2
Enclosure 3
473
SYSTEM ID SYSA DATE 10/29/2008 CONVERTED TO z/OS V1R10 RMF TIME 02.14.00 00000ABCD1 TYPE-MODEL 002107-921 CDATE 10/29/2008 ------ READ OPERATIONS ------OPS BYTES BYTES RTIME /SEC /OP /SEC /OP 3.7 35.8K 133.6K 9.6 32.9 48.3K 1.6M 4.5 11.2 43.4K 484.2K 7.2 57.7 49.9K 2.9M 8.8 153.3 53.4K 8.2M 9.5 329.8 45.5K 15.0M 5.6 110.7 46.1K 5.1M 7.2
--EXTENT POOL-ID TYPE 0000 CKD 1Gb 0001 CKD 1Gb 0002 CKD 1Gb 0003 CKD 1Gb 0004 CKD 1Gb 0005 CKD 1Gb 0006 CKD 1Gb
------ WRITE OPERATIONS -----OPS BYTES BYTES RTIME /SEC /OP /SEC /OP 11.3 57.9K 656.8K 61.9 11.7 35.0K 410.7K 24.1 37.9 123.1K 4.7M 120.1 88.1 145.2K 12.8M 126.3 87.2 143.5K 12.5M 135.1 28.4 156.6K 4.4M 68.3 25.6 16.3K 418.3K 53.4
--ARRAY-NUM WDTH 1 6 1 6 1 6 1 6 1 6 1 6 1 6
MIN RPM 15 15 15 15 15 15 15
5 5 5 5 5 5 5
474
1 1 1
6 6 6
15 15 15
Multiple ranks on one extent pool without Storage Pool Striping (SPS)
Now, we examine the effect on performance analysis when we have multiple ranks defined on one extent pool. Example 15-12 shows the rank statistics for that configuration. In this example, extent pool 0000 contains ranks with RRID 0001, 0003, 0005, 0007, 0009, 000B, 000D, 000F, 0011, 0013, and 001F. Each ranks performance statistics, as well as the weighted average performance of the extent pool, are reported here. Regardless of the way in which we define the LCU in relationship to the extent pool, identifying the cause of a performance problem is complicated, because we can only see the association of a volume to an extent pool and not to a rank. Refer to Example 15-8 on page 471. The DSCLI showrank command can provide a list of all of the volumes that reside on the rank that we want to investigate further. Analyzing performance based on this showrank output can be difficult, because it can show volumes from multiple LCUs that reside on this one rank. Note: Depending on the technique used to define the CKD volumes on the extent pool (refer to Table 5-2 on page 97), certain volumes can be allocated on multiple ranks within the extent pool, which complicates the performance analysis even further.
Example 15-12 Rank statistics for multiple ranks on one extent pool without SPS
E S S z/OS V1R8 SERIAL NUMBER 00000BCDE1 R A N K S T A T I S T I C S DATE 09/28/2008 TIME 11.59.00 09/28/2008 CTIME INTERVAL 14.59.942 CYCLE 1.000 SECONDS 11.59.01 CINT 14.59
0001
CKD 1Gb
RRID 0001 0003 0005 0007 0009 000B 000D 000F 0011 0013 001F POOL 0000 0002 0004 0006 0008 000A 000C 000E 0010 0012 001E POOL
------ READ OPERATIONS ------OPS BYTES BYTES RTIME /SEC /OP /SEC /OP 458.1 53.1K 24.3M 2.5 95.2 41.3K 3.9M 3.4 580.4 54.0K 31.3M 2.7 146.6 45.5K 6.7M 5.0 30.8 22.2K 685.7K 6.1 167.6 47.2K 7.9M 4.2 49.2 26.1K 1.3M 5.9 255.4 53.1K 13.5M 3.1 103.0 39.3K 4.1M 7.2 127.0 47.5K 6.0M 3.2 1.0 9.4K 9.6K 7.5 2014.3 49.5K 99.8M 3.4 129.4 51.3K 6.6M 2.3 228.6 50.9K 11.6M 2.0 51.0 36.1K 1.8M 5.2 189.3 52.6K 10.0M 1.8 160.5 49.1K 7.9M 2.1 71.4 38.2K 2.7M 6.3 230.1 50.9K 11.7M 2.7 183.0 50.6K 9.3M 2.6 150.1 50.3K 7.6M 2.6 193.2 51.6K 10.0M 4.4 4.5 49.8K 224.8K 16.6 1591.2 49.9K 79.4M 2.9
------ WRITE OPERATIONS -----OPS BYTES BYTES RTIME /SEC /OP /SEC /OP 16.8 10.6K 178.5K 15.2 19.4 11.9K 230.5K 13.5 15.8 22.8K 359.4K 16.4 14.6 14.4K 210.4K 12.6 14.5 31.8K 462.2K 14.2 20.7 52.9K 1.1M 13.1 12.5 12.3K 152.9K 12.1 11.8 26.8K 317.5K 15.4 20.4 21.5K 437.2K 15.0 7.3 39.3K 285.9K 13.7 1.8 43.6K 78.7K 22.7 155.5 24.5K 3.8M 14.2 6.8 21.9K 149.3K 12.5 16.7 38.9K 648.1K 14.7 7.3 17.1K 125.8K 11.0 7.3 17.3K 126.4K 11.6 8.8 28.1K 246.5K 13.6 12.2 11.5K 140.1K 10.4 11.8 15.0K 176.4K 13.1 16.6 86.8K 1.4M 15.1 6.2 22.0K 136.2K 12.5 7.6 24.1K 182.8K 14.3 0.6 57.0K 32.7K 83.1 101.8 33.4K 3.4M 13.5
--ARRAY-NUM WDTH 1 6 1 7 1 6 1 7 1 6 1 7 1 6 1 6 1 7 1 6 1 7 11 71 1 6 1 7 1 6 1 6 1 6 1 6 1 7 1 6 1 7 1 7 1 7 11 71
MIN RPM 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15
RANK CAP 876G 1022G 876G 1022G 876G 1022G 876G 876G 1022G 876G 1022G 10366G 876G 1022G 876G 876G 876G 876G 1022G 876G 1022G 1022G 1022G 10366G
RAID TYPE RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
475
0001
CKD 1Gb
RRID 0000 0002 0008 000A 0010 0012 0018 001A POOL 0004 0006 000C 000E 0014 0016 001C 001E POOL
------ READ OPERATIONS ------OPS BYTES BYTES RTIME /SEC /OP /SEC /OP 40.9 49.1K 2.0M 6.4 39.2 47.6K 1.9M 6.8 40.9 48.2K 2.0M 6.4 37.7 48.4K 1.8M 6.5 40.9 48.9K 2.0M 6.2 40.0 47.8K 1.9M 6.2 42.1 47.8K 2.0M 6.6 45.9 48.6K 2.2M 6.7 327.8 48.3K 15.8M 6.5 40.8 47.8K 1.9M 6.7 42.0 48.8K 2.0M 6.9 38.4 49.4K 1.9M 6.5 43.0 49.0K 2.1M 6.6 40.4 49.2K 2.0M 6.6 40.5 48.8K 2.0M 6.5 37.9 47.0K 1.8M 6.2 39.6 48.6K 1.9M 7.2 322.7 48.6K 15.7M 6.7
------ WRITE OPERATIONS -----OPS BYTES BYTES RTIME /SEC /OP /SEC /OP 11.3 13.1K 148.5K 7.1 11.9 15.8K 187.9K 7.5 10.9 16.5K 179.1K 7.2 10.3 13.8K 142.0K 7.2 11.4 14.4K 163.8K 7.3 10.5 12.3K 128.9K 7.1 12.1 12.4K 150.7K 7.2 12.7 15.1K 192.2K 7.3 91.1 14.2K 1.3M 7.2 11.2 14.0K 157.3K 9.2 11.9 16.1K 192.2K 9.2 9.5 16.5K 157.3K 8.8 12.1 16.6K 201.0K 8.5 10.8 15.2K 163.8K 8.4 11.5 16.0K 183.5K 8.5 10.5 14.5K 152.9K 8.2 10.9 16.5K 179.1K 10.0 88.4 15.7K 1.4M 8.9
--ARRAY-NUM WDTH 1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6 8 48 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 8 56
MIN RPM 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15
RANK CAP 438G 438G 438G 438G 438G 438G 438G 438G 3504G 511G 511G 511G 511G 511G 511G 511G 511G 4088G
RAID TYPE RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
makes it much easier to analyze the I/O configuration and performance. RMF Magic automatically determines the I/O configuration from your RMF data, showing the relationship between the disk subsystem serial numbers, SSID, LCUs, device numbers, and device types. With RMF Magic, there is no need to manually consolidate printed RMF reports. While RMF Magic reports are based on information from RMF records, the analysis and reporting goes beyond what RMF provides. In particular, it computes accurate estimates for the read and write bandwidth (MB/s) for each disk subsystem and down to the device level. With this unique capability, RMF Magic can size the links in a future remote copy configuration, because RMF Magic knows the bandwidth that is required for the links, both in I/O requests and in megabytes per second, for each point in time. RMF Magic consolidates the information from RMF records with channel, disk, LCU, and cache information into one view per disk subsystem, per SSID (LCU), and per storage group. RMF Magic gives insight into your performance and workload data for each RMF interval within a period selected for the analysis, which can span weeks. Where RMF postprocessor reports are sorted by host and LCU, RMF Magic reports are sorted by disk subsystem and SSID (LSS). With this information, you can plan migrations and consolidations more effectively, because RMF Magic provides a detailed insight in the workload, both from a disk subsystem and a storage group perspective. RMF Magics graphical capabilities allow you to find any hot spots and tuning opportunities in your disk configuration. Based on user-defined criteria, RMF Magic can automatically identify peaks within the analysis period. On top of that, the graphical reports make all of the peaks and anomalies stand out immediately, which helps you with analyzing and identifying performance bottlenecks. You can use RMF Magic to analyze subsystem performance for z/OS hosts. If your DS8000 also provides storage for Open Systems hosts, this tool provides reports on rank statistics and host port link statistics. The DS8000 storage subsystem provides these Open Systems statistics to RMF whenever performance data is reported to RMF and is available for reporting through RMF Magic. Of course, if you have a DS8000 that has only Open Systems activity and does not include any z/OS 3390 volumes, this data cannot be collected by RMF and reported on by RMF Magic.
477
You execute an RMF Magic study in four steps: 1. Data collection: RMF records are collected in the site to be analyzed and sorted in time sequence. 2. RMF Magic reduce step: This step compresses the input data (better than 10:1) and creates a database. The Reduce step also validates the input data. 3. RMF Magic analyze step: Based on information found in the RMF data, this step computes supplemental values, such as Write MB/s. The analyst defines the groups into which to summarize the data for Remote Copy sizing or performance reporting. The output of this step consists of: a. Reports that are created in the form of comma separated values (csv) files to use as input for the next step. b. Top-10 interval lists based on user criteria. The csv files are loaded in a reporting database or in a Microsoft Excel workbook. 4. Data presentation and analysis: RMF Magic for Windows is used to create graphical summaries of the data stored in the reporting database. The analyst can now profile the workload, looking at various workload characteristics from the storage unit point of view. This process might require additional analysis runs, because various interest groups (or application data groups) are identified for analysis.
478
SORT MODS
FIELDS=(11,4,CH,A,7,4,CH,A),EQUALS E15=(ERBPPE15,500,,N),E35=(ERBPPE35,500,,N)
479
The other standard charts show the breakdown by disk subsystem or by storage group, and also the data transfer rate in MBps. These charts show the power of being able to see a graphical representation of your storage subsystem over time. Figure 15-17 is a sampling of the standard summary charts that are automatically created by the RMF Magic reporting tool. Additional standard reports include back-end RAID activity rates, read hit percentages, and a variety of breakdowns of I/O response time components.
In additional to the graphical view of your performance data, RMF Magic also provides detailed spreadsheet views of important measurement data with highlighting of spreadsheet cells for more visual access to the data. Figure 15-18 on page 481 is an example of the Performance Summary for a single disk subsystem. For each of the data measurement points (for example, Column C shows I/O Rate while Column D shows Response Time), rows 10 through 15 show a summary of the information. Rows 10 through 12 show the maximum rate that was measured, the RMF interval in which that rate occurred, and which row of the spreadsheet shows that maximum number. Easy navigation is available to that hot spot by selecting a cell in the desired column, and then clicking Goto Max. For example, selecting cell D10 and then clicking Goto Max moves the cursor to highlight cell D22, which is the peak response time of 2.15 ms. At this position, it provides you with the row where you can see the I/O rate and data rate during this interval. Figure 15-18 on page 481 also shows color-coded highlighting for cells that show the top intervals of measurement data. For example, if you are viewing this text in color, you see that column F has cells that are highlighted in pink (in this particular view, there is only one cell highlighted). The pink cells represent all of the measurement intervals that have values higher than the 95th percentile. This is a feature of the RMF Magic tool, which provides visual access to potential performance hot spots represented by your measurement data.
480
Figure 15-18 I/O and data rate summary for a single subsystem
Figure 15-19 on page 482 shows a spreadsheet similar in appearance to Figure 15-18 but, in this case, shows the summary of the cache measurement data. Again, the tool highlights those cells that have the highest measurement intervals for ease of navigation within the data.
481
RMFM4W 4.3.9 Copyright (C) 2003-2008 IntelliMagic, The Netherlands Performance Summary for each DSS with Cache information Go to Index Go to Max I/O Rate IO/s Resp Time Read Hit ms % All Disk Subsystems IBM-ABMA1 DS8100, 60608MB Cache 8 FICON Channels R/W Resp Ratio I/O Rate Time Read Hit IO/s ms %
R/W Ratio
17916.8 1.99 97.30 3.21 9486.0 2.15 96.70 2.38 8/8/2008 8/8/2008 8/8/2008 8/8/2008 8/8/2008 8/8/2008 8/8/2008 8/8/2008 RMF interval 02:44:00 02:14:00 03:59:00 01:59:00 02:44:00 02:14:00 03:59:00 01:59:00 Row Number 24 22 29 21 24 22 29 21 98% below 17580.5 1.91 96.97 2.92 9265.4 1.93 96.58 2.29 95% below 17076.1 1.78 96.48 2.48 8934.6 1.59 96.40 2.16 Average 13653.3 1.22 94.28 1.77 6963.9 1.07 94.47 1.59 8/8/2008 00:59:00 8/8/2008 01:14:00 8/8/2008 01:29:00 8/8/2008 01:44:00 8/8/2008 01:59:00 8/8/2008 02:14:00 8/8/2008 02:29:00 8/8/2008 02:44:00 8/8/2008 02:59:00 8/8/2008 03:14:00 8/8/2008 03:29:00 8/8/2008 03:44:00 8/8/2008 03:59:00 8/8/2008 04:14:00 14341.6 12400.3 13991.4 9046.6 10480.0 11104.9 14245.4 17916.8 14951.3 12819.8 13447.9 14300.6 16795.9 14751.8 0.75 0.85 0.78 0.93 1.55 1.99 1.26 1.03 0.99 1.29 1.10 0.96 1.49 1.56 92.80 93.20 93.10 93.70 94.80 94.30 94.00 93.80 94.00 91.90 94.50 95.80 97.30 96.20 1.11 1.68 1.89 2.24 3.21 2.12 1.60 1.53 1.02 1.55 2.09 1.70 1.90 1.92 6921.2 5482.5 6337.7 4225.8 4748.0 5541.1 8750.8 9486.0 8260.9 7249.7 6970.9 7803.1 7301.9 7656.2 0.70 0.77 0.69 0.78 1.40 2.15 1.13 1.02 1.00 1.07 0.91 1.01 1.20 1.16 93.70 94.20 93.60 93.60 94.40 93.40 94.80 93.50 94.00 92.60 94.90 95.70 96.70 96.20 1.07 1.62 1.80 2.09 2.38 2.01 1.76 1.40 0.86 1.24 1.94 1.64 1.56 1.80
In summary, you use the RMF Magic for Windows tool to get a view of I/O activity from the disk subsystem point of view as opposed to the operating system point of view as shown by the standard RMF reporting tools. This approach allows you to analyze the affect that each of the operating system images has on the various disk subsystem resources.
482
Figure 15-20 shows the spreadsheet created from the csv file. Only the ECKD links and performance statistics are shown on this spreadsheet. The following fields are calculated using this formula: Read KB/op = (Read MBps) x (1000) / (Read op/sec) Write KB/op = (Write MBps) x (1000) / (Write op/sec) At this RMF interval period, the average is 12.5 KB/op for reads and 8.1 KB/op for writes. You can plot this data over a period of time where the CONN Time is higher and then compare the chart to a period in the past and see if the transfer size in KB/op has increased. If the transfer size in KB/op has increased, this increase explains the increase in the CONN Time.
DateTime 9/25/2008 3:14 9/25/2008 3:14 9/25/2008 3:14 9/25/2008 3:14 9/25/2008 3:14 9/25/2008 3:14 9/25/2008 3:14 9/25/2008 3:14
Gbps Rd MBs Wrt MBs 2GBs 31.95 10.51 2GBs 32.41 10.62 2GBs 32.04 10.66 2GBs 31.9 10.14 2GBs 32.74 10.44 2GBs 32.4 10.05 2GBs 32.01 10.35 2GBs 32.2 10.23
Wrt Ops Rd R/T Wrt R/T Rd KB/op Wrt KB/op 1277.36 0.16 0.33 12.4 8.2 1285.7 0.16 0.35 12.4 8.3 1280.37 0.17 0.34 12.4 8.3 1262.28 0.17 0.33 12.4 8.0 1278.64 0.27 0.49 12.5 8.2 1262.86 0.26 0.46 12.5 8.0 1270.57 0.17 0.35 12.4 8.1 1265.87 0.17 0.36 12.5 8.1 average: 12.5 8.1
483
busydevice=(criterion=iorate,threshold=99); Example 15-16 Parameters for read + write throughput criteria interestgroup sp2, limitto=(dss=(ibm-12345)) select=(devnum=((begin1, end1),..,(beginN, endN)), busydevice=(criterion=mbs,threshold=5,miniorate=1);
The output that you want to look at is the xxx$DDEV, which has an entry for each individual volume that meets the above criteria. Based on either or both csv files, you should be able to determine which volumes are the biggest contributor to the rank saturation. After identifying these volumes, you will need to move some of them to other less busy extent pools. If all the other extent pools are just as busy, then it is time to add more ranks to be able to handle the workload. Figure 15-21 shows the spreadsheet created from xxx$DDEV.csv file for the read + write throughput criteria. The r+w mbs column is calculated from (writembs) + (readmbs). You can sort this spreadsheet by the r+w mbs column to identify the volumes with the biggest load contributor.
addr 813F 804C 8450 8068 806D 8634 8065 8147 8145 854D 8263 855D 813D 8031 8367 813E 8451 863F Volser LD2994 LD29UC LD29SS LD29U4 LD29U9 LD29IU LD29U1 LD299C LD299A LD29S3 LD29BT LD29S6 LD2992 LD295Y LD29W1 LD2993 LD29ST LD29SD iorate 915 472.4 385.4 379.4 397 366.4 276.5 123 374.4 87.6 264.1 134.3 126.9 121.6 56.5 80.6 219.4 111.9 resp writerate writetrack writembs readrate readmbs iosq pend conn disc rhr r+w mbs 0.57 0 0 0 915 17.2 0 0.18 0.33 0.07 0.998 17.2 0.57 27.5 27.5 0.5 444.7 8 0 0.19 0.32 0.06 0.994 8.5 0.57 0 0 0 385.6 7 0 0.17 0.32 0.08 0.994 7 0.58 3 3 0.1 376.3 6.9 0 0.19 0.32 0.07 0.994 7 0.57 0 0 0 397 6.8 0 0.18 0.31 0.08 0.989 6.8 0.56 0 0 0 366.3 5.8 0 0.16 0.3 0.09 0.994 5.8 0.57 0 0 0 276.4 4.8 0 0.19 0.31 0.07 0.994 4.8 0.93 2.4 2.4 0.1 120.5 3.9 0 0.22 0.44 0.27 0.977 4 0.48 0 0 0 374.5 3.6 0 0.16 0.25 0.07 0.994 3.6 0.99 0 0 0 87.6 3.2 0 0.22 0.48 0.29 0.977 3.2 0.51 0 0 0 264.1 3.1 0 0.16 0.27 0.08 0.994 3.1 0.79 0 0.1 0 134.3 2.8 0 0.21 0.34 0.24 0.979 2.8 0.84 0 0 0 126.9 2.7 0 0.2 0.35 0.3 0.971 2.7 0.78 29.1 29.1 0.7 92.2 2 0 0.18 0.34 0.26 0.971 2.7 0.7 46.3 46.3 2.2 9.1 0.4 0 0.18 0.49 0.03 0.979 2.6 0.96 0 0 0 80.5 2.6 0 0.23 0.44 0.29 0.98 2.6 0.59 0 0 0 219.4 2.6 0 0.18 0.26 0.14 0.989 2.6 0.87 0.1 0.1 0 111.8 2.5 0 0.19 0.36 0.32 0.984 2.5
484
16
Chapter 16.
Databases
This chapter reviews the major IBM database systems and the performance considerations when they are used with the DS8000 disk subsystem. We limit our discussion to the following databases: DB2 Universal Database (DB2 UDB) in a z/OS environment DB2 in an open environment IMS in a z/OS environment You can obtain additional information at this Web site: http://www.ibm.com/software/data/db2/udb
485
486
TABLE
All data managed by DB2 is associated to a table. The table is the main object used by DB2 applications.
TABLESPACE
A tablespace is used to store one or more tables. A tablespace is physically implemented with one or more datasets. Tablespaces are VSAM linear datasets (LDS). Because tablespaces can be larger than the largest possible VSAM dataset, a DB2 tablespace can require more than one VSAM dataset.
INDEX
A table can have one or more indexes (or can have no index). An index contains keys. Each key points to one or more data rows. The purpose of an index is to get direct and faster access to the data in a table.
487
INDEXSPACE
An index space is used to store an index. An index space is physically represented by one or more VSAM LDSs.
DATABASE
The database is a DB2 representation of a group of related objects. Each of the previously named objects has to belong to a database. DB2 databases are used to organize and manage these objects.
STOGROUP
A DB2 storage group is a list of storage volumes. STOGROUPs are assigned to databases, tablespaces, or index spaces when using DB2-managed objects. DB2 uses STOGROUPs for disk allocation of the table and index spaces. Installations that are storage management subsystem (SMS)-managed can define STOGROUP with VOLUME(*). This specification implies that SMS assigns a volume to the table and index spaces in that STOGROUP. In order to assign a volume to the table and index spaces in the STOGROUP, SMS uses automatic class selection (ACS) routines to assign a storage class, a management class, and a storage group to the table or index space.
488
The recovery datasets are: Bootstrap dataset DB2 uses the bootstrap dataset (BSDS) to manage recovery and other DB2 subsystem information. The BSDS contains information needed to restart and to recover DB2 from any abnormal circumstance. For example, all log datasets are automatically recorded with the BSDS. While DB2 is active, the BSDS is open and updated. DB2 always requires two copies of the BSDS, because they are critical for data integrity. For availability reasons, the two BSDS datasets must be put on separate servers on the DS8000 or in separate logical control units (LCUs). Active logs The active log datasets are used for data recovery and to ensure data integrity in case of software or hardware errors. DB2 uses active log datasets to record all updates to user and system data. The active log datasets are open as long as DB2 is active. Active log datasets are reused when the total active log space is used up, but only after the active log (to be overlaid) has been copied to an archive log. DB2 supports dual active logs. We strongly recommend that you use dual active logs for all DB2 production environments. For availability reasons, the log datasets must be put on separate servers on the DS8000 or separate LCUs. Archive logs Archive log datasets are DB2-managed backups of the active log datasets. Archive logs datasets are automatically created by DB2 whenever an active log is filled. DB2 supports dual archive logs, and we recommend that you use dual archive log datasets for all production environments. Archive log datasets are sequential datasets that can be defined on disk or on tape and migrated and deleted with standard procedures.
489
If you want optimal performance from DS8000, do not treat it totally like a black box. Understand how DB2 tables map to underlying volumes and how the volumes map to RAID arrays.
490
Measurements oriented to determine how large volumes can impact DB2 performance have shown that similar response times can be obtained using larger volumes compared to using the smaller 3390-3 standard size volumes (refer to 15.6.2, Larger volume compared to smaller volume performance on page 451 for a discussion).
491
492
Databases
Tablespaces
/fs.rb.T1.DA3a1 /fs.rb.T1.DA3b1
Instances
An instance is a logical database manager environment where databases are cataloged and configuration parameters are set. An instance is similar to an image of the actual database manager environment. You can have several instances of the database manager product on the same database server. You can use these instances to separate the development environment from the production environment, tune the database manager to a particular environment, and protect sensitive information from a particular group of users. For database partitioning features (DPF) of the DB2 Enterprise Server Edition (ESE), all data partitions reside within a single instance.
Databases
A relational database structures data as a collection of database objects. The primary database object is the table (a defined number of columns and any number of rows). Each database includes a set of system catalog tables that describe the logical and physical structure of the data, configuration files containing the parameter values allocated for the database, and recovery logs. DB2 UDB allows multiple databases to be defined within a single database instance. Configuration parameters can also be set at the database level, thus allowing you to tune, for example, memory usage and logging.
Database partitions
A partition number in DB2 UDB terminology is equivalent to a data partition. Databases with multiple data partitions and residing on an SMP system are also called multiple logical partition (MLN) databases.
Chapter 16. Databases
493
Partitions are identified by the physical system where they reside as well as by a logical port number with the physical system. The partition number, which can be from 0 to 999, uniquely defines a partition. Partition numbers must be in ascending sequence (gaps in the sequence are allowed). The configuration information of the database is stored in the catalog partition. The catalog partition is the partition from which you create the database.
Partitioning map
When a partitiongroup is created, a partitioning map is associated to it. The partitioning map in conjunction with the partitioning key and hashing algorithm is used by the database manager to determine which database partition in the partitiongroup will store a given row of data. Partitioning maps do not apply to non-partitioned databases.
Containers
A container is the way of defining where on the storage device the database objects will be stored. Containers can be assigned from filesystems by specifying a directory. These containers are identified as PATH containers. Containers can also reference files that reside within a directory. These containers are identified as FILE containers, and a specific size must be identified. Containers can also reference raw devices. These containers are identified as DEVICE containers, and the device must already exist on the system before the container can be used. All containers must be unique across all databases; a container can belong to only one tablespace.
Tablespaces
A database is logically organized in tablespaces. A tablespace is a place to store tables. To spread a tablespace over one or more disk devices, you simply specify multiple containers. For partitioned databases, the tablespaces reside in partitiongroups. In the create tablespace command execution, the containers themselves are assigned to a specific partition in the partitiongroup, thus maintaining the shared nothing character of DB2 UDB DPF. Tablespaces can be either system-managed space (SMS) or data-managed space (DMS). For an SMS tablespace, each container is a directory in the filesystem, and the operating system file manager controls the storage space (Logical Volume Manager (LVM) for AIX). For a DMS tablespace, each container is either a fixed-size pre-allocated file, or a physical device, such as a disk (or in the case of the DS8000, a vpath), and the database manager controls the storage space. There are three major types of user tablespaces: Regular (index and data), temporary, and long. In addition to these user-defined tablespaces, DB2 requires that you define a system tablespace, the catalog tablespace. For partitioned database systems, this catalog tablespace resides on the catalog partition.
494
They consist of a series of logically linked blocks of storage that have been given the same name. They also have a unique structure for storing information that allows the information to be related to information on other tables. When creating a table, you can choose to have certain objects, such as indexes and large object (LOB) data, stored separately from the rest of the table data, but you must define this table to a DMS tablespace. Indexes are defined for a specific table and assist in the efficient retrieval of data to satisfy queries. They can also be used to assist in the clustering of data. Large objects (LOBs) can be stored in columns of the table. These objects, although logically referenced as part of the table, can be stored in their own tablespace when the base table is defined to a DMS tablespace, which allows for more efficient access of both the LOB data and the related table data.
Pages
Data is transferred to and from devices in discrete blocks that are buffered in memory. These discrete blocks are called pages, and the memory reserved to buffer a page transfer is called an I/O buffer. DB2 UDB supports various page sizes including 4 k, 8 k, 16 k, and 32k. When an application accesses data randomly, the page size determines the amount of data transferred. This size corresponds to the size of data transfer request done to the DS8000, which is sometimes referred to as the physical record. Sequential read patterns can also influence the page size selected. Larger page sizes for workloads with sequential read patterns can enhance performance by reducing the number of I/Os.
Extents
An extent is a unit of space allocation within a container of a tablespace for a single tablespace object. This allocation consists of multiple pages. The extent size (number of pages) for an object is set when the tablespace is created: An extent is a group of consecutive pages defined to the database. The data in the tables spaces is striped by extent across all the containers in the system.
Buffer pools
A buffer pool is main memory allocated on the host processor to cache table and index data pages as they are being read from disk or being modified. The purpose of the buffer pool is to improve system performance. Data can be accessed much faster from memory than from disk; therefore, the fewer times that the database manager needs to read from or write to disk (I/O), the better the performance. Multiple buffer pools can be created.
495
The amount of data being prefetched determines the amount of parallel I/O activity. Ordinarily, the database administrator defines a prefetch value large enough to allow parallel use of all of the available containers. Consider the following example: A tablespace is defined with a page size of 16 KB using raw DMS. The tablespace is defined across four containers, each container residing on a separate logical device, and the logical devices are on different DS8000 ranks. The extent size is defined as 16 pages (or 256 KB). The prefetch value is specified as 64 pages (number of containers x extent size). A user makes a query that results in a tablespace scan, which then results in DB2 performing a prefetch operation. The following actions will happen: DB2, recognizing that this prefetch request for 64 pages (1 MB) evenly spans four containers, will make four parallel I/O requests, one against each of those containers. The request size to each container will be 16 pages (or 256 KB). After receiving several of these requests, the DS8000 will recognize that these DB2 prefetch requests are arriving as sequential accesses, causing the DS8000 sequential prefetch to take effect, which will result in all of the disks in all four DS8000 ranks to operate concurrently, staging data to the DS8000 cache, to satisfy the DB2 prefetch operations.
Page cleaners
Page cleaners are present to make room in the buffer pool before prefetchers read pages on
disk storage and move them into the buffer pool. For example, if a large amount of data has been updated in a table, many data pages in the buffer pool might be updated but not written into disk storage (these pages are called dirty pages). Because prefetchers cannot place fetched data pages onto the dirty pages in the buffer pool, these dirty pages must be flushed to disk storage and become clean pages so that prefetchers can place fetched data pages from disk storage.
Logs
Changes to data pages in the buffer pool are logged. Agent processes updating a data record in the database update the associated page in the buffer pool and write a log record into a log buffer. The written log records in the log buffer will be flushed into the log files asynchronously by the logger. To optimize performance, neither the updated data pages in the buffer pool nor the log records in the log buffer are written to disk immediately. They are written to disk by page cleaners and the logger, respectively. The logger and the buffer pool manager cooperate and ensure that the updated data page is not written to disk storage before its associated log record is written to the log. This behavior ensures that the database manager can obtain enough information from the log to recover and protect a database from being left in an inconsistent state when the database has crashed as a result of an event, such as a power failure.
Parallel operations
DB2 UDB extensively uses parallelism to optimize performance when accessing a database. DB2 supports several types of parallelism, including query and I/O parallelism.
496
Query parallelism
There are two dimensions of query parallelism: Inter-query parallelism and intra-query parallelism. Inter-query parallelism refers to the ability of multiple applications to query a database at the same time. Each query executes independently of the other queries, but they are all executed at the same time. Intra-query parallelism refers to the simultaneous processing of parts of a single query, using intra-partition parallelism, inter-partition parallelism, or both: Intra-partition parallelism subdivides what is usually considered a single database operation, such as index creation, database loading, or SQL queries, into multiple parts, many or all of which can be run in parallel within a single database partition. Inter-partition parallelism subdivides what is usually considered a single database operation, such as index creation, database loading, or SQL queries, into multiple parts, many or all of which can be run in parallel across multiple partitions of a partitioned database on one machine or on multiple machines. Inter-partition parallelism only applies to DPF.
I/O parallelism
When there are multiple containers for a tablespace, the database manager can exploit parallel I/O. Parallel I/O refers to the process of writing to, or reading from, two or more I/O devices simultaneously. Parallel I/O can result in significant improvements in throughput. DB2 implements a form of data striping by spreading the data in a tablespace across multiple containers. In storage terminology, the part of a stripe that is on a single device is a strip. The DB2 term for strip is extent. If your tablespace has three containers, DB2 will write one extent to container 0, the next extent to container 1, the next extent to container 2, and then back to container 0. The stripe width (a generic term not often used in DB2 literature) is equal to the number of containers, or three in this case. Extent sizes are normally measured in numbers of DB2 pages. Containers for a tablespace are ordinarily placed on separate physical disks, allowing work to be spread across those disks, and allowing disks to operate in parallel. Because the DS8000 logical disks are striped across the rank, the database administrator can allocate DB2 containers on separate logical disks residing on separate DS8000 arrays, which takes advantage of the parallelism both in DB2 and in the DS8000. For example, four DB2 containers residing on four DS8000 logical disks on four different 7+P ranks will have data spread across 32 physical disks.
497
If you want optimal performance from the DS8000, do not treat it completely like a black box. Establish a storage allocation policy that allocates data using several DS8000 ranks. Understand how DB2 tables map to underlying logical disks, and how the logical disks are allocated across the DS8000 ranks. One way of making this process easier to manage is to maintain a modest number of DS8000 logical disks.
1 3 5 7
2 4 6 8
HMC
2 2 0 0
Container 1
Container 2
Container 3
Container 4
Container 5
Container 6
Container 7
Container 8
Figure 16-2 Allocating DB2 containers using a spread your data approach
498
In addition, consider that: You can intermix data, indexes, and temp spaces on the DS8000 ranks. Your I/O activity is more evenly spread, and thus, you avoid the skew effect, which you otherwise see if the components are isolated. For DPF systems, establish a policy that allows partitions and containers within partitions to be spread evenly across DS8000 resources. Choose a vertical mapping, in which DB2 partitions are isolated to specific arrays, with containers spread evenly across those arrays.
Page size
Page sizes are defined for each tablespace. There are four supported page sizes: 4 K, 8 K, 16 K, and 32 K. Factors that affect the choice of page size include: The maximum number of records per page is 255. To avoid wasting space on a page, do not make page size greater than 255 times the row size plus the page overhead. The maximum size of a tablespace is proportional to the page size of its tablespace. In SMS, the data and index objects of a table have limits, as shown in Table 16-1. In DMS, these limits apply at the tablespace level.
Table 16-1 Page size relative to tablespace size Page size 4 KB 8 KB 16 KB 32 KB Maximum data/index object size 64 GB 128 GB 256 GB 512 GB
Select a page size that can accommodate the total expected growth requirements of the objects in the tablespace.
499
For OLTP applications that perform random row read and write operations, a smaller page size is preferable, because it wastes less buffer pool space with unwanted rows. For DSS applications that access large numbers of consecutive rows at a time, a larger page size is better, because it reduces the number of I/O requests that are required to read a specific number of rows. Tip: Experience indicates that page size can be dictated to a certain degree by the type of workload. For pure OLTP workloads, we recommend a 4 KB page size. For a pure DSS workload, we recommend a 32 KB page size. For a mixture of OLTP and DSS workload characteristics, we recommend either an 8 KB page size or a 16 KB page size.
Extent size
If you want to stripe across multiple arrays in your DS8000, assign a LUN from each rank to be used as a DB2 container. During writes, DB2 will write one extent to the first container, the next extent to the second container, and so on until all eight containers have been addressed before cycling back to the first container. DB2 stripes across containers at the tablespace level. Because DS8000 stripes at a fairly fine granularity (256 KB), selecting multiples of 256 KB for the extent size makes sure that multiple DS8000 disks are used within a rank when a DB2 prefetch occurs. However, keep your extent size below 1 MB. I/O performance is fairly insensitive to the selection of extent sizes, mostly due to the fact that DS8000 employs sequential detection and prefetch. For example, even if you picked an extent size, such as 128 KB, which is smaller than the full array width (it will involve accessing only four disks in the array), the DS8000 sequential prefetch keeps the other disks in the array busy.
Prefetch size
The tablespace prefetch size determines the degree to which separate containers can operate in parallel. Although larger prefetch values might enhance throughput of individual queries, mixed applications generally operate best with moderate-sized prefetch and extent parameters. You will want to engage as many arrays as possible in your prefetch to maximize throughput. It is worthwhile to note that prefetch size is tunable. We mean that prefetch size can be altered after the tablespace has been defined and data loaded, which is not true for extents and page sizes that are set at tablespace creation time and cannot be altered without redefining the tablespace and reloading the data. Tip: The prefetch size must be set so that as many arrays as desired can be working on behalf of the prefetch request. For other than the DS8000, the general recommendation is to calculate prefetch size to be equal to a multiple of the extent size times the number of containers in your tablespace. For the DS8000, you can work with a multiple of the extent size times the number of arrays underlying your tablespace.
500
The DS8000 supports a high degree of parallelism and concurrency on a single logical disk. As a result, a single logical disk the size of an entire array achieves the same performance as many smaller logical disks. However, you must consider how logical disk size affects both the host I/O operations and the complexity of your organizations systems administration. Smaller logical disks provide more granularity, with their associated benefits. But it also increases the number of logical disks seen by the operating system. Select an DS8000 logical disk size that allows for granularity and growth without proliferating the number of logical disks. Take into account your container size and how the containers will map to AIX logical volumes and DS8000 logical disks. In the simplest situation, the container, the AIX logical volume, and the DS8000 logical disk will be the same size. Tip: Try to strike a reasonable balance between flexibility and manageability for your needs. Our general recommendation is that you create no fewer than two logical disks in an array, and the minimum logical disk size needs to be 16 GB. Unless you have an extremely compelling reason, standardize a unique logical disk size throughout the DS8000. Among the advantages and the disadvantages between larger and smaller logical disks sizes: Advantages of smaller size logical disks: Easier to allocate storage for different applications and hosts. Greater flexibility in performance reporting; for example, PDCU reports statistics for logical disks. Disadvantages of smaller size logical disks Small logical disk sizes can contribute to proliferation of logical disks, particularly in SAN environments and large configurations. Administration gets complex and confusing. Advantages of larger size logical disks: Simplifies understanding of how data maps to arrays. Reduces the number of resources used by the operating system. Storage administration is simpler, thus more efficient and fewer chances for mistakes. Disadvantages of larger size logical disks Less granular storage administration, resulting in less flexibility in storage allocation.
Examples
Let us assume a 6+P array with 146 GB disk drives. Suppose you wanted to allocate disk space on your 16-array DS8000 as flexibly as possible. You can carve each of the 16 arrays up into 32 GB logical disks or logical unit numbers (LUNs), resulting in 27 logical disks per array (with a little left over). This design yields a total of 16 x 27 = 432 LUNs. Then, you can implement 4-way multipathing, which in turn makes 4 x 432 = 1728 hdisks visible to the operating system. Not only does this approach create an administratively complex situation, but at every reboot, the operating system will query each of those 1728 disks. Reboots might take a long time.
501
Alternatively, you create just 16 large logical disks. With multipathing and attachment of four Fibre Channel ports, you have 4 x 16 = 128 hdisks visible to the operating system. Although this number is large, it is certainly more manageable, and reboots are much faster. Having overcome that problem, you can then use the operating system logical volume manager to carve this space up into smaller pieces for use. There are problems with this large logical disk approach as well, however. If the DS8000 is connected to multiple hosts or it is on a SAN, disk allocation options are limited when you have so few logical disks. You have to allocate entire arrays to a specific host, and if you wanted to add additional space, you must add it in array-size increments.
16.5.6 Multipathing
Use DS8000 multipathing along with DB2 striping to ensure the balanced use of Fibre Channel paths. Multipathing is the hardware and software support that provides multiple avenues of access to your data from the host computer. You need to provide at least two Fibre Channel paths from the host computer to the DS8000. Paths are defined by the number of host adapters on the DS8000 that service a certain host systems LUNs, the number of Fibre Channel host bus adapters on the host system, and the SAN zoning configuration. The total number of paths ultimately includes consideration for the throughput requirements of the host system. If the host system requires more than (2 x 200) 400 MBps throughput, two host bus adapters are not adequate. DS8000 multipathing requires the installation of multipathing software. For AIX, you have two choices, Subsystem Device Driver Path Control Module (SDDPCM) or the IBM Subsystem Device Driver (SDD). For AIX, we recommend SDDPCM. We discuss these products in Chapter 11, Performance considerations with UNIX servers on page 307 and Chapter 9, Host attachment on page 265. There are several benefits you receive from using multipathing: higher availability, higher bandwidth, and easier management. A high availability implementation is one in which your application can still access data using an alternate resource if a component fails. Easier performance management means that the multipathing software automatically balances the workload across the paths.
502
503
504
32 3390-3 vs 4 3390-27
2
Device response time (msec)
1.5
3390-3 3390-27
0.5
0
2905 4407
505
506
17
Chapter 17.
507
z/OS Global Mirror, previously known as Extended Remote Copy (XRC) z/OS Metro/Global Mirror across three sites with Incremental Resync
Table 17-1 Reference chart for DS Copy Services on DS8000 DS8000 function FlashCopy FlashCopy SE Global Mirror Metro Mirror Global Copy z/OS Global Mirror z/OS Metro/Global Mirror Metro/Global Mirror ESS800 Version 2 function FlashCopy N/A Global Mirror Metro Mirror Global Copy z/OS Global Mirror z/OS Metro/Global Mirror Metro/Global Mirror Formerly known as: FlashCopy N/A Asynchronous PPRC Synchronous PPRC PPRC Extended Distance Extended Remote Copy (XRC) 3-site solution using Sync PPRC and XRC 2 or 3 site Asynchronous Cascading PPRC
Refer to the Interoperability Matrixes for the DS8000 and ESS to confirm which products are supported on a particular disk subsystem.
Note: TotalStorage Productivity Center for Replication currently does not manage Global Copy sessions. In addition to using these methods, there are several possible interfaces available specifically for System z users to manage DS8000 Copy Services relationships. Table 17-2 lists these tools, which are: TSO ICKDSF DFSMSdss The ANTRQST macro Native TPF commands (for z/TPF only)
Table 17-2 Copy Services Management Tools Runs on z/OS Runs on Open Systems Server No No No Yes Yes Manages count key data (CKD) Yes Yes Yes Yes Yes Manages fixed block data (FB) Yes1 Yes1 No Yes Yes
TSO ANTRQST ICKDSF DSCLI TotalStorage Productivity Center for Replication GDPS
Yes
No
Yes
Yes1
1. A CKD unit address (and host unit control block (UCB)) must be defined in the same DS8000 server against which host I/O can be issued to manage Open Systems (FB) LUNs.
Refer to the following IBM Redbooks publications for detailed information about DS8000 Copy Services: IBM System Storage DS8000: Copy Services with IBM System z, SG24-6787 IBM System Storage DS8000: Copy Services in Open Environments, SG24-6788
17.2 FlashCopy
FlashCopy can help reduce or eliminate planned outages for critical applications. FlashCopy is designed to allow read and write access to the source data and the copy almost immediately following the FlashCopy volume pair establishment. Standard FlashCopy uses a normal volume as target volume. This target volume must be the same size (or larger) than the source volume, and the space is allocated in the storage subsystem. IBM FlashCopy SE, introduced with Licensed Machine Code 5.3.0.xxx, uses Space Efficient volumes as FlashCopy target volumes. A Space Efficient target volume has a virtual size that is equal to or greater than the source volume size. However, space is not allocated for this volume when the volume is created and the FlashCopy initiated. Only when updates are made to the source volume, any original tracks of the source volume that will be modified are copied to the Space Efficient volume. Space in the repository is allocated for just these tracks (or for any write to the target itself).
Chapter 17. Copy Services performance
509
FlashCopy objectives
The FlashCopy feature of the DS8000 provides you with the capability of making an immediate copy of a logical volume at a specific point in time, which we also refer to as a point-in-time-copy, instantaneous copy, or time zero copy (T0 copy), within a single DS8000 Storage Facility Image. There are several points to consider when you plan to use FlashCopy that might help you minimize any impact that the FlashCopy operation can have on host I/O performance. As Figure 17-1 illustrates, when FlashCopy is invoked, a relationship (or session) is established between the source and target volumes of the FlashCopy pair. This session includes the creation of the necessary bitmaps and metadata information needed to control the copy operation. This FlashCopy establish process completes quickly, at which point: The FlashCopy relationship is fully established. Control returns to the operating system or task that requested the FlashCopy. Both the source volume and its time zero (T0) target volume are available for full read/write access. At this time, a background task within the DS8000 starts copying the tracks from the source to the target volume. Optionally, you can suppress this background copy task using the nocopy option, which is efficient, for example, if you are making a temporary copy just to take a backup to tape. In this case, the FlashCopy relationship remains until explicitly withdrawn. FlashCopy SE can be used in this instance. FlashCopy SE provides most of the function of FlashCopy but can only be used to provide a nocopy copy.
Source
Target
Time
Write Read
Read Write Read and write to both source and target possible. Optional physical copy progresses in background When copy is complete, relationship between source and target ends
T0
Simplex
Simplex
The DS8000 keeps track of which data has been copied from source to target. As Figure 17-1 shows, if an application wants to read data from the target that has not yet been copied, the data is read from the source. Otherwise, the read can be satisfied from the target volume. 510
DS8000 Performance Monitoring and Tuning
Flashcopy SE is designed for temporary copies. Copy duration generally does not last longer than 24 hours unless the source and target volumes have little write activity. FlashCopy SE is optimized for use cases where a small percentage of the source volume is updated during the life of the relationship. If much more than 20% of the source is expected to change, there might be trade-offs in terms of performance as opposed to space efficiency. In this case, standard FlashCopy might be considered as a good alternative. FlashCopy has several options. Not all options are available to all user interfaces. It is important right from the beginning to know the purpose for which the target volume will be used afterward. Knowing this purpose, the options to be used with FlashCopy can be identified and the environment that supports the selected options can be chosen.
MB/sec
400 350 300 250 200 150 100 50 0
MB/sec
2400 2000 1600 1200 800 400 0
FB 7+P
CKD 6+P
Base FlashCopy
FB 64 Ranks, 8 DA Pairs
FlashCopy SE
511
IO/sec
625
IO /sec
100000
6+P Rank
500
80000
375
60000
250
40000
125
20000
F B 6 4 R a n ks , D B O 7 0 /3 0/50
C K D 4 8 R a n k s, C a c h e S td.
Base
FlashCopy
FlashCopy SE
Note: This chapter is equally valid for System z volumes and Open Systems LUNs. In the following sections of the present chapter, we use only the terms volume or volumes, but the text is equally valid if the terms LUN and LUNs are used, unless otherwise noted.
512
Table 17-3
FlashCopy source and target volume location Processor complex Device adapter Unimportant Different device adapter Unimportant Rank Different ranks Different ranks Different ranks
Tip: To find the relative location of your volumes, you can use the following procedure: 1. Use the lsfbvol command to learn which extent pool contains the relevant volumes. 2. Use the lsrank command to display both the device adapter and the rank for each extent pool. 3. To determine which processor complex contains your volumes, look at the extent pool name. Even-numbered extent pools are always from Server 0, while odd-numbered extent pools are always from Server1.
Rank characteristics
Normal performance planning also includes the tasks to select the disk drives (capacity and rpms) and the RAID configurations that best match the performance needs of the applications. Be aware that with FlashCopy nocopy relations, the DS8000 does copy-on-write for each first change to a source volume track. If the disks of the target volume are slower than the disks of the source volume, copy-on-write might slow down production I/O. A full copy FlashCopy produces an extremely high write activity on the disk drives of the target volume. Therefore, it is always a good practice to use target volumes on ranks with the same characteristics as the source volumes. Finally, you can achieve a small performance improvement by using identical rank geometries for both the source and target volumes. In other words, if the source volumes are located on a rank with a 7 + P configuration, the target volumes are also located on a rank configured as 7 + P.
513
There is a modest performance impact to logical FlashCopy establish performance when using incremental FlashCopy. In the case of incremental FlashCopy, the DS8000 must create additional metadata (bitmaps). However, the impact is negligible in most cases. Finally, the placement of the FlashCopy source and target volumes has an effect on the establish performance. You can refer to the previous section for a discussion of this topic as well as to Table 17-3 on page 513 for a summary of the recommendations.
The recommended placement of the FlashCopy source and target volumes was discussed in the previous section. Refer to Table 17-3 on page 513 for a summary of the conclusions. For the best background copy performance, always place the source and target volumes in different ranks. There are additional criteria to consider if the FlashCopy is a full box copy that involves all ranks. Note: The term full box copy implies that all rank resources are involved in the copy process. Either all or nearly all ranks have both source and target volumes, or half the ranks have source volumes and half the ranks have target volumes. For full box copies, still place the source and target volumes in different ranks. When all ranks are participating in the FlashCopy, you can still place the source and target volumes in different ranks by doing a FlashCopy of volumes on rank R0 onto rank R1 and volumes on rank R1 onto rank R0 (for example). Additionally, if there is heavy application activity in the
514
source rank, performance is less affected if the background copy target was in another rank that has lighter application activity.
I
Note: If Storage Pool Striping is used when allocating volumes, all ranks will be more or less equally busy. Therefore, there is less need to be so concerned about the placement of data, but make sure to still keep the source and the target on the same processor complex. If background copy performance is of high importance in your environment, use incremental FlashCopy as much as possible. Incremental FlashCopy will greatly reduce the amount of data that needs to be copied and, therefore, greatly reduces the background copy time. If the FlashCopy relationship was established with the -nocp (no copy) option, only write updates to the source volume will force a copy from the source to the target. This forced copy is also called a copy-on-write. Note: The term copy-on-write describes a forced copy from the source to the target, because a write to the source has occurred. This forced copy occurs on the first write to a track. Note that because the DS8000 writes to nonvolatile cache, there is typically no direct response time delay on host writes. A write to the source will result in a copy of the track.
FlashCopy nocopy In a FlashCopy nocopy relationship, a copy-on-write is done whenever a write to a source
track occurs for the first time after the FlashCopy was established. This type of FlashCopy is ideal when the target volumes are needed for a short time only, for example, to run the backup jobs. FlashCopy nocopy puts only a minimum additional workload on the back-end adapters and disk drives. However, it affects most of the writes to the source volumes as long
515
as the relationship exists. When you plan to keep your target volumes for a long time, this choice might not be the best solution.
Incremental FlashCopy
Another important performance consideration is whether to use incremental FlashCopy. Use incremental FlashCopy when you perform FlashCopies always to the same target volumes on regular time intervals. The first FlashCopy will be a full copy, but subsequent FlashCopy operations will copy only the tracks of the source volume that had been modified since the last FlashCopy. Incremental FlashCopy has the least impact on applications. During normal operation, no copy-on-write is done (as in a nocopy relation), and during a resync, the load on the back end is much lower compared to a full copy. There is only a very small overhead for the maintenance of out-of-sync bitmaps for the source and target volumes. Note: The incremental FlashCopy resyncflash command does not have a -nocp (no copy) option. Using resyncflash will automatically use the copy option, regardless of whether the original FlashCopy was copy or no copy.
logical tracks. Refer to Figure 17-5. Each time that a track in the repository is accessed, it has to go through this mapping process. Consequently, the attributes of the volume hosting the repository are important when planning a FlashCopy SE environment.
I/O from server: update Vol 100 Trk 17 Cache Server 0 Cache
I/O complete New process with IBM FlashCopy SE Track table of repository
:
Server 1
NVS
destaging FlashCopy relationship? New update? Release Data in NVS
NVS
yes
Got it
Lookup
Wait
Write update
Write
Because of space efficiency, data is not physically ordered in the same sequence on the repository disks as it is on the source. Processes that might access the source data in a sequential manner might not benefit from sequential processing when accessing the target. Another important consideration for FlashCopy SE is that we always have nocopy relationships. A full copy or incremental copy is not possible. If there are many source volumes that have targets in the same extent pool, all updates to these source volumes cause write activity to this one extent pools repository. We can consider a repository as something similar to a volume. So, we have writes to many source volumes being copied to just one volume, the repository. Where a dedicated extent pool is defined specifically for use as a FlashCopy SE repository, there will be less space in the repository than the total capacity (sum) of the source volumes. You might be tempted to use fewer disk spindles or disk drive modules (DDMs) for this extent pool. By definition, fewer spindles mean less performance, and so careful planning is needed to achieve the required throughput and response times from the Space Efficient volumes. A good strategy is to keep the number of spindles roughly equivalent but just use smaller, faster drives (but do not use Fibre Channel Advanced Technology Attachment (FATA) or Serial Advanced Technology Attachment (SATA) drives). For example, if your source volumes are 300 GB 15K rpm disks, using 146 GB 15K rpm disks on the repository can provide both space efficiency and excellent repository performance. Another possibility is to consider RAID 10 for the repository, although that goes somewhat against space efficiency (you might be better off using standard FlashCopy with RAID 5 than SE with RAID 10). However, there might be cases where trading off some of the space efficiency gains for a performance boost justifies RAID 10. Certainly if RAID 10 is used at the
517
source, consider it for the repository (note that the repository will always use striping when in a multi-rank extent pool). Note: There is no advantage in using RAID 6 for the repository other than resilience. Only consider it where RAID 6 is used as the standard throughout the DS8000. Storage Pool Striping has good synergy with the repository (volume) function. With Storage Pool Striping, the repository space is striped across multiple RAID arrays in an extent pool, which helps to balance the volume skew that might appear on the sources. It is generally best to use four RAID arrays in the multi-rank extent pool intended to hold the repository and no more than eight RAID arrays. Finally, as mentioned before, try to use at least the same number of disk spindles on the repository as the source volumes. Avoid severe fan in configurations, such as 32 ranks of source disk being mapped to an eight rank repository. This type of configuration will likely have performance problems unless the update rate to the source is extremely modest. It is possible to share the repository with production volumes on the same extent pool but use caution when doing so, because contention between the repository and the production volumes can impact performance. In this case, the repository for the one extent pool can be placed in a different extent pool so that source and target volumes are on different ranks but on the same processor complex. To summarize, we can expect a high random write workload for the repository. To prevent the repository from becoming overloaded, you can take the following precautions: Have the repository in an extent pool with several ranks (a repository is always striped). Use fast 15K rpm disk drives for the repository ranks. Consider using RAID 10 instead of RAID 5, because RAID 10 can sustain a higher random write workload. Avoid placing standard source and repository target volumes in the same extent pool. Because FlashCopy SE does not need a lot of capacity if your update rate is not too high, you might want to make several FlashCopies from the same source volume. For example, you might want to make a FlashCopy several times a day to set checkpoints to protect your data against viruses or for other reasons. Of course, creating more than one FlashCopy SE relationship for a source volume will increase the overhead, because each first change to a source volume track has to be copied several times for each FlashCopy SE relationship. Therefore, keep the number of concurrent FlashCopy SE relationships to a minimum, or test how many relationships you can have without affecting your application performance too much. Note that from a performance standpoint, avoiding multiple relationships also applies to normal FlashCopy. There are no restrictions on the amount of virtual space or the number of SE volumes that can be defined for either z/OS or Open Systems storage.
518
It is typically used for applications that cannot suffer any data loss in the event of a failure. As data is transferred synchronously, the distance between primary and secondary disk subsystems will determine the effect on application response time. Figure 17-6 illustrates the sequence of a write update with Metro Mirror.
Server write 1
4 Write acknowledge
Write to secondary 2
LUN or volume
Secondary (target)
When the application performs a write update operation to a primary volume, this process happens: 1. 2. 3. 4. Write to primary volume (DS8000 cache) Write to secondary (DS8000 cache) Signal write complete on the secondary DS8000 Post I/O complete to host server
The Fibre Channel connection between primary and secondary subsystems can be direct, through a Fibre Channel SAN switch, via a SAN router using Fibre Channel over Internet Protocol (FCIP), or through other supported distance solutions, such as Dense Wave Division Multiplexing (DWDM).
519
Distance
The distance between your primary and secondary DS8000 subsystems will have an effect on the response time overhead of the Metro Mirror implementation. Note that with the requirement of diverse connections for availability, it is common to have certain paths that are longer distance than others. Contact your IBM Field Technical Sales Specialist (FTSS) to assist you in assessing your configuration and the distance implications if necessary. The maximum supported distance for Metro Mirror is 300 km (186.4 miles). There is approximately a 1 ms overhead per 100 km (62 miles) for write I/Os (this relation between latency and physical distance might be different when using a WAN). The DS8000 Interoperability Matrix gives details of SAN, network, and DWDM supported devices. Distances of over 300 km (186.4 miles) are possible and are supported by RPQ. Due to network configuration variability, the client must work with the channel extender vendor to determine the appropriate configuration to meet their requirements.
Logical paths
A Metro Mirror logical path is a logical connection between the sending LSS and the receiving LSS. An FC link can accommodate multiple Metro Mirror logical paths. Figure 17-7 on page 521 shows an example where we have a 1:1 mapping of source to target LSSs, and where the three logical paths are accommodated in one Metro Mirror link: LSS1 in DS8000 1 to LSS1 in DS8000 2 LSS2 in DS8000 1 to LSS2 in DS8000 2 LSS3 in DS8000 1 to LSS3 in DS8000 2 Alternatively, if the volumes in each of the LSSs of DS8000 1 map to volumes in all three secondary LSSs in DS8000 2, there will be nine logical paths over the Metro Mirror link (not fully illustrated in Figure 17-7 on page 521). Note that we recommend a 1:1 LSS mapping.
520
DS8000 1
LSS 1
1 logical path 3-9 logical paths
DS8000 2
LSS 1
1 logical path
LSS 2
1 logical path
switch
Port
Metro Mirror paths
1 Link
Port
LSS 2
1 logical path
LSS 3
1 logical path
Figure 17-7 Logical paths for Metro Mirror
LSS 3
1 logical path
Metro Mirror links have certain architectural limits, which include: A primary LSS can maintain paths to a maximum of four secondary LSSs. Each secondary LSS can reside in a separate DS8000. Up to eight logical paths per LSS to LSS relationship can be defined. Each Metro Mirror path requires a separate physical link. An FC port can host up to 2048 logical paths. These paths are the logical and directional paths that are made from LSS to LSS. An FC path (the physical path from one port to another port) can host up to 256 logical paths (Metro Mirror paths). An FC port can accommodate up to 126 different physical paths (DS8000 port to DS8000 port through the SAN). For Metro Mirror, consistency requirements are managed through use of the consistency group or Critical Mode option when you are defining Metro Mirror paths between pairs of LSSs. Volumes or LUNs, which are paired between two LSSs whose paths are defined with the consistency group option, can be considered part of a consistency group. Consistency is provided by means of the extended long busy (for z/OS) or queue full (for Open Systems) condition. These conditions are triggered when the DS8000 detects a condition where it cannot update the Metro Mirror secondary volume. The volume pair that first detects the error will go into the extended long busy or queue full condition, so that it will not perform any I/O. For z/OS, a system message will be issued (IEA494I state change message); for Open Systems, an SNMP trap message will be issued. These messages can be used as a trigger for automation purposes to provide data consistency by use of the Freeze/Run (or Unfreeze) commands. Metro Mirror itself does not offer a means of controlling this scenario, it offers the consistency group and Critical attributes, which along with appropriate automation solutions, can manage data consistency and integrity at the remote site. The Metro Mirror volume pairs are always consistent, due to the synchronous nature of Metro Mirror. However, cross-volume or LSS data consistency must have an external management method. IBM offers TotalStorage Productivity Center for Replication to deliver solutions in this area. Note: The consistency group and Critical attributes must not be set for any PPRC paths or LSSs unless there is an automation solution in place to manage the freeze and run commands. If it is set and there is no automation tool to manage a freeze, the host systems will lose access to the volumes until the timeout occurs at 300 seconds.
521
Bandwidth
Prior to establishing your Metro Mirror solution, you must determine what your peak bandwidth requirement will be. Determining your peak bandwidth requirement will help to ensure that you have enough Metro Mirror links in place to support that requirement. To avoid any response time issues, establish the peak write rate for your systems and ensure that you have adequate bandwidth to cope with this load and to allow for growth. Remember that only writes are mirrored across to the target volumes after synchronization. There are tools to assist you, such as TotalStorage Productivity Center (TPC) or the operating system-dependent tools, such as iostat. Another method, but not quite so exact, is to monitor the traffic over the FC switches using FC switch tools and other management tools, and remember that only writes will be mirrored by Metro Mirror. You can also get an idea about the proportion of read to writes by issuing datapath query devstats on SDD-attached servers. A single 2 Gb Fibre Channel link can provide approximately 200 MBps throughput for the Metro Mirror establish. This capability scales up linearly with additional links up to six links. The maximum of eight links for an LSS pair provides a throughput of approximately 1400 MBps. Note: A minimum of two links is recommended between each DS8000 pair for resilience. The remaining capacity with a failed link is capable of maintaining synchronization.
LSS design
Because the DS8000 has made the LSS a topological construct, which is not tied to a physical array as in the ESS, the design of your LSS layout can be simplified. It is now possible to assign LSSs to applications, for example, without concern about the under-allocation or the over-allocation of physical disk subsystem resources. Assigning LSSs to applications can also simplify the Metro Mirror environment, because it is possible to reduce the number of commands that are required for data consistency.
Symmetrical configuration
As an aid to planning and management of your Metro Mirror environment, we recommend that you maintain a symmetrical configuration in terms of both physical and logical elements. As well as making the maintenance of the Metro Mirror configuration easier, maintaining a symmetrical configuration in terms of both physical and logical elements has the added benefit of helping to balance the workload across the DS8000. Figure 17-8 on page 523 shows a logical configuration. This idea applies equally to the physical aspects of the DS8000. You need to attempt to balance workload and apply symmetrical concepts to all aspects of your DS8000, which has the following benefits: Ensure even performance: The secondary site volumes must be created on ranks with DDMs of the same capacity and speed as the primary site. Simplify management: It is easy to see where volumes will be mirrored and processes can be automated. Reduce administrator overhead: There is less administrator overhead due to automation and the simpler nature of the solution. Ease the addition of new capacity into the environment: New arrays can be added in a modular fashion. Ease problem diagnosis: The simple structure of the solution will aid in identifying where any problems might exist.
522
Figure 17-8 on page 523 shows this idea in a graphical form. DS8000 #1 has Metro Mirror paths defined to DS8000 # 2, which is in a remote location. On DS8000 #1, volumes defined in LSS 00 are mirrored to volumes in LSS 00 on DS8000 #2 (volume P1 is paired with volume S1, P2 with S2, P3 with S3, and so on). Volumes in LSS 01 on DS8000 #1 are mirrored to volumes in LSS 01 on DS8000 #2, and so on. Requirements for additional capacity can be added in a symmetrical way also by the addition of volumes into existing LSSs, and by the addition of new LSSs when needed (for example, the addition of two volumes in LSS 03 and LSS 05 and one volume to LSS 04 will bring them to the same number of volumes as the other LSS. Additional volumes can then be distributed evenly across all LSSs, or additional LSSs can be added.
Consider an asymmetrical configuration where the primary site has volumes defined on ranks comprised of 146 GB DDMs. The secondary site has ranks comprised of 300 GB DDMs. Because the capacity of the destination ranks is double that of the source ranks, it seems feasible to define twice as many LSSs per rank on the destination side. However, this situation, where four primary LSSs on four ranks were feeding into four secondary LSSs on two ranks, creates a performance bottleneck on the secondary rank and slows down the entire Metro Mirror process.
Volumes
You will need to consider which volumes to mirror to the secondary site. One option is to mirror all volumes, which is advantageous for the following reasons: You will not need to consider whether any required data has been missed. Users will not need to remember which logical pool of volumes is mirrored and which is not. The addition of volumes to the environment is simplified; you will not have two processes for the addition of disk (one process for mirrored volumes and another process for non-mirrored volumes). You will be able to move data around your disk environment easily without concern about whether the target volume is a mirrored volume or not.
523
Note: Consider the bandwidth that you need to mirror all volumes. The amount of bandwidth might not be an issue if there are many volumes with a low write I/O rate. Review data from TotalStorage Productivity Center for Disk if it is available. You can choose not to mirror all volumes (for example, swap devices for Open Systems or temporary work volumes for z/OS can be omitted). In this case, you will need careful control over what data is placed on the mirrored volumes (to avoid any capacity issues) and what data is placed on the non-mirrored volumes (to avoid missing any required data). You can place all mirrored volumes in a particular set of LSSs, in which all volumes have Metro Mirror enabled, and direct all data requiring mirroring to these volumes. For testing purposes, additional volumes can be configured at the remote site. These volumes can be used to take a FlashCopy of a consistent Metro Mirror image on the secondary volume and then allow the synchronous copy to restart while testing is performed.
Primary Site
primary
SAN switch
Secondary Site
Metro Mirror
SAN switch
secondary
F la
shC
opy
tertiary
To create a consistent copy for testing, the host I/O needs to be quiesced or automation code, such as Geographically Dispersed Parallel Sysplex (GDPS) and TotalStorage Productivity Center for Replication, needs to used to create a consistency group on the primary disks so that all dependent writes are copied to the secondary disks.
Distance is an important value for both the initial establish data rate and synchronous write performance. Data must go to the other site, and the acknowledgement goes back. Add possible latency times of certain active components on the way. We think it is a good rule to calculate 1 ms additional response time per 100 km (62 miles) of site separation for a write I/O. Distance also affects the establish data rate. Know your workload characteristics. Factors, such as blocksize, read/write ratio, and random or sequential processing, are all key Metro Mirror performance considerations. Monitor link performance to determine if links are becoming overutilized. Use tools, such as TotalStorage Productivity Center and Resource Measurement Facility (RMF), or review SAN switch statistics. Testing Metro Mirror performance has been done for the following configuration with different workload patterns (Figure 17-10). This test is an example of the performance that you can expect. Use these results for guidance only: Primary: DS8300 Turbo Eight DA pairs, 512 73 GB/15k rpm disks, and RAID 10 Secondary: DS8300 Eight DA pairs, 512 73 GB/15k rpm disks, and RAID 10 PPRC links: 2 Gb Host I/Os are driven from the AIX host (AIX 5.3.0.40) with 8 host paths PPRC distance simulation: Four links Channel extender: CNT Edge Router (1 Gb FC Adapter and 1 Gb Ethernet network interface) Distance simulator: Empirix PacketSphere (10 microsecond bidirectional = 1 km (.62 miles))
Figure 17-10 70/30/50 Metro Mirror with channel extender Chapter 17. Copy Services performance
525
Note: 70/30/50 is an open workload where read/write ratio = 2.33, read hits = 50%, destage rate = 17.2%, and transfer size = 4 K. In this instance, it shows the minimal impact of Metro Mirror on I/O response times at a distance between sites of 50 km (31 miles), although it does fall off more sharply and at a lower I/O rate than with no copy running.
17.3.3 Scalability
The DS8000 Metro Mirror environment can be scaled up or down as required. If new volumes are added to the DS8000 that require mirroring, they can be dynamically added. If additional Metro Mirror paths are required, they also can be dynamically added. Note: The mkpprcpath command is used to add Metro Mirror paths. If paths are already established for the LSS pair, they must be included in the mkpprcpath command together with any additional path or the existing paths will be removed.
526
Server write
1
Primary (source) LUN or volume
2 Write acknowledge
3
Secondary (target) LUN or volume
4
Write to secondary (non-synchronously)
In Figure 17-11: 1. The host server requests a write I/O to the primary DS8000. The write is staged through cache and nonvolatile storage (NVS). 2. The write returns to the host servers application. 3. A few moments later, in a nonsynchronous manner, the primary DS8000 sends the necessary data so that the updates are reflected on the secondary volumes. The updates are grouped in batches for efficient transmission. Note also that if the data is still in cache, only the changed sectors are sent. If the data is no longer in cache, the full track is read from disk. 4. The secondary DS8000 returns write completed to the primary DS8000 when the updates are secured in the secondary DS8000 cache and NVS. The primary DS8000 then resets its change recording information. The primary volume remains in the Copy Pending state while the Global Copy session is active. This status only changes if a command is issued or the links between the storage subsystems are lost.
527
Note: The consistency group is not specified on the establish path command. Data on Global Copy secondaries is not consistent so there is no need to maintain the order of dependent writes. The decision about when to use Global Copy can depend on a number of factors, such as: The recovery of the system does not need to be current with the primary application system. There is a minor impact to application write I/O operations at the primary location. The recovery uses copies of data created by the user on tertiary volumes. Distances beyond ESCON limits or FCP limits are required. 103 km (64 miles) for ESCON links and 300 km (186 miles) for FCP links (RPQ for greater distances) You can use Global Copy as a tool to migrate data between data centers.
Distance
The maximum (supported) distance for a direct Fibre Channel connection is 10 km (6.2 miles). If you want to use Global Copy over longer distances, you can use the following connectivity technologies to extend this distance: Fibre Channel routers using Fibre Channel over Internet Protocol (FCIP) Channel extenders over Wide Area Network (WAN) lines Dense Wavelength Division Multiplexers (DWDM) on fiber
528
capability of the individual channel. Because the wavelength of light is from a practical perspective infinitely divisible, DWDM technology is only limited by the sensitivity of its receptors for the total possible aggregate bandwidth. A complete and current list of Global Copy supported environments, configurations, networks, and products is available in the DS8000 Interoperability Matrix. You must contact the multiplexer vendor regarding hardware and software prerequisites when using the vendors products in a DS8000 Global Copy configuration.
Primary Site
primary
channel extender
Secondary Site
Global Copy
channel extender
secondary
Fla s
hCo py
tertiary
Following these steps, the user creates consistent data: 1. Quiesce I/O. 2. Suspend the pairs (go-to-synch and suspend). FREEZE can be used and extended long busy will not be returned to the server, because consistency group was not specified on the establish path. 3. FlashCopy secondary to tertiary. Tertiary will have consistent data. 4. Reestablish paths (if necessary). 5. RESYNC (resumepprc) Global Copy.
529
implementations over extended distances, Global Copy becomes an excellent trade-off solution. You can estimate the Global Copy application impact as that of the application when working with Metro Mirror suspended volumes. For the DS8000, there is additional work to do with the Global Copy volumes compared to the suspended volumes, because with Global Copy, the changes have to be sent to the remote DS8000. But this impact is negligible overhead for the application compared with the typical synchronous overhead. There are no host system resources consumed by Global Copy volume pairs, excluding any management solution, because the Global Copy is managed by the DS8000 subsystem. If you take a FlashCopy at the recovery site in your Global Copy implementation, consider the influence between Global Copy and the FlashCopy background copy. If you use the FlashCopy with the nocopy option at the recovery site, when the Global Copy target receives an update, the track on the FlashCopy source, which is also the Global Copy target, has to be copied to the FlashCopy target before the data transfer operation completes. This copy operation to the FlashCopy target can complete by using the DS8000 cache and NVS without waiting for a physical write to the FlashCopy target. However, this data movement can influence the Global Copy activity. So, when considering the network bandwidth, consider that the FlashCopy effect over the Global Copy activity might in fact decrease the bandwidth utilization during certain intervals.
17.4.3 Scalability
The DS8000 Global Copy environment can be scaled up or down as required. If new volumes that require mirroring are added to the DS8000, they can be dynamically added. If additional Global Copy paths are required, they also can be dynamically added.
Addition of capacity
As we have previously mentioned, the logical nature of the LSS has made a Global Copy implementation on the DS8000 easier to plan, implement, and manage. However, if you need to add more LSSs to your Global Copy environment, your management and automation solutions must be set up to add this capacity.
storage unit at the local site is asynchronously copied to the storage unit at the remote site. A consistent copy of the data is then periodically automatically maintained on the storage unit at the remote site by forming a consistency group at the local site, and subsequently creating a tertiary copy of the data at the remote site with FlashCopy. This two-site data mirroring function is designed to provide a high performance, cost-effective, global distance data replication and disaster recovery solution.
Host
Host write
1
2 Acknowledge write
B
Write to secondary (asynchronously)
FlashCopy (automatically)
C
Automatic cycle in active session
531
The DS8000 manages the sequence to create a consistent copy at the remote site (Figure 17-14): Asynchronous long distance copy (Global Copy) with little to no impact to application writes. Momentarily pause for application writes (fraction of millisecond to a few milliseconds). Create point-in-time consistency group across all primary subsystems in out-of-sync (OOS) bitmap. New updates are saved in the Change Recording bitmap. Restart application writes and complete the write (drain) of point-in-time consistent data to the remote site. Stop the drain of data from the primary after all consistent data has been copied to the secondary. Logically FlashCopy all data to C volumes to preserve consistent data. Restart Global Copy writes from the primary. Automatic repeat of sequence every few seconds to minutes to hours (this choice is selectable and can be immediate).
Global Copy
FlashCopy
B
Remote Site
Local Site
1. 2. 3. 4. 5.
Create Consistency Group of volumes at local site Send increment of consistent data to remote site FlashCopy at the remote site Resume Global Copy (copy out-of-sync data only) Repeat all the steps according to the defined time period
The data at the remote site is current within 3 to 5 seconds, but this recovery point (RPO) depends on the workload and bandwidth available to the remote site. Note: The copy created with the consistency group is a power-fail consistent copy, not necessarily an application-based consistent copy. When you use this copy for recovery, you might need to perform additional recovery operations, such as the fsck command in an AIX filesystem. This section discusses performance aspects when planning and configuring for Global Mirror together with the potential impact to application write I/Os caused by the process used to form a consistency group.
532
We also consider distributing the target Global Copy and target FlashCopy volumes across various ranks to balance load over the entire target storage server and minimize the I/O load for selected busy volumes.
533
The PPRC links need to use dedicated DS8000 host adapter ports to avoid any conflict with host I/O. If any subordinate storage subsystems are included in the Global Mirror session, the FC links to those subsystems must also use dedicated host adapter ports.
Write
A Primary A1 Primary
source copy pending
Read
2
FCP links
Write
Primary Primary
target
A B1
Read
Write
Primary Primary
C1
tertiary
copy pending
Host
Local site
Remote site
Figure 17-15 Global Copy with write hit at the remote site
534
The FlashCopy write I/O operation on the target DS8000 server steps are (refer to Figure 17-15): 1. The application write I/O completes immediately to volume A1 at the local site. 2. Global Copy nonsynchronously replicates the application I/O and reads the data at the local site to send to the remote site. 3. The modified track is written across the link to the remote B1 volume. 4. FlashCopy nocopy sees that the track is about to change. 5. Track is written to the C1 volume before the write to the B1 volume. This process is an approximation of the sequence of internal I/O events. There are optimization and consolidation effects, which make the entire process quite efficient. Figure 17-15 showed the normal sequence of I/Os within a Global Mirror configuration. The critical path is between points (2) and (3). Usually (3) is simply a write hit in NVS in B1, and some time later and after (3) completes, the original FlashCopy source track is copied from B1 to C1. If NVS is overcommitted in the secondary storage server, there is a potential impact on the Global Copy data replication operations performance. Refer to Figure 17-16.
Write
A A1
source
Read
2
FCP links
Write
Primary Primary
target
A B1
Read
Write
Primary Primary
C1
tertiary
copy pending
Host
Local site
Remote site
Figure 17-16 Application write I/O within two consistency group points
Figure 17-16 summarizes roughly what happens when NVS in the remote storage server is overcommitted. A read (3) and a write (4) to preserve the source track and write it to the C volume is required before the write (5) can complete. Eventually, the track gets updated on the B1 volume to complete the write (5). But usually, all writes are quick writes to cache and persistent memory and happen in the order as outlined in Figure 17-14 on page 532. You can obtain a more detailed explanation of this processing in IBM System Storage DS8000: Copy Services with IBM System z, SG24-6787, and IBM System Storage DS8000: Copy Services in Open Environments, SG24-6788.
535
Maximum
Maximum
coordination
time Serialize all Global Copy primary volumes
drain
time Perform FlashCopy
Write
Hold write I/O
A1
Primary
B1
I/O
PPRC path(s)
Secondary
C1
Tertiary
A2
Primary
B2
Secondary
C2
Tertiary
Local site
Remote Site
Figure 17-17 Coordination time and how it impacts application write I/Os
The coordination time, which you can limit by specifying a number of milliseconds, is the maximum impact to an applications write I/Os that you allow when forming a consistency group. The intention is to keep the coordination time value as small as possible. The default of 50 ms might be high in a transaction processing environment. A valid number might also be in the single digit range. The required communication between the Master storage server and potential Subordinate storage servers is in-band over PPRC paths between the Master and Subordinates. This communication is highly optimized and allows you to minimize the potential application write I/O impact to 3 ms, for example. There must be at least one PPRC FC link between a Master storage server and each Subordinate storage server, although for redundancy we recommend that you use two PPRC FC links. One of the key design objectives for Global Mirror is to not impact the production applications. The consistency group formation process involves the holding of production write activity in order to create dependent write consistency across multiple devices and multiple disk 536
DS8000 Performance Monitoring and Tuning
subsystems. This process must therefore be fast enough that an impact is extremely small. With Global Mirror, the process of forming a consistency group is designed to take 1 to 3 ms. If we form consistency groups every 3 to 5 seconds, the percentage of production writes impacted and the degree of impact is therefore very small. The following example shows the type of impact that might be seen from consistency group formation in a Global Mirror environment. We assume that we are going 24000 I/Os per second with a 3:1 R/W ratio. We perform 6000 write I/O per second, each write I/O takes 0.5 ms, and it take 3 ms to create a consistent set of data. Approximately 0.0035 x 6000 = 21 write I/Os are affected by the creation of consistency. If each of these 21 I/Os experiences a 3 ms delay, and this delay happens every 3 seconds, we have an average response time (RT) delay of (21x 0.003)/18000 = 0.0035 ms. A 0.0035 ms average impact to a 0.5 ms write is a 0.7% increase in response time, and normal performance reporting tools will not detect this level of impact.
537
Note that further subsequent writes to this very same track do not experience any delay, because the tracks have been already replicated to the remote site.
538
A Primary A1 Primary
source copy pending
Primary Primary
target
A B1
Primary Primary
C3
tertiary
copy pending
Rank 1
A Primary Primary A2
source copy pending
Primary Primary
target
A B2
Primary Primary
C1
tertiary
copy pending
Rank 2
Rank 2
A Primary Primary A3
source
Primary Primary
target
A B3
Primary Primary
C2
tertiary
Host
copy pending
copy pending
Rank 3
Rank 3
Local site
Remote site
Figure 17-18 Remote storage server configuration: All ranks contain equal numbers of volumes
The goal is to put the same number of each volume type into each rank. The volume types that we discuss here refer to B volumes and C volumes within a Global Mirror configuration. In order to avoid performance bottlenecks, spread busy volumes over multiple ranks. Otherwise, hot spots can be concentrated on single ranks when you put the B and C volumes on the same rank. We recommend spreading B and C volumes as Figure 17-18 suggests. With mixed DDM capacities and different speeds at the remote storage server, consider spreading B volumes not only over the fast DDMs but over all ranks. Basically, follow a similar approach as Figure 17-18 recommends. You might keep particularly busy B volumes and C volumes on the faster DDMs. If the DDMs used at the remote site are double the capacity but the same speed as those DDMs used at the production site, an equal number of ranks can be formed. In a failover situation when the B volume is used for production, it will provide the same performance as the production site, because the C volume will not then be in use. Note: Keep the FlashCopy target C volume on the same processor complex as the FlashCopy source B volume. Figure 17-19 on page 540 introduces the D volumes.
539
A Primary A1 Primary
source copy pending
Primary Primary
target
A B1
Primary Primary
C3
tertiary
copy pending
Rank 1
A Primary Primary A2
source copy pending
Primary Primary
target
A B2
Primary Primary
C1
tertiary
copy pending
Rank 2
Rank 2
A Primary Primary A3
source
Primary Primary
target
A B3
Primary Primary
C2
tertiary
Host
copy pending
copy pending
Rank 3
Rank 3
Local site
Host
Remote site
Primary Primary
A D1 A D3
Primary Primary
D2
Primary Primary
Primary Primary
D4
Rank 4
Figure 17-19 shows, besides the three Global Mirror volumes, the addition of D volumes that you can create for test purposes. Here we suggest, as an alternative, a rank with larger and perhaps slower DDMs. The D volumes can be read from another host, and any other I/O to the D volumes does not impact the Global Mirror volumes in the other ranks. Note that a nocopy relationship between B and D volumes will read the data from B when coming through the D volume. So, you might consider a physical COPY when you create D volumes on a different rank, which will separate additional I/O to the D volumes from I/O to the ranks with the B volumes. If you plan to use the D volumes as the production volumes at the remote site in a failover situation, the D volume ranks must be configured in the same way as the A volume ranks and use identical DDMs. You must make a full copy to the D volume for both testing and failover. When using TotalStorage Productivity Center for Replication, the Copy Sets for Global Mirror Failover/Failback w/ Practice are defined in this way. The TotalStorage Productivity Center for Replication volume definitions are: A volume defined as H1 volume (Host site 1) B volume defined as I2 volume (Intermediate site 2) C volume defined as J2 volume (Journal site 2) D volume defined as H2 volume (Host site 2) The use of FlashCopy SE for the C volume requires a different configuration again. These volumes are physically allocated in a data repository. A repository volume per extent pool is used to provide physical storage for all Space Efficient volumes in that extent pool. Figure 17-20 on page 541 shows an example of a Global Mirror setup with FlashCopy SE. In this example, the FlashCopy targets use a common repository.
540
A Primary Primary A1
source copy pending
Primary Primary
target
A B1
Primary Primary
copy pending
Rank 1
A Primary Primary A2
source copy pending
Primary Primary
target
A B2
Primary Primary
C
tertiary
copy pending
Rank 2
Rank 2
Extpool 1
Primary Primary
A Primary Primary A3
source
Primary Primary
target
A B3
Host
copy pending
copy pending
Rank 3
Rank 3
Local site
Remote site
Figure 17-20 Remote disk subsystem with Space Efficient FlashCopy target volumes
FlashCopy SE is optimized for use cases where less than 20% of the source volume is updated during the life of the relationship. In most cases, Global Mirror is configured to schedule consistency group creation at an interval of a few seconds, which means that a small amount of data is copied to the FlashCopy targets. From this point of view, Global Mirror is a recommended area of application for FlashCopy SE. In contrast, Standard FlashCopy will generally have superior performance to FlashCopy SE. The FlashCopy SE repository is critical regarding performance. When provisioning a repository, Storage Pool Striping (SPS) will automatically be used with a multi-rank extent pool to balance the load across the available disks. In general, we recommend that the extent pool contain a minimum of four RAID arrays. Depending on the logical configuration of the DS8000, you might also consider using multiple Space Efficient repositories for the FlashCopy target volume in a Global Mirror environment, at least one on each processor complex. Note that the repository extent pool can also contain additional non-repository volumes. Contention can arise if the extent pool is shared. After the repository is defined, you cannot expand it so it is important that you plan to make sure that it will be large enough. If the repository fills, the FlashCopy SE relationship will fail and the Global Mirror will not be able to successfully create consistency groups.
541
Effective with the availability of DS8000 Release 3.0 licensed internal code (LIC), the maximum Global Mirror volume relationships on a secondary DS8000 subsystem have been increased three-fold. Table 17-4 includes the old and new values.
Table 17-4 Global Mirror secondary volume guidelines
Disk subsystem/cache
ESS M800 DS8000/16 GB DS8000/32 GB DS8000/64 GB DS8000/128 GB DS8000/256 GB
Note: The recommendations are solely based on the number of Global Mirror volume relationships, the capacity of the volumes is irrelevant. One way to avoid exceeding these recommendations is to use fewer, larger volumes with Global Mirror.
542
Suspending a Global Copy pair that belongs to an active Global Mirror session impacts the formation of consistency groups. When you intend to remove Global Copy volumes from an active Global Mirror session, follow these steps: 1. Remove the desired volumes from the Global Mirror session. 2. Withdraw the FlashCopy relationship between the B and C volumes. 3. Terminate the Global Copy pair to bring volume A and volume B into simplex mode. Note: When you remove A volumes without pausing Global Mirror, you might see this situation reflected as an error condition with the showmigr -metrics command, indicating that the consistency group formation failed. However, this error condition does not mean that you have lost a consistent copy at the remote site, because Global Mirror does not take the FlashCopy (B to C) for the failed consistency group data. This message indicates that just one consistency group formation has failed, and Global Mirror will retry the sequence.
543
Primary site
System Data Mover
Secondary site
Secondary system
Primary system
7 5 1 2 3
CG
ESS DS6000
Control dataset
Statel dataset
Primary subsystem(s)
Secondary subsystem(s)
Primary volume 1
Primary volume n
Secondary volume 1
Secondary volume n
Figure 17-21 illustrates a simplified view of the z/OS Global Mirror components and the data flow logic. When a z/OS Global Mirror pair is established, the host systems DFSMSdfp software starts to time-stamp all subsequent write I/Os to the primary volumes, which provides the basis for managing data consistency across multiple LCUs. If these primary volumes are shared by systems running on different CECs, an IBM Sysplex Timer is required to provide a common time reference for these time stamps. If all the primary systems are running in different LPARs within the same CEC, the system time-of-day clock can be used. z/OS Global Mirror is implemented in a cooperative way between the DS8000s on the primary site and the DFSMSdfp host system software component System Data Mover (SDM). The logic for the data flow is (refer to Figure 17-21): 1. The primary system writes to the primary volumes. 2. The application I/O operation is signalled completed when the data is written to primary DS8000 cache and NVS, which is when channel end and device end are returned to the primary system. The application write I/O operation has now completed, and the updated data will be mirrored asynchronously according to the following steps. 3. The DS8000 groups the updates into record sets, which are asynchronously off-loaded from the cache to the SDM system. Because z/OS Global Mirror uses this asynchronous copy technique, there is no performance impact on the primary applications I/O operations. 4. The record sets, perhaps from multiple primary storage subsystems, are processed into consistency groups (CGs) by the SDM. The CG contains records that have their order of update preserved across multiple LCUs within an DS8000, across multiple DS8000s, and across other storage subsystems participating in the same z/OS Global Mirror session. This preservation of order is absolutely vital for dependent write I/Os, such as databases and their logs. The creation of CGs guarantees that z/OS Global Mirror will copy data to the secondary site with update sequence integrity. 544
DS8000 Performance Monitoring and Tuning
5. When a CG is formed, it is written from the SDM real storage buffers to the Journal datasets. 6. Immediately after the CG has been hardened on the Journal datasets, the records are written to their corresponding secondary volumes. Those records are also written from SDMs real storage buffers. Because of the data in transit between the primary and secondary sites, the currency of the data on secondary volumes lags slightly behind the currency of the data at the primary site. 7. The control dataset is updated to reflect that the records in the CG have been written to the secondary volumes.
PARMLIB members
When z/OS Global Mirror first starts up using the XSTART command, it searches for member ANTXIN00 in SYS1.PARMLIB. When a parameter value in the ANTXIN00 member differs from that value specified in the XSTART command, the version in the XSTART command will override the value found in ANTXIN00. After processing the ANTXIN00 member, z/OS Global Mirror then looks for member ALL in the dataset hlq.XCOPY.PARMLIB, where hlq is the value found for the hlq parameter in the ANTXIN00 member, or the overriding value entered in the XSTART command. The ALL member can contain values that you want to be common across all logical sessions on a system.
545
546
planned activity, such as a maintenance update, or moving the SDM to a different site or a different LPAR. The XSUSPEND TIMEOUT command will end the ANTASnnn address space and inform the involved LCUs that their z/OS Global Mirror session has been suspended. The DS8000 will then record changed tracks in the hardware bitmap and will free the write updates from the cache. When the z/OS Global Mirror session is restarted with the XSTART command and volumes are added back to the session with the XADDPAIR command, the hardware bitmap maintained by the DS8000 while the session was suspended will be used to resynchronize the volumes, and thus full volume resynchronization is avoided.
547
Note: PARMLIB parameters BITMAP ChangedTracks and DelayTime have an equivalent function.
Write Pacing
Write Pacing works by injecting a small delay as each zGM record set is created in cache for a given volume. As the device residual count increases, so does the magnitude of the pacing, eventually reaching a maximum value at a target residual count. You can specify both this maximum value and the target residual count at which it is effective for each volume through the XRC XADDPAIR command. Write pacing provides a greater level of flexibility than Device Blocking. Furthermore, the device remains ready to process I/O requests, allowing application read activity to continue while the device is being paced, which is not the case with Device Blocking. You can set the write pacing levels to one of fifteen fixed values between 0.02 ms and 2 ms. The write pacing levels are used for volumes with high rates of small blocksize writes, such as database logs, where we need to minimize response time impact. A dynamic workload balancing algorithm was introduced with z/OS Global Mirror (zGM) Version 2 The objective of this mechanism is to balance the write activity from primary systems and SDMs capability to offload cache during write peaks or a temporary lack of resources to SDM with minimal impact to the primary systems. In situations where the SDM offload rate falls behind the primary systems write activity, data starts to accumulate in cache. This accumulation is dynamically detected by the primary DS8000 microcode, and it responds by slowly but progressively reducing available write bandwidth for the primary systems, thus giving the SDM a chance to catch up. The DS8000 implements device-level blocking. The update rate for a volume continues unrestricted unless a volume reaches a residual count threshold waiting to be collected by the SDM. Whenever that threshold is exceeded, application updates to the single volume are paused to allow the SDM to read them from the cache of the subsystem.
548
Tip: By using the DONOTBLOCK parameter of the XADDPAIR command, you can request that z/OS Global Mirror does not block specific devices. You can use this option for IMS WADs, DB2 logs, CICS logs, or spool datasets that use small blocksizes, perform numerous updates, and are critical to application response time.
549
subsystem involved in the zGM logical session. Multiple reader support can also help to simplify the move to larger devices, and it can reduce the sensitivity of zGM in draining updates as workload characteristics change or capacity growth occurs. Less manual effort will be required to manage the SDM off-load process. For more information about multiple reader, refer to IBM System Storage DS8000: Copy Services with IBM System z, SG24-6787.
10
15
20
25
30
35
Figure 17-23 on page 551 shows the comparison when running a 27 KB sequential write workload to a single volume. Here, we compare the MB per second throughput. Even though the improvement is not as dramatic as on the 4 KB sequential write workload, we still see that the multiple reader provides better performance compared to the single reader.
550
50
100
150
200
250
300
350
400
Figure 17-24 on page 552 shows the benchmark result where we compare the application throughput measured by the total I/O rate when running a database random write to one LSS. Here, we again see a significant improvement on the throughput when running with multiple readers.
551
10
15
20
25
552
Server or Servers
***
4
normal application I/Os Global Mirror netw ork asynchronous long distance
1
A
Metro Mirror
2 3
B
Global Mirror
Global Mirror consistency group formation (CG)
a. write updates to B volumes paused (< 3ms) to create CG b. CG updates to B volumes drained to C volumes c. after all updates drained, FlashCopy changed data from C to D volume s
IBM offers services and solutions for the automation and management of the Metro Mirror environment, which include GDPS for System z and TotalStorage Productivity Center for Replication. You can obtain more details about GDPS at the following Web site: http://www.ibm.com/systems/z/advantages/gdps
553
Intermediate Site
Local Site
Remote Site
Metropolitan distance
Unlimited distance
Metro Mirror
P
DS8000 Metro Mirror Secondary
X
DS8000 z/OS Global Mirror Secondary
In the example that is shown in Figure 17-26, the System z environment in the Local Site is normally accessing the DS8000 disk in the Local Site. These disks are mirrored back to the Intermediate Site with Metro Mirror to another DS8000. At the same time, the Local Site disk has z/OS Global Mirror pairs established to the Remote Site to another DS8000, which can be at continental distances from the Local Site.
554
Appendix A.
555
RAID 5 and array type of 6+P+S or 7+P. The red circle surrounding the ranks represents the extent pools created. The ranks, designated by Rn, such as R0, represent the arrays logically grouped from eight physical DDMs (disks). In this example, all arrays were formatted as RAID 5, and due to the sparing rules, four arrays from each DA pair have a spare and are designated as 6+P+S. The remaining four arrays are configured as 7+P types.
DS8000 Grouping of Arrays/Extpools
R8 R9
R10 R11 R12 R13 R14 R15
Exp8 R8 Exp9 R9 Exp0 R0 Exp1 R1
6+P+S RAID-5
Exp10 Exp11 Exp12 R10 R11 R12 Exp13 Exp14 Exp15 R13 R14 R15
R0 R1
R2 R3 R4 R5 R6 R7
7+P RAID-5
The team has decided to place each rank into its own extent pool. This method is an excellent way to isolate each rank from each other rank, but because of the way that the team laid out the LUNs in each extent pool, no isolation at the host application level was achieved. The view shown in Figure A-1 can be transposed to a spreadsheet to further define the granularity of logical layout and LUNs configured from each extent pool. The chart in Figure A-2 on page 558 shows further granularity of the capacity resources available at the DA pair, server complex, rank, extent pool, logical subsystem (LSS), and logical unit number (LUN) level. Included in the chart is a specific color coding of the LUNs to assigned hosts and the host host bus adapter (HBA) worldwide port names (WWPNs). Orienting yourself with the chart and understanding how to read and use this type of chart is beneficial in understanding the illustrations presented in this appendix. In Figure A-2 on page 558, we show a layout of LUNs with respective numbering associated to the server complexes, extpools, and LSSs numbered in rows 6 - 38, and columns A, B, and C. In this illustration, we have made the array numbering equal to the rank numbering, and the rank numbering equal to the Extpool and LSS numbering. For example, S1=A0=R0=Extpool0=LSS0 as shown by the box around rows 6 and 7, surrounded by the box in red (horizontally). In this example, the LUNs are numbered sequentially (in hexadecimal format) from left to right: Box number 1 (column A) shows the server complexes number 0 and 1. Box number 2 (column B) shows the Extpools numbered 1 - 23. Box number 3 (column C) shows the LSSs. The chart is read vertically by Server, Extpool, LSS, and the associated LUNs carved from the respective arrays/ranks/extpools. Box number 4 shows the arrays owned by the DA pairs,
557
for example, there are two DA pairs in this illustration. Box number 5 shows a number of LUNs by color code that are owned by various hosts represented by matching colors. 1. Server complexs 2. Extent pools/arrays/ranks 3. LSS/LCUs
4. DA pairs
Figure A-2 Before hardware capacity resource chart to track LUN placement at the rank to extent pool level
The chart shown in Figure A-2 has value in that you can quickly view the logical layout at a glance without having to run a bunch of commands to visualize your environment. (Notice that all but two LUNs were spread across every array/rank/extent pool and both server complexes). Hot spots can more readily be pinpointed and confirmed with tools, such as Tivoli Productivity Center. Using this type of chart can however more easily and quickly help you in reconfiguring the logical layout when hots spots are found. Migrating LUNs to other less busy arrays, DA pairs, or server complexes can more easily be planned. In the example shown in Figure A-2, notice that all host data LUNs are spread across all the hardware resources, regardless of server complex, DA pair, array characteristics, and so forth. The initial belief of the storage team was that the more spread across spindles, the better the performance. After the implementation, the client experienced performance hot spots on the production hosts, identified in box number 6. One of these production hosts applications workload peaked at the same time as one of the non-priority hosts, identified in box number 7. Upon investigation and reconfiguration, the problem was solved by separating the production hosts from the nonproduction hosts and placing the production host LUNs on the arrays with a 7+P type format. The chart shown in Figure A-3 on page 559 shows the remediated logical configuration. Notice that the production host LUNs were isolated from the nonproduction hosts and placed 558
DS8000 Performance Monitoring and Tuning
on arrays that have a 7+P type. Realize that 7+P type arrays will outperform 6+P+S type arrays simply because there are more physical heads and spindles across which to spread. For further information and discussion about array type characteristics, reference 4.3, Understanding the array to LUN relationship on page 55. Another key factor in the performance throughput gain resulted in separating and isolating the production servers application I/O at the array level from the nonproduction servers application I/O. For example, the production hosts I/O resides in the LUNs shown in the box numbered 1. The nonproduction LUNs now reside in separate arrays from the production LUNs as well, shown in the box numbered 2. The benefit of separating the host applications I/O at the array level kept the workloads from peaking at the same time. This new configuration still accommodates a spread of I/O across several arrays, eight to be exact (4, 5, 6, 7, 12, 13, 14, and 15), DA pairs (0 and 2), and server complexes (0 and 1). 2. Non production Host LUNs
Figure A-3 After hardware capacity resource chart to track LUN placement at the rank to extent pool level
The ideal point between isolating and spreading was achieved as shown in Figure A-3. Spreading applies to both isolated workloads and resource-sharing workloads.
559
In scenario 1, we used a logical configuration for maximum isolation of spindles at the rank/extent pool. To further illustrate this concept and familiarize you with the charts we use in many of the illustrations going forward in this appendix, refer to Figure A-4. This chart is similar to the previous chart shown in Figure A-3 on page 559, except for the LUN to array granularity. In Figure A-4, we show a logical configuration of 48 ranks. Each rank is placed in its own extent pool to isolate the rank spindles from all other rank spindles. The boxes surrounding each rank represent 8 physical disks grouped into a logical array. Each array is represented in Columns A through E, over two rows. For example, columns A - E, rows 3 and 4, represent an array (in this example, rank0 shown on the left). Each of the arrays contains information identifying the server complex, extent pool, rank number, capacity in GB, and RAID type. In the middle of the diagram, we show a representation of eight physical disks for rank number 0, 33, and 15 just to illustrate that each rank is made up of eight physical DDMs.
DA_Pair5,Rank33,Server1,Extpool33,LSS33 @1582GB
DA_Pair2,Rank15,Server1,Extpool15,LSS15 @1582GB
We have isolated all the ranks from each other. The benefit of maximum isolation is for application requirements that call for the most spindle separation. SAN Volume Controller (SVC) and other virtualization appliances benefit from this type of configuration, because they use their own virtualization. Note: Although the ranks are isolated from each other, the I/O traffic is still shared within the associated owning DA pair and server complex for each rank.
560
R0 R9 R16 R25 R32 R41 Extpool 0 R1 R8 R17 R24 R33 R40 Extpool 1 R2 R11 R18 R27 R34 R43 Extpool 2 R3 R10 R19 R26 R35 R42 Extpool 3 R4 R13 R20 R29 R36 R45 Extpool 4 R5 R12 R21 R29 R37 R44 Extpool 5 R6 R15 R22 R31 R38 R47 Extpool 6 R7 R14 R23 R30 R39 R46 Extpool 7
To achieve a well-balanced DS8000 from the array to extent pool perspective, take an array/rank from each DA pair of equal size and the same characteristics and group them into an extent pool as shown in Figure A-5. With this configuration and considering the large
561
number of arrays and DA pairs, a good general rule is to create as many extent pools as there are arrays in a DA pair. For example, in Figure A-5 on page 561, we show six DA pairs and eight ranks in each DA pair. Next, we show that one rank from each DA pair has been grouped into one of the eight extent pools. This way, you can achieve a well balanced configuration in the extent pool, because it leverages and distributes the I/O evenly across all the hardware resources, such as the arrays, DA pairs, and server complexes.
DS8300 Grouping of Arrays/Extpools Extpool 0 R0 R2 R4 R6 R9 R11 R13 R15 R16 R18 R20 R22 R25 R27 R29 R31 R32 R34 R36 R38 R41 R43 R45 R47 Extpool 1 R1 R3 R5 R7 R8 R10 R12 R14 R17 R19 R21 R23 R24 R26 R28 R30 R33 R35 R37 R39 R40 R42 R44 R46
Figure A-6 A bad example of maximum sharing between two extent pools
Although the ranks to extent pools are divided up to separate the 6+P+S from the 7+P RAID types, we are unable to achieve an even balance across the server complexes. All the LUNs created and assigned to a host from extent pool 0 filter traffic through server complex 0 and all the LUNs assigned to a host from extpool 1 filter I/O traffic through server complex 1. In order for you to achieve I/O balance from a host perspective, you need to assign LUNs from both extent pools to each host. By spreading the application across both extent pools, we introduce a lack of any isolation. All I/O is shared between all hosts and workloads. The advantages for this type of a configuration are that I/O is spread across the maximum number of heads and spindles, maintaining reasonable balance across the DA pairs and server complexes. The disadvantages are a lack of I/O isolation. Everything is spread and striped, which introduces I/O contention between all hosts and applications. If a hot spot was encountered, you have nowhere to move the data to improve I/O throughput for the 562
DS8000 Performance Monitoring and Tuning
application, especially if two or more applications peaked at the same time. You also negate any performance gains from 7+P arrays by mixing them with 6+P array types. A properly utilized and balanced system is configured as shown in scenario 4, which is illustrated in Figure A-7.
DS8300 Grouping of Arrays/Extpools Extpool 0 R0 R2 R9 R11 R16 R18 R25 R27 R32 R34 R41 R43 Extpool 1 R1 R3 R8 R10 R17 R19 R24 R26 R33 R35 R40 R42 Extpool 2 R4 R6 R13 R15 R20 R22 R29 R31 R36 R38 R45 R47 Extpool 3 R5 R7 R12 R14 R21 R23 R28 R30 R37 R39 R44 R46
Figure A-7 A good example of maximum sharing to maintain balance and proper utilization
In Figure A-7, we show a suggested logical configuration. Two ranks from each DA pair are grouped together in an extent pool, which provides excellent balance for optimal, evenly distributed I/O throughput. This configuration also allows you to take advantage of the aggregate throughput of 12 ranks in each extpool. In extent pools 0 and 1, you spread the I/O across 84 physical drives, because one drive in each rank is a spare. Extent pools 2 and 3 allow you to take advantage of spreading the I/O over 96 physical drives, thereby, outperforming I/O residing in extent pools 0 and 1. For a more detailed understanding of why the 7+P arrays outperform the 6+P+S arrays, refer to 4.3, Understanding the array to LUN relationship on page 55. By configuring the unit this way, you can control the balance of I/O more evenly for LUN to host assignments, taking advantage of I/O from both server complexes and all DA pairs, for example, by assigning LUNs to a host from extent pool 0 and 1, or extent pool 2 and 3.
563
The advantages are: 1. Partial built-in isolation, separating arrays by RAID types 2. Faster I/O throughput from the use of larger arrays 3. At least two types of extent pools of both arrays characteristics, such as 6+P and 7+P If I/O hot spots are encountered, you will have the ability to move the hot spot to less busy hardware resources that perform differently at the rank level. This move will aid in analyzing and confirming if I/O can be improved at the DS8000 disk level. In order to see the LUN level granularity with this type of configuration, refer to Figure A-8 for LUN assignments from the extent pools configured in this manner. In Figure A-8, we show that hot spots were encountered on both 512 GB LUNs configured in extent pools 0 and 1. These LUNs are shown in cells G2 and N2. The data residing on these LUNs can be migrated by using a host migration technique to extent pools 2 and 3. Extent pools 2 and 3 contain RAID types of 7+P, which perform faster than extent pools 0 and 1 made up of the 6+P+S array type.
Extpool3
Figure A-8 shows an example of four extent pools and the associated LUN mappings to the hosts from each extent pool. Note: To move the data I/O hot spot between extent pools, use a host migration technique. As the database grows and I/O increases, there might be a need for further isolation to remediate I/O contention at the spindle level. Scenario 5 shows how to further isolate at the rank to extent pool level but still take advantage of the aggregate throughput incurred from spreading the I/O across multiple arrays/ranks. 564
DS8000 Performance Monitoring and Tuning
R0 R8 R16 R24 R32 R40 Extpool 0 R1 R9 R17 R25 R33 R41 Extpool 1 R2 R10 R18 R26 R34 R42 Extpool 2 R3 R11 R19 R27 R35 R43 Extpool 3 R4 R12 R20 R28 R36 R44 Extpool 4 R5 R13 R21 R29 R37 R45 Extpool 5 R6 R14 R22 R30 R38 R46 Extpool 6 R7 R15 R23 R31 R39 R47 Extpool 7
Figure A-9 More isolation but maintaining balance and spread across hardware resources
We suggest the logical configuration shown in Figure A-9, where one rank from each DA pair of like capacity and characteristics is grouped together in extent pools, which results in more extent pools as shown in Figure A-10 on page 566. More extent pools allow you to isolate to a finer granularity. Be careful when assigning LUNs to hosts to allow for this isolation. Remember that you can quickly mitigate this granularity of isolation by spreading the same applications I/O across multiple extent pools.
565
Extpool0
Hosts
DS8000 Volgrps
Figure A-10 How extent pools might look with this configuration
In Figure A-10, we show the extent pool breakout at the LUN level and have isolated the host LUN assignments to take advantage of this granularity. We still have a good spread of arrays to DA pairs, in that a rank from each DA pair has been pooled together for the aggregate throughput of all the DA pairs in each extent pool. We show that each host takes advantage of the load balancing offered by LUNs assigned from both DS8000 server complexes. The odd-numbered extent pools are owned by server complex 1 and the even-numbered extent pools are owned by server complex 0. At least one LUN from an even-numbered and an odd-number extent pool has been assigned to each host. For example, two LUNs from extpool 4 and two LUNs from extent pool 5 are assigned to HostJ, volume group 13 (V13).
566
3X2+S+S
S0
S1
Exp4 R4 Exp5 R5
7+P
72.8GB size
R0
R1
R2
R3
6+P+S
S0 4X2
Exp7 R9 R33
S0
R6
S1
R7
R8 R9
R10 R11 R12 R13 R14 R15
R16 R17
R18 R19 R20 R21 R22 R23
R40 R41
R42 R43 R44 R45 R46 R47
R48 R49
R50 R51 R52 R53 R54 R55
Exp6 R8 R32
3X2+S+S
145.6GB size
Exp10
S1
7+P 6+P+S
S0
Exp13
Exp12
R0 R1
R2 R3 R4 R5 R6 R7
R24 R25
R26 R27 R28 R29 R30 R31
R32 R33
R34 R35 R36 R37 R38 R39
R56 R57
R58 R59 R60 R61 R62 R63
300GB size
Exp17
5+P+Q+S
Exp16
6+P+Q
450GB size
500GB size
5+P+Q+S
6+P+Q
Figure A-11 DS8000 mixed RAID and density drives example of logical configuration for performance
The example shown in Figure A-11 might not be an ideal logical configuration, but balance is near perfect. Notice that in DA pair 0, we have one rank configured as a 3X2+S+S and another at 4X2. Rank2 is owned by server complex 0 and rank3 is owned by server complex 1. Each rank is tied to either server complex 0 or 1 without another rank of the same size or RAID format type from the same server complex. In Figure A-12 on page 568, we show a chart that we have converted from Figure A-11 so that we can better understand at a glance how this DS8000 is configured and why. We can more easily see the RAID types (RAID 5, RAID 6, or RAID 10), the capacity information rendered by each array and RAID type, the spread of arrays/ranks across the DA pairs, and the spread of the extent pools to the server complexes. By knowing all this information, you will be better informed as to how to provision the LUNs from each extent pool to exploit the performance rendered by each LUN for the best possible balanced throughput.
567
DA pairs Server complex number extent pool number Rank number Array capacity RAID type DA pairs
Figure A-12 shows the characteristics and relationship of each array by size, RAID type, capacity, extent pool, DA pair, and server complex ownership. Note: As a reminder, if you can take advantage of the aggregate throughput rendered by spreading the applications I/O across the ranks, DA pairs, and server complexes, performance is almost always better. Separation of array/rank types is an important factor as well and will enable the option of host server-level striping at RAID 0 for striping granularity. Mixing array type characteristics is allowed but can make performance worse at the host level due to the different strip widths rendered by the extent creation. For a better understanding of the strip widths relationship to the RAID type characteristic, refer to 4.3, Understanding the array to LUN relationship on page 55. Figure A-13 on page 569 shows the breakdown of the LUNs configured in each extent pool, according to the mapping relationship shown in Figure A-12.
568
Hosts
In Figure A-13, we show the LUN assignments made to the hosts. The important thing to note here is the balance and spread between the hardware resources, such as the server complexes, DA pairs, and arrays. Notice that we did not spread everything across everything. Instead, we split all the arrays of equal capacity and characteristics per DA pair to allow us to make extent pools equally divided between the server complexes, which accommodates more achievable load balancing at the host server level. For example, HostL has four LUNs assigned from 4 extent pools. Each LUN is spread across at least two ranks, two DA pairs, and two server complexes.
569
570
Appendix B.
571
Performance Logs and Alerts provides the following functions: Counter logs This function lets you create a log file with specific system objects and counters and their instances. Log files can be saved in different formats (file name + file number, or file name + file creation date) for use in System Monitor or for exporting to database or spreadsheet applications. You can schedule the logging of data, or the counter log can be started manually using program shortcuts. Counter log settings can also be saved in HTML format for use in a browser either locally or remotely via TCP/IP. Trace logs This function lets you create trace logs that contain trace data provider objects. Trace logs differ from counter logs in that they measure data continuously rather than at specific intervals. You can log operating system or application activity using event providers. There are two kinds of providers: system and non-system providers. Alerts This function lets you track objects and counters to ensure that they are within a specified range. If the counters value is under or over the specified value, an alert is issued.
572
4. Click Add Objects, select the PhysicalDisk and LogicalDisk Objects, and select Add. 5. In the General tab, Sample data every: is used to set how frequently you capture the data. When configuring the interval, specify an interval that will provide enough granularity to allow you to identify the issue. For example, if you have a performance problem that only lasts 15 seconds, you need to set up the interval to be at the most 15 seconds or it will be difficult to capture the issue. Note: As a general rule for post-processing, more than 2000 intervals is difficult to analyze in spreadsheet software. When configuring the interval, remember both the problem profile and the planned collection duration in order to not capture more data than can be reasonably analyzed. You might require multiple log files. 6. In the Run As field, enter the account with sufficient rights to collect the information about the server to be monitored, and then click Set Password to enter the relevant password. 7. The Log Files tab, shown in Figure B-3 on page 574, lets you set the type of the saved file, the suffix that is appended to the file name, and an optional comment. You can use two types of suffixes in a file name: numbers or dates. The log file types are listed in Table B-1 on page 574. If you click Configure, you can also set the location, file name, and file size for a log file. The Binary File format takes the least amount of space and is suggested for most logging.
573
Description
Comma-delimited values log file (csv extension). Use this format to export to a spreadsheet. Tab-delimited log file (TSV extension). Use this format to export the data to a spreadsheet program. Sequential, binary log file (BLG extension). Use this format to capture data intermittently (stopping and resuming log operation). Circular, binary-format log file (BLG extension). Use this format to store log data to same log file, overwriting old data. Logs are output as an SQL database.
Binary file
8. The Schedule tab shown in Figure B-4 on page 575 lets you specify when this log is started and stopped. You can select the option box in the start log and stop log section to manage this log manually using the Performance console shortcut menu. You can configure to start a new log file or run a command when this log file closes.
574
9. After clicking OK, the logging will start automatically. If for any reason, it does not, simply right-click the Log Settings file in the perfmon window and click Start. 10.To stop the counter log, click the Stop the selected log icon on the toolbar.
575
576
Figure B-6 Windows Server 2003 System Monitor Properties (Source tab)
3. At the System Monitor Properties dialog, select the Data tab. You now see any counter that you specified when setting up the Counter Log as shown in Figure B-7. If you only selected counter objects, the Counters section will be empty. To add counters from an object, simply click Add, and then select the appropriate counters.
577
4. In the Monitor View, click Add and select all the key PhysicalDisk counters for all instances and click Add. 5. You will see a window similar to Figure B-9 on page 579.
578
6. Right-click the chart side, and click Save Data As. 7. In the Save Data As window, select a file type of Text File (Comma delimited)(*.csv). 8. Provide a file name, such as Disk_Performance.
579
To run the Performance Monitor: 1. Expand the Data Collection Sets element in the tree view as referenced in Figure B-11 on page 581. Right-click User Defined.
580
Figure B-11 Windows Server 2008 User Defined Data Collector Sets
2. Select New Data Collector Set. 3. The Data Collector Set Wizard will be displayed as shown in Figure B-12. Click Next.
4. Provide a file name, such as Disk_Performance. 5. Select Create manually (Advanced) and click Next. 6. As shown in Figure B-13 on page 582, check the box next to Performance counter and click Next.
581
Figure B-13 Windows Server 2008 Advanced Data Collector Set configuration
7. As shown in Figure B-14, you will be asked to select the desired performance counters. Click Add and you will see a window similar to Figure B-15. Select all instances of the disks except the Total and manually select all the individual counters identified in Table 10-2 on page 293. Select the desired computer as well. Click OK.
582
8. You will now be at the Create new Data Collector Set Wizard window. In the Sample interval list box, set how frequently you capture the data. When configuring the interval, specify an interval that will provide enough granularity to allow you to identify the issue. For example, if you have a performance problem that only lasts 15 seconds, you need to set up the interval to be at the most 15 seconds or it will be difficult to capture the issue. Click Next. Note: As a general rule for post-processing, more than 2000 intervals is difficult to analyze in spreadsheet software. When configuring the interval, remember both the problem profile and the planned collection duration in order to not capture more data than can be reasonably analyzed. You might need multiple log files. 9. You will be prompted to enter the location of the log file and the file name as shown in Figure B-16 on page 584. In this example, the default file name is shown. After entering the file name and directory, click Finish.
583
Figure B-17 Windows Server 2008 Performance Monitor with new Data Collector Set
11.Right-click the Data Collector Set and select Start. 12.After you have collected the data for the desired period, you can select the Stop icon or right-click Data Collector Set and select Stop.
584
3. In the Monitor View, click Add, select all the key PhysicalDisk counters for all instances, and click Add. 4. You will see a window similar to Figure B-17 on page 584.
5. Right-click the Chart side and left-click Save Data As. 6. In the Save Data As window, select a file type of Text File (Comma delimited)(*.csv). 7. Provide a file name, such as Disk_Performance.
585
586
Appendix C.
587
C.1 Introduction
The scripts presented in this appendix were written and tested on AIX servers, but you can modify them to work with SUN Solaris and Hewlett-Packard UNIX (HP-UX). They have been modified slightly from the scripts published in earlier versions of performance and tuning IBM Redbooks publications. These modifications are mainly due to the absence of the ESS Utility commands, lsess and lssdd, which are not available for the DS8000. We used the Subsystem Device Driver (SDD) datapath query essmap command instead. By downloading the Acrobat PDF version of this publication, you can copy and paste these scripts for easy installation on your host systems. To function properly, the scripts presented here rely on: An AIX host running AIX 5L Subsystem Device Driver (SDD) for AIX Version 1.3.1.0 or later The scripts presented in this appendix are: vgmap lvmap vpath_iostat ds_iostat test_disk_speeds Important: These scripts are provided on an as is basis. They are not supported or maintained by IBM in any formal way. No warranty is given or implied, and you cannot obtain help with these scripts from IBM.
C.2 vgmap
The vgmap script displays which vpaths a volume group uses and also to which rank each vpath belongs. Use this script to determine if a volume group is made up of vpaths on several different ranks and which vpaths to use for creating striped logical volumes. Example output of the vgmap command is shown in Example C-1. The vgmap shell script is in Example C-2.
Example: C-1 The vgmap output # vgmap testvg PV_NAME testvg: vpath0 vpath2 RANK 1100 1000 PV STATE TOTAL PPs active active 502 502 FREE PPs 502 502
Example: C-2 The vgmap shell script #!/bin/ksh ############################## # vgmap # usage: vgmap <vgname> # # Displays DS8000 logical disks and RANK ids for each # disk in the volume group # # # Author: Pablo Clifton [email protected]
588
# Date: August 28, 2005 ############################################## datapath query essmap > /tmp/lssdd.out lssddfile=/tmp/lssdd.out workfile=/tmp/work.$0 sortfile=/tmp/sort.$0 # AIX lsvg -p $1 | grep -v "PV_NAME" > $workfile echo "\nPV_NAME RANK PV STATE
TOTAL PPs
FREE PPs
Free D"
for i in `cat $workfile | grep vpath | awk '{print $1}'` do #echo "$i ... rank" rank=`grep -w $i $lssddfile | awk '{print $11}' | head -n 1` sed "s/$i /$i $rank/g" $workfile > $sortfile cp $sortfile $workfile done cat $workfile rm $workfile rm $sortfile ########################## THE END ######################
C.3 lvmap
The lvmap script displays which vpaths and ranks a logical volume uses. Use this script to determine if a logical volume spans vpaths on several different ranks. The script does not tell you if a logical volume is striped or not. Use lslv <lv_name> for that information or modify this script. An example output of the lvmap command is shown in Example C-3. The lvmap shell script is in Example C-4.
Example: C-3 The lvmap output lvmap.ksh 8000stripelv lvmap.ksh 8000stripelv LV_NAME RANK 8000stripelv:N/A vpath4 0000 vpath5 ffff Example: C-4 The lvmap shell script #!/bin/ksh ############################################## # LVMAP # usage: lvmap <lvname> # # displays logical disk and rank ids for each # disk a logical volume resides on # Note: the script depends on correct lssdd info in # /tmp/lssdd.out #
589
# Before running the first time, run: # Author: Pablo Clifton [email protected] # Date: August 28, 2005 ############################################## datapath query essmap > /tmp/lssdd.out lssddfile=/tmp/lssdd.out workfile=/tmp/work.$0 sortfile=/tmp/sort.$0 lslv -l $1 | grep -v " COPIES " > $workfile for i in `cat $workfile | grep vpath | awk '{print $1}'` do #echo "$i ... rank" rank=`grep -w $i $lssddfile | awk '{print $11}' | head -n 1` sed "s/$i /$i $rank/g" $workfile > $sortfile cp $sortfile $workfile done echo "\nLV_NAME cat $workfile rm $workfile rm $sortfile ###################### End ####################### RANK COPIES IN BAND DISTRIBUTION"
C.4 vpath_iostat
The vpath_iostat script is a a wrapper program for AIX that converts iostat information based on hdisk devices to vpaths instead. The script first builds a map file to list hdisk devices and their associated vpaths and then converts iostat information from hdisks to vpaths. To run the script, make sure that the SDD datapath query essmap command is working properly; that is, all volume groups are using vpaths instead of hdisk devices. The command syntax is: vpath_iostat (control c to break out) Or: vpath_iostat <interval> <iteration> An example of the output vpath_iostat produces is shown in Example C-5. The vpath_iostat shell script is in Example C-6 on page 591.
Example: C-5 The vpath_iostat output garmo-aix: Total VPATHS used: garmo-aix Vpath: MBps garmo-aix vpath0 12.698 garmo-aix vpath6 12.672 garmo-aix vpath14 11.238 garmo-aix vpath8 11.314 garmo-aix vpath2 6.963 garmo-aix vpath12 7.731 garmo-aix vpath4 3.840 8 16:16 Wed 26 Feb 2003 5 sec interval tps KB/trans MB_read 63.0 201.5 0.0 60.6 209.1 0.0 59.8 187.9 0.0 44.6 253.7 0.0 44.2 157.5 0.0 30.2 256.0 0.0 29.4 130.6 0.0 MB_wrtn 63.5 63.4 56.2 56.6 34.8 38.7 19.2
590
garmo-aix vpath10 2.842 13.2 215.3 0.0 14.2 -----------------------------------------------------------------------------------------garmo-aix TOTAL READ: 0.00 MB TOTAL WRITTEN: 346.49 MB garmo-aix READ SPEED: 0.00 MB/sec WRITE SPEED: 70.00 MB/sec Example: C-6 The vpath_iostat shell script #!/bin/ksh ##################################################################### # Usage: # vpath_iostat (default: 5 second intervals, 1000 iterations) # vpath_iostat <interval> <count> # # Function: # Gather IOSTATS and report on DS8000 VPATHS instead of disk devices # AIX hdisks # HP-UX [under development ] # SUN [under development ] # Linux [under development ] # # Note: # # A small amount of free space < 1MB is required in /tmp # # Author: Pablo Clifton [email protected] # Date: August 28, 2005 ##################################################################### ########################################################## # set the default period for number of seconds to collect # iostat data before calculating average period=5 iterations=1000 essfile=/tmp/disk-vpath.out ifile=/tmp/lssdd.out ds=`date +%d%H%M%S` # hname=`hostname` # ofile=/tmp/vstats # wfile=/tmp/wvfile # wfile2=/tmp/wvfile2 # pvcount=`iostat | grep hdisk | wc # File to store output from lssdd command # Input file containing LSSDD info time stamp get Hostname raw iostats work file work file -l | awk '{print $1}'`
############################################# # Create a list of the vpaths this system uses # Format: hdisk DS-vpath # datapath query essmap output MUST BE correct or the IO stats reported # will not be correct ############################################# if [ ! -f $ifile ] then echo "Collecting DS8000 info for disk to vpath map..." datapath query essmap > $ifile fi cat $ifile | awk '{print $2 "\t" $1}' > $essfile
591
######################################### for internal in `lsdev -Cc disk | grep SCSI | awk '{print $1}'` do echo "$internal $internal" >> $essfile done ############################################### # Set interval value or leave as default if [[ $# -ge 1 ]] then period=$1 fi ########################################## # Set <iteration> value if [[ $# -eq 2 ]] then iterations=$2 fi ################################################################# # ess_iostat <interval> <count> i=0 while [[ $i -lt $iterations ]] do iostat $period 2 > $ofile
grep hdisk $ofile > $ofile.temp # only gather hdisk info- not cd # other devices tail -n $pvcount $ofile.temp | grep -v "0.0 0" | sort +4 -n -r | head -n 100 > $wfile ########################################### #Converting hdisks to vpaths.... # ########################################### for j in `cat $wfile | awk '{print $1}'` do vpath=`grep -w $j $essfile | awk '{print $2}'` sed "s/$j /$vpath/g" $wfile > $wfile2 cp $wfile2 $wfile done ########################################### # Determine Number of different VPATHS used ########################################### numvpaths=`cat $wfile | awk '{print $1} ' | grep -v hdisk | sort -u | wc -l` dt=`date +"%H:%M %a %d %h %Y"` print "\n$hname: Total VPATHS used: $numvpaths $dt $period sec interval" printf "%s\t%s\t\t%-9s\t%-9s\t%-9s\t%-9s\t%-9s\n" "$hname" "Vpath:" "MBps" "tps" \ "KB/trans" "MB_read" "MB_wrtn" ########################################### # Sum Usage for EACH VPATH and Internal Hdisk ########################################### 0.0 0.0 0 \
592
for x in `cat $wfile | awk '{ print $1}' | sort -u` do cat $wfile | grep -w $x | awk '{ printf ("%4d\t\t%-9s\t%-9s\t%-9s\t%-9s\t%-9s\n" , \ $1, $2, $3, $4, $5, $6) }' | awk 'BEGIN { } { tmsum=tmsum+$2 } { kbpsum=kbpsum+$3 } { tpsum=tpsum+$4 } { kbreadsum=kbreadsum+$5 } { kwrtnsum=kwrtnsum+$6 } END { if ( tpsum > 0 ) printf ("%-7s\t%4s\t\t%-9.3f\t%-9.1f\t%-9.1f\t%-9.1f\t%-9.1f\n" , hname, \ vpath, kbpsum/1000, tpsum, kbpsum/tpsum , kbreadsum/1000, kwrtnsum/1000) else printf ("%-7s\t%4s\t\t%-9.3f\t%-9.1f\t%-9.1f\t%-9.1f\t%-9.1f\n" , hname, \ vpath, kbpsum/1000, tpsum, "0", kbreadsum/1000, kwrtnsum/1000) }' hname="$hname" vpath="$x" >> $wfile2.tmp done ############################################# # Sort VPATHS/hdisks by NUMBER of TRANSACTIONS ############################################# if [[ -f $wfile2.tmp ]] then cat $wfile2.tmp | sort +3 -n -r rm $wfile2.tmp fi ############################################################## # SUM TOTAL IO USAGE for ALL DISKS/LUNS over INTERVAL ############################################################## #Disks: % tm_act Kbps tps Kb_read Kb_wrtn # field 5 read field 6 written tail -n $pvcount $ofile.temp | grep -v "0.0 0.0 0.0 0" | awk 'BEGIN { } { rsum=rsum+$5 } { wsum=wsum+$6 } END { rsum=rsum/1000 wsum=wsum/1000 printf ("------------------------------------------------------------------------------------------\n") if ( divider > 1 ) { printf ("%-7s\t%14s\t%4.2f\t%s\t%14s\t%4.2f\t%s\n", hname, "TOTAL READ: ", \ rsum, "MB", "TOTAL WRITTEN: ", wsum, "MB") } printf ("%-7s\t%14s\t%4.2f\t%s\t%14s\t%4.2f\t%s\n\n\n", hname, "READ SPEED: ", \ rsum/divider, "MB/sec", "WRITE SPEED: ", wsum/divider, "MB/sec" ) }' hname="$hname" divider="$period" let i=$i+1 done
593
# rm rm rm rm
C.5 ds_iostat
The ds_iostat script is a a wrapper program for AIX that converts iostat information based on hdisk devices to ranks instead. The ds_iostat script depends on the SDD datapath query essmamp command and iostat. The script first builds a map file to list hdisk devices and their associated ranks and then converts iostat information from hdisks to ranks. To run the script, enter: ds_iostat (control c to break out) Or: ds_iostat <interval> <iteration> An example of the ds_iostat output is shown in Example C-7. The ds_iostat shell script is in Example C-8.
Example: C-7 The ds_iostat output # ds_iostat 5 1 garmo-aix: Total RANKS used: 12 20:01 Sun 16 Feb 2003 5 sec interval garmo-aix Ranks: MBps tps KB/trans MB_read MB_wrtn garmo-aix 1403 9.552 71.2 134.2 47.8 0.0 garmo-aix 1603 6.779 53.8 126.0 34.0 0.0 garmo-aix 1703 5.743 43.0 133.6 28.8 0.0 garmo-aix 1503 5.809 42.8 135.7 29.1 0.0 garmo-aix 1301 3.665 32.4 113.1 18.4 0.0 garmo-aix 1601 3.206 27.2 117.9 16.1 0.0 garmo-aix 1201 2.734 22.8 119.9 13.7 0.0 garmo-aix 1101 2.479 22.0 112.7 12.4 0.0 garmo-aix 1401 2.299 20.4 112.7 11.5 0.0 garmo-aix 1501 2.180 19.8 110.1 10.9 0.0 garmo-aix 1001 2.246 19.4 115.8 11.3 0.0 garmo-aix 1701 2.088 18.8 111.1 10.5 0.0 -----------------------------------------------------------------------------------------garmo-aix TOTAL READ: 430.88 MB TOTAL WRITTEN: 0.06 MB garmo-aix READ SPEED: 86.18 MB/sec WRITE SPEED: 0.01 MB/sec Example: C-8 The ds_iostat shell script #!/bin/ksh #set -x ######################### #Usage: # ds_iostat (default: 5 second intervals, 1000 iterations) # ds_iostat <interval> <count> #
594
# Function: # Gather IOSTATS and report on DS8000 RANKS instead of disk devices # AIX hdisks # HP-UX # SUN # Linux # # Note: # ds_iostat depends on valid rank ids from the datapath query essmap command # # A small amount of free space < 1MB is required in /tmp # # Author: Pablo Clifton [email protected] # Date: Feb 28, 2003 ##################################################################### ########################################################## # set the default period for number of seconds to collect # iostat data before calculating average period=5 iterations=1000 essfile=/tmp/lsess.out ds=`date +%d%H%M%S` # time stamp hname=`hostname` # get Hostname ofile=/tmp/rstats # raw iostats wfile=/tmp/wfile # work file wfile2=/tmp/wfile2 # work file pvcount=`iostat | grep hdisk | wc -l | awk '{print $1}'` ############################################# # Create a list of the ranks this system uses # Format: hdisk DS-rank # datapath query essmap output MUST BE correct or the IO stats reported # will not be correct ############################################# datapath query essmap|grep -v "*"|awk '{print $2 "\t" $11}' > $essfile ######################################### # ADD INTERNAL SCSI DISKS to RANKS list ######################################### for internal in `lsdev -Cc disk | grep SCSI | awk '{print $1}'` do echo "$internal $internal" >> $essfile done ############################################### # Set interval value or leave as default if [[ $# -ge 1 ]] then period=$1 fi ########################################## # Set <iteration> value if [[ $# -eq 2 ]] then iterations=$2 fi
595
################################################################# # ess_iostat <interval> <count> i=0 while [[ $i -lt $iterations ]] do iostat $period 2 > $ofile
grep hdisk $ofile > $ofile.temp # only gather hdisk info- not cd # other devices tail -n $pvcount $ofile.temp | grep -v "0.0 | sort +4 -n -r | head -n 100 > $wfile ########################################### #Converting hdisks to ranks.... # ########################################### for j in `cat $wfile | awk '{print $1}'` do rank=`grep -w $j $essfile | awk '{print $2}'` sed "s/$j /$rank/g" $wfile > $wfile2 cp $wfile2 $wfile done ########################################### # Determine Number of different ranks used ########################################### numranks=`cat $wfile | awk '{print $1} ' | grep -v hdisk | cut -c 1-4| sort -u -n | wc -l` dt=`date +"%H:%M %a %d %h %Y"` print "\n$hname: Total RANKS used: $numranks $dt $period sec interval" printf "%s\t%s\t\t%-9s\t%-9s\t%-9s\t%-9s\t%-9s\n" "$hname" "Ranks:" "MBps" "tps" \ "KB/trans" "MB_read" "MB_wrtn" ########################################### # Sum Usage for EACH RANK and Internal Hdisk ########################################### for x in `cat $wfile | awk '{ print $1}' | sort -u` do cat $wfile | grep -w $x | awk '{ printf ("%4d\t\t%-9s\t%-9s\t%-9s\t%-9s\t%-9s\n" , \ $1, $2, $3, $4, $5, $6) }' | awk 'BEGIN { } { tmsum=tmsum+$2 } { kbpsum=kbpsum+$3 } { tpsum=tpsum+$4 } { kbreadsum=kbreadsum+$5 } { kwrtnsum=kwrtnsum+$6 } END { if ( tpsum > 0 ) printf ("%-7s\t%4s\t\t%-9.3f\t%-9.1f\t%-9.1f\t%-9.1f\t%-9.1f\n" , hname, \ rank, kbpsum/1000, tpsum, kbpsum/tpsum , kbreadsum/1000, kwrtnsum/1000) else printf ("%-7s\t%4s\t\t%-9.3f\t%-9.1f\t%-9.1f\t%-9.1f\t%-9.1f\n" , hname, \ rank, kbpsum/1000, tpsum, "0", kbreadsum/1000, kwrtnsum/1000) }' hname="$hname" rank="$x" >> $wfile2.tmp done 0.0 0.0 0 0" \
596
############################################# # Sort RANKS/hdisks by NUMBER of TRANSACTIONS ############################################# if [[ -f $wfile2.tmp ]] then cat $wfile2.tmp | sort +3 -n -r rm $wfile2.tmp fi ############################################################## # SUM TOTAL IO USAGE for ALL DISKS/LUNS over INTERVAL ############################################################## #Disks: % tm_act Kbps tps Kb_read Kb_wrtn # field 5 read field 6 written tail -n $pvcount $ofile.temp | grep -v "0.0 0.0 0.0 'BEGIN { } { rsum=rsum+$5 } { wsum=wsum+$6 } END { rsum=rsum/1000 wsum=wsum/1000 printf ("------------------------------------------------------------------------------------------\n") if ( divider > 1 ) { printf ("%-7s\t%14s\t%4.2f\t%s\t%14s\t%4.2f\t%s\n", hname, "TOTAL READ: ", \ rsum, "MB", "TOTAL WRITTEN: ", wsum, "MB") } printf ("%-7s\t%14s\t%4.2f\t%s\t%14s\t%4.2f\t%s\n\n\n", hname, "READ SPEED: ",\ rsum/divider, "MB/sec", "WRITE SPEED:", wsum/divider, "MB/sec" ) }' hname="$hname" divider="$period" let i=$i+1 done rm $ofile rm $wfile rm $wfile2 rm $essfile ################################## THE END ##########################
0" \
| awk
C.6 test_disk_speeds
Use the test_disk_speeds script to test a 100 MB sequential read against one raw vpath (rvpath0) and record the speed at different times throughout the day to get an average read speed that a rank is capable of in your environment.
597
You can change the amount of data read, the blocksize, and the vpath by editing the script and changing the variables: tsize=100 # MB bs=128 # KB vpath=rvpath0 # disk to test An example of the output for test_disk_speeds is shown in Example C-9.
Example: C-9 The test_disk_speeds #!/bin/ksh ########################################################## # test_disk_speeds # Measure disk speeds using dd # # tsize = total test size in MB # bs = block size in KB # testsize= total test size in KB; tsize*1000 # count = equal to the number of test blocks to read which is # testsize/bsize # Author: Pablo Clifton [email protected] # Date: August 28, 2005 ######################################################### # SET these 2 variables to change the block size and total # amount of data read. Set the vpath to test tsize=100 # MB bs=128 # KB vpath=rvpath0 # disk to test ######################################################### let testsize=$tsize*1000 let count=$testsize/$bs # calculate start time, dd file, calculate end time stime=`perl -e "print time"` dd if=/dev/$vpath of=/dev/null bs="$bs"k count=$count etime=`perl -e "print time"` # get total run time in seconds let totalt=$etime-$stime let speed=$tsize/$totalt printf "$vpath\t%4.1f\tMB/sec\t$tsize\tMB\tbs="$bs"k\n" $speed ########################## THE END ###############################
C.7 lsvscsimap.ksh
When you assign several logical unit numbers (LUNs) from DS8000 to Virtual I/O Server (VIOS) and then map those LUNs to the logical partitioning (LPAR) clients with the time, trivial activities, such as upgrading the Subsystem Device Driver Path Control Module (SDDPCM) device driver can become something challenging. Because of that, we created two scripts: the first script (Example C-10 on page 599) generates a list of mappings among the LUNs and LPAR clients. The second script, based on that output, creates the commands needed to recreate the mappings among the LUNs and LPAR clients.
598
To list the configuration, execute the following command: # cd /home/padmin # ./lsvscsimap.ksh To save the configuration in a file, type the following command: # ./lsvscsimap.ksh -s test.out
Example: C-10 The lsvscsimap.ksh script #!/usr/bin/ksh93 ######################################################################### # # # Name of script: lsvscsimap.ksh # # Path: /home/padmin # # Node(s): # # Info: Script to generate the configuration mappings among the LUNs # # and LPAR Clients, based on the S/N of disks. # # # # Author: Anderson F. Nobre # # Creation date: 14/03/2008 # # # # Modification data: ??/??/???? # # Modified by: ???????? # # Modifications: # # - ????????????????? # # # ######################################################################### #-----------------------------------------------------------------------# Function: usage #-----------------------------------------------------------------------function usage { printf "Usage: %s [-s value]\n" $0 printf "Where:\n" printf " -s <value>: Generate the mappings among the LUNs \n" printf " and LPAR Clients\n" } #-----------------------------------------------------------------------# Function: lsmapbyvhost #-----------------------------------------------------------------------function lsmapbyvhost { cat <<EOF > /tmp/lsmapbyvhost.awk /^vhost/ {INI=1; VHOST=\$1; VSCSISRVSLOT=substr(\$2, index(\$2, "-C")+2); LPAR=\$3} /^VTD/ {VTDISK=\$2} /^Backing device/ {HDISK=\$3} /^Physloc/ {printf "%s %s %s %s %d\n", VHOST, HDISK, VTDISK, VSCSISRVSLOT, LPAR} EOF ioscli lsmap -all > /tmp/lsmap-all.out cat /tmp/lsmap-all.out | awk -f /tmp/lsmapbyvhost.awk > /tmp/lsmapbyvhost.out } #-----------------------------------------------------------------------Appendix C. UNIX shell scripts
599
# Function: ppqdhdbysn #-----------------------------------------------------------------------function ppqdhdbysn { cat <<EOF > /tmp/ppqdhdbysn.awk /DEVICE NAME:/ {DEVNUM=\$2; HDISK=\$5} /SERIAL:/ {print \$2, HDISK} EOF pcmpath query device > /tmp/ppqd.out cat /tmp/ppqd.out | awk -f /tmp/ppqdhdbysn.awk > /tmp/ppqdhdbysn.out } #-----------------------------------------------------------------------# Function: lslvbyvg #-----------------------------------------------------------------------function lslvbyvg { lsvg -o | while read vg do ppsize=$(lsvg ${vg} | awk '/PP SIZE:/ {print $6}') lsvg -l ${vg} | egrep "raw|jfs|jfs2" | egrep -v "jfs2log|jfslog" | awk '{print $1, $3}' | while read lv ppnum do lvhdisks=$(lslv -l ${lv} | egrep "hdisk|vpath" | awk '{print $1}' | tr '\n' ',') printf "%s %s %s %s\n" ${vg} ${lv} $(expr ${ppsize} \* ${ppnum}) $(echo ${lvhdisks} | sed -e 's/,$//') done done > /tmp/lslvbyvg.out } #-----------------------------------------------------------------------# Function: lspvidbyhd #-----------------------------------------------------------------------function lspvidbyhd { lspv > /tmp/lspv.out } #-----------------------------------------------------------------------# Declaring global environment variables... #-----------------------------------------------------------------------export PATH=/usr/bin:/etc:/usr/sbin:/usr/ucb:/usr/bin/X11:/sbin:/usr/java14/jre/bin:/usr/java14/bin:/usr/ios/cli:/ usr/ios/utils:/usr/ios/lpm/bin:/usr/ios/oem:/usr/ios/ldw/bin:$HOME ######################################################################### # Main Logic Script... # ######################################################################### SFLAG=
600
while getopts s:h name do case $name in s ) SFLAG=1 SVAL=$OPTARG ;; h|\? ) usage exit -1 ;; esac done shift $(($OPTIND - 1)) #-----------------------------------------------------------------------# Collecting the necessary information and formating... #-----------------------------------------------------------------------lsmapbyvhost ppqdhdbysn lslvbyvg lspvidbyhd #-----------------------------------------------------------------------# Generating the file with the configuration mappings... #-----------------------------------------------------------------------# The config file of VSCSI Server has the following fields: # Name of VIOS Server typeset VIOSNAME=$(hostname) # List of hdisks where is the LV or hdisk of storage mapped typeset -A ALVHDISKS # Name of VG of LV mapped typeset -A AVGNAME # Name of LV typeset -A ALV # Size of LV/hdisk in MB typeset -A ALVHDSIZE # Virtual SCSI Server Device typeset -A AVSCSISRV # Server slot typeset -A ASRVSLOT # Client LPAR ID typeset -A ACLTLPAR # Client slot. Always "N/A" CLTSLOT="N/A" # Virtual target device typeset -A AVTDEVICE # PVID, case an LV is mapped typeset -A ALUNPVID # S/N, case a LUN is mapped typeset -A ALUNSN cat /tmp/ppqdhdbysn.out | while read LUNSN LVHDISKS do
601
ALUNSN[${LVHDISKS}]=${LUNSN} done cat /tmp/lslvbyvg.out | while read VGNAME LV LVHDSIZE LVHDISKS do AVGNAME[${LV}]=${VGNAME} ALVHDSIZE[${LV}]=${LVHDSIZE} ALVHDISKS[${LV}]=${LVHDISKS} done cat /tmp/lspv.out | while read LVHDISKS LUNPVID VGNAME VGSTATUS do ALUNPVID[${LVHDISKS}]=${LUNPVID} done if [[ ${SFLAG} -eq 1 ]] then cat /tmp/lsmapbyvhost.out | while read VSCSISRV LVHDISKS VTDEVICE SRVSLOT CLTLPAR do if [[ ${LVHDISKS} == @(*hdisk*) ]] then printf "%s %s %s %s %s %s %s %s %s %s %s %s\n" ${VIOSNAME} ${LVHDISKS} "N/A" "N/A" "N/A" ${VSCSISRV} ${SRVSLOT} ${CLTLPAR} ${CLTSLOT} ${VTDEVICE} ${ALUNPVID[${LVHDISKS}]} ${ALUNSN[${LVHDISKS}]} else printf "%s %s %s %s %s %s %s %s %s %s %s %s\n" ${VIOSNAME} ${ALVHDISKS[${LVHDISKS}]} ${AVGNAME[${LVHDISKS}]} ${LVHDISKS} ${ALVHDSIZE[${LVHDISKS}]} ${VSCSISRV} ${SRVSLOT} ${CLTLPAR} ${CLTSLOT} ${VTDEVICE} "N/A" "N/A" fi done | sort -k6 > ${SVAL} else cat /tmp/lsmapbyvhost.out | while read VSCSISRV LVHDISKS VTDEVICE SRVSLOT CLTLPAR do if [[ ${LVHDISKS} == @(*hdisk*) ]] then printf "%s %s %s %s %s %s %s %s %s %s %s %s\n" ${VIOSNAME} ${LVHDISKS} "N/A" "N/A" "N/A" ${VSCSISRV} ${SRVSLOT} ${CLTLPAR} ${CLTSLOT} ${VTDEVICE} ${ALUNPVID[${LVHDISKS}]} ${ALUNSN[${LVHDISKS}]} else printf "%s %s %s %s %s %s %s %s %s %s %s %s\n" ${VIOSNAME} ${ALVHDISKS[${LVHDISKS}]} ${AVGNAME[${LVHDISKS}]} ${LVHDISKS} ${ALVHDSIZE[${LVHDISKS}]} ${VSCSISRV} ${SRVSLOT} ${CLTLPAR} ${CLTSLOT} ${VTDEVICE} "N/A" "N/A" fi done | sort -k6 fi
C.8 mkvscsimap.ksh
Let us suppose that you need a script (Example C-11) to remove the configuration mappings among the LUNs and LPAR clients. Also, remove the hdisk devices, update the multipathing I/O (MPIO) device driver, and recognize the hdisk devices again. Now, you still need to rebuild the mappings among the LUNs and LPARs. The problem is that the names of the hdisk devices are not in order after the upgrade. Based on the saved configuration file, you can execute the following command to recreate the new mappings: # ./mkvscsimap.ksh -c test.out -s test2.out -r
602
Example: C-11 The mkvscsimap.ksh script #!/usr/bin/ksh93 ######################################################################### # # # Name of script: mkvscsimap.ksh # # Path: /home/padmin # # Node(s): # # Info: Script to create the commands to map the LUNs and vhosts, # # based on S/N of disks. # # # # Author: Anderson F. Nobre # # Creation date: 14/03/2008 # # # # Modification date: ??/??/???? # # Modified by: ???????? # # Modifications: # # - ????????????????? # # # ######################################################################### #-----------------------------------------------------------------------# Function: usage #-----------------------------------------------------------------------function usage { printf "Usage: %s [-c value] [-s value] [-x] [-r]\n" $0 printf "Where:\n" printf " -c <value>: Configuration file of VSCSI mappings\n" printf " -s <value>: Generates the script with the commands to map the LUNs in the VSCSI devices\n" printf " -x: Creates the mappings of LUNs in the VSCSI devices\n" printf " -r: Recreates the mappings with the new hdisks\n" } #-----------------------------------------------------------------------# Function: ppqdhdbysn #-----------------------------------------------------------------------function ppqdhdbysn { cat <<EOF > /tmp/ppqdhdbysn.awk /DEVICE NAME:/ {DEVNUM=\$2; HDISK=\$5} /SERIAL:/ {print \$2, HDISK} EOF pcmpath query device > /tmp/ppqd.out cat /tmp/ppqd.out | awk -f /tmp/ppqdhdbysn.awk > /tmp/ppqdhdbysn.out } #-----------------------------------------------------------------------# Declaring the global environment variables... #-----------------------------------------------------------------------export PATH=/usr/bin:/etc:/usr/sbin:/usr/ucb:/usr/bin/X11:/sbin:/usr/java14/jre/bin:/usr/java14/bin:/usr/ios/cli:/ usr/ios/utils:/usr/ios/lpm/bin:/usr/ios/oem:/usr/ios/ldw/bin:$HOME
603
######################################################################### # Main Logic of Script... # ######################################################################### # If there are no arguments, then print the usage... if [[ $# -eq 0 ]] then usage exit -1 fi CFLAG= SFLAG= XFLAG= RFLAG= while getopts c:s:xr name do case $name in c ) CFLAG=1 CVAL=$OPTARG ;; s ) SFLAG=1 SVAL=$OPTARG ;; x ) XFLAG=1 ;; r ) RFLAG=1 ;; h|\? ) usage exit -1 ;; esac done shift $(($OPTIND - 1)) # If the configuration file won't be informed, then finishes with error... if [[ $CFLAG -eq 0 ]] then printf "The configuration file needs to be informed!!!\n" usage exit -1 fi # If the options 's' and 'x' has been selected, then finishes with error... if [[ $SFLAG -eq 1 && $XFLAG -eq 1 ]] then printf "The options '-s' and '-x' can't be used together!!!\n" usage exit -1 fi
604
if [[ $SFLAG -eq 1 && $RFLAG -eq 0 ]] then cat $CVAL | while read VIOSNAME LVHDISKS AVGNAME LV LVHDSIZE VSCSISRV SRVSLOT CLTLPAR CLTSLOT VTDEVICE LUNPVID LUNSN do if [[ ${AVGNAME} == @(N/A) ]] then printf "mkvdev -vdev ${LVHDISKS} -vadapter ${VSCSISRV} -dev ${VTDEVICE}\n" else printf "mkvdev -vdev ${LV} -vadapter ${VSCSISRV} -dev ${VTDEVICE}\n" fi done > ${SVAL} elif [[ $SFLAG -eq 1 && $RFLAG -eq 1 ]] then typeset -A ALUNHD ppqdhdbysn cat /tmp/ppqdhdbysn.out | while read LUNSN LUNHD do ALUNHD[${LUNSN}]=${LUNHD} done cat $CVAL | while read VIOSNAME LVHDISKS AVGNAME LV LVHDSIZE VSCSISRV SRVSLOT CLTLPAR CLTSLOT VTDEVICE LUNPVID LUNSN do if [[ ${AVGNAME} == @(N/A) ]] then printf "mkvdev -vdev ${ALUNHD[${LUNSN}]} -vadapter ${VSCSISRV} -dev vt${ALUNHD[${LUNSN}]}\n" else printf "mkvdev -vdev ${LV} -vadapter ${VSCSISRV} -dev ${VTDEVICE}\n" fi done > ${SVAL} fi if [[ $XFLAG -eq 1 && $RFLAG -eq 0 ]] then cat $CVAL | while read VIOSNAME LVHDISKS AVGNAME LV LVHDSIZE VSCSISRV SRVSLOT CLTLPAR CLTSLOT VTDEVICE LUNPVID LUNSN do if [[ ${AVGNAME} == @(N/A) ]] then printf "mkvdev -vdev ${LVHDISKS} -vadapter ${VSCSISRV} -dev ${VTDEVICE}\n" # ioscli mkvdev -vdev ${LVHDISKS} -vadapter ${VSCSISRV} -dev ${VTDEVICE} else printf "mkvdev -vdev ${LV} -vadapter ${VSCSISRV} -dev ${VTDEVICE}\n" # ioscli mkvdev -vdev ${LV} -vadapter ${VSCSISRV} -dev ${VTDEVICE} fi done elif [[ $XFLAG -eq 1 && $RFLAG -eq 1 ]] then typeset -A ALUNHD ppqdhdbysn cat /tmp/ppqdhdbysn.out | while read LUNSN LUNHD do
605
ALUNHD[${LUNSN}]=${LUNHD} done cat $CVAL | while read VIOSNAME LVHDISKS AVGNAME LV LVHDSIZE VSCSISRV SRVSLOT CLTLPAR CLTSLOT VTDEVICE LUNPVID LUNSN do if [[ ${AVGNAME} == @(N/A) ]] then printf "mkvdev -vdev ${ALUNHD[${LUNSN}]} -vadapter ${VSCSISRV} -dev vt${ALUNHD[${LUNSN}]}\n" # ioscli mkvdev -vdev ${ALUNHD[${LUNSN}]} -vadapter ${VSCSISRV} -dev vt${ALUNHD[${LUNSN}]} else printf "mkvdev -vdev ${LV} -vadapter ${VSCSISRV} -dev ${VTDEVICE}\n" # ioscli mkvdev -vdev ${LV} -vadapter ${VSCSISRV} -dev ${VTDEVICE} fi done fi
606
Appendix D.
Post-processing scripts
This appendix provides several scripts for post-processing and correlating data from different sources.
607
D.1 Introduction
In our experience, it is often necessary to correlate data from different data sources. There are several methods to correlate data from different sources, including manual correlation, using Microsoft Excel macros, or using shell scripts. The information in this appendix demonstrates how to use PERL shell scripts to correlate and post-process data from multiple sources. While these scripts can be converted to run on UNIX fairly easily, we assume you will be running these on a Windows system. In this appendix, we describe: Dependencies Running the scripts perfmon-essmap.pl iostat_aix53_essmap.pl iostat_sun-mpio.pl tpc_volume-lsfbvol-showrank.pl Note: The purpose of these scripts is not to provide a toolkit that addresses every possible configuration scenario; rather, it is to demonstrate several of the possibilities available.
D.2 Dependencies
In order to execute the scripts described in this section, you need to prepare your system.
Software
You need to have the following software installed: Active Perl 5.6.x or later. At the time of writing this book, you can download PERL from the following Web sites: http://www.download.com/3000-2229_4-10634113.html http://www.activestate.com/Products/activeperl/features.plex While it is not absolutely necessary for users preferring a UNIX style bash shell, download and install CYGWIN. All the samples provided in this chapter will assume that CYGWIN is installed. At the time of writing of this book, you can download CYGWIN from: http://www.cygwin.com/
Script location
There is nothing magical about the location of the script; however, the shell from which you will run the script must be aware of the location of the script. We suggest placing the scripts that you plan on reusing in a common directory, such as either: C:\Performance\scripts C:\Performance\bin For simplicity, we place the script in the same directory as the performance and capacity configuration data.
Script creation
In order to run these scripts, you need to copy the entire contents of the script into a file and name the file with a .pl suffix. You can can copy the contents and name the file in notepad, but you need to ensure that the save as type: option is set to All files as shown in Figure D-1 on page 609.
608
6. Open the file in Excel. In this case, we redirected the output to Win2003_ITSO_Rotate_Volumes_RAID0_RANDOM.csv. Note: If your file is blank or only contains headers, you need to determine if the issue is with the input files, or if you damaged the script when creating the script. For interpretation and analysis of Windows data, refer to 10.8.6, Analyzing performance data on page 297.
609
By default, all the scripts will print to standard out. If you want to place the output in a file, simply redirect the output to a file.
D.2.1.1 perfmon-essmap.pl
The purpose of the perfmon-essmap.pl script (Example D-1 on page 610) is to correlate Windows server disk performance data with SDD datapath query essmap data. The rank column is invalid for DS8000s with multi-rank extent pools. The script takes two input arguments but there are not any flags required. The first parameter has to be the perfmon data and the second parameter has to be the datapath query essmap data. An example of the output of perfmon-essmap.pl is shown in Figure D-3.
Several of the columns have been hidden in order to fit the example output into the space provided.
Example: D-1 The perfmon-essmap.pl script #!C:\Perl\bin\Perl.exe ############################################################### # Script: perfmon-essmap.pl # Purpose: Correlate perfmon data with datapath query essmap # and create a normalized data for input into spreadsheet # The following perfmon counters are needed: # Avg Disk sec/Read # Avg Disk sec/Write # Disk Reads/sec # Disk Writes/sec # Disk Total Response Time - Calculated to sum avg rt * avg i/o rate # Disk Read Queue # Disk Write Queue # Disk Read KB/sec - calc # Disk Write KB/sec - calc ############################################################### $file = $ARGV[0]; # Set perfmon csv output to 1st arg open(PERFMON,$file) or die "cannot open perfmon $file\n"; $datapath = $ARGV[1];# Set 'datapath query essmap' output to 2nd arg open(DATAPATH,$datapath) or die "cannot open datapath $datapath\n"; ######################################################################## # Read in essmap and create hash with hdisk as key and LUN SN as value # ######################################################################## while (<DATAPATH>) { if (/^$/) { next; }# Skip empty lines if (/^Disk /) { next; }# Skip empty lines if (/^--/) { next; }# Skip empty lines @line = split(/\s+|\t/,$_);# Build temp array of current line
610
$lun = $line[4];# Set lun ID $path = $line[1];# Set path $disk = $line[0];# Set disk# $hba = $line[2];# Set hba port - use sdd gethba.exe to get wwpn $size = $line[7];# Set size in gb $lss = $line[8];# Set ds lss $vol = $line[9];# Set DS8K volume $rank = $line[10];# Set rank - DOES NOT WORK FOR ROTATE VOLUME OR ROTATE extent $c_a = $line[11]; # Set the Cluster and adapter accessing rank $dshba = $line[13];# Set shark hba - this is unusable with perfmon which isn't aware of paths $dsport = $line[14];# Set shark port physical location - this is unusable with perfmon which isn't aware of paths $lun{$disk} = $lun;# Set the LUN in hash with disk as key for later lookup $disk{$lun} = $disk;# Set vpath in hash with lun as key $lss{$lun} = $lss;# Set lss in hash with lun as key $rank{$lun} = $rank;# Set rank in hash with lun as key $dshba{$lun} = $dshba;# Set dshba in hash with lun as key - this is unusable with perfmon which isn't aware of paths $dsport{$lun} = $dsport;# Set dsport in hash with lun as key - this is unusable with perfmon which isn't aware of paths if (length($lun) > 8) { $ds = substr($lun,0,7);# Set the DS8K serial } else { $ds = substr($lun,3,5);# Set the ESS serial } $ds{$lun} = $ds; # set ds8k in hash with LUN as key } ################ # Print Header # ################ print "DATE,TIME,Subsystem Serial,Rank,LUN,Disk,Disk Reads/sec, Avg Read RT(ms),Disk Writes/sec,Avg Write RT(ms),Avg Total Time,Avg Read Queue Length,Avg Write Queue Length,Read KB/sec,Write KB/sec\n"; ################################################################################################## # Read in perfmon and create record for each hdisk and split the first column into date and time # ################################################################################################## while (<PERFMON>) { if (/^$/) { next; } # Skip empty lines if (/^--/) { next; } # Skip empty lines if (/PDH-CSV/) { @header = split(/,/,$_); # Build header array shift(@header); # Remove the date element unshift(@header,"Date","Time"); # Add in date, time next; # Go to next line } @line = split(/\t|,/,$_); # Build temp array for current line @temp = split(/\s|\s+|\"/,$line[0]);# Split the first element into array $date = $temp[1]; # Set date to second element of array $time = $temp[2]; # Set time to third element of array shift(@line); # Remove old date unshift(@line,"\"$date\"","\"$time\"");# Add in the new date and time chomp(@line); # Remove carriage return at end for ($i=0; $i<=$#line; $i++) { # Loop through each element in line $line[$i] =~ s/"//g; # Remove double quotes from input line $header[$i] =~ s/"//g; # Remove double quotes from header array @arr = split(/\\/,$header[$i]);# Split current header array element $hostname = $arr[2]; # Extract hostname from header $disk = $arr[3]; # Set disk to the 4th element $counter = $arr[4]; # Set counter to 5th element if ($disk =~ /Physical/) { # If we find Physical Object
611
if ($disk =~ /Total/) { next; }# If disk instance is Total then skip @tmpdisk = split(/Physical|\(|\s|\)/,$disk);# Create temp array of disk name $newdisk = $tmpdisk[1] . $tmpdisk[2];# Create newly formatted disk name to match SDD output if ($counter =~ /Avg. Disk sec\/Read/) {# If counter is Avg. Disk sec/Read $diskrrt{$date}{$time}{$newdisk} = $line[$i]*1000;# Then set disk read response time hash } if ($counter =~ /Avg. Disk sec\/Write/) {# If counter is Avg. Disk sec/Write $diskwrt{$date}{$time}{$newdisk} = $line[$i]*1000;# Then set disk write response time has } if ($counter =~ /Disk Reads\/sec/) {# If counter is Disk Reads/sec $diskreads{$date}{$time}{$newdisk} = $line[$i];# Then set Disk Reads/sec hash } if ($counter =~ /Disk Writes\/sec/) {# If counter is Disk Writes/sec $diskwrites{$date}{$time}{$newdisk} = $line[$i];# Then set Disk Writes/sec hash } if ($counter =~ /Avg. Disk Read Queue Length/) {# If counter is Disk Read Queue Length $diskrql{$date}{$time}{$newdisk} = $line[$i];# Then set Disk Read Queue Length hash } if ($counter =~ /Avg. Disk Write Queue Length/) {# If counter is Disk Write Queue Length $diskwql{$date}{$time}{$newdisk} = $line[$i];# Then set Disk Write Queue Length hash } if ($counter =~ /Disk Read Bytes\/sec/) {# If counter is Disk Read Bytes/sec $diskrkbs{$date}{$time}{$newdisk} = $line[$i]/1024;# Then calc kb an set in hash } if ($counter =~ /Disk Write Bytes\/sec/) {# If counter is Disk Write Bytes/sec $diskwkbs{$date}{$time}{$newdisk} = $line[$i]/1024;# Then calc kb and set in hash } } } } ### Print out the data here - key is date, time, disk while (($date,$times) = each(%diskrrt)) {# Loop through each date-time hash while (($time,$disks) = each(%$times)) {# Nested Loop through each time-disk hash while (($disk,$value) = each(%$disks)) {# Nest loop through disk-value hash $diskrrt = $diskrrt{$date}{$time}{$disk};# Set shortnames for easier print $diskwrt = $diskwrt{$date}{$time}{$disk}; $diskreads = $diskreads{$date}{$time}{$disk}; $diskwrites = $diskwrites{$date}{$time}{$disk}; $total_time = ($diskrrt*$diskreads)+($diskwrt*$diskwrites); $diskrql = $diskrql{$date}{$time}{$disk}; $diskwql = $diskwql{$date}{$time}{$disk}; $diskrkbs = $diskrkbs{$date}{$time}{$disk}; $diskwkbs = $diskwkbs{$date}{$time}{$disk}; $lun = $lun{$disk}; # Lookup lun for current disk print "$date,$time,$ds{$lun},$rank{$lun},$lun,$disk,$diskreads,$diskrrt,$diskwrites,$diskwrt,$total_time,$diskrql ,$diskwql,$diskrkbs,$diskwkbs\n"; } } }
612
D.2.1.2 iostat_aix53_essmap.pl
The purpose of iostat_aix53_essmap.pl script (Example D-2 on page 613) is to correlate AIX 5.3 iostat -D data with SDD datapath query essmap output for the purpose of analysis. Beginning in AIX 5.3, the iostat command provides the ability to continuously collect read and write response times. Prior to AIX 5.3, filemon was required in order to collect disk read and write response times. Example output of the iostat_aix53_essmap.pl is shown in Figure D-4.
TIME 0:00:00 0:00:00 0:00:00 0:00:00 0:00:00 0:00:00 0:00:00 0:00:00 1:01:00 STORAGE DS HBA DS PORT LSS RANK 75P2831 R1-B1-H3-ZA 30 29 ffe3 75P2831 R1-B1-H3-ZA 30 23 ffe9 75P2831 R1-B5-H3-ZB 431 23 ffe9 75P2831 R1-B5-H3-ZB 431 29 ffe3 75P2831 R1-B6-H1-ZA 500 29 ffe3 75P2831 R1-B6-H1-ZA 500 23 ffe9 75P2831 R1-B8-H1-ZA 700 29 ffe3 75P2831 R1-B8-H1-ZA 700 23 ffe9 75P2831 R1-B1-H3-ZA 30 29 ffe3 LUN 75P28311D0A 75P2831170A 75P2831170A 75P28311D0A 75P28311D0A 75P2831170A 75P28311D0A 75P2831170A 75P28311D0A VPATH vpath13 vpath12 vpath12 vpath13 vpath13 vpath12 vpath13 vpath12 vpath13 HDISK hdisk51 hdisk50 hdisk52 hdisk53 hdisk55 hdisk54 hdisk57 hdisk56 hdisk51 #BUSY KBPS KB READ KB WRITE 0.61 0.14 0.47 0.80 8.10 8.10 0.60 8.60 8.60 0.95 0.22 0.73 0.54 0.21 0.33 0.80 8.90 8.90 0.10 0.52 0.12 0.40 0.60 7.60 7.60 4.80 3.80 1.00
Several of the columns have been hidden in order to fit the example output into the space provided.
Example: D-2 iostat_aix53_essmap.pl #!C:\Perl\bin\Perl.exe ############################################################### # Script: iostat_aix53_essmap.pl # Purpose: Process AIX 5.3 iostat disk and essmap # normalize data for input into spreadsheet ############################################################### use Getopt::Std; ####################################################################################### main(); # Start main logic sub main{ parseparms(); # Get input parameters readessmap($essmap); # Invoke routine to read datapath query essmap output readiostat($iostat); # Invoke routine to read iostat } ############## Parse input parameters ################################################## sub parseparms { $rc = getopts('i:e:'); # Define inputs $iostat = $opt_i; # Set value for iostat $essmap = $opt_e; # Set value for dp query essmap (defined $iostat) or usage(); # If iostat is not set exit (defined $essmap) or usage(); # If essmap is not set exit } ############# Usage ################################################################### sub usage { print "\nUSAGE: iostat_aix53_essmap.pl [-ie] -h"; print "\n -i The file containing the iostat output"; print "\n -e The file containing the datapath query essmap output"; print "\n ALL ARGUMENTS ARE REQUIRED!\n"; exit ; } ### Read in pcmpath and create hash with hdisk as key and LUN SN as value ############# $file = $ARGV[0]; # Set iostat -D for 1st arg $essmap = $ARGV[1]; # Set 'datapath query essmap' output to 2nd arg Appendix D. Post-processing scripts
613
### Read in essmap and create hash with hdisk as key and LUN SN as value sub readessmap($essmap) { open(ESSMAP,$essmap) or die "cannot open $essmap\n"; while (<ESSMAP>) { if (/^$/) { next; } # Skip empty lines if (/^--/) { next; } # Skip empty lines if (/^Disk/) { next; } # Skip header @line = split(/\s+|\t/,$_); # Build temp array $lun = $line[4]; # Set lun $hdisk = $line[1]; # set hdisk $vpath = $line[0]; # set vpath $hba = $line[3]; # set hba $lss = $line[8]; # set lss $rank = $line[10]; # set rank $dshba = $line[13]; # set shark hba $dsport = $line[14]; # set shark port $vpath{$lun} = $vpath; # Set vpath in hash $lss{$lun} = $lss; # Set lss in hash $rank{$lun} = $rank; # Set rank in hash $dshba{$hdisk} = $dshba; # Set dshba in hash $dsport{$hdisk} = $dsport; # Set dsport in hash $lun{$hdisk} = $lun; # Hash with hdisk as key and lun as value if (length($lun) > 8) { $ds = substr($lun,0,7); # Set ds serial to first 7 chars } else { $ds = substr($lun,3,5); # or this is ESS and only 5 chars } $ds{$lun} = $ds; # set the ds serial in a hash } } ### Read in iostat and create record for each hdisk sub readiostat($iostat) { ### Print Header print "TIME,STORAGE SN,DS HBA,DS PORT,LSS,RANK,LUN,VPATH,HDISK,#BUSY,KBPS,KB READ PS,KB WRITE PS,TPS,RPS,READ_AVG_SVC,READ_MIN_SVC,READ_MAX_SVC,READ_TO,WPS,WRITE_AVG_SVC,WRITE_MIN_SVC,WRITE_MAX_SVC,WRI TE_TO,AVG QUE,MIN QUE, MAX QUE,QUE SIZE\n"; $time = 0; # Set time variable to 0 $cnt = 0; # Set count to zero open(IOSTAT,$iostat) or die "cannot open $iostat\n";# Open iostat file while (<IOSTAT>) { # Read in iostat file if ($time == 0) { if (/^Mon|^Tue|^Wed|^Thu|^Fri|^Sat|^Sun/) {# This only works if a time stamp was in file $date_found = 1; # Set flag for @line = split(/\s+|\s|\t|,|\//,$_);# build temp array $date = $line[1] . " " . $line[2] . " " . $line[5];# Create date $time = $line[3]; # Set time $newtime = $time; # Set newtime $interval = 60; # Set interval to 60 seconds next; } else { $time++; # Set time counter to 1 if no time is in file } } if (/^#|System/) { next; } # Skip notes if (/^hdisk|^dac/) { # If line starts with hdisk @line = split(/\s+|\t|,|\//,$_);# build temp array $pv = $line[0]; # Set physical disk to 1st element $xfer = 1; # we are in the transfer stanza now
614
next; } if (/-------/) { $cnt++; # count the number of stanzas if ($date_found == 1) { # If date flag is set $newtime = gettime($time,$interval,$cnt);# get time based on original time,cnt, and interval } else { $time++; $newtime = 'Time_' . $time;# Set a relative time stamp } next; # Go to next line } if ($xfer == 1) { # If in transfer section @line = split(/\s+|\t|,|\//,$_);# build temp array $busy{$pv} = $line[1]; # Set busy time if ($line[2] =~ /[K]/) { # If K is in value $line[2] =~ s/K//g; # remove K $kbps{$pv} = $line[2]; # Set value } elsif ($line[2] =~ /[M]/) {# If M $line[2] =~ s/M//g; # remove M $kbps{$pv} = $line[2]*1024;# Multiply value by 1024 } else { $kbps{$pv} = $line[2]/1024;# Else its bytes so div by 1024 } $tps{$pv} = $line[3]; # Set transfer per sec if ($line[4] =~ /[K]/) { # If K then remove K $line[4] =~ s/K//g; $bread{$pv} = $line[4]; } elsif ($line[4] =~ /[M]/) {# If Mbytes convert to K $line[4] =~ s/M//g; $bread{$pv} = $line[4]*1024; } else { $bread{$pv} = $line[4]/1024;# Else bytes and convert ot K } if ($line[5] =~ /[K]/) { # Same logic as above $line[5] =~ s/K//g; $bwrtn{$pv} = $line[5]; } elsif ($line[5] =~ /[M]/) { $line[5] =~ s/M//g; $bwrtn{$pv} = $line[5]*1024; } else { $bwrtn{$pv} = $line[5]/1024; } $xfer = 0; next; } if (/read:/) { $read = 1; next; } if ($read == 1) { # If read flag is set @line = split(/\s+|\t|,|\//,$_);# build temp array $rps{$pv} = $line[1]; # set reads per second $rrt{$pv} = $line[2]; # set read rt $rminrt{$pv} = $line[3]; # set read min rt if ($line[4] =~ /[S]/) { $line[4] =~ s/S//g; $rmaxrt{$pv} = $line[4]*1000;# set read max rt convert from seconds if necessary } else { $rmaxrt{$pv} = $line[4]; } $rto{$pv} = $line[5]; $read = 0;
615
next; } if (/write:/) { $write = 1; next; }# If write flag is set if ($write == 1) { @line = split(/\s+|\t|,|\//,$_);# build temp array $wps{$pv} = $line[1]; # set writes per sec $wrt{$pv} = $line[2]; # set write rt $wminrt{$pv} = $line[3]; # set min write rt if ($line[4] =~ /[S]/) { $line[4] =~ s/S//g; $wmaxrt{$pv} = $line[4]*1000;# set max rt and convert from secs if necessary } else { $wmaxrt{$pv} = $line[4]; } $wto{$pv} = $line[5]; $write = 0; next; } if (/queue:/) { $queue = 1; next; }# If queue flag is set if ($queue == 1) { @line = split(/\s+|\t|,|\//,$_);# build temp array $qt{$pv} = $line[1]; # set queue time $qmint{$pv} = $line[2]; # set queue min time $qmaxt{$pv} = $line[3]; # Set queue max time $qsize{$pv} = $line[4]; $queue = 0; $time{$pv} = $newtime; $lun = $lun{$pv}; print "$time{$pv},$ds{$lun},$dshba{$pv},$dsport{$pv},$lss{$lun},$rank{$lun},$lun,$vpath{$lun},$pv,$busy{$pv},$kbp s{$pv},$bread{$pv},$bwrtn{$pv},$tps{$pv},$rps{$pv},$rrt{$pv},$rminrt{$pv},$rmaxrt{$pv},$rto{$pv},$wps{$pv}, $wrt{$pv},$wminrt{$pv},$wmaxrt{$pv},$wto{$pv},$qt{$pv},$qmint{$pv},$qmaxt{$pv},$qsize{$pv}\n"; next; } if (/^--/) { $time++; } } } ############# Convert Time #################################################################### sub gettime() { my $time = $_[0]; my $interval = $_[1]; my $cnt = $_[2]; my $hr = substr($time,0,2); my $min = substr($time,3,2); my $sec = substr($time,6,2); $hrsecs =$hr * 3600; $minsecs = $min * 60; my $addsecs = $interval * $cnt; my $totsecs = $hrsecs + $minsecs + $sec + $addsecs; $newhr = int($totsecs/3600); $newsecs = $totsecs%3600; $newmin = int($newsecs/60); $justsecs = $newsecs%60; $newtime = $newhr . ":" . $newmin . ":" . $justsecs; return $newtime; }
616
D.2.1.3 iostat_sun-mpio.pl
The purpose of the iostat_sun-mpio.pl script (Example D-3 on page 617) is to reformat Solaris iostat -xn data so that it can be analyzed in a spreadsheet. The LUN identification only works properly with Solaris systems running MPxIO. In the iostat -xn output with MPxIO, there is only one disk shown per LUN in the output. The iostat_sun-mpio.pl only takes one argument, which is the iostat file. There are no flags required. An example of the output is shown in Figure D-5.
D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D A a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t T E e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T I i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i M m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m E e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e L 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 U N H c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c D 1 1 4 4 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4 t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t I S 0 1 6 6 6 6 6 6 6 6 0 1 6 6 6 6 6 6 6 6 0 1 6 6 6 6 6 6 6 6 d d 0 0 0 0 0 0 0 0 d d 0 0 0 0 0 0 0 0 d d 0 0 0 0 0 0 0 0 K 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 # R E A D 0 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 7 7 7 7 7 7 7 7 6 6 6 6 6 6 6 6 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 8 8 8 8 8 8 8 8 F F F F F F F F F F F F F F F F C C C C C C C C 8 8 8 8 8 8 8 8 3 3 3 3 3 3 3 3 7 7 7 7 7 7 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 4 4 4 4 4 4 3 3 3 3 2 2 2 2 0 0 0 0 0 0 0 0 3 2 1 0 3 2 1 0 d d d d d d d d 0 0 0 0 0 0 0 0 S . 1 0 0 0 . 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 3 2 A V G _ R E A D _ K B . 9 0 0 0 . 7 . 3 0 0 0 0 . 5 0 0 0 0 0 0 0 0 0 0 0 0 1 6 1 6 0 0 0 0 2 4 1 6 1 0 0 W R I T E . 2 0 0 0 . 1 . 1 0 0 0 0 . 1 0 0 0 0 0 0 0 0 0 0 2 0 1 1 0 4 4 0 2 6 9 2 1 5 7 1 9 9 1 7 6 0 0 S 0 A V G _ W R I T E _ K 0 0 0 1 1 0 0 0 1 B . 8 0 . 8 . 7 . 1 . 1 . 7 . 7 . 8 . 2 0 0 0 0 0 0 0 0 0 0 . 5 0 9 5 . 6 0 . 8 . 2 . 6 . 2 . 5 A V G _ W A I T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A V G _ S 1 V 5 0 0 0 0 0 0 0 0 C . 4 0 7 7 5 5 7 7 7 5 0 0 0 0 0 0 0 0 0 0 . 3 0 . 8 . 6 0 . 9 3 3 . 9 . 9 . . . . . . . .
4 4 4 4 4 4 4 4
3 3 3 3 2 2 2 2
0 0 0 0 0 0 0 0
3 2 1 0 3 2 1 0
4 4 4 4 4 4 4 4
3 3 3 3 2 2 2 2
0 0 0 0 0 0 0 0
3 2 1 0 3 2 1 0
5 5 5 5 5 5 5 5
0 0 0 0 0 0 0 0
7 7 7 7 7 7 7 7
6 6 6 6 6 6 6 6
3 3 3 3 3 3 3 3
0 0 0 0 0 0 0 0
8 8 8 8 8 8 8 8
F F F F F F F F
F F F F F F F F
C C C C C C C C
8 8 8 8 8 8 8 8
3 3 3 3 3 3 3 3
7 7 7 7 7 7 7 7
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
4 4 4 4 4 4 4 4
3 3 3 3 2 2 2 2
0 0 0 0 0 0 0 0
3 2 1 0 3 2 1 0
d d d d d d d d
0 0 0 0 0 0 0 0
8 1 5 3 1 9 4 1 1 1 2 3 6 4 3 0 1 7 7 7 5 2 3 9 9 4 0 4
6 2 2 2
4 4 4 4 4 4 4 4
3 3 3 3 2 2 2 2
0 0 0 0 0 0 0 0
3 2 1 0 3 2 1 0
5 5 5 5 5 5 5 5
0 0 0 0 0 0 0 0
7 7 7 7 7 7 7 7
6 6 6 6 6 6 6 6
3 3 3 3 3 3 3 3
0 0 0 0 0 0 0 0
8 8 8 8 8 8 8 8
F F F F F F F F
F F F F F F F F
C C C C C C C C
8 8 8 8 8 8 8 8
3 3 3 3 3 3 3 3
7 7 7 7 7 7 7 7
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
4 4 4 4 4 4 4 4
3 3 3 3 2 2 2 2
0 0 0 0 0 0 0 0
3 2 1 0 3 2 1 0
d d d d d d d d
0 0 0 0 0 0 0 0
1 1 2 2
2 2
Example: D-3 The iostat_sun-mpio.pl script #!C:\Perl\bin\Perl.exe ############################################################### # Script: iostat_sun-mpio.pl # Purpose: Correlate disk# with Sun iostat and extract LUN # normalize data for input into spreadsheet ############################################################### use Getopt::Std; $file = $ARGV[0]; # Set iostat -xcn to 1st arg open(IOSTAT,$file) or die "cannot open $file\n";# Open file ### Print Header print "DATE,TIME,LUN,HDISK,#READS,AVG_READ_KB,WRITES,AVG_WRITE_KB,AVG_WAIT,AVG_SVC\n"; ### Read in iostat and create record for each hdisk $cnt=0; $time=0; while (<IOSTAT>) { if ($time == 0) { if (/^Mon|^Tue|^Wed|^Thu|^Fri|^Sat|^Sun/) { @line = split(/\s+|\s|\t|,|\//,$_);# build temp array $date = $line[1] . " " . $line[2] . " " . $line[5]; $time = $line[3]; $interval = 60; next; } else { $date = 'Date Not Available'; $time++; # Set time counter to 1 if no time is in file } } if (/tty|tin|md|r\/s|^$|#/) { next;} if (/extended/) { $cnt++; # count the number of stanzas if ($date_found == 1) { # If date flat is set $newtime = gettime($time,$interval,$cnt);# get time based on original time,cnt, and interval } else { $newtime = 'Time' . $time;# Set a relative time stamp $time++; }
617
next; } @line = split(/\s+|\t|,|\//,$_);# Build temp array for each line $pv = $line[11]; # Set pv to 11th element $lun = substr($pv,31,4); # Set lun to substring of pv $date{$pv} = $date; # Set date to date hash $time{$pv} = $newtime; # Set time to time hash $reads{$pv} = $line[1]; # Set read hash $writes{$pv} = $line[2]; # Set write hash $readkbps{$pv} = $line[3]; # Set read kbps $writekbps{$pv} = $line[4]; # Set write kbps $avg_wait{$pv} = $line[7]; # Set avg wait time $avg_svc{$pv} = $line[8]; # Set avg response time print "$date{$pv},$time{$pv}, $lun,$pv, $reads{$pv},$readkbps{$pv},$writes{$pv},$writekbps{$pv},$avg_wait{$pv},$avg_svc{$pv}\n"; }
sub gettime() { my $time = $_[0]; my $interval = $_[1]; my $cnt = $_[2]; my $hr = substr($time,0,2); my $min = substr($time,3,2); my $sec = substr($time,6,2); $hrsecs =$hr * 3600; $minsecs = $min * 60; my $addsecs = $interval * $cnt; my $totsecs = $hrsecs + $minsecs + $sec + $addsecs; $newhr = int($totsecs/3600); $newsecs = $totsecs%3600; $newmin = int($newsecs/60); $justsecs = $newsecs%60; $newtime = $newhr . ":" . $newmin . ":" . $justsecs; return $newtime; }
D.2.1.4 tpc_volume-lsfbvol-showrank.pl
Correlate TotalStorage Productivity Center volume batch reports with rank data obtained from DSCLI. While TotalStorage Productivity Center provides many excellent drill-down and correlation features, you might want to have data in a spreadsheet for further analysis and graphing. Unfortunately when exporting volume performance data using Batch Reports (refer to 8.5.4, Batch reports on page 232), the relationships between the volumes and the array sites are lost. The purpose of this script is to establish the relationship between the array site, DS8000 extent pool, and the volume. The tpc_volume-lsfbvol-showrank.pl script (Example D-5 on page 619) requires data obtained from the DSCLI to establish the relationships. The other feature of this script is that it converts the time stamp into two fields: date and time, and converts the time field into a 24 hour time stamp.
618
$ c:/Performance/bin/tpc_volume-lsfbvol-showrank.pl USAGE: tpc_volume-lsfbvol-showrank.pl [-fart] -h -f The file containing the lsfbvol output -a The file containing the lsarray output -r The file containing the showrank output -t The file containing the tpc volume performance output ALL ARGUMENTS ARE REQUIRED! Figure D-6 shows the example output of tpc_volume-lsfbvol-showrank.pl. Several of the columns have been hidden in order to fit the example output into the space provided.
Date Tim e S ubsys tem V olum e Interval Read I/O Rate (norm al) 0 0 0 41.64 0 0 0 Read I/O Read I/O W rite I/O Rate Rate Rate (norm al) (sequenti (overall) al) 0 0 0 0 0 0 0 0 0 986.69 1028.34 0 718.09 718.09 0 0 0 0 0 0 0 W rite I/O W rite I/O Rate Rate (s equenti (overall) al) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 VG CA P A CIT E XTE NTP Y OOL
DS 8000-2107-75GB 192-IB M DS 8000-2107-75GB 192-IB M DS 8000-2107-75GB 192-IB M DS 8000-2107-75GB 192-IB M DS 8000-2107-75GB 192-IB M DS 8000-2107-75GB 192-IB M DS 8000-2107-75GB 192-IB M
V1 V1 V1 V1 V1 V1 V1
64 64 64 64 64 64 64
P4 P4 P4 P4 P4 P4 P4
Note: The delimiter is a pipe | sign. This means that when you open the data in Excel, you will need to set the delimiter to a | sign. Do not redirect the output to a file with a suffix of .csv or Excel will open it assuming the delimiter is a comma and the output will look strange.
Example: D-5 The tpc_volume-lsfbvol-showrank.pl #!C:\Perl\bin\Perl.exe ############################################################### # Script: tpc_volume-lsfbvol-showrank.pl # Purpose: correlate tpc volume data to extent pool rank(s) and put in csv output # Requirements: Must contain lsfbvol output from DS8K-Config-Gatherer-v1.2.cmd ############################################################### use Getopt::Std; ####################################################################################### main(); # Start main logic sub main{ parseparms(); # Get input parameters readlsfbvol($lsfbvol); # Invoke routine to read lsfbvol readshowrank($showrank); # Invoke routine to read showrank output readlsarray($lsarray); # Invoke routine to read lsarray output readtpcvol($tpcvol); # Invoke routine to read tpc volume } ####################################################################################### sub parseparms { $rc = getopts('f:a:r:t:'); # Define inputs $lsfbvol = $opt_f; # Set value for lsfbvol $lsarray = $opt_a; # Set value for lsarray $showrank = $opt_r; # Set value for showrank $tpcvol = $opt_t; # Set value for tpcvol (defined $lsfbvol) or usage(); # If lsfbvol is not set exit
619
(defined $lsarray) or usage(); # If lsarray is not set exit (defined $showrank) or usage(); # If showrank is not set exit (defined $tpcvol) or usage(); # If tpcvol is not set exit } ####################################################################################### sub usage { print "\nUSAGE: tpc_volume-lsfbvol-showrank.pl [-fart] -h"; print "\n -f The file containing the lsfbvol output"; print "\n -a The file containing the lsarray output"; print "\n -r The file containing the showrank output"; print "\n -t The file containing the tpc volume performance output"; print "\n ALL ARGUMENTS ARE REQUIRED!\n"; exit ; } ############# BEGIN PROCESS LSFBVOL OUTPUT ###################################### sub readlsfbvol($lsfbvol) { # Define subroutine open(LSFBVOL,$lsfbvol) or die "cannot open $lsfbvol\n";# Open lsfbvol file while (<LSFBVOL>) { # Loop through every line in lsfbvol if (/^$|^Date|^Name|^==/) { next; } # Skip empty and header lines @line = split(/:/,$_); # Build temp array of each line $vol_id = $line[1]; # Set volume ID to the 2nd element $ep = $line[7]; # Set the exent pool to the 8th element $cap = $line[10]; # Set capacity in gb to 11th element $vg = $line[13]; # Set volume group to 14th element chomp($vg); # Remove carriage returnh $vg{$vol_id} = $vg; # Create vg hash with vol as key $cap{$vol_id} = $cap; # Create capacity hash with vol as key $ep{$vol_id} = $ep; # Create extent pool has with vol as key } } ############# END PROCESS LSFBVOL OUTPUT ###################################### ############# BEGIN PROCESS SHOWRANK OUTPUT ###################################### sub readshowrank($showrank) { # Define subroutine open(SHOWRANK,$showrank) or die "cannot open $showrank\n";# Open showrank file while (<SHOWRANK>) { # Iterate through each line if (/^$/) { next; } # skip empty lines chomp($_); # Remove carriage return $_ =~ s/\"//g; # Remove quotations @temp = split(/\t|\s+|\s/, $_);# build array of data for each line if (/^ID/) { # If line begins with has $rank = $temp[1]; # Set rank to the 2nd element } if (/^Array/) { # If line begins with array $array = $temp[1]; # Set array to 2nd element } if (/^extpoolID/) { # If line begins with extpoolID $ep = $temp[1]; # Set ep to 2nd element } if (/^volumes/) { # If line begins with volumes @vol_list = split(/,/,$temp[1]);# Split list of vols into array foreach $vol (@vol_list) { # loop through each volume if ($array_vols{$vol}) { # If its already existing if ($array_vols{$vol} =~ $array) {# check to see if we have it stored move on next; } else { $array_vols{$vol} = $array_vols{$vol}.','.$array;# else add it to existing list } } else { $array_vols{$vol} = $array;
620
############# BEGIN PROCESS LSARRAY OUTPUT ###################################### sub readlsarray($lsarray) { # Define subroutine open(LSARRAY,$lsarray) or die "cannot open $lsarray\n";# Open lsarray file while (<LSARRAY>) { # Iterate through each line if (/^$/) { next; } # skip empty lines chomp($_); # Remove carriage return $_ =~ s/\"//g; # Remove quotations @temp = split(/,/, $_); # build array of data for each line local $array = $temp[0]; # Set array to first element $array_site{$array} = $temp[4];# Build has with array as key to array site value } } ############# END PROCESS LSARRAY OUTPUT ###################################### ############# PROCESS TPC VOLUME DATA ######################################### sub readtpcvol($tpcvol) { # Define subroutine open(TPCVOL,$tpcvol) or die "cannot open $tpcvol\n";# Open lsarray file while (<TPCVOL>) { if (/^$/) { next; } # Skip empty lines if (/^-/) { next; } # Skip empty lines if (/^Subsystem/) { # Header line @line = split(/,/,$_); # build temp array $subsystem = $line[0]; # Capture subsystem $volume = $line[1]; # Capture volume @temp = split(/\s|\s+/,$line[2]);# Split the first element into Date and Time $time = $temp[0]; # Capture time shift(@line); shift(@line);shift(@line);# Drop first 3 elements unshift(@line,'Date',$time,$subsystem,$volume);# Add back in elements in desired order $cnt=0; # Start counter for ($i=0;$i<=$#line;$i++) { # Start a loop through elements in header line $header[$cnt] = $line[$i]; # Build header array if($i == $#line) { # if end of line remove carriage return chomp($header[$cnt]); } print "$header[$cnt]|"; # Print out header with pipe delm $cnt++; # Increment count } print "VG|CAPACITY|EXTENTPOOL|TPC-ARRAY\n";# Print out extra-new Headers next; # Next line please } @line = split(/,|\(|\)|\:|\s/,$_);# Build temp array $subsystem = $line[0]; # Set subsystem to first element $volume = $line[4]; # Set volume to 5th element $volume =~ tr/a-z/A-Z/; # Make volume upper case $date = $line[6]; # Set date to 7th element $hr = twelveto24($line[7],$line[9]);# Convert time from am-pm to 24hr $time = $hr . ":" . $line[8]; # Put hr min together shift(@line); shift(@line);shift(@line);shift(@line);shift(@line);shift(@line);shift(@line);shift(@line);shift(@line);shi ft(@line); unshift(@line,$date,$time,$subsystem,$volume);# After removal of elements re-add for proper order $serial = substr($line[2],12,7);# Set DS8K serial #
621
$ep = $ep{$volume}; $tpc_array = array2arraysite($array_vols{$volume}); # Get array chomp($_); # Remove cr foreach $ele (@line) { # Iterate through elements print "$ele|"; # Print each element } print "$vg{$volume}|$cap{$volume}|$ep{$volume}|$tpc_array\n";# Print the extra elements } } ############# ITERATE THROUGH ARRAY LIST AND CONVERT TO array site ################ sub array2arraysite{ my $list = $_[0]; my @array_list= split(/,/,$list); my $arr; my $ars_list; my $cnt = 0; foreach $arr (@array_list) { $ars = $array_site{$arr}; # Lookup array site $tpc_array = '2107' . "." . $serial . "-" . substr($ars,1);# Convert array site to TPC array# if ($cnt == 0) { $ars_list = $tpc_array; } else { $ars_list = $ars_list . "," . $tpc_array; } $cnt++; } return $ars_list; } ############# END PROCESS TPC VOLUME DATA ######################################### sub twelveto24{ ($hr, $per) = @_; $per =~ tr/A-Z/a-z/; if ($hr == 12 && $per eq 'am') { return 0; } elsif ($hr != 12 && $per eq 'pm') { return $hr+12; } else { return $hr }; }
622
Appendix E.
Benchmarking
Benchmarking storage systems has become extremely complex over the years, given all of the hardware and software parts being used for storage systems. In this appendix, we discuss the goals and the ways to conduct an effective storage benchmark.
623
time frame. Furthermore, you need to clearly identify the objective of the benchmark with all the participants and precisely define the success criteria of the results.
Appendix E. Benchmarking
625
Important: Each workload test must be defined with a minimum time duration in order to eliminate any side effects or warm-up period, such as populating cache, which can generate incorrect results.
test different ways to get an overall performance improvement by tuning each of the different components.
Appendix E. Benchmarking
627
628
Related publications
The publications listed in this section are considered particularly suitable for a more detailed discussion of the topics covered in this book.
629
IBM ESS and IBM DB2 UDB Working Together, SG24-6262 AIX 5L Performance Tools Handbook, SG24-6039 AIX 5L Differences Guide Version 5.3 Edition, SG24-7463
Other publications
These publications are also relevant as further information sources: IBM TotalStorage DS8000 Command-Line Interface Users Guide, SC26-7625 IBM System Storage DS8000 Host Systems Attachment Guide, SC26-7917 IBM TotalStorage DS8000: Introduction and Planning Guide, GC35-0495 IBM TotalStorage DS8000: Users Guide, SC26-7623 IBM TotalStorage DS Open Application Programming Interface Reference, GC35-0493 IBM TotalStorage DS8000 Messages Reference, GC26-7659 z/OS DFSMS Advanced Copy Services, SC35-0428 Device Support Facilities: Users Guide and Reference, GC35-0033 DB2 for OS/390 V5 Administration Guide, SC26-8957 Administration Guide of DB2 for OS/390, SC26-8957 Device Support Facilities: Users Guide and Reference, GC35-0033 IBM System Storage DS8000 Host Systems Attachment Guide, SC26-7917 IBM System Storage Multipath Subsystem Device Driver Users Guide, GC52-1309 IBM TotalStorage XRC Performance Monitor Installation and Users Guide, GC26-7479 z/OS DFSMS Advanced Copy Services, SC35-0428
Online resources
These Web sites and URLs are also relevant as further information sources: IBM Disk Storage Feature Activation (DSFA) http://www.ibm.com/storage/dsfa Documentation for the DS8000 http://www.ibm.com/systems/storage/disk/ds8000/index.html IBM System Storage Interoperation Center (SSIC) http://www.ibm.com/systems/support/storage/config/ssic/index.jsp
630
Related publications
631
632
Index
A
AAL 6, 20 addpaths command 368 address groups 53 Advanced Copy Services 422, 426, 438 affinity 47 agile view 362 AIO 314, 328329 aioo 315, 329 AIX filemon command 339341 lvmstat command 334 nmon command 336 SDD commands 367, 371 secondary system paging 370 topas command 335 aligned partition 396, 398399 allocation 47, 4950 anticipatory 413 anticipatory I/O Scheduler 412 ANTXIN00 member 545 application workload 32 arbitrated loop 267 array site 20, 45 array sites 4445, 54 arrays 42, 4445, 47 arrays across loops see AAL Assymetric Logical Unit Access (ALUA) 362 cache management 13 performance of SARC 14 capacity 526, 530 Capacity Magic 27 cfgvpath command 373 cfvgpath 373 characterize performance 380 choosing CKD volume size 450 disk size with DB2 UDB 500 CKD volumes 4950, 54 allocation and deletion 50 commands addpaths 368 cfgvpath 373 cron 365 datapath 331, 366 dd 355, 375376 filemon 339341 lquerypr 370 lsvpcfg 368369, 372 lvmstat 334 nmon 336 sar 363, 365, 420 SDD commands in AIX 367 SDD commands in HP-UX 371 SDD commands in Sun 373 SDD datapath 274 showvpath 372, 374 topas 335 vpathmkdev 374 compatability mode 394 Complete Fair Queuing 412413, 415 Completely Fair Queuing I/O Scheduler 412 concurrent write operation 448 CONN time 464, 466467 connectivity Global Copy 528 considerations DB2 performance 489 DB2 UDB performance 492, 497 logical disks in a SAN 269 z/OS planning 459 zones 268 Consistency Group drain time 537 constraints definition 224 containers 492, 494496, 498 control data sets 545 Copy Pending 527 Copy Services 507 FlashCopy 508510 FlashCopy objectives 510 introduction 508
B
balanced DB2 workload 490 IMS workload 505 bandwidth 519520, 522 benchmark cautions 627 requirements 624 benchmarking 623624, 626 goals 624 bio structure 405 block layer 405 block size 416417 buffer pools 495 BusLogic 399
C
cache 1214, 208, 211405, 417 as intermediate repository 12 fast writes 12 cache and I/O operations read operations 12 write operations 12 cache information 208
633
Metro Mirror 518 Metro Mirror configuration considerations 519 cron command 365
D
DA 1719 data mining 34 data rate droop effect 454 data warehousing 34 databases 485488, 493 buffer pools 495 containers 492, 494, 496 DB2 in a z/OS environment 486 DB2 overview 487 DB2 prefetch 490491, 495 extent size 495496, 500 extents 495 IMS in a z/OS environment 502 logs 489, 493, 496, 503 MIDAWs 491 multi-pathing 501502 page cleaners 496 page size 495496, 499 pages 495497 parallel operations 496 partitiongroups 494 partitioning map 494 partitions 493 prefetch size 500 selecting DB2 logical sizes 499 tables, indexes and LOBs 494 tablespace 494, 496 workload 31, 486 dataset placement 545 datastore 385 DB2 485487 in a z/OS environment 486 logging 3132 overview 487 prefetch 490491, 495 query 32 recovery data sets 488 selecting logical sizes 499 storage objects 487 transaction environment 32 DB2 UDB 492494 container 492, 494495, 498 instance 493 performance considerations 492, 497 striping 499 tablespace 487, 492, 494 dd command 355, 375376 DDMs 16, 1819 deadline 412413 deadline I/O scheduler 412 Dense Wavelength Division Multiplexing (DWDM) 528 description and characteristics of a SAN 267 device adapter see DA device busy 446 Device Mapper - Mutlipath I/O (DM-MPIO) 407
DFW Bypass 491 DFWBP 491 digital video editing 35 direct connect 267268 Direct I/O 309, 312, 314, 344 dirty buffer 405 flushing 405 DISC time 466467, 470 Discovered Direct I/O 344 disk bottlenecks iostat command 418419 vmstat command 418 disk enclosure 19, 21 disk I/O subsystem block layer 405 cache ??403403405, 417 I/O subsystem architecture 403 Disk Magic 161163 modeling 163 Disk Magic for open systems 180 Disk Magic for zSeries 165 hardware configuration 179, 195 disk subsystem 18 fragmentation 285 disk xfer 327328 diskpart 399 disks configuring for SAN 269 size with DB2 UDB 500 distance 519520 distributions Linux 401402 DONOTBLOCK 548549 drain period 537 drain time 537538 ds_iostat 588, 594 DS8000 AAL cache management 13 Copy Services 507 DA DDMs 16, 1819 disk enclosure 19, 21 FICON 56 FICON port concurrency 469 hardware overview 3 host adapter 5 host attachment 266 I/O priority queuing 7 Multiple Allegiance 7 PAV POWER5 5 priority queuing 449 SDD z/OS Global Mirror DS8000 nI/O ports 527 DS8100 21 processor memory 12 DS8300 10, 17, 2122 disk subsystem 18
634
LPAR 17, 23, 27 DWDM 276 dynamic alias management 442, 452 dynamic buffer cache 357 Dynamic Path Reconnect (DPR) 26 Dynamic Path Selection (DPS) 26 dynamic PAV 442444
E
EAV 49 elevator model 405 engineering and scientific applications 35 Enterprise Volume Management System (EVMS) 411 ESCON 26 ess_iostat script 594 ESX Server multipathing 387 ESX server 384385, 387 esxcfg-mpath 387 esxtop 390391 EVMS (Enterprise Volume Management System) 410 examples datapath query output 331, 366 ESCON connection 275 FICON connection 278 filemon command outputs 340 larger vs. smaller volumes - random workload 451 rank device spreading 318, 359 sar output 363 zoning in a SAN environment 269 Ext2 414 Ext3 414415 Extended Address Volume (EAV) 49 Extended Distance FICON (EDF) 455 eXtended File System 414 Extended Remote Copy see z/OS Global Mirror extent pool 47, 513, 517518 extent pools 4749 extent rotation 51, 55 extent size 495496, 500 extent type 4647 extents 495
FICON 56, 22, 2426, 276, 442, 451452, 454 host adapters 22, 2425, 27 host attachment 278 FICON Express2 454455, 468 FICON Express4 266, 277278, 454455, 457458 FICON Open Exchange 469 filemon 330, 339340 filemon command 339341 filemon measurements 339 fixed block LUNs 48 Fixed policy 387 FlashCopy 508510 performance 510511 FlashCopy objectives 510 fragmentation, disk 285 fsbuf 327
G
Global Copy 508509, 524, 526 capacity 530 connectivity 528 distance considerations 528 DS8000 I/O ports 527 performance 526, 529531, 533535 planning considerations 529 scalability 530 Global Mirror add or remove storage servers or LSSs 543 avoid unbalanced configurations 538 Consistency Group drain time 537 growth within configurations 541 performance 532533 performance aspects 533 performance considerations at coordination time 536 remote storage server configuration 538 volumes 536, 538539, 542 guidelines z/OS planning 459
H
hardware configuration planning allocating hardware components to workloads 3 array sites, RAID, and spares 20 cache as intermediate 12 Capacity Magic 27 disk subsystem 18 DS8000 overview 3 DS8300 10, 17, 2122 DS8300 LPAR 17, 23, 27 Fibre Channel and FICON host adapters 24 host adapters 11, 1617, 24 I/O enclosures 1718 modeling your workload 3 multiple paths to open systems servers 25 multiple paths to zSeries servers 26 order of installation 21 performance numbers 2 processor complex 10, 12, 17 processor memory 12, 16
F
fast write 1213 fast writes 12 FATA 517 FC adapter 273 FC-AL overcoming shortcomings 18 switched 5 FCP supported servers 267 fcstat 336, 339 Fibre Channel 267 distances 2425 host adapters 22, 2425, 27 topologies 267268 Fibre Channel and FICON host adapters 24
Index
635
RIO-G interconnect and I/O enclosures 17 RIO-G loop 17 spreading host attachments 26 storage server challenge 2 switched disk architecture 18 tools to aid in hardware planning 27 whitepapers 27 HCD (hardware configuration definition) 26 High performance FICON (zHPF) 455 High Performance FileSystem (HFS) 356 host adapter (HA) 5 host adapters 11, 1617, 24 Fibre Channel 22, 2425, 27 FICON 22, 2425, 27 host attachment 5455 description and characteristics of a SAN 267 FC adapter 273 Fibre Channel 267 Fibre Channel topologies 267 FICON 276 SAN implementations 267 SDD load balancing 273 single path mode 273 supported Fibre Channel 267 HP MirrorDisk/UX 358 HyperPAV 442443, 445
J
Journal File System 414415 Journaled File System (JFS) 357 journaling mode 414416 Journals 545
L
large volume support planning volume size 453 legacy AIO 314, 328329 links Fibre Channel 520 Linux 401403 load balancing SDD 273 locality of reference 404 log buffers 503 logging 502503 logical configuration disks in a SAN 269 logical control unit (LCU) 53 logical device choosing the size 450 configuring in a SAN 269 planning volume size 453 logical paths 520 Logical Session 549 logical subsystem see LSS logical volumes 4548 logs 489, 493, 496, 503 lquerypr 367368, 370371 lquerypr command 370 LSI Logic 399 LSS 5354, 519520, 543 add or remove in Global Mirror 543 LSS design 522 lsvpcfg command 368369, 372 LUN masking 268, 271 LUN queue depth 392393 LUNs allocation and deletion 50 fixed block 48 LVM 2 410411 lvmap 588589 lvmap script 589 lvmstat 330, 334335 lvmstat command 334
I
I/O priority queuing 449 I/O elevator 404405, 415 anticipatory 413 Complete Fair Queuing 412413, 415 deadline 412413 NOOP 413 I/O enclosure 1718 I/O latency 12 I/O priority queuing 7 I/O scheduler 405, 417 I/O subsystem architecture 403 IBM TotalStorage Multipath Subsystem Device Driver see SDD IBM TotalStorage XRC Performance Monitor 552 implement the logical configuration 143 IMS 485, 502503 logging 502503 performance considerations 502, 504 WADS 503504 IMS in a z/OS environment 502 index 487488, 494 indexspace 488 installation planning host attachment 266 instance 493 IOCDS 26 Iometer 305 ionice 415 IOP/SAP 464, 467468 IOSQ time 445, 466 iostat command 418419
M
Managed Disk Group 427429 Managed Disks Group 423 maximum drain time 537 metric 207, 226, 228 Metro Mirror 518 adding capacity in new DS6000s 526 addition of capacity 526 bandwidth 519520, 522
636
distance 519520 Fibre Channel links 520 logical paths 520 LSS design 522 performance 520, 523524 scalability 526 symmetrical configuration 522523 volumes 518, 520, 523 Metro Mirror configuration considerations 519 Metro/Global Mirror performance 541, 553 Microsoft Cluster Services (MSCS) 385 MIDAW 491 MIDAWs 491 modeling your workload 3 monitor host workload 38 open systems servers 38 monitoring DS8000 workload 38 monitoring tools Disk Magic 161163 Iometer 305 Performance console 294296 Task Manager 301 mount command 415416 MPIO 254 MRU policy 387 multi-pathing 501502 multipathing DB2 UDB environment 502 SDD 288, 290, 292 Multiple Allegiance 7, 442, 446, 453 Multiple Allegiance (MA) 446 multiple reader 549550 multi-rank 518, 541
tuning I/O buffers 313 verify the storage subsystem 377 vpathmkdev 367, 373374 order of installation 21 DS8100 21 over provisioning 50 overhead 512, 516, 518
P
page cache 403 page cleaners 496 page size 495496, 499 PAGEFIX 546 pages 495497 Parallel 496 Parallel Access Volume (PAV) 453 Parallel Access Volumes (PAV) 53 Parallel Access Volumes see PAV parallel operations 496 partitiongroups 494 partitioning map 494 path failover 274 PAV 7, 442445, 465 PAV and Multiple Allegiance 446 pbuf 327 pdflush 403, 405 PEND time 446, 448, 466 performance 459, 507, 511, 517, 529, 533535 AIX secondary system paging 370 DB2 considerations 489 DB2 UDB considerations 492, 497 Disk Magic tool 161163 FlashCopy overview 510511 IMS considerations 502, 504 planning for UNIX systems 308 size of cache 16 tuning Windows systems 281282 Performance Accelerator feature 23 Performance console 294296 performance data collection 217, 224, 239 performance data collection task 224 performance logs 296 performance monitor 224, 239 performance numbers 2 plan Address Groups and LSSs 118 plan RAID Arrays and Ranks 81 planning logical volume size 453 UNIX servers for performance 308 planning and monitoring tools Disk Magic 162 Disk Magic for open systems 180 Disk Magic for zSeries 165 Disk Magic modeling 163 Disk Magic output information 163 workload growth projection 197 planning considerations 529 ports 209, 225 posix AIO 328 POWER5 5 Index
N
N_Port ID Virtualization (NPIV) 386 nmon command 336 noatime 415 nocopy 510, 513, 515, 517 NOOP 413 Noop 412 Noop I/O Scheduler 412
O
OLDS 503504 OLTP 486, 498, 500 open systems servers 38 cfvgpath 373 characterize performance 380 dynamic buffer cache 357 filemon 330, 339340 lquerypr 367368, 370371 lvmstat 330, 334335 performance logs 296 removing disk bottlenecks 300 showvpath 367, 372, 374 topas 335336
637
prefetch size 500 priority queuing 449 processor complex 10, 12, 17 processor memory 12, 16 pstat 328
R
RAID 20 RAID 10 517518, 525 drive failure 4344 implementation 43 theory 43 RAID 5 517518 drive failure 42 implementation 4243 theory 42 RAID 6 4243, 45, 52 ranks 4649 RAS RAID 42 RAID 10 43 spare creation 44 Raw Device Mapping (RDM) 384, 393 RDM 384386, 393 real-time sampling and display 363 recovery data sets 488 Red Hat 402, 408, 415 Redbooks Web site 630 Contact us xx ReiserFS 414415 reorg 52 Resource Management Facility (RMF) 464 RFREQUENCY 547 RIO-G loop 17 RMF 525, 533, 538 RMZ 543 rotate extents 51, 55 Rotated volume 51 rotated volume 51 RTRACKS 547
S
SAN 26 cabling for availability 267 implementation 267 zoning example 269 SAN implementations 267 SAN Statistics monitoring performance 242 SAN Volume Controller (SVC) 424 SAR display previously captured data 365 real-time sampling and display 363 sar summary 366 sar command 363, 365, 420 sar summary 366 SARC 6 scalability 526, 530 scripts
ess_iostat 594 lvmap 589 test_disk_speeds 597 vgmap 588 vpath_iostat 590 SCSI See Small Computer System Interface SCSI reservation 393 SDD 6, 254, 288, 290, 292 addpaths 368 commands in AIX 367 commands in HP-UX 371 commands in Sun 373 DB2 UDB environment 502 lsvpcfg command 368369, 372 SDD load balancing 273 sequential measuring with dd command 375 Sequential Prefetching in Adaptive Replacement Cache see SARC server affinity 47 setting I/O schedulers 412 showpath command 374 showvpath 367, 372, 374 showvpath command 372 single path mode 273 Small Computer System Interface 405 SMI-S standard IBM extensions 208 ports 209 space efficient volume 49 spares 20, 44 floating 44 spindle 517 spreading host attachments 26 static and dynamic PAVs 442 storage facility image 10 storage image 10, 23 storage LPAR 10 Storage Pool Striping 4748, 51, 53, 515, 518, 541 Storage pool striping 51 storage server challenge 2 storage servers add or remove in Global Mirror 543 storage unit 10 striped volume 52 striping DB2 UDB 499 VSAM data striping 490 Subsystem Device Driver (SDD) 407 Sun SDD commands 373 supported Fibre Channel 267 supported servers FCP attachment 267 SuSE 402 SVC 254 switched fabric 267268 switched FC-AL 5 symmetrical configuration 522523
638
System Data Mover (SDM) 543 System Management Facilities (SMF) 198, 469 System z servers concurrent write operation 448 CONN time 464, 466467 DISC time 466467, 470 IOSQ time 445, 466 PAV and Multiple Allegiance 446 PEND time 446, 448, 466
T
table 487488, 493 tables, indexes, and LOBs 494 tablespace 487, 492, 494, 496 application 488 system 488, 494 Task Manager 301 test_disk_speeds script 597 thin provisioning 50 TIMEOUT 546547 timestamps 214, 223 topas 335336 topas command 335 topologies arbitrated loop 267 direct connect 267268 switched fabric 267268 tpctool CLI 236 tpctool lstype command 237 tuning disk subsystem ??412, ??412 Windows systems 281282 tuning I/O buffers 313 tuning parameters 545
hierarchy 46, 5455 host attachment 5455 logical volumes 4548 ranks 4649 volume group 5455 VMFS datastore 386, 392 VMotion 386 vmstat 326328 vmstat command 418 VMware datastore 385 volume space efficient 49 volume groups 5455 volume manager 52 volumes 209211, 518, 520, 523 add or remove 536, 538539, 542 CKD 4950, 54 vpath_iostat script 590 vpathmkdev 367, 373374 vpathmkdev command 374
W
WADS 503504 Wavelength Division Multiplexing (WDM) 528 whitepapers 27 Windows Iometer 305 Task Manager 301 tuning 281282 Windows Server 2003 fragmentation 285 workload 512, 515516, 518 databases 486 workload growth projection 197 Workload Manager 442, 449, 454 write ahead data sets, see WADS Write Pacing 548
U
UNIX shell scripts ds_iostat 588, 594 introduction 572, 588, 608 ivmap 588589 vgmap 588 using Task Manager 301
X
XADDPAIR 547549 XRC PARMLIB members 545 XRC Performance Monitor 552 XRC see z/OS Global Mirror xSeries servers Linux 401403 XSTART command 545547 XSUSPEND 546547
V
VDisk 424425 Veritas Dynamic MultiPathing (DMP) 291 vgmap 588 vgmap script 588 video on demand 34 Virtual Center (VC) 389 Virtual Machine File System (VMFS) 393 virtualization address groups 53 array site 45 array sites 4445, 54 arrays 42, 4445, 47 extent pools 4749
Z
z/OS planning guidelines 459 z/OS Global Mirror 508, 543544 dataset placement 545 IBM TotalStorage XRC Performance Monitor 552 tuning parameters 545 z/OS Workload Manager 442 zGM multiple reader 549
Index
639
zSeries servers overview 442 PAV 442443, 445, 465 static and dynamic PAVs 442
640
Back cover
Understand the performance aspects of the DS8000 architecture Configure the DS8000 to fully exploit its capabilities Use planning and monitoring tools with the DS8000
This IBM Redbooks publication provides guidance about how to configure, monitor, and manage your IBM TotalStorage DS8000 to achieve optimum performance. We describe the DS8000 performance features and characteristics and how they can be exploited with the different server platforms that can attach to it. Then, in consecutive chapters, we detail specific performance recommendations and discussions that apply for each server environment, as well as for database and DS8000 Copy Services environments. We also outline the various tools available for monitoring and measuring I/O performance for different server environments, as well as describe how to monitor the performance of the entire DS8000 subsystem. This book is intended for individuals who want to maximize the performance of their DS8000 and investigate the planning and monitoring tools that are available.