0% found this document useful (0 votes)
66 views38 pages

Linux On ZVM - Understanding Disk IO

Uploaded by

Văn Tạ Trung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views38 pages

Linux On ZVM - Understanding Disk IO

Uploaded by

Văn Tạ Trung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Linux on z/VM Performance

Understanding Disk I/O


Rob van der Heij
Velocity Software
http://www.velocitysoftware.com/
[email protected]

Copyright © 2013 Velocity Software, Inc. All Rights Reserved. Other


products and company names mentioned herein may be trademarks of their
respective owners.
Agenda

• I/O Performance Model


• ECKD Architecture
• RAID Disk Subsystems
• Parallel Access Volumes
• Virtual Machine I/O
• Linux Disk I/O

http://zvmperf.wordpress.com/

2
Linux on z/VM Tuning Objective

Resource Efficiency
 Achieve SLA at minimal cost
• “As Fast As Possible” is a very expensive SLA target
 Scalability has its limitations
• The last 10% peak capacity is often the most expensive

Recommendations are not always applicable


 Every customer environment is different
 Very Few Silver Bullets
 Consultant skills and preferences

5
Benchmark Challenges

Benchmarks have limited value for real workload


 Every real life workload is different
• All are different from synthetic benchmarks
• There are just too many options and variations to try
 Benchmarks can help understand the mechanics
• Provide evidence for the theoretical model
Use performance data from your real workload
 Focus on the things that really impact service levels

140
160
120
140
100
120
80
100
60
80
40
60
20 40
0 20
wr it e r ewr it e read rer ead random r d r andom wr
0
w rite rew rite read reread random rd random w r

6
Anatomy of Basic Disk I/O

Who Cares About Disk


“Disks are very fast today”
“Our response time is a few ms”

Selection Criteria
 Capacity
 Price
Reality: In comparison,
disk I/O today is slow
IBM 3380-AJ4 Seagate Momentus
(1981) 7200.3 (2011)

Price $80K $60

Capacity 2.5 GB 250 GB

Latency 8.3 ms 4.2 ms

Seek Time 12 ms 11 ms

Host Interface 3 MB/s 300 MB/s

© 2010 Brocade, SHARE in Seattle, “Understanding FICON I/O Performance” Device Interface 2.7 MB/s 150 MB/s

7
Anatomy of Basic Disk I/O

Reading from disk


 Seek – Position the heads over the right track
 Latency – Wait for the right sector
 Read – Copy the data into memory

Host Disk
Average I/O Operation
Start
 Seek over 1/3 of the tracks ~ 10 ms I/O I/O Seek
Response
 Wait for 1/2 a rotation ~ 3 ms Time Locate
 Read the data ~ 1 ms
Transfer
Data

host

Processing
buffer I/O Time
Rate
disk

Start
I/O
Host and disk decoupled by
speed matching buffer

8
Classic DASD Configuration

CKD – Count Key Data Architecture


 Large system disk architecture since 60’s
 Track based structure
• Disk record size to mach application block size
 Disk I/O driven by channel programs
• Autonomous operation of control unit and disk
• Reduced CPU and memory requirements
 ECKD – Extended Count Key Data
• Efficient use of cache control units
• Improved performance with ESCON and FICON channel

FBA – Fixed Block Architecture


 Popular with 9370 systems
 Not supported by z/OS
 Access by block number
 Uniform block size

9
Classic DASD Configuration

Channel Attached DASD


 Devices share a channel
 Disconnect and reconnect
 Track is cached in control unit buffer

IOSQ Application Host OS CHPID Control Unit Device


 Device Contention Read
IOSQ Start I/O
 Interrupt Latency
PEND
 Channel Busy PEND
Command
Transfer
 Path Latency
 Control Unit Busy
 Device Busy
DISC DISC
 Seek
 Latency Data
Transfer
 Rotational Delay CONN
CONN
 Data Transfer
 Channel Utilization I/O Complete
Data
Available
http://zvmperf.wordpress.com/2013/06/07/disk-io-response-time-metrics/

10
Classic DASD Configuration

Instrumentation provided by z/VM Monitor


 Metrics from z/VM and Channel
• Traditionally used to optimize disk I/O performance
 Response time improvement through seek optimization
• Relocating data sets to avoid multiple hot spots
• I/O scheduling – elevator algorithm

Screen: ESADSD2 ESAMON 3.807 03/23 16:24-16:33


1 of 3 DASD Performance Analysis - Part 1 DEVICE 3505 2097

Dev Device %Dev <SSCH/sec-> <-----Response times (ms)--->


Time No. Serial Type Busy avg peak Resp Serv Pend Disc Conn
-------- *--- ------ ------ ---- *---- ----- ----- ----- ----- ----- -----
16:25:00 3505 0X3505 3390-? 26.3 728.8 728.8 0.4 0.4 0.2 0.0 0.2
16:26:00 3505 0X3505 3390-? 76.9 977.4 977.4 0.8 0.8 0.3 0.1 0.4
16:27:00 3505 0X3505 3390-? 62.0 480.0 977.4 1.3 1.3 0.5 0.1 0.6
16:28:00 3505 0X3505 3390-? 15.8 198.9 977.4 0.8 0.8 0.1 0.5 0.2

11
Contemporary Disk Subsystem

Big Round Brown Disk


 Specialized Mainframe DASD
 One-to-one map of Logical Volume on Physical Volume
 Physical tracks in CKD format
 ECKD Channel Programs to exploit hardware capability

Contemporary Disk Subsystem


 Multiple banks of commodity disk drives
• RAID configuration
• Dual power supply
• Dual controller
 Microcode to emulate ECKD channel programs
• Data spread over banks, ranks, array sites
 Lots of memory to cache the data

12
RAID Configuration

RAID: Redundant Array of Independent Disks


 Setup varies among vendors and models
 Error detection through parity data
 Error correction and hot spares
 Spreading the I/O over multiple disks

Performance Considerations
 The drives are “just disks”
 RAID does not avoid latency
 Large data cache to avoid I/O
 Cache replacement strategy
Cache

Additional Features FICON


Channels
 Instant copy
 Autonomous backup
ECKD
 Data replication Emulation

13
RAID Configuration

Provides Performance Metrics like 3990-3


 Model is completely different
 DISC includes all internal operations
• Reading data into cache
• Data duplication and synchronization

Bimodal Service Time distribution

Probabiltiy
 Cache read hit
• Data available in subsystem cache
Response Tim e

• No DISC time
 Cache read miss
• Back-end reads to collect data
• Service time unrelated to logical I/O

Average response time is misleading


 Cache hit ratio
 Service time for cache read miss

14
RAID Configuration

Example:
 Cache Hit Ratio 90%
 Average DISC 0.5 ms
 Service Time Miss 5 ms

Read Prediction
 Detecting sequential I/O
 ECKD: Define Extent

RAID does not improve hit ratio


 Read-ahead can improve hit ratio
Cache
 RAID makes read-ahead cheaper
FICON
Channels

ECKD
Emulation

15
Disk I/O Example
210K blocks per second =
<----------Rates (per sec)--------> 105 MB/s -> 6.3 GB written
<Processor Pct Util> Idle <-Swaps-> <-Disk IO-> Switch Intrpt
Time Node Total Syst User Nice Pct In Out In Out Rate Rate
-------- -------- ----- ---- ---- ---- ---- ---- ---- ----- ----- ------ ------
15:12:00 roblnx2 5.9 5.7 0.2 0 60.2 0 0 0 210K 272.1 0 105 MB/s & 272 context
switches -> ~ 400 KB I/O’s

Dev Device Total ERP %Dev <SSCH/sec-> <-----Response times (ms)--->


Time No. Serial Type SSCH SSCH Busy avg peak Resp Serv Pend Disc Conn
-------- ---- ------ ------ ----- ---- ---- ----- ----- ----- ----- ----- ----- -----
15:12:00 954A PR954A 3390-9 6350 0 36.8 105.8 105.8 3.5 3.5 0.2 1.2 2.1
15:12:00 95D5 PR954A 3390-9 6677 0 35.9 111.3 111.3 3.2 3.2 0.2 1.1 1.9
15:12:00 95D6 PR954A 3390-9 6532 0 35.7 108.9 108.9 3.3 3.3 0.2 1.2 2.0

Pct. <---- Total I/O ----> <------ Write Activity ------>


Dev Actv <Per Sec> Cache Total DFW DFW Seq NVS
Time No. Serial Samp I/O Hits Hit% Read% I/O I/O Hits I/O Hit% Full
-------- ---- ------ ---- ---- ---- ----- ----- ----- ---- ---- ---- ---- ----
15:12:00 954A PR954A 100 326 326 100.0 0 325.7 326 326 308 100 123

Pct. <---- Total I/O ----> <-Tracks/second->


Dev Actv <Per Sec> Cache <--Cache---> <-Staged-> De-
Time No. Serial Samp I/O Hits Hit% Read% Inhib Bypass Seq Nseq staged
-------- ---- ------ ---- ---- ---- ----- ----- ----- ------ ----- ---- ------
15:12:00 954A PR954A 100 326 326 100.0 0 0 0 0 0 2194

16
Parallel Access Volumes

S/390 I/O Model: Single Active I/O per Logical Volume


 Made sense with one logical volume per physical volume
 Too restrictive on contemporary DASD subsystems
• Logical volume can be striped over multiple disks
• Cached data could be accessed without real disk I/O
• Even more restrictive with large logical volumes
Logical Volumes

LPAR
Channel Subsystem

Cache
LPAR
FICON
Channels
a a
z/VM LPAR b b ECKD
Emulation
c c

17
Parallel Access Volumes

Base and Alias Subchannels


 Alias appear like normal device subchannel
• Host and DASD subsystem know it maps on the same set of data
• Simultaneous I/O possible on base and each alias subchannel
 DASD subsystem will run them in parallel when possible
• Operations may be performed in different order
Base and Alias

LPAR
Channel Subsystem

Cache
LPAR
FICON
Channels
a a
aa aa
b b
z/VM LPAR b b b b ECKD
c c
Emulation
c c
c c

18
Parallel Access Volumes

Access to cached data while previous I/O is still active


 I/O throughput mainly determined by cache miss operations
• Assumes moderate hit ratio and an alias subchannel available

Example
 Cache hit ratio of 90%
• Cache hit response time 0.5 ms
• Cache miss response 5.5 ms

cache miss cache hits


PEND 0.2 ms
Single Subchannel
DISC 5.0 ms

CONN 0.3 ms
Base

Alias

Elapsed Time

19
Parallel Access Volumes

Queuing of next I/O closer to the device


 Interesting with high cache hit ratio when PEND is significant
 Avoids delay due to PEND time
• Service time for cache hit determined only by CONN time
• Assuming sufficient alias subchannels
Example
 Cache hit ratio of 95%
• Cache hit response time 0.5 ms
• Cache miss response 5.5 ms

Single
Subchannel
PEND 0.2 ms
Base
DISC 5.0 ms
Alias
CONN 0.3 ms
Alias
Elapsed Time

20
Parallel Access Volumes

Multiple parallel data transfers over different channels


 Parallel operations retrieving from data cache
• Depends on DASD subsystem architecture and bandwidth
• Configuration aspects (ranks, banks, etc)
• Implications on FICON capacity planning
 Cache hit service time improved by the number of channels
• Combined effect: service time determined by aggregate bandwidth
• Assumes infinite number of alias subchannels
• Assumes sufficiently high cache hit ratio

Single
Subchannel

Base PEND 0.2 ms


Alias
DISC 5.0 ms
Alias
CONN 0.3 ms
Alias

21
Parallel Access Volumes

Performance Benefits
1. Access to cached data while previous I/O is still active
• Avoids DISC time for cache miss
2. Queuing the request closer to the device
• Avoid IOSQ and PEND time
3. Multiple operations in parallel retrieving data from cache
• Utilize multiple channels for single logical volume

Restrictions

 PAV is chargeable feature on DASD subsystems


• Infinite number of alias devices is unpractical and expensive

 Workload must issue multiple independent I/O operations


• Typically demonstrated by I/O queue for the device (IOSQ time)

 Single workload can monopolize your I/O subsystem


• Requires additional monitoring and tuning

22
Parallel Access Volumes

Static PAV
 Alias devices assigned in DASD Subsystem configuration
 Association observed by host Operating System

Dynamic PAV
 Assignment can be changed by higher power (z/OS WLM)
 Moving an alias takes coordination between parties
 Linux and z/VM tolerate but not initiate Dynamic PAV

HyperPAV
 Pool of alias devices is associated with set of base devices
 Alias is assigned for the duration of a single I/O
 Closest to “infinite number of alias devices assumed”

23
Parallel Access Volumes

Virtual machines can exploit PAV

PAV-aware guests (Linux)


 Dedicated Base and Alias devices
 Costly when the guest does not need it all the time

PAV-aware guests with minidisks


 Uses virtual HyperPAV alias devices
 Requires sufficient real HyperPAV alias devices

PAV-unaware guests (CMS or Linux)


 Minidisks on shared logical volumes
 z/VM manages and shares the alias devices
 Break large volumes into smaller minidisks to exploit PAV

24
Linux Disk I/O

Virtual machines are just like real machines


 Prepare a channel program for the I/O
 Issue a SSCH instruction to virtual DASD (minidisk)
 Handle the interrupt that signals completion

z/VM does the smoke and mirrors


 Translate the channel program
• Virtual address translation, locking user pages
• Fence minidisk with a Define Extent CCW
 Issue the SSCH to the real DASD
 Reflect interrupt to the virtual machine Linux

Diagnose I/O y

Channel Subsystem
 High-level Disk I/O protocol
x
 Easier to manage y
z/VM LPAR
 Synchronous and Asynchronous z

25
Linux Disk I/O

Linux provides different driver modules 90 DIAG

 ECKD – Native ECKD DASD 80


70
60
ECKD

50
• Minidisk or dedicated DASD 40
30

• Also for Linux in LPAR


20
10
0
w rite rew rite read reread random rd random w r

 FBA – Native FBA DASD


• Does not exist in real life
• Virtual FBA – z/VM VDISK Linux
• Disk in CMS format
• Emulated FBA – EDEVICE
 DIAG – z/VM Diagnose 250
• Disk in CMS reserved format diag eckd fba
• Device independent dasd

 Real I/O is done by z/VM

Channel Subsystem
x
 No obvious performance favorite y
z/VM LPAR
• Very workload dependent z

26
Linux Disk I/O

Virtual Machine I/O also uses other resources


 CPU – CCW Translation, dispatching
 Paging – Virtual machine pages for I/O operation

Application Linux Host z/VM Control Unit Device


Read
Virtual
Start I/O

Real
I/O Response Time

IOSQ
Virtual Machine

Start I/O

CCW Translation
PEND Command
Paging Transfer
Dispatching
DISC
Data
Transfer
CONN
Real I/O
Complete
Virtual I/O
Data
Interrupt
Available

27
Linux Disk I/O

Linux Physical Block Device


 Abstract model for a disk
• Divided into partitions Linux
 Data arranged in blocks (512 byte) app app app app
 Blocks referenced by number
File
Linux Block Device Layer Systems
 Data block addressed by
• Device number (major / minor)
Page Cache
• Block number
 All devices look similar
Block layer
Linux Page Cache
 Keep recently used data
diag eckd fba
 Buffer data to be written out
dasd

28
Linux Disk I/O

Buffered I/O
 By default Linux will buffer application I/O using Page Cache
• Lazy Write – updates written to disk at “later” point in time
• Data Cache – keep recently used data “just in case”
• Read Ahead – avoid I/O for sequential reading
Buffered I/O Throughput
Write
Read
300
 Performance improvement 250

Throughput (MB/s)
• More efficient disk I/O 200

150

• Overlap of I/O and processing 100

50

0
4 8 16 32 64 128 256 512
Block Size (KB)

29
Linux Disk I/O

Buffered I/O
 By default Linux will buffer application I/O using Page Cache
• Lazy Write – updates written to disk at “later” point in time
• Data Cache – keep recently used data “just in case”
• Read Ahead – avoid I/O for sequential reading
Write
Direct I/O vs Buffered I/O

 Performance improvement 300


Read
Wr Direct
Rd Direct
• More efficient disk I/O 250

Throughput (MB/s)
200
• Overlap of I/O and processing 150

100

50

Direct I/O 0
4 8 16 32 64 128 256 512

 Avoids Linux page cache Block Size (KB)

Disk Write - CPU Cost - Buffered vs Direct I/O


• Application decides on buffering 9
User
CP

• No guessing at what is needed next 8

 Same performance at lower cost 6

• Not every application needs it 4

0
Buf fered Direct I/ O
http://zvmperf.wordpress.com/2012/04/17/cpu-cost-of-buffered-io/

30
Linux Disk I/O

Myth: Direct I/O not supported for ECKD disks


 Frequently told by DB2 experts

Truth: DB2 does not do 4K aligned database I/O


 The NO FILESYSTEM CACHING option is rejected
 Database I/O is buffered by Linux
• Uses additional CPU to manage page cache
• Touches all excess memory to cache data
 FCP disks recommended for databases with business data
• May not be an option for installations with large FICON investment

Experimental work to provide a bypass for this restriction


 Interested to work with customers who need this

31
Linux Disk I/O

Synchronous I/O
 Single threaded application model
 Processing and I/O are interleaved CPU I/O CPU I/O CPU I/O
transaction
Asynchronous I/O
 Allow for overlap of processing and I/O
 Improves single application throughput CPU CPU CPU
 Assumes a balance between I/O and CPU I/O I/O I/O

Matter of Perspective
 From a high level everything is asynchronous
 Looking closer, everything is serialized again

Linux on z/VM
 Many virtual machines competing for resources
 Processing of one user overlaps I/O of the other
 Unused capacity is not wasted

32
Linux Disk I/O

Myth of Linux I/O Wait Percentage


 Shown in “top” and other Linux tools
 High percentage: good or bad?
 Just shows there was idle CPU and active I/O
• Less demand for CPU shows high iowait%
• Adding more virtual CPUs increases iowait%
• High iowait% does not indicate an “I/O problem”

top - 11:49:20 up 38 days, 21:27, 2 users, load average: 0.57, 0.13, 0.04
Tasks: 55 total, 2 running, 53 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.3%us, 1.3%sy, 0.0%ni, 0.0%id, 96.7%wa, 0.3%hi, 0.3%si, 1.0%st

top - 11:53:32 up 38 days, 21:31, 2 users, load average: 0.73, 0.38, 0.15
Tasks: 55 total, 3 running, 52 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 31.1%sy, 0.0%ni, 0.0%id, 62.5%wa, 0.3%hi, 4.3%si, 1.7%st

33
Linux Disk I/O

Myth of Linux Steal Time


 Shown in “top” and other Linux tools
• “We have steal time, can the user run native in LPAR?”
 Represents time waiting for resources
• CPU contention
• Paging virtual machine storage
• CP processing on behalf of the workload
• Idle Linux guest with application polling
 Linux on z/VM is a shared resource environment
• Your application does not own the entire machine
• Your expectations may not match the business priorities
 High steal time may indicate a problem
• Need other data to analyze and explain

top - 11:53:32 up 38 days, 21:31, 2 users, load average: 0.73, 0.38, 0.15
Tasks: 55 total, 3 running, 52 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 31.1%sy, 0.0%ni, 0.0%id, 62.5%wa, 0.3%hi, 4.3%si, 1.7%st

http://zvmperf.wordpress.com/2013/02/28/explaining-linux-steal-percentage/

34
Linux Disk I/O

Logical Block Devices Linux

 Device Mapper app app app app


 Logical Volume Manager
Creates new block device File
Systems
 Rearranges physical blocks
Avoid excessive mixing of data
Page Cache
Be aware for more exotic methods
 Mirrors and redundancy
 Anything beyond RAID 0 Block layer

 Do not mess with I/O scheduler

Logical Block Devices

concatenation striping

35
Linux Disk I/O

Disk Striping
 Function provided by LVM and mdadm
 Engage multiple disks in parallel for your workload

Like shipping with many small trucks


 Will the small trucks be faster?
• What if everyone does this?
 What is the cost of reloading the goods? Split large I/O into
small I/O’s
• Extra drivers, extra fuel?
 Will there be enough small trucks? queue for the proper
• Cost of another round trip? devices

merge into large I/O’s

36
Linux Disk I/O

Performance Aspects of Striping


 Break up a single large I/O into many small ones
• Expecting that small ones are quicker than a large ones
• Expect the small ones to go in parallel
 Engage multiple I/O devices for your workload
• No benefit if all devices already busy
• Your disk subsystem may already engage more devices
• You may end up just waiting on more devices

Finding the Optimal Stripe Size is Hard


 Large stripes may not result in spreading of the I/O
 Small stripes increases cost
• Cost of split & merge proportional to number of stripes
 Some applications will also stripe the data
 Easy approach: avoid it until performance data shows a problem

37
The Mystery of Lost Disk Space

Claim: ECKD formatting is less efficient


1
 “because it requires low-level format”

Is this likely to be true?


 Design is from when space was very expensive
 Fixed Block has low level format too – but hidden from us

ECKD allows for very efficient use of disk space


 Allows application to pick most efficient block size
 Capacity of a 3390 track varies with block size
• 48 KB with 4K block size
• 56 KB as single block
 Complicates emulation of 3390 tracks on fixed block device
• Variable length track size (log-structured architecture)
• Fixed size a maximum capacity (typically 64 KB for easy math)

1
Claim in various IBM presentations

38
Conclusion

Avoid using synthetic benchmarks for tuning


 Hard to correlate to real life workload

Measure application response


 Identify any workload that does not meet the SLA
 Review performance data to understand the bottleneck
• Be aware of misleading indicators and instrumentation
• Some Linux experts fail to understand virtualization
 Address resources that cause the problem
• Don’t get tricked into various general recommendations

Performance Monitor is a must


 Complete performance data is also good for chargeback
 Monitoring should not cause performance problems
 Consider a performance monitor with performance support

Avoid betting with your Linux admin on synthetic benchmarks


 Drop me a note if you cannot avoid it

39
Linux on z/VM Performance

Understanding Disk I/O


Rob van der Heij

Session 13522 Velocity Software


http://www.velocitysoftware.com/
[email protected]

Copyright © 2013 Velocity Software, Inc. All Rights Reserved. Other


products and company names mentioned herein may be trademarks of their
respective owners.

You might also like