0% found this document useful (0 votes)
100 views19 pages

Chapter 6: I/O Why I/O?

The document discusses I/O performance and disk architectures. It covers topics like throughput versus latency, I/O overlap, device characteristics, magnetic disks, disk parameters and operations. Performance is affected by seek time, rotational latency, transfer rate and overhead.

Uploaded by

Pavan Nav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views19 pages

Chapter 6: I/O Why I/O?

The document discusses I/O performance and disk architectures. It covers topics like throughput versus latency, I/O overlap, device characteristics, magnetic disks, disk parameters and operations. Performance is affected by seek time, rotational latency, transfer rate and overhead.

Uploaded by

Pavan Nav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Chapter 6: I/O Why I/O?

Who cares and what to consider Amdahl’s law


Device charateristics and types • speedup only CPU, I/O becomes bottleneck
• e.g.,
I/O system architecture
• suppose I/O takes 10% time
• buses, I/O processors
• speedup CPU 10 times
High performance disk architectures
• system only speeds up 5 times
I/O Performance

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 1 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 2

Throughput vs latency Throughput vs latency


“There is an old network saying: bandwidth problems can be who cares about latency
cured with money. latency problems are harder because the
speed of light is fixed - you can’t bribe God.” - David Clark • why don’t you just context switch
• fallacy
throughput
• requires more memory
• bandwidth
• requires more processes(jobs)
• I/Os per second
• human productivity increases super-linearly as
latency • response time decreases
• response time

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 3 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 4
I/O Overlap I/O Performance
I/O overlaps with computation in complicated ways Timejob = timecpu + timeI/O - timeoverlap
I/O request I/O request I/O interrupt
e.g., 10 = 10+4-4
job 1 job 2 job 3 job 1
USER speed up CPU by 2x

what is timejob
OS
timejob = 5+4-4 = 5 (best)
done
timejob = 5+4-0 = 9 (worst)
I/O

timejob = 5+4-2 = 7 (average?)

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 5 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 6

I/O Characteristics I/O Characteristics


supercomputers Time sharing filesystems
• data transfer rate important • small files
• many MBs per second for large files • sequential accesses

Transaction processing • many creates/deletes

• I/O rate important


• “random” accesses
• disk I/Os per second

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 7 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 8
Device Characteristics Device Characteristics
behavior Data Rate
Device I or O? Partner
• input - read once KB/s
• output - write once mouse I human 0.01
• storage - read many times; usually write graphics dis- O human 60,000
play
partner modem I/O machine 2-8
• human
LAN I/O machine 500-6000
• machine
tape storage machine 2000
data rate disk storage machine 2000-10,000
• peak transfer rate

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 9 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 10

Magnetic Disks Disk Parameters


spindles: 1-4 (most 1)
Head

platters per spindle :1-20


Platters
rpm: 3000-6000 RPM (most 3600)
Arm
platter diameter: 1.3”-8”
Spindle
• trend towards smaller disks
Cylinder
Track
Platters Spans • higher RPM
• mass production
Sector
tracks per surface: 500-2500
tor Gap Intersec-

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 11 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 12
Disk Parameters Disk Operations
sectors per surface: 32 typical seek: move head to track
• sector # —gap—data+ECC— n 
• avg seek time =  ∑ seek ( i )  ⁄ n
• fixed length sectors (except IBM) 1 
• typically fixed sectors per track
• n is # tracks, seek(i) is time to seek ith track
• recently constant bit density
rotational latency: wait for sector
• avg rotational latency 0.5/3600 = 8.3 ms

transfer rate
• typically 1-4 MB per second

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 13 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 14

Disk Operations Disk Performance


overhead avg disk access = avg seek time + avg rot. delay + transfer + ovhd
• controller delay e.g.,
• queuing delay • 3600 rpm; 2MB/s
• avg seek time: 9ms
• controller overhead: 1ms
• read 512-byte sector
• 9 ms+.5/3600 + .5KB/2 MB/s + 1 ms
• = 18.6 ms

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 15 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 16
Alternatives to Disks Alternatives to Disks
DRAMS FLASH memory
• SSD - solid state disk + no seek time
• standard disk interface + fast transfer
• DRAM and battery backup + non-volatile
• ES - expanded storage – bulk erase before write
• software controlled cache
– slow writes
• large (4K) blocks
– “wears” out over time
+ no seek time
+ fast transfer rate
– cost

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 17 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 18

Optical Disks Graphics Display - CRT


read-only
• CD-ROM
• cheap and reliable
• slow Electron
Gun
write-once
• not-so cheap phorous
X + Y Deflectors
Phos-

• slow Screen Coated

write-many screen has many scan lines each of which has many pixels
• expensive, slow phosphorous acts as capacitor- refresh 30-60 times/second

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 19 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 20
Graphics Displays - Frame Buffer Graphics Displays - Frame Buffer
frame buffer stores bit map
CPU Memory
• one entry per pixel
• black - 1 bit per pixel
0.2 MB/s • gray-scale 4-8 bits per pixel
• color (RGB) 8 bits per color
Frame
CRT
Buffer
30 MB/s • typical size 1560 x 1280 pixels
• • black and white: 250 KB
• color (RGB): 5.7 MB

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 21 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 22

Rducing cost of Frame Buffer Frame Buffer Operations


key idea: only a small number of colors are used in one image logically output only

color map: frame buffer stores color map index • but read as well

• color map translates index to full 24-bit color BIT BLTS: bit block transfers
Frame Buffer Color Map
(256×24) • read-modify-write operations

X0 17
• e.g., read xor write
120 014 074 CRT

8-bit
• used for cursors etc
index
Y0
open question
24-bit RGB
• OS only?
• 1560 x 1280 with 256-entry color map - factor 3 reduction
• or direct user access? protection?

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 23 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 24
Frame Buffer Implementation Other Issues in Displays
1560 x 1280 RGB display double buffering
• bandwidth required = 1560x1280x24x30 = 171 MB/s • duplicate frame buffer

how can we implement this? • to prevent displaying incomplete updates

• Video DRAMS • may be necessary for animation

• dual-ported DRAM z-buffer


• regular random access port • for displaying 3-D images
• serial video port • assign z-dimension for each pixel
• use 24 in parallel for RGB • store z-dimension in frame buffer
what about bandwidth? interleave video DRAMS • BIT BLTS compare Z-dimension

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 25 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 26

Networks Networks
Terminal networks Long haul networks
• machine-terminal • machine-machine
• star - point-to point • irregular structure - point to point
• 0.3-19 Kbits/s, RS232 protocol • 50-2000 Kbits/s, > 10 km

LANs • Internet

• machine-machine
• bus, ring, star
• 0.1-100 Mbits/s, < 10 km
• ethernet

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 27 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 28
LAN LAN
E.g., Ethernet ATM Asynchronous Transfer Method
• one-write bus with collisions and exponential backoff Phone company uses for long-haul networks (packet-switch)
• within building
not a viable LAN yet
• 10Mb

Now ethernet is
• point to point to clients (switched network)
• with hubs
• client s/w unchanged
• 100Mb

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 29 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 30

WAN I/O System Architecture


E.g., ARPANET, Internet hierarchical data paths
arranged as a DAG • divides bandwidth going down hierarchy
• often buses at each level
backbones now 1Gb/s; 100Gb/s in the future
TCP/IP - protocol stack I/O processing

• Transmission control protocol, Internet protocol • program controlled


• DMA
Key issues:
• dedicated I/O processors
• Top-to-bottom systems issues
• getting net into homes
• cable modem, ISDN, ??

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 31 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 32
I/O System Architecture Buses
High Low
CPU Option
Performance cost
Address/data lines separate? yes no
Cache
Data lines wider narrower
CPU - Memory Bus
transfer size multiple single word
words
Frame
Memory IOP Buffer CRT bus masters multiple one
split transactions yes no
I/O Bus
clocking synchronous asynchronous
Disk Disk Network
Controller Controller Interface

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 33 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 34

Buses CPU interface


CPU-Memory buses physical connection
• want speed • direct to cache
• usually custom design (fast - several GB/s) + no coherence problems
• eg SGI Challenge, Sun SD, HP Summit – pollutes cache
– CPU and I/O arbitrate for cache
I/O buses
• CPU-memory bus
• compatibility is important
+ DMA
• usually standard designs - PCI (Express), SCSI (slower - – may not be standard
<= GB/s)

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 35 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 36
CPU interface CPU Interface
• I/O bus
CPU I/O
+ industry standard
– slower than memory bus
Direct to Cache
Cache
– indirection through IO processor
CPU - Memory Bus

Memory Bus

Memory I/O IOP

I/O Bus

I/O Bus

I/O

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 37 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 38

Bus Arbitration Distributed Arbitration


centralized star connection set of wire-OR priority lines
• high cost set of wire-OR timing and control lines
• high performance
each requesting device indicates its priority
daisy chain
device removes its less-significant bits if higher priority present
• cheap
eventually only highest priority remains
• low performance
special care to ensure fairness
distributed arbitration
• medium price/performance

arbitration for next bus mastership overlap with current transfer

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 39 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 40
Bus Switching Methods Standard I/O Buses
circuit-switched buses Micro PCI-
S bus SCSI
• bus is held until request is complete channel Xpress
• simple protocol data width 32 bits 32 32-64 8-16
• latency of device affects bus utilization clock 16-25 Mhz asynch 256 10/asynch
# masters multiple multiple multiple multiple
split transaction or packet-switched (or pipelined)
b/w, 32-bit 33 MB/s 20 150+ 20 0r 6
• bus is released after request is initiated
read
• others use the bus until reply comes back
b/w, peak 89 75 800+ 20 or 6
• complex bus control
• better utilization of bus

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 41 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 42

Memory buses I/O Processing


HP SGI Sun program controlled
Summit Challenge XDbus • CPU explicitly manages all transfers
data width 128 bits 256 144 • high I/O overhead => big minus!
clock 60 MHz 48 66
DMA - direct memory access
# masters multiple multiple multiple • DMA controller manages single block transfers
b/w, peak 960 MB/s 1200 1056
I/O processors I - OP
These are older buses • processors dedicated to I/O operations
Currently, 128 bits, 250MHz+, DDR, several 10s of GB/s • capable of executing I/O programs
• may be special-purpose or general-purpose

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 43 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 44
Communicating with I/O processors Communicating with I/O processors
I/O control I/O completion
• memory mapped • polling
• ld/st to “special” addresses => operations occur • wait for status bit to change
• protected by virtual memory • periodic checking
• I/O instructions • interrupt
• special instructions initiate I/O operations • I/O completion interrupts CPU
• protected by privileged instructions

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 45 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 46

IBM 3990 I/O processing High-Performance Disk Architectures


channel == IOP extension to conventional disks

1 user program sets up table in memory with I/O request (pointer disk arrays
to channel program) then execute syscall
redundant arrays of inexpensive disks (RAIDs)
2 OS checks for protection, then executes “start subchannel” instr

3 pointer to channel program is passed to IOP. IOP executes


channel program

4 IOP interacts with storage director to execute individual channel


commands. IOP is free to do other work between channel
commands

5 on completion, IOP places status in memory, interrupts CPU

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 47 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 48
Extensions to conventional disks Extensions to conventional disks
fixed head disk parallel transfer disk
• head per track, head does not seek • read from multiple surfaces at the same time
• seek time eliminated • difficulty in looking onto different tracks on multiple surfaces
• rotational latency unchanged • lower cost alternatives possible (disk arrays)
• low track density increasing disk density
• not economical • an on-going process
• requires increasingly sophisticated lock-on control
• increases cost

solid state disks

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 49 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 50

Extensions to conventional disks Disk Arrays


disk caches collection of individual disks
• RAM to buffer data between device and host • each disk has its own arm/head
• fast writes - buffer acts as a write buffer
data distributions
• better utilization of host-to-device path A0 A0 A0 A0
A1 A1 A1 A1
A0 B0 C0 D0 A0 A1 A2 A3
• high miss rate increases request latency A2
A3
A2
A3
A2
A3
A2
A3
A4 A4 A4 A4
| | | |
disk scheduling A1 B1 C1 D1
B0 B0 B0 B0
A4 A5 A6 A7

B1 B1 B1 B1
• schedule simultaneous I/O requests to reduce latency B2 B2 B2 B2 | | | |
| | | |
A2 B2 C2 D2
• e.g., schedule request with shortest seek time C0
C1
C0
C1
C0
C1
C0
C1
B0 B1 B2 B3

• works best for unlikely cases (long queues)


Independent Fine-grain Coarse-grain

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 51 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 52
Disk Arrays Disk Arrays
independent addressing coarse-grain striping
• s/w user distribute data • data transfer parallelism for large requests
• load balancing an issue • concurrency for small requests

fine-grain striping • load balanced by statistical randomization

• one bit, one byte, one sector must consider workload to determine stripe size
• #disks x stripe unit evenly divides smallest accessible data
• perfect load balance; only one request served at a time
• effective transfer rate approx N times better than single disk
• access time can go up, unless synchronized disks

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 53 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 54

Redundancy Mechanisms Redundant Array of Inexpensive Disk-RAID


disk failures are a significant fraction of hardware failuers arrays of small cheap disks to provide high performance/reliability
• striping increases #corrupted files per failure D = # data disks C = # check disks
data replication level1: mirrored disks (D=1 , C =1)
• disk mirroring • overhead too high
• allow multiple reads
level2: bit interleaved array for soft errors (e.g., D=10, C=4)
• writes must be synchronized
• layout like ECC for DRAMs
parity protection • read all bits across groups
• use a parity disk • merge update bits with bits not updated; recompute parity
• rewrite full group including checks

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 55 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 56
Redundant Array of Inexpensive Disk-RAID Redundant Array of Inexpensive Disk-RAID
level 3: hard error detection and parity (e.g., D=4, C=1) level5: rotated parity to parallelize writes
• key: failed disk is easily identifed by controller • parity spread out across disks in a group
• no need for special code to identify failed disk • different updates of parities go to different disks
• striped data - N data and 1 parity level6: two-dimensional array
• because failed disk is known, parity enough for recovery • array of data is a two-dimensional array
level 4: intra goup parallelism • with row and column parities
• coarse-grain striping • more than 1 failure
• like level 3 + ability to do more than one small I/O at a time
• write must update disk with data and parity disk

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 57 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 58

I/O Performance - Method1 I/O Performance


Like Iron Law, we can do simple calculations for I/O performance assume steady state => arrival rate == departure rate

Better option: I/O is shared resource and sees requests from Little’s Law:
many jobs, so if jobs are independent enough I/O requests will be
random enough that we can use queuing theory (ECE 600, 547) • rate = avg. # in system/avg. response time
• applies to any queue in equilibrium
Think of I/O as a queuing system
• requests enter the queue at a certain rate queue
• wait for service server
arrivals
• service takes certain time
• requests leave the system at a certain rate
• we can calculate response time for each request

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 59 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 60
I/O Performance I/O Performance
Total time in system = time in queue + time in service utilization = arrival rate/service rate
total time is response time - that’s what matters note that little’s law can be applied to individual components

service rate = 1/time to serve • server: # in server = arrival rate x time in service
• queue: queue length = arrival rate x time in queue
lenth of system = length of queue + avg. # of jobs in service

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 61 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 62

I/O Performance I/O Performance


for FIFO queues C - square of coeff of variance

time in system = q length x time in service + residual service time • =1 exponential


• > 1 hyperexponential
residual service time -
• < 1 hypoexponential
• depends on probability distribution of service time
• constant => memoryless property
avg residual service time = 1/2 x mean x (1+C)
• C - square of coefficient of variance
• C = variance/mean2
• variance = E(X2) - (E(x))2

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 63 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 64
I/O Performance I/O Performance
time in q = q lengthxservice time + util x average residual time avoid bottlenecks in I/O system
time in q = (service time x (1+C) x util)/ (2 x (1- util)) designing an I/O system

if C =1 • list I/O devices

• time in q = service time x (util/(1-util)) • list cost


• which is why util should not get too high • record CPU demand
• list memory or bur demand of each device
• determine performance of each option
• simulation or queuing theory

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 65 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 66

I/O Performance Bottleneck Analysis I/O Performance


Choice of large or small disk drives - find out I/O per second SCSI-2 strings - 20MB/s with 15 disks per bus
• 500 MIPS CPU SCSI-2 - 1ms overhead per I/O
• 16-byte 100 ns memory
large 8G disk or small 2G disks
• 200 MB/s I/O bus - upto 20 SCSI buses and controllers
both 7200 RPM, 8-ms avg seek, 6MB/s transfer
• 10000 instrs per I/O
• 16KB per I/O total storage = 200GB

Need to find the slowest component (“weakest link”)

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 67 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 68
I/O Performance I/O Performance
CPU limit - 500 MIPS/10000 = 50000 IOPS SCSI-2 transfer = 16KB/20MB/s = 0.8 ms
memory limit - 1/100ns x 16/ 16KB = 10000 IOPS SCSI-2 limit - 1/(1+0.8) = 556 IOPS

I/O bus limit - 200M/16KB = 12500 IOPS disk performance


• I/O time = 8ms + 0.5/7200 + 16KB/6MB = 14.9ms
• disk limit = 1/14.9 = 67 IOPS
memory limits performance to 10000 IOPS
25 8-GB disks => 25x67 = 1675

100 2-GB disks => 100 x 67 = 6700

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 69 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 70

I/O Performance I/O Performance


minimum SCSI-2 buses for 25 8-GB disks = 25/15 = 2 SCSI strings slightly less perf than disks

minimum SCSI-2 buses for 100 2-GB disks = 100/15 = 7 number of disks per SCSI at full b/w = 556/67 = 8

max IOPS for 2 SCSI-2 = 2 x 556 = 1112 number of SCSI for 8-GB = 25/8 = 4

max IOPS for 7 SCSI-2 = 7 x 556 = 3892 number of SCSI for 2-GB = 100/8 = 13

so we have
• 25 8-GB with 2 or 4 SCSI strings
• 100 2-GB with 7 or 13 SCSI strings

use cost to pick the best

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 71 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 72
I/O Performance Unix File System Performance
We assumed 100% util for some of the components cache files in memory
but queuing delay worsens severely with high util • mmeory much faster than disks

so we need to limit util - rules of thumb file cache is key to I/O performanc
• I/O bus < 75% • OS parameters - cache size, write policy
• disk string < 40% • asynchronous writes => processor continues
• disk arm < 60% • coherence in client/server systems
• disk < 80%
• recalculate performance based on these limits

© 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 73 © 2009 by Vijaykumar ECE565 Lecture Notes: Chapter 6 74

You might also like