Input/Output & System Performance Issues
Input/Output & System Performance Issues
Input/Output & System Performance Issues
System Architecture = System components and how the components are connected (system interconnects)
Components of Total System Execution Time: (or response time) CPU Memory I/O
Control Input
2
Memory
- I/O Subsystem
System Interconnects
(instructions,
System Interconnects
3
data) Datapath
registers Output
ALU, buses
• Design factors:
– I/O device characteristics (input, output, storage, etc.)
/Performance.
– I/O Connection Structure (degree of separation from
memory operations). Isolated I/O System Architecture
Memory Controller
(One or more levels) (Chipset North Bridge)
Back Side Bus
(BSB)
Thus
I/O Subsystem
Two Types of System Interconnects/Buses:
1- CPU-Memory Bus or interconnect EECC551 - Shaaban
2 – I/O Buses/interfaces
#4 Lec # 9 Fall 2012 10-23-2012
CPU Core
1 GHz - 3.8 GHz
Typical FSB-Based System Architecture
4-way Superscaler System Architecture = System Components + System Component Interconnects
RISC or RISC-core (x86): All Non-blocking caches
Deep Instruction Pipelines
Dynamic scheduling L1 L1 16-128K 1-2 way set associative (on chip), separate or unified
Multiple FP, integer FUs CPU L2 256K- 2M 4-32 way set associative (on chip) unified
Dynamic branch prediction L2 L3 2-16M 8-32 way set associative (on or off chip) unified
Hardware speculation
L3 Front Side Bus
SDRAM (possibly Caches (FSB) Examples: Alpha, AMD K7: EV6, 200-400 MHz
PC100/PC133 Intel PII, PIII: GTL+ 133 MHz
on-chip)
100-133MHz System Bus Intel P4 800 MHz
64-128 bits wide
2-way inteleaved
Bus Adapter
~ 900 MBYTES/SEC )64bit) Memory Main I/O Bus
Controller Example: PCI, 33-66MHz
32-64 bits wide
Double Date 133-528 MB/s
Rate (DDR) SDRAM Memory Bus
PC3200
I/O Controllers NICs PCI-X 133MHz 64-bits wide
1066 MB/s
200 MHz DDR
64-128 bits wide
Memory
4-way interleaved Disks
~3.2 GBYTES/SEC (64bit)
Displays Networks
Keyboards
RAMbus DRAM (RDRAM)
400MHZ DDR Chipset (Isolated I/O Subsystem)
16 bits wide (32 banks)
Chipset I/O Devices
North South
~ 1.6 GBYTES/SEC (System Logic)
Bridge Bridge
(System Logic) I/O Subsystem
Current System Architecture: Two Types of System Interconnects/Buses:
Thus
Isolated I/O: Separate memory 1- CPU-Memory Bus or interconnect
(system) and I/O buses. 2 – I/O Buses/interfaces
Graphics/ System
GPU
Memory
Isolated I/O
EECC551 - Shaaban
#7 Lec # 9 Fall 2012 10-23-2012
Intel Pentium 4 System Architecture
System Architecture = System Components
+ System Component Interconnects
And Core 2 (Using The Intel 925 Chipset)
CPU
(Including cache) System Bus (Front Side Bus, FSB)
System Core Logic Bandwidth usually should match or exceed
that of main memory
Memory Controller Hub
(Chipset North Bridge)
Graphics/GPU System
Memory
Two 8-byte DDR2 Channels
Graphics I/O Bus (PCI Express)
Isolated I/O
Storage I/O (Serial ATA)
Misc. Main
I/O I/O Bus
Interfaces (PCI)
Misc.
I/O
System Core Logic
Interfaces
I/O Controller Hub
(Chipset South Bridge) Basic Input Output System (BIOS)
I/O Subsystem
Current System Architecture:
Isolated I/O: Separate memory and I/O buses.
EECC551 - Shaaban
Source: http://www.anandtech.com/showdoc.aspx?i=2088&p=4
#8 Lec # 9 Fall 2012 10-23-2012
Intel Core i7 “Nehalem” System Architecture
Intel's QuickPath Interconnect (QPI) Point-to-point system interconnect used
instead of Front Side Bus (FSB)
+ Memory controller integrated on processor chip (three DDR3 channels)
Memory
Controllers System
Memory
Isolated I/O
QuickPath Interconnect:
Intel’s first point-point interconnect
introduced 2008 with the Nehalem
Architecture as an alternative to
HyperTransport
EECC551 - Shaaban
FSB = Front Side Bus (Processor-memory Bus or System Bus)
#10 Lec # 9 Fall 2012 10-23-2012
Example CPU-Memory System Buses
(Front Side Buses, FSBs)
Bus Summit Challenge XDBus SP P4
Originator HP SGI Sun IBM Intel
Clock Rate (MHz) 60 48 66 111 800
Split transaction? Yes Yes Yes Yes Yes
Address lines 48 40 ?? ?? ??
Data lines 128 256 144 128 64
Clocks/transfer 4 5 4 ?? ??
Peak (MB/s) 960 1200 1056 1700 6400
Master Multi Multi Multi Multi Multi
Arbitration Central Central Central Central Central
Addressing Physical Physical Physical Physical Physical
Length 13 inches 12 inches 17 inches ?? ??
FSB Bandwidth matched with single 8-byte channel SDRAM
EECC551 - Shaaban
#11 Lec # 9 Fall 2012 10-23-2012
Main System I/O Bus Example: PCI, PCI-Express
Specification Bus Width Bus Frequency Peak
Bandwidth
(bits) (MHz) (MB/sec)
PCI 2.3
32 33.3 133
Legacy
PCI PCI 2.3
64 33.3 266
PCI 2.3
64 66.6 533
Formerly
Intel’s 3GIO
PCI-Express 1-32 ??? 500-16,000
EECC551 - Shaaban
#16 Lec # 9 Fall 2012 10-23-2012
I/O Controller Architecture
Part of System Core Logic Part of System Core Logic
Chipset Chipset
North Bridge Peripheral or Main I/O Bus (PCI, PCI-X, etc.)South Bridge
CPU Host
I/O Channel Interface
Processor
I/O Controller
I/O Devices
SCSI, IDE, USB, ….
Time(workload) = Time(CPU) + Time(I/O) - Time(Overlap)
Industry-standard interfaces
I/O I/O
No overlap I/O I/O
Overlap of CPU processing EECC551 - Shaaban
Time and I/O processing time #17 Lec # 9 Fall 2012 10-23-2012
I/O: A System Performance Perspective
• CPU Performance: Improvement of ~ 60% per year.
i.e storage devices (hard drives)
EECC551 - Shaaban
#20 Lec # 9 Fall 2012 10-23-2012
System & I/O Performance Metrics: Response time
• Response time measures how long a storage (or I/O) Orsystem
entire
The utilization of DMA and I/O device queues and multiple I/O devices
servicing a queue may make throughput >> 1 / response time EECC551 - Shaaban
#21 Lec # 9 Fall 2012 10-23-2012
Timesystem =Time in System for a task =
I/O Modeling:
Modeling Response Time = Queuing Time + Service Time
Average Task
Arrival Rate Producer-Server Model Server Service Time
per task Tser
r
tasks/sec
Time a task spends
waiting in queue
Queue wait time = Tq
Throughput
vs. Queue
full
Response Time most of the
time.
More time
in queue
Utilization
AKA Loading Factor
I/O
Tasks
I/O I/O
Tasks Tasks
Producer I/O device +
e.g CPU controller
Queue Server
Magnetic Disks
Characteristics: (1-5)
• Diameter (form factor): 1.8in - 3.5in
• Rotational speed: 5,400 RPM-15,000 RPM
• Tracks per surface.
• Sectors per track: Outer tracks contain
Seek Time
more sectors.
• Recording or Areal Density: Tracks/in X Bits/in
• Cost Per Megabyte. Bits/ Inch2
• Seek Time: (2-12 ms) Current Areal Density ~ 500 Gbits / Inch2
{ •
The time needed to move the read/write head arm.
Reported values: Minimum, Maximum, Average.
Rotation Latency or Delay: (2-8 ms)
Current Rotation speed
The time for the requested sector to be under Rotation
7200-15000 RPM
Time
the read/write head. (~ time for half a rotation)
• Transfer time: The time needed to transfer a sector of bits.
Read/Write Seek
• Type of controller/interface: SCSI, EIDE (PATA, SATA) Head Time
• Disk Controller delay or time.
• Average time to access a sector of data =
average seek time + average rotational delay + transfer time
+ disk controller overhead
(ignoring queuing time)
Access time = average seek time + average rotational delay
EECC551 - Shaaban
#25 Lec # 9 Fall 2012 10-23-2012
Basic Disk Performance Example
• Given the following Disk Parameters:
– Average seek time is 5 ms
– Disk spins at 10,000 RPM
– Transfer rate is 40 MB/sec
• Controller overhead is 0.1 ms
• Assume that the disk is idle, so no queuing delay exist.
• What is Average Disk read or write service time for a 500-
byte (.5 KB) Sector? Time for half a rotation
Ave. seek + ave. rot delay + transfer time + controller overhead
= 5 ms + 0.5/(10000 RPM/60) + 0.5 KB/40 MB/s + 0.1 ms
= 5 + 3 + 0.13 + 0.1 = 8.23 ms
Access Time
Actual time to process the disk request Tservice (Disk Service Time for this request)
is greater and may include CPU I/O processing Time
and queuing time
Or Tser
Here: 1KBytes = 103 bytes, MByte = 106 bytes, 1 GByte = 109 bytes EECC551 - Shaaban
#26 Lec # 9 Fall 2012 10-23-2012
Historic Perspective of Hard Drive Characteristics Evolution: Areal Density
Drive areal density has increased by a factor of 8.5 million since the first disk drive, IBM's RAMAC,
was introduced in 1957. Since 1991, the rate of increase in areal density has accelerated to 60% per year,
and since 1997 this rate has further accelerated to an incredible 100% per year.
Internal data transfer rate increase is influenced by the increase in areal density EECC551 - Shaaban
#28 Lec # 9 Fall 2012 10-23-2012
Historic Perspective of Hard Drive Characteristics Evolution: Access/Seek Time
Producer:
CPU Proc Tasks
Tq Tasks
IOC Device
OS or User
Tsys = Tq + Tser
• Given: An I/O system in equilibrium (input rate is equal to output rate) and:
– Tser : Average time to service a task = 1/Service rate
– Tq : Average time per task in the queue
– Tsys : Average time per task in the system, or the response time, Ignoring CPU processing time
and other system delays
the sum of Tser and Tq thus Tsys = Tser + Tq
– r : Average number of arriving tasks/sec (i.e task arrival rate)
– Lser : Average number of tasks in service.
– Lq : Average length of queue
– Lsys : Average number of tasks in the system,
the sum of L q and Lser
u must be between 0 and 1 otherwise there would be more tasks arriving than could be serviced
Here a server is the device (i.e hard drive) and its I/O controller (IOC) EECC551 - Shaaban
#33 Lec # 9 Fall 2012 10-23-2012
A Little Queuing Theory
Task arrival rate r FIFO System (Single Queue + Single Server)
tasks/sec
Queue server Tser
Task Service Time Tser
Proc Tq IOC Device
Timequeue = Lengthqueue x Timeserver + Time for the server to complete current task
Time for the server to complete current task = Server utilization x remaining service time of current task
Lengthqueue = Arrival Rate x Timequeue (Little’s Law)
We need to estimate waiting time in queue (i.e Timequeue = Tq)? Tq?
Here a server is the device (i.e hard drive) and its I/O controller (IOC)
The response time above does not account for other factors such as CPU time. EECC551 - Shaaban
#35 Lec # 9 Fall 2012 10-23-2012
A Little Queuing Theory: Average Queue Wait Time Tq
For Single Queue + Single Server
• Calculating average wait time in queue Tq
– If something at server, it takes to complete on average m1(z) = 1/2 x Tser x (1 + C2)
– Chance server is busy = u; average delay is u x m1(z) = 1/2 x u x Tser x (1 + C2)
– All customers in line must complete; each avg Tser
Timequeue = Time for the server to complete current task + Lengthqueue x Timeserver
Timequeue = Average residual service time + Lengthqueue x Timeserver
A version of this derivation in textbook page 385 (3rd Edition: page 726) EECC551 - Shaaban
#36 Lec # 9 Fall 2012 10-23-2012
A Little Queuing Theory: M/G/1 and M/M/1
Single Queue + Single Server
EECC551 - Shaaban
#39 Lec # 9 Fall 2012 10-23-2012
I/O Queuing Performance: An M/M/1 Example
• Previous example with a faster disk with average disk service time = 10 ms
• The processor still sends 40 disk I/O requests per second, requests & service
are exponentially distributed
i.e C2 = 1 Tser
• On average:
(Changed from
– How utilized is the disk, u? 20 ms to 10 ms)
– What is the average time spent in the queue, Tq?
– What is the average response time for a disk request, Tsys ?
• We have:
r average number of arriving requests/second = 40
Tser average time to service a request = 10 ms (0.01s) Utilization U
• We obtain:
u server utilization: u = r x Tser = 40/s x .01s = 0.4 or 40%
Tq average time/request in queue = Tser x u / (1 – u)
= 10 x 0.4/(1-0.4) = 10 x 0.4/0.6 = 6.67 ms (0 .0067s)
i.e Mean
Response Time Tsys average time/request in system: Tsys = Tq +Tser=10 + 6.67 =
= 16.67 ms
Response time is 100/16.67 = 6 times faster even though the new
Response
service time is only 2 times faster due to lower queuing time . Time
6.67 ms instead of 80 ms
EECC551 - Shaaban
#40 Lec # 9 Fall 2012 10-23-2012
Factors Affecting System & I/O Performance
• I/O processing computational requirements:
– CPU computations available for I/O operations.
CPU – Operating system I/O processing policies/routines.
– I/O Data Transfer/Processing Method used.
• CPU cycles needed: Polling >> Interrupt Driven > DMA
• I/O Subsystem performance:
– Raw performance of I/O devices (i.e magnetic disk performance).
I/O
– I/O bus capabilities. Service Time, Tser, Throughput
• The average I/O bandwidth is 4020 IOPS x (16 KB/sec) = 64.3 MB/sec
• Tq = Tser x u /[m (1 – u)] = 14.9ms x .6 / [100 x .4 ] = .22 ms Using expression
for Tq for M/M/m
• Response Time = Tser + Tq+ Tcpu + Tmemory + Tscsi = from slide 36
EECC551 - Shaaban
Here: 1KBytes = 103 bytes, MByte = 106 bytes, 1 GByte = 109 bytes #46 Lec # 9 Fall 2012 10-23-2012