0% found this document useful (0 votes)
76 views36 pages

UNIT-IV Memory and I/O

The document discusses memory hierarchy design and cache performance. It covers the following key points: 1) Memory hierarchies are designed based on the principles of locality and cost/performance tradeoffs between different memory technologies. The goal is to provide fast memory at low cost. 2) Caches exploit locality by copying data from slower main memory to faster cache memory closer to the CPU. This reduces average memory access time. 3) Important techniques for improving cache performance include reducing miss penalties through multi-level caches, and reducing miss rates by improving block placement, identification and replacement algorithms.

Uploaded by

Jeshanth Kani
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views36 pages

UNIT-IV Memory and I/O

The document discusses memory hierarchy design and cache performance. It covers the following key points: 1) Memory hierarchies are designed based on the principles of locality and cost/performance tradeoffs between different memory technologies. The goal is to provide fast memory at low cost. 2) Caches exploit locality by copying data from slower main memory to faster cache memory closer to the CPU. This reduces average memory access time. 3) Important techniques for improving cache performance include reducing miss penalties through multi-level caches, and reducing miss rates by improving block placement, identification and replacement algorithms.

Uploaded by

Jeshanth Kani
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

UNIT-IV Memory And I/O

Memory Hierarchy Design


• Motivated by the principle of locality - A 90/10 type of rule
– Take advantage of 2 forms of locality
• Spatial - nearby references are likely
• Temporal - same reference is likely soon
• Also motivated by cost/performance structures
– Smaller hardware is faster: SRAM, DRAM, Disk, Tape
– Access vs. bandwidth variations
– Fast memory is more expensive
• Goal – Provide a memory system with cost almost as low as the cheapest level and speed
almost as fast as the fastest level

Levels in A Typical Memory Hierarchy

Sample Memory Hierarchy

Cache
• The first level of the memory hierarchy encountered once the address leaves the CPU
– Persistent mismatch between CPU and main-memory speeds

1
– Exploit the principle of locality by providing a small, fast memory between CPU
and main memory -- the cache memory
• Cache is now applied whenever buffering is employed to reuse commonly occurring
terms (ex. file caches)
• Caching – copying information into faster storage system
– Main memory can be viewed as a cache for secondary storage

General Hierarchy Concepts


• At each level - block concept is present (block is the caching unit)
– Block size may vary depending on level
• Amortize longer access by bringing in larger chunk
• Works if locality principle is true
– Hit - access where block is present - hit rate is the probability
– Miss - access where block is absent (in lower levels) - miss rate
• Mirroring and consistency
– Data residing in higher level is subset of data in lower level
– Changes at higher level must be reflected down - sometime
• Policy of sometime is the consistency mechanism
• Addressing
– Whatever the organization you have to know how to get at it!
– Address checking and protection

Physical Address Structure


• Key is that you want different block sizes at different levels

Latency and Bandwidth

• The time required for the cache miss depends on both latency and bandwidth of the
memory (or lower level)
• Latency determines the time to retrieve the first word of the block
• Bandwidth determines the time to retrieve the rest of this block
• A cache miss is handled by hardware and causes processors following in-order execution
to pause or stall until the data are available

Predicting Memory Access Times


• On a hit: simple access time to the cache
• On a miss: access time + miss penalty
– Miss penalty = access time of lower + block transfer time

2
– Block transfer time depends on
• Block size - bigger blocks mean longer transfers
• Bandwidth between the two levels of memory
– Bandwidth usually dominated by the slower memory and the bus
protocol
• Performance
– Average-Memory-Access-Time = Hit-Access-Time + Miss-Rate * Miss-Penalty
– Memory-stall-cycles = IC * Memory-reference-per-instruction * Miss-Rate *
Miss-Penalty

Headaches of Memory Hierarchies


• CPU never knows for sure if an access will hit
• How deep will a miss be - i. e. miss penalty
– If short then the CPU just waits
– If long then probably best to work on something else – task switch
• Implies that the amount can be predicted with reasonable accuracy
• Task switch better be fast or productivity/efficiency will suffer
• Implies some new needs
– More hardware accounting
– Software readable accounting information (address trace)

Four Standard Questions


• Block Placement
– Where can a block be placed in the upper level?
• Block Identification
– How is a block found if it is in the upper level?
• Block Replacement
– Which block should be replaced on a miss?
• Write Strategy
– What happens on a write?

Block Placement Options


• Direct Mapped
– (Block address) MOD (# of cache blocks)
• Fully Associative
– Can be placed anywhere
• Set Associative
– Set is a group of n blocks -- each block is called a way
– Block first mapped into a set à (Block address) MOD (# of cache sets)
– Placed anywhere in the set
– Most caches are direct mapped, 2- or 4-way set associative

Block Replacement
• Random: just pick one and chuck it
– Simple hash game played on target block frame address

3
– Some use truly random
• But lack of reproducibility is a problem at debug time
• LRU - least recently used
– Need to keep time since each block was last accessed
• Expensive if number of blocks is large due to global compare
• Hence approximation is often used = Use bit tag and LFU
• FIFO

Write Options
• Write through: write posted to cache line and through to next lower level
– Incurs write stall (use an intermediate write buffer to reduce the stall)
• Write back
– Only write to cache not to lower level
– Implies that cache and main memory are now inconsistent
• Mark the line with a dirty bit
• If this block is replaced and dirty then write it back
• Pro’s and Con’s à both are useful
– Write through
• No write on read miss, simpler to implement, no inconsistency with main
memory
– Write back
• Uses less main memory bandwidth, write times independent of main
memory speeds
• Multiple writes within a block require only one write to the main memory

Write Miss Options


• Two choices for implementation
– Write allocate – or fetch on write
• Load the block into cache, and then do the write in cache
• Usually the choice for write-back caches
– No-write allocate – or write around
• Modify the block where it is, but do not load the block in the cache
• Usually the choice for write-through caches
• Danger - goes against the locality principle grain
• But other delayed completion games are possible

Unified vs. Split Cache


• Instruction cache and data cache
• Unified cache
– structural hazards for load and store operations
• Split cache
– Most recent processors choose split cache
– Separate ports for instruction and data caches – double bandwidth
– Opportunity of optimizing each cache separately – different capacity, block sizes,
and associativity

4
5.3 Cache Performance

Improving Cache Performance


• Average-memory-access-time = Hit-time + Miss-rate * Miss-penalty
• Strategies for improving cache performance
– Reducing the miss penalty
– Reducing the miss rate
– Reducing the miss penalty or miss rate via parallelism
– Reducing the time to hit in the cache

5.4 Reducing Cache Miss Penalty


Techniques for Reducing Miss Penalty
• Multilevel Caches (the most important)
• Critical Word First and Early Restart
• Giving Priority to Read Misses over Writes
• Merging Write Buffer
• Victim Caches

1) Multi-Level Caches
2) Probably the best miss-penalty reduction
3) Performance measurement for 2-level caches
a. AMAT = Hit-time-L1 + Miss-rate-L1* Miss-penalty-L1
b. Miss-penalty-L1 = Hit-time-L2 + Miss-rate-L2 * Miss-penalty-L2
c. AMAT = Hit-time-L1 + Miss-rate-L1 * (Hit-time-L2 + Miss-rate-L2 * Miss-
penalty-L2)

• Definitions:
– Local miss rate: misses in this cache divided by the total number of memory
accesses to this cache (Miss-rate-L2)
– Global miss rate: misses in this cache divided by the total number of memory
accesses generated by CPU (Miss-rate-L1 x Miss-rate-L2)
5
– Global Miss Rate is what matters
• Advantages:
– Capacity misses in L1 end up with a significant penalty reduction since they
likely will get supplied from L2
• No need to go to main memory
– Conflict misses in L1 similarly will get supplied by L2

Comparing Local and Global Miss Rates


• Huge 2nd level caches
• Global miss rate close to single level cache rate provided L2 >> L1
• Global cache miss rate should be used when evaluating second-level caches (or 3rd, 4th,…
levels of hierarchy)
• Many fewer hits than L1, target reduce misses

2) Critical Word First and Early Restart

• Do not wait for full block to be loaded before restarting CPU


– Critical Word First – request the missed word first from memory and send it to
the CPU as soon as it arrives; let the CPU continue execution while filling the rest
of the words in the block. Also called wrapped fetch and requested word first
– Early restart -- as soon as the requested word of the block arrives, send it to the
CPU and let the CPU continue execution
• Benefits of critical word first and early restart depend on
– Block size: generally useful only in large blocks
– Likelihood of another access to the portion of the block that has not yet been
fetched
• Spatial locality problem: tend to want next sequential word, so not clear if
benefit

3) Giving Priority to Read Misses Over Writes

• In write through, write buffers complicate memory access in that they might hold the
updated value of location needed on a read miss
– RAW conflicts with main memory reads on cache misses
• Read miss waits until the write buffer empty à increase read miss penalty (old MIPS
1000 with 4-word buffer by 50% )
• Check write buffer contents before read, and if no conflicts, let the memory access
continue
• Write Back?
– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the read, and then do the
write

6
– CPU stall less since restarts as soon as do read

4) Merging Write Buffer


• An entry of write buffer often contain multi-words. However, a write often involves
single word
– A single-word write occupies the whole entry if no write-merging
• Write merging: check to see if the address of a new data matches the address of a valid
write buffer entry. If so, the new data are combined with that entry
Advantage
– Multi-word writes are usually faster than single-word writes
– Reduce the stalls due to the write buffer being full

Write-Merging Illustration

5) Victim Caches
• Remember what was just discarded in case it is need again
• Add small fully associative cache (called victim cache) between the cache and the refill
path
– Contain only blocks discarded from a cache because of a miss
– Are checked on a miss to see if they have the desired data before going to the next
lower-level of memory
• If yes, swap the victim block and cache block
– Addressing both victim and regular cache at the same time
• The penalty will not increase
Victim Cache Organization

7
5.5 Reducing Miss Rate
Classify Cache Misses - 3 C’s
• Compulsory à independent of cache size
– First access to a block à no choice but to load it
– Also called cold-start or first-reference misses
– Measured by a infinite cache (ideal)
• Capacity à decrease as cache size increases
– Cache cannot contain all the blocks needed during execution, then blocks
being discarded will be later retrieved
– Measured by a fully associative cache
• Conflict (Collision) à decrease as associativity increases
– Side effect of set associative or direct mapping
– A block may be discarded and later retrieved if too many blocks map to the
same cache block

Miss Distributions vs. the 3 C’s (Total Miss Rate)

8
Techniques for Reducing Miss Rate
• Larger Block Size
• Larger Caches
• Higher Associativity
• Way Prediction and Pseudo-associative Caches
• Compiler optimizations

Larger Block Sizes


• Obvious advantages: reduce compulsory misses
– Reason is due to spatial locality
• Obvious disadvantage
– Higher miss penalty: larger block takes longer to move
– May increase conflict misses and capacity miss if cache is small

Large Caches
• Help with both conflict and capacity misses
• May need longer hit time AND/OR higher HW cost
• Popular in off-chip caches

Higher Associatively
• 8-way set associative is for practical purposes as effective in reducing misses as fully
associative
• 2: 1 Rule of thumb
– 2 way set associative of size N/ 2 is about the same as a direct mapped cache of
size N (held for cache size < 128 KB)
• Greater associatively comes at the cost of increased hit time

Way Prediction
• Extra bits are kept in cache to predict the way, or block within the set of the next cache
access
• Multiplexor is set early to select the desired block, and only a single tag comparison is
performed that clock cycle
• A miss results in checking the other blocks for matches in subsequent clock cycles
• Alpha 21264 uses way prediction in its 2-way set-associative instruction cache.
Simulation using SPEC95 suggested way prediction accuracy is in excess of 85%

Pseudo-Associative Caches
• Attempt to get the miss rate of set-associative caches and the hit speed of direct-mapped
cache
• Idea
– Start with a direct mapped cache
– On a miss check another entry
– Usual method is to invert the high order index bit to get the next try
• 010111 à 110111
• Problem - fast hit and slow hit
– May have the problem that you mostly need the slow hit

9
– In this case it is better to swap the blocks
• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
– Better for caches not tied directly to processor (L2)
– Used in MIPS R1000 L2 cache, similar in UltraSPARC

Relationship Between a Regular Hit Time, Pseudo Hit Time and Miss Penalty
Miss Penalty

Compiler Optimization for Code


• Code can easily be arranged without affecting correctness
• Reordering the procedures of a program might reduce instruction miss rates by reducing
conflict misses
• McFarling's observation using profiling information [1988]
– Reduce miss by 50% for a 2KB direct-mapped instruction cache with 4-byte
blocks, and by 75% in an 8KB cache
– Optimized programs on a direct-mapped cache missed less than unoptimized ones
on an 8-way set-associative cache of same size

Compiler Optimization for Data


• Idea – improve the spatial and temporal locality of the data
• Lots of options
– Array merging – Allocate arrays so that paired operands show up in same cache
block
– Loop interchange – Exchange inner and outer loop order to improve cache
performance
– Loop fusion – For independent loops accessing the same data, fuse these loops
into a single aggregate loop

Merging Arrays Example


/* Before: 2 sequential arrays */
key
int val[SIZE];
int key[SIZE];

/* After: 1 array of stuctures */


struct merge {
int val; val key
int key;
};
struct merge merged_array[SIZE];

Reducing conflicts between val & key; improve spatial locality

Loop Interchange Example


10
/* Before */
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory every 100 words; improve spatial
locality

Loop Fusion Example


/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];
}

2 misses per access to a & c vs. one miss per access;


Improve temporal locality

5.7 Reducing Hit Time

Techniques for Reducing Hit Time


• Small and Simple Caches
• Avoid Address Translation during Indexing of the Cache
• Pipelined Cache Access
• Trace Caches

Small and Simple Caches

• A time-consuming portion of a cache hit: use the index portion to read the tag and then
compare it to the address
• Small caches – smaller hardware is faster
– Keep the L1 cache small enough to fit on the same chip as CPU
– Keep the tags on-chip, and the data off-chip for L2 caches
• Simple caches – direct-Mapped cache
– Trading hit time for increased miss-rate

11
• Small direct mapped misses more often than small associative caches
• But simpler structure makes the hit go faster

Virtual Addressed Caches


• Parallel rather than sequential access
– Physical addressed caches access the TLB to generate the physical address, then
do the cache access
• Avoid address translation during cache index
– Implies virtual addressed cache
– Address translation proceeds in parallel with cache index
• If translation indicates that the page is not mapped - then the result of the
index is not a hit
• Or if a protection violation occurs - then an exception results
• All is well when neither happen
• Too good to be true?

Paging Hardware with TLB

Problems with Virtual Caches


• Protection – necessary part of the virtual to physical address translation
– Copy protection information on a miss, add a field to hold it, and check it on
every access to virtually addressed cache.
• Task switch causes the same virtual address to refer to different physical address
– Hence cache must be flushed
• Creating huge task switch overhead
• Also creates huge compulsory miss rates for new process
– Use PID’s as part of the tag to aid discrimination

Problems with Virtual Caches


• Synonyms or Alias

12
– OS and User code have different virtual addresses which map to the same
physical address (facilitates copy-free sharing)
– Two copies of the same data in a virtual cache à consistency issue
– Anti-aliasing (HW) mechanisms guarantee single copy
• On a miss, check to make sure none match PA of the data being fetched
(must VA à PA); otherwise, invalidate
– SW can help - e.g. SUN’s version of UNIX
• Page coloring - aliases must have same low-order 18 bits
• I/O – use PA
– Require mapping to VA to interact with a virtual cache

Pipelining Writes for Fast Write Hits – Pipelined Cache


• Write hits usually take longer than read hits
– Tag must be checked before writing the data
• Pipelines the write
– 2 stages – Tag Check and Update Cache (can be more in practice)
– Current write tag check & previous write cache update
• Result
– Looks like a write happens on every cycle
– Cycle-time can stay short since real write is spread over
– Mostly works if CPU is not dependent on data from a write
• Spot any problems if read and write ordering is not preserved by the
memory system?
• Reads play no part in this pipeline since they already operate in parallel with the tag
check

Trace Caches
• Conventional caches limit the instructions in a static cache block to spatial locality
• Conventional caches may be entered from and exited by a taken branch à first and last
portion of a block are unused
– Taken branches or jumps are 1 in 5 to 10 instructions
• A 64-byte block has 16 instructions à space utilization problem
• A trace cache stores instructions only from the branch entry point to the exit of the trace à
avoid header and trailer overhead

13
• Complicated address mapping mechanism, as addresses are no longer aligned to power of
2 multiples of word size
• May store the same instructions multiple time in I-cache
– Conditional branches making different choices result in the same instructions
being part of separate traces, which each occupy space in the cache
• Intel NetBurst (foundation of Pentium 4)

Cache Optimization Summary

14
5.9 Main Memory

Main Memory -- 3 important issues


• Capacity
• Latency
– Access time: time between a read is requested and the word arrives
– Cycle time: min time between requests to memory (> access time)
• Memory needs the address lines to be stable between accesses
– By addressing big chunks - like an entire cache block (amortize the latency)
– Critical to cache performance when the miss is to main
• Bandwidth -- # of bytes read or written per unit time
– Affects the time it takes to transfer the block

Improving Main Memory Performance


• Simple:
– CPU, Cache, Bus, Memory same width (32 or 64 bits)
• Wide:
– CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256
bits; UtraSPARC 512)
• Interleaved:
– CPU, Cache, Bus 1 word: Memory N Modules
(4 Modules); example is word interleaved

3 Examples of Bus Width, Memory Width, and Memory Interleaving to Achieve Memory
Bandwidth

15
Wider Main Memory
1) Doubling or quadrupling the width of the cache or memory will doubling or quadrupling
the memory bandwidth
a. Miss penalty is reduced correspondingly
2) Cost and Drawback
a. More cost on memory bus
b. Multiplexer between the cache and the CPU may be on the critical path (CPU is
still access the cache one word at a time)
i. Multiplexors can be put between L1 and L2
c. The design of error correction become more complicated
i. If only a portion of the block is updated, all other portions must be read for
calculating the new error correction code
d. Since main memory is traditionally expandable by the customer, the minimum
increment is doubled or quadrupled

Simple Interleaved Memory


Bank_# = address MOD #_of_banks
Address_within_bank = Floor(Address / #_of_banks)

• Memory chips are organized into banks to read or write multiple words at a time, rather
than a single word
– Share address lines with a memory controller
– Keep the memory bus the same but make it run faster
– Take advantage of potential memory bandwidth of all DRAMs banks
– The banks are often one word wide
– Good for accessing consecutive memory location
• Miss penalty of 4 + 56 + 4 * 4 or 76 CC (0.4 bytes per CC)

Simple Interleaved Memory


• Interleaved memory is logically a wide memory, except that accesses to bank are staged
over time to share bus
• How many banks should be included?
– More than # of CC to access word in bank
• To achieve the goal that delivering information from a new bank each
clock for sequential accesses à avoid waiting
• Disadvantages
– Making multiple banks are expensive à larger chip, few chips
• 512MB RAM
– 256 chips of 4M*4 bits è16 banks of 16 chips
– 16 chips of 64M*4 bit è only 1 bank
– More difficulty in main memory expansion (like wider memory)

16
Independent Memory Banks
• Memory banks for independent accesses vs. faster sequential accesses (like wider or
interleaved memory)
– Multiple memory controller
• Good for…
– Multiprocessor I/O
– CPU with Hit under n Misses, Non-blocking Cache

Memory Technology
DRAM Technology
• Semiconductor Dynamic Random Access Memory
• Emphasize on cost per bit and capacity
• Multiplex address lines è cutting # of address pins in half
– Row access strobe (RAS) first, then column access strobe (CAS)
– Memory as a 2D matrix – rows go to a buffer
– Subsequent CAS selects subrow
• Use only a single transistor to store a bit
– Reading that bit can destroy the information
– Refresh each bit periodically (ex. 8 milliseconds) by writing back
• Keep refreshing time less than 5% of the total time
• DRAM capacity is 4 to 8 times that of SRAM
Now a days-
• DIMM: Dual inline memory module
– DRAM chips are commonly sold on small boards called DIMMs
– DIMMs typically contain 4 to 16 DRAMs
• Slowing down in DRAM capacity growth
– Four times the capacity every three years, for more than 20 years
– New chips only double capacity every two year, since 1998
• DRAM performance is growing at a slower rate
– RAS (related to latency): 5% per year
– CAS (related to bandwidth): 10%+ per year

SRAM Technology
• Cache uses SRAM: Static Random Access Memory
• SRAM uses six transistors per bit to prevent the information from being disturbed when
read è no need to refresh
– SRAM needs only minimal power to retain the charge in the standby modeègood
for embedded applications
– No difference between access time and cycle time for SRAM
• Emphasize on speed and capacity
– SRAM address lines are not multiplexed
• SRAM speed is 8 to 16x that of DRAM

ROM and Flash


• Embedded processor memory
• Read-only memory (ROM)

17
– Programmed at the time of manufacture
– Only a single transistor per bit to represent 1 or 0
– Used for the embedded program and for constant
– Nonvolatile and indestructible
• Flash memory:
– Nonvolatile but allow the memory to be modified
– Reads at almost DRAM speeds, but writes 10 to 100 times slower
– DRAM capacity per chip and MB per dollar is about 4 to 8 times greater than
flash

Improving Memory Performance in a Standard DRAM Chip


• Fast page mode: time signals that allow repeated accesses to buffer without another row
access time
• Synchronous RAM (SDRAM): add a clock signal to DRAM interface, so that the
repeated transfer would not bear overhead to synchronize with the controller
– Asynchronous DRAM involves overhead to sync with controller
– Peak speed per memory module 800—1200MB/sec in 2001
• Double data rate (DDR): transfer data on both the rising edge and falling edge of DRAM
clock signal
– Peak speed per memory module 1600—2400MB/sec in 2001

RAMBUS
• RAMBUS optimizes the interface between DRAM and CPU
• RAMBUS makes a single chip act more like a memory system than a memory
component
– Each chip has interleaved memory and high-speed interface
• 1st generation RAMBUS: RDAM
– Replace RAS/CAS with a bus that allows other accesses over it between the
sending of the address and return of the data
– Each chip has four banks, each with their own row buffer
– A chip can return a variable amount of data from a single request, and even
perform its refresh
– Clock signal and transfer on both edges of its clock
– 300 MHz clock

Storage Systems
Motivation: Who Cares About I/O?
• CPU Performance: 2 times very 18 months
• I/O performance limited by mechanical delays (disk I/O)

Position of I/O in Computer Architecture – Past


• An orphan in the architecture domain
• I/O meant the non-processor and memory stuff
– Disk, tape, LAN, WAN, etc.
– Performance was not a major concern
• Devices characterized as:

18
– Extraneous, non-priority, infrequently used, slow
• Exception is swap area of disk
– Part of the memory hierarchy
– Hence part of system performance but you’re hosed if you use it often

Position of I/O in Computer Architecture – Now


• Trends – I/O is the bottleneck
– Communication is frequent
• Voice response & transaction systems, real-time video
• Multimedia expectations
– Even standard networks come in gigabit/sec flavors
– For multi-computers
• Result
– Significant focus on system bus performance
• Common bridge to the memory system and the I/O systems
• Critical performance component for the SMP server platforms

System vs. CPU Performance


• Care about speed at which user jobs get done
– Throughput - how many jobs/time (system view)
– Latency - how quick for a single job (user view)
– Response time – time between when a command is issued and results appear (user
view)
• CPU performance main factor when:
– Job mix fits in the memory à there are very few page faults
• I/O performance main factor when:
– The job is too big for the memory - paging is dominant
– When the job reads/writes/creates a lot of unexpected files
• OLTP – Decision support -- Database
– And then there is graphics & specialty I/O devices

System Performance
• Depends on many factors in the worst case
– CPU
– Compiler
– Operating System
– Cache
– Main Memory
– Memory-IO bus
– I/O controller or channel
– I/O drivers and interrupt handlers
– I/O devices: there are many types
• Level of autonomous behavior
• Amount of internal buffer capacity
• Device specific parameters for latency and throughput

19
I/O Systems
interrupts

Keys to a Balanced System


• It’s all about overlap - I/O vs CPU
– Timeworkload = Timecpu + TimeI/O - Timeoverlap
• Consider the benefit of just speeding up one
– Amdahl’s Law (see P4 as well)
• Latency vs. Throughput

I/O System Design Considerations


• Depends on type of I/O device
– Size, bandwidth, and type of transaction
– Frequency of transaction
– Defer vs. do now
• Appropriate memory bus utilization
• What should the controller do
– Programmed I/O
– Interrupt vs. polled
– Priority or not
– DMA
– Buffering issues - what happens on over-run
– Protection
– Validation

Types of I/O Devices


• Behavior
– Read, Write, Both
– Once, multiple
– Size of average transaction
– Bandwidth
– Latency
• Partner - the speed of the slowest link theory
– Human operated (interactive or not)
– Machine operated (local or remote)

Some I/O Device Characteristics

20
Is I/O Important?
• Depends on your application
– Business - disks for file system I/O
– Graphics - graphics cards or special co-processors
– Parallelism - the communications fabric
• Our focus = mainline uniprocessing
– Storage subsystems (Chapter 7)
– Networks (Chapter 8)
• Noteworthy Point
– The traditional orphan
– But now often viewed more as a front line topic

Types of Storage Devices


Magnetic Disks
• 2 important Roles
– Long term, non-volatile storage – file system and OS
– Lowest level of the memory hierarchy
• Most of the virtual memory is physically resident on the disk
• Long viewed as a bottleneck
– Mechanical system à slow
– Hence they seem to be an easy target for improved technology
– Disk improvement w.r.t. density have done better than Moore’s law

Disks are organized into platters, tracks, and sectors

21
Physical Organization Options
• Platters – one or many
• Density - fixed or variable
– All tracks have the same no. of sectors?)
• Organization - sectors, cylinders, and tracks
– Actuators - 1 or more
– Heads - 1 per track or 1 per actuator
– Access - seek time vs. rotational latency
• Seek related to distance but not linearly
• Typical rotation: 3600 RPM or 15000 RPM
• Diameter – 1.0 to 3.5 inches

Typical Physical Organization


• Multiple platters
– Metal disks covered with magnetic recording material on both sides
• Single actuator (since they are expensive)
– Single R/W head per arm
– One arm per surface
– All heads therefore over same cylinder
• Fixed sector size
• Variable density encoding
• Disk controller – usually built in processor + buffering

Anatomy of a Read Access


• Steps
– Memory mapped I/O over bus to controller
– Controller starts access
– Seek + rotational latency wait
– Sector is read and buffered (validity checked)
– Controller says ready or DMA’s to main memory and then says ready

22
Access Time
• Access Time
– Seek time: time to move the arm over the proper track
• Very non-linear: accelerate and decelerate times complicate
– Rotation latency (delay): time for the requested sector to rotate under the head (on
average: 0.5 * RPM)
– Transfer time: time to transfer a block of bits (typically a sector) under the read-
write head
– Controller overhead: the overhead the controller imposes in performing an I/O
access
– Queuing delay: time spent waiting for a disk to become free

AverageAccessTime= AverageSeekTime+ AverageRotationalDelay+


TransferTime+ControllerOverhead +QueuingDelay
Cost VS Performance
• Large-diameter drives have many more data to amortize the cost of electronics à lowest
cost per GB
• Higher sales volume à lower manufacturing cost
• 3.5-inch drive, the largest surviving drive in 2001, also has the highest sales volume, so it
unquestionably has the best price per GB

Disk Alternatives

• Optical Disks
– Optical compact disks (CD) – 0.65GB
– Digital video discs, digital versatile disks (DVD) – 4.7GB * 2 sides
– Rewritable CD (CD-RW) and write-once CD (CD-R)
– Rewritable DVD (DVD-RAM) and write-once DVD (DVD-R)
• Robotic Tape Storage
• Optical Juke Boxes
• Tapes – DAT, DLT
• Flash memory
– Good for embedded systems
– Nonvolatile storage and rewritable ROM

Bus – Connecting I/O Devices to CPU/Memory


I/O Connection Issues
• Shared communication link between subsystems
– Typical choice is a bus

23
• Advantages
– Shares a common set of wires and protocols à low cost
– Often based on standard - PCI, SCSI, etc. à portable and versatility
• Disadvantages
– Poor performance
– Multiple devices imply arbitration and therefore contention
– Can be a bottleneck

I/O Connection Issues – Multiple Buses


• I/O bus
– Lengthy
– Many types of connected devices
– Wide range in device bandwidth
– Follow a bus standard
– Accept devices varying in latency and bandwidth capabilities
• CPU-memory bus
– Short
– High speed
– Match to the memory system to maximize CPU-memory bandwidth
– Knows all types of devices that must connect together

Typical Bus Synchronous Read Transaction

Bus Design Decisions


• Other things to standardize as well
– Connectors
– Voltage and current levels
– Physical encoding of control signals
– Protocols for good citizenship

• Bus master: devices that can initiate a R/W transaction

24
– Multiple : multiple CPUs, I/O device initiate bus transactions
– Multiple bus masters need arbitration (fixed priority or random)
• Split transaction for multiple masters
– Use packets for the full transaction (does not hold the bus)
– A read transaction is broken into read-request and memory-reply transactions
– Make the bus available for other masters while the data is read/written from/to the
specified address
– Transactions must be tagged
– Higher bandwidth, but also higher latency

Split Transaction Bus

• Clocking: Synchronous vs. Asynchronous


– Synchronous
• Include a clock in the control lines, and a fixed protocol for address and
data relative to the clock
• Fast and inexpensive (little or no logic to determine what's next)
• Everything on the bus must run at the same clock rate
• Short length (due to clock skew)
• CPU-memory buses
– Asynchronous
• Easier to connect a wide variety of devices, and lengthen the bus
• Scale better with technological changes
• I/O buses

Synchronous or Asynchronous?

25
Standards
• The Good
– Let the computer and I/O-device designers work independently
– Provides a path for second party (e.g. cheaper) competition
• The Bad
– Become major performance anchors
– Inhibit change
• How to create a standard
– Bottom-up
• Company tries to get standards committee to approve it’s latest philosophy
in hopes that they’ll get the jump on the others (e.g. S bus, PC-AT bus, ...)
• De facto standards
– Top-down
• Design by committee (PCI, SCSI, ...)

Connecting the I/O Bus


• To main memory
– I/O bus and CPU-memory bus may the same
• I/O commands on bus could interfere with CPU's access memory
– Since cache misses are rare, does not tend to stall the CPU
– Problem is lack of coherency
– Currently, we consider this case
• To cache
• Access
– Memory-mapped I/O or distinct instruction (I/O opcodes)
• Interrupt vs. Polling
• DMA or not
– Autonomous control allows overlap and latency hiding
– However there is a cost impact

26
A typical interface of I/O devices and an I/O bus to the CPU-memory bus

Processor Interface Issues


• Processor interface
– Interrupts
– Memory mapped I/O
• I/O Control Structures
– Polling
– Interrupts
– DMA
– I/O Controllers
– I/O Processors
• Capacity, Access Time, Bandwidth
• Interconnections
– Busses

I/O Controller

27
Memory Mapped I/O

CPU
Single Memory & I/O Bus
No Separate I/O Instructions
ROM
Memory Interface Interface RAM

Peripheral Peripheral
CPU
$ I/O

L2 $
Memory Bus I/O bus

MemoryBus Adaptor

Programmed I/O
• Polling
• I/O module performs the action, on behalf of the processor
• But I/O module does not interrupt CPU when I/O is done
• Processor is kept busy checking status of I/O module
– not an efficient way to use the CPU unless the device is very fast!
• Byte by Byte…

28
Interrupt-Driven I/O
• Processor is interrupted when I/O module ready to exchange data
• Processor is free to do other work
• No needless waiting
• Consumes a lot of processor time because every word read or written passes through the
processor and requires an interrupt
• Interrupt per byte

Direct Memory Access (DMA)


• CPU issues request to a DMA module (separate module or incorporated into I/O module)
• DMA module transfers a block of data directly to or from memory (without going
through CPU)
• An interrupt is sent when the task is complete
– Only one interrupt per block, rather than one interrupt per byte
• The CPU is only involved at the beginning and end of the transfer
• The CPU is free to perform other tasks during data transfer

29
Reliability, Availability, and Dependability
Dependability, Faults, Errors, and Failures
Computer system dependability is the quality of delivered service such that reliance
can justifiably be placed on this service. The service delivered by a system is its
observed actual behavior as perceived by other system(s) interacting with this
system's users.
Each module also has an ideal specified behavior, where a service specification is an
agreed description of the expected behavior.
A system failure occurs when the actual behavior deviates from the specified
behavior. The failure occurred because of an error, a defect in that module. The cause
of an error is a fault.
When a fault occurs, it creates a latent error, which becomes effective when it is
activated; when the error actually affects the delivered service, a failure occurs. The
time between the occurrence of an error and the resulting failure is the error latency.
Thus, an error is the manifestation in the system of a fault, and a failure is the
manifestation on the service of an error.

Faults, Errors, and Failures


• A fault creates one or more latent errors
• The properties of errors are
– A latent error becomes effective once activated
– An error may cycle between its latent and effective states
– An effective error often propagates from one component to another, thereby
creating new errors.
• A component failure occurs when the error affects the delivered service.
• These properties are recursive and apply to any component

30
Example of Faults, Errors, and Failures
• Example 1
– A programming mistake: fault
– The consequence is an error or latent error
– Upon activation, the error becomes effective
– When this effective error produces erroneous data that affect the delivered
service, a failure occurs
• Example 2
– An alpha particle hitting a DRAM à fault
– It changes the memory à latent error
– Affected memory word is read à effective error
– The effective error produces erroneous data that affect the delivered service
à failure (If ECC corrected the error, a failure would not occur)
Service Accomplishment and Interruption
• Service accomplishment: service is delivered as specified
• Service interruption:delivered service is different from the specified service
• Transitions between these two states are caused by failures or restorations
Measure Reliability And Availability
• Reliability: measure of the continuous service accomplishment from a reference initial
instant
– Mean time to failure (MTTF)
– The reciprocal of MTTF is a rate of failures
– Service interruption is measured as mean time to repair (MTTR)
• Availability: measure of the service accomplishment w.r.t the alternation between the
above-mentioned two states
– Measured as: MTTF/(MTTF + MTTR)
– Mean time between failure = MTTF+ MTTR
Example
• A disk subsystem
– 10 disks, each rated at 1,000,000-hour MTTF
– 1 SCSI controller, 500,000-hour MTTF
– 1 power supply, 200,.000-hour MTTF
– 1 fan, 200,000-hour MTTF
– 1 SCSI cable, 1000,000-hour MTTF
• Component lifetimes are exponentially distributed (the component age is not important in
probability of failure), and independent failure
1 1 1 1 1
Failure _ Rate  10 *    
1,000,000 500,000 200,000 200,000 1,000,000
1
MTTF   43,500hour ( 5Years )
Failure _ Rate
Cause of Faults
• Hardware faults: devices that fail
• Design faults: faults in software (usually) and hardware design (occasionally)

31
• Operation faults: mistakes by operations and maintenance personnel
• Environmental faults: fire, flood, earthquake, power failure, and sabotage

Classification of Faults
• Transient faults exist for a limited time and are not recurring
• Intermittent faults cause a system to oscillate between faulty and fault-free
operation
• Permanent faults do not correct themselves with the passing of time

Reliability Improvements
• Fault avoidance: how to prevent, by construction, fault occurrence
• Fault tolerance: how to provide, by redundancy, service complying with the
service specification in spite of faults having occurred or that are occurring
• Error removal: how to minimize, by verification, the presence of latent errors
• Error forecasting: how to estimate, by evaluation, the presence, creation, and
consequences of errors

RAID: Redundant Arrays of Inexpensive Disks


3 Important Aspects of File Systems
• Reliability – is anything broken?
– Redundancy is main hack to increased reliability
• Availability – is the system still available to the user?
– When single point of failure occurs is the rest of the system still usable?
– ECC and various correction schemes help (but cannot improve reliability)
• Data Integrity
– You must know exactly what is lost when something goes wrong
Disk Arrays
• Multiple arms improve throughput, but not necessarily improve latency
• Striping
– Spreading data over multiple disks
• Reliability
– General metric is N devices have 1/N reliability
• Rule of thumb: MTTF of a disk is about 5 years
– Hence need to add redundant disks to compensate
• MTTR ::= mean time to repair (or replace) (hours for disks)
• If MTTR is small then the array’s MTTF can be pushed out significantly
with a fairly small redundancy factor
Data Striping
• Bit-level striping: split the bit of each bytes across multiple disks
– No. of disks can be a multiple of 8 or divides 8
• Block-level striping: blocks of a file are striped across multiple disks; with n disks, block
i goes to disk (i mod n)+1
• Every disk participates in every access
– Number of I/O per second is the same as a single disk
– Number of data per second is improved

32
• Provide high data-transfer rates, but not improve reliability

Redundant Arrays of Disks


• Files are "striped" across multiple disks
• Availability is improved by adding redundant disks
– If a single disk fails, the lost information can be reconstructed from redundant
information
– Capacity penalty to store redundant information
– Bandwidth penalty to update
• RAID
– Redundant Arrays of Inexpensive Disks
– Redundant Arrays of Independent Disks

Raid Levels, Reliability, Overhead

RAID Levels 0 – 1
• RAID 0 – No redundancy (Just block striping)
– Cheap but unable to withstand even a single failure
• RAID 1 – Mirroring
– Each disk is fully duplicated onto its "shadow“
– Files written to both, if one fails flag it and get data from the mirror
– Reads may be optimized – use the disk delivering the data first
– Bandwidth sacrifice on write: Logical write = two physical writes
– Most expensive solution: 100% capacity overhead
– Targeted for high I/O rate , high availability environments
• RAID 0+1 – stripe first, then mirror the stripe
• RAID 1+0 – mirror first, then stripe the mirror

RAID Levels 2 & 3


• RAID 2 – Memory style ECC
– Cuts down number of additional disks
– Actual number of redundant disks will depend on correction model
– RAID 2 is not used in practice

33
• RAID 3 – Bit-interleaved parity
– Reduce the cost of higher availability to 1/N (N = # of disks)
– Use one additional redundant disk to hold parity information
– Bit interleaving allows corrupted data to be reconstructed
– Interesting trade off between increased time to recover from a failure and cost
reduction due to decreased redundancy
– Parity = sum of all relative disk blocks (module 2)
• Hence all disks must be accessed on a write – potential bottleneck
– Targeted for high bandwidth applications: Scientific, Image Processing

RAID Level 3: Parity Disk

RAID Levels 4 & 5 & 6


• RAID 4 – Block interleaved parity
– Similar idea as RAID 3 but sum is on a per block basis
– Hence only the parity disk and the target disk need be accessed
– Problem still with concurrent writes since parity disk bottlenecks
• RAID 5 – Block interleaved distributed parity
– Parity blocks are interleaved and distributed on all disks
– Hence parity blocks no longer reside on same disk

34
– Probability of write collisions to a single drive are reduced
– Hence higher performance in the consecutive write situation
• RAID 6
– Similar to RAID 5, but stores extra redundant information to guard against
multiple disk failures

Raid 4 & 5 Illustration

RAID4 RAID5

Small Write Update on RAID 3

Small Writes Update on RAID 4/5


1 Logical Write = 2 Physical Reads + 2 Physical Writes

I/O Performance Measures


• Some similarities with CPU performance measures
– Bandwidth - 100% utilization is maximum throughput
– Latency - often called response time in the I/O world
• Some unique
– Diversity - what types can be connected to the system
– Capacity - how many and how much storage on each unit
• Usual relationship between Bandwidth & Latency

Latency VS. Throughput

35
• Response time (latency): the time a task takes from the moment it is placed in the buffer
until the server finishes the task
• Throughput: the average number of tasks completed by the server over a time period
• Knee of the curve (L VS. T): the area where a little more throughput results in much
longer response time, or, a little shorter response time results in much lower throughput

Transaction Model
• In an interactive environment, faster response time is important
• Impact of inherent long latency
• Transaction time: sum of 3 components
– Entry time - time it takes user (usually human) to enter command
– System response time - command entry to response out
– Think time - user reaction time between response and next entry

7.11 Designing an I/O System


I/O Design Complexities
• Huge variety of I/O devices
– Latency
– Bandwidth
– Block size
• Expansion is a must – longer buses, larger power and cabinets
• Balanced Performance and Cost
• Yet another n-dimensional conflicting
• Constraint problem
– Yep - it’s NP hard just like all the rest
– Experience plays a big role since the solutions are heuristic

7 Basic I/O Design Steps


• List types of I/O devices and buses to be supported
• List physical requirements of I/O devices
– Volume, power, bus slots, expansion slots or cabinets, ...
• List cost of each device and associated controller
• List the reliability of each I/O device
• Record CPU resource demands - e.g. cycles
– Start, support, and complete I/O operation
– Stalls due to I/O waits
– Overhead - e.g. cache flushes and context switches
• List memory and bus bandwidth demands
• Assess the performance of different ways to organize I/O devices
– Of course you’ll need to get into queuing theory to get it right

36

You might also like