UNIT-IV Memory and I/O
UNIT-IV Memory and I/O
Cache
• The first level of the memory hierarchy encountered once the address leaves the CPU
– Persistent mismatch between CPU and main-memory speeds
–
1
– Exploit the principle of locality by providing a small, fast memory between CPU
and main memory -- the cache memory
• Cache is now applied whenever buffering is employed to reuse commonly occurring
terms (ex. file caches)
• Caching – copying information into faster storage system
– Main memory can be viewed as a cache for secondary storage
• The time required for the cache miss depends on both latency and bandwidth of the
memory (or lower level)
• Latency determines the time to retrieve the first word of the block
• Bandwidth determines the time to retrieve the rest of this block
• A cache miss is handled by hardware and causes processors following in-order execution
to pause or stall until the data are available
2
– Block transfer time depends on
• Block size - bigger blocks mean longer transfers
• Bandwidth between the two levels of memory
– Bandwidth usually dominated by the slower memory and the bus
protocol
• Performance
– Average-Memory-Access-Time = Hit-Access-Time + Miss-Rate * Miss-Penalty
– Memory-stall-cycles = IC * Memory-reference-per-instruction * Miss-Rate *
Miss-Penalty
Block Replacement
• Random: just pick one and chuck it
– Simple hash game played on target block frame address
3
– Some use truly random
• But lack of reproducibility is a problem at debug time
• LRU - least recently used
– Need to keep time since each block was last accessed
• Expensive if number of blocks is large due to global compare
• Hence approximation is often used = Use bit tag and LFU
• FIFO
Write Options
• Write through: write posted to cache line and through to next lower level
– Incurs write stall (use an intermediate write buffer to reduce the stall)
• Write back
– Only write to cache not to lower level
– Implies that cache and main memory are now inconsistent
• Mark the line with a dirty bit
• If this block is replaced and dirty then write it back
• Pro’s and Con’s à both are useful
– Write through
• No write on read miss, simpler to implement, no inconsistency with main
memory
– Write back
• Uses less main memory bandwidth, write times independent of main
memory speeds
• Multiple writes within a block require only one write to the main memory
4
5.3 Cache Performance
1) Multi-Level Caches
2) Probably the best miss-penalty reduction
3) Performance measurement for 2-level caches
a. AMAT = Hit-time-L1 + Miss-rate-L1* Miss-penalty-L1
b. Miss-penalty-L1 = Hit-time-L2 + Miss-rate-L2 * Miss-penalty-L2
c. AMAT = Hit-time-L1 + Miss-rate-L1 * (Hit-time-L2 + Miss-rate-L2 * Miss-
penalty-L2)
• Definitions:
– Local miss rate: misses in this cache divided by the total number of memory
accesses to this cache (Miss-rate-L2)
– Global miss rate: misses in this cache divided by the total number of memory
accesses generated by CPU (Miss-rate-L1 x Miss-rate-L2)
5
– Global Miss Rate is what matters
• Advantages:
– Capacity misses in L1 end up with a significant penalty reduction since they
likely will get supplied from L2
• No need to go to main memory
– Conflict misses in L1 similarly will get supplied by L2
• In write through, write buffers complicate memory access in that they might hold the
updated value of location needed on a read miss
– RAW conflicts with main memory reads on cache misses
• Read miss waits until the write buffer empty à increase read miss penalty (old MIPS
1000 with 4-word buffer by 50% )
• Check write buffer contents before read, and if no conflicts, let the memory access
continue
• Write Back?
– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the read, and then do the
write
6
– CPU stall less since restarts as soon as do read
Write-Merging Illustration
5) Victim Caches
• Remember what was just discarded in case it is need again
• Add small fully associative cache (called victim cache) between the cache and the refill
path
– Contain only blocks discarded from a cache because of a miss
– Are checked on a miss to see if they have the desired data before going to the next
lower-level of memory
• If yes, swap the victim block and cache block
– Addressing both victim and regular cache at the same time
• The penalty will not increase
Victim Cache Organization
7
5.5 Reducing Miss Rate
Classify Cache Misses - 3 C’s
• Compulsory à independent of cache size
– First access to a block à no choice but to load it
– Also called cold-start or first-reference misses
– Measured by a infinite cache (ideal)
• Capacity à decrease as cache size increases
– Cache cannot contain all the blocks needed during execution, then blocks
being discarded will be later retrieved
– Measured by a fully associative cache
• Conflict (Collision) à decrease as associativity increases
– Side effect of set associative or direct mapping
– A block may be discarded and later retrieved if too many blocks map to the
same cache block
8
Techniques for Reducing Miss Rate
• Larger Block Size
• Larger Caches
• Higher Associativity
• Way Prediction and Pseudo-associative Caches
• Compiler optimizations
Large Caches
• Help with both conflict and capacity misses
• May need longer hit time AND/OR higher HW cost
• Popular in off-chip caches
Higher Associatively
• 8-way set associative is for practical purposes as effective in reducing misses as fully
associative
• 2: 1 Rule of thumb
– 2 way set associative of size N/ 2 is about the same as a direct mapped cache of
size N (held for cache size < 128 KB)
• Greater associatively comes at the cost of increased hit time
Way Prediction
• Extra bits are kept in cache to predict the way, or block within the set of the next cache
access
• Multiplexor is set early to select the desired block, and only a single tag comparison is
performed that clock cycle
• A miss results in checking the other blocks for matches in subsequent clock cycles
• Alpha 21264 uses way prediction in its 2-way set-associative instruction cache.
Simulation using SPEC95 suggested way prediction accuracy is in excess of 85%
Pseudo-Associative Caches
• Attempt to get the miss rate of set-associative caches and the hit speed of direct-mapped
cache
• Idea
– Start with a direct mapped cache
– On a miss check another entry
– Usual method is to invert the high order index bit to get the next try
• 010111 à 110111
• Problem - fast hit and slow hit
– May have the problem that you mostly need the slow hit
9
– In this case it is better to swap the blocks
• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
– Better for caches not tied directly to processor (L2)
– Used in MIPS R1000 L2 cache, similar in UltraSPARC
Relationship Between a Regular Hit Time, Pseudo Hit Time and Miss Penalty
Miss Penalty
• A time-consuming portion of a cache hit: use the index portion to read the tag and then
compare it to the address
• Small caches – smaller hardware is faster
– Keep the L1 cache small enough to fit on the same chip as CPU
– Keep the tags on-chip, and the data off-chip for L2 caches
• Simple caches – direct-Mapped cache
– Trading hit time for increased miss-rate
11
• Small direct mapped misses more often than small associative caches
• But simpler structure makes the hit go faster
12
– OS and User code have different virtual addresses which map to the same
physical address (facilitates copy-free sharing)
– Two copies of the same data in a virtual cache à consistency issue
– Anti-aliasing (HW) mechanisms guarantee single copy
• On a miss, check to make sure none match PA of the data being fetched
(must VA à PA); otherwise, invalidate
– SW can help - e.g. SUN’s version of UNIX
• Page coloring - aliases must have same low-order 18 bits
• I/O – use PA
– Require mapping to VA to interact with a virtual cache
Trace Caches
• Conventional caches limit the instructions in a static cache block to spatial locality
• Conventional caches may be entered from and exited by a taken branch à first and last
portion of a block are unused
– Taken branches or jumps are 1 in 5 to 10 instructions
• A 64-byte block has 16 instructions à space utilization problem
• A trace cache stores instructions only from the branch entry point to the exit of the trace à
avoid header and trailer overhead
13
• Complicated address mapping mechanism, as addresses are no longer aligned to power of
2 multiples of word size
• May store the same instructions multiple time in I-cache
– Conditional branches making different choices result in the same instructions
being part of separate traces, which each occupy space in the cache
• Intel NetBurst (foundation of Pentium 4)
14
5.9 Main Memory
3 Examples of Bus Width, Memory Width, and Memory Interleaving to Achieve Memory
Bandwidth
15
Wider Main Memory
1) Doubling or quadrupling the width of the cache or memory will doubling or quadrupling
the memory bandwidth
a. Miss penalty is reduced correspondingly
2) Cost and Drawback
a. More cost on memory bus
b. Multiplexer between the cache and the CPU may be on the critical path (CPU is
still access the cache one word at a time)
i. Multiplexors can be put between L1 and L2
c. The design of error correction become more complicated
i. If only a portion of the block is updated, all other portions must be read for
calculating the new error correction code
d. Since main memory is traditionally expandable by the customer, the minimum
increment is doubled or quadrupled
• Memory chips are organized into banks to read or write multiple words at a time, rather
than a single word
– Share address lines with a memory controller
– Keep the memory bus the same but make it run faster
– Take advantage of potential memory bandwidth of all DRAMs banks
– The banks are often one word wide
– Good for accessing consecutive memory location
• Miss penalty of 4 + 56 + 4 * 4 or 76 CC (0.4 bytes per CC)
16
Independent Memory Banks
• Memory banks for independent accesses vs. faster sequential accesses (like wider or
interleaved memory)
– Multiple memory controller
• Good for…
– Multiprocessor I/O
– CPU with Hit under n Misses, Non-blocking Cache
Memory Technology
DRAM Technology
• Semiconductor Dynamic Random Access Memory
• Emphasize on cost per bit and capacity
• Multiplex address lines è cutting # of address pins in half
– Row access strobe (RAS) first, then column access strobe (CAS)
– Memory as a 2D matrix – rows go to a buffer
– Subsequent CAS selects subrow
• Use only a single transistor to store a bit
– Reading that bit can destroy the information
– Refresh each bit periodically (ex. 8 milliseconds) by writing back
• Keep refreshing time less than 5% of the total time
• DRAM capacity is 4 to 8 times that of SRAM
Now a days-
• DIMM: Dual inline memory module
– DRAM chips are commonly sold on small boards called DIMMs
– DIMMs typically contain 4 to 16 DRAMs
• Slowing down in DRAM capacity growth
– Four times the capacity every three years, for more than 20 years
– New chips only double capacity every two year, since 1998
• DRAM performance is growing at a slower rate
– RAS (related to latency): 5% per year
– CAS (related to bandwidth): 10%+ per year
SRAM Technology
• Cache uses SRAM: Static Random Access Memory
• SRAM uses six transistors per bit to prevent the information from being disturbed when
read è no need to refresh
– SRAM needs only minimal power to retain the charge in the standby modeègood
for embedded applications
– No difference between access time and cycle time for SRAM
• Emphasize on speed and capacity
– SRAM address lines are not multiplexed
• SRAM speed is 8 to 16x that of DRAM
17
– Programmed at the time of manufacture
– Only a single transistor per bit to represent 1 or 0
– Used for the embedded program and for constant
– Nonvolatile and indestructible
• Flash memory:
– Nonvolatile but allow the memory to be modified
– Reads at almost DRAM speeds, but writes 10 to 100 times slower
– DRAM capacity per chip and MB per dollar is about 4 to 8 times greater than
flash
RAMBUS
• RAMBUS optimizes the interface between DRAM and CPU
• RAMBUS makes a single chip act more like a memory system than a memory
component
– Each chip has interleaved memory and high-speed interface
• 1st generation RAMBUS: RDAM
– Replace RAS/CAS with a bus that allows other accesses over it between the
sending of the address and return of the data
– Each chip has four banks, each with their own row buffer
– A chip can return a variable amount of data from a single request, and even
perform its refresh
– Clock signal and transfer on both edges of its clock
– 300 MHz clock
Storage Systems
Motivation: Who Cares About I/O?
• CPU Performance: 2 times very 18 months
• I/O performance limited by mechanical delays (disk I/O)
18
– Extraneous, non-priority, infrequently used, slow
• Exception is swap area of disk
– Part of the memory hierarchy
– Hence part of system performance but you’re hosed if you use it often
System Performance
• Depends on many factors in the worst case
– CPU
– Compiler
– Operating System
– Cache
– Main Memory
– Memory-IO bus
– I/O controller or channel
– I/O drivers and interrupt handlers
– I/O devices: there are many types
• Level of autonomous behavior
• Amount of internal buffer capacity
• Device specific parameters for latency and throughput
19
I/O Systems
interrupts
20
Is I/O Important?
• Depends on your application
– Business - disks for file system I/O
– Graphics - graphics cards or special co-processors
– Parallelism - the communications fabric
• Our focus = mainline uniprocessing
– Storage subsystems (Chapter 7)
– Networks (Chapter 8)
• Noteworthy Point
– The traditional orphan
– But now often viewed more as a front line topic
21
Physical Organization Options
• Platters – one or many
• Density - fixed or variable
– All tracks have the same no. of sectors?)
• Organization - sectors, cylinders, and tracks
– Actuators - 1 or more
– Heads - 1 per track or 1 per actuator
– Access - seek time vs. rotational latency
• Seek related to distance but not linearly
• Typical rotation: 3600 RPM or 15000 RPM
• Diameter – 1.0 to 3.5 inches
22
Access Time
• Access Time
– Seek time: time to move the arm over the proper track
• Very non-linear: accelerate and decelerate times complicate
– Rotation latency (delay): time for the requested sector to rotate under the head (on
average: 0.5 * RPM)
– Transfer time: time to transfer a block of bits (typically a sector) under the read-
write head
– Controller overhead: the overhead the controller imposes in performing an I/O
access
– Queuing delay: time spent waiting for a disk to become free
Disk Alternatives
• Optical Disks
– Optical compact disks (CD) – 0.65GB
– Digital video discs, digital versatile disks (DVD) – 4.7GB * 2 sides
– Rewritable CD (CD-RW) and write-once CD (CD-R)
– Rewritable DVD (DVD-RAM) and write-once DVD (DVD-R)
• Robotic Tape Storage
• Optical Juke Boxes
• Tapes – DAT, DLT
• Flash memory
– Good for embedded systems
– Nonvolatile storage and rewritable ROM
23
• Advantages
– Shares a common set of wires and protocols à low cost
– Often based on standard - PCI, SCSI, etc. à portable and versatility
• Disadvantages
– Poor performance
– Multiple devices imply arbitration and therefore contention
– Can be a bottleneck
24
– Multiple : multiple CPUs, I/O device initiate bus transactions
– Multiple bus masters need arbitration (fixed priority or random)
• Split transaction for multiple masters
– Use packets for the full transaction (does not hold the bus)
– A read transaction is broken into read-request and memory-reply transactions
– Make the bus available for other masters while the data is read/written from/to the
specified address
– Transactions must be tagged
– Higher bandwidth, but also higher latency
Synchronous or Asynchronous?
25
Standards
• The Good
– Let the computer and I/O-device designers work independently
– Provides a path for second party (e.g. cheaper) competition
• The Bad
– Become major performance anchors
– Inhibit change
• How to create a standard
– Bottom-up
• Company tries to get standards committee to approve it’s latest philosophy
in hopes that they’ll get the jump on the others (e.g. S bus, PC-AT bus, ...)
• De facto standards
– Top-down
• Design by committee (PCI, SCSI, ...)
26
A typical interface of I/O devices and an I/O bus to the CPU-memory bus
I/O Controller
27
Memory Mapped I/O
CPU
Single Memory & I/O Bus
No Separate I/O Instructions
ROM
Memory Interface Interface RAM
Peripheral Peripheral
CPU
$ I/O
L2 $
Memory Bus I/O bus
MemoryBus Adaptor
Programmed I/O
• Polling
• I/O module performs the action, on behalf of the processor
• But I/O module does not interrupt CPU when I/O is done
• Processor is kept busy checking status of I/O module
– not an efficient way to use the CPU unless the device is very fast!
• Byte by Byte…
28
Interrupt-Driven I/O
• Processor is interrupted when I/O module ready to exchange data
• Processor is free to do other work
• No needless waiting
• Consumes a lot of processor time because every word read or written passes through the
processor and requires an interrupt
• Interrupt per byte
29
Reliability, Availability, and Dependability
Dependability, Faults, Errors, and Failures
Computer system dependability is the quality of delivered service such that reliance
can justifiably be placed on this service. The service delivered by a system is its
observed actual behavior as perceived by other system(s) interacting with this
system's users.
Each module also has an ideal specified behavior, where a service specification is an
agreed description of the expected behavior.
A system failure occurs when the actual behavior deviates from the specified
behavior. The failure occurred because of an error, a defect in that module. The cause
of an error is a fault.
When a fault occurs, it creates a latent error, which becomes effective when it is
activated; when the error actually affects the delivered service, a failure occurs. The
time between the occurrence of an error and the resulting failure is the error latency.
Thus, an error is the manifestation in the system of a fault, and a failure is the
manifestation on the service of an error.
30
Example of Faults, Errors, and Failures
• Example 1
– A programming mistake: fault
– The consequence is an error or latent error
– Upon activation, the error becomes effective
– When this effective error produces erroneous data that affect the delivered
service, a failure occurs
• Example 2
– An alpha particle hitting a DRAM à fault
– It changes the memory à latent error
– Affected memory word is read à effective error
– The effective error produces erroneous data that affect the delivered service
à failure (If ECC corrected the error, a failure would not occur)
Service Accomplishment and Interruption
• Service accomplishment: service is delivered as specified
• Service interruption:delivered service is different from the specified service
• Transitions between these two states are caused by failures or restorations
Measure Reliability And Availability
• Reliability: measure of the continuous service accomplishment from a reference initial
instant
– Mean time to failure (MTTF)
– The reciprocal of MTTF is a rate of failures
– Service interruption is measured as mean time to repair (MTTR)
• Availability: measure of the service accomplishment w.r.t the alternation between the
above-mentioned two states
– Measured as: MTTF/(MTTF + MTTR)
– Mean time between failure = MTTF+ MTTR
Example
• A disk subsystem
– 10 disks, each rated at 1,000,000-hour MTTF
– 1 SCSI controller, 500,000-hour MTTF
– 1 power supply, 200,.000-hour MTTF
– 1 fan, 200,000-hour MTTF
– 1 SCSI cable, 1000,000-hour MTTF
• Component lifetimes are exponentially distributed (the component age is not important in
probability of failure), and independent failure
1 1 1 1 1
Failure _ Rate 10 *
1,000,000 500,000 200,000 200,000 1,000,000
1
MTTF 43,500hour ( 5Years )
Failure _ Rate
Cause of Faults
• Hardware faults: devices that fail
• Design faults: faults in software (usually) and hardware design (occasionally)
31
• Operation faults: mistakes by operations and maintenance personnel
• Environmental faults: fire, flood, earthquake, power failure, and sabotage
Classification of Faults
• Transient faults exist for a limited time and are not recurring
• Intermittent faults cause a system to oscillate between faulty and fault-free
operation
• Permanent faults do not correct themselves with the passing of time
Reliability Improvements
• Fault avoidance: how to prevent, by construction, fault occurrence
• Fault tolerance: how to provide, by redundancy, service complying with the
service specification in spite of faults having occurred or that are occurring
• Error removal: how to minimize, by verification, the presence of latent errors
• Error forecasting: how to estimate, by evaluation, the presence, creation, and
consequences of errors
32
• Provide high data-transfer rates, but not improve reliability
RAID Levels 0 – 1
• RAID 0 – No redundancy (Just block striping)
– Cheap but unable to withstand even a single failure
• RAID 1 – Mirroring
– Each disk is fully duplicated onto its "shadow“
– Files written to both, if one fails flag it and get data from the mirror
– Reads may be optimized – use the disk delivering the data first
– Bandwidth sacrifice on write: Logical write = two physical writes
– Most expensive solution: 100% capacity overhead
– Targeted for high I/O rate , high availability environments
• RAID 0+1 – stripe first, then mirror the stripe
• RAID 1+0 – mirror first, then stripe the mirror
33
• RAID 3 – Bit-interleaved parity
– Reduce the cost of higher availability to 1/N (N = # of disks)
– Use one additional redundant disk to hold parity information
– Bit interleaving allows corrupted data to be reconstructed
– Interesting trade off between increased time to recover from a failure and cost
reduction due to decreased redundancy
– Parity = sum of all relative disk blocks (module 2)
• Hence all disks must be accessed on a write – potential bottleneck
– Targeted for high bandwidth applications: Scientific, Image Processing
34
– Probability of write collisions to a single drive are reduced
– Hence higher performance in the consecutive write situation
• RAID 6
– Similar to RAID 5, but stores extra redundant information to guard against
multiple disk failures
RAID4 RAID5
35
• Response time (latency): the time a task takes from the moment it is placed in the buffer
until the server finishes the task
• Throughput: the average number of tasks completed by the server over a time period
• Knee of the curve (L VS. T): the area where a little more throughput results in much
longer response time, or, a little shorter response time results in much lower throughput
Transaction Model
• In an interactive environment, faster response time is important
• Impact of inherent long latency
• Transaction time: sum of 3 components
– Entry time - time it takes user (usually human) to enter command
– System response time - command entry to response out
– Think time - user reaction time between response and next entry
36