0% found this document useful (0 votes)
15 views73 pages

Cache Memory

Chapter 4 discusses cache memory, detailing its characteristics, organization, and principles of operation within computer memory systems. It explains the hierarchy of memory types, access times, and the importance of cache in improving performance through various mapping and replacement algorithms. The chapter also addresses cache coherency issues in multi-processor systems and outlines write policies to maintain data consistency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views73 pages

Cache Memory

Chapter 4 discusses cache memory, detailing its characteristics, organization, and principles of operation within computer memory systems. It explains the hierarchy of memory types, access times, and the importance of cache in improving performance through various mapping and replacement algorithms. The chapter also addresses cache coherency issues in multi-processor systems and outlines write policies to maintain data consistency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Let's Talk About

Chapter 4
Cache Memory

Presented by Group 4
Cache Memory

Computer Memory Cache Memory Elements of Cache Pentium 4 Cache


System Overview Principles Design Organization
Chapter 4.1
Characteristics of
Memory Systems
Introduction to
Memory Systems

MEMORY SYSTEMS

Memory systems are essential for


storing and retrieving data in a
computer.
They come in various forms and
function at different levels of
performance, cost, and capacity.
Main Categories of
Memory Systems

INTERNAL MEMORY EXTERNAL MEMORY

Registers, cache, RAM (closer to Disks, tapes (used for large-scale


the CPU). storage).
Location

INTERNAL MEMORY EXTERNAL MEMORY

Found within the CPU or very Located further from the CPU,
close to it, such as cache and including hard drives, SSDs, and
registers. tapes.
Capacity

DEFINITION HIERARCHY
Registers and Cache: Smallest but
fastest memory (in KB or MB).
The total amount of data the Main Memory (RAM): Larger but
memory system can hold at one slightly slower (in GB).
time. External Storage (HDD, SSD):
Largest, typically in the range of
terabytes, but slowest.
Unit of Transfer

INTERNAL MEMORY EXTERNAL MEMORY

Transfers data in words, which Transfers data in blocks, larger


are the smallest fixed-sized units chunks of data (e.g., 512 bytes,
of data (e.g., 32 or 64 bits). 4KB blocks).
Unit of Transfer

WORD ADDRESSABLE UNITS


The smallest unit that can be
The natural unit of memory
uniquely addressed in memory.
organization. Typically equal Often, this is a byte, but some
to the number of bits used to systems use the word as the
represent an integer. addressable unit.
Example: Intel x86 has a word The number of addressable units (N)
size of 32 bits. is related to the number of bits in
the address (A): 2^A=N
Access Time

SEQUENTIAL ACCESS DIRECT ACCESS


Data is accessed in a linear Data is accessed by moving
order, such as in tape storage. to a general area (direct
Access time varies as the access) followed by a search
read/write mechanism moves within that area.
through intermediate records. Example: Disk storage.
Access Time

RANDOM ACCESS ASSOCIATIVE ACCESS


Any memory location can be Data is accessed based on a
accessed directly without portion of its content rather
regard to previous locations. than its address.
Example: Main memory and Common in certain cache
some cache systems. memories.
Performance

ACCESS TIME
MEMORY CYCLE TIME
(LATENCY)
Time required to read or write Includes access time and the
data. time required before a second
For random-access memory, access can start.
this is the time between the Important for system bus
request and the availability of performance.
the data.
Performance

TRANSFER RATE
For non-random-access
memory: Tn=Ta+n/R​,
The rate at which data can be where Tn is the average time
transferred. to read/write n bits, Ta​is the
For random-access memory, average access time, and R is
Transfer Rate = 1/Cycle Time the transfer rate (in bits per
second).
Types of Memory

NON-VOLATILE
VOLATILE MEMORY
MEMORY

Requires constant power to Retains data without power.


retain data.
Example: Hard drives, SSDs,
Example: RAM. ROM.
Memory
Organization

DEFINITION Internal Memory Organization:


Bits are grouped to form words,
which are the "natural" units of
Memory organization refers data that the CPU processes.
to the physical arrangement
of bits into words in memory. Memory can be organized in
various ways depending on the
system, sometimes not in a
straightforward manner.
THE MEMORY
HIERARCHY

DESIGN CONSTRAINTS TRADE-OFFS


Faster access = higher cost
per bit.
How much (capacity)?
Greater capacity = lower cost
How fast (access time)?
per bit.
How expensive (cost)?
Greater capacity = slower
access time.
THE DILEMMA

Designers need large,


low-cost memory for
capacity but fast,
expensive memory for
performance.
MEMORY HIERARCHY
SOLUTION

Employ a hierarchy of memory technologies.

As you move down the hierarchy:

Cost per bit decreases.


Capacity increases.
Access time increases.
Frequency of access by the processor decreases.
LOCALITY OF
REFERENCE

Memory references by the processor tend to cluster.


Programs contain iterative loops and subroutines
with repeated references to a small set of
instructions.
Operations on tables and arrays involve access to
clustered sets of data words."
EXAMPLE 4.1

Suppose that the processor has access to two levels of memory. Level 1 contains 1000 words
and has an access time of 0.01 μs; level 2 contains 100,000 words and has an access time of
0.1 μs. Assume that if a word to be accessed is in level 1, then the processor accesses it
directly. If it is in level 2, then the word is first transferred to level 1 and then accessed by the
processor. For simplicity, we ignore the time required for the processor to determine whether
the word is in level 1 or level 2. Figure 4.2 shows the general shape of the curve that covers
this situation. The figure shows the average access time to a two- level memory as a function
of the hit ratio H, where H is defined as the fraction of all memory accesses that are found in
the faster memory (e.g., the cache), T1 is the access time to level 1, and T2 is the access time
to level 2. As can be seen, for high percentages of level 1 access, the average total access
time is much closer to that of level 1 than that of level 2.

In our example, suppose 95% of the memory accesses are found in level 1. Then the average
time to access a word can be expressed as

(0.95)(0.01 μs) + (0.05)(0.01 μs + 0.1 μs) = 0.0095 + 0.0055 = 0.015 μs

The average access time is much closer to 0.01 μs than to 0.1 μs, as desired.
DATA ORGANIZATION

Data can be organized across


the hierarchy so that the
percentage of accesses to
each lower level is
significantly less than that of
the level above.

TWO-LEVEL EXAMPLE MEMORY TYPES


Level 2 memory contains all The fastest, smallest, and most
program instructions and expensive memory consists of
data, while current clusters processor registers.
are placed in level 1. On Main memory is the primary
average, most references are internal memory system,
to instructions and data in typically extended with a
level 1. higher-speed, smaller cache.
MEMORY
DISK CACHE
CHARACTERISTICS

Volatile memory types Disk writes are clustered,


include registers, cache, and which improves performance
main memory. and minimizes processor
External, nonvolatile memory involvement.
(secondary or auxiliary Data in the software cache
memory) includes hard disks, can be accessed faster than
removable media, tape, and from the disk if referenced
optical storage. before the next dump to disk.
Chapter 4.2

CACHE MEMORY
PRINCIPLES
4.2 CACHE MEMORY
PRINCIPLES

CACHE

Small amount of fast memory


Between normal main
memory and CPU
May be located on CPU chip
or module
CONCEPT
CONCEPT
MULTIPLE LEVELS
CACHE OPERATION

CPU requests contents of memory


location
Check cache for this data
If present, get from cache (fast)
If not present. read required block from
main memory to cache
then deliver from cache to CPU
Cache includes tags to identify which
block of main memory is in each cache
slot
STRUCTURE
MAIN
MEMORY

CACHE
2
0
1

MAIN
MEMORY
READ OPERATION
Cache Hit: When the
cache has the data, it
directly communicates
with the processor,
bypassing the system
bus and buffers.

Cache Miss: If the cache


does not have the data,
the system bus is used
to fetch the data from
main memory. The data
is then sent to both the
cache and the processor
via the data buffer.

ORGANIZATION
Chapter 4.3

ELEMENTS OF
CACHE DESIGN
TOPICS

Cache -First in first out (FIFO)


-Logical -Least frequently used (LFU)
-Physical -Random
Cache Size Write Policy
Mapping Function -Write through
-Direct -Write back
-Associative Line Size
-Set Associative Number of Caches
Replacement Algorithm -Single or two level
-Least recently used (LRU) -Unified or split
HPC

HIGH-PERFORMANCE
COMPUTING
CACHE
ADDRESSES

LOGICAL/ PHYSICAL
VIRTUAL CACHE CACHE

-Virtual addresses
-Physical addresses
-Faster
VIRTUAL/LOGICAL

PHYSICAL
CACHE SIZE

SIZE COST SPEED


MAPPING
FUNCTION
Mapping
Function
Because there are fewer cache
lines than main memory blocks,
an algorithm is needed for
mapping main memory blocks
into cache lines. Further, a means
is needed for determining which
main memory block currently
occupies a cache line. The choice
of the mapping function dictates
how the cache is organized.
Direct Mapping

i = j modulo m
where i = cache line number
j = main memory block number
m = number of lines in the cache
Direct Mapping
Direct Mapping
Thrashing
Associative
Mapping

Associative mapping overcomes


the disadvantage of direct
mapping by permitting each main
memory block to be loaded into
any line of the
cache
Associative Mapping
Set-Associative
Mapping

m=v*k
i = j modulo v
where
i = cache set number
j = main memory block number
m = number of lines in the cache
v = number of sets
k = number of lines in each set
Set-Associative Mapping
Replacement Algorithms
When the cache is filled, and a new block is
brought in:
A block must be replaced to make room for the
new block.
For direct mapping:
There is only one possible line for any specific
block.
No choice is available in selecting which block to
replace.
For associative and set-associative mapping:
A replacement algorithm is needed to decide
which block to replace.
This algorithm must be implemented in hardware
to ensure high speed.
Four Common replacement
algorithms are:
LEAST RECENTLY USED (LRU):
Concept: Replace the block that has been in
the cache the longest without being
referenced. LEAST FREQUENTLY USED (LFU):
For two-way set associative caches:
When a line is referenced, its USE bit is Concept: Replace the block that has
set to 1, and the other line's USE bit is set experienced the fewest references.
to 0. The block whose USE bit is 0 is A counter is associated with each line, and
replaced. the line with the lowest count is replaced.
For fully associative caches:
A list of indexes to all lines is maintained.
The most recently used line moves to the
front of the list. RANDOM REPLACEMENT:
The least recently used line (at the back
of the list) is replaced. Concept: Pick a line at random for
replacement, without considering its
usage.
Simulation studies suggest random
FIRST-IN-FIRST-OUT (FIFO): replacement provides only slightly worse
performance compared to usage-based
Concept: Replace the block that has been in algorithms.
the cache the longest, regardless of how
often it's been used.
Easily implemented using a round-robin or
circular buffer technique.
Write Policy
When a block in the cache is replaced:
If the block hasn't been altered: It can be overwritten directly with a
new block.
If the block has been altered (at least one write operation): The
modified block must be written back to the main memory before the
new block is loaded into the cache.

Key Problems with Cache and Memory Consistency:


1. Multiple devices accessing main memory:
If only the cache is updated, main memory could be outdated (invalid
data).
If an I/O device changes main memory, the cache could hold
outdated data.
2. Multiple processors with local caches:
If one processor modifies a word in its cache, it could invalidate the
corresponding data in the other processors' caches.
Two main write policies to handle these issues:
Write Through:
Every time data is written to the cache, it is also written to the main memory
simultaneously.
Advantage: Main memory always holds valid data, which ensures consistency. Other
processors or devices can monitor memory traffic and keep their own caches
updated.
Disadvantage: Generates a lot of memory traffic, which can slow down the system
and create bottlenecks.
Write Back:
Concept: Updates are made only in the cache. When a block is modified, a dirty bit (or
use bit) is set to indicate the change.
When a block is replaced: It is written back to main memory only if the dirty bit is set.
Advantage: Reduces memory writes, minimizing memory traffic.
Disadvantage: Main memory may hold invalid data, meaning I/O devices must go
through the cache to ensure they get updated data. This requires more complex
circuitry and can also cause bottlenecks.
In general, about 15% of memory references are write operations. In certain high-
performance computing (HPC) tasks, such as:
Vector-vector multiplication: Write operations may reach 33%.
Matrix transposition: Write operations can go as high as 50%.
In a system where multiple
devices (like processors) have
their own caches and share main
memory, altering data in one
cache can invalidate the same
data in both the main memory
and other caches. Even with a
write-through policy, other
caches may hold outdated data.
A system that solves this issue is
said to maintain cache coherency

Cache coherency:
A system that ensures all
caches and main memory are
synchronized is said to
maintain cache coherency.
Possible approaches to
maintain cache coherency:

BUS WATCHING WITH WRITE


THROUGH:
Each cache controller monitors the bus
for write operations by other devices.
If another device writes to a shared
HARDWARE TRANSPARENCY:
memory location present in its cache, the
cache controller invalidates that cache
Extra hardware ensures all updates to
entry.
main memory from a cache are reflected
This depends on using the write-through
in other caches.
policy for all caches.
When one processor updates its cache,
the update is written to both main
memory and any matching data in other
caches.
NONCACHEABLE MEMORY:
A portion of main memory is marked as
noncacheable and shared by multiple
processors.
Any access to this shared memory results in
a cache miss, as the data is not cached.
Noncacheable memory is identified using
chip-select logic or high-address bits.
Line Size
When data is fetched into the cache, both the requested word and nearby words are retrieved
together as a block.
Effect of Increasing Block Size on Hit Ratio:
Initial Increase in Hit Ratio:
As block size increases, the hit ratio improves because nearby data is more likely to be
accessed soon (due to the principle of locality).
Decrease in Hit Ratio for Larger Blocks:
After a certain point, as block size gets bigger, the chances of using the newly fetched
data decreases.
The probability of using the replaced data becomes higher than using the new data.
Two Main Effects of Larger Blocks:
1. Fewer Blocks Fit in Cache:
Larger blocks reduce the total number of blocks that can fit in the cache.
This causes useful data to be overwritten quickly.
2. Farther Words are Less Useful:
As a block becomes larger, words farther from the requested word are less likely to be
accessed soon.
Complex Relationship Between Block Size and Hit Ratio:
The optimal block size depends on the locality characteristics of a program.
Block sizes between 8 and 64 bytes are generally close to the optimum.
For high-performance computing (HPC) systems, cache line sizes of 64 to 128 bytes are often
used.
Line Size

When caches were originally


introduced, the typical system
had a single cache. More
recently, the use of multiple
caches has become the norm.
Two aspects of this design issue
concern the number of levels of
caches and the use of unified
versus split caches.
MULTILEVEL CACHES
As chip technology has improved, it is now possible to place a cache directly on
the processor chip (on-chip cache).
Compared to a cache accessed via an external bus, the on-chip cache reduces the
processor's need to use the external bus, speeding up execution and improving
system performance.

On-Chip Cache Benefits:


When the processor finds the requested data in the on-chip cache, there’s no
need to access the bus.
The data paths within the processor are shorter, so accessing the on-chip cache is
much faster than accessing external memory, even in "zero-wait" bus cycles.

The Need for External Cache (L2 Cache):


Despite the presence of an on-chip cache, an external (off-chip) cache may still
be useful.
Modern designs usually include both on-chip and external caches.
The simplest setup is a two-level cache system, with the internal cache being
Level 1 (L1) and the external cache being Level 2 (L2).
MULTILEVEL CACHES
If there's no L2 cache, and the processor cannot find the required data in the L1
cache, it must access slower DRAM or ROM memory, which leads to poor
performance.
The L2 cache, often made of fast SRAM (static RAM), helps in quickly retrieving
missing data from L1 cache.
If the L2 cache matches the speed of the bus, the data transfer can occur using
"zero-wait state" transactions, which are the fastest.
Many designs with an off-chip L2 cache avoid using the system bus and instead
use a separate data path between the processor and the L2 cache, reducing bus
traffic.
With the continued miniaturization of components, many processors now include
the L2 cache on the chip, further improving performance.
Performance Impact of Multilevel Caches:
The effectiveness of the L2 cache depends on the hit rates of both the L1 and L2
caches.
Studies have shown that multilevel caches improve performance.
However, adding multiple cache levels complicates design factors, including
cache size, replacement algorithms, and write policies.
Figure 4.17
Figure 4.17 looks at how well a two-level cache system
performs based on the size of the caches. The graph
assumes both caches (L1 and L2) have the same line
size and measures the total "hit ratio," meaning it
counts when the needed data is found in either the L1
or L2 cache.
Impact of L2 Cache Size: The L2 cache doesn't
make a big difference in improving hits until it is at
least twice the size of the L1 cache.
Key Points on Cache Sizes:
For an L1 cache of 8 KB, the biggest
improvement happens when the L2 cache is 16
KB.
For an L1 cache of 16 KB, the biggest boost
occurs with an L2 cache of 32 KB.
Performance: Before reaching these sizes, the L2
cache doesn't significantly improve overall cache
performance.
Unified Cache:
Unified Cache:
In early on-chip cache designs, Advantages of Unified Cache:
there was a single cache used A unified cache has a higher
for both data and instructions.
This unified cache stored hit rate compared to split
references to both types of caches because it adjusts
memory access. based on the balance between
instruction and data fetches.
If the program requires more
Split Cache: instruction fetches, the cache
Modern designs often separate the will hold more instructions,
cache into two parts:
One for instructions (Instruction L1 and if more data fetches are
cache) needed, the cache will hold
One for data (Data L1 cache) more data.
These two caches are at the same level, Only one cache design is
typically as two L1 caches.
The processor consults the instruction
needed, simplifying the
L1 cache for fetching instructions and implementation.
the data L1 cache for fetching data.
Trend Toward Split Caches (L1 Level): Unified Cache Issue:
At the L1 cache level, there is a trend With a unified cache, if the execution unit
towards using split caches (one for needs to access memory for data while the
instructions, one for data). instruction prefetcher requests an
Higher levels (like L2) tend to use unified instruction, the instruction request may be
caches, especially in superscalar machines delayed.
that focus on executing multiple The cache prioritizes the data request from
instructions in parallel and prefetching the execution unit to complete the current
future instructions. instruction, causing a delay in fetching the
next instruction.
Key Advantage of Split Cache: This delay can affect the performance of
It avoids cache contention between the the instruction pipeline.
instruction fetch/decode unit and the
execution unit. Split Cache Benefit:
In designs that use pipelining (where the By separating the instruction and data
processor fetches instructions in advance caches, contention is eliminated, ensuring
to store in a buffer for execution), avoiding smoother and more efficient use of the
cache contention is crucial. instruction pipeline, leading to better
performance.
Chapter 4.4
Pentium 4
Cache Organization
Table 4.4 Intel Cache Evolution
Figure 4.18 Pentium 4 Block Diagram
Table 4.5 Pentium 4 Cache Operating Modes
Thank You For
Listening

You might also like