Notes M5
Notes M5
Notes M5
MODULE 5
Chapter 12 CACHES
Cache
– is a small, fast array of memory placed between the processor core and main memory
that store portions of recently referenced main memory.
– The word cache is a French word meaning “a concealed place for storage”.
Write Buffer
– Often used with a cache is a write buffer a very small first-in-first-out (FIFO) memory
placed between the processor core and main memory.
– The purpose of a write buffer is to free the processor core and cache memory from the
slow write time associated with writing to main memory.
The innermost level of the hierarchy is at the processor core. This memory is so
tightly coupled to the processor that in many ways it is difficult to think of it
as separate from the processor. This memory is known as a register file.
At the primary level, memory components are connected to the processor core
through dedicated on-chip interfaces. It is at this level we find tightly coupled
memory (TCM) and level 1 cache.
The purpose of main memory is to hold programs while they are running on a
system.
Also at the primary level is main memory. It includes volatile components like
SRAM and DRAM, and non-volatile components like flash memory.
The purpose of main memory is to hold programs while they are running on a
system.
Secondary memory is used to store unused portions of very large programs that
do not fit in main memory and programs that are not currently executing.
A cache may be incorporated between any level in the hierarchy where there is
a significant access time difference between memory components.
Figure 12.1 includes a level 1 (L1) cache and write buffer. The L1 cache is an
array of high-speed, on-chip memory that temporarily holds code and data from
a slower level.
A cache holds this information to decrease the time required to access both
instructions and data.
The write buffer is a very small FIFO buffer that supports writes to main
memory from the cache.
Not shown in the figure is a level 2 (L2) cache. An L2 cache is located between
the L1 cache and slower memory. The L1 and L2 caches are also known as the
primary and secondary caches.
Figure 12.2 shows the relationship that a cache has with main memory system and the
processor core.
The upper half of the figure shows a block diagram of a system without a cache.
Main memory is accessed directly by the processor core using the datatypes
supported by the processor core.
The lower half of the diagram shows a system with a cache. The cache memory
is much faster than main memory and thus responds quickly to data requests by
the core
The cache's relationship with main memory involves the transfer of small
blocks of data between the slower main memory to the faster cache memory.
These blocks of data are known as cache lines
The cache's relationship with main memory involves the transfer of small
blocks of data between the slower main memory to the faster cache memory.
These blocks of data are known as cache lines
If a cached core supports virtual memory, it can be located between the core
and the memory management unit (MMU), or between the MMU and physical
memory. Figure 12.3 shows the difference between the two caches
A logical cache stores data in a virtual address space. A logical cache is located
between the processor and the MMU.
The processor can access data from a logical cache directly without going
through the MMU. A logical cache is also known as a virtual cache
A physical cache stores memory using physical addresses.
A physical cache is located between the MMU and main memory.
For the processor to access memory, the MMU must first translate the virtual
address to a physical address before the cache memory can provide data to the
core.
ARM cached cores with an MMU use logical caches for processor families
ARM7 through ARM10, including the Intel StrongARM and Intel XScale
processors. The ARM11 processor family uses a physical cache
The improvement a cache provides is possible because computer programs
execute in non-random way.
The principle of locality of reference explains the performance improvement
provided by the addition of a cache memory to a system.
The repeated use of the same code or data in memory, or those very near, is the
reason a cache improves performance.
By loading the referenced code or data into faster memory when first accessed,
each subsequent access will be much faster. It is the repeated access to the faster
memory that improves performance.
In processor cores using the Von Neumann architecture, there is a single cache
used for instruction and data.
This type of cache is known as a unified cache. A unified cache memory
contains both instruction and data values.
The Harvard architecture has separate instruction and data buses to improve
overall system performance, but supporting the two buses requires two caches.
In processor cores using the Harvard architecture, there are two caches: an
instruction cache (I-cache) and a data cache (D-cache). This type of cache is
known as a split cache.
In a split cache, instructions are stored in the instruction cache and data values
are stored in the data cache.
The size of a cache is defined as the actual code or data the cache can store from
main memory. Not included in the cache size is the cache memory required to
support cache-tags or status bits.
Two common status bits are the valid bit and dirty bit.
A valid bit marks a cache line as active, meaning it contains live data originally
taken from main memory and is currently available to the processor core on
demand.
A dirty bit defines whether or not a cache line contains data that is different
from the value it represents in main memory.
The cache controller is hardware that copies code or data from main memory
to cache memory automatically.
It performs this task automatically to conceal cache operation from the software
it supports
The cache controller intercepts read and write memory requests before passing
them on to the memory controller.
It processes a request by dividing the address of the request into three fields,
the tag field, the set index field, and the data index field. The three bit fields are
shown in Figure 12.4.
First, the controller uses the set index portion of the address to locate the cache
line within the cache memory that might hold the requested code or data. This
cache line contains the cache-tag and status bits, which the controller uses to
determine the actual data stored there.
The controller then checks the valid bit to determine if the cache line is active,
and compares the cache-tag to the tag field of the requested address. If both the
status check and comparison succeed, it is a cache hit. If either the status check
or comparison fails, it is a cache miss.
On a cache miss, the controller copies an entire cache line from main memory
to cache memory and provides the requested code or data to the processor. The
copying of a cache line from main memory to cache memory is known as a
cache line fill.
On a cache hit, the controller supplies the code or data directly from cache
memory to the processor. To do this it moves to the next step, which is to use
the data index field of the address request to select the actual code or data in the
cache line and provide it to the processor.
Figure 12.5 shows where portions of main memory are temporarily stored in
cache memory. The figure represents the simplest form of cache, known as a direct-
mapped cache.
Figure 12.6 takes Figure 12.5 and overlays a simple, contrived software
procedure to demonstrate thrashing. The procedure calls two routines
repeatedly in a do whlie loop.
Each routine has the same set index address; that is, the routines are found at
addresses in physical memory that map to the same location in cache memory.
The first time through the loop, routine A is placed in the cache as it executes.
When the procedure calls routine B, it evicts routine A a cache line at a time as
it is loaded into cache and executed.
On the second time through the loop, routine A replaces routine B, and then
routine B replaces routine A.
Repeated cache misses result in continuous eviction of the routine that not
running. This is cache thrashing.
The storing of data in cache lines within a set does not affect program execution.
Two sequential blocks from main memory can be stored as cache lines in the
same way or two different ways.
The placement of values within a set is exclusive to prevent the same code or
data block from simultaneously occupying two cache lines in a set.
The mapping of main memory to a cache changes in a four-way set associative
cache. Figure 12.8 shows the differences.
Any single location in main memory now maps to four different locations in
the cache
The bit field for the tag is now two bits larger, and the set index bit field is two
bits smaller.
The size of the area of main memory that maps to cache is now 1 KB instead
of 4 KB. This means that the likelihood of mapping cache line data blocks to
the same set is now four times higher. This is offset by the fact that a cache line
is one fourth less likely to be evicted.
The ideal goal would be to maximize the set associativity of a cache by designing it so any
main memory location maps to any cache line. A cache that does this is known as a fully
associative cache.
One method used by hardware designers to increase the set associativity of a cache
includes a content addressable memory (CAM).
A cache that does this is known as a fully associative cache. However, as the
associativity increases, so does the complexity of the hardware that supports it.
A CAM uses a set of comparators to compare the input tag address with a cache-tag
stored in each valid cache line.
A CAM works in the opposite way a RAM works. Where a RAM produces data when
given an address value, a CAM produces an address if a given data value exists in the
memory.
The cache controller uses the address tag as the input to the CAM and the output selects
the way containing the valid cache line.
The tag portion of the requested address is used as an input to the four CAMs that
simultaneously compare the input tag with all cache-tags stored in the 64 ways.
The controller enables one of four CAMs using the set index bits. The indexed CAM
then selects a cache line in cache memory and the data index portion of the core address
selects the requested word, halfword, or byte within the cache line .
A write buffer is a very small, fast FIFO memory buffer that temporarily holds data
that the processor would normally write to main memory. In a system without a write buffer,
the processor writes directly to main memory.
The write buffer reduces the processor time taken to write small blocks of sequential
data to main memory.
The FIFO memory of the write buffer is at the same level in the memory hierarchy as
the L1 cache and is shown in Figure 12.1.
The efficiency of the write buffer depends on the ratio of main memory writes to the
number of instructions executed
If the write buffer does not fill, the running program continues to execute out of cache
memory using registers for processing, cache memory for reads and writes, and the
write buffer for holding evicted cache lines while they drain to main memory.
A write buffer also improves cache performance; the improvement occurs during cache
line evictions.
If the cache controller evicts a dirty cache line, it writes the cache line to the write
buffer instead of main memory.
The new cache line data will be available sooner, and the processor can continue
operating from cache memory.
Data written to the write buffer is not available for reading until it has exited the write
buffer to main memory.
The ARM10 family, for example, supports coalescing —the merging of write
operations into a single cache line.
The write buffer will merge the new value into an existing cache line in the write buffer
if they represent the same data block in main memory. Coalescing is also known as
write merging, write collapsing, or write combining.
There are two terms used to characterize the cache efficiency of a program:
The cache hit rate and the cache miss rate. The hit rate is the number of cache hits
divided by the total number of memory requests over a given time interval. The value is
expressed as a percentage:
The miss rate is similar in form: the total cache misses divided by the total number of
memory requests expressed as a percentage over a time interval.
Note that the miss rate also equals 100 minus the hit rate.
Two other terms used in cache performance measurement are the hit time—the time it
takes to access a memory location in the cache.
The miss penalty—the time it takes to load a cache line from main memory into cache.
The cache write policy determines where data is stored during processor write operations.
The replacement policy selects the cache line in a set that is used for the next line fill
during a cache miss.
The allocation policy determines when the cache controller allocates a cache line.
Writethrough
– Cache controller writes to both cache and main memory when there is a cache hit on
write, ensuring that the cache and main memory stay coherent at all times, but slower
than writeback.
Writeback
– Cache controller writes to valid cache data memory and not to main
memory.Consequently, valid cache lines and main memory may contain different
data.
– The line data will be written back to main memory when evicted.
– Must use one or more of the dirty bits.
One performance advantage a writeback cache has over a writethrough cache is in the
frequent use of temporary local variables by a subroutine.
These variables are transient in nature and never really need to be written to main memory.
An example of one of these transient variables is a local variable that overflows onto a
cached stack because there are not enough registers in the register file to hold the variable.
Pseudorandom replacement
randomly selects the next cache line in a set to replace. The selection algorithm uses a
nonsequential incrementing victim counter. In a pseudorandom replacement algorithm the
controller increments the victim counter by randomly selecting an increment value and
adding this value to the victim counter. When the victim counter reaches a maximum value,
it is reset to a defined base value.