Module 5
Module 5
Introduction
A cache is a small, fast memory between the processor and main memory. It holds recently used
data to speed up the processor, reducing the need to access slower main memory. A write buffer,
also used with caches, queues data for efficient writing to main memory. Caches and write
buffers are transparent to software, improving performance without needing code changes.
However, predicting program execution time becomes challenging due to cache eviction, where
old data is removed to make room for new, potentially impacting performance unpredictably.
• Processor Core: Innermost level tightly coupled with a register file for fastest memory
access.
❖ Registers provide immediate storage for data being actively processed.
• Primary Level:
❖ Tightly Coupled Memory (TCM): On-chip memory directly connected to the
processor core via dedicated interfaces.
❖ Level 1 (L1) Cache: High-speed on-chip memory holding frequently accessed
data.
❖ Main Memory: Includes SRAM, DRAM, and flash memory; stores programs
during execution.
• Secondary Storage: Larger, slower devices like disk drives used for storing large
programs and data not currently in use.
❖ Characterized by longer access times compared to main memory.
• TCM and SRAM use similar technologies but differ in placement (on-chip vs. board-
mounted).
Cache Functionality:
• Enhances system performance by reducing the time required to access instructions and
data stored in slower memory levels.
• Moves frequently accessed data from lower levels of the hierarchy to higher levels
temporarily.
L2 Cache:
• Located between L1 cache and slower memory, further optimizing data access.
• Figure 12.1: Illustrates L1 cache and write buffer, essential for optimizing data flow.
• Figure 12.2: Demonstrates how caches speed up data retrieval compared to direct access
to slower main memory.
o Shows data movement in cache lines between main memory and faster cache
memory.
o Write buffer temporarily holds data before efficiently writing it to main memory.
12.1.1 Caches and Memory Management Units
• Cached cores supporting virtual memory can be placed either between the core and the
Memory Management Unit (MMU), or between the MMU and physical memory.
• Placement determines whether the cache operates in the virtual or physical addressing
realm.
• Stores data using virtual addresses, located between the processor and MMU.
• Allows direct access to data without MMU translation.
• Physical Cache:
• Stores data using physical addresses, located between the MMU and main memory.
• Requires MMU translation of virtual addresses to physical addresses before accessing
memory.
• ARM7 through ARM10, Intel StrongARM, and Intel XScale processors use logical
caches.
• ARM11 processors utilize physical caches.
• Performance Improvement:
• Locality of Reference:
• Single cache used for both instructions and data (unified cache).
• Instructions and data share the same memory space.
• Harvard Architecture:
• Cache Basics:
• Cache Components:
• Directory store: Stores cache-tags, identifying where cache lines came from in main
memory.
• Data section: Holds actual data read from main memory.
• Status information: Includes status bits like valid and dirty bits.
• Cache-Tags:
• Directory entries that specify the origin address in main memory for each cache line.
• Data Storage:
• Actual data from main memory stored in the data section of the cache.
• Refers to the amount of actual code or data stored in the cache, excluding space for
cache-tags and status bits.
• Status Bits:
• Valid bit: Marks if a cache line contains valid data from main memory.
• Dirty bit: Indicates if data in a cache line differs from the corresponding data in main
memory.
❖ The cache controller automatically copies data between main memory and cache to
optimize performance without software intervention.
❖ It intercepts memory read and write requests, dividing addresses into tag, set index, and
data index fields.
❖ Using the set index, it locates potential cache lines in cache memory and checks tags and
status bits.
❖ A cache hit occurs if the requested data is active in cache and matches the tag; otherwise,
it's a cache miss.
❖ On a cache miss, it performs a cache line fill by copying the entire cache line from main
memory to cache.
❖ For cache hits, it directly provides the requested data from cache memory to the
processor using the data index field.
12.2.3 The Relationship between Cache and Main Memory
• Set index selects the cache location for addresses ending in 0x824.
• Data index selects the specific data within the cache line.
• Tag field is compared to cache-tag to determine data presence in cache.
• Main memory has one million possible locations for every one cache location.
• Only one value from main memory's million can be in cache at a time.
• Tag comparison determines if requested data is in cache or needs fetching.
• Direct-mapped cache uses a single cache location for each main memory address.
• Design cost includes potential high levels of thrashing.
• Thrashing happens when multiple program elements fight for the same cache location.
• This leads to frequent loading and eviction of cache lines.
• Figure 12.6 overlays a software example to demonstrate thrashing.
• Shows two routines called repeatedly in a loop with the same set index address.
• Routines are placed in main memory addresses that map to the same cache location.
• First execution loads and executes routine A in cache.
• Calling routine B evicts routine A from cache.
• Cycle repeats, causing routines to swap places in cache during each loop iteration.
Example in Figure 12.6:
• Some caches incorporate a design enhancement to mitigate thrashing, as shown in Figure 12.7.
• This feature divides the cache memory into smaller units called "ways."
• Figure 12.7 depicts a four KB cache where the set index now addresses multiple cache lines
across several ways.
• Previously, a single cache had 256 lines, but now there are four ways with 64 lines each.
• Cache lines sharing the same set index are grouped into a "set," hence the term "set index."
❖ Set associative caches enhance performance by grouping cache lines into sets, as
illustrated in Figure 12.8.
❖ Each set includes multiple cache lines (e.g., four ways in a set).
❖ Data or code blocks from main memory can be allocated to any of these cache lines
within a set without impacting program execution.
❖ This flexibility allows two sequential blocks from main memory to be stored in the same
set or different sets.
❖ Unlike direct-mapped caches, where a specific main memory location maps to only one
cache location, a four-way set associative cache allows a single main memory location to
map to four different cache locations.
❖ Figure 12.8 shows how this mapping changes compared to Figure 12.5, despite both
being 4 KB caches.
❖ Key differences include:
o Tag field is larger by two bits, while set index field is smaller by two bits.
o This results in four million main memory addresses mapping to one set of four
cache lines, instead of one million addresses mapping to one location.
❖ Cache now covers a 1 KB area of main memory instead of 4 KB.
❖ This increases the chances of mapping data blocks to the same set but reduces the
likelihood of eviction.
❖ In a practical scenario like the example in Figure 12.6, using a four-way set associative
cache would reduce thrashing as routines and data establish unique places in the available
cache locations within a set, assuming their sizes fit within the 1 KB mapping area.
12.2.5 Write Buffers
❖ A write buffer is a small, fast FIFO (First-In-First-Out) memory buffer.
❖ It temporarily holds data that the processor intends to write to main memory.
❖ In systems without a write buffer, the processor writes directly to main memory.
❖ With a write buffer, data is initially written quickly to the FIFO and then transferred at a
slower pace to main memory.
❖ The purpose of the write buffer is to reduce the time the processor spends writing small
blocks of sequential data to main memory.
❖ The FIFO memory of the write buffer is positioned at the same level in the memory hierarchy
as the L1 cache, as depicted in Figure 12.1.
Cache Hit Rate: This measures how often the processor finds requested data in the cache rather
than having to retrieve it from slower main memory.
The hit rate is the number of cache hits divided by the total number of memory requests over a
given time interval. The value is expressed as a percentage:
higher hit rate indicates better cache efficiency and faster data access.
Cache Miss Rate: The miss rate is similar in form: the total cache misses divided by the total
number of memory requests expressed as a percentage over a time interval. Note that the miss
rate also equals 100 minus the hit rate.
Types of Hit and Miss Rates: These terms can specify performance for reads, writes, or both,
offering insights into how effectively the cache handles different types of memory operations.
• Hit Time: This refers to the time taken to access data from the cache when a hit occurs. It's
typically much faster than accessing data from main memory due to the cache's proximity to the
processor.
• Miss Penalty: This is the time delay when data isn't found in the cache (a miss) and must be
fetched from main memory. It represents the performance cost of cache misses compared to hits.
12.3 Cache Policy
There are three policies that determine the operation of a cache:
• The cache write policy determines where data is stored during processor write operations.
• The replacement policy selects the cache line in a set that is used for the next line fill during
a cache miss.
• The allocation policy determines when the cache controller allocates a cache line.
• When the processor core writes to memory, the cache controller has two alternatives for its
write policy. The controller can write to both the cache and main memory, updating the
values in both locations; this approach is known as writethrough. Alternatively, the cache
controller can write to cache memory and not update main memory, this is known as
writeback or copyback
Writethrough: When the cache controller uses a writethrough policy, it writes to both cache
and main memory when there is a cache hit on write, ensuring that the cache and main memory
stay coherent at all times. Under this policy, the cache controller performs a write to main
memory for each write to cache memory. Because of the write to main memory, a writethrough
policy is slower than a writeback policy.
Writeback: When a cache controller uses a writeback policy, it writes to valid cache data
memory and not to main memory. Consequently, valid cache lines and main memory may
contain different data. The cache line holds the most recent data, and main memory contains
older data, which has not been updated. Caches configured as writeback caches must use one or
more of the dirty bits in the cache line status information block. When a cache controller in
writeback writes a value to cache memory, it sets the dirty bit true. If the core accesses the cache
line at a later time, it knows by the state of the dirty bit that the cache line contains data not in
main memory. If the cache controller evicts a dirty cache line, it is automatically written out to
main memory. The controller does this to prevent the loss of vital information held in cache
memory and not in main memory. One performance advantage a writeback cache has over a
writethrough cache is in the frequent use of temporary local variables by a subroutine.
• Writeback Policy AND Writethrough
On a cache miss, the cache controller must select a cache line from the available set in cache
memory to store the new information from main memory.
If the victim contains valid, dirty data, the controller must write the dirty data from the cache
memory to main memory before it copies new data into the victim cache line.
The process of selecting and replacing a victim cache line is known as eviction.
The strategy implemented in a cache controller to select the next victim is called its replacement
policy.
The replacement policy selects a cache line from the available associative member set; that is, it
selects the way to use in the next cache line replacement.
ARM cached cores support two replacement policies, either pseudorandom or round-robin.
■ Round-robin or cyclic replacement simply selects the next cache line in a set to replace. The
selection algorithm uses a sequential, incrementing victim counter that increments each time the
cache controller allocates a cache line. When the victim counter reaches a maximum value, it is
reset to a defined base value.
■ Pseudorandom replacement randomly selects the next cache line in a set to replace. The
selection algorithm uses a nonsequential incrementing victim counter. In a pseudorandom
replacement algorithm the controller increments the victim counter by randomly selecting an
increment value and adding this value to the victim counter. When the victim counter reaches a
maximum value, it is reset to a defined base value.
Most ARM cores support both policies (see Table 12.1 for a comprehensive list of ARM cores
and the policies they support).
The round-robin replacement policy has greater predictability, which is desirable in an embedded
system.
• Core-Specific Implementations: