Cache and Caching: Electrical and Electronic Engineering
Cache and Caching: Electrical and Electronic Engineering
1
Characteristics of Cache
Small, active, transparent and automatic
Small Most caches are 10% of the main memory size and hold equal percentage of
data
Active Has active mechanism that examines each request and decides how to respond
Available or not available. If not available, to retrieve a copy of item from data
store
Decides which item to keep in cache
Transparent A cache can be inserted without making changes to the request or data store.
Interface cache presents to requester the same interface as it does to data
storage and vice versa
Automatic Cache mechanism does not retrieve instruction on how to act or which data
items to store in the cache storage. Instead it implements an algorithm that
examines the sequence of requests and uses the request to determine how to
manage the cache.
Importance
Flexibility as in usage
o Hardware, software and combination of the two
o Small, medium and large data items
o Generic data items
o Application type of data
o Textual and non textual
o Variety of computers
o Systems designed to retrieve data(web) or those that store (physical memories)
Cache terminologies
There are many terminologies depending on application
Memory system Backing store
Cache web pages Browser
Server origin server
Database lookups Client request for database servers (system that handles
requests)
Hit Request that can be satisfied without any need to access the
underlying data store
Miss Request that cannot be satisfied
High locality of reference sequence containing repetitions of the same request
2
Cost rCh 1 r Cm where Ch and Cm are costs of accessing cache and store
respectively
Miss ratio 1-hit ratio
Replacement policy
Need To increase the hit ratio:
1. The policy should retain those items that will be referenced most frequently
2. Should be inexpensive
3. LRU method preferred
Multi level cache
More than one cache used along the path from requester to data store. The cost of accessing
mew cache is lower than the cost of accessing the original cache
Preloading Caches
During start-up the hit ratio is very low since it has to fetch items from the store. This can be
improved by preloading the cache.
o Using anticipation of requests (repeated)
Cache can be viewed as the main memory while data store as the external storage.
3
Caches in multiprocessors
Write through and write back
Write through
This is the method of writing to memory where the cache keeps a copy and forwards
the write operation to the underlying memory.
Write back scheme
Cache keeps data item locally and only writes the value to memory if
necessary. This is the case if value reaches end of LRU list and must be
replaced. To determine whether value is to be written back, a bit termed dirty
bit is kept by cache.
Cache Coherence
Performance can be optimized by using write back scheme than write through scheme. The
performance can also be optimized by giving each processor its own cache. Unfortunately the
two methods conflict (write back and multi-cache) during READ and WRITE operations for
the same address.
To avoid conflicts, all devices that access memory must follow a cache coherence protocol
that coordinates the values. Each processor must inform the other processor of its operation
so that the addressing is not confused.
Cache behaves like physical memory and data storage as external memory
Cache performs two tasks, passing the request simultaneously to physical and
searches locally
When the OS is running a program, the addresses given are always the same, ie starting from
zero. If the OS changes the program, it must also change that information in the cache since
the new program uses the same address to refer to the new set of values. The cache must have
a way to resolve these multiple application address location
1. Cache flush operation
The cache is flushed whenever the OS changes to a new virtual space.
2. Disambiguation
5
Any application swap, the OS loads the application ID into the address space
ID register.
Cache divides memory and cache into blocks where the block is in powers of two
To distinguish blocks, a unique tag value is assigned to each group of the blocks
Tags are used to identify a large group of bytes than single byte
Associative approach provides hardware that can search all of them simultaneously
A fully associative cache has the underlying caches containing only one slot, but the
slot can hold an arbitrary value equivalent to Content addressable Memory (CAM)
Example to programmers
Programmers who understand cache can write a code that exploits a cache
Array
Assume many operations on a large array
Perform all the operations on a single element of the array before moving to
the next element (program iterates through the array once)
Paging Single iteration for demand of paging possible
TLB
6
The register file in the CPU is accessible by both the integer and the floating point units, or
each unit may have its own specialized registers. The out-of-order execution units are
intelligent enough to know the original order of the instructions in the program and re-impose
program order when the results are to be committed (‘retired’) to their final destination
registers
7
Multi-level caches introduce new design decisions. For instance, in some processors, all data
in the L1 cache must also be somewhere in the L2 cache. These caches are called strictly
inclusive. Other processors (like the AMD Athlon) have exclusive caches — data is
guaranteed to be in at most one of the L1 and L2 caches, never in both. Still other processors
(like the Intel Pentium II, III, and 4), do not require that data in the L1 cache also reside in the
L2 cache, although it may often do so. There is no universally accepted name for this
intermediate policy, although the term mainly inclusive has been used.
The advantage of exclusive caches is that they store more data. This advantage is larger when
the exclusive L1 cache is comparable to the L2 cache, and diminishes if the L2 cache is many
times larger than the L1 cache. When the L1 misses and the L2 hits on an access, the hitting
cache line in the L2 is exchanged with a line in the L1. This exchange is quite a bit more
work than just copying a line from L2 to L1, which is what an inclusive cache does.
One advantage of strictly inclusive caches is that when external devices or other processors in
a multiprocessor system wish to remove a cache line from the processor, they need only have
the processor check the L2 cache. In cache hierarchies which do not enforce inclusion, the L1
cache must be checked as well. As a drawback, there is a correlation between the
associativities of L1 and L2 caches: if the L2 cache does not have at least as many ways as all
L1 caches together, the effective associativity of the L1 caches is restricted.
Another advantage of inclusive caches is that the larger cache can use larger cache lines,
which reduces the size of the secondary cache tags. (Exclusive caches require both caches to
have the same size cache lines, so that cache lines can be swapped on a L1 miss, L2 hit). If
the secondary cache is an order of magnitude larger than the primary, and the cache data is an
order of magnitude larger than the cache tags, this tag area saved can be comparable to the
incremental area needed to store the L1 cache data in the L2.
Then comes an enormous Level 3 cache memory (8 MB) for managing communications
between cores. That means that if a core tries to access a data item and it’s not present in the
Level 3 cache, there’s no need to look in the other cores’ private caches—the data item won’t
be there either. Conversely, if the data are present, four bits associated with each line of the
8
cache memory (one bit per core) show whether or not the data are potentially present
(potentially, but not with certainty) in the lower-level cache of another core, and which one.
Pipelining in Microprocessors
Modern microprocessors are structured and hence they contain many internal processing
units. Each unit performs a particular task. In real sense each of these processing units is
actually a special purpose microprocessor. The processor can process several instructions
simultaneously at various stages of execution. This ability is called pipelining. Intel 8086
was the first processor to make use of idle memory time by fetching the next instruction
while executing the current one. This process accelerates the overall execution time of a
program.
Figure 9 shows how an Intel i486 executes the instruction in a pipeline fashion. When one
instruction is fetched, the other is decoded which the third is being executed while the fourth
is being executed back. All these activities take place within the same time duration, thus
giving an overall execution rate of one instruction per clock cycle. Considering the
conventional approach that requires 4 clock cycles to fetch and execute and write back for
one instruction, the pipelining approach is much superior. If the start and end times of the
operation are considered, the overall (average) rate of processing comes out to be nearly one
(slightly greater) instruction per clock.
9
µP FETCH Decode Execute FETCH Decode Execute FETCH Decode Execute
1 1 2 2 2 3 3 3
Bus BUSY IDLE BUSY BUSY IDLE BUSY BUSY IDLE BUSY
Non-Pipelined Execution (8085)
Bus FETCH FETCH FETCH FETCH STORE1 FETCH FETCH READ FETCH
Unit 1 2 3 4 5 6 7
Instruction Decode Decode Decode 3 Decode 4 IDLE Decode 5 Decode IDLE
Unit 1 2 6
Execution Unit Execute Execute Execute Execute IDLE Execute Execute
1 2 3 4 5 6
Address Unit Generate Generate
Address Address
1 2
Fig. 9 Pipelining of instructions
Pipelining approach is very much a part of RISC architecture besides being suitable for CISC
architecture. Other factors that have contributed to the development of RISC on i486 are
MMU and 8KB primary cache
10
Additional Notes
40 MHZ – 25 ns
DRAM chips – access time 60 – 100 ns
SRAM – access time 15 – 25 ns
SRAM (ECL) – access time 12 ns BUT expensive
Assume aircraft moving at 850 km/h
Distance moved in 12 ns=1/10 of diameter of hair
Cache – attempts adv of quick SRAM with cheapness of DRAMs. to achieve the most
effective memory system.
SRAM
CACHE
DRAM
CPU CACHE MEMORY
(MAIN)
CONTROLLER
Cache can be On chip or separate. It can be between 1/10 – 1/1000 X smaller than main
memory
Cache hit means information requested in cache while cache miss indicates information
requested is not in cache
- Cache controller disables ready signal. CPU to insert wait states
Cache hit - Cache controller – reads a complete cache line called cache line fill
11
- Data bytes addressed by CPU are immediately passed on by the cache
controller before the whole cache line is completed.
Cache line – 16 or 32 bytes in size next CPU request may be part of the cache line hence
increase hit rate.
Cache controllers use burst mode where a block of data that contains more bytes than the data
bus width. Burst mode doubles the bus transfer rate.
Cache strategies – write through
Write through – always transfers data to main even when there is a hit.
- Wait states
- Use of fast write buffers tries to improve the write operations.
- Main memory consistency is enhanced.
Multiprocessors would have difficulty in this strategy unless an inquiry cycle is done to re-
establish consistency.
Write back – cache write unless specified
Write allocate – cache miss – cache controller fill the cache space for a cache line with the
data for the address to be written.
Usually Data written through to main memory. The cache controller then reads into the
cache the applicable cache line with the entry to be updated.
Cache controller independently performs the write allocate in parallel to CPU
operation.
Cache miss – because of complication, are simply switched through to main and ignored by
cache.
CACHE ORGANIZATIONAND ASSOCIATIVE MEMORY (CAM)
Types: Direct mapping
4 way
Tag
Associative memory
Assume cache memory of 16 kb and a cache line of 16 bytes
20 8 4
Tag address set address Byte address
12
TAG Element of cache directory
Determines whether a hit or miss
Valid bit implies valid cache line
Flush reset valid bits of the cache line
Write protect No overwrite
SET
Every tag of corresponding cache line are elements of the set.
Way For a given set address, the tag address of all ways are simultaneously
compared with the tag part of the address given out of the CPU for a hit/miss
criterion.
Capacity 4 way X set X cache line size = 16Kb
A Miss will check LRU for replacement.
Algorithms
Direct Mapping Cache line in one position
Associative Cache line can be anywhere within the four ways. Overwrite can be
avoided. 2 way would be faster than a 4 way cache. Associative
memory concept is also known as Content Addressable Memory
(CAM).
Cache hit determination
13
32 bit microprocessor The 4 GB is divided into 220 cache pages each 4KB through to
256 sets. The page is further divided into 16 byte cache line.
Organization
No restrictions
L2 cache is organized into 512KB in a 2 way organization. A cache line is taken as 64 bytes
and sets are 8192
Large caches can be implemented with external SRAMs. Tags may be for SRAMs with short
access time (15nS) while other data in SRAMs with access time of 20nS which can be
external.
Replacement strategies
The cache controller uses the LRU bits assigned to a set of cache lines for marking the last
addressed (most recently) way of the set.
Replacement policy Replace cache entry
B0=0? No
No B1=0? B2=0? No
Yes Yes
Replace Replace Replace Replace
way 1 way 0 14
way 2 way 3
Random replacement is also possible. Comprehensive statistical analyses have shown that
there is very little difference between the efficiency of the LRU and random replacement
algorithm. Replacement policy will solely rely on the cache designer.
Access and addressing
If last access was way 0 or 1, controller sets LRU bit B0. For access to way 0, bit B1 is set.
Addressing way 1 sets LRU bit B1. Access way 2 the LRU bit 2 is set. Addressing way 3, bit
2 is cleared.
15