0% found this document useful (0 votes)
97 views15 pages

Cache and Caching: Electrical and Electronic Engineering

Caching is an optimization technique that reduces memory access time and improves performance. A cache acts as an intermediary between the processor and main memory. It is small, active, transparent, and automatic. There are different types of caches including instruction caches, data caches, and translation lookaside buffers (TLBs). Caches use techniques like direct mapping, set associative mapping, and cache replacement policies to optimize performance. Programmers can write code to exploit cache behavior through techniques like iterating through arrays in a cache-friendly manner.

Uploaded by

Wanjira Kigoko
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views15 pages

Cache and Caching: Electrical and Electronic Engineering

Caching is an optimization technique that reduces memory access time and improves performance. A cache acts as an intermediary between the processor and main memory. It is small, active, transparent, and automatic. There are different types of caches including instruction caches, data caches, and translation lookaside buffers (TLBs). Caches use techniques like direct mapping, set associative mapping, and cache replacement policies to optimize performance. Programmers can write code to exploit cache behavior through techniques like iterating through arrays in a cache-friendly manner.

Uploaded by

Wanjira Kigoko
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 15

ELECTRICAL AND ELECTRONIC ENGINEERING

Cache and Caching


Caching refers to an important optimization technique used to reduce Von Neumann
Bottleneck (time spent performing memory access that can limit overall performance) and
improve the performance of any hardware or software system that retrieves information. A
cache acts as an intermediary.

1
Characteristics of Cache
Small, active, transparent and automatic
Small Most caches are 10% of the main memory size and hold equal percentage of
data
Active Has active mechanism that examines each request and decides how to respond
Available or not available. If not available, to retrieve a copy of item from data
store
Decides which item to keep in cache
Transparent A cache can be inserted without making changes to the request or data store.
Interface cache presents to requester the same interface as it does to data
storage and vice versa
Automatic Cache mechanism does not retrieve instruction on how to act or which data
items to store in the cache storage. Instead it implements an algorithm that
examines the sequence of requests and uses the request to determine how to
manage the cache.
Importance
Flexibility as in usage
o Hardware, software and combination of the two
o Small, medium and large data items
o Generic data items
o Application type of data
o Textual and non textual
o Variety of computers
o Systems designed to retrieve data(web) or those that store (physical memories)

Cache terminologies
There are many terminologies depending on application
Memory system Backing store
Cache web pages Browser
Server origin server
Database lookups Client request for database servers (system that handles
requests)
Hit Request that can be satisfied without any need to access the
underlying data store
Miss Request that cannot be satisfied
High locality of reference sequence containing repetitions of the same request

number of request that are hits


Hit Ratio 
Total number of requests

2
Cost  rCh   1  r  Cm where Ch and Cm are costs of accessing cache and store
respectively
Miss ratio 1-hit ratio
Replacement policy
Need To increase the hit ratio:
1. The policy should retain those items that will be referenced most frequently
2. Should be inexpensive
3. LRU method preferred
Multi level cache
More than one cache used along the path from requester to data store. The cost of accessing
mew cache is lower than the cost of accessing the original cache

Cost  r1Ch1  r2Ch 2   1  r1  r2  Cm


Multi-level caches generally operate by checking the smallest Level 1 (L1) cache first; if it
hits, the processor proceeds at high speed. If the smaller cache misses, the next larger cache
(L2) is checked, and so on, before external memory is checked.

Preloading Caches
During start-up the hit ratio is very low since it has to fetch items from the store. This can be
improved by preloading the cache.
o Using anticipation of requests (repeated)

o Frequently used pages

Pre-fetch related data


If a processor accesses a byte of memory, the cache fetches 64 bytes. Thus if the processor
fetches the next byte, the value will come from the cache. Modern computer systems employ
multiple caches. Caching is used with both virtual and physical memory as well as secondary
memory.
Translation Lookaside Buffer (TLB) contains digital circuits that move values into a Content
Addressable memory (CAM) at high speed.

 Cache can be viewed as the main memory while data store as the external storage.

3
Caches in multiprocessors
Write through and write back
Write through
This is the method of writing to memory where the cache keeps a copy and forwards
the write operation to the underlying memory.
Write back scheme
Cache keeps data item locally and only writes the value to memory if
necessary. This is the case if value reaches end of LRU list and must be
replaced. To determine whether value is to be written back, a bit termed dirty
bit is kept by cache.
Cache Coherence
Performance can be optimized by using write back scheme than write through scheme. The
performance can also be optimized by giving each processor its own cache. Unfortunately the
two methods conflict (write back and multi-cache) during READ and WRITE operations for
the same address.
To avoid conflicts, all devices that access memory must follow a cache coherence protocol
that coordinates the values. Each processor must inform the other processor of its operation
so that the addressing is not confused.

Physical memory cache

 Demand paging as a form of cache

 Cache behaves like physical memory and data storage as external memory

 Page replacement policy as cache replacement policy


A cache inserted between processor and memory need to understand physical address. We
can imagine cache receiving a read request, checking to see if the request can be answered
from cache and then if the request is not present, to pass the request to underlying memory.
Once the item is retrieved from memory, cache saves a copy locally and then returns the
value to processor.
Example
READ

 Cache performs two tasks, passing the request simultaneously to physical and
searches locally

 If answer is local, cancel memory operation

 If no local answer, wait for underlying memory operation to complete

 Answer arrives, save copy, transfer answer to processor


Instructions and Data caches
4
Should all memory references pass through a single cache? To understand the question,
imagine instructions being executed and data being accessed.
Instruction fetch tends to behave with highly locality since in many cases the next instruction
is found at an adjacent memory address. If loops are used, they are small routines that can fit
into a cache.
Data fetch may be at random and hence not necessarily adjacent in the memory address. Also
any time memory is referenced; the cache keeps a copy even though the value will not be
needed again.
The overall performance of the cache is reduced. Architects vary in choice from different
caches and one large cache that can allow intermixing.
Virtual memory caching and cache flush

When the OS is running a program, the addresses given are always the same, ie starting from
zero. If the OS changes the program, it must also change that information in the cache since
the new program uses the same address to refer to the new set of values. The cache must have
a way to resolve these multiple application address location
1. Cache flush operation
The cache is flushed whenever the OS changes to a new virtual space.
2. Disambiguation

 Use extra bits that identify the address space

 Processor contains an extra hardware register that contain an address space ID

 Each program allocated a unique number

5
 Any application swap, the OS loads the application ID into the address space
ID register.

 Processor creates artificially longer addresses before passing an address to the


cache containing the ID
Implementation of memory cache
Originally the memory cache contained two values of entry: memory address and the content
found in that address. New methods are:
1. Direct mapping cache
2. Set associative memory cache

 Use power of two minimise computation


Direct method cache

 Cache divides memory and cache into blocks where the block is in powers of two

 To distinguish blocks, a unique tag value is assigned to each group of the blocks

 From figure 3, tag2 can occupy block0 in cache

 Tags are used to identify a large group of bytes than single byte

 Cache look-up becomes extremely efficient

 Newer technology involve the addressing as shown in figure 4


Associative memory cache

 A set of associative cache use hardware parallelism to provide more flexibility

 Associative approach provides hardware that can search all of them simultaneously

 Referencing is in different caches

 A fully associative cache has the underlying caches containing only one slot, but the
slot can hold an arbitrary value equivalent to Content addressable Memory (CAM)
Example to programmers
Programmers who understand cache can write a code that exploits a cache
Array
 Assume many operations on a large array
 Perform all the operations on a single element of the array before moving to
the next element (program iterates through the array once)
Paging Single iteration for demand of paging possible

TLB

6
The register file in the CPU is accessible by both the integer and the floating point units, or
each unit may have its own specialized registers. The out-of-order execution units are
intelligent enough to know the original order of the instructions in the program and re-impose
program order when the results are to be committed (‘retired’) to their final destination
registers

Exclusive versus inclusive cache

7
Multi-level caches introduce new design decisions. For instance, in some processors, all data
in the L1 cache must also be somewhere in the L2 cache. These caches are called strictly
inclusive. Other processors (like the AMD Athlon) have exclusive caches — data is
guaranteed to be in at most one of the L1 and L2 caches, never in both. Still other processors
(like the Intel Pentium II, III, and 4), do not require that data in the L1 cache also reside in the
L2 cache, although it may often do so. There is no universally accepted name for this
intermediate policy, although the term mainly inclusive has been used.

The advantage of exclusive caches is that they store more data. This advantage is larger when
the exclusive L1 cache is comparable to the L2 cache, and diminishes if the L2 cache is many
times larger than the L1 cache. When the L1 misses and the L2 hits on an access, the hitting
cache line in the L2 is exchanged with a line in the L1. This exchange is quite a bit more
work than just copying a line from L2 to L1, which is what an inclusive cache does.

One advantage of strictly inclusive caches is that when external devices or other processors in
a multiprocessor system wish to remove a cache line from the processor, they need only have
the processor check the L2 cache. In cache hierarchies which do not enforce inclusion, the L1
cache must be checked as well. As a drawback, there is a correlation between the
associativities of L1 and L2 caches: if the L2 cache does not have at least as many ways as all
L1 caches together, the effective associativity of the L1 caches is restricted.

Another advantage of inclusive caches is that the larger cache can use larger cache lines,
which reduces the size of the secondary cache tags. (Exclusive caches require both caches to
have the same size cache lines, so that cache lines can be swapped on a L1 miss, L2 hit). If
the secondary cache is an order of magnitude larger than the primary, and the cache data is an
order of magnitude larger than the cache tags, this tag area saved can be comparable to the
incremental area needed to store the L1 cache data in the L2.

Three-Level Cache Hierarchy


As the latency difference between main memory and the fastest cache has become larger,
some processors have begun to utilize as many as three levels of on-chip cache. the Itanium 2
(2003) had a 6 MB unified level 3 (L3) cache on-die; the AMD Phenom II (2008) has up to 6
MB on-die unified L3 cache; and the Intel Core i7 (2008) has an 8 MB on-die unified L3
cache that is inclusive, shared by all cores. The benefits of an L3 cache depend on the
application's access patterns.
The memory hierarchy of Conroe was extremely simple and Intel was able to concentrate on
the performance of the shared L2 cache, which was the best solution for an architecture that
was aimed mostly at dual-core implementations. But with Nehalem, the engineers started
from scratch and came to the same conclusions as their competitors: a shared L2 cache was
not suited to native quad-core architecture. The different cores can too frequently flush data
needed by another core and that surely would have involved too many problems in terms of
internal buses and arbitration to provide all four cores with sufficient bandwidth while
keeping latency sufficiently low. To solve the problem, the engineers provided each core with
a Level 2 cache of its own. Since it’s dedicated to a single core and relatively small (256 KB),

Then comes an enormous Level 3 cache memory (8 MB) for managing communications
between cores. That means that if a core tries to access a data item and it’s not present in the
Level 3 cache, there’s no need to look in the other cores’ private caches—the data item won’t
be there either. Conversely, if the data are present, four bits associated with each line of the
8
cache memory (one bit per core) show whether or not the data are potentially present
(potentially, but not with certainty) in the lower-level cache of another core, and which one.

Pipelining in Microprocessors
Modern microprocessors are structured and hence they contain many internal processing
units. Each unit performs a particular task. In real sense each of these processing units is
actually a special purpose microprocessor. The processor can process several instructions
simultaneously at various stages of execution. This ability is called pipelining. Intel 8086
was the first processor to make use of idle memory time by fetching the next instruction
while executing the current one. This process accelerates the overall execution time of a
program.
Figure 9 shows how an Intel i486 executes the instruction in a pipeline fashion. When one
instruction is fetched, the other is decoded which the third is being executed while the fourth
is being executed back. All these activities take place within the same time duration, thus
giving an overall execution rate of one instruction per clock cycle. Considering the
conventional approach that requires 4 clock cycles to fetch and execute and write back for
one instruction, the pipelining approach is much superior. If the start and end times of the
operation are considered, the overall (average) rate of processing comes out to be nearly one
(slightly greater) instruction per clock.

9
µP FETCH Decode Execute FETCH Decode Execute FETCH Decode Execute
1 1 2 2 2 3 3 3
Bus BUSY IDLE BUSY BUSY IDLE BUSY BUSY IDLE BUSY
Non-Pipelined Execution (8085)

Bus FETCH FETCH FETCH FETCH STORE1 FETCH FETCH READ FETCH
Unit 1 2 3 4 5 6 7
Instruction Decode Decode Decode 3 Decode 4 IDLE Decode 5 Decode IDLE
Unit 1 2 6
Execution Unit Execute Execute Execute Execute IDLE Execute Execute
1 2 3 4 5 6
Address Unit Generate Generate
Address Address
1 2
Fig. 9 Pipelining of instructions

Pipelining approach is very much a part of RISC architecture besides being suitable for CISC
architecture. Other factors that have contributed to the development of RISC on i486 are
MMU and 8KB primary cache

10
Additional Notes
40 MHZ – 25 ns
DRAM chips – access time 60 – 100 ns
SRAM – access time 15 – 25 ns
SRAM (ECL) – access time 12 ns BUT expensive
Assume aircraft moving at 850 km/h
Distance moved in 12 ns=1/10 of diameter of hair
Cache – attempts adv of quick SRAM with cheapness of DRAMs. to achieve the most
effective memory system.

SRAM
CACHE
DRAM
CPU CACHE MEMORY
(MAIN)
CONTROLLER
Cache can be On chip or separate. It can be between 1/10 – 1/1000 X smaller than main
memory
Cache hit means information requested in cache while cache miss indicates information
requested is not in cache
- Cache controller disables ready signal. CPU to insert wait states
Cache hit - Cache controller – reads a complete cache line called cache line fill

11
- Data bytes addressed by CPU are immediately passed on by the cache
controller before the whole cache line is completed.
Cache line – 16 or 32 bytes in size next CPU request may be part of the cache line hence
increase hit rate.
Cache controllers use burst mode where a block of data that contains more bytes than the data
bus width. Burst mode doubles the bus transfer rate.
Cache strategies – write through

- Write back – copy back


- Write allocate

Write through – always transfers data to main even when there is a hit.

- Wait states
- Use of fast write buffers tries to improve the write operations.
- Main memory consistency is enhanced.

Multiprocessors would have difficulty in this strategy unless an inquiry cycle is done to re-
establish consistency.
Write back – cache write unless specified
Write allocate – cache miss – cache controller fill the cache space for a cache line with the
data for the address to be written.
Usually Data written through to main memory. The cache controller then reads into the
cache the applicable cache line with the entry to be updated.
Cache controller independently performs the write allocate in parallel to CPU
operation.
Cache miss – because of complication, are simply switched through to main and ignored by
cache.
CACHE ORGANIZATIONAND ASSOCIATIVE MEMORY (CAM)
Types: Direct mapping
4 way
Tag
Associative memory
Assume cache memory of 16 kb and a cache line of 16 bytes
20 8 4
Tag address set address Byte address

Cache entry: Cache directory + cache memory


Cache directory
Cache directory stored internally in the cache controller or in an external RAM hence more
SRAMs than actually necessary.
cache memory -Stores actual data
e.g 4 way cache – cache directory

12
TAG Element of cache directory
Determines whether a hit or miss
Valid bit implies valid cache line
Flush reset valid bits of the cache line
Write protect No overwrite

SET
Every tag of corresponding cache line are elements of the set.
Way For a given set address, the tag address of all ways are simultaneously
compared with the tag part of the address given out of the CPU for a hit/miss
criterion.
Capacity 4 way X set X cache line size = 16Kb
A Miss will check LRU for replacement.
Algorithms
Direct Mapping Cache line in one position
Associative Cache line can be anywhere within the four ways. Overwrite can be
avoided. 2 way would be faster than a 4 way cache. Associative
memory concept is also known as Content Addressable Memory
(CAM).
Cache hit determination

13
32 bit microprocessor The 4 GB is divided into 220 cache pages each 4KB through to
256 sets. The page is further divided into 16 byte cache line.
Organization
No restrictions
L2 cache is organized into 512KB in a 2 way organization. A cache line is taken as 64 bytes
and sets are 8192

Large caches can be implemented with external SRAMs. Tags may be for SRAMs with short
access time (15nS) while other data in SRAMs with access time of 20nS which can be
external.

Replacement strategies
The cache controller uses the LRU bits assigned to a set of cache lines for marking the last
addressed (most recently) way of the set.
Replacement policy Replace cache entry

All lines valid? No Replace invalid line

B0=0? No

No B1=0? B2=0? No
Yes Yes
Replace Replace Replace Replace
way 1 way 0 14
way 2 way 3
Random replacement is also possible. Comprehensive statistical analyses have shown that
there is very little difference between the efficiency of the LRU and random replacement
algorithm. Replacement policy will solely rely on the cache designer.
Access and addressing
If last access was way 0 or 1, controller sets LRU bit B0. For access to way 0, bit B1 is set.
Addressing way 1 sets LRU bit B1. Access way 2 the LRU bit 2 is set. Addressing way 3, bit
2 is cleared.

15

You might also like