Thesis
Thesis
EMBEDDED SYSTEMS
MASTER OF SCIENCE
December 2009
APPROVED:
Performance and power consumption are most important issues while designing
embedded systems. Several studies have shown that cache memory consumes about
50% of the total power in these systems. Thus, the architecture of the cache governs
both performance and power usage of embedded systems. A new N-way reconfigurable
data cache is proposed especially for embedded systems. This thesis explores the
two-way, or four-way set associative using a mode selector. The module has been
designed and simulated in Xilinx ISE 9.1i and ModelSim SE 6.3e using the Verilog
by
ii
ACKNOWLEDGEMENTS
who has motivated and guided me throughout my studies and thesis work. I would also
for his support, advice and suggestions. I am also grateful to Dr. Armin Mikler for being
I would like to thank my husband, Nishith Bani, and my family for their
encouragement and patience throughout my research work. I could not have made it
without them. They give me a reason to work very hard. I would also like to thank Dr.
Dhruva Ghai and Oluwayomi Bamidele Adamo, for their advice and spending time to
guide me. Finally, I would like to thank friendly staff and my colleagues in computer
work.
iii
TABLE OF CONTENTS
Chapter
1. INTRODUCTION ...................................................................................................... 1
1.1 Motivation............................................................................................................... 2
iv
2.1 Improving Performance of Direct-Mapped Caches by Reducing Miss Rate......... 20
SYSTEMS ..................................................................................................................... 43
v
5.1 Basic Design Considerations ............................................................................... 54
REFERENCES .............................................................................................................. 73
vi
LIST OF TABLES
Table 1-3 Impact of Cache Associativity on Miss Rate and Access Time ..................... 13
Table 2-1 Table Showing Prior Research Done to Improve the Performance of Direct-
Table 2-2 Table Showing Prior Research Done to Reduce Access Time Set of
Table 2-3 Table Showing Prior Research Done to Reduce Power Consumption of
Table 2-4 Table Showing Prior Research Done for Reconfigurable Cache................... 30
Table 3-1 Effect of Cache Size on Miss Rate and Access Time .................................. 34
Table 3-2 Elements Available for Cache Design and Possessed by the Proposed
Table 5-2 Comparison of Various Design Metrics of Proposed Design with Direct
Table 5-3 Comparison of Design Metrics of Reconfigurable Data Cache for Various
Table 5-4 Comparison of Proposed Design with Existing Reconfigurable Memories. ... 71
vii
LIST OF FIGURES
Figure 1-4 Mapping of Main Memory Block 15 in Three Different Cache Architectures .
...................................................................................................................................... 10
Figure 1-5 Miss Rate (a) and Access Time (b) of SPEC92 Benchmarks on 1KB Data
Figure 4-1 High Level View of Proposed Architecture of Reconfigurable Data Cache. . 44
Figure 5-3 Simulation Waveform of Direct Mapped Cache Showing Read Miss ........... 58
viii
Figure 5-5 Simulation waveform of Two-Way Associative Cache Showing Read Hit .... 59
Miss ............................................................................................................................... 61
Figure 5-9 Simulation Waveform of Reconfigurable Cache in Mode ‘00’ (Direct Mapped)
Figure 5-10 Simulation Waveform of Reconfigurable Cache in Mode ‘01’ (Two-Way Set-
Figure 5-11 Simulation Waveform of Reconfigurable Cache in Mode ‘11’ (Four-Way Set-
ix
CHAPTER 1
INTRODUCTION
The need for mobile systems, portable devices, and many other appliances used
in our modern life results in a growing demand for embedded computing systems. As
this growth occurs at tremendous rate, it reduces the window for time-to-market, which
vital role [30]. Several programming languages and electronic design automation (EDA)
tools are available currently for embedded processors, which make programming
metrics and compromise between power, cost, performance and time-to-market. Cache
memory, a crucial part of embedded systems, is responsible for consuming approx. half
of the total power consumption by these systems. Research in design of optimal cache
architectures for portable and mobile devices is being actively pursued. According to
certain studies, the use of separate data and instruction cache is one way of improving
the performance of today’s microprocessors. Proper cache architecture can bring down
the time overhead of accessing data and instructions from off-chip main memory,
thereby reducing power consumption. High gains in performance have been achieved
Cache sizes, degree of associativity, block replacement algorithms, write policies, block
size (cache line) are the core parameters for optimizing the cache architecture. Suitable
1
selection of these design parameters can enhance cache performance in terms of
1.1 Motivation
Embedded systems have always been cost sensitive. Cache occupies approx.
fifty percent of the total area and also accounts for approximately fifty percent of a
processor’s total power in embedded systems, including both static and dynamic
components [5][6]. Thus, cache governs the performance and cost of application
specific embedded systems. The direct-mapped (DN) cache architecture is very popular
in embedded systems because of its simplicity, faster access time and low power
consumption. A DM cache is more energy efficient and uses less power than the same
sized two-way or four-way set associative cache since it accesses only one location of
tag and data arrays per access [2]. Moreover, a direct-mapped cache has faster access
time as it does not require a multiplexer to select the requested data from multiple
accessed data items in different sets. Although direct mapped cache has the
advantage of consuming less area and power, it suffers from poor performance. One
way to improve the performance of such systems is to use set associative cache at the
expense of larger area and higher power consumption compared to direct mapped
cache. Other ways of improving the performance of direct-mapped cache are discussed
in section 2.1. The data cache architectures of most commonly used embedded
2
Table 1-1 Data Cache Associativities of Popular Embedded Microprocessors [2].
From Table 1.1, it is clear that the associativity requirement of data cache for almost all
the embedded systems is one way, two-way or four-way. In order to match the cost,
requirement.
1.2 History
Recent advances in chip technology, such as the ability to place more transistors
on the same die together with increased operating speeds, has led to a tremendous gap
between processor’s and main memory’s speed. According to Moore’s law, there is an
3
and 10%, respectively [4]. Thus, this gap between processor and main memory speed
processor has to access on-chip cache memory instead of off-chip main memory.
Cache is a small on-chip memory situated between a high speed processor and low
speed main memory. A cache is implemented using SRAM (Static Random Access
Memory) which makes the cache fast, unlike the main memory which comprises of
DRAM (Dynamic Random Access Memory). Fig. 1.2 illustrates the memory hierarchy
4
On Chip
Memory
Processor
Off Chip
D- Cache
Main
Memory
I - Cache
Custom
As cache is very fast and on-chip, the processor can access it more quickly than
main memory. A cache is local memory in a computing system that stores a copy of
data and instructions currently used by the processor. The architecture of cache
memory is largely determined by the behavior of the application using that cache. Single
functioned embedded systems like scanners, fax machines, digital cameras, etc. are
designed to execute a small range of well defined tasks in the system’s lifetime,
requiring a small, high performance, low power cache. In contrast, a desktop computer
has to support various applications, like word processors, spreadsheets, CAD software,
etc. which need large amounts of cache [14]. Desktop systems afford greater flexibility
for the design of cache memories in terms of cache size, associativity, block size, line
size, and multi-level cache. Due to limitations on the physical size and energy budget,
design of cache for embedded applications is more stringent than those for desktop
applications.
5
1.3 Principles of Cache Memory
At any given time, the processor needs only small amount of data [28]. The
cache memory tries to predict the range of memory locations, which the processor will
need in the near future and copies the content of these locations in advance. Whenever
the processor needs any data, first it attempts to retrieve it from cache and if data is not
available there, it has to wait until the data is loaded from main memory to cache. At this
time data from nearby locations of the requested address are also copied to cache.
The basic principle behind cache operation is locality of reference, also known as
processor needs to access only a small portion of their total available address space at
a given point of time [26]. There are two basic types of reference locality:
data item is used, then there is a high probability that it will be required again
address is used at a given instant of time, then there is a high probability that
Main memory with n-bit address lines has a total space of 2n words. It is divided
into numbers of blocks containing k words each. Thus, it has total 2n/k = MB (main
memory blocks). Cache memory is also divided into a number of lines containing k
words each. The total number of cache lines (CL) is considerably smaller than the
6
number of main memory blocks (MB). Thus, only few main memory blocks are mapped
to cache at any point of time. When a read request is initiated by the processor not
present in the cache, then the whole block containing that requested data item is
mapped to one of the cache lines [24]. A cache line cannot be uniquely assigned to
individual main memory block because of the large number of blocks compared to
cache lines. Hence a tag is associated with each cache line to identify the physical
address corresponding to that particular line. Fig. 1.3 illustrates the structure of cache
7
Cache Main Memory
Line Memory
Number Tag Cache Line Address
0 0
1 1
2 2
CL-1
k words
MB-1
requested data. It is measured in terms of the hit or miss ratio, and access time. When
the requested data is found in the cache then it is called a cache hit, otherwise a cache
miss. Hit ratio is defined as the number of memory accesses found in cache with
respect to the total requested memory accesses. Miss ratio is given as (1 - hit ratio)
[28]. The time taken by the cache to provide the requested data in case of a hit is called
hit time. When there is a cache miss, then the requested data is fetched from main
8
memory and mapped to cache. The time required for fetching and mapping of the data
(by increasing the hit ratio or by decreasing the miss ratio) are generally categorized as:
(1) increasing block size and cache size, (2) increasing associativity, (3) cache probing,
(4) supplementing the regular cache with victim cache, (5) hardware prefetching of data,
Cache memory is responsible for half of the total power and area usage in
architecture, which means how the cache is mapped to the system’s main memory.
Mapping reduces the chance that a moved-out block will be used again in the near
future. There are three types of mapping: direct, fully associative, and n-way set
associative.
block from main memory is assigned to one particular cache line. Mapping is based on
the following relation [24], CL = (MB mod CL), where MB is the main memory block which
is mapped to the cache line number CL and CL is the total number of lines in the cache.
Fig. 1.4 shows how the position of block number 15 from main memory can be placed in
9
Figure 1-4 Mapping of Main Memory Block 15 in Three Different Cache Architectures
[25].
Simple design and comparatively easy hardware implementation are the two
benefits of direct mapped cache [29]. However, the only problem associated with this
design is the mapping of each block from main memory to one specific cache line. In
spite of the fact that cache follows the principle of locality, for some programs or
applications, there is a possibility of requiring few data items very frequently which are
mapped to the same cache line. These data items will be moved in and out of the cache
continuously causing a low hit ratio. In single tasking embedded systems, this situation
is unusual but in multi-tasking systems it can arise fairly often and hence deteriorate the
10
1.6.2 Fully Associative Cache
In fully associative cache each block from main memory can be assigned to any
of the cache lines, as shown in Fig. 1.4 and thus provides best hit ratio. At the same
time, it suffers from the overhead of cost, complexity, hardware and access time
involved in the search of requested address within the cache. All cache blocks are
memory address is in the cache. In addition to this, extra logic is required to find out
which cache line should be replaced when requested data is not available in the cache.
Generally, fully associative cache is not used due to its high cost, complexity, and
access time.
In set associative cache architecture, the cache memory is divided into a number
of small, direct mapped modules, where each module is called a set. A cache
can be assigned to any cache line within the set according to the relation [24], CSN =
(MB mod S), where MB is the main memory block which is mapped to the cache set
number CSN and S is the number of sets in the cache. In order to determine whether a
requested memory address is in the cache, locations within the set indicated by an
fully associative cache, thus its performance lies between these two cache
11
architectures. Two-way and four-way set associative caches give the best performance
in terms of hit ratio and access time for embedded systems [24].
The associativity of cache memory not only affects its performance but also
greatly impacts the overall energy consumed. A direct mapped cache consumes less
energy per access than a two-way or four-way set-associative cache, because only one
tag and one data array are read during an access, rather than two or four arrays.
However, for some applications direct mapped cache has a higher miss rate and access
time, consuming higher energy for accessing the off-chip main memory. In such cases,
increasing the cache associativity is one way to reduce the miss rate and access time,
which in turn reduces the overall energy consumed by the cache [2]. Fig. 1.5(a) shows
the miss rate for the SPEC92 benchmark and Fig. 1.5(b) shows average memory
access time for these miss rates under the assumption that higher associativity will
increase the clock cycle [25]. Impact of cache associativity on miss rate and access
12
Table 1-3 Impact of Cache Associativity on Miss Rate and Access Time
As shown in Fig. 1.5(a), the total miss rate for a one-way 1 KB cache is 13.3%,
for a two-way cache is 10.5% and for four-way cache is only 9.5%. It is clear from Fig.
1.5(b) that the average memory access time decreases with increase in associativity.
14%
12%
10%
Miss Rate
8%
6%
4%
2%
0%
1-way 2-way 4-way
Associativity
10
8
Access Time
6
4
2
0
1-way 2-way 4-way
Associativity
Figure 1-5 Miss Rate (a) and Access Time (b) of SPEC92 Benchmarks on 1KB Data
Caches of Different Associativities.
13
Although more energy per access is required for accessing a four-way set
associative cache, due to the additional hardware required to support line replacement,
the extra energy may be compensated by reduction in access time and energy that
would have been caused by misses. Thus, choosing the appropriate associativity to a
particular application is very essential to reduce energy, which motivates the need for
When a cache miss occurs, the requested address line must be placed into the
cache. To load a new line into the cache, one of the existing cache lines must be
replaced. Cache line replacement policy is a technique for selecting the line which
should be replaced when all the lines in set associative or fully associative cache are full
requested line can go to exactly one location in the cache. In set associative cache the
requested line can go in one of the fixed number of cache locations. Thus, we have a
choice as to where to place the requested line and hence a choice of which line to
replace. In a fully associative cache, the requested line can go to any location in the
cache. Thus all cache lines are candidates for replacement. The proposed design can
work either as direct-mapped, which does not need any replacement policy or as set
associative, which requires some replacement policy for line replacement in case of
cache miss. There are four most common replacement policies [24].
randomly.
14
2. Least Recently Used (LRU) – A LRU replacement policy replaces a cache
3. First in - First out (FIFO) – A FIFO replacement policy uses a queue of size N,
to keep track of the sequence in which cache locations are being accessed,
and replaces the cache line that was loaded in the queue in the most distant
time.
4. Least Frequently Used (LFU) – A LFU replacement policy replaces the cache
been found that this policy would be a poor replacement line selection method. In
reality, it performs worse when compared to any of the other three replacement
methods mentioned above. The FIFO replacement policy is also easy to implement in
hardware via a queue of size N for cache lines. The most commonly used replacement
within a set, relative to the other lines. According to the principle of locality, a recently
accessed cache line is more likely to be referenced again in the near future. Thus, LRU
tends to give the best performance among other methods. This policy provides an
excellent hit rate but relatively expensive hardware is required for its implementation.
It is necessary to check whether the cache line has been modified or not, before
replacement. If the content of the cache line has not been updated since its arrival in the
cache, there is no need to modify the main memory corresponding to this cache line
15
prior to its replacement. When we write to a particular cache line, the data contents of
the cache will be modified after the write operation therefore it would have a different
value from the corresponding main memory location. In such a case, there is a need to
update the main memory before replacement of that cache line. The cache line and
corresponding main memory location should hold the same data. There are three
different write policies which can be used to ensure that the cache and main memory
contents are the same: write-through, write-through with write buffer, and write-back
[26].
immediately written to both cache and main memory during write operations.
Easy implementation and consistency among main memory and cache are
two major benefits of this policy. On the other hand, the write through policy
introduces a significant amount of delay as the processor has to wait until the
write operation to the main memory is completed. In spite of this delay, most
delay during write operation. This approach uses a write buffer queue that
holds data which is waiting for it to be written into the main memory. As the
cache controller writes the data into the cache and into the write buffer queue,
execution. Thus, the processor does not have to wait to write data into main
memory and saves valuable clock cycles. If the write buffer queue is
16
completely full, with pending data to be written on main memory, and a write
request comes from processor, then the processor has to wait until it gets an
empty location in the write buffer queue. One way to solve this problem is to
posted write or copy back policy. During a write operation the data is written
only to the line in cache. This takes less time and allows the processor to
continue execution immediately. The updated cache line is written to the main
memory only when it is replaced. In order to keep track of a cache line which
has been updated and written to main memory before replacement, an extra
bit called dirty bit is associated with each line. The status of the cache line is
indicated by this bit, whether the line is updated (dirty) while in the cache or,
not updated (clean). If the cache line is not updated, there is no need to copy
its content before replacement on a cache miss. The advantage of the write-
back policy is that it can improve system performance when the processor
requests write operations faster than writes, which can be handled by main
polices.
17
1.9 Organization of the Thesis
5, the prototype of the direct-mapped, two-way and four-way associative data cache
along with proposed reconfigurable data cache is explained. Chapter 6 concludes the
18
CHAPTER 2
RELATED RESEARCH
Cache plays a vital role in any processor based embedded system in order to
achieve high performance. Several cache designs have been proposed by researchers
either to improve the performance of direct-mapped cache or to reduce the access time
and power consumption of set associative cache. In this chapter, a review of the
research work done in the field of cache design is presented. Classification of related
research is given in Fig. 2.1. Section 2.1 reviews various research works on improving
incur accessing to off chip main memory which is both power costly and time
consuming. Sections 2.2 and 2.3 discuss in brief various research works done to
reduce the access time and power consumption in set associative caches. Section 2.4
covers the work done in the area of reconfigurable caches. FPGA technology overview
19
Related
Research
Proposed
Zhang, 2006 Zhang, 2003 Design
Park, 2004
Zhang, 2005
The ever growing need for high performance processors, especially for
embedded applications, has motivated the computer designers to study and work on
discussed in chapter 1, the most commonly and efficiently used method for increasing
accounts for approx. half of the total power consumed by the system. Direct-mapped
cache is very popular in embedded systems due to its small size and high speed. The
20
key to improving the performance of direct-mapped cache is to reduce the miss rate or
A technique called Column Associative Cache for reducing the miss rate of direct
mapped caches has been proposed by Agarwal and Pudar [12]. This design uses an
extra bit for dynamically choosing alternate hashing functions and a multiplexer for the
address generation to achieve almost the same miss rate as that of a two way set
Efficient direct mapped cache [3] architectures reduce the accesses to overused
cache blocks and increase the accesses to underused blocks without increasing the
cache access time, associativity, and area. Cache memory is divided into sub arrays in
order to achieve the best trade-off among power consumption, area, and performance.
In addition to this, conventional decoders have been replaced by SRAM, CAM (Content
Addressable Memory), and XCAM (CAM with ‘don’t care’) based configurable decoders
to maintain the same access time as that of the original direct mapped design. The
small amount of extra power consumption comes from the fact that instead of the
original four AND gate decoders, eight rows of five-bit long configurable decoders have
been used in this design. Total power overhead due to replacing the original decoder
conventional two-way set associative cache, efficient cache consumes less energy but
achieves almost the same hit rate. The design of this cache suffers from the drawback
that it requires simulation of applications in advance, to find out the best decoding
21
Balanced Cache [10] introduces a programmable decoder and block replacement
policy to increase the access to the underutilized cache blocks in direct-mapped cache
which is further divided into a programmable decoder (PD) and a conventional non-
access to heavily used blocks to one eighth as compared to the original design. While
cache design provides the advantage of block replacement at the expense of 10.5%
The design features of the approaches discussed above are tabulated in Table
2.1.
22
Table 2-1 Table Showing Prior Research Done to Improve the Performance of Direct-
Mapped Cache
terms of hit ratio but have relatively higher access time and power consumption than
same size direct mapped cache. Several approaches have been proposed to improve
Cache design with partial address matching [20] uses a hit way predicting
technique to reduce the access time. The tag field is divided into two arrays: Main
Directory (MD) and Partial Address Directory (PAD). Both Main Directory and Partial
Address Directory arrays possess the same parameters such as number of sets and
23
associativity, as the cache. The Main Directory array contains the thirteen most
significant bits of the original tag field while the remaining least significant five bits are
moved out to the Partial Address Directory. Only the PAD array is compared to predict
the hit instead of comparing the full tag field, and this hit is verified by MD comparison.
This design is faster because initially only the five bit PAD comparison is required to
predict the hit. If the PAD comparison is not correct then only the MD comparison is
time almost the same as that of a conventional one way set associative cache. This
cache design is based on the fact that there is a difference of at least one bit among two
tags of a set. This bit is called difference bit and the corresponding bit position in which
these two tags differ is called the difference-index. These difference indices and
difference bits are key design features and are used for way selection. Area overhead
depends on the size of the cache; it is 2% in the case of 8K and 1% in the case of 16K.
The design features of the approaches discussed above are tabulated in Table
2.2.
24
Table 2-2 Table Showing Prior Research Done to Reduce Access Time Set of
Associative Cache
Filter cache [9] is an unusually small low power direct mapped cache, which is
positioned in front of the regular cache. In embedded systems most of the processor’s
time is spent in executing just a few tasks, so most hits would take place in the small
filter cache. Therefore, the power hungry regular cache would not be accessed
frequently. Filter cache designs achieve overall power reduction, at the cost of
A CAM based cache [22] is designed to reduce the power consumption of set
associative caches. Each standard CAM cell consists of ten transistors compare to the
standard SRAM cell which consists of six transistors, hence occupying about twice the
area of an SRAM cell. CAM-tag caches have comparable access latency, but give lower
25
hit energy and higher hit rates than RAM-tag set-associative caches at the expense of
area overhead. CAM-tag caches can provide lower total memory access energy by
reducing the hit energy cost of the high associativities required to avoid costly misses.
There is no significant performance overhead associated with CAM designs except for a
power two-way set associative cache for embedded applications. The design consists of
a modified two-way set associative cache, a decoder for the skewing function, and a
cooperate logically with each other: one behaves as a main cache and the other
prediction for way selection, thereby obtaining higher hit rate. Moreover, this design has
the ability to be converted into one-way set associative for special applications. This
cache structure saves up to 55% power over conventional set associative cache.
Way halting cache [19] is a four way set associative cache with a halt tag array.
Halt tag array is a small fully associative memory, which stores the lowest four bits of all
ways tags. There is simultaneous comparison of the halt tag array with the requested
address tag, in parallel with address decoding of data and tag. This design consumes
less power as it uses static logic only instead of CAM based dynamic logic which is
used in modern highly associative cache. The halting tag array is the key component of
the design, which is responsible for overall power savings. Way halting cache achieves
26
about 55% of energy savings over conventional four-way set-associative cache at the
The design of configurable line size cache is based on the fact that reducing the number
of switches per access can reduce overall power per access. Configurable line size is
achieved by a configurable counter which is placed in the cache controller and specifies
how many words to read at a time from the off chip main memory.
The design features of the approaches discussed above are tabulated in Table
2.3.
Table 2-3 Table Showing Prior Research Done to Reduce Power Consumption of
Associative Cache
27
2.4 Reconfigurable Cache Architectures
cache design for embedded applications. In [11] the authors propose a reconfigurable
multi-function computing cache architecture based on the fact that some computing
related applications use complete cache storage. This reconfigurable cache structure is
divided into a regular cache and a configurable cache. The configurable part of the
cache architecture can be converted into a functional unit for either of the two
Discrete Cosine Transform (DCT/IDCT). Additional logic is embedded into the cache
structure to convert the cache memory into a functional unit. This cache design
Zhang [2] proposed a highly configurable cache architecture which can be tuned
systems. This architecture aids the cache system to be configured at the optimal
configure the cache into direct mapped two-way or four-way associative architecture.
This structure achieves static and dynamic power savings with very little size and
this energy efficiency, since there is no simple method to obtain the optimal
configuration dynamically.
28
A reconfigurable memory architecture is proposed in [16], which can emulate
many memory structures including cache, a FIFO and a simple scratchpad memory.
These memory mats consist of a memory array with metadata bits and a small
peripheral circuitry. In addition to this, extra functional blocks and flexible status bits
have been used to support reconfigurability. This additional logic accounts for 32% area
The design features of the approaches discussed above are tabulated in Table
2.4.
29
Table 2-4 Table Showing Prior Research Done for Reconfigurable Cache
blocks [38]. Fig. 2.2 shows the basic structure of an FPGA that incorporates all these
elements. The combinational logic is further divided into small units called combinational
logic blocks (CLBs) or logic elements (LEs). Fig. 2.3 shows the architecture of the
combinational logic block (CLB). A typical CLB consists of a look-up table (LUT), which
can be configured to a specific type of logic function and a flip-flop, thus providing
30
combinational as well as sequential logic [39]. A typical FPGA contains hundreds or
thousands of CLBs.
I / O Block
I I
CLB CLB CLB
/O /O
B B
CLB CLB CLB
l l
o o
c c
k k
CLB CLB CLB
I / O Block
31
Look-up
Inputs
Table
LUT
Enable
interconnect. FPGAs usually offer various types of interconnect based on the distance
routing/wiring channels that run horizontally and vertically through the chip. I/O blocks
are used as an interface between package pins and the internal configurable logic and
often provide other features such as high-speed or low power connections. FPGAs are
slower than ASICs, and cannot accommodate complex designs, but are ideal to check
the functionality of designs. Once the logic design of any module is written using a
hardware description language, it needs to be mapped onto the low-level logic blocks of
the FPGA. When a program is loaded onto an FPGA, three specific functions are
carried out:
Verilog and VHDL are common hardware description languages. A single FPGA can
32
2.6 Contributions of This Thesis
This thesis proposes a new N-way reconfigurable data cache architecture. The
(ii) two-way associative, and (iii) four-way associative. We have designed a Mode
Selector module within the data cache which allows the data cache to work in any of
these three configurations. This N-way reconfigurable data cache has been designed to
target embedded systems, as most of the popular embedded systems use direct-
mapped, two-way or four-way data cache, as given in Table 1.1. The proposed
architecture has been prototyped using Verilog in Xilinx ISE 9.1i and simulated through
33
CHAPTER 3
design, such as cache size, mapping function, replacement policy, write policy, and
block size. Table 3.1 gives the design features of the proposed architecture.
Performance of the cache depends on its size. Table 3.2 shows the effect of
cache size on miss rate and access time. The performance of the cache increases with
increase in cache size, but at the same time there is an increase in power consumption
Table 3-1 Effect of Cache Size on Miss Rate and Access Time [25]
The size of the cache should be small enough to reduce the per bit storage cost
and it should be large enough to provide better hit rate. Currently, a range of cache
sizes from a few words to several mega words are available. For simplicity we have
34
Table 3-2 Elements Available for Cache Design and Possessed by the Proposed
Reconfigurable Cache Architecture [24].
Random
Write Buffer
Write Back
In direct mapping, illustrated in Fig. 3.1 [14], the physical address is divided into
three fields: tag, index, and offset. The tag bits (most significant bits) are unique
35
identifiers for each memory block, which are currently mapped to the cache. Index
(middle) bits specify the location of the requested address within the cache and also
determine the size of the cache. Offset (least significant) bits represent a particular word
within the cache line. Whenever the processor initiates a request to access a particular
memory address, the tag comparator compares the contents of the indexed tag with the
tag of the desired address and simultaneously reads the content of the indexed data
array. If the requested address is currently in the cache, it generates a high Match
signal. The cache produces a high on the hit line if the requested address is in cache
and valid.
36
3.2.2 Two-Way Set-Associative Cache
divided into two sets of direct-mapped cache. Each set has its separate tag, and valid
and data arrays as shown in Fig. 3.2. Whenever the processor initiates a request to
access a particular memory address, both tag comparators compare the contents of the
indexed tag with the tag of desired address and simultaneously the cache read the
contents of indexed data array from both sets. In case of a case hit, the multiplexer with
the help of the encoder routes the corresponding data to the cache output.
37
Figure 3-2 Two-Way Set-Associative Cache Architecture
Fig. 3.3 depicts the architecture of a four-way set-associative cache where the
physical address is divided into tag field, index field, and line offset field. Being four-way
set associative, the cache consists of four sets of tag, valid and data arrays. During an
access initiated by the processor, the cache first decodes the address bits of the index
field and then concurrently read out the contents from appropriate locations of all sets of
tag, valid and data arrays. The cache simultaneously compares the content of all four
38
selected tag locations with the content of physical address’ tag field. If any of the
selected tags matches the requested address’s tag, the corresponding comparator
39
Valid data contents in the requested location with high match signal produces a high hit
corresponding to that set. If there is a hit in any of the sets, the multiplexer in
combination with the encoder passes the corresponding data to the cache output.
An efficient replacement policy can reduce the power consumption for set
associative cache. The replacement policy basically reduces the number of cache
misses, which in turn reduces power consumption [43]. The performance of the
the cache based on past accesses. LRU is the most popular replacement policy which
gives higher performance for set associative and fully associative cache design. There
increases the delay associated with reaching the best candidate for replacement. Even
though a highly associative cache with LRU policy is designed, its performance can be
Various ways of implementing LRU policy for an N-way set associative cache are
as follows [8]:
3. Counter Implementation
4. Phase Implementation
40
In the proposed design of reconfigurable data cache, we have implemented the
LRU policy using the counter implementation method. In this method, each set has its
own LRU unit which has a log2N bits down counter corresponding to each cache line
within a set. The value of the counter shows the order in which the associated cache
line is used within the set. The larger value indicates the most recently used line, while
the smallest value shows the least recently used cache line within a set. Whenever a
cache line is accessed, the corresponding counter value is compared with other
counters’ values within set. The counter values which are greater than the counter value
of the currently accessed line, are decremented by one and the counter value of the
currently accessed line is set to highest value (N-1). Initially all counters are set to zero.
The proposed design can work either as two-way set associative or as four-way
within the set. Whenever a particular cache line within a set is referenced, the
corresponding LRU bit is set to 1, to indicate the most recently used line and the LRU
bit of the other cache line is set to zero to indicate the least recently used line. The
cache line whose LRU bit is currently 0, is selected for replacement whenever a miss is
occurred. To implement the LRU policy for a four way set associative configuration, a 2-
bit (log24) counter is needed corresponding to each cache line within the set. The
counter implementation method provides good performance only for low associativity
associativity [8].
41
3.4 Write Policy
A write operation can create discrepancy between cache and main memory if
only the cache is updated after a write operation [42]. In our design, to prevent this
Another element of cache design is the size of cache line or main memory block.
It indicates the number of words per cache line. Whenever there is a cache miss, the
whole block of main memory containing the requested word is mapped to cache.
According to the reference of locality, the hit ratio increases with increase in line size. At
the same time, if block size is increased further, the hit ratio will start to decrease
because a chance of using newly fetched block is less than the one which has been
replaced.
42
CHAPTER 4
A data cache memory lies between the processor and the main memory [7]. Fig.
4.1 shows the high level block diagram of the proposed design along with the signals
that the cache needs to communicate with the processor and main memory interface.
The processor interface consists of address bus (PAddress), data bus (PData), and
three control signals (PR W , PStrobe, PReady). The processor starts a bus
transaction when PStrobe is high and a requested address is placed on the address
bus. The cache sends a PReady signal to the processor when the bus transaction is
completed. The PR W signal is low for write operation and high for read operation. The
main memory interface consists of address bus (MAddress), data bus (MData), and
In order to access the main memory, the requested address is first placed on
MAddress bus along with MStrobe and MR W control signals. The MR W signal is low
for write operation and high for read operation. MReady is used to signal the cache
memory that the bus transaction is completed by main memory. Two global signals, Clk
and Reset are used to synchronize the cache operation with the processor. There is an
additional signal called mode, which makes the cache reconfigurable. Depending upon
the value of mode signal, the proposed design of reconfigurable data cache can work in
43
1. Direct-Mapped Cache
Processor Memory
Interface Interface
PAddress MAddress
PStrobe Reconfigurable MStrobe
PR W Data MR W
PData MData
Figure 4-1 High Level View of Proposed Architecture of Reconfigurable Data Cache.
We have used top down approach to design our module. The top-level design
module shown is further divided into sub modules to build the main module as shown in
Fig. 4.2. The proposed reconfigurable cache consists of six sub modules:Tag RAMs,
Valid RAMs, Data RAMs, Line Replacement unit, Mode Selector Unit, Cache Controller.
44
PAddress MAddress
Mode
Mode Selector Tag Valid
Unit RAM RAM
PStrobe
MStrobe
LRU Cache MR W
Controller
MReady
PR W
PReady
Data
MData
RAM
PData
Clk Reset
The detailed architecture of proposed design is shown in Fig. 4.3, which consists
of sub modules along with connecting logics like Encoder, Tag Comparators, and
1. 16-bit address bus, which gives us a total address space of 64K as main memory
2. 256 bytes of cache, that means only 8 least significant bits are required to
45
The main memory is divided up into 256 blocks of 256 bytes each, where each
block is mapped to the cache. Because only 8 address bits are needed to identify the
address in cache, the 16 bit physical address is divided into 8 bit tag field and 8 bit
index field. Due to reconfigurable architecture, total cache of 256 bytes is divided into
four sets of 64 locations each. Thus, we have used four sets of Tag, Valid, and Data
46
Figure 4-3 Architecture of Reconfigurable Data Cache
47
1. Tag RAM - Contains the TAG fields of the physical addresses which are currently
2. Valid RAM - Contains a valid bit associated with each location of cache memory.
This valid bit indicates whether the cache entry is valid or not. Initially, all entries
are set to invalid. As the data contents are moved from main memory to cache,
3. Data RAM - Contains the data contents of physical addresses which are currently
mapped to cache.
then all tag comparators simultaneously compare the content of tag array
indicated by index field with the requested address’s tag field. If any of the tag
arrays hold the requested address, the corresponding tag comparator generates
5. Line Replacement Unit – When a cache miss occurs, the line replacement unit
determines which line should be removed from the cache. According to the
mode selection signals, this unit will replace the least recently used (LRU) line
6. Mode Selector - The data cache can be configured as one, two and four way set
generate the enable control signal for each of the four set, these two mode bits
are combined with A9 and A8 bits of physical address. Operation of the mode
selector unit is summarized in Table 4.1. Partition of the total cache memory
48
Table 4-1 Mode Selector Operation
Direct
Mapped 2 Way 4 Way
A9 A8 Mode – 00 Mode - 01 Mode - 11
00 S0 S0, S2 S0, S1, S2, S3
01 S1 S1, S3 S0, S1, S2, S3
10 S2 S0, S2 S0, S1, S2, S3
11 S3 S1, S3 S0, S1, S2, S3
(i) When mode = “00”, only one of the four output signal is set to high for a particular
address. Thus, only one set of data, tag and valid array will be active and cache
(ii) With mode = “11”, all the four output signals will set to high which in turn activate
the associated data and tag array. As all four sets are active in this case, cache
(iii) When mode = “01”, cache works as a two-way set associative, because two
sets of data, tag, and valid arrays will be active for any given address.
49
S0 S0 Set0 S0
Set0
S1 S1 Set1 S1
Set
S2 S2 Set2 S2
Set1
S3 S3 S3
Set3
7. Cache Controller – It controls all operations within the cache and is implemented
using a finite state machine as shown in Fig. 4.4. Control signals asserted during
50
Figure 4-5 Cache Controller State Machine
(i) Idle State: No memory access and processor is idle in this state. Controller
will remain in idle state until some read or write operation is requested by the
write state.
(ii) Read State: In this state, the cache is checked for availability of the requested
is currently in cache, a cache hit occurs and control returns to idle state
51
during next active clock. Otherwise, a cache miss occurs and control transfers
(iii) ReadMiss State: Main memory read access is initiated by the cache
(iv) ReadMemory State: Controller has to wait until main memory finds the
requested address and reads the data from that location. Once main memory
loads the data contents on data bus, it asserts a ready signal to controller and
(v) ReadData State: Now requested data is available on data bus. Write this data
data bus to complete the read request. After the completion of processor’s
(vi) Write State: In this state, cache is checked for availability of the requested
currently in cache, a cache hit occurs and control returns to writehit state
during next active clock. Otherwise, a cache miss occurs and control transfers
(vii) WriteHit Data: On a cache hit for write operation, controller stimulates the
write control signal of data cache to write the data contents sent by the
processor and also initiates write through for main memory. Control transfers
52
(viii) WriteMiss State: On a cache miss for write operation, data contents sent
by the processor are written to least recently used cache line and associated
tag and valid rams are also updated. Controller initiates write through policy
for main memory and transfers control to writedata state on next active clock.
(ix) WriteData State: Controller has to wait until main memory completes the
write operation and sends back a ready signal to controller. After completing
the requested write operation, controller will come back to idle state.
ReadMemory PR W , PDataOE, MR W
CacheDataSelect, Ready
Write PStrobe
WriteHit Hit, MStrobe, MDataOE
WriteMiss Miss, MStrobe, MDataOE
WriteMemory MStrobe, Ready
53
CHAPTER 5
This chapter presents the simulation and synthesis results of the reconfigurable
data cache. The mode selector unit has been considered as the basis of the proposed
different test cases has been used for functional simulation of the design. In order to
study the comparative performance of the proposed design, direct mapped, two-way
and four-way set-associative caches are also implemented on the same target device.
A typical design flow for designing VLSI systems with the Verilog HDL is shown
in Fig. 5.1. The design flow usually begins with design specification and then moves to
behavioral and structural level (gates and registers). Finally, the physical description of
There are two basic approaches to design VLSI circuits: a top-down approach
and a bottom-up approach [34]. In top-down approach, we first define top-level design
module and then find the sub-modules required to build this main module. The bottom-
up approach is reverse of top-down approach, where we first define the basic building
blocks and then build main design module using these blocks. A digital designer can
chose any of these two approaches. However, a top-down approach is always preferred
as system level design can be done through HDL design, which can be converted into
RTL, gate and physical level with the help of EDA tools. It is easier to decode and fix
any error or fault at the system level. This thesis has performed the system level
54
implementation of the reconfigurable data cache in Xilinx ISE 9.1i and simulation using
Design
Specification
Behavioral
Description
RTL
Description
Gate Level
Netlist
Physical
Layout
system. Once the Verilog code is written, it can be loaded onto an FPGA board and
modules and primitives [23]. The basic building block in Verilog is module. Module
55
consists of keyword module, which is the name of design unit, list of I/O ports,
each module can be defined at four levels of abstraction, in order to fulfill design
2. Dataflow level: At this level, the module is implemented using hardware registers.
3. Gate level: At this level, the design in implemented using logic gates.
4. Switch level: It is the lowest level of abstraction in Verilog HDL design. At this
Verilog provide the flexibility to mix and match all four levels of abstractions in a
single module. Generally the design is technology independent and more flexible at
levels of abstraction.
required for functional verification. A stimulus block serves two main purposes
2. It applies test cases to the design under test (DUT) and also collects the output
56
5.3 FPGA Prototyping
The architecture of reconfigurable data cache was modeled using Verilog and the
functional simulation was carried out using Modelsim SE 6.3 e. The code is written in a
This Verilog code was compiled using Xilinx ISE 9.1i. The implementations are targeted
for Xilinx’s Virtex-5 family of FPGAs. The Virtex-5 family of FPGA from Xilinx is built in a
65-nm copper CMOS technology. The Virtex-5 Configurable Logic Blocks (CLBs) are
based on 6-input look-up tables and a flip-flop. The CLBs are the main logic resources
each CLB contains a pair of bit slices; each bit slice further consists of four 6-input look-
up tables and four flip-flops, for a total of eight 6-input look-up tables and eight flip-flops
per CLB [45]. Complete synthesis of all Verilog modules is performed along with its
The RTL schematic and functional simulation result of the direct mapped cache
is shown in Fig. 5.2 and 5.3 respectively. For functional verification of the designed
cache module in direct mapped mode, a trace file of 20 different test cases has been
applied through a driver. The simulation waveform shows only one data access, which
57
Figure 5-2 RTL Schematic of Direct Mapped Cache
Figure 5-3 Simulation Waveform of Direct Mapped Cache Showing Read Miss
Fig. 5.4 shows the RTL schematic of two-way set-associative cache. The same
trace file is used for functional verification. Fig. 5.5 depicts the data access which is a
58
Figure 5-4 RTL Schematic of Two-Way Set-associative Cache
Figure 5-5 Simulation waveform of Two-Way Associative Cache Showing Read Hit
The RTL schematic and functional simulation result of the four-way set-
associative cache is shown in Fig. 5.6 and 5.7 respectively. For functional verification of
the design, the same trace file has been applied through a driver. The simulation
59
Figure 5-6 RTL Schematic of Four-Way Set-associative Cache
60
Figure 5-7 Simulation Waveform of Four-Way Set-Associative Cache Showing Read
Miss
The RTL schematic of the reconfigurable cache is shown in Fig. 5.8. For
functional verification of the designed cache module in all three different modes, the
same trace file of 20 test cases has been applied through a driver. The simulation
waveforms of the reconfigurable cache in direct mapped, two-way and four-way mode
61
Figure 5-8 RTL Schematic of Reconfigurable Cache
62
Figure 5-9 Simulation Waveform of Reconfigurable Cache in Mode ‘00’ (Direct Mapped)
Showing Read Miss
Figure 5-10 Simulation Waveform of Reconfigurable Cache in Mode ‘01’ (Two-Way Set-
Associative) Showing Write Miss
63
Figure 5-11 Simulation Waveform of Reconfigurable Cache in Mode ‘11’ (Four-Way Set-
Associative) Showing Write Hit.
Complete results obtained from trace file in three different modes are
summarized in Table 5.1. The memory write time for all three configurations (1-way, 2-
way, and 4-way) is same because we have used write through policy for write
operations. In this policy, main memory is simultaneously updated with data cache for
every write access. Total number of hits and misses are same for 2-way and 4-way
configurations, as we have considered a very small trace file for functional simulation of
our design. We can obtain a significant difference in number of hits and misses for
these two configurations by considering a large trace file (which contains approximately
64
Table 5-1 Summary of Reconfigurable Cache Operation in Three Different Modes
Various design metrics of the proposed design in direct mapped, two-way and
four-way set-associative modes are summarized in Table 5.2. The timing path from a
clock to any other clock in the design indicates the minimum period [47]. The proposed
design possesses smallest minimum period and also highest maximum operating
frequency with respect to 1-way, 2-way, and 4-way set-associative caches. The
maximum path delay is an indicator of maximum path from inputs to outputs. This delay
is smallest for the proposed design. The proposed design has obtained performance
path delay due to distribution of clock in target device (Virtex-5). In virtex-5 family of
FPGAs, there are total 32 global clocks and each device is divided into regions for
distribution of clock. The design of proposed module is mapped into the target device in
65
a very compact manner as compare to other three configurations, thus producing a
minimum timing path from one synchronous element to other. Since the proposed
design is mapped compactly, maximum delay from any one node to any other node is
also small compare to other designs. The Cell Usage expressed in BELS, reports the
count of all the logical cells that are basic elements of the Virtex technology, for
example, LUTs, MUXCY, MUXF5, MUXF6, MUXF7, MUXF8 [46]. Flip-flops or slice
register count indicates the total number of latches and flip flop used by the design.
Reconfigurable cache design occupies more number of cells, slice registers and LUTs
Xilinx gives us the flexibility of implementing the HDL code into all the leading
devices of Virtex, Spartan, and Cool Runner families. The reconfigurable cache also
has been targeted to two other FPGA boards. Table 5.3 summarizes design metrics of
reconfigurable data cache for various three different FPGA technologies. The power
66
Table 5-2 Comparison of Various Design Metrics of Proposed Design with Direct
Mapped, 2-Way, and 4-Way Set Associative Caches.
*Power calculated incorrectly for small designs (1, 2-way) due to software (Xilinx ISE 9.1) bugs.
67
Maximum Frquency (MHz) Minimum Period (ns)
250 8
7
200 6
150 5
4
100 3
50 2
1
0 0
68
Slice LUTs
2000
1800
1600
1400
1200
1000
800
600
400
200
0
69
Table 5-3 Comparison of Design Metrics of Reconfigurable Data Cache for Various
FPGA Technologies.
70
Table 5-4 Comparison of Proposed Design with Existing Reconfigurable Memories.
71
CHAPTER 6
This thesis presented the architecture and design of a new N-way reconfigurable
data cache for embedded systems. The proposed data cache can be configured as
requirements. We have achieved this reconfigurability with the help of a mode selector
while utilizing the full capacity of cache. The FPGA implementations of the
caches have been implemented. The performance of the proposed design is compared
The proposed design can be further optimized in terms of speed, area and power
consumption.
72
REFERENCES
[4] P. Grun, N. Dutt, and Alexandru Nicolau, “Memory Architecture Exploration for
Programmable Enbedded Systems”, Boston: Kluwer Academic Publishers, 2003
[5] A. Malik, B. Moyer and D. Cermak, “A Low Power Unified Cache Architecture
Providing Power and Performance Flexibility,” International Symposium on Low
Power Electronics and Design, June 2000.
[7] Crisu, D., “An Architectural Survey and Modeling of Data Cache Memories in
Verilog HDL”, in Proceedings of International Semiconductor Conference,
Vol. 1, pp.139-142, 1999.
[8] T.S.B. Sudarshan, Rahil Abbas Mir, and S.Vijayalakshmi,”Highly Efficient LRU
Implementations for High Associativity Cache Memory”, in Proceedings of 12th
IEEE International Conference on Advanced Computing and Communications,
pp.24-35, Dec 2004.
[9] J. Kin, M. Gupta and W. Mangione-Smith, “The Filter Cache: An Energy Efficient
Memory Structure,” International Symposium on Microarchitecture, Dec 1997, pp.
184-193.
73
[11] H. Kim, A.K. Somani, and A. Tyagi, “A Reconfigurable Multi-Function Computing
Cache Architecture,” IEEE Transactions on VLSI, Vol. 9, No. 4, pp. 509-523, Aug.
2001.
[13] C. Zhang, X. Zhang, and Y. Yan, “Two Fast and High- Associativity Cache
Schemes,” IEEE Micro, Vol. 17(5), pp. 40-49, Sep/Oct 1997.
[14] Frank Vahid, and Tony Givargis, “Embedded System Design: A Unified
Hardware/Software Introduction”, John Wiley & Sons, Inc., 2002.
[15] N. H. E. Weste and D. Harris, “CMOS VLSI Design: A Circuit and Systems
Perspective”, Addison Wesley, 2005.
[16] K. Mai, R. Ho, E. Alon, D. Liu, Y. Kim, D. Patil, and M.A. Horowitz, “ Architecture
and Circuit Techniques for a 1.1-GHz 16-kb Reconfigurable Memory in 0.18-μm
CMOS” , IEEE Journal of Solid-State Circuits, Vol.40, No.1, pp. 261-275, Jan
2005
[17] J.W. Park, C.G. Kim, J.H. Lee, and S.D. Kim, “An Energy Efficient Cache
Memory Architecture for Embedded Systems”, in Proceedings of the ACM
symposium on Applied Computing, pp. 884-890, 2004.
[19] C. Zhang, F. Vahid, J. Yang, and W. Najjar, “A Way-Halting Cache for Low-
Energy High-Performance Systems,” ACM Transactions on Architecture and
Code Optimization, Vol. 2, Issue 1, pp. 34-54, March 2005.
[20] L. Liu, “Cache Design with Partial Address Matching,” in Proceedings of the
International Symposium on Microarchitecture, 1994.
[21] C. Zhang, F. Vahid, and W. Najjar, “Energy Benefits of a Configurable Line Size
Cache for Embedded Systems,” International Symposium on VLSI Design, 2003.
74
[22] M. Zhang and K.Asanovic,“Highly-Associative Caches for Low-Power
Processors,” Kool Chips Workshop, in conjunction with International Symposium
on Microarchitecture, Dec. 2000.
[23] Z. Navabi, “Verilog Digital System Design”, New York: McGraw-Hill, 1999.
[24] http://www.faculty.iu-bremen.de/birk/lectures/PC101
2003/07cache/cache%20memory.htm
[26] J. L. Hennessy and D .A. Patterson, “Computer Organization and Design: The
Hardware/Software Interface,” 4th edition, Morgan-Kaufmann Publishing Co.,
2009.
[27] http://www.pcguide.com/ref/mbsys/cache/funcComparison-c.html
[28] http://webster.cs.ucr.edu/AoA/Windows/HTML/MemoryArchitecturea2.html
[29] www.cse.uconn.edu/~huang/spring08_340/May_5/Wei_Zeng.ppt
[31] http://blogs.sun.com/toddjobson/resource/Memory_Latency.png
[33] A. J. Smith, “Cache memories”, ACM Computing Surveys, Vol.14, Issue 3, pp.
473-530, Sep.1982.
[34] S. Palnitkar, “Verilog HDL: A Guide to Digital Design and Synthesis”, SunSoft
Press, 1996.
[35] P.R. Panda, N. Dutt, and Alexandra Nicolau, “Memory Issues in Embedded
Systems-On-Chip: Optimizations and Exploration”, Kluwer Academic Publishers,
1999.
75
[36] Wlison Peter R., “Design Recipes for FPGAs”, Elsevier/Newnes, 2007.
[37] Wolf, Wayne Hendrix, “FPGA-based System Design”, Prentice Hall PTR, 2004.
[44] K. Coffman, “Real World FPGA Design with Verilog”, Prentice Hall Modern
Semiconductor Design Series, 2000.
[45] www.xilinx.com
76