5035V - 4th Semester - Computer Science and Engineering
5035V - 4th Semester - Computer Science and Engineering
A memory is just like a human brain. It is used to store data and instructions. Computer
memory is the storage space in the computer, where data is to be processed and
instructions required for processing are stored. The memory is divided into large
number of small parts called cells. Each location or cell has a unique address, which
varies from zero to memory size minus one. For example, if the computer has 64k
words, then this memory unit has 64 * 1024 = 65536 memory locations. The address of
these locations varies from 0 to 65535.
Memory is primarily of three types −
Cache Memory
Primary Memory/Main Memory
Secondary Memory
Cache Memory
Cache memory is a very high speed semiconductor memory which can speed up the
CPU. It acts as a buffer between the CPU and the main memory. It is used to hold
those parts of data and program which are most frequently used by the CPU. The parts
of data and programs are transferred from the disk to cache memory by the operating
system, from where the CPU can access them.
Advantages
Disadvantages
Secondary Memory
This type of memory is also known as external memory or non-volatile. It is slower than
the main memory. These are used for storing data/information permanently. CPU
directly does not access these memories, instead they are accessed via input-output
routines. The contents of secondary memories are first transferred to the main
memory, and then the CPU can access it. For example, disk, CD-ROM, DVD, etc.
RAM (Random Access Memory) is the internal memory of the CPU for storing data,
program, and program result. It is a read/write memory which stores data until the
machine is working. As soon as the machine is switched off, data is erased.
Access time in RAM is independent of the address, that is, each storage location inside
the memory is as easy to reach as other locations and takes the same amount of time.
Data in the RAM can be accessed randomly but it is very expensive.
RAM is volatile, i.e. data stored in it is lost when we switch off the computer or if there
is a power failure. Hence, a backup Uninterruptible Power System (UPS) is often used
with computers. RAM is small, both in terms of its physical size and in the amount of
data it can hold.
RAM is of two types −
Long life
No need to refresh
Faster
Used as cache memory
Large size
Expensive
High power consumption
ROM stands for Read Only Memory. The memory from which we can only read but
cannot write on it. This type of memory is non-volatile. The information is stored
permanently in such memories during manufacture. A ROM stores such instructions
that are required to start a computer. This operation is referred to as bootstrap. ROM
chips are not only used in the computer but also in other electronic items like washing
machine and microwave oven.
Let us now discuss the various types of ROMs and their characteristics.
Advantages of ROM
The advantages of ROM are as follows −
Non-volatile in nature
Cannot be accidentally changed
Cheaper than RAMs
Easy to test
More reliable than RAMs
Static and do not require refreshing
Contents are always known and can be verified
The smallest unit of information is known as bit (binary digit), and in one memory cell we can store one bit of
information. 8 bit together is termed as a byte.
The maximum size of main memory that can be used in any computer is determined by the addressing scheme.
A computer that generates 16-bit address is capable of addressing upto 216 which is equal to 64K memory location.
Similarly, for 32 bit addresses, the total capacity will be 2 32 which is equal to 4G memory location.
In some computer, the smallest addressable unit of information is a memory word and the machine is called word-
addressable.
The data transfer between main memory and the CPU takes place through two CPU registers.
If the MAR is k-bit long, then the total addressable memory location will be 2k.
If the MDR is n-bit long, then the n bit of data is transferred in one memory cycle.
The transfer of data takes place through memory bus, which consist of address bus and data bus. In the above
example, size of data bus is n-bit and size of address bus is k bit.
It also includes control lines like Read, Write and Memory Function Complete (MFC) for coordinating data transfer. In
the case of byte addressable computer, another control line to be added to indicate the byte transfer instead of the
whole word.
For memory operation, the CPU initiates a memory operation by loading the appropriate data i.e., address to MAR.
If it is a memory read operation, then it sets the read memory control line to 1. Then the contents of the memory
location is brought to MDR and the memory control circuitry indicates this to the CPU by setting MFC to 1.
If the operation is a memory write operation, then the CPU places the data into MDR and sets the write memory
control line to 1. Once the contents of MDR are stored in specified memory location, then the memory control circuitry
indicates the end of operation by setting MFC to 1.
A useful measure of the speed of memory unit is the time that elapses between the initiation of an operation and the
completion of the operation (for example, the time between Read and MFC). This is referred to as Memory Access
Time. Another measure is memory cycle time. This is the minimum time delay between the initiation two independent
memory operations (for example, two successive memory read operation). Memory cycle time is slightly larger than
memory access time.
The storage part is modelled here with SR-latch, but in reality it is an electronics circuit made up of transistors.
The memory consttucted with the help of transistors is known as semiconductor memory. The semiconductor
memories are termed as Random Access Memory(RAM), because it is possible to access any memory location in
random.
Depending on the technology used to construct a RAM, there are two types of RAM -
SRAM: StaticRandomAccessMemory.
DRAM: Dynamic Random Access Memory.
Dynamic Ram (DRAM):A DRAM is made with cells that store data as charge on capacitors. The presence or
absence of charge in a capacitor is interpreted as binary 1 or 0.
Because capacitors have a natural tendency to discharge due to leakage current, dynamic RAM require periodic
charge refreshing to maintain data storage. The term dynamic refers to this tendency of the stored charge to leak
away, even with power continuously applied.
Static RAM (SRAM):
In an SRAM, binary values are stored using traditional flip-flop constructed with the help of transistors. A static RAM
will hold its data as long as power is supplied to it.
SRAM Versus DRAM :
Both static and dynamic RAMs are volatile, that is, it will retain the information as long as power supply is
applied.
A dynamic memory cell is simpler and smaller than a static memory cell. Thus a DRAM is more dense,
i.e., packing density is high(more cell per unit area). DRAM is less expensive than corresponding SRAM.
DRAM requires the supporting refresh circuitry. For larger memories, the fixed cost of the refresh circuitry is
more than compensated for by the less cost of DRAM cells
SRAM cells are generally faster than the DRAM cells. Therefore, to construct faster memory modules(like
cache memory) SRAM is used.
A memory cell is capable of storing 1-bit of information. A number of memory cells are organized in the form of a
matrix to form the memory chip. One such organization is shown in the Figure 3.5.
Each row of cells constitutes a memory word, and all cell of a row are connected to a common line which is referred
as word line. An address decoder is used to drive the word line. At a particular instant, one word line is enabled
depending on the address present in the address bus. The cells in each column are connected by two lines. These
are known as bit lines. These bit lines are connected to data input line and data output line through a Sense/Write
circuit. During a Read operation, the Sense/Write circuit sense, or read the information stored in the cells selected by
a word line and transmit this information to the output data line. During a write operation, the sense/write circuit
receive information and store it in the cells of the selected word.
A memory chip consisting of 16 words of 8 bits each, usually referred to as 16 x 8 organization. The data input and
data output line of each Sense/Write circuit are connected to a single bidirectional data line in order to reduce the pin
required. For 16 words, we need an address bus of size 4. In addition to address and data lines, two control
lines, and CS, are provided. The line is to used to specify the required operation about read or write.
The CS (Chip Select) line is required to select a given chip in a multi chip memory system.
8. The Memory Hierarchy (2) - The Cache
Many modern computers have more than one cache, it is common to find
an instruction cache together with a data cache. and in many systems the
caches are hierarchy structured by themselves: most microprocessors in the
market today have an internal cache, with a size of a few KBytes, and
allow an external cache with a much larger capacity, tens to hundreds of
KBytes.
Some Values
135
8 The Memory Hierarchy (2) - The Cache
Freedom of placing a block into the cache ranges from absolute, when the
block can be placed anywhere in the cache, to zero, when the block has a
strictly predefined position.
• if the block can be placed anywhere in the cache the cache is said
to be fully associative;
Transfers between the lower level of the memory and the cache occur in
blocks: for this reason we can see the memory address as divided in two
fields:
What is the size of the two fields in an address if the address size is 32 bits
and the block is 16 Byte wide?
136
8.2 Placing a block in the cache
Answer:
Assuming that the memory is byte addressable there are 4 bits necessary to
specify the position of the byte in the block. The other 28 bits in the
address identify a block in the lower level of the memory hierarchy.
The address above refers to block number 3 in the lower level; inside that
block the byte number 13 will be accessed.
For a cache that has a power of two blocks (suppose 2m blocks), finding the
position is a direct mapped cache is trivial: position (index) is indicated by
the last (the least significant) log2m bits of the block-frame address.
For a set associative cache that has a power of two sets (suppose 2k sets),
the set where a given block has to be mapped is indicated by the last (the
least significant) log2k bits of the block-frame address.
The address can be viewed as having three fields: the block-frame address
is split into two fields, the tag and the index, plus the block offset address:
A CPU has a 7 bit address; the cache has 4 blocks 8 bytes each. The CPU
addresses the byte at address 107. Suppose this is a miss and show where
will be the corresponding block placed.
137
8 The Memory Hierarchy (2) - The Cache
Answer:
(107)10 = (1101011)2
With an 8 bytes block the least significant three bits of the address (011) are
used to indicate the position of a byte within a block.
The most significant four bits ((1101)2 = 1310) represent the block-frame
address, i.e. the number of the block in the lower level of the memory.
Because it is a direct mapped cache, the position of block number 13 in the
cache is given by:
Hence the block number 13 in the lower level of the memory hierarchy will
be placed in position 1 into the cache. This is precisely the same as using
the last
log24 = 2
Figure 8.2 is a graphical representation for this example. Figures 8.1 and
are graphical representations of the same problem we have in example
but for fully associative and set associative caches respectively.
Because the cache is smaller than the memory level below it, there are
several blocks that will map to the same position in the cache; using the
Example 8.2 it is easy to see that blocks number 1, 5, 9, 13 will all map to
the same position. The question now is: how can we determine if the block
in the memory is the one we are looking for, or not?
Each line in the cache is augmented with a tag field that holds the tag field
of the address corresponding to that block. When the CPU issues an
address, there are, possibly, several blocks in the cache that could contain
the desired information. The one will be chosen that has the same tag as
that of the address issued by the CPU.
Figure 8.4 presents the same cache we had in figures 8.1 to 8.3, improved
with the tag fields. In the case of a fully associative cache all tags in the
cache must be checked against the address's tag field; this because in a fully
associative cache blocks may be placed anywhere. Because the cache must
be very fast, the checking process must be done in parallel, all cache's tags
138
8.3 Finding a Block in the Cache
must be compared at the same time with the address tag fields. For a set
associative cache there is less work than in a fully associative cache: there
is only one set in which the block can be; therefore only the tags of the
blocks in that set have to be compared against the address tag field.
If the cache is direct mapped, the block can have only one position in the
cache: only the tag of that block is compared with the address tag field.
Figure 8.6 presents the status of a four line, direct mapped cache, similar to
the one we had in Example 8.2 after a sequence of misses; suppose that
after reset (or power-on), the CPU issues the following sequence of reads at
addresses (in decimal notation): 78, 79, 80, 77, 109, 27, 81. Hits don't
change the state of the cache when only reads are performed; therefore
only the state of the cache after misses is presented in Figure 8.6. Below is
the binary representation of addresses involved in the process:
139
8 The Memory Hierarchy (2) - The Cache
Index
0 Fully associative cache
Block
Number
0 Block 13 can go
anywhere in the cache
1
13
15
Lower level in memory hierarchy
FIGURE 8.1 A fully associative four blocks (lines) cache connected to a 16 blocks .
140
8.3 Finding a Block in the Cache
Index
0 Direct mapped cache
Block
Number
0 Block 13 can go
only in position 1
1 (13 mod 4) in the
cache
13
15
FIGURE 8.2 A Direct mapped, four blocks (lines) cache connected to a 16 blocks memory.
141
8 The Memory Hierarchy (2) - The Cache
Index
0 Set associative cache
Set 0
1
Set 1
2
Block
Number
0 Block 13 goes to
set 1 (13 mod 2);
1 in the set 1 it can
occupy any position
13
15
142
8.3 Finding a Block in the Cache
Tag Data
13
3
Set 0
For a set associative cache
the block can be in only one
set; only the tags of that set
must be checked
13
Set 1
FIGURE 8.4 Findinga block in the cache implies comparing the tag field of the actual address with the
content of one or more tags in the cache.
143
8 The Memory Hierarchy (2) - The Cache
• Address 78: miss because the valid bit is 0 (Not Valid); a block is
brought and placed into the cache in position Index = 01
• Address 79: hit; as Figure 8.6.b points out the content of this
memory address is already in the cache
• Address 80: miss because the valid bit at index 10 in the cache is 0
(Not Valid); a block is brought into the cache and placed at this
index.
• Address 109: miss; the block being transferred from the lower
level of the hierarchy is placed in the cache at index 01, thus
replacing the previous block.
• Address 27: miss; block transferred into the cache at index 11.
• Address 81: hit; the item is found in the cache at index 10.
It is a common mistake to neglect the tag field when computing the amount
of memory necessary for a cache.
Answer:
The cache will have a number of lines equal with:
144
8.4 Replacing Policies
Hence the number of bits in the index field of an address is 10. The tag field
in an address is:
1 + 19 + 16 * 8 = 148 bits
This figure is by 18% larger than the “useful” size of the cache, and is
hardly negligible.
Replacing Policies
145
8 The Memory Hierarchy (2) - The Cache
CPU Addresses
nindex
ntag Address
1 ntag
MUX
COMP
=
Hit
146
8.4 Replacing Policies
147
8 The Memory Hierarchy (2) - The Cache
• FIFO (First In First Out): the oldest block in the cache (or in the
set for a set associative cache) is selected for replacement. This
policy does not take into account the addressing pattern in the
past: it may happen the block has been heavily used in the
previous addressing cycles, and yet it is chosen for replacement.
The FIFO policy is outperformed by the random policy which has,
as a plus, the advantage of being easier to implement.
Consider a fully associative four block cache, and the following stream of
block-frame addresses: 2, 3, 4, 2, 5, 2, 3, 1, 4, 5, 2, 2, 2, 3. Show the content
of the cache in two cases:
a) using a LRU algorithm for replacing blocks;
b) using a FIFO policy.
Answer:
Address:
2 3 4 2 5 2 3 1 4 5 2 2 2 3
21 22 23 21 22 21 22 23 24 51 52 53 54 55
31 32 33 34 35 31 32 33 34 21 21 21 22
41 42 43 44 45 11 12 13 14 15 16 31
51 52 53 54 41 42 43 44 45 46
M M M M M M M M M
148
8.5 Cache Write Policies
For the LRU policy, the subscripts indicate the age of the blocks in the
cache. For the FIFO policy a star is used to indicate which is the next block
to be replaced. The Ms under the columns of tables indicate the misses.
For the short sequence of block-frame addresses in this example, the FIFO
policy yields a smaller number of misses, 7 as compared with 9 for the
LRU. However in most cases the LRU strategy proves to be better than
FIFO.
So far we have discussed about how reads are handled in a cache. Writes
are more difficult and affect the performance more than reads do. If we take
a closer look at the block scheme in Figure 8.5 we realize that, in the case
of a read, the two basic operations are performed in parallel: the tag and
reading the block are read at the same time. Further, the tags must be
compared, and the delay in the comparator (COMP) is slightly higher then
the delay through the multiplexor (MUX): if we have a hit then the data is
already stable at the cache's outputs; if there a miss there is no harm in
reading some improper data from the cache, we simply ignore it.
There are two options when writing into the cache, depending upon how
the information in the lower lever of the hierarchy is updated:
• write through: the item is written both into the cache and into the
corresponding block in the lower level of the hierarchy; as a
149
8 The Memory Hierarchy (2) - The Cache
• write back: writes occur only in the cache; the modified block is
written into the lower level of the hierarchy only when it has to be
replaced.
With the write-back policy there is useless to write back a block (i.e. to
write a block into the lower level of the hierarchy) if the block has not been
modified while in the cache. To keep track if a block was modified or not, a
bit, called the dirty bit, is used for every block in the cache; when the
block is brought into the cache this bit is set to Not-dirty (0); the first write
in that block sets the bit to Dirty (1). When the replacement decision is
taken, the control checks if the block is dirty or clean. If the block is dirty it
has to be to the lower level of the memory; otherwise a new block coming
from the lower level of the hierarchy can simply overwrite that block in the
cache.
For fully or set associative caches, where several bocks may candidate for
replacement, it is common to prefer the one which is clean (if any), thus
saving the time necessary to transfer a block from the cache to the lower
level of the memory.
The two cache write policies have their advantages and disadvantages:
150
8.6 The Cache Performance
where both the execution time and stalls are expressed in clock cycles.
Now the natural question we may ask is: do we include the cache access
time in the CPUexec or in Memory_stalls? Both ways are possible: it is
possible to consider the cache access time in Memory_stalls, simply
because the cache is a part of the memory hierarchy. On the other hand,
because the cache is supposed to be very fast, we can include the hit time in
the CPU execution time as the item sought in the cache will be delivered
very quickly, maybe during the same execution cycle. As a matter of fact
this is the widely accepted convention.
Memory_stalls will include the stall due to misses, for reads and writes:
The above formula can be also written using misses per instruction as:
Answer:
CPUtime = IC*(CPIexec + Mem_accesses_per_instruction*miss_rate*miss_penalty)*Tck
151
8 The Memory Hierarchy (2) - The Cache
The IC and Tck are the same in both cases, with and without cache, so the
result of including the cache's behavior is an increase in CPUtime by
8.25
----------– 1 = 17.8%
7
The following example presents the impact of the cache for a system with a
lower CPI (as is the case with pipelined CPUs):
The CPI for a CPU is 1.5, there are on the average 1.4 memory accesses
per instruction, the miss rate is 5%, and the miss penalty is 10 clock cycles.
What is the performance if the cache is considered?
Answer:
CPUtime = IC*(CPIexec + Mem_accesses_per_instruction*miss_rate*miss_penalty)*Tck
Note that for a machine with lower CPI the impact of the cache is more
significant than for a machine with a higher CPI.
The following example shows the impact of the cache on system with
different clock rates.
Example 8.7 CPU PERFORMANCE WITH CACHE, CPI AND CLOCK RATES:
Answer:
CPUtime = IC*(CPIexec + Mem_accesses_per_instruction*miss_rate*miss_penalty)*T ck
For the CPU running with a 20ns clock cycle, the miss penalty is 140/20 =
7 clock cycles, and the performance is given by:
152
8.6 The Cache Performance
The effect of the cache, for this machine, is to stretch the execution time by
32%. For the machine running with a 10 ns clock cycle, the miss penalty is
140/10 = 14 clock cycles, and the performance is:
• reducing the miss rate: the easy way is to increase the cache size;
however there is a serious limitation in doing so for on-chip
caches: the space. Most on-chip caches are only a few kilobytes in
size.
• reducing the miss penalty: for most cases the access time
dominates the miss penalty; while the access time is given by the
technology used for memories, and, as a result can not be easily
lowered, it is possible to use intermediate levels of cache between
the internal cache (on-chip) and main memory.
153
8 The Memory Hierarchy (2) - The Cache
• capacity: if the cache does not contain all the blocks needed for
the execution of the program, then some blocks will be replaced
and then, later, brought back into the cache;
As for capacity misses, the solution is larger caches, both internal and
external. If the cache is too small to fit the requirement of some program,
then most of the time will be spent in transferring blocks between the cache
and the lower level of the hierarchy; this is called trashing. A trashing
memory hierarchy has a performance that is close to that of the memory in
the lower level, or even poorer due to misses overhead.
Initial caches were meant to hold both data and instructions. This caches
are called unified or mixed. It is possible however to have separate caches
for instructions and data, as the CPU knows if it is fetching an instruction
or loading/storing data. Having separate caches allows the CPU to perform
an instruction fetch at the same time with a data read/write, as it happens in
pipelined implementations. As the table in section 8.6 shows, most of the
today’s architectures have separate caches. Separate caches give the
designer the opportunity to separately optimize each cache: they may have
different sizes, different organizations, and block sizes. The main
observation is that instruction caches have lower miss rates as data caches,
for the main reason that instructions expose better spatial locality than data.
154
Exercises
Exercises
Redo the design in problem 8.1 but for a 4-way set associative
cache. Compare your design with the fully associative cache and
the direct mapped cache.
Assume you have two machines with the same CPU and same
main memory, but different caches:
cache 1: a 16 set, 2-way associative cache, 16 bytes per block, write
through;
cache 2: a 32 lines direct mapped cache, 16 bytes per block, write
back.
Also assume that a miss takes 10 longer than a hit, for both machines. A
word write takes 5 times longer than a hit, for the write through cache; the
transfer of a block from the cache to the memory takes 15 times as much as
a hit.
a) write a program that makes machine 1 run faster than machine 2 (by as
much as possible);
b) write a program that makes machine 2 run faster than machine 1 (by as
much as possible).
155
8 The Memory Hierarchy (2) - The Cache
156
Write Through and Write Back in Cache
Cache is a technique of storing a copy of data temporarily in rapidly accessible storage memory.
Cache stores most recently used words in small memory to increase the speed in which a data
is accessed. It acts like a buffer between RAM and CPU and thus increases the speed in which
data is available to the processor.
Whenever a Processor wants to write a word, it checks to see if the address it wants to write
the data to, is present in the cache or not. If address is present in the cache i.e., Write Hit.
We can update the value in the cache and avoid a expensive main memory access.But this
results in Inconsistent Data Problem.As both cache and main memory have different data, it
will cause problem in two or more devices sharing the main memory (as in a multiprocessor
system).
This is where Write Through and Write Back comes into picture.
Write Through:
In write through, data is simultaneously updated to cache and memory. This process is simpler and
more reliable.This is used when there are no frequent writes to the cache(Number of write operation is
less).
It helps in data recovery (In case of power outage or system failure). A data write will experience latency
(delay) as we have to write to two locations (both Memory and Cache). It Solves the inconsistency
problem. But it questions the advantage of having a cache in write operation (As the whole point of using
a cache was to avoid multiple accessing to the main memory).
Write Back:
The data is updated only in the cache and updated into the memory in later time. Data is updated in the
memory only when the cache line is ready to replaced (cache line replacement is done using Belady’s
Anomaly, Least Recently Used Algorithm, FIFO, LIFO and others depending on the application).
Write Back is also known as Write Deferred.
If write occurs to a location that is not present in the Cache(Write Miss), we use two options, Write
Allocation and Write Around.
Write Allocation
In Write Allocation data is loaded from the memory into cache and then updated. Write allocation works
with both Write back and Write through.But it is generally used with Write Back because it is unnecessary
to bring data from the memory to cache and then updating the data in both cache and main memory.
Thus Write Through is often used with No write Allocate.
Write Around:
Here data is Directly written/updated to main memory without disturbing cache.It is better to use this when
the data is not immediately used again.