0% found this document useful (0 votes)
105 views99 pages

Computer Architecture AllClasses-Outline-100-198

The document discusses the design space of pipelines, including basic pipeline layout, dependency resolution, and pipelined execution of integer and boolean instructions. It covers pipeline stages, subtasks in each stage, dependency resolution techniques, and logical layout and implementation of FX pipelines.

Uploaded by

SrinivasaRao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views99 pages

Computer Architecture AllClasses-Outline-100-198

The document discusses the design space of pipelines, including basic pipeline layout, dependency resolution, and pipelined execution of integer and boolean instructions. It covers pipeline stages, subtasks in each stage, dependency resolution techniques, and logical layout and implementation of FX pipelines.

Uploaded by

SrinivasaRao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

Computer Architecture Unit 1

E-references:
• http://www.lc3help.com/tutorials/Basic_LC-3_Instructions/ Retrieved on
03-04-2012
• http://www.scribd.com/doc/4596293/LC3-Instruction-Details Retrieved
on 02-04-2012
• http://xavier.perseguers.ch/programmation/mips-
assembler/references/5-stage-pipeline.html

Manipal University Jaipur B1648 Page No. 100


Computer Architecture Unit 1

Unit 5 Design Space of Pipelines


Structure:
5.1 Introduction
Objectives
5.2 Design Space of Pipelines
Basic layout of a pipeline Dependency resolution
5.3 Pipeline Instruction Processing
5.4 Pipelined Execution of Integer and Boolean Instructions
The design space
Logical layout of FX pipelines
Implementation of FX pipelines
5.5 Pipelined Processing of Loads and Stores Subtasks of load and store
processing The design space
Sequential consistency of instruction execution Instruction issuing
and parallel execution
5.6 Summary
5.7 Glossary
5.8 Terminal Questions
5.9 Answers

5.1 Introduction
In the previous unit, you studied pipelined processors in great detail with a
short review of pipelining and examples of some pipeline in modern
processors. You also studied various kinds of pipeline hazards and the
techniques available to handle them.
In this unit, we will introduce you to the design space of pipelines. Day-by- day
increasing complexity of the chips had lead to higher operating speeds. These
speeds are provided by overlapping instruction latencies or by implementing
pipelining. In the early models, discrete pipeline was used. Discrete pipeline
performs the task in stages like fetch, decode, execution, memory, and write-
back operations. Here every pipeline stage requires one cycle of time, and as
there are 5 stages so the instruction latency is of five cycles. Longer pipelines
over more cycles can hide instruction latencies.
This provides processors to attain higher clock speeds. Instruction pipelining
has significantly improved the performance of today’s processors. In this unit,
you will study the design space of pipelines which is further divided into basic

Manipal University Jaipur B1648 Page No. 101


Computer Architecture Unit 1

layout of a pipeline and dependency resolution. We focus primarily on


pipelined execution of Integer and Boolean instructions and pipelined
processing of loads and stores.
Objectives:
After studying this unit, you should be able to:
• explain design space of pipelines
• describe pipeline instruction processing
• identify pipelined execution of Integer and Boolean instructions
• discuss pipelined processing of loads and stores

5.2 Design Space of Pipelines


In this section, we will learn about the design space of pipelines. The design
space of pipelines can be sub divided into two aspects as shown in figure 5.1.

Figure 5.1: Principle Aspects of Design Space of Pipelines

Let’s discuss each one of them in detail.


5.2.1 Basic Layout of a pipeline
To understand a pipeline in depth, it is necessary to know about those
decisions which are fundamental to the layout of a pipeline. Let’s discuss them
below:
The number of pipeline stages used to perform a given task are:,
1. Specification of the subtasks to be performed in each of the pipeline
stages,
2. Layout of the stage sequence, that is, whether the stages are used in a
strict sequential manner or some stages are recycled,
3. Use of bypassing, and
4. Timing of the pipeline operations, that is, whether pipeline operations are

Manipal University Jaipur B1648 Page No. 102


Computer Architecture Unit 1

controlled synchronously or asynchronously.

Figure 5.2 depicts these stages diagrammatically.

Basic layout of a pipeline

6 0 0 0 0
Number of Specification Layout of the Uta of Timing of the
stages of the subtasks stage aequence bypassing pipeline operations to bo performed In
each of the stages

Figure 5.2: Overall Stage Layout of a pipeline

5.2.2 Dependency resolution


Pipeline design has another aspect called the dependency resolution. Earlier,
some pipelined computers used the Microprocessor without Interlocked
Pipeline Stages (MIPS approach) and used a static dependency resolution
which is also called static scheduling or software interlock resolution.
Here the detection and proper resolution of dependencies is done by the
compiler. Examples of static dependency resolution are:
• Original MIPS designs (like the MIPS and the MIPS-X)
• Some less famous RISC processors (like RCA, Spectrum)
• Intel processor (i860) which has both VLIW and scalar operation modes.
A further advanced resolution scheme is the combined static/dynamic
dependency resolution. This has been employed by MIPS R processors like
R2000, R3000, R4000, R4200 and R6000. In the first MIPS processors
(R2000, R3000) hardware interlocks were used for the long latency
operations, such as multiplication, division and conversion, while the
resolution of short latency operations relied entirely on the compiler. Newer R-
series implementations have extended the range of hardware interlocks
further and further, first to the load/store hazards (R6000) and then to other
short latency operations as well (R4000). In the 84000, the only instructions
which rely on a static dependency resolution are the coprocessor control
instructions.
In recent processors dependencies are resolved dynamically, by extra
hardware. Nevertheless, compilers for these processors are assumed to

Manipal University Jaipur B1648 Page No. 103


Computer Architecture Unit 1

perform a parallel optimisation by code reordering, in order to increase


performance. Figure 5.3 shows the various possibilities of resolving the
pipeline hazards.

Figure 5.3: Possibilities for Resolving Pipeline Hazards

Self Assessment Questions


1. The full form of MIPS is ____________________ .
2. In recent processors dependencies are resolved ________________ , by
extra hardware.
Activity 1:
Visit a library and find out the features ofR2000, R3000, R4000, R4200 and
R6000. Compare them in a chart.

5.3 Pipeline Instruction Processing


An Instruction pipeline operates on a stream of instructions by overlapping and
decomposing the three phases (fetch, decode and execute) of the instruction
cycle. It has been extensively used in RISC machine and many high-end
mainframes as one of the major contributor to achieve high performance. A
typical instruction execution in pipeline architecture consists of a sequence of
following operations:
1. Fetch instruction: In this operation, the next expected instruction is read
into a buffer from cache memory.
2. Decode instruction/register fetch: The Instruction Decoder reads the

Manipal University Jaipur B1648 Page No. 104


Computer Architecture Unit 1

next instruction from the memory, decode it, optimize the order of
execution and further sends the instruction to the destinations.
3. Calculate operand address: Now, the effective address of each source
operand is calculated.
4. Fetch operand/memory access: Then, the memory is accessed to fetch
each operand. For a load instruction, data returns from memory and is
placed in the Load Memory Data (LMD) register. If it is a store, then data
from register is written into memory. In both cases, the operand address
as computed in the prior cycle is used.
5. Execute instruction: In this operation, the ALU perform the indicated
operation on the operands prepared in the prior cycle and store the result
in the specified destination operand location.
6. Write back operand: Finally, the result into the register file is written or
stored into the memory.
These six stages of instruction pipeline are shown in a flowchart in figure 5.4.

Manipal University Jaipur B1648 Page No. 105


Computer Architecture Unit 1

Figure 5.4: Flowchart of an Instruction Pipeline

Self Assessment Questions


3. In _______________ the result into the register file is written or stored
into the memory.
4. In Decode Instruction/Register Fetch operation, the ______________
and the _______________ are determined and the register file is
accessed to read the registers.
5.4 Pipelined Execution of Integer and Boolean Instructions
Now let us discuss Pipelined execution of integer and Boolean instructions

Manipal University Jaipur B1648 Page No. 106


Computer Architecture Unit 1

with respect to the design space.


5.4.1 The design space
In this section, first we will overview the salient aspects of pipelined execution
of FX instructions. (In this section, the abbreviation FX will be used to denote
integer and Boolean.) With reference to figure 5.5 we emphasise two basic
aspects of the design space: how FX pipelines are laid out logically and how
they are implemented.

Figure 5.5: Design Space of the Pipelined Execution of FX instructions

A logical layout of an FX pipeline consists, first, of the specification of how


many stages an FX pipeline has and what tasks are to be performed in these
stages. These issues will be discussed in Section 5.4.2 for RISC and CISC
pipelines. The other key aspect of the design space is how FX pipelines are
implemented. In this respect we note that the term FX pipeline can be
interpreted in both a broader and a narrower sense. In the broader sense, it
covers the full task of instruction fetch, decode, execute and, if required, write
back. In this case, it is usually employed for the execution of Local Store (LS)
and branch instructions and is termed a master pipeline. By contrast, in the
narrower sense, an FX pipeline is understood to deal only with the execution
and writeback phases of the processing of FX instructions. Then, the
preceding tasks of instruction fetch, decode and, in the case of superscalar
execution, instruction issue are performed by a separate part of the processor.

Manipal University Jaipur B1648 Page No. 107


Computer Architecture Unit 1

5.4.2 Logical layout of FX pipelines


Integer and Boolean instructions account for a considerable proportion of
programs. Together, they amount to 30-40% of all executed instructions.
Therefore, the layout of FX pipelines is fundamental to obtaining a high-
performance processor.
In the following topic, we discuss how FX pipelines are laid out. However, we
describe the FX pipelines for RISC and C1SC processors separately, since
each type has a slightly different scope. While processing operates
instructions, RISC pipelines have to cope only with register operands. By
contrast, CISC pipelines must be able to deal with both register and memory
operands as well as destinations.
Pipeline in RISC architecture: Before discussing pipelines in RISC
machines, let us first discuss what is a RISC machine? The term RISC stands
for Reduced Instruction Set Computing. RISC computers reduce chip
complexity by using simpler instructions. As a result, RISC compilers have to
generate software routines to perform complex instructions that would have
been done in hardware by CISC (Complex Instruction Set Computing)
computers. The salient features of RISC architecture are as follows:
• RISC architecture has instructions of uniform length.
• Instruction sets are streamlined to carry efficient and important
instructions.
• Memory addressing method is simplified. The complex references are split
up into various reduced instructions.
• The numbers of registers are increased. RISC processors can have
minimum 16 and maximum 64 registers. These registers get hold of
variables that are frequently used.
Pipelining is a standard feature in RISC processors. A typical RISC processor
pipeline operates in the following steps:
1. Fetch instructions from the memory
2. Read the registers and then decode instruction
3. Either execute instruction or compute the address
4. Access the operand stored at that memory location
5. Write the calculated result into the register
RISC instructions are simpler as compared to the instructions used in CISC
processors. It is due to the pipelining feature used there. CISC instructions are
Manipal University Jaipur B1648 Page No. 108
Computer Architecture Unit 1

of variable length, while RISC instructions are of same length. RISC


instructions can be fetched in a single operation. Theoretically, one clock cycle
should be taken by each stage in RISC processor so that the processor
completes execution of one instruction in one clock cycle. But practically, RISC
processors take more than one cycle for one instruction. The processor may
sometimes stall due to branch instructions and data dependencies. Data
dependency takes place if an instruction waits for the output of previous
instruction. Delay can also be due to the reason that instruction is waiting for
some data which is not currently available in the register. So, the processor
cannot finish an instruction in one clock cycle.
Branch instructions are those that tell the processor to make a decision about
what the next instruction to be executed. They are generally based on the
results of another instruction. So, they can also create problems in a pipeline
if a branch is conditional on the results of an instruction, which has not yet
finished its path through the pipeline. In that case also, the processor takes
more than one clock cycle to finish one instruction.
Pipeline in CISC architecture: CISC is an acronym for Complex Instruction
Set Computer. The CISC machines are easy to program and make efficient
use of memory. Since the earliest machines were programmed in assembly
language and memory was slow and expensive, the CISC philosophy was
commonly implemented in large computers such as PDP-11. Most common
microprocessor designs such as the Intel 80x86 and Motorola 68K series have
followed the CISC philosophy. The CISC instructions sets have the following
main features:
• Two-operand format; here instructions have both source & destination.
• Register to register, memory to register and register to memory
commands.
• Multiple addressing modes for memory, having specialised modes for
indexing through arrays
• Depending upon the addressing mode, the instruction length varies
• Multiple clock cycles required by instructions to execute.
Intel 80486, a CISC machine, uses 5-stage pipeline. Here the CPU tries to
maintain one instruction execution per clock cycle. However, this architecture
does not provide maximum potential performance improvement due to the
following reasons:

Manipal University Jaipur B1648 Page No. 109


Computer Architecture Unit 1

• Occurrence of sub-cycles between the initial fetch and the instruction


execution.
• Execution of an instruction waiting for previous instruction output.
• Occurrence of the branch instruction.
5.4.3 Implementation of FX pipelines
Most of today's arithmetic pipelines are designed to perform fixed functions.
These arithmetic/logic units (ALUs) perform fixed-point and floating-point
operations separately. The fixed-point unit is also called the integer unit. The
floating-point unit can be built either as part of the central processor or on a
separate coprocessor. These arithmetic units perform scalar operations
involving one pair of operands at a time. The pipelining in scalar arithmetic
pipelines is controlled by software loops. Vector arithmetic units can be
designed with pipeline hardware directly under firmware or hardwired control.
Scalar and vector arithmetic pipelines differ mainly in the areas of register files
and control mechanisms involved. Vector hardware pipelines are often built as
add-on options to a scalar processor or as an attached processor driven by a
control processor. Both scalar and vector processors are used in modem
supercomputers.
Arithmetic pipeline stages: Depending on the function to be implemented,
different pipeline stages in an arithmetic unit require different hardware logic.
Since all arithmetic operations (such as add, subtract, multiply, divide,
squaring, square rooting, logarithm, etc.) can be implemented with the basic
add and shifting operations, the core arithmetic stages require some form of
hardware to add and to shift. For example, a typical three-stage floatingpoint
adder includes a first stage for exponent comparison and equalisation which
is implemented with an integer adder and some shifting logic; a second stage
for fraction addition using a high-speed carry look ahead adder; and a third
stage for fraction normalisation and exponent readjustment using a shifter and
another addition logic.

Manipal University Jaipur B1648 Page No. 110


Computer Architecture Unit 1

Arithmetic or logical shifts can be easily implemented with shift registers. High-
speed addition requires either the use of a carry-propagation adder (CPA)
which adds two numbers and produces an arithmetic sum as shown in
figure5.6a, or the use of a carry-save adder (CSA) to "add" three input
numbers and produce one sum output and a carry output as exemplified in
figure 5.6b.

e.g. n=4
A = 10 11
<•) B = 0 111
S=10010=A*B

(Sum)
(a) An n-bit carry-propagate adder (CPA) which allows either carry
propagation or applies the carry-lookahead technique

e.g. n=4
X=

CSA

Sb= 0 1 0 0 0 1 1
+) C = 0 1 1 1 0 1 0
c Sb
8=1011111= Sb+C = X+Y+Z (Bitwise
(Carry
vector) sum)

(b) An n-bit carry-save adder (CSA), where Sb is the bitwise sum of X. Y, and Z. and
C is a carry vector generated without cany propagation between digits

Figure 5.6: Distinction between a Carry-propagate Adder (CPA) and a


Carry-save Adder (CSA)

Manipal University Jaipur B1648 Page No. 111


Computer Architecture Unit 1

In a CPA, the carries generated in successive digits are allowed to propagate


from the low end to the high end, using either ripple carry propagation or sonic
carry looks-head technique. In a CSA, the carries are not allowed to propagate
but instead are saved in a carry vector. In general, an n-bit CSA is specified
as follows: Let X, Y, and Z be three n-bit input numbers, expressed as X= (xn-
1, xn-2, x1, x0) and so on. The CSA performs bitwise operations simultaneously

on all columns of digits to produce two n- bit output numbers, denoted as Sb =


(0, Sn-1, Sn-2, ..., S1, S0) and C = (Cn, Cn- 1, C1, 0).Note that the leading bit of
the bitwise sum, Sb is
always a 0, and the tail bit of the carry vector C is always a 0. The inputoutput
relationships are expressed below:
Si = x ® y-® Zi
Ci +1 = XiYiv yiZiv ZiXi..5.1 fori = 0,1, 2, ...,n - 1, where ® is the exclusive
OR and v is the logical OR operation. Note that the arithmetic sum of three
input numbers, i.e., S = X+ Y + Z, is obtained by adding the two output
numbers, i.e., S = Sb +C, using a CPA. We use the CPA and CSAs to
implement the pipeline stages of a fixed-point multiply unit as follows.
Multiply Pipeline Design: Consider as an example the multiplication of two
8-bit integers A x B = P, where P is the 16-bit product. This fixed-point
multiplication can be written as the summation of eight partial products as
shown below: P = A x B = P0 + P1+ P2 + ...............+ P7, where x and + are
arithmetic multiply and add operations, respectively.
10 1 10 ) 0 1 ° A
x) I 0 0 10 0 I 1 = B

10 1 10 1 0 1 - Po
101 1 0 1 0 1 0 -
00000 0 0 0 0 0 - P2
000000 00 0 0 0 ” Py
1011010 1 0 0 0 0 ° PA
00000000 00 0 0 0 = Pi
000000000 0 0 0 0 0 - P(,
+> 1 01101010000 0 00 = P7
0110011111101 1 11=P

Manipal University Jaipur B1648 Page No. 112


Computer Architecture Unit 1
Note that the partial product Pj, is obtained by multiplying the multiplicand A by
the jth bit of B and then shifting the result j bits to the left for j = 0, 1, 2, ..., 7.
Thus Pj, is (8 + j) bits long with j trailing zeros. The summation of the eight
partial products is done with a Wallace tree of CSAs plus a CPA at the final
stage, as shown in figure 5.7.

Figure 5.7: A Pipeline Unit for Fixed-point Multiplication of 8-bit Integers

The first stage (S1) generates all eight partial products, ranging from 8 bits to
15 bits, simultaneously. The second stage (S2) is made up of two levels of four
CSAs, and it essentially merges eight numbers into four numbers ranging from
13 to 15 bits. The third stage (S3) consists of two CSAs, and it merges four
numbers from S2 into two 16-bit numbers. The final stage (S4) is a CPA, which
adds up the last two numbers to produce the final product P.
For a maximum width of 16 bits, the CPA is estimated to need four gate levels

Manipal University Jaipur B1648 Page No. 113


Computer Architecture Unit 1
of delay. Each level of the CSA can be implemented with a two-gate- level
logic. The delay of the first stage (S1) also involves two gate levels.
Thus the entire pipeline stages have an approximately equal amount of delay.
The matching of stage delays is crucial to the determination of the number of
pipeline stages, as well as the clock period. If the delay of the CPA stage can
be further reduced to match that of a single CSA level, then the pipeline can
be divided into six stages with a clock rate twice as fast. The basic concepts
can be extended to operands with a larger number of bits.
Self Assessment Questions
5. While processing operates instructions, RISC pipelines have to cope only
with __________________ .
6. In RISC architecture, instructions are of a uniform length (True/ False).
7. Name two microprocessors which follow the CISC philosophy.
8. ____________ adds two numbers and produces an arithmetic sum.

Activity 2:
Access the internet and find out more about the difference between fixed point
and floating point units.

5.5 Pipelined Processing of Loads and Stores


Now let us study pipelined processing of loads and stores in detail.
5.5.1 Subtasks of load and store processing
Loads and stores are frequent operations, especially in RISC code. While
executing RISC code we can expect to encounter about 25-35% load
instructions and about 10% store instructions. Thus, it is of great importance
to execute load and store instructions effectively. How this can be done is the
topic of this section.
To start with, we summarise the subtasks which have to be performed during
a load or store instruction.
Let us first consider a load instruction. Its execution begins with the
determination of the effective memory address (EA) from where data is to be
fetched. In straightforward cases, like RISC processors, this can be done in
two steps: fetching the referenced address register(s) and calculating the
effective address. However, for CISC processors address calculation may be
a difficult task, requiring multiple subsequent register fetches and address

Manipal University Jaipur B1648 Page No. 114


Computer Architecture Unit 1
calculations, as for instance in the case of indexed, postincremented, relative
addresses. Once the effective address is available, the next step is usually, to
forward the effective (virtual) address to the MMU for translation and to access
the data cache. Here, and in the subsequent discussion, we shall not go into
details of whether the referenced cache is physically or virtually addressed,
and thus we neglect the corresponding issues. Furthermore, we assume that
the referenced data is available in the cache and thus it is fetched in one or a
few cycles. Usually, fetched data is made directly available to the requesting
unit, such as the FX or FP unit, through bypassing. Finally, the last subtask to
be performed is writing the accessed data into the specified register.
For a store instruction, the address calculation phase is identical to that already
discussed for loads. However, subsequently both the virtual address and the
data to be stored can be sent out in parallel to the MMU and the cache,
respectively. This concludes the processing of the store instruction. Figure 5.8
shows the subtasks involved in executing load and store instructions.

Figure 5.8: Subtasks of Executing Load and Store Instructions


5.5.2 The design space
While considering the design space of pipelined load/store processing we take
into account only one aspect, namely whether load/store operations are
executed sequentially or in parallel with FX instructions (Figure 5.9).
In traditional pipeline implementations, load and store instructions are
processed by the master pipeline. Thus, loads and stores are executed
sequentially with other instructions (Figure 5.9).

Manipal University Jaipur B1648 Page No. 115


Computer Architecture Unit 1

US addresses are calculated by US is performed by a separate US unit


the FX pipeline Arfonomons load/store unrtfs)
Mester pipeline SuperSPARC
(960CA (1989)
(1992p) PowerPC 601 (1993) MC88110 (1991)
R4000 (1992) PowerPC 603 (1993)
Pentium (1993,2 FX EUs) PowerPC 604 (1995)
68060 (I993p) a21164 PowerPC 620 (1996)
(1994,2 FX EUs) Power? R6000 (1994,2 US units)
(1993) (121064/21064A (1992,1993)

US: Load/Store

Performance, trend

Figure 5.9: Sequential vs. Parallel Execution of Load/Store Instructions

In this case, the required address calculation of a load/store instruction can be


performed by the adder of the execution stage. However, one instruction slot
is needed for each load or store instruction.

A more effective technique for load/store instruction processing is to do it in


parallel with data manipulations (see again Figure 5.9). Obviously, this
approach assumes the existence of an autonomous load/store unit which can
perform address calculations on its own.
Let’s discuss both these techniques in detail.
5.5.3 Sequential consistency of instruction execution
By operating the processors with multiple EUs (Execution Units) in parallel, the
instructions execution can be finished very fast. However, all the instructions
execution should maintain sequential consistency. The sequential consistency
follows two aspects:
1. Processor Consistency - the order of instructions execution ();

Manipal University Jaipur B1648 Page No. 116


Computer Architecture Unit 1
2. Memory Consistency - the order of accessing the memory ().
Processor consistency: The phrase Processor Consistency is applied to
suggest the consistency of instruction completion with sequential instruction
execution. There are two types of processor consistency reflected by
Superscalar processors; namely weak or strong consistency.
Weak processor consistency specifies that all the instructions must be
executed justly; with the condition of no violation of data dependencies. Data
dependencies must be observed and settled during the execution.
Strong processor consistency forces the instructions to follow program order
for the execution. This can be attained through ROB (reorder buffer).ROB is a
storage area from where all data is read and written.
Memory consistency: One another face of superscalar instruction execution
is whether memory access is executed in the same order as in a sequential
processor.
Memory consistency is weak if with strict sequential program execution, the
memory access is out-of-order. Moreover, data dependencies should not be
dishonoured. Simply, it can be stated that weak consistency permits load and
store reordering and being very particular about memory data dependencies,
to be found and settled.
Memory consistency is strong, if memory access occurs strictly in program
order and load/store reordering is prohibited.
Load and Store reordering
Load and store instructions affect both the processor and the memory. Firstly
ALU or address unit computes the addresses and then the load and store
instructions get executed.
Now, the loads can fetch the data cache from the memory data. Once the
generated address is received, a store instruction can send the operands.
Processor affirming weak memory consistency permits memory access
reordering. This point can be considered as advantageous because of the
following three reasons:
1. Permitting load/store bypassing,
2. Making speculative loads or stores feasible
3. Allowing hiding of cache misses.
Load/Store bypassing
Load/Store bypassing means that any of the two can bypass each other. This
Manipal University Jaipur B1648 Page No. 117
Computer Architecture Unit 1
means either stores can bypass loads or vice versa, without violating the
memory data dependencies. The bypassing of loads to stores provides the
advantage of runtime overlapping of loops.
This is accomplished by permitting loads at the origin of iteration to access
memory without having to hold till stores at the end of the former iteration are
finished. In order to prevent fetching a false data value, a load can bypass
pending stores if none of the previous stores have the same target address as
the load. Nevertheless, certain addresses of pending stores may not be
available.
Speculative loads
Speculative loads avoid memory access delay. This delay can be caused due
to the non- computation of required addresses or clashes among the
addresses. The speculative loads should be checked for correctness. If
required then respective measures should be taken to done for it. Speculative
loads are alike speculative branches.
To check the address, write the loads and stores computed target address into
ROB (ReOrder buffer). The address comparison is carried out at ROB.
Reorder buffer (ROB)
ROB came in 1988 for the solution of precise interrupt problem. Currently,
ROB is an assurance tool for sequential consistency execution where multiple
EUs operate in parallel.

Manipal University Jaipur B1648 Page No. 118


Computer Architecture Unit 1
ROB is a circular buffer. It has a head and tail pointers. In ROB, instructions
enter in program order only. Instructions can only be retired if all of their
previous instructions have finished and they had also retired.
Sequential consistency can be maintained by directing instructions to update
the program state by writing their results in proper program order into the
memory or referenced architectural register(s). ROB can successfully support
both interrupt handling and speculative execution.
5.5.4 Instruction Issuing and parallel execution
In this phase execution tuples are created. After its creation it is decided that
which execution tuple can now be issued. When the accessibility of data and
resources are checked during run-time it is then known as Instruction Issuing.
In instruction issuing area many pipelines are processed.
In figure 5.10 you can see a reorder buffer which follows FIFO order.

Figure 5.10: A Reorder Buffer.


In this buffer the entries received and sent in FIFO order. When the input
operands are present then the instruction can be executed. Other instruction
might be located in instruction issue.
Other constraints are associated with the buffers carrying the execution tuples.
In figure 5.11 you can see the Parallel Execution Schedule (PES) of iteration.
PES has hardware resources which contain one path to the memory, two
integer units and one branch unit.

Manipal University Jaipur B1648 Page No. 119


Computer Architecture Unit 1
INTEGER UNIT 1 INIEGERUNIT2 MEMORYUNIT BRANCHUNIT
time
■ove Rl,r7
add 82, P.1,4 lw rB, (RI)
Kve 83, r7 lx rS, (82)
add 84,83,4
add r5,rS,l add rt.rt.l sx r9, (83) ble r3,r9,L3 1 add r7,r7,4 sv rB,(84) bit r6,r4,L2

Figure 5.11: Example of PES

You can see that rows are showing the time steps and columns are showing
certain operations performed in time step. In this PES we can see that in
branch unit “ble” is not taken and it is theoretically executing instruction from
predicted path. In this example we have showed renaming values for only r3
register but others can also be renamed. Various values allotted to register r3
are bounded to different physical register (R1, R2, R3, R4).
Now you can see numerous ways of arranging instruction issue buffer for
boosting up the complexity.
Single queue method: Renaming is not needed in single queue method
because this method has 1 queue and no out of ordering issue. In this method
the operand availability could be handled through easy reservation bits allotted
to every register. During the instructional modification of register issues, a
register reserved and after the modification finished the register is cleared.
Multiple queue method: In multiple queue method, all the queues get
instruction issue in order. Due to other queues some queues can be issued
out. With respect to instruction type single queues are organized.
Reservation stations: In reservation stations, the instruction issue does not
follow the FIFO order. As a result for data accessibility, the reservation stations
at the same time have to observe their source operands. The conventional
way of doing this is to reserve the operand data in reservation station. As
reservation station receive the instruction then available operand values are
firstly read and placed in it.
After that it logically evaluate the difference between the operand designators
of inaccessible data and result designators of finishing instructions. If there is
similarity, then the result value is extracted to matching reservation station.
Instruction got issued as all the operands are prepared in reservation station.
It can be divided into instruction type for decreasing data paths or may behave
Manipal University Jaipur B1648 Page No. 120
Computer Architecture Unit 1
as a single block.
Self Assessment Questions
9. In traditional pipeline implementations, load and store instructions are
processed by the ___________________ .
10. The consistency of instruction completion with that of sequential
instruction execution is specified b ______________ .
11. Reordering of memory accesses is not allowed by the processor which
endorses weak memory consistency does not allow (True/False).
12. ____________ is not needed in single queue method.
13. In reservation stations, the instruction issue does not follow the FIFO
order. (True/ False).

5.6 Summary
• The design space of pipelines can be sub divided into two aspects:
basic layout of a pipeline and dependency resolution.
• An Instruction pipeline operates on a stream of instructions by
overlapping and decomposing the three phases (fetch, decode and
execute) of the instruction cycle.
• Two basic aspects of the design space are how FX pipelines are laid out
logically and how they are implemented.
• A logical layout of an FX pipeline consists, first, of the specification of how
many stages an FX pipeline has and what tasks are to be performed in
these stages.
• The other key aspect of the design space is how FX pipelines are imple-
mented.
• In logical layout of FX pipelines, the FX pipelines for RISC and CISC
processors have to be taken separately, since each type has a slightly
different scope.
• Pipelined processing of loads and stores consist of sequential consistency
of instruction execution and parallel execution.

5.7 Glossary
• CISC: It is an acronym for Complex Instruction Set Computer. The CISC
machines are easy to program and make efficient use of memory.
• CPA: It stands for carry-propagation adder which adds two numbers
and produces an arithmetic sum.
• CSA: It stands for carry-save adder which adds three input numbers
and produces one sum output.
Manipal University Jaipur B1648 Page No. 121
Computer Architecture Unit 1
• LMD: Load Memory Data.
• Load/Store bypassing: It defines that either loads can bypasss stores or
vice versa, without violating the memory data dependencies.
• Memory consistency: It is used to find out whether memory access is
performed in the same order as in a sequential processor.
• Processor consistency: It is used to indicate the consistency of
instruction completion with that of sequential instruction execution.
• RISC: It stands for Reduced Instruction Set Computing. RISC
computers reduce chip complexity by using simpler instructions.
• ROB: It stands for Reorder Buffer. ROB is an assurance tool for
sequential consistency execution where multiple EUs operate in parallel.
• Speculative loads: They avoid memory access delay. This delay can be
caused due to the non- computation of required addresses or clashes
among the addresses.
• Tomasulo’s algorithm: It allows the replacement of sequential order by
data-flow order.

5.8 Terminal Questions


1. Name the two sub divisions of design space of pipelines and write short
notes on them.
2. What do you mean by pipeline instruction processing?
3. Explain the concept of pipelined execution of Integer and Boolean
instructions.
4. Describe the logical layout of both RISC and CISC computers.
5. Write in brief the process of implementation of FX pipelines.
6. Explain the various subtasks involved in load and store processing
7. Write short notes on:
a. Sequential Consistency of Instruction Execution
b. Instruction Issuing and Parallel Execution

5.9 Answers
Self Assessment Questions
1. Microprocessor without Interlocked Pipeline Stages
2. Dynamically
3. Write Back Operand
4. Opcode, operand specifiers
5. Register operands
6. True

Manipal University Jaipur B1648 Page No. 122


Computer Architecture Unit 1
7. Intel 80x86 and Motorola 68K series
8. Carry-propagation adder (CPA)
9. Master pipeline
10. Processor Consistency
11. False
12. Renaming
13. True

Terminal Questions
1. The design space of pipelines can be sub divided into two aspects: basic
layout of a pipeline and dependency resolution. Refer Section 5.2.
2. A pipeline instruction processing technique is used to increase the
instruction throughput. It is used in the design of modern CPUs,
microcontrollers and microprocessors.Refer Section 5.3 for more details.
3. There are two basic aspects of the design space of pipelined execution of
Integer and Boolean instructions: how FX pipelines are laid out logically
and how they are implemented. Refer Section 5.4.
4. While processing operates instructions, RISC pipelines have to cope only
with register operands. By contrast, CISC pipelines must be able to deal
with both register and memory operands as well as destinations. Refer
Section 5.4.
5. Depending on the function to be implemented, different pipeline stages in
an arithmetic unit require different hardware logic. Refer Section 5.4.
6. The execution of load and store instructions begins with the
determination of the effective memory address (EA) from where data is to
be fetched. This can be broken down into subtasks. Refer
Section 5.5.
7. The overall instruction execution of a processor should mimic sequential
execution, i.e. it should preserve sequential consistency. Refer Section
5.5. The first step is to create and buffer execution and then determine
which tuples can be issued for parallel execution. Refer Section 5.5.

References:
• Hwang, K. (1993) Advanced Computer Architecture. McGraw-Hill.
• Godse D. A. & Godse A. P. (2010). Computer Organisation, Technical
Publications. pp. 3-9.
• Hennessy, John L., Patterson, David A. & Goldberg, David (2002)
Computer Architecture: A Quantitative Approach, (3rd edition), Morgan
Manipal University Jaipur B1648 Page No. 123
Computer Architecture Unit 1
Kaufmann.
• Sima, Dezso, Fountain, Terry J. & Kacsuk, Peter (1997) Advanced
computer architectures - a design space approach, Addison-Wesley-
Longman: I-XXIII, 1-766.

E-references:
• http://www.eecg.toronto.edu/~moshovos/ACA06/readings/ieee-
proc.superscalar.pdf
• http://webcache.googleusercontent.com/search?q=cache:yU5nCVnju9
cJ:www.ic.uff.br/~vefr/teaching/lectnotes/AP1-topico3.5.ps.gz+load+
store+sequential+instructions&cd=2&hl=en&ct=clnk&gl=in

Unit 6 Instruction-Level Parallelism and its Exploitation

Structure:
6.1 Introduction
Objectives
6.2 Dynamic Scheduling
Advantages of dynamic scheduling Limitations of dynamic
Scheduling
6.3 Overcoming Data Hazards
6.4 Dynamic Scheduling Algorithm - The Tomasulo Approach
6.5 High performance Instruction Delivery
Branch target buffer
Advantages of branch target buffer
6.6 Hardware-based Speculation
6.7 Summary
6.8 Glossary
6.9 Terminal Questions
6.10 Answers

6.1 Introduction
In pipelining, two or more instructions that are independent of each other can
overlap. This possibility of overlap is known as ILP (instruction-level
parallelism). It is addressed as ILP because the instructions may be assessed
parallelly. Parallelism level is quite small in straight-line codes where there are
no branches except the entry or exit. The easiest and most widely used
methodology to enhance parallelism is by exploiting parallelism among the

Manipal University Jaipur B1648 Page No. 124


Computer Architecture Unit 1
loop iterations. This is termed as “loop-level parallelism”.
In the previous unit, you studied design space of pipelines. You studied various
aspects such as pipelined execution of integer and Boolean instructions and
pipelined processing of loads and stores. In this unit, we will throw light on the
process of overcoming hazards with dynamic schedule, its examples and
algorithm. We will also examine the High performance instruction delivery and
hardware based speculation.
Objectives:
After studying this unit, you should be able to:
• describe the process of overcoming the data hazards with dynamic
scheduling
• give examples of dynamic scheduling
• describe the Tomasulo approach of dynamic scheduling algorithm
• identify techniques of overcoming data hazards with dynamic scheduling
• analyse the concept of high performance instruction delivery
• explain hardware based speculation

6.2 Dynamic Scheduling


Pipeline fetches an instruction and executes it. This flow is restrained if there
exists any data dependencies among the instruction already in the pipeline
and the fetched instruction that can be hidden with bypassing or forwarding.
When the data dependence between the instructions cannot be hidden, then
in such a case the hazard detection hardware generally stalls the instruction
pipeline. In this scenario, new instructions are neither fetched nor issued till
the time the dependence is resolved. Techniques for scheduling the
instructions need to be examined properly in order to so as to identify the
dependent instructions and also to decrease the actual hazards and their
resultant stalls. This act of scheduling is termed as static scheduling.
There is another category of scheduling known as dynamic scheduling. A
dynamic scheduling is the hardware based scheduling. In this approach, the
hardware rearranges the instruction execution to reduce the stalls. Dynamic
scheduling reduces the stalls and simultaneously maintains the data flow &
exceptions in the instruction execution.
6.2.1 Advantages of dynamic scheduling
There are various advantages of dynamic scheduling. They are as follows:
1. Dynamic scheduling is helpful in situations where the data dependencies
between the instructions are not known during the time of compilation.
Manipal University Jaipur B1648 Page No. 125
Computer Architecture Unit 1
2. Dynamic scheduling also helps to simplify the task of compiler.
3. It permits code compiled by one pipeline in mind to execute efficiently on
some other pipeline.
6.2.2 Limitations of dynamic scheduling
Dynamic scheduling has several limitations:
• The pipelining techniques we have used so far use in-order instruction
issue. This acts as a major limitation. In-order instruction means that the
following instructions cannot proceed if there is any instruction stalled in
instruction pipeline. Therefore, when two nearly positioned instructions are
dependent on each other, then a stall occurs.
Existence of multiple functional units could lead to idle-time of these units.
Suppose if any instruction j depends on any time-consuming instruction i,
which is presently being executed in the instruction pipeline, then in such
a case all instructions following instruction j needs to be stalled till the time
instruction i is over and instruction j begins execution. For example,
consider this code sequence:

DIVD FO, F2, F4

ADDD F10, FO, F8


SUBD F12, F8, F14

Here F0, F1, F2....F14 are the floating point registers (FPRs) and DIVD, ADDD
and SUBD are the floating point operations on double precision(denoted by
D). The dependence of ADDD on DIVD causes a stall in the pipeline; and thus,
the SUBD instruction cannot execute. IF the instructions are not executed in
same sequence then this limitation could be ruled out.
In case of DLX (DLX is a RISC processor architecture) pipeline, the structural
& data hazards are examined during the instruction decode (ID). If any
instruction can carry out appropriately, it is issued from ID. To commence with
the execution of the SUBD, we need to examine the following two issues
separately:
• Firstly we need to analyse the any type of structural hazards
• Secondly, we need to wait for the non-occurrence of any data hazard.

Structural hazards must be checked at the time of issuance. Therefore, inorder


instruction issuance is still used. Moreover, instruction implementation must

Manipal University Jaipur B1648 Page No. 126


Computer Architecture Unit 1
initiate at the instant when the data operands are available for access.
Therefore the pipeline which executes out-of-order results in out-of-order
completion.
But the out-of-order completion results in various types of difficulties in
exception handling. The exceptions generated in a dynamic scheduled
processor are also imprecise because any instruction may be entirely
executed before any previously issued instruction generates an exception. In
such a scenario, it is quite challenging to again start after the interrupt.
For carrying out out-of-order execution, we need to necessarily separate the
ID (Instruction Decode) pipe stage into two. These are as follows:
1. Issue - In this stage, the instructions are decoded and a check for
identifying structural hazards is performed.
2. Read operands - In this stage the operands are read after no data hazards
are detected.
IF (instruction fetch) comes before the issue stage. The IF can fetch and issue
instructions from a queue or latch. The EX (Execution) stage follows the read
operands stage. Based on the complexity of operation, the execution may
involve various cycles. Consequently, there must be a demarcation between
the initiation of instruction execution and completion of instruction execution.
Doing so will allow simultaneous execution of multiple instructions.
Self Assessment Questions
1. The methodology, which involves separation of dependent instructions,
minimizes data/structural hazards and consequential stalls is termed as

2. To commence with the execution of the SUBD, we need to separate the


issue method into 2 parts: firstly __________________ and secondly

3. ______________ stage precedes the issue phase.


4. The _______________ stage follows the read operands stage similar
to the DLX pipeline.

6.3 Overcoming Data Hazards


Now let us discuss the methods of overcoming data hazards with dynamic
scheduling in this section.
Dynamic Scheduling with a Scoreboard
In a dynamically scheduled pipeline, all instructions pass through the issue
Manipal University Jaipur B1648 Page No. 127
Computer Architecture Unit 1
stage in order (in-order issue); however, they can be stalled or bypass each
other in the second stage (read operands) and thus enter execution out of
order. Score board is a method of permitting out-of-order instruction execution
when sufficient resources are available and there are no data dependencies.
The CDC (Control Data Corporation) 6600 scoreboard developed this
capability and it is named after it. (CDC 6600 was a family of mainframe
computers manufactured by Control Data Corporation)
Out-of-order instruction execution may give rise to WAR (Write after Read, a
type of data hazard) hazards which are not present in DLX floating point and
integer pipelines.
Let us consider that SUBD destination is F8 in the earlier example; then its
code sequence will be as shown below:

In this example you can see that ADDD and SUBD are interdependent. If
SUBD is executed before ADDD, then the data interdependence will be
violated resulting in wrong execution. Similarly, to refrain output dependencies
violation, it is essential to detect WAW (Write after Write) data hazards
Scoreboard technique helps to minimize or remove both the structural as well
as the data hazards. Scoreboard stalls the later instruction that is engaged in
the interdependence. Scoreboard’s goal is to execute an instruction in each
clock cycle (in situation where no structural hazards exist). Therefore, when
any instruction is stalls, some other independent instructions may be executed.
The scoreboard technique takes complete accountability for issuing and
executing the instruction together with all hazards detection. To take
advantage of executing instructions out-of-order necessarily requires several
instructions to be executed simultaneously. We can achieve this by use of
either of the two ways:
1. By utilizing pipelined functional units
2. By using multiple functional units
The above given ways are necessary for pipeline control. Here we will consider
the use of multiple functional units.
CDC 6600 comprises of 16 distinct functional units. These are of following
types:
Manipal University Jaipur B1648 Page No. 128
Computer Architecture Unit 1
• Four FPUs (floating-point units)
• Five units for memory references
• Seven units for integer operations.
FPUs are of prime importance in DLX scoreboards in comparison to other FU
(functional units).
For example: We have 2 multipliers, 1 adder, 1 divide unit, and 1 integer unit
for all integer operations, memory references and branches.
The methodology for the DLX & CDC 6600 is quite similar as both of these are
load-store architectures. Given below in figure 6.1 is the basic structure of a
DLX Processor with a Scoreboard.

Figure 6.1: The Basic Structure of a DLX Processor with a Scoreboard


Here every instruction involves four execution steps, considering only the FP
operations only. Now let’s us analyse in detail the manner in which the
scoreboard stores the essential information so as to determine when to move
from one step to another. Figure 6.2 below shows these steps.

Manipal University Jaipur B1648 Page No. 129


Computer Architecture Unit 1

Figure 6.2: Steps Replaced in the Standard DLX Pipeline

Now let us study the four steps in the scoreboard technique in detail.
1. Issue: Issue step is used as a replacement of a part of ID step of DLX
pipeline. In this step the instruction is forwarded to FU. The internal data
construction is also modified here. It is done only in two situations:
• FU for the instruction is jobless.
• No other active instruction has the same register as destination. This
ensures that the operation is free from WAW (Write after Write) data
hazard.
When any structural or WAW hazards are detected, the stall occurs and
the issue of all subsequent instructions is stopped until these data hazards
have been corrected. when a stall occurs in this stage, the buffer between
instruction issue and fetch is filled. If buffer contains a single instruction
then the instruction fetch also stalls at once but if the buffer space contains
a queue, it creates stalls only after the buffer queue is fully filled.
2. Read operands: The scoreboard examines if the source operands is
available or not. The source operand is said to be available when no
previously issue active instruction is ready to write to it. The scoreboard
prompts the FU to start reading the operands from data registers and start
execution as soon as the source operands become available. Read after
Write (RAW) hazards are resolved in a dynamic manner during this stage.
It may also send instructions for out-of-order execution. Issue and read
operand step together completes the functions of the ID step of DLX
Manipal University Jaipur B1648 Page No. 130
Computer Architecture Unit 1
pipeline.
3. Execution: After receiving the operands, the FU starts execution. on
completion of execution, the result is generated. Thereafter FU informs the
scoreboard about the completion of execution step. Execution step is used
in place of EX step of DLX pipeline but in latter it may involve multiple
cycles.
4. Write result: after the FU completes execution, the scoreboard detects
whether the WAR hazards are present or not. If the WAR hazard is
detected, it stalls the instruction. WAR hazard occurs when there is an
instruction code as in our earlier example of ADDD & SUBD where both
utilize F8. The code for that example is again shown below:

Here you can see that the source operand for ADDD is F8 that is similar to the
destination register of SUBD. However, ADDD in fact is dependent on the
previous instruction DIVD. In this case, the scoreboard will stall SUBD in its
write result stage till the time ADDD read its operands.
Any completing instruction may not be permitted to write its results in following
cases:
• when there exists any instruction which hasn’t read its operands that
precedes (i.e., in issuance order) the completing instruction
• one of the operands is the same register as the result of the completing
instruction

Manipal University Jaipur B1648 Page No. 131


Computer Architecture Unit 1
After handling the WAR hazard, the scoreboards prompts the FU for storing
their results into destination register. This step is s replacement of the WB step
of DLX pipeline.
The DLX scoreboard comprises of functional units. Figure 6.3 shows what the
scoreboard’s information looks like through the execution of this simple
sequence of instructions:

Scoreboard shows three types of status. These are:


1. Instruction status: Instruction status shows in which of the four steps the
instruction is currently.
2. Functional unit status: This shows the status of FU (functional unit).
There are mainly nine fields for every FU, shown below:

3. Register result status: It is used to declare which FU can write on any


register, whether an active instruction has any register allocated for its
destination or not. In situation where no pending instructions exist that
need to written to any register, the field is set as blank.

Manipal University Jaipur B1648 Page No. 132


Computer Architecture Unit 1

Instruction status

Issue Read operands Execution complete Write result


Instruction
LD F6,34(R2) V V
LD F2,45(R3) y J
MULTD F0,F2,F4
SUED F8,F6,F2
DIVD F10,F0,F6 >1
ADDD F6,F8,F2

Functional unit status

Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F2 R3 No


Multi Yes Mult FO F2 F4 Integer No Yes
Mult2 No

Add Yes Sub F8 F6 F2 Integer Yes No


Divide Yes Div FIO FO F6 Multi No Yes

Register result status

HI F2 F4 F6 F8 FIO FI2 ... F30

FU Multi Integer Add Divide


Figure 6.3: Components of the Scoreboard

Self Assessment Questions


5. When the pipeline executes ____________________ before ADDD, it
violates the interdependence between instructions leading to wrong
execution.
6. The objective of scoreboard is achieved with _________________ or
______________________ functional units or both.
7. The source operand for ADDD is ______________ , and is similar to
destination register of SUBD.
8. The FU status ___________________ shows whether it is busy or
idle.

Manipal University Jaipur B1648 Page No. 133


Computer Architecture Unit 1
6.4 Dynamic Scheduling Algorithm - The Tomasulo Approach
Dynamic Scheduling Algorithm was proposed by Robert Tomasulo.
Tomasulo’s scheme combines the important constituents of scoreboard
methodology with the prologue of Register renaming. This scheme has many
variants. The basic idea behind this algorithm is “Avoiding WAR and WAW
data hazards by use of renaming registers”.
The Tumasulo algorithm
It was formulated for IBM 360/91 in 1967; approximately three years later to
CDC 6600. This algorithm emphasises on the FPUs, in relation to a pipelined
FPU for DLX. The key distinction between DLX and the IBM360 is that IBM
360 processor contains register-memory instructions.
Tomasulo’s algorithm makes use of a load FU therefore no key alterations are
essential for adding register-memory addressing modes. One of the most
significant additions is an added bus. The IBM 360/91 also contains pipelined
FU rather than numerous FUs. The only dissimilarity is that pipelined FU can
commence at the most one action in a clock cycle. There are no major
variations between the IBM 360/91 and CDC6600. The IBM 360/91 is capable
of holding 3 operations for the FP (floating-point) adder and 2 for the FP
(floating-point multiplier). Additionally it may contain maximum of 6 FP loads,
or memory references, and 3 FP stores as outstanding. To do this load data
buffers & store data buffers are utilized.
There are various differences between Tomasulo’s scheme and
scoreboarding. These are given below:
• In Tomasulo’s scheme, the control and buffers are dispersed between FUs
(Functional Units) but it is centralised in score board technique. In case of
Tomasulo’s scheme register renaming is done to avoid the data and
structural hazards but no register renaming is done in score board
technique.
• CBD (Common Data Bus) is responsible for broadcasting the results to all
FUs in case of Tomasulo’s scheme. But scoreboard technique writes the
results into various registers.
• The Tumasulo algorithm can read operands from registers and CDB
(common data bus) and write operands to CDB only. While the operands
are read and written from and to registers in case of score board technique.

Manipal University Jaipur B1648 Page No. 134


Computer Architecture Unit 1
• In Tomasulo’s scheme, issue can take place only when the RS
(Reservation station) is free while the issue can take place when the FU is
free.
Figure 6.4 shows the basic structure of a Tomasulo-based floating-point unit
for DLX.

Figure 6.4: Basic Structure of a DLX Floating-Point Unit using Tomasulo’s


Algorithm

The reservation station contains the following:


• Issued instructions which are waiting for execution by the FU,
• operands for the instructions which have already been worked out (else
the source of the operands),
• Information required to handle the instruction after it has started execution.
The addresses, which come from or go to the memory are held in the load
buffers and store buffers. A pair of bus connects the FP register to FU and a
bus connects FP register to store buffers. Common bus transmits the results

Manipal University Jaipur B1648 Page No. 135


Computer Architecture Unit 1
from the FU & from memory everywhere excluding the load buffer. The buffers
& RS (reservation stations) contain tag fields that are utilized for hazard
control.
Tomasulo’s scheme is invoking when the designers are compelled to pipeline
the architecture where it is hard to schedule code or has registers sufficiency
of. But when evaluated in terms of cost, the benefits of the Tomasulo approach
as compared to compiler scheduling for an effective single-issue pipeline are
very less. But with the increasing demand for issuance capability and improved
performance of difficult-to-schedule codes the methods of dynamic scheduling
& register renaming are becoming more wide-spread.
Self Assessment Questions
9. Tumasulo scheme was invented by _____________ .
10. The ________________ could hold 3 operations for the FP adder and
2 for the FP multiplier.
11. The ____________ and _____________ are used to store the data/
addresses that come from or go to memory.

Activity 1:
Imagine yourself as a computer architect. Explain the measures you will take
to overcome data hazards with dynamic scheduling.

6.5 High Performance Instruction Delivery


In case of MIPS 5-stage pipelining, the address of the incoming-instruction-
fetch must be recognized before the completion of the present Instruction
Fetch (IF) cycle. Consequently, for ZERO branch penalties, it ought to be
realized if the fetched (as-yet un-decoded) instruction is branch or not. In case
it is a branch then it must also know the next-PC (Program Counter). This is
accomplished by introducing a Cache which contains the address of the
following instruction if branch is taken as well as not-taken. This cache is
known as the Branch-Target Cache or Branch-Target Buffer (BTB). The
branch-prediction buffer is accessed throughout the ID phase, after the
instruction decode, i.e., we know the branch-target address at the end of ID
stage to fetch the next predicted instruction. This is shown in figure 6.5.

Manipal University Jaipur B1648 Page No. 136


Computer Architecture Unit 1

Figure 6.5: Branch Prediction

6.5.1 Branch target buffer


Branch Target Buffer has three fields:
• Lookup: Addresses of the known branch instructions (predicted as
taken)
• Predicted PC: PC of the fetched instruction predicted taken-branch
• Prediction State: Optional: Extra prediction state bits
Branch Target Buffer has the following complications:
• Complication arise in using 2-bit predictor because it uses information for
both the branches taken and not-taken
• This complication is resolved in PowerPC processors by using both the T
arget-buffer and Prediction-buffer
The penalty can be calculated by looking at the possibility of the 2 events:
(i) Branch predicted taken but end up not take
= %buffer hit rate x % incorrect prediction
= 0.95 x 0.1 = 0.095
(ii) Branch is taken but is not found in buffer
= % incorrect prediction = 0.1
The penalty in both the cases is 2 cycles, therefore,
Branch Penalty = (0.095 + 0.1) x2 = 0.195 x 2 = 0.39
Manipal University Jaipur B1648 Page No. 137
Computer Architecture Unit 1
Example:
Consider a branch-target buffer implemented for conditional branches only for
pipelined processor.

Assuming that:
• Misprediction penalty = 4 cycles
• Buffer miss-penalty = 3cycles
• Hit rate and accuracy each = 90%
• Branch Frequency = 15%
Solution:
The speedup with Branch Target Buffer verses no BTB is expressed as:
Speedup = CPI no BTB/CPI BTB
= (CPI base+Stallsno BTB) / (CPI base + Stalls BTB)
The stalls are determined as:
Stalls = ZFrequency x Penalty
The sum over all the stall cases is given as the product of frequency of the stall
cases and the stall-penalty.
i) Stallsno BTB = 0.15 x 2 = 0.30
ii) To find Stalls BTB, we have to consider each output from BTB
There exist three possibilities:
a) Branch misses the BTB:
Frequency = 15 % x 0.1 = 1.5% = 0.015
Penalty = 3
Stalls=0.045
b) Branch can hit and correctly predicted:
Frequency = 15 % x 0.9(htt)x 0.9^^^)= 12.1% = 0.121
Penalty = 0
Stalls= 0
c) Branch can hit but incorrectly predicted:
Frequency = 15 % x 0.9 (hit) x 0.1 (misprediction) = 1.3% = 0.013 Penalty
=4
Stalls = 0.052
iii) Stalls BTB = 0.045 + 0 + 0.052 = 0.097
Speedup = (CPIbase + Stallsno BTB) / (CPIbase + Stal^)
= (1.0 + 0.3) / (1.0 + 0.097)
Manipal University Jaipur B1648 Page No. 138
Computer Architecture Unit 1
= 1.2
In order to achieve more instruction delivery, one possible variation in the
Branch Target Buffer is:
• To keep one or more than one target instructions, instead of or in addition to,
the anticipated Target Address
6.5.2 Advantages of branch target buffer
There are several advantages of branch target buffer. They are as follows:
• It possibly allows larger BTB as it allows access to take more time between
consecutive instructions fetches
• Buffering the actual Target-Instructions allow Branch Folding, i.e., ZERO
cycle Unconditional Branching or sometimes ZERO Cycle conditional
Branching
Self Assessment Questions
12. The branch-prediction buffer is accessed during the _____ stage.
13. The _____ field helps check the addresses of the known branch
instructions.
14. Buffering the actual Target-Instructions allow ___________ .

6.6 Hardware-Based Speculation


Hardware-based speculation is the methodology for cutting down the
consequences of control dependencies in multiple issue processors.
This methodology is based upon three basic ideas:
• Dynamic branch prediction that determines the which particular
instructions has to be executed,
• Speculation that permits the instructions execution prior to resolving the
control dependencies and
• Dynamic scheduling that relates to scheduling of various grouping of the
basic blocks.
When any processor allows branch prediction together with dynamic
scheduling, then it always assumes that the branch-prediction results were
correct, accordingly fetches, and issues instructions.
Hardware-based speculation makes use of dynamic data dependences to
select when to carry out instructions. This technique of executing programs is
basically a data-flow execution: operations carry out as soon as their operands
are accessible. This can be seen in figure 6.6.

Manipal University Jaipur B1648 Page No. 139


Computer Architecture Unit 1

Figure 6.6: Hardware-based Speculation

Advantages of hardware-based speculation


Some of the major advantages of hardware-based speculation (in comparison
to s software-based speculation) are as follows:
1. A hardware-based speculation helps in disambiguation of memory
references at the run time mostly in case of pointers. This permits us to
transfer loads past stores at runtime.
2. Hardware-based speculation is more beneficial whenever the hardware-
based branch prediction is higher-up to software-based branch
prediction performed at time of compilation. This is valid for numerous
integer programs.
For instance, a profile-based static predictor has an approximate
misprediction rate of 16% for four of the five integer SPEC programs we
use, while a hardware predictor has an approximate misprediction rate of
11%. As speculated instructions might retard the computation rate
whenever the prediction is wrong, this variation is substantial.
3. Hardware-based speculation helps to maintain an entirely accurate
exception model for speculated instructions.

Manipal University Jaipur B1648 Page No. 140


Computer Architecture Unit 1
4. Hardware-based speculation neither demand compensation nor book-
keeping code.
5. Hardware-based speculation along with dynamic scheduling does not
necessitate distinct code sequences for achieving quality performance for
various implementations of the architecture. While the compilerbased
speculation & scheduling necessarily need the different code sequences
that are customized according to the machine, and more outdated or
unlike code sequences may deteriorate the performance.
6. Although hardware speculation & scheduling can take advantage from
scheduling & tuning processors, utilizing hardware-based methodologies
are anticipated to perform substantially even with earlier or dissimilar code
sequences. Although this benefit is the most difficult to measure, it may
be the most significant in the end.
Self Assessment Questions
15. __________ makes use of dynamic data dependences to select
when to carry out instructions.
16. Hardware-based speculation is more beneficial whenever the
hardware-based branch prediction is higher-up to software
___________________ performed at time of compilation
17. Hardware-based speculation helps to maintain an entirely accurate
exception model for __________ .
18. Hardware-based speculation demand neither ________________ nor
Activity 2:
In a computer designing situation, discuss why software-based speculation
is not superior.

6.7 Summary
Let us recapitulate the important concepts discussed in this unit:
• In pipelining, implementation of instructions independent of one another
can overlap. This possible overlap is known as instruction-level parallelism
(ILP)
• Pipeline fetches an instruction and executes it.
• In DLX pipelining, all the structural & data hazards are analyzed
throughout the process of instruction decode (ID).
• A dynamic scheduling is the hardware based scheduling. In this
approach, the hardware rearranges the instruction execution to reduce the
stalls.

Manipal University Jaipur B1648 Page No. 141


Computer Architecture Unit 1
• Score board is a method of permitting out-of-order instruction execution
when sufficient resources are available and there are no data
dependencies. There are four steps in this technique.
• The Tumasulo algorithm emphasises on the FP, in relation to a pipelined,
FPU for DLX.
• The addresses, which come from or go to the memory are held in the load
buffers and store buffers.
• Hardware-based speculation makes use of dynamic data dependences to
select when to carry out instructions.

6.8 Glossary
• Dynamic scheduling: Hardware based scheduling that rearranges the
instruction execution to reduce the stalls.
• EX: Execution stage
• FP: Floating-Point Unit
• ID: Instruction Decode
• ILP: Instruction-Level Parallelism
• Instruction-level parallelism: Overlap of independent instructions on one
another
• Static scheduling: Separating dependent instructions and minimising the
number of actual hazards and resultant stalls.
6.9 Terminal Questions
1. What do you understand by instruction-level parallelism? Also, explain
loop-level parallelism.
2. Describe the concept of dynamic scheduling.
3. How does the execution of instructions take place under dynamic
scheduling with score boarding?
4. What is the goal of score boarding?
5. Explain the tumasulo approach.

6.10 Answers
Self Assessment Questions
1. Static scheduling
2. Check the structural hazards, wait for the absence of a data hazards
3. An instruction fetch
4. EX
5. SUBD, ADDD
6. Pipelined, multiple

Manipal University Jaipur B1648 Page No. 142


Computer Architecture Unit 1
7. F8
8. Busy
9. Robert Tomasulo
10. IBM 360/91
11. Load buffers, store buffers
12. ID
13. Lookup
14. Branch Folding
15. Hardware-based speculation
16. Software-based branch prediction
17. Speculated instructions
18. Compensation, bookkeeping code
Terminal Questions
1. In pipelining, implementation of instructions independent of one another
can overlap. This possible overlap is known as instruction-level parallelism
(ILP). Refer Section 6.1.
2. In dynamic scheduling, instructions can be executed out of program order
without hampering the result. Refer Section 6.2.
3. A dynamic scheduling is the hardware based scheduling. In this approach,
the hardware rearranges the instruction execution to reduce the stalls.
Refer Section 6.3.
4. The objective of a scoreboard is to maintain an execution rate of one
instruction per clock cycle (when there are no structural hazards) by
executing an instruction as early as possible. Refer Section 6.3.
5. Robert Tomasulo proposed this technique and is therefore named after
him. In this methodology the important elements of the scoreboarding
scheme are merged with the prologue of register renaming. Refer Section
6.4.

References:
• John L. Hennessy and David A. Patterson, Computer Architecture: A
Quantitative Approach, Fourth Edition, Morgan Kaufmann Publishers.
• David Salomon, Computer Organisation, 2008, NCC Blackwell.
• Joseph D. Dumas II; Computer Architecture; CRC Press.
• Nicholas P. Carter; Schaum’s Outline of Computer Architecture; Mc. Graw-
HiLl Professional.

Manipal University Jaipur B1648 Page No. 143


Computer Architecture Unit 1
E-references:
• http://cnx.org/content/m29416/latest/
• http://www.ece.unm.edu/
• www.nvidia.com
• www.jilp.org/
• www-ee.eng.hawaii.edu/
Unit 7 Exploiting Instruction - Level Parallelism With
Software Approach
Structure:
7.1 Introduction
Objectives
7.2 Types of Branches
Unconditional branch Conditional branch
7.3 Branch Handling
7.4 Delayed Branching
7.5 Branch Processing
7.6 BranchPrediction
Fixed branch prediction Static branch prediction Dynamic branch
prediction
7.7 The Intel IA-64 Architecture and Itanium Processor
7.8 ILP in the Embedded and Mobile Markets
7.9 Summary
7.10 Glossary
7.11 Terminal Questions
7.12 Answers

7.1 Introduction
In the previous unit, you studied Instruction-level parallelism and its dynamic
exploitation. You learnt how to overcome data hazards with dynamic
scheduling besides performance instruction delivery and hardware based
speculation.
As mentioned in the previous unit, inherent property of a sequence of
instructions, results in execution of some instructions parallel which is also
known as Instruction level parallelism (ILP). There is an upper bound, as to
how much parallelism can be achieved. We can approach this upper bound
via a series of transformations that either expose or allow more ILP to be
exposed to later transformations. The best way to exploit ILP is to have a

Manipal University Jaipur B1648 Page No. 144


Computer Architecture Unit 1
collection of transformations that operate on or across program blocks, either
producing “faster code” or exposing more ILP. In this unit, you will study the
software approach of exploiting Instruction-level parallelism. You will also learn
about various concepts like types of branches, branch handling, delayed
branching, branch processing, and static branch prediction. Beside these, we
will also discuss the Intel IA-64 architecture and Itanium processor. We will
conclude this unit by discussing ILP in the embedded and mobile markets.

Objectives:
After studying this unit, you should be able to:
• identify the various types of branches
• explain the concept of branch handling
• describe the role of delayed branching
• recognise branch processing
• discuss the process of branch prediction
• explain Intel IA-64 architecture and Itanium processor
• discuss the use of ILP in the embedded and mobile markets

7.2 Types of Branches


Implementation of branching is done by using a branch instruction. The
address of target instruction is included in the branch instruction. In processors
(for example, Pentium), we can also call this instruction as jump instruction.
The different types of branches are:
• unconditional
• conditional
In unconditional as well as conditional branch, the method for transfer control
remains similar. This is shown in figure 7.1 below:

Manipal University Jaipur B1648 Page No. 145


Computer Architecture Unit 1
Branch Target

Figure 7.1: Control Flow in Branching


7.2.1 Unconditional Branch
This type of branch is considered as the simplest one. It is used to transfer
control to a particular target. Let us discuss an example as follows:
branch target
Target address specification can be performed in any of the following ways:
• absolute address
or
• PC-relative address
In case of absolute address, target instruction’s actual address is specified.
The method of PC-relative address specifies the address of target instruction
corresponding to the contents of PC. Many of the processors provide support
to absolute address in case of unconditional branches. Both the formats are
supported by others. For instance, MIPS processors provides support to
absolute address-based branch by means of
j target.
Also it supports PC-relative unconditional branch by means of b target
Actually, the final instruction is considered as assembly language instruction,
even though only j instruction is supported by processor. Every branch
instruction is permitted to utilise any of the absolute or a PC-relative address.
This permission is provided by PowerPC. Instruction encoding comprises a bit.
We call this bit as AA (absolute address) bit, which specifies the address type.
If the value of AA is equal to 1, it is considered as absolute address, is or else,
it is considered as PC-relative address.

Manipal University Jaipur B1648 Page No. 146


Computer Architecture Unit 1
In case of using absolute address, processor transfers the control by just
loading the particular address of target into PC register. In case of using PC-
relative addressing, the particular address of target exists as an addition to the
contents of PC. The outcome is positioned in PC. In both cases, as the PC
signifies the address of next instruction, the instruction will be fetched by
processor at the proposed target address. The major benefit of utilising PC
relative address is that the code can be moved from memory’s one block to
another block where target addresses are not changed. We call this code as
re-locatable code which is impossible in case of absolute addresses.
7.2.2 Conditional Branch
Here if a particular condition meets its requirements, then only the jump is
conducted. For instance, a branch may be needed when two values are equal.
These types of conditional branches can be managed in any of the following
fundamental ways:
• Set-then-Jump: This design separate the testing for condition as well
as branching. A condition code register is used for attaining
communication among the instructions for condition as well as branching.
This design is followed by Pentium which makes use of flag register for
recording the outcome of test condition. For testing the condition, mp
(compare) instruction is used. Numerous flag bits are fixed by this
instruction. This specifies the connection among two compared values. For
instance, let us consider the zero bit. In the case when two values are
same, then this bit is set. Now if the zero bit is set, then the conditional
instruction, that is, jump can be used. This instruction is used to jump to
the target location. This sequence can be clarified by the following code
segment, where the values available in register AX as well as register BX
are compared:

Manipal University Jaipur B1648 Page No. 147


Computer Architecture Unit 1
cmp AX,BX ;compare the two values in AX and BX ;if equal, transfer
je target control to target ;if not, this instruction is executed
sub AX,BX
target:add AX,BX ;control is transferred here if AX
= BX
...
Here je is defined as jump if equal instruction which transfers control to
target in the case when two values in register AX as well as in register BX
are equal.

• Test-and-Jump: Many of the processors merge the testing as well as


branching into a particular instruction. MIPS processor is used to
demonstrate the rule included in this approach. MIPS offer numerous
branch instructions which are used for testing and branching. Below, you
can see the branch on equal instruction:
beq Rsrc1,Rsrc2,target

The conditional branch instruction given above performs the testing of the
contents available in two registers, that is, Rsrc1 as well as Rsrc2 for
equality. The control is transferred to the target if their values appear to be
equal. Let us suppose that the numbers that are to be compared are
placed in register t0 and register t1. For this, the branch instruction is
written as below:

beq $t1,$t0,target
The instruction given above substitutes the two-instruction cmp/je
sequence which is utilised by Pentium.
Registers are maintained by some of the processors. This is done for recording
the condition of arithmetic as well as logical operations. We call these registers
as condition code registers.
The status of the last arithmetic or logical operation is recorded by these
registers. For instance, if two 32-bit integers are added, i then the sum might
need more than 32 bits. It is an overflow condition which should be recorded
by the system. Usually, this overflow condition is indicated by setting a bit in
condition code register. For example, the MIPS, does not make use of
condition registers. Rather, it to flag the overflow condition exceptions is used.
Alternatively, th processors such as the Pentium, SPARC, and Power PC

Manipal University Jaipur B1648 Page No. 148


Computer Architecture Unit 1
make use of the condition registers. In case of Pentium, this information is
recorded by flags register. In case of PowerPC, XER register keeps the record
of this information. . SPARC utilises a condition code register. Many instruction
sets present branches founded on comparisons to zero.
SPARC and MIPS processors are the examples of processors that offer this
kind of branch instructions. Extremely pipelined RISC processors provide
support to what we call as delayed branch execution. Refer figure 7.1 to
observe the dissimilarity between delayed and normal branch execution. On
the execution of branch instruction, it transfers control to the target instantly.
For instance, Pentium makes use of this kind of branching. In case of delayed
branch execution, control is transmitted to target. This is done after the
execution of instruction which follows branch instruction.
In figure 7.1, for instance, the execution of instruction y takes place before
transferring the control. We call this slot of instruction as delay slot. For
instance, delayed branch execution is used by the SPARC. Actually, it delayed
execution is also used for procedure calls. This process helps because when
the processor is decoding branch instruction, the instruction which comes next
is already obtained. Therefore, the efficiency is improved by the execution of
it rather than throwing it away. This approach needs rearrangement of several
instructions.
Self Assessment Questions
1. Branch instruction like Pentium is also known as ___________ .
2. It is possible to have Re-locatable code in case of absolute addresses.
(True/False)

Activity 1:
Work on an MIPS processor to find out the difference between conditional
and unconditional branching.

7.3 Branch Handling


Branch is a flow altering instruction that is required to be handled in a special
manner in pipelined processors. Branch instruction’s impact on the pipeline is
shown in figure 7.2 (a) as below:

Manipal University Jaipur B1648 Page No. 149


Computer Architecture Unit 1

Figure 7.2: Branch Instruction’s impact on Pipeline


As shown in figure 7.2, the instruction Ib is considered as a branch instruction.
In the case when branch is taken, the control is transferred to instruction It. On
the other hand, if branch is not taken, instructions available in pipeline are of
use.
In the case when the branch is taken, every instruction available in pipeline, at
different stages, is removed. In the example discussed above, it is required to
remove instructions I2, I3, and I4. Fetching of instructions begin at target
address. Due to this our pipeline works inefficiently for three clock cycles. We
call this process as branch penalty.
Now we will discuss the process of reducing this branch penalty. In figure 7.2,
it is observed that we wait in anticipation of the IE (execution) stage. This is
done before starting instruction fetch at target address. The delay can be
reduced if it is determined previously. We can reduce the delay if we can
determine this earlier. As shown in figure 7.2(b), for instance, to find out if
branch is taken together with the information of target address throughout the

Manipal University Jaipur B1648 Page No. 150


Computer Architecture Unit 1
ID (decode) stage, it is required for us to give one cycle’s penalty.
In the example discussed above, it is required remove just one instruction (I2).
However can the required information be obtained at decode stage? For many
of the branch instructions, target address is specified as the instruction’s part.
Thus calculation of the target address is comparatively simple. Determining
whether the branch is taken throughout decode stage may not be an easy
process. For instance, it may be required to fetch operands in addition to
comparing their values so as to find out whether the branch is taken. This
signifies that it is required to wait in anticipation of the execution stage.
Self Assessment Questions
3. ______ is a flow altering instruction that is required to be handled in
a special manner in pipelined processors.
4. Wasteful work done by pipeline for a considerable time is called the

7.4 Delayed Branching


In figure 7.2 (b), it is shown that the branch penalty can be reduced to one
cycle. Branch penalty is efficiently reduced further by means of Delayed
branch execution. The plan is based on the study that the instruction is always
fetched that follows the branch before identifying whether the branch is taken.
Now the question arise that why the instruction is not executed rather than
throwing it away? This means that it is required to put a valuable instruction in
this instruction slot. We call this instruction slot a delay slot. Alternatively,
branching is delayed until after the execution of instruction in the delay slot
take place. A number of processors, for example, MIPS and SPARC make use
of delayed execution for procedure calls as well as branching. When this
method is applied, it is required to perform modification in our program so as
to place a valuable instruction inside delay slot. For example, let us consider
the code segment given below for better understanding.
add branch R2, R3, R4
sub target
R5, R6, R7

target:
mult R8, R9, R10

Manipal University Jaipur B1648 Page No. 151


Computer Architecture Unit 1
When the branch delay takes place, the instructions can be rearranged so as
to move the branch instruction forward by one instruction. This is shown as
below:
branch target
add R2, R3, R4 /* Branch delay slot */
sub R5, R6, R7

... .. .

target: mult R10


R8, R9,

The process of moving instructions into delay slots is not an issue of worry for
programmers. This task is accomplished by compilers in addition to
assemblers. If any valuable instruction cannot be moved into delay slot, NOP
operation (no operation) is placed. This is to observe that if the branch is not
taken, we would not like to provide execution to delay slot instruction. This
means that we would like to nullify the instruction in delay slot. A number of
processors such as SPARC offer this option of nullification.
Self Assessment Questions
5. A number of processors such as __________ and _________ make
use of delayed execution for procedure calls as well as branching.
6. If any valuable instruction cannot be moved into delay slot, is placed.

7.5 Branch Processing


Branch Processing helps in instruction execution. It receives branch
instructions and resolves the conditional branches as early as possible. For
resolving it uses static and dynamic branch prediction. Effective processing of
branches has become a cornerstone of increased performance in ILP-
processors. No wonder, therefore, that in the pursuit of more performance,
predominantly in the past few years, computer architects have developed a
confusing variety of branch processing schemes.
After the recent announcements of a significant number of new processors,
we are in a position to discern trends and to emphasise promising solutions.
Branch processing has two aspects, its layout and its micro-architectural
implementation, as shown in figure 7.3.

Manipal University Jaipur B1648 Page No. 152


Computer Architecture Unit 1

detection conditional branches branch target path


Figure 7.3: Design Space of Branch Processing

As far as its layout is concerned, branch processing involves three major


subtasks: detecting branches, handling of unresolved conditional branches
during instruction decoding and accessing the branch target path.
However, the earlier a processor detects branches, the earlier branch
processing can be started and the fewer penalties there are. Therefore, novel
schemes try to detect branches as early as possible.
The next aspect of the layout is the handling of unresolved conditional
branches. We note that we designate a conditional branch unresolved if the
specified condition is not yet available at the time when it is evaluated during
branch processing. The last aspect of the layout of branch processing is how
the branch target path is accessed.
Self Assessment Questions
7. Branch processing has two aspects _______ and _______ .
8. Name the major sub tasks of branch processing.

7.6 Branch Prediction


Branch prediction is a method which is basically utilised for handling the
problems related to branch. Different strategies of branch prediction include:
• Fixed branch prediction
• Static branch prediction
• Dynamic branch prediction

These approaches are discussed as below:


7.6.1 Fixed branch prediction
In fixed branch prediction, prediction is considered to be fixed. This approach
of branch prediction is easy to implement. This approach presumes either of
the following: • branch is never taken
Manipal University Jaipur B1648 Page No. 153
Computer Architecture Unit 1
or
• branch is always taken

The examples of branch-never-taken strategy include VAX 11/780 and


Motorola 68020.
The benefit of using never-taken approach is that the instructions are
continuously fetched by processor so that the pipeline can be filled. Now if the
prediction turns out to be wrong, then minimum penalty would be there.
Alternatively, using always-taken strategy involves the pre-fetching of
instruction by the processor. This is done at the target address of branch.
In case of paged environment, a page fault may take place. To handle this
situation, a special method is required.
Now in case of loop structure, the never-taken strategy is not appropriate. If a
loop is repeated 200 times, then the branch is taken 199 times out of 200 times.
The always-taken strategy is a better one in case of loops. Likewise, we prefer
the always-taken strategy for procedure calls as well as returns.
7.6.2 Static branch prediction
Till now it is understood that instead of using a fixed approach, the
performance can be improved by making use of an approach which is reliant
on the type of branch. This type of approach is known as the static branch
prediction. This approach makes use of instruction opcode for predicting
whether the branch is taken. This approach provides high
prediction correctness. To illustrate this, let us show sample data for industrial
environments. In these types of environments, the prediction of branches and
loops from all branch-type operations are discussed below: • branches
are about 70%,
• loops are about 10%
• remaining operations include procedure calls/returns
40% of branches are considered to be unconditional from the total branches.
On using a never-taken strategy for conditional branch and always-taken
strategy for remaining branch-type operations, there occurs 82% prediction
accuracy. This is shown in table 7.1 as below
Table 7.1: Static Branch Prediction Accuracy
Instruction Prediction: Correct prediction
Instruction type distribution (%) Branch taken? (%)
Unconditional branch 70 x 0.4 = 28 Yes 28

Manipal University Jaipur B1648 Page No. 154


Computer Architecture Unit 1
Conditional branch 70 x 0.6 - 42 No 42 x 0.6 - 25.2
Loop 10 Yes 10 x 0.9 = 9
Call/renim 20 Yes 20
Overall prediction accuracy = 82.2%

It is presumed by the data in the table given above that approximately 60% of
the time conditional branches are not taken. Therefore this prediction of
conditional branch is accurate only sixty percent of the time. So now we get
the following:
42 x 0.6 = 25.2%
This is the prediction accuracy in case of conditional branches.
Likewise, loops jump back having 90% possibility. As loops emerge about 10%
of the time, 9% prediction appears to be accurate. To our surprise, even this
static prediction approach provides accuracy of about 82%.
7.6.3 Dynamic branch prediction
For making more accurate predictions, this approach considers run-time
history. Here the n branch executions of history are considered and this
information is used for predicting the next one.
The experiential study done by Smith and Lee proposes that this approach
provides major enhancement in prediction accuracy. In table 7.2, we have
shown a summary of what they have studied.
Table 7.2: Affect of utilising the information of Past Branches on Prediction
Accuracy

An algorithm that is applied is simple. That is, the next branch prediction is the
majority of n branch executions of past. For instance, let us suppose n = 3.
That is, if three branch executions of the past includes two or more times
branches, then the prediction that occurs is the branch that will be taken.
In table 7.2, the data propose that if we consider l two branch executions of

Manipal University Jaipur B1648 Page No. 155


Computer Architecture Unit 1
the past, then about 90% prediction accuracy is provided to us for most of the
mixes. Apart from that, only minor improvement is obtained. From the
implementation viewpoint, only two bits are required to obtain the history of
past two branch executions.
The process is simple. Preserve the existing prediction unless the two
predictions of past were incorrect. Particularly, it is not required to change the
prediction just for the reason that the last prediction was incorrect. We can
express this plan by means of the four-state finite state machine. This is shown
in figure 7.4.

Figure 7.4: State Diagram for Branch Prediction

In the figure given above, the left bit signifies the prediction whereas the right
bit signifies the status of branch (that is, whether branch is taken or not). In
case the left bit appears to be”0”, then the prediction would occur as “not
taken”. Or else it is predicted that the “branch is taken”. Actual outcome of
branch instruction is provided by right bit. Therefore, “branch not taken” is
signified by a “0”. This means that branch instruction didn't jump. On the other
hand, “branch is taken” is signified by “1”. For instance, state 00 signifies that
it predicted left zero bit (branch would not be taken) () and right zero bit (branch
is definitely not taken) (). Thus, we stay in state 00 in the case when branch is
not taken, In case the prediction is incorrect, we move to state 01. But, still
“branch not taken” is predicted since we were incorrect just once. In case the
prediction is right, we move to state 00 again. If the prediction appears to be
incorrect again, then we change the prediction to “branch taken”. Also we will
move to state10. Thus, on the occurrence of two wrong predictions one after
the other makes us change the prediction.

Manipal University Jaipur B1648 Page No. 156


Computer Architecture Unit 1
Self Assessment Questions
9. In case of Fixed Branch Prediction, the branch is presumed to be either
or _____ .
10. Static strategy makes use of _______________ for predicting whether
the branch is taken.
Activity 2:
Find out examples of processors which use the above mentioned three types
of branch predictions.

7.7 The Intel IA-64 Architecture and Itanium Processor


Due to the complex structure of superscalar and related architecture
technology, a need for the development of new technology was felt. The two
main features of superscalar technology: linear growth of functional unit area
with respect to number of units and square growth of scheduler area with
respect to number of units contributed to the quest for new technology.
As a result of these, the cost performance reaches the level of diminishing
returns. Moreover, traditional architecture exhibited limited parallelism. Thus,
to overcome these factors, the Intel IA-64 architecture and Itanium processor
were developed. Let’s study them in detail.
The Intel IA-64
Intel is quickly reaching to the point where it has taken almost everything from
IA-32 ISA (Intel’s Architecture, 32-bit, Instructional Set Architecture) as well as
the Pentium II line of processors. Latest models can even now take advantage
from improvements in manufacturing technology. However, determining new
ways to quicken the process of implementing even more is becoming tough.
This is because the restraints enforced by IA-32 ISA are appearing larger
constantly. The real solution is to cast aside IA-32 as the main line of
development and perform ISA. Actually Intel proposes to do this. The new
architecture, generated mutually by means of Hewlett Packard as well as Intel,
is known as IA-64. It is considered as a full 64-bit machine from start to end.
In upcoming years, an entire series of processors is expected that implements
this architecture.
The Itanium processor comprises a group of 64-bit Intel microprocessors
which provides execution to the Intel Itanium architecture. This architecture
was initially known as IA-64. The processors are sold by Intel for enterprise
servers and high-performance computing. The beginning of this architecture

Manipal University Jaipur B1648 Page No. 157


Computer Architecture Unit 1
took place at Hewlett-Packard (HP). Afterwards, it was modernized by the joint
efforts of both HP and Intel. The compiler of Itanium architecture, based on
explicit instruction-level parallelism, chooses the instructions to be executed in
parallel. This is quite different from superscalar architectures that rely upon
CPU to administer instruction dependencies at the time of execution.
The cores of Itanium involving Tukwila have the capability of executing about
six instructions for every clock cycle. The first Itanium processor occurred in
2001 which was named as Merced (A dual mode processor, which is capable
of executing the programs of both IA-32 as well as IA-64.). At present, HP is
not the sole manufacturer of Itanium-based systems. Also various other
manufacturers have plunged in this field.
Itanium was regarded as the 4th-most used microprocessor architecture for
enterprise class systems and comes just after x86-64, IBM POWER, and
SPARC. Initially planned for release in 2007, Tukwila is the most recent
processor of this category and was released on February 8, 2010. The
beginning point for IA-64 architecture was a high-end 64-bit RISC processor,
for example, UltraSPARC II (Scalable performance Architecture).
IA-64 architecture is considered as a load/store architecture having 64-bit
addresses as well as 64-bit broad registers. We have 64 general registers
which are available to IA-64 programs. Also there are some more registers
which are available to IA-32 programs).
Every instruction contains the same fixed format. That is, it includes two 6- bit
source register fields, a 6-bit destination register field, an opcode, and another
6-bit field. Many instructions consider two register operands, carry out some
computation on them, and keep the outcome the destination register again. To
perform various operations in parallel, there are various functional units
available. Mostly, the RISC machines are of analogous architecture. The
thought of bundle of associated instructions is unusual. Instructions take place
in groups of three, which we call a bundle. This is shown in figure 7.5 as below.

Manipal University Jaipur B1648 Page No. 158


Computer Architecture Unit 1
Instructions
can be
chained
together

PREDICATE
REGISTER

Figure 7.5: IA-64: Bundles of 3 Instructions


Every bit bundle of 128-bit includes three fixed-format instructions of 40-bit.
Also it includes a template of 8-bit. We can group the bundles together by
means of an end-of-bundle bit. This is done so that 3 or more instructions can
be available in a bundle. Template comprises information regarding the
instructions that can be accomplished in parallel. This plan, in addition to the
availability of various registers, permits the compiler to separate blocks of
instructions and inform the Processor that their execution can be performed in
parallel. Therefore the compiler is required to rearrange instructions, confirm
for dependences, ensure that functional units are available, etc., rather than
the hardware.
By displaying the internal functioning of machine and informing compiler
writers to ensure that every bundle comprises well-suited instructions, the task
of scheduling RISC instructions is moved from hardware (at run time) to
compiler (at compile time). This is the reason because of which this model is
known as Explicitly Parallel Instruction Computing (EPIC). Performing
scheduling of instructions at compile time includes various benefits which are
discussed as below:
• As all the work is now performed by compiler, the hardware can be simpler
to a great extent, saving numerous transistors for other valuable functions,
like larger level 1 caches.
• For any specified program, scheduling is to be performed just once. It is
done at compile time.
• As all the work is done by, a software seller can utilise a compiler that takes
much time in optimising its program. Every user gets the advantage
whenever the program is executed.
The entire family of CPUs are created by the thought of bundles of instructions.
On the low-end processors a bundle may be supplied for each clock cycle.

Manipal University Jaipur B1648 Page No. 159


Computer Architecture Unit 1
Before providing the next bundle, the CPU is required to wait until every
instruction is accomplished On high-end processors, providing numerous
bundles during the same clock cycle may be possible, similar to existing
superscalar designs.
Self Assessment Questions
11. IA-64 architecture is considered as a load/store architecture having 64-bit
_______________ as well as 64-bit broad __________ .
12. IA-64 model is also called ______________ .

7.8 ILP in the Embedded and Mobile Markets


Interesting strategies are represented by the Crusoe chips and Trimedia for
applying the concepts of Very long instruction word (VLIW) in an embedded
space. Trimedia processor may be the closest existing processor to a "classic"
processor of VLIW. Also, it supports a method for the compression of
instructions at the time when they are in main memory along with instruction
cache. It also supports the method for decompressing them throughout the
fetching of instruction.
The drawbacks of VLIW processor are handled by this strategy. On the
contrary, the Crusoe processor makes use of software translation from the x86
architecture to a VLIW processor. Thus it achieves lower power utilization as
compared to general x86 processors.
Now we will focus on Trimedia TM32 architecture in detail.
The Trimedia TM32 Architecture
A group of embedded processors which are committed to multimedia
processing has been given a name, that is, Media processor. Usually it
appears to be cost sensitive similar to embedded processors. However it
follows the compiler orientation from desktop as well as server computing.
Similar to DSPs, they work on narrower data types and not on the desktop.
They must frequently manage endless, continuous flows of data. In the Figure
7.6, we have given a list of media application areas besides benchmark
algorithms for media processors.

Manipal University Jaipur B1648 Page No. 160


Computer Architecture Unit 1

Application area Benchmarks

Data Communication Verterbi decoding


Audio coding AC3 Decode
Video coding MPEG2 encode, DVD decode
Video processing Layered natural motion. Dynamic noise.
Reduction, Peaking
Graphics 3D renderer library
Figure 7.6: Media Processor Application Areas and Example Benchmarks

TM32 CPU is considered as an example of this class. Since multimedia


applications comprise significant parallelism in managing these data streams,
the architectures of instruction set frequently appear dissimilar as compared to
the desktop. It is proposed for products such as advanced televisions as well
as set top boxes. Lots of registers are there such as 128 32-bit registers. These
registers include any of the integer or floating point data. To permit
computations on numerous data instances, the partitioned ALU or SIMD
instructions are provided. In the Figure 7.7, we have shown various operations
found in Trimedia TM32 CPU.

Number
Operation
Category Examples of
Operation
Comment
Load/store cps s 33 signed, unsigned,register
ld8, ldl6, H3 2,1mm. st8, stl6, st32 indirect, indexed, scaled
addressing
Byte shuffles SIMD type convert
shiftrighr 1-.2-, 3-bytes, selectbyte, mergp,
pack 1
Bit shifts asl, asr, Isl, 1ST, rol,
mul, sum of products, sum-of-SIMD-
1 10 shifts, SIMD
round, saturate. 2’scomp
Multiplies and 23
multimedia elements, multimedia, e.g. sum of products SIMD ~
(FIR)
Integer arithmetic add, sib,min, max, abs, average, bitand, bitor, 62
saturate, 2’s comp,
bitxor, bitinv, bitandinv eql, neq, gtr, geq, les, unsigned, immediate,
leq, sign extend, zero extend, sum of absolute SIMD
differences
Floating point add, sub, neg ,mul, div, sqn eql, neq, gtr, geq, 42 scalar
les, leq, IEEE flags
Special ops alloc, prefetch, copy back, read tag read, 20 cache, special regs
cache status, read counter
Branch jmpt, jmpf 6 (un) interruptible
Total 207
Figure 7.7: Operations found in Trimedia TM32 CPU
One of the unusual characteristic from the desktop point of view is that the
programmer is allowed to state five autonomous operations that can be issued
simultaneously. In case the five autonomous instructions are not available
(which means that others are dependent), then no operations (NOPs) are
Manipal University Jaipur B1648 Page No. 161
Computer Architecture Unit 1
positioned in the remaining slots. We call this method of instruction coding a
VLIW (Very Long Instruction Word) method.
It is known that as Trimedia TM32 CPU comprise longer instruction words and
frequently includes NOPs, the instructions of Trimedia are compressed in the
memory. Also the instructions are decoded to the full size when they are
loaded into cache. In Figure 7.8, we have shown the TM32 CPU instruction
mix for EEMBC bench-marks.

Figure 7.8: TM32 CPU Instruction Mix for EEMBC Customer Benchmark
By means of source code which is unmodified, instruction mix is analogous to
others, even though more byte data transfers are there. For aligning the data
for SIMD instructions, the huge number of pack is observed and the
instructions are merged. Computers used for general purpose (having higher
importance byte data transfers) and the instruction mix for “out-of- the-box” C
code is considered similar to each other. The Single instruction, multiple data
(SIMD) instructions along with the pack are used by means of the hand-

Manipal University Jaipur B1648 Page No. 162


Computer Architecture Unit 1
optimised C code. Also the instructions are merged so to align the data.
The comparative instruction mix for unmodified kernels is represented by
means of, middle column. On the other hand, modification at the C level is
allowed by right column. All the operations that were accountable for at least
1% of the total in any of the mixes are listed by these columns.
Self Assessment Questions
13. Trimedia processor may be the closest existing processor to a

14. State two uses of Trimedia TM32 CPU.

7.9 Summary
• Implementation of branching is done by using a branch instruction. The
address of target instruction is included in the branch instruction
• The branch penalty can be reduced to one cycle. It can be efficiently
reduced further by means of Delayed branch execution.
• Effective processing of branches has become a cornerstone of increased
performance in ILP-processors.
• Branch prediction is a method which is basically utilised for handling the
problems related to branch. Different strategies of branch prediction
include:
❖ Fixed branch prediction
❖ Static branch prediction
❖ Dynamic branch prediction
• The new architecture, generated mutually by means of Hewlett Packard
as well as Intel , is known as IA-64
• IA-64 model is also known as Explicitly Parallel Instruction Computing
(EPIC).
• Itanium comprises a group of 64-bit Intel microprocessors which provides
execution to the Intel Itanium architecture. This architecture was initially
known as IA-64.
• Interesting strategies are represented by the Crusoe chips and Trimedia
for applying the concepts of Very long instruction word (VLIW) in an
embedded space. Trimedia processor may be the closest existing
processor to a "classic" processor of VLIW.

7.10 Glossary

Manipal University Jaipur B1648 Page No. 163


Computer Architecture Unit 1
• Branch penalty: Wasteful work done by pipelines for a considerable time.
• Condition code registers: A condition code register is used for attaining
communication among the instructions for condition as well as branching.
• EPIC: Explicitly Parallel Instruction Computing.
• ILP: Instruction level parallelism.
• Merced: A dual mode processor, which is capable of executing the
programs of both IA-32 as well as IA-64.
• VLIW: Very Long Instruction Word.

7.11 Terminal Questions


1. Differentiate between unconditional and conditional branch.
2. Explain the concept of branch handling.
3. What do you understand by delayed branching?
4. Define branch processing.
5. What do you mean by branch prediction?
6. Write short notes on:
a) Fixed Branch Prediction
b) Intel IA-64 architecture
c) Static Branch Prediction
d) Itanium processor
e) Dynamic Branch Prediction
7. Explain the concept of Trimedia TM32 Architecture.
7.12 Answers
Self Assessment Questions
1. Jump instruction
2. False
3. Branch
4. Branch penalty
5. SPARC, MIPS
6. No operation (NOP)
7. Layout, micro-architectural implementation
8. a) Detecting branches
b) Handling of unresolved conditional branches during instruction
decoding.
c) Accessing the branch target path
9. Never taken, always taken
10. Instruction opcode
11. Addresses, registers
Manipal University Jaipur B1648 Page No. 164
Computer Architecture Unit 1
12. Explicitly Parallel Instruction Computing (EPIC)
13. "Classic" VLIW processor.
14. Set top boxes and advanced televisions.
Terminal Questions
1. This type of branch is considered as the simplest one. It is used to transfer
control to a particular target. In conditional branches, if a particular
condition meets its requirements, then only the jump is conducted. Refer
Section 7.2.
2. Branch Handling is executed when the flow of control is altered. For
example branch requires special handling in pipelined processors. Refer
Section 7.3.
3. Delayed branching is the reduction of branch penalty to one cycle. Refer
Section 7.4.
4. Branch processing receives branch instructions and resolves the
conditional branches as early as possible. Refer Section 7.5.
5. Branch prediction predicts the outcome of branch. Refer Section 7.6.
6. a) In Fixed Branch Prediction, prediction is fixed. Refer Section 7.6.1.
b) The new architecture, generated mutually by means of Hewlett
Packard as well as Intel , is known as IA-64. Refer section 7.7.1.
c) This approach makes use of instruction opcode for predicting
whether the branch is taken. Refer section 7.6.2.
d) Itanium comprises a group of 64-bit Intel microprocessors which
provides execution to the Intel Itanium architecture. This architecture
was initially known as IA-64. Refer Section 7.7.2.
e) For making more accurate predictions, this approach considers run-
time history. Here the n branch executions of history are considered
and this information is used for predicting the next one. Refer section
7.6.3.
7. TM32 CPU is considered as an example of multimedia applications. The
multimedia applications comprise significant parallelism in managing the
data streams Refer Section 7.8.

References:
• Hwang, K. Advanced Computer Architecture. McGraw-Hill.
• Godse, D. A. & Godse, A. P. Computer Organization. Technical
Publications.
• Hennessy, John L., Patterson, David A. & Goldberg David. Computer

Manipal University Jaipur B1648 Page No. 165


Computer Architecture Unit 1
Architecture: A Quantitative Approach, Morgan Kaufmann.
• Sima, Dezso, Fountain, Terry J. & Kacsuk, Peter, Advanced computer
architectures - a design space approach. Addison-Wesley-Longman.

E-references:
• http://www.scribd.com/doc/46312470/37/Branch-processing,
• http://www.scribd.com/doc/60519412/15/Another-View-The-Trimedia-
TM32-CPU-151.

Unit 8 Memory Hierarchy Technology

Structure:
8.1 Introduction
Objectives
8.2 Memory Hierarchy
Cache memory organisation
Basic operation of cache memory
Performance of cache memory
8.3 Cache Addressing Modes
Physical address mode
Virtual address mode
8.4 Mapping
Direct mapping
Associative mapping
8.5 Elements of Cache Design
8.6 Cache Performance
Improving cache performance
Techniques to reduce cache miss
Techniques to decrease cache miss penalty
Techniques to decrease cache hit time
8.7 Shared Memory organisation
8.8 Interleaved Memory Organisation
8.9 Bandwidth and Fault Tolerance
8.10 Consistency Models
Strong consistency models
Weak consistency models
8.11 Summary
8.12 Glossary

Manipal University Jaipur B1648 Page No. 166


Computer Architecture Unit 1
8.13 Terminal Questions
8.14 Answers

8.1 Introduction
You can say that Memory system is the important part of a computer system.
The input data, the instructions necessary to manipulate the input data and the
output data are all stored in the memory.

Memory unit is an essential part of any digital computer because computer


processes data only if it is stored somewhere in its memory. For example, if
computer has to compute f(x) = sinx for a given value of x, then first of all x is
stored somewhere in memory, then a routine is called that contains program
that calculates sine value of a given x. It is an indispensable component of a
computer. We will cover all this in this unit.
In the previous unit, we explored the software approach of exploiting
Instruction-level parallelism in which you studied types of branches, branch
handling, delayed branching, branch processing, and static branch prediction.
Also, you studied the Intel IA-64 architecture and Itanium processor, ILP in the
embedded and mobile markets.
In this unit, we will study memory hierarchy technology. We will cover cache
memory organisation, cache addressing modes, direct mapping and
associative caches. We will also discuss the elements of cache design,
techniques to reduce cache misses via parallelism, techniques to reduce
cache penalties, and techniques to reduce cache hit time. Also, we will study
the shared memory organisation and interleaved memory organisation.
Objectives:
After studying this unit, you should be able to:
• explain the concept of cache memory organisation
• label different cache addressing modes
• explain the concept of mapping
• identify the elements of cache design
• analyse the concept of cache performance
• describe various techniques to reduce cache misses
• explain the concept of shared and interleaved memory organisation
• discuss bandwidth and fault tolerance
• discuss strong and weak consistency models

Manipal University Jaipur B1648 Page No. 167


Computer Architecture Unit 1
8.2 Memory Hierarchy
Computer memory is utilized for storing and retrieving data and instruction.
The memory system includes the managing and controlling of storage devices
along with information or algorithms contained in it. Basically computers are
used to enhance the speed of computing. Similarly the main aim of memory
system is to give speedy and continuous access on memory by CPU. Small
computers do not require additional storage because they have limited
applications that can be easily fulfilled.
The General Purpose computers perform very well with the additional storage
capacity including the capacity of main memory. Main memory directly deals
with the processor. Auxiliary memory is a high-speed memory which provides
backup storage and not directly accessible by CPU but it is connected with
main memory. The early forms of auxiliary memory are punched paper tape,
punched cards and magnetic drums. Since 1980’s the devices employed in
auxiliary memory are tapes, optical and magnetic disks. Cache memory is an
extremely high speed memory utilized to boost up the speed of computation
by providing the required information and data to the processor at high speed.
Cache memory is introduced in the system for just for overcoming the
difference of speed between main memory and CPU Cache memory stores
the program segments which is being executed in processor as well as the
temporary data required in current computations. Computer performance rate
increases because cache memory provides the segments and data at very
high speed. As Input/output processor is concerned with data transfer among
main memory and auxiliary memory, similarly cache memory is concerned for
information transfer between processor and main memory. The objective of
using memory system is to get maximum access speed and to minimize the
entire cost of memory organization.
Memories vary in their design, in their capacity and speed of operation that is
why we have a hierarchical memory system. A typical computer can have all
types of memories. According to their nearness to the CPU, memories form a
hierarchy structure as shown in figure 8.1.

Manipal University Jaipur B1648 Page No. 168


Computer Architecture Unit 1

Now, we let us discuss cache memory and the cache memory organisation.
8.2.1 Cache memory organisation
A cache memory is an intermediate memory between two memories having
large difference between their speeds of operation. Cache memory is located
between main memory and CPU. It may also be inserted between CPU and
RAM to hold the most frequently used data and instructions. Communicating
with devices with a cache memory in between enhances the performance of a
system significantly. Locality of reference is a common observation that at a
particular time interval, references to memory acts limited for some localised
memory areas. Its illustration can be given by making use of control structure
like 'loop'. Cache memories exploit this situation to enhance the overall
performance.
Whenever a loop is executed in a program, CPU executes the loop repeatedly.
Hence for fetching instructions, subroutines and loops act as locality of
reference to memory. Memory references also act as localised.
Table look-up procedure continually refers to memory portion in which table is
stored. These are the properties of locality of reference. Cache memory’s basic
idea is to hold the often accessed data and instruction in quick cache memory,
the total accessed memory time will attain almost the access time of cache.
The fundamental idea of cache organisation is that by keeping the most
frequently accessed instructions and data in the fast cache memory, the
average memory access time will reach near to access time of cache.
8.2.2 Basic operation of cache memory
Whenever CPU needs to access the memory, cache is examined. If the file is
found in the cache, it is read from the fast memory. If the file is missing in cache
then main memory is accessed to read the file. A set of files just accessed by
CPU is then transferred from main memory to cache memory.

Manipal University Jaipur B1648 Page No. 169


Computer Architecture Unit 1
8.2.3 Performance of cache memory
Cache memory performance is measured in terms of Hit Ratio. If the processor
detects a word in cache, while referring that word in main memory is known to
produce a “hit”. If processor cannot detect that word in cache is known as
“miss”. Hit ratio is a ratio of hits to misses. High hit ratio signifies validity of
"locality of reference". When the hit ratio is high, then the processor accesses
the cache memory rather than main memory.
The main feature of cache memory is its spontaneous response. Hence, time
is not wasted in finding files in cache. When data is transformed from main
memory to cache memory this process is known as mapping process.
Self Assessment Questions
1. ____________ directly deals with the processor.
2. ___________ is a high-speed memory which provides backup
storage.
3. A __________ memory is an intermediate memory between two
memories having large difference between their speeds of operation.
4. If the processor detects a word in cache, while referring that word in main
memory is known to produce a ____________________ .

8.3 Cache Addressing Modes


The operation to be performed is specified by the operation field of the
instruction. The execution of the operation is performed on some data stored
in computer registers or memory words. In program execution the selection of
operands depends upon the addressing mode of the instruction. Addressing
modes has a rule that says “the address field in the instruction is translated or
changed before the operand is referenced”. While accessing a cache, the CPU
can address the cache in two ways, as following
• Physical Address Mode
• Virtual Address Mode
Now let’s go into the details of these addressing modes
8.3.1 Physical address mode
In physical address mode, a physical memory address is used to access a
cache.
Implementation on unified cache: Generally, both instructions and data are
stored into the same cache. This design is called the Unified (or Integrated)
cache design. When it is implemented on unified cache, the cache is indexed
Manipal University Jaipur B1648 Page No. 170
Computer Architecture Unit 1
and tagged with physical address. When the processor issues an address, the
address is translated in Translation Lookaside Buffer (TLB) or in Memory
Management Unit (MMU) before any cache lookup as illustrated in figure 8.2.
TLB is a cache where a set of recently looked entries is maintained.

Manipal University Jaipur B1648 Page No. 171


Computer Architecture Unit 1

Cache Hit: When the addressed data or instruction is found in cache during
operation, it is called a cache hit.
Cache Miss: When the addressed data or instruction is not found in cache
during operation, it is called a cache miss. At the time of cache miss, a
complete cache block is loaded from the equivalent memory location at one
time.
Implementation on Split Cache: When physical address is used in split
cache, both data cache and instruction cache are accessed with a physical
address after translation from MMU. In this design, the first-level D-cache uses
write-through policy as it is a small one (64 KB) and the second-level D-cache
uses write-back policy as it is larger (256 KB) with slower speed. The I-cache
is a single level cache that has a smaller size (64 KB). The implementation of
physical address mode on split cache is illustrated in figure 8.3.

Figure 8.3: Implementation of Physical Address Mode on Split Cache Design


Advantages of physical address mode: The main advantages of the
physical mode of cache addressing is that the Design is simple as it requires
little intervention of operating system and no problem arises in accessing the
Manipal University Jaipur B1648 Page No. 172
Computer Architecture Unit 1
physical addresses as the cache is having the same index tag.
Disadvantage of physical address mode: The main disadvantage of the
physical mode of cache addressing is that the physical mode is slow in
accessing the cache because of the time taken by MMU/TLB in completing the
address translation.
8.3.2 Virtual address mode
In the virtual mode of cache addressing, the cache is indexed and tagged with
virtual address that is stored in the cache and MMU simultaneously and the
translation process is done by MMU with cache lookup operations. The
process of virtual cache addressing in the case of a unified cache is illustrated
in figure 8.4.

In figure 8.4 you can see that a unified cache is in direct contact with virtual
address. It is known as virtual address cache. In the above figure you can also
see that Main Memory Unit and cache validation or interpretation is performed
simultaneously. The cache lookup operation does not use the physical address
produced by the Main Memory Unit but it can be saved for later use. The virtual
address cache is encouraged with the improved proficiency of quick cache
accessing; it is overlapped with the MMU translation.
Advantages of virtual address mode: The virtual mode of cache addressing
offers the following advantages:
• It eliminates address translation time for a cache hit since misses are not
common as hits.
• Cache lookup is not delayed.

Manipal University Jaipur B1648 Page No. 173


Computer Architecture Unit 1
• The cache access efficiency is faster than physical addressing mode.
• The MMU translation yields physical main memory address which is saved
for later use by the cache.
Self Assessment Questions
5. When both instructions and data are stored into the same cache, the
design is called the _____________ cache design.
6. TLB stands for ___________________ .

8.4 Mapping
Mapping refers to the translation of main memory address to the cache
memory address. The transfer of information from main memory to cache
memory is conducted in units of cache blocks on cache lines. Blocks in caches
are called block frames which are denoted as
Bi for i = 1, 2, ...j
where j is the entire block frames in caches.
The corresponding memory blocks are denoted as
Bj for j = 1, 2, ...k
where k is the total number of blocks in memory. It is assumed that
k >> j and k = 2s and j = 2r
Where s is the number of bits required to address a main memory block, and
r is number of bits required to address a cache memory block.
There are four types of mapping schemes: direct mapping, associative
mapping, set associative mapping, and sequential mapping. Here, we will
discuss the first two types of mapping.
8.4.1 Direct mapping
Associative memories are very costly as compared to RAM due to the
additional logic association with all cells. Generally there are 2j words in main
memory and 2k words in cache memory. The j-bit memory address is
separated by 2 fields. k bits are used for index field. j-k bits are long-fields. The
direct mapping cache organization utilizes k-bit indexes to access the cache
memory and j-bit address for main memory. Cache words contain data and
related tags. Every memory block is assigned to a particular line of cache in
direct mapping. But if a line already contains memory block when new block is
to be added then the old memory block is removed. The figure 8.5 illustrates

Manipal University Jaipur B1648 Page No. 174


Computer Architecture Unit 1
the mapping of multiple blocks to the similar line in the cache memory. Blocks
can be sent to these lines only. In figure 8.5, memory address has block
identification portion that contains 8-bits.

Figure 8.5: Direct Mapping

Tag bits are stored next to data bits as new word enters the cache. Once
processor has produced a memory request, the cache index field is utilized for
the main memory address to access cache. Tag in word in cache is evaluated
with tag field in processor address. If this comparison is positive there is a hit
and the word is found in cache. If the comparison is negative then it is a miss
and the word is read in main memory. The word is then stored in cache with
the new tag and deletes the previous value.
Demerits of direct mapping: If the two words have similar addresses and
indexes then the hit ratio will fall substantially. But dissimilar tags are accessed
continually.
8.4.2 Associative mapping
Associative mapping is used in cache organization which is the quickest and
most supple organization. Addresses of the word and content of the words are
stored in associative memory. It means cache can store any word in main
memory.
For example, in figure 8.6, CPU address is first placed in argument register
Manipal University Jaipur B1648 Page No. 175
Computer Architecture Unit 1
and then associative memory is explored for the match of the above address.

CPU address ----------- ► Argument Register ---------------- ► Associative


memory

Cache memory

Figure 8.6: Flow of Search

If address is found in it somewhere, it has to be placed in cache memory


immediately. If cache memory has no vacant space for storage of new
information, in such a case vacancy is created using page replacement policy.
In associative mapping technique, the entire cache array is implemented as a
single associative memory. The associative memory is also called content
addressable memory (CAM). When a memory address produced by the
processor is sent to the CAM, the CAM simultaneously compares that address
to all addresses currently stored in the cache.
Self Assessment Questions
7. The translation of main memory address to the cache memory address
is known as ________________ .
8. _______________ memories are expensive compared to RAMs.

Activity 1:
Visit an organisation and find out the cache memory size and the costs they
are incurring to hold it. Also try to retrieve the size of the data stored in the
cache memory.

8.5 Elements of Cache Design


Main elements of Cache design are:
1. Cache Size: The size of cache should be small enough to bring overall
cost per bit closer to the main memory. On the contrary, the size must be
large enough, so that the cache and overall access time be somewhat
equal. Mostly large caches are slow than smaller ones because if the
cache is large then there are numerous gates are concerned in addressing
the cache which makes it slow
2. Mapping Function: The two types of mapping—Direct Mapping and
Associative Mapping (explained earlier in this unit)

Manipal University Jaipur B1648 Page No. 176


Computer Architecture Unit 1
3. Replacement Algorithm: LRU (Least recently Used), FIFO (First-in, First-
out), LFU (Least Frequently Used) or some Random one i.e. simple to
build in hardware.
4. Write Policy: Write Through, Write Back or Write Once.
5. Line Size: Optimum size depends on workload.
6. Number of Caches: Single or two levels and Unified or split.
Generally, both instructions and data are stored into the same cache. This
design is called the Unified (or Integrated) cache design. However at times
different caches are used to store and access instructions and data separately.
A cache used only to store instructions but not data is called Instruction
Cache (I-Cache) while a cache used only to store data is called Data Cache
(D-Cache).
The advantage of restricting a cache to store only instructions is that
instructions relatively do not change. Therefore, the contents of an instruction
cache need never be written again to main memory. However, the contents of
D-Cache undergo frequent changes and require to be written again to main
memory to keep the memory updated. A design where instructions and data
are stored in different caches for execution conveniences is called Split (or
Mixed) cache design.
Self Assessment Questions
9. A design where instructions and data are stored in different caches for
execution conveniences is called _____________ cache design.
10. I-Cache denotes ________________________.

8.6 Cache Performance


Instruction count is free from the hardware, thus, it is generally used to
calculate the processor performance. A computer designer must concentrate
on the miss rate for evaluating memory-hierarchy performance as it is too
independent of the speed of the hardware. Sometimes, the miss rate can
mislead the performance measure and thus, average memory access time
should be used.
Average memory access time = Hit time + Miss rate x Miss Penalty
This equation assists you in taking decision among unified or split cache.
8.6.1 Improving cache performance
The gap between CPU and the main memory speeds have been increasing
from the past few years. This has pulled the attention of many computer

Manipal University Jaipur B1648 Page No. 177


Computer Architecture Unit 1
designers. The average memory access time formula helped us to present the
techniques to improve the caches. The techniques for improving the cache
performance are:
• Reduce Miss Rate
• Reduce Cache Miss Penalty
• Reduce Cache Hit Time

Now let’s discuss these techniques in detail.


8.6.2 Techniques to reduce cache miss
For reducing the miss rate the following techniques are used:
• Hardware prefetching of instructions and data
• Victim caches
• Pseudo-associative Caches
Hardware prefetching of Instruction and Data
Hardware-based approaches can be classified into two categories: spatial and
temporal. In spatial access to the current block is the basis for the prefetch
decision. In spatial schemes, prefetches occur when there is miss on the cache
block. In temporal schemes the look ahead decoding of the instruction stream
is implied. Temporal mechanisms attempt to have data in the cache “just in
time” to be used.
8.6.3 Techniques to reduce cache miss penalty
For reducing cache miss penalty the following techniques are used:
• Early restart and critical word first
• Giving Priority To Read Misses Over Writes
• Sub-block Placement
Early restart and critical word first
• Before the restarting the processor do not wait for loading of block :
❖ Early restart: As the block of the word arrives, send it to the processor
for continuing execution.
❖ Critical Word First: Firstly ask for the missed word from memory and
then send it to the processor right away when it arrives.
❖ Processor is busy in filling words in block so let him do his work.
• Normally it is utilized for large block caches.

8.6.4 Techniques to reduce cache hit time


• Avoiding address translation during cache indexing

Manipal University Jaipur B1648 Page No. 178


Computer Architecture Unit 1
• Small and simple caches
• Pipelining writes for fast write hits
Avoiding address translation
In case cache received virtual address is called Virtually Addressed Cache or
just Virtual Cache. It offers following features:
• The logically switched process must flush the cache or else get false hits.
• Handling synonyms or aliases;
❖ Mapping of dissimilar virtual address with same physical address.
❖ Virtual address is needed because input/output system have to interact
with cache
• Synonyms or aliases solution
❖ Hardware Guarantee: We should get that hardware which ensures
unique physical address of each cache block
❖ Software guarantee: lower n bits ought to have similar address; only if
covers index field and direct mapped, they must be unique called page
colouring.
• Cache flush solution
❖ Solution for the cache flush is to add process identifier tag. This tag
recognizes the process. If it is a wrong process then the address within
that process can’t get a hit
Self Assessment Questions
11. ____________ affects the clock cycle rate of the processor.
12. Average memory access time = Hit time + Miss rate x ____________ .

8.7 Shared Memory Organisation


The common issues in architecture of shared memory organization are access
control, synchronization, security and protection.
• Access control decides accessibility of process to resources
• Synchronisation constraints restrict the accessibility time of shared
processes to access shared resources.
• Protection is the third issue that does not allow processes to create
arbitrary access to resources that belongs to some other process.
The computer technology has become so much advanced that it is very difficult
to improve further the performance of superscalar processors which is
exploiting more instruction-level parallelism (ILP). The best solution is to rely

Manipal University Jaipur B1648 Page No. 179


Computer Architecture Unit 1
on thread-level parallelism (TLP) rather than ILP. The various forms of TLP
are as follows:
• Explicit multithreading
• Chip-level Multiprocessing (CMP)
• Symmetric Multiprocessing (SMP)
• Asymmetric Multiprocessing (ASMP)
• Uniform Memory Access multiprocessing (UMA)
• Non-Uniform Memory Access multiprocessing (NUMA)
• Clustered multiprocessing
• Cache Only Memory Architecture (COMA)

All of the above architectures except clustered multiprocessors provide all


cores in the system with access to a shared physical address space.
A simple architecture of the shared memory organisation is shown in figure
8.7.

Shared Bus
Figure 8.7: Shared Memory Organisation

It basically has the following features:


• The bus is usually a simple physical connection.
• The bus bandwidth limits the number of CPUs.
• There could be multiple memory elements in the system
• A single 'on chip' cache is universal.
• A second level cache could be 'on chip', which could be shared as part of
memory system
Designs of shared memory processor
There are various approaches for designing a shared memory processor. The
available design alternatives for a shared memory processor are as follows:
1. No physical sharing: In this memory system organisation, every
processor or a node that consists of more than one processor has its own
private main memory. It can access remote memory connected to other

Manipal University Jaipur B1648 Page No. 180


Computer Architecture Unit 1
nodes through interconnection network. This architecture is known as the
non-uniform memory access architecture (NUMA) and is shown in figure
8.8. In NUMA design, the cost of access to local memory is much lower
than for remote memory access.

Figure 8.8: NUMA Architecture

2. Shared main memory: In this memory system organisation, every


processor or core has its own private L1 and L2 caches, but all processors
share the common main memory. Although this was the dominating
architecture for small-scale multiprocessors, some of the recent
architectures abandoned the shared memory organisation and switched to
the NUMA organisation.
3. Shared L1 cache: This design is only used in chips with explicit
multithreading, where all logical processors share a single pipeline.
4. Shared L2 cache: This design minimises the on-chip data replication and
makes more efficient use of cache capacity. Some Chip-level
Multiprocessing (CMP) systems are built with shared L2 caches.
Self Assessment Questions
13. ILP stands for ______________________ .
14. TLP is the abbreviation for ____________________ .

8.8 Interleaved Memory Organisation


Interleaved Memory Organisation (or Memory Interleaving) is a technique
aimed at enhancing the efficiency of memory usages in a system where more
than one data/instruction is required to be fetched simultaneously by the CPU
as in the case of pipeline processors and vector processors. To understand
the concept, let us consider a system with a CPU and a memory as shown in
figure 8.9.

Manipal University Jaipur B1648 Page No. 181


Computer Architecture Unit 1

Figure 8.9: Interleaved Memory Organisation

As long as the processor requires a single memory read at a time, the above
memory arrangement with a single MAR, a single MDR, a single Address bus
and a single Data bus is sufficient. However, if more than one read is required
simultaneously, the arrangement fails. This problem can be overcome by
adding as many address and data bus pairs along with respective MARs and
MDRs. But buses are expensive as equal number of bus controllers will be
required to carry out the simultaneous reads.
An alternative technique to handle simultaneous reads with comparatively low
cost overheads is memory interleaving. Under this scheme, the memory is
divided into numerous modules which is equivalent to the number of
simultaneous reads required, having their own sets of MARs and MDR but
sharing common data and address buses. For example, if an instruction
pipeline processor requires two simultaneous reads at a time, the memory is
partitioned into two modules having two MARs and two MDRs, as shown in
figure 8.10.

Manipal University Jaipur B1648 Page No. 182


Computer Architecture Unit 1

Figure 8.10: MAR and MDR in Interleaved Memory Modules

The memory modules are assigned different mutually exclusive memory


address spaces. Thus, in this case suppose, the memory module 1 is assigned
even address space and memory module 2 is assigned odd memory space.
Now, when the CPU needs two instructions to be fetched from the memory; let
us say located at address 2 and 3, the first memory module is loaded with the
address 2. While the first instruction is being read into the MDR1, the MAR2 is
loaded with address 3.
When both the instructions are ready to be read into the CPU from the
respective MDRs, the CPU reads them one after the other from these two
registers. This is an example of two-way interleaved memory architecture. In
a similar way an n-way interleave memory may be designed. Obviously, but
for this technique, 2 sets of address and data buses and MARs and MDRs
would be required to achieve the same objective.
This type of modular memory architecture is helpful for systems that use vector
or pipeline processing. By suitably arranging the memory accesses, the
memory cycle time will reduced to a number of memory module. The same
technique is also employed in enhancing the speed of read/write operations in
various secondary storage devices such as hard disks and the like.

Self Assessment Questions


15. ________________ is a technique aimed at enhancing the
efficiency of memory usages in a system
16. ________________ share common data and address buses.

Manipal University Jaipur B1648 Page No. 183


Computer Architecture Unit 1
8.9 Bandwidth and Fault Tolerance
H. Hellerman (1967) has derived an equation to estimate the effective increase
in memory bandwidth through multiway interleaving. A single memory module
is assumed to deliver one word per memory cycle and thus, has a bandwidth
of 1.
Memory Bandwidth: The memory bandwidth B of an m-way interleaved
memory is upper-bounded by m and lower-bounded by 1. Hellerman estimated
B as:

In this equation m= number of interleaved memory modules. This equation


implies that if 16 memory modules are used, then the effective memory
bandwidth is approximately four times that of a single module. This pessimistic
estimate is due to the fact that block access of various lengths and access of
single words are randomly mixed in user programs. Hellerman’s estimate was
based on a single-processor system. If memoryaccess conflicts from multiple-
processors, such as the hot spot problems, are considered, the effective
memory bandwidth will be further reduced.
In a vector processing, the access time of a long vector with n elements and
stride distance 1 has been estimated by Cragon (1992) as: It is assumed that
the n elements are stored in contiguous memory locations in m-way
interleaved memory system. The average time t1 required to access one
element in a vector is estimated by

„ 0m-i
ti - (1+—)
mn
0
Where, n >/. (very long vector), ti ^— - r .As n ^ 1 (scalar access), m
t1 ^ 0
Fault Tolerance: High- and low-order interleaving could be mixed to generate
various interleaved memory organisations. Sequential addresses are assigned
in the high-order interleaved memory in each memory module.
This makes it easier to isolate faulty memory modules in a memory bank of m
memory modules. When one module failure is detected, the remaining
modules can still be used by opening a window in the address space. This fault
isolation cannot be carried out in a low-order interleaved memory, in which a

Manipal University Jaipur B1648 Page No. 184


Computer Architecture Unit 1
module failure may paralyse the entire memory bank. Thus, low- order
interleaving memory is not fault-tolerant.
Self Assessment Questions
17. The memory bandwidth is upper-bounded by _______________ and
lower-bounded by _________________ .
18. ________________ are assigned in the high-order interleaved
memory in each memory module.

8.10 Consistency Models


Usually the logical data store is distributed and replicated physically throughout
several processes. But the consistency models acts as a agreement among
the data storage and processes. Perfection in the work of store only happens
if the processes follow some rules. This model helps in understanding that how
simultaneous writes and reads occur in shared memory. It is applicable for
shared memory multiprocessors with shared databases and cache coherence
algorithms.
Consistency Models are divided into two models: Strong and Weak.
8.10.1 Strong consistency models
In these models, the operations on shared data are synchronised. The various
strong consistency models are:
i) Strict consistency: As the name suggests it is very strict. In this type of
consistency if there is any read on data item then it will gives the matching
result of the last written date item. The main drawback of this consistency
is that it depends on the absolute global time.
ii) Sequential consistency: In this type of consistency, if the processes are
executed in sequence then the results is similar to the read write
operations. The operations of all processes are in sequence as defined
in the program. Figure 8.11 shows the Sequential Consistency Model

Manipal University Jaipur B1648 Page No. 185


Computer Architecture Unit 1

Figure 8.11: Sequential Consistency Model

iii) Casual consistency: In casual consistency, casual writes should be


seen in the similar order through all processes. The concurrent writes are
seen in dissimilar order in different machines.
iv) FIFO consistency: In FIFO consistency, all process can see single
process writes in the same order they were issued. Though dissimilar
process writes could be seen in dissimilar order through dissimilar
processes.
8.10.2 Weak consistency models
In these models, synchronisation can happen when the shared data is locked
and unlocked. If a consistency model is weaker, it is easier to build a scalable
solution. The various weak consistency models are:
i) General weak consistency: In this consistency, accesses to
synchronisation variables linked with a data store are sequentially
consistent. When all the writes are completed then only you can perform
any operation on synchronization variable.
ii) Release consistency: In release consistency, all the previous work of
the process should be finished so that read/ write operations can be
performed on shared data.
iii) Entry consistency: In entry consistency, the access to
synchronization variable is not permitted to process. Until shared data is
updated in respect to that process.
Self Assessment Questions
19. _________ model is a contract between processes and a data store.
20. The two categories of consistency models are _____ and ________ .
Activity 2:
Visit an organisation. Find the number of m-interleaved memory modules.
Now, calculate the memory bandwidth using the formula of B.
Manipal University Jaipur B1648 Page No. 186
Computer Architecture Unit 1
8.11 Summary
Let us recapitulate the important concepts discussed in this unit:
• Small computers do not require additional storage because they have
limited applications that can be easily fulfilled.
• If the processor detects a word in cache, while referring that word in main
memory is known to produce a “hit”. If processor cannot detect that word
in cache is known as “miss”.
• The characteristic of cache memory is its spontaneous response. Hence,
time is not wasted in finding files in cache.
• When physical address is used in split cache, both data cache and
instruction cache are accessed with a physical address after translation
from MMU.
• Mapping refers to the translation of main memory address to the cache
memory address.
• A computer designer must concentrate on the miss rate for evaluating
memory-hierarchy performance as it is too independent of the speed of the
hardware.
• Interleaved Memory Organisation (or Memory Interleaving) is a technique
aimed at enhancing the efficiency of memory usages in a system where
more than one data/instruction is required to be fetched simultaneously by
the CPU.
• The memory bandwidth B of an m-way interleaved memory is upper-
bounded by m and lower-bounded by 1.

8.12 Glossary
• Associative Mapping: Associative mapping is used in cache
organization which is the quickest and most supple organization.
• Auxiliary memory Auxiliary memory is a high-speed memory which
provides backup storage and not directly accessible by CPU but it is
connected with main memory.
• Cache Memory Organisation: A small, fast and costly memory that is
placed between a processor and main memory.
• Main memory: Refers to physical memory that is internal to the computer.
• Memory interleaving: A category of techniques for increasing memory
speed. NUMA Multiprocessing: Non-Uniform Memory Access
multiprocessing.
• RAM: Random-access memory
• Split Cache Design: A design where instructions and data are stored in
Manipal University Jaipur B1648 Page No. 187
Computer Architecture Unit 1
different caches for execution conveniences.

8.13 Terminal Questions


1. Explain Memory-Hierarchy?
2. Explain the meaning of Cache Memory Organisation.
3. Describe the term addressing modes. List the different types of addressing
modes.
4. Define the following terms:
A. Cache Hit
B. Cache Miss
5. What is meant by Direct Mapping? Discuss the various types of Mapping.
6. Explain the concept of Shared Memory Organisation.

8.14 Answers
Self Assessment Questions
1. Main memory
2. Auxiliary memory
3. Cache
4. Hit
5. Unified
6. Translation Lookside Buffer
7. Mapping
8. Associative
9. Split
10. Instruction Cache
11. Hit time
12. Miss Penalty
13. Instruction-Level Parallelism
14. Thread-Level Parallelism
15. Interleaved Memory Organisation
16. MARs and MDRs
17. m, 1
18. Sequential addresses
19. Consistency
20. Strong, Weak

Terminal Questions
1. Memory hierarchy contains the Cache Memory Organisation. Refer

Manipal University Jaipur B1648 Page No. 188


Computer Architecture Unit 1
Section 8.2.
2. A cache memory is an intermediate memory between two memories
having large difference between their speeds of operation. Refer Section
8.2.
3. Addressing modes has a rule that says “the address field in the instruction
is translated or changed before the operand is referenced”. Refer Section
8.3.
4. When the addressed data or instruction is found in cache during operation,
it is called a cache hit. When the addressed data or instruction is not found
in cache during operation, it is called a cache miss. Refer Section 8.3.
5. Mapping refers to the translation of main memory address to the cache
memory address. Refer Section 8.4.
6. Shared memory organization is a process by which program processes
can exchange data faster than by reading and writing using the regular
operating system functions. Refer Section 8.7.

References:
• Kai Hwang: Advanced Computer Architecture, Parallelism, Scalablility,
Programmability - MGH
• Micheal J. Flynm: Computer Architecture, Pipelined & Parallel Processor
Design - Narosa.
• J.P. Haycs: Computer Architecture & Organisation - MGM
• Nicholas P. Carter; Schaum’s Outline of Computer Architecture; Mc. Graw-
Hill Professional

E-references:
• www.csbdu.in/
• www.cs.hmc.edu/
• www.usenix.org
• cse.yeditepe.edu.tr/

Manipal University Jaipur B1648 Page No. 189


Computer Architecture Unit 1

Unit 9 Vector Processors

Structure:
9.1 Introduction
Objectives
9.2 Use and Effectiveness of Vector Processors
9.3 Types of Vector Processing
Memory-memory vector architecture
Vector register architecture
9.4 Vector Length and Stride Issues
Vector length
Vector stride
9.5 Compiler Effectiveness in Vector Processors
9.6 Summary
9.7 Glossary
9.8 Terminal Questions
9.9 Answers

9.10 Introduction
In the previous unit, you learnt about memory hierarchy technology and related
aspects such as cache addressing modes, mapping, elements of cache
design, cache performance, shared & interleaved memory organisation,
bandwidth & fault tolerance, and consistency models. In this unit, we will
introduce you to vector processors.
A processor design which has the capacity to operate mathematical operations
upon multiple data elements at the same time is called a vector processor.
This is just opposite to a scalar processor, which is able to tackle just a single
element at one time. A vector processor is also called array processor. Vector
processing was first successfully implemented in the CDC STAR-100 and
Advanced Scientific Computer (ASC) of the Texas Instruments. The vector
technique was first fully exploited in the famous Cray-1. The Cray design had
eight vector registers which held sixty-four 64-bit words each. The Cray-
1usually had a performance of about 80 MFLOPS (Million floating-point
operations per second), but with up to three chains running, it could hit the
highest point at 240 MFLOPS.
In this unit, you will study about various these processors such as types, uses

Manipal University Jaipur B1648 Page No. 190


Computer Architecture Unit 1

and effectiveness of these processors. You will also study vector length and
stride issues, and compiler effectiveness in vector processors.
Objectives:
After studying this unit, you should be able to:
• state the use and effectiveness of vector processors
• identify the types of vector processing
• describe memory-memory vector architecture
• discuss the use of CDC Cyber 200 model 205 computer
• explain vector register architecture
• recognise the functional units of vector processor
• discuss vector instructions and vector processor implementation (CRAY-
1)
• solve vector length and stride issues
• explain compiler effectiveness in vector processors

9.2 Use and Effectiveness of Vector Processors


There is class of computational problems that is beyond the capabilities of a
conventional computer. The characteristics of these problems are that they
need a large number of computations that can be completed by a conventional
computer in days or may be weeks. In most of the science and engineering
applications, the problems can be developed in terms of vectors and matrices
that lend themselves to vector processing. Computers with vector processing
capabilities are in demand in specialised applications. Vector processing is of
utmost importance in the following representative application areas:
• Image processing
• Seismic data analysis
• Aerodynamics and space flight simulations
• Long-range weather forecasting
• Medical diagnosis
• Petroleum explorations
• Mapping the human genome
• Artificial intelligence and expert systems
Advantages of Vector Processors
Vector processors provide the following benefits:
1. Vector processors take advantage of data parallelism in huge scientific as

Manipal University Jaipur B1648 Page No. 191


Computer Architecture Unit 1

well as multimedia applications.


2. The moment a vector instruction begins functioning, just the register buses
and the functional unit feeding it require to be powered. Power can be
turned off for Fetch unit, decode unit, Re-order Buffer (ROB) etc. This
leads to reduction in power usage.
3. Vector processors are able to function on one whole vector in a single
instruction. Therefore, vector processors lessen the fetch and decode
bandwidth because of less number of instructions fetched.
4. In vector processing, the size of programs is small, because it needs fewer
numbers of instructions.
5. Vector memory access does not cause any wastage just like cache
access. Each data item requested by the processor is utilised in actual
terms.
6. Vector instructions also don’t reveal a lot of branches by implementing a
loop in a single instruction.
Self Assessment Questions
1. __________ is able to function on one whole vector in a single
instruction.
2. They also take advantage of ____________ in huge scientific as well
as multimedia applications.

9.3 Types of Vector Processing


Depending on the way the operands are fetched, vector processors can be
segregated into following two groups:
Memory-memory vector architecture: Operands are straight away streamed
from the memory to the functional units and outcomes are written back to
memory at the time the vector operation advances in this architecture.
Vector-register architecture: Operands are read into vector registers
wherein they are fed to the functional units and outcomes of operations are
written to vector registers in this architecture.
In the next section we will learn more about these two types of processors.
9.3.1 Memory-memory vector architecture
A type of vector processor that gives permission to the vector operands to be
fetched right away from memory to the various vector pipelines and the
outcomes to be written directly to memory is known as memory-memory

Manipal University Jaipur B1648 Page No. 192


Computer Architecture Unit 1

vector processor. As the elements of the vector require to be taken out from
memory than from a register, it takes a bit longer to start a vector operation;
this is partially because of the price of a memory access. An instance of a
memory-memory vector processor is the CDC Cyber 205.
Because of the ability to overlap memory accesses as well as the probable
reprocess of vector processors, ‘vector-register processors’ are normally
more productive and efficient as compared to ‘memory-memory vector
processors’. However, because the vectors’ length in a computation rises,
such a difference in effectiveness between the two kinds of architectures drops
down. In reality, the memory-memory vector processors can prove much
efficient when it comes to long vectors. However, experience displays that
smaller vectors are more commonly utilised.
Planned on the concepts initiated for the CDC Star 100, the first commercial
model of the CDC Cyber 205 was handed over in 1981. Such a supercomputer
is a memory-memory vector machine and fetches vectors directly from
memory to load the pipelines as well as stores the pipeline outcomes directly
to memory. Besides, it does not contain any vector registers. Consequently,
the vector pipelines have large start-up times. Instead of pipelines designed
for specific operations, such a machine consists of up to four general-purpose
pipelines. It also provides gather as well as scatter functions. ETA-10 is an
updated modern shared-memory multiprocessor version of the CDC Cyber
205.The next section provides more detail of this model.
CDC Cyber 200 model 205 computer overview: The Model 205 computer is
a super-scale, high-speed, logical and arithmetic computing system. It utilises
LSI circuits in both the scalar and vector processors that improve performance
to complement the many advanced features that were implemented in the
STAR-100 and CYBER 203 (these are the two Control Data Corp. computers
with built-in vector processors), like hardware macroinstructions, virtual
addressing and stream processing. The Model 205 contains separate scalar
and vector processors particularly designed for sequential and parallel
operations on single bit, 8-bit bytes, and 32-bit or 64-bit floating-point operands
and vector elements.
The central memory of the Model 205 is a high-performance semiconductor
memory with single-error correction, double-error detection (SECDED) on
each 32-bit half word, providing extremely high storage integrity. Virtual

Manipal University Jaipur B1648 Page No. 193


Computer Architecture Unit 1

addressing uses a high-speed mapping technique to convert a logical to an


absolute storage address to allow programs to appear logically contiguous
while being physically discontinuous in the storage system.
The basic Model 205 computer consists of the central processor unit (CPU), 1
million 64-bit words of central memory with SECDED, 6 input/output ports, and
a maintenance control unit (MCU). The CPU consists of the scalar processor
and a vector processor with one vector pipeline. Central memory is field-
expandable from one million 64-bit words to two or four million words of
semiconductor memory. The vector pipelines can be expanded to two or four
and the input/output ports are expandable to 16.The Model 205 central
processor consists of all instruction and streaming control, scalar and vector
arithmetic processors, and control for communication with central memory by
the CPU and the input/output channels.
The basic functional areas of the Model 205 CPU are:
• Scalar Processor
• Vector Processor
• Memory Interface
• Maintenance Control Unit
The LSI scalar processor contains a scalar arithmetic unit with independent
high-speed scalar arithmetic functional units. The scalar processor also
contains a semiconductor register file of 256 64-bit words utilised for indexing
and storing constants, instruction and operand addressing and field length
counts. Additionally it also holds operands and results for scalar instructions.
The scalar processor performs instruction control and virtual address
comparison and translation. A feature is provided to select, via an operating
system software installation parameter, a small page size of 512, 2048, or
8192 words. A large page size of 65,536 words is also provided. The vector
processor contains one, two or four parallel, segmented pipelines to facilitate
high-speed vector processing. The vector processor control is contained in the
stream unit. The string and all logical operations are performed in the string
unit. The memory interface provides the read and write ports of central memory
for the scalar and vector processors. Each port contains a one-SWORD (512-
bit Super Word) buffer to facilitate high transfer rates. The CPU processes
input and output by issuing relatively simple high-level messages to high-
speed peripheral stations or a front-end processor connected to the

Manipal University Jaipur B1648 Page No. 194


Computer Architecture Unit 1

input/output ports.
9.3.2 Vector register architecture
In a vector-register processor, the entire vector operations excluding load and
store are in the midst of the vector registers. Such architectures are the vector
equivalent of load-store architecture. Since the late 1980s, all major vector
computers have been using a vector-register architecture which includes the
Cray Research processors (Cray-1, Cray-2, X-MP, YMP, C90, T90 and SV1),
Japanese supercomputers (NEC SX/2 through SX/5, Fujitsu VP200 through
VPP5000, and the Hitachi S820 and S-8300), and the mini-
supercomputers(Convex C-1 through C-4).
All vector operations are memory to memory in a memory-memory vector
processor, the initial vector computers and CDC’s vector computers were of
such kind. Vector register architectures possess various benefits over vector
memory-memory architectures. It is necessary for the vector memorymemory
architecture to write the entire intermediate outcomes to memory as well as
later on read them back from memory. Vector register architecture is able to
maintain intermediate outcomes in the vector registers just near to the vector
functional units, decreasing temporary storage needs, inter-instruction latency
and memory bandwidth needs.
In case a vector outcome is required by multiple other vector instructions,
memory-memory architecture should read it from memory innumerable times;
while a vector register machine can use the value from vector registers once
again, thereby decreasing memory bandwidth needs. For such reasons, vector
register machines have proved to be more effective practically.
Components of a vector register processor: The major components of the
vector unit of a vector register machine are as given below:
1. Vector registers: There are many vector registers that can perform
different vector operations in an overlapped manner. Every vector register
is a fixed-length bank that consists of one vector with multiple elements
and each element is 64-bit in length. There are also many read and write
ports. A pair of crossbars connects these ports to the inputs/ outputs of
functional unit.
2. Scalar registers: The scalar registers are also linked to the functional
units with the help of the pair of crossbars. They are used for various
purposes such as computing addresses for passing to the vector

Manipal University Jaipur B1648 Page No. 195


Computer Architecture Unit 1

load/store unit and as buffer for input data to the vector registers.
3. Vector functional units: These units are generally floating-point units that
are completely pipelined. They are able to initiate a new operation on each
clock cycle. They comprise all operation units that are utilised by the vector
instructions.
4. Vector load and store unit: This unit can also be pipelined and perform
an overlapped but independent transfer to or from the vector registers.
5. Control unit: This unit decodes and coordinates among functional units.
It can detect data hazards as well as structural hazards. Data hazards are
the conflicts in register accesses while functional hazards are the conflicts
in functional units.

Figure 9.1 gives you a clear picture of the above mentioned functional units
of vector processor.

Figure 9.1: Vector Register Architecture


Types of Vector Instructions: The various types of vector instructions for a
register-register vector processor are:
(a) Vector-scalar instructions
(b) Vector-vector instructions
(c) Vector-memory instructions
(d) Gather and scatter instructions
(e) Masking instructions
(f) Vector reduction instructions
Let us discuss these.

Manipal University Jaipur B1648 Page No. 196


Computer Architecture Unit 1

(a) Vector-scalar instructions: Using these instructions, a scalar operand


can be combined with a vector one. If A and B are vector registers and f
is a function that performs some operation on each element of a single
or two vector operands, a vector-scalar operand can be defined as
follows:
Ai: = f (scalar, Bi)
(b) Vector-vector instructions: Using these instructions, one or two vector
operands are fetched from respective vector registers and produce
results in another vector register. If A, B, and C are three vector registers,
a vector-vector operand can be defined as follows:
Ai: = f (Bi, Ci)
(c) Vector-memory instructions: These instructions correspond to vector
load or vector store. The vector load can be defined as follows:
A: = f (M) where M is a memory register
The vector store can be defined as follows:
M: = f (A)
(d) Gather and scatter instructions: Gather is an operation that fetches the
non-zero elements of a sparse vector from memory as defined below:
A x Vo: = f (M)
Scatter stores a vector in a sparse vector into memory as defined below:
M: = f (A x Vo)
(e) Masking instructions: These instructions use a mask vector to expand
or compress a vector as defined below:
V = f (A x VM) where V is a mask vector
(f) Vector reduction instructions: These instructions accept one or two
vectors as input and produce a scalar as output.
Vector processor implementation (CRAY-1): CRAY-1 is one of the oldest
processors that implemented vector processing. CRAY-1 is considered as the
world's first vector supercomputer. It was introduced in 1975 by Seymour Cray.
It is basically a register-oriented RISC-like machine requiring all operands to
be in registers. It has five kinds of registers:
(a) A registers: A set of 8 24-bit registers
(b) B registers: A set of 64 24-bit registers

Manipal University Jaipur B1648 Page No. 197


Computer Architecture Unit 1

(c) S registers: A set of 8 64-bit registers


(d) T registers: A set of 64 64-bit registers
(e) Vector registers: A set of 8 64-element floating point registers
There are12 functional units in CRAY-1:
(a) 2 24-bit units for address calculation
(b) 4 64-bit integer scalar units for integer operations
(c) 6 deeply pipelined units for vector operations
CRAY-1 uses 16-bit instructions. All vector operations can be executed in one
16-bit instruction. The block diagram of CRAY-1 architecture is shown in figure
9.2.

Manipal University Jaipur B1648 Page No. 198

You might also like