Computer Architecture AllClasses-Outline-100-198
Computer Architecture AllClasses-Outline-100-198
E-references:
• http://www.lc3help.com/tutorials/Basic_LC-3_Instructions/ Retrieved on
03-04-2012
• http://www.scribd.com/doc/4596293/LC3-Instruction-Details Retrieved
on 02-04-2012
• http://xavier.perseguers.ch/programmation/mips-
assembler/references/5-stage-pipeline.html
5.1 Introduction
In the previous unit, you studied pipelined processors in great detail with a
short review of pipelining and examples of some pipeline in modern
processors. You also studied various kinds of pipeline hazards and the
techniques available to handle them.
In this unit, we will introduce you to the design space of pipelines. Day-by- day
increasing complexity of the chips had lead to higher operating speeds. These
speeds are provided by overlapping instruction latencies or by implementing
pipelining. In the early models, discrete pipeline was used. Discrete pipeline
performs the task in stages like fetch, decode, execution, memory, and write-
back operations. Here every pipeline stage requires one cycle of time, and as
there are 5 stages so the instruction latency is of five cycles. Longer pipelines
over more cycles can hide instruction latencies.
This provides processors to attain higher clock speeds. Instruction pipelining
has significantly improved the performance of today’s processors. In this unit,
you will study the design space of pipelines which is further divided into basic
6 0 0 0 0
Number of Specification Layout of the Uta of Timing of the
stages of the subtasks stage aequence bypassing pipeline operations to bo performed In
each of the stages
next instruction from the memory, decode it, optimize the order of
execution and further sends the instruction to the destinations.
3. Calculate operand address: Now, the effective address of each source
operand is calculated.
4. Fetch operand/memory access: Then, the memory is accessed to fetch
each operand. For a load instruction, data returns from memory and is
placed in the Load Memory Data (LMD) register. If it is a store, then data
from register is written into memory. In both cases, the operand address
as computed in the prior cycle is used.
5. Execute instruction: In this operation, the ALU perform the indicated
operation on the operands prepared in the prior cycle and store the result
in the specified destination operand location.
6. Write back operand: Finally, the result into the register file is written or
stored into the memory.
These six stages of instruction pipeline are shown in a flowchart in figure 5.4.
Arithmetic or logical shifts can be easily implemented with shift registers. High-
speed addition requires either the use of a carry-propagation adder (CPA)
which adds two numbers and produces an arithmetic sum as shown in
figure5.6a, or the use of a carry-save adder (CSA) to "add" three input
numbers and produce one sum output and a carry output as exemplified in
figure 5.6b.
e.g. n=4
A = 10 11
<•) B = 0 111
S=10010=A*B
(Sum)
(a) An n-bit carry-propagate adder (CPA) which allows either carry
propagation or applies the carry-lookahead technique
e.g. n=4
X=
CSA
Sb= 0 1 0 0 0 1 1
+) C = 0 1 1 1 0 1 0
c Sb
8=1011111= Sb+C = X+Y+Z (Bitwise
(Carry
vector) sum)
(b) An n-bit carry-save adder (CSA), where Sb is the bitwise sum of X. Y, and Z. and
C is a carry vector generated without cany propagation between digits
10 1 10 1 0 1 - Po
101 1 0 1 0 1 0 -
00000 0 0 0 0 0 - P2
000000 00 0 0 0 ” Py
1011010 1 0 0 0 0 ° PA
00000000 00 0 0 0 = Pi
000000000 0 0 0 0 0 - P(,
+> 1 01101010000 0 00 = P7
0110011111101 1 11=P
The first stage (S1) generates all eight partial products, ranging from 8 bits to
15 bits, simultaneously. The second stage (S2) is made up of two levels of four
CSAs, and it essentially merges eight numbers into four numbers ranging from
13 to 15 bits. The third stage (S3) consists of two CSAs, and it merges four
numbers from S2 into two 16-bit numbers. The final stage (S4) is a CPA, which
adds up the last two numbers to produce the final product P.
For a maximum width of 16 bits, the CPA is estimated to need four gate levels
Activity 2:
Access the internet and find out more about the difference between fixed point
and floating point units.
US: Load/Store
Performance, trend
You can see that rows are showing the time steps and columns are showing
certain operations performed in time step. In this PES we can see that in
branch unit “ble” is not taken and it is theoretically executing instruction from
predicted path. In this example we have showed renaming values for only r3
register but others can also be renamed. Various values allotted to register r3
are bounded to different physical register (R1, R2, R3, R4).
Now you can see numerous ways of arranging instruction issue buffer for
boosting up the complexity.
Single queue method: Renaming is not needed in single queue method
because this method has 1 queue and no out of ordering issue. In this method
the operand availability could be handled through easy reservation bits allotted
to every register. During the instructional modification of register issues, a
register reserved and after the modification finished the register is cleared.
Multiple queue method: In multiple queue method, all the queues get
instruction issue in order. Due to other queues some queues can be issued
out. With respect to instruction type single queues are organized.
Reservation stations: In reservation stations, the instruction issue does not
follow the FIFO order. As a result for data accessibility, the reservation stations
at the same time have to observe their source operands. The conventional
way of doing this is to reserve the operand data in reservation station. As
reservation station receive the instruction then available operand values are
firstly read and placed in it.
After that it logically evaluate the difference between the operand designators
of inaccessible data and result designators of finishing instructions. If there is
similarity, then the result value is extracted to matching reservation station.
Instruction got issued as all the operands are prepared in reservation station.
It can be divided into instruction type for decreasing data paths or may behave
Manipal University Jaipur B1648 Page No. 120
Computer Architecture Unit 1
as a single block.
Self Assessment Questions
9. In traditional pipeline implementations, load and store instructions are
processed by the ___________________ .
10. The consistency of instruction completion with that of sequential
instruction execution is specified b ______________ .
11. Reordering of memory accesses is not allowed by the processor which
endorses weak memory consistency does not allow (True/False).
12. ____________ is not needed in single queue method.
13. In reservation stations, the instruction issue does not follow the FIFO
order. (True/ False).
5.6 Summary
• The design space of pipelines can be sub divided into two aspects:
basic layout of a pipeline and dependency resolution.
• An Instruction pipeline operates on a stream of instructions by
overlapping and decomposing the three phases (fetch, decode and
execute) of the instruction cycle.
• Two basic aspects of the design space are how FX pipelines are laid out
logically and how they are implemented.
• A logical layout of an FX pipeline consists, first, of the specification of how
many stages an FX pipeline has and what tasks are to be performed in
these stages.
• The other key aspect of the design space is how FX pipelines are imple-
mented.
• In logical layout of FX pipelines, the FX pipelines for RISC and CISC
processors have to be taken separately, since each type has a slightly
different scope.
• Pipelined processing of loads and stores consist of sequential consistency
of instruction execution and parallel execution.
5.7 Glossary
• CISC: It is an acronym for Complex Instruction Set Computer. The CISC
machines are easy to program and make efficient use of memory.
• CPA: It stands for carry-propagation adder which adds two numbers
and produces an arithmetic sum.
• CSA: It stands for carry-save adder which adds three input numbers
and produces one sum output.
Manipal University Jaipur B1648 Page No. 121
Computer Architecture Unit 1
• LMD: Load Memory Data.
• Load/Store bypassing: It defines that either loads can bypasss stores or
vice versa, without violating the memory data dependencies.
• Memory consistency: It is used to find out whether memory access is
performed in the same order as in a sequential processor.
• Processor consistency: It is used to indicate the consistency of
instruction completion with that of sequential instruction execution.
• RISC: It stands for Reduced Instruction Set Computing. RISC
computers reduce chip complexity by using simpler instructions.
• ROB: It stands for Reorder Buffer. ROB is an assurance tool for
sequential consistency execution where multiple EUs operate in parallel.
• Speculative loads: They avoid memory access delay. This delay can be
caused due to the non- computation of required addresses or clashes
among the addresses.
• Tomasulo’s algorithm: It allows the replacement of sequential order by
data-flow order.
5.9 Answers
Self Assessment Questions
1. Microprocessor without Interlocked Pipeline Stages
2. Dynamically
3. Write Back Operand
4. Opcode, operand specifiers
5. Register operands
6. True
Terminal Questions
1. The design space of pipelines can be sub divided into two aspects: basic
layout of a pipeline and dependency resolution. Refer Section 5.2.
2. A pipeline instruction processing technique is used to increase the
instruction throughput. It is used in the design of modern CPUs,
microcontrollers and microprocessors.Refer Section 5.3 for more details.
3. There are two basic aspects of the design space of pipelined execution of
Integer and Boolean instructions: how FX pipelines are laid out logically
and how they are implemented. Refer Section 5.4.
4. While processing operates instructions, RISC pipelines have to cope only
with register operands. By contrast, CISC pipelines must be able to deal
with both register and memory operands as well as destinations. Refer
Section 5.4.
5. Depending on the function to be implemented, different pipeline stages in
an arithmetic unit require different hardware logic. Refer Section 5.4.
6. The execution of load and store instructions begins with the
determination of the effective memory address (EA) from where data is to
be fetched. This can be broken down into subtasks. Refer
Section 5.5.
7. The overall instruction execution of a processor should mimic sequential
execution, i.e. it should preserve sequential consistency. Refer Section
5.5. The first step is to create and buffer execution and then determine
which tuples can be issued for parallel execution. Refer Section 5.5.
References:
• Hwang, K. (1993) Advanced Computer Architecture. McGraw-Hill.
• Godse D. A. & Godse A. P. (2010). Computer Organisation, Technical
Publications. pp. 3-9.
• Hennessy, John L., Patterson, David A. & Goldberg, David (2002)
Computer Architecture: A Quantitative Approach, (3rd edition), Morgan
Manipal University Jaipur B1648 Page No. 123
Computer Architecture Unit 1
Kaufmann.
• Sima, Dezso, Fountain, Terry J. & Kacsuk, Peter (1997) Advanced
computer architectures - a design space approach, Addison-Wesley-
Longman: I-XXIII, 1-766.
E-references:
• http://www.eecg.toronto.edu/~moshovos/ACA06/readings/ieee-
proc.superscalar.pdf
• http://webcache.googleusercontent.com/search?q=cache:yU5nCVnju9
cJ:www.ic.uff.br/~vefr/teaching/lectnotes/AP1-topico3.5.ps.gz+load+
store+sequential+instructions&cd=2&hl=en&ct=clnk&gl=in
Structure:
6.1 Introduction
Objectives
6.2 Dynamic Scheduling
Advantages of dynamic scheduling Limitations of dynamic
Scheduling
6.3 Overcoming Data Hazards
6.4 Dynamic Scheduling Algorithm - The Tomasulo Approach
6.5 High performance Instruction Delivery
Branch target buffer
Advantages of branch target buffer
6.6 Hardware-based Speculation
6.7 Summary
6.8 Glossary
6.9 Terminal Questions
6.10 Answers
6.1 Introduction
In pipelining, two or more instructions that are independent of each other can
overlap. This possibility of overlap is known as ILP (instruction-level
parallelism). It is addressed as ILP because the instructions may be assessed
parallelly. Parallelism level is quite small in straight-line codes where there are
no branches except the entry or exit. The easiest and most widely used
methodology to enhance parallelism is by exploiting parallelism among the
Here F0, F1, F2....F14 are the floating point registers (FPRs) and DIVD, ADDD
and SUBD are the floating point operations on double precision(denoted by
D). The dependence of ADDD on DIVD causes a stall in the pipeline; and thus,
the SUBD instruction cannot execute. IF the instructions are not executed in
same sequence then this limitation could be ruled out.
In case of DLX (DLX is a RISC processor architecture) pipeline, the structural
& data hazards are examined during the instruction decode (ID). If any
instruction can carry out appropriately, it is issued from ID. To commence with
the execution of the SUBD, we need to examine the following two issues
separately:
• Firstly we need to analyse the any type of structural hazards
• Secondly, we need to wait for the non-occurrence of any data hazard.
In this example you can see that ADDD and SUBD are interdependent. If
SUBD is executed before ADDD, then the data interdependence will be
violated resulting in wrong execution. Similarly, to refrain output dependencies
violation, it is essential to detect WAW (Write after Write) data hazards
Scoreboard technique helps to minimize or remove both the structural as well
as the data hazards. Scoreboard stalls the later instruction that is engaged in
the interdependence. Scoreboard’s goal is to execute an instruction in each
clock cycle (in situation where no structural hazards exist). Therefore, when
any instruction is stalls, some other independent instructions may be executed.
The scoreboard technique takes complete accountability for issuing and
executing the instruction together with all hazards detection. To take
advantage of executing instructions out-of-order necessarily requires several
instructions to be executed simultaneously. We can achieve this by use of
either of the two ways:
1. By utilizing pipelined functional units
2. By using multiple functional units
The above given ways are necessary for pipeline control. Here we will consider
the use of multiple functional units.
CDC 6600 comprises of 16 distinct functional units. These are of following
types:
Manipal University Jaipur B1648 Page No. 128
Computer Architecture Unit 1
• Four FPUs (floating-point units)
• Five units for memory references
• Seven units for integer operations.
FPUs are of prime importance in DLX scoreboards in comparison to other FU
(functional units).
For example: We have 2 multipliers, 1 adder, 1 divide unit, and 1 integer unit
for all integer operations, memory references and branches.
The methodology for the DLX & CDC 6600 is quite similar as both of these are
load-store architectures. Given below in figure 6.1 is the basic structure of a
DLX Processor with a Scoreboard.
Now let us study the four steps in the scoreboard technique in detail.
1. Issue: Issue step is used as a replacement of a part of ID step of DLX
pipeline. In this step the instruction is forwarded to FU. The internal data
construction is also modified here. It is done only in two situations:
• FU for the instruction is jobless.
• No other active instruction has the same register as destination. This
ensures that the operation is free from WAW (Write after Write) data
hazard.
When any structural or WAW hazards are detected, the stall occurs and
the issue of all subsequent instructions is stopped until these data hazards
have been corrected. when a stall occurs in this stage, the buffer between
instruction issue and fetch is filled. If buffer contains a single instruction
then the instruction fetch also stalls at once but if the buffer space contains
a queue, it creates stalls only after the buffer queue is fully filled.
2. Read operands: The scoreboard examines if the source operands is
available or not. The source operand is said to be available when no
previously issue active instruction is ready to write to it. The scoreboard
prompts the FU to start reading the operands from data registers and start
execution as soon as the source operands become available. Read after
Write (RAW) hazards are resolved in a dynamic manner during this stage.
It may also send instructions for out-of-order execution. Issue and read
operand step together completes the functions of the ID step of DLX
Manipal University Jaipur B1648 Page No. 130
Computer Architecture Unit 1
pipeline.
3. Execution: After receiving the operands, the FU starts execution. on
completion of execution, the result is generated. Thereafter FU informs the
scoreboard about the completion of execution step. Execution step is used
in place of EX step of DLX pipeline but in latter it may involve multiple
cycles.
4. Write result: after the FU completes execution, the scoreboard detects
whether the WAR hazards are present or not. If the WAR hazard is
detected, it stalls the instruction. WAR hazard occurs when there is an
instruction code as in our earlier example of ADDD & SUBD where both
utilize F8. The code for that example is again shown below:
Here you can see that the source operand for ADDD is F8 that is similar to the
destination register of SUBD. However, ADDD in fact is dependent on the
previous instruction DIVD. In this case, the scoreboard will stall SUBD in its
write result stage till the time ADDD read its operands.
Any completing instruction may not be permitted to write its results in following
cases:
• when there exists any instruction which hasn’t read its operands that
precedes (i.e., in issuance order) the completing instruction
• one of the operands is the same register as the result of the completing
instruction
Instruction status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Activity 1:
Imagine yourself as a computer architect. Explain the measures you will take
to overcome data hazards with dynamic scheduling.
Assuming that:
• Misprediction penalty = 4 cycles
• Buffer miss-penalty = 3cycles
• Hit rate and accuracy each = 90%
• Branch Frequency = 15%
Solution:
The speedup with Branch Target Buffer verses no BTB is expressed as:
Speedup = CPI no BTB/CPI BTB
= (CPI base+Stallsno BTB) / (CPI base + Stalls BTB)
The stalls are determined as:
Stalls = ZFrequency x Penalty
The sum over all the stall cases is given as the product of frequency of the stall
cases and the stall-penalty.
i) Stallsno BTB = 0.15 x 2 = 0.30
ii) To find Stalls BTB, we have to consider each output from BTB
There exist three possibilities:
a) Branch misses the BTB:
Frequency = 15 % x 0.1 = 1.5% = 0.015
Penalty = 3
Stalls=0.045
b) Branch can hit and correctly predicted:
Frequency = 15 % x 0.9(htt)x 0.9^^^)= 12.1% = 0.121
Penalty = 0
Stalls= 0
c) Branch can hit but incorrectly predicted:
Frequency = 15 % x 0.9 (hit) x 0.1 (misprediction) = 1.3% = 0.013 Penalty
=4
Stalls = 0.052
iii) Stalls BTB = 0.045 + 0 + 0.052 = 0.097
Speedup = (CPIbase + Stallsno BTB) / (CPIbase + Stal^)
= (1.0 + 0.3) / (1.0 + 0.097)
Manipal University Jaipur B1648 Page No. 138
Computer Architecture Unit 1
= 1.2
In order to achieve more instruction delivery, one possible variation in the
Branch Target Buffer is:
• To keep one or more than one target instructions, instead of or in addition to,
the anticipated Target Address
6.5.2 Advantages of branch target buffer
There are several advantages of branch target buffer. They are as follows:
• It possibly allows larger BTB as it allows access to take more time between
consecutive instructions fetches
• Buffering the actual Target-Instructions allow Branch Folding, i.e., ZERO
cycle Unconditional Branching or sometimes ZERO Cycle conditional
Branching
Self Assessment Questions
12. The branch-prediction buffer is accessed during the _____ stage.
13. The _____ field helps check the addresses of the known branch
instructions.
14. Buffering the actual Target-Instructions allow ___________ .
6.7 Summary
Let us recapitulate the important concepts discussed in this unit:
• In pipelining, implementation of instructions independent of one another
can overlap. This possible overlap is known as instruction-level parallelism
(ILP)
• Pipeline fetches an instruction and executes it.
• In DLX pipelining, all the structural & data hazards are analyzed
throughout the process of instruction decode (ID).
• A dynamic scheduling is the hardware based scheduling. In this
approach, the hardware rearranges the instruction execution to reduce the
stalls.
6.8 Glossary
• Dynamic scheduling: Hardware based scheduling that rearranges the
instruction execution to reduce the stalls.
• EX: Execution stage
• FP: Floating-Point Unit
• ID: Instruction Decode
• ILP: Instruction-Level Parallelism
• Instruction-level parallelism: Overlap of independent instructions on one
another
• Static scheduling: Separating dependent instructions and minimising the
number of actual hazards and resultant stalls.
6.9 Terminal Questions
1. What do you understand by instruction-level parallelism? Also, explain
loop-level parallelism.
2. Describe the concept of dynamic scheduling.
3. How does the execution of instructions take place under dynamic
scheduling with score boarding?
4. What is the goal of score boarding?
5. Explain the tumasulo approach.
6.10 Answers
Self Assessment Questions
1. Static scheduling
2. Check the structural hazards, wait for the absence of a data hazards
3. An instruction fetch
4. EX
5. SUBD, ADDD
6. Pipelined, multiple
References:
• John L. Hennessy and David A. Patterson, Computer Architecture: A
Quantitative Approach, Fourth Edition, Morgan Kaufmann Publishers.
• David Salomon, Computer Organisation, 2008, NCC Blackwell.
• Joseph D. Dumas II; Computer Architecture; CRC Press.
• Nicholas P. Carter; Schaum’s Outline of Computer Architecture; Mc. Graw-
HiLl Professional.
7.1 Introduction
In the previous unit, you studied Instruction-level parallelism and its dynamic
exploitation. You learnt how to overcome data hazards with dynamic
scheduling besides performance instruction delivery and hardware based
speculation.
As mentioned in the previous unit, inherent property of a sequence of
instructions, results in execution of some instructions parallel which is also
known as Instruction level parallelism (ILP). There is an upper bound, as to
how much parallelism can be achieved. We can approach this upper bound
via a series of transformations that either expose or allow more ILP to be
exposed to later transformations. The best way to exploit ILP is to have a
Objectives:
After studying this unit, you should be able to:
• identify the various types of branches
• explain the concept of branch handling
• describe the role of delayed branching
• recognise branch processing
• discuss the process of branch prediction
• explain Intel IA-64 architecture and Itanium processor
• discuss the use of ILP in the embedded and mobile markets
The conditional branch instruction given above performs the testing of the
contents available in two registers, that is, Rsrc1 as well as Rsrc2 for
equality. The control is transferred to the target if their values appear to be
equal. Let us suppose that the numbers that are to be compared are
placed in register t0 and register t1. For this, the branch instruction is
written as below:
beq $t1,$t0,target
The instruction given above substitutes the two-instruction cmp/je
sequence which is utilised by Pentium.
Registers are maintained by some of the processors. This is done for recording
the condition of arithmetic as well as logical operations. We call these registers
as condition code registers.
The status of the last arithmetic or logical operation is recorded by these
registers. For instance, if two 32-bit integers are added, i then the sum might
need more than 32 bits. It is an overflow condition which should be recorded
by the system. Usually, this overflow condition is indicated by setting a bit in
condition code register. For example, the MIPS, does not make use of
condition registers. Rather, it to flag the overflow condition exceptions is used.
Alternatively, th processors such as the Pentium, SPARC, and Power PC
Activity 1:
Work on an MIPS processor to find out the difference between conditional
and unconditional branching.
target:
mult R8, R9, R10
... .. .
The process of moving instructions into delay slots is not an issue of worry for
programmers. This task is accomplished by compilers in addition to
assemblers. If any valuable instruction cannot be moved into delay slot, NOP
operation (no operation) is placed. This is to observe that if the branch is not
taken, we would not like to provide execution to delay slot instruction. This
means that we would like to nullify the instruction in delay slot. A number of
processors such as SPARC offer this option of nullification.
Self Assessment Questions
5. A number of processors such as __________ and _________ make
use of delayed execution for procedure calls as well as branching.
6. If any valuable instruction cannot be moved into delay slot, is placed.
It is presumed by the data in the table given above that approximately 60% of
the time conditional branches are not taken. Therefore this prediction of
conditional branch is accurate only sixty percent of the time. So now we get
the following:
42 x 0.6 = 25.2%
This is the prediction accuracy in case of conditional branches.
Likewise, loops jump back having 90% possibility. As loops emerge about 10%
of the time, 9% prediction appears to be accurate. To our surprise, even this
static prediction approach provides accuracy of about 82%.
7.6.3 Dynamic branch prediction
For making more accurate predictions, this approach considers run-time
history. Here the n branch executions of history are considered and this
information is used for predicting the next one.
The experiential study done by Smith and Lee proposes that this approach
provides major enhancement in prediction accuracy. In table 7.2, we have
shown a summary of what they have studied.
Table 7.2: Affect of utilising the information of Past Branches on Prediction
Accuracy
An algorithm that is applied is simple. That is, the next branch prediction is the
majority of n branch executions of past. For instance, let us suppose n = 3.
That is, if three branch executions of the past includes two or more times
branches, then the prediction that occurs is the branch that will be taken.
In table 7.2, the data propose that if we consider l two branch executions of
In the figure given above, the left bit signifies the prediction whereas the right
bit signifies the status of branch (that is, whether branch is taken or not). In
case the left bit appears to be”0”, then the prediction would occur as “not
taken”. Or else it is predicted that the “branch is taken”. Actual outcome of
branch instruction is provided by right bit. Therefore, “branch not taken” is
signified by a “0”. This means that branch instruction didn't jump. On the other
hand, “branch is taken” is signified by “1”. For instance, state 00 signifies that
it predicted left zero bit (branch would not be taken) () and right zero bit (branch
is definitely not taken) (). Thus, we stay in state 00 in the case when branch is
not taken, In case the prediction is incorrect, we move to state 01. But, still
“branch not taken” is predicted since we were incorrect just once. In case the
prediction is right, we move to state 00 again. If the prediction appears to be
incorrect again, then we change the prediction to “branch taken”. Also we will
move to state10. Thus, on the occurrence of two wrong predictions one after
the other makes us change the prediction.
PREDICATE
REGISTER
Number
Operation
Category Examples of
Operation
Comment
Load/store cps s 33 signed, unsigned,register
ld8, ldl6, H3 2,1mm. st8, stl6, st32 indirect, indexed, scaled
addressing
Byte shuffles SIMD type convert
shiftrighr 1-.2-, 3-bytes, selectbyte, mergp,
pack 1
Bit shifts asl, asr, Isl, 1ST, rol,
mul, sum of products, sum-of-SIMD-
1 10 shifts, SIMD
round, saturate. 2’scomp
Multiplies and 23
multimedia elements, multimedia, e.g. sum of products SIMD ~
(FIR)
Integer arithmetic add, sib,min, max, abs, average, bitand, bitor, 62
saturate, 2’s comp,
bitxor, bitinv, bitandinv eql, neq, gtr, geq, les, unsigned, immediate,
leq, sign extend, zero extend, sum of absolute SIMD
differences
Floating point add, sub, neg ,mul, div, sqn eql, neq, gtr, geq, 42 scalar
les, leq, IEEE flags
Special ops alloc, prefetch, copy back, read tag read, 20 cache, special regs
cache status, read counter
Branch jmpt, jmpf 6 (un) interruptible
Total 207
Figure 7.7: Operations found in Trimedia TM32 CPU
One of the unusual characteristic from the desktop point of view is that the
programmer is allowed to state five autonomous operations that can be issued
simultaneously. In case the five autonomous instructions are not available
(which means that others are dependent), then no operations (NOPs) are
Manipal University Jaipur B1648 Page No. 161
Computer Architecture Unit 1
positioned in the remaining slots. We call this method of instruction coding a
VLIW (Very Long Instruction Word) method.
It is known that as Trimedia TM32 CPU comprise longer instruction words and
frequently includes NOPs, the instructions of Trimedia are compressed in the
memory. Also the instructions are decoded to the full size when they are
loaded into cache. In Figure 7.8, we have shown the TM32 CPU instruction
mix for EEMBC bench-marks.
Figure 7.8: TM32 CPU Instruction Mix for EEMBC Customer Benchmark
By means of source code which is unmodified, instruction mix is analogous to
others, even though more byte data transfers are there. For aligning the data
for SIMD instructions, the huge number of pack is observed and the
instructions are merged. Computers used for general purpose (having higher
importance byte data transfers) and the instruction mix for “out-of- the-box” C
code is considered similar to each other. The Single instruction, multiple data
(SIMD) instructions along with the pack are used by means of the hand-
7.9 Summary
• Implementation of branching is done by using a branch instruction. The
address of target instruction is included in the branch instruction
• The branch penalty can be reduced to one cycle. It can be efficiently
reduced further by means of Delayed branch execution.
• Effective processing of branches has become a cornerstone of increased
performance in ILP-processors.
• Branch prediction is a method which is basically utilised for handling the
problems related to branch. Different strategies of branch prediction
include:
❖ Fixed branch prediction
❖ Static branch prediction
❖ Dynamic branch prediction
• The new architecture, generated mutually by means of Hewlett Packard
as well as Intel , is known as IA-64
• IA-64 model is also known as Explicitly Parallel Instruction Computing
(EPIC).
• Itanium comprises a group of 64-bit Intel microprocessors which provides
execution to the Intel Itanium architecture. This architecture was initially
known as IA-64.
• Interesting strategies are represented by the Crusoe chips and Trimedia
for applying the concepts of Very long instruction word (VLIW) in an
embedded space. Trimedia processor may be the closest existing
processor to a "classic" processor of VLIW.
7.10 Glossary
References:
• Hwang, K. Advanced Computer Architecture. McGraw-Hill.
• Godse, D. A. & Godse, A. P. Computer Organization. Technical
Publications.
• Hennessy, John L., Patterson, David A. & Goldberg David. Computer
E-references:
• http://www.scribd.com/doc/46312470/37/Branch-processing,
• http://www.scribd.com/doc/60519412/15/Another-View-The-Trimedia-
TM32-CPU-151.
Structure:
8.1 Introduction
Objectives
8.2 Memory Hierarchy
Cache memory organisation
Basic operation of cache memory
Performance of cache memory
8.3 Cache Addressing Modes
Physical address mode
Virtual address mode
8.4 Mapping
Direct mapping
Associative mapping
8.5 Elements of Cache Design
8.6 Cache Performance
Improving cache performance
Techniques to reduce cache miss
Techniques to decrease cache miss penalty
Techniques to decrease cache hit time
8.7 Shared Memory organisation
8.8 Interleaved Memory Organisation
8.9 Bandwidth and Fault Tolerance
8.10 Consistency Models
Strong consistency models
Weak consistency models
8.11 Summary
8.12 Glossary
8.1 Introduction
You can say that Memory system is the important part of a computer system.
The input data, the instructions necessary to manipulate the input data and the
output data are all stored in the memory.
Now, we let us discuss cache memory and the cache memory organisation.
8.2.1 Cache memory organisation
A cache memory is an intermediate memory between two memories having
large difference between their speeds of operation. Cache memory is located
between main memory and CPU. It may also be inserted between CPU and
RAM to hold the most frequently used data and instructions. Communicating
with devices with a cache memory in between enhances the performance of a
system significantly. Locality of reference is a common observation that at a
particular time interval, references to memory acts limited for some localised
memory areas. Its illustration can be given by making use of control structure
like 'loop'. Cache memories exploit this situation to enhance the overall
performance.
Whenever a loop is executed in a program, CPU executes the loop repeatedly.
Hence for fetching instructions, subroutines and loops act as locality of
reference to memory. Memory references also act as localised.
Table look-up procedure continually refers to memory portion in which table is
stored. These are the properties of locality of reference. Cache memory’s basic
idea is to hold the often accessed data and instruction in quick cache memory,
the total accessed memory time will attain almost the access time of cache.
The fundamental idea of cache organisation is that by keeping the most
frequently accessed instructions and data in the fast cache memory, the
average memory access time will reach near to access time of cache.
8.2.2 Basic operation of cache memory
Whenever CPU needs to access the memory, cache is examined. If the file is
found in the cache, it is read from the fast memory. If the file is missing in cache
then main memory is accessed to read the file. A set of files just accessed by
CPU is then transferred from main memory to cache memory.
Cache Hit: When the addressed data or instruction is found in cache during
operation, it is called a cache hit.
Cache Miss: When the addressed data or instruction is not found in cache
during operation, it is called a cache miss. At the time of cache miss, a
complete cache block is loaded from the equivalent memory location at one
time.
Implementation on Split Cache: When physical address is used in split
cache, both data cache and instruction cache are accessed with a physical
address after translation from MMU. In this design, the first-level D-cache uses
write-through policy as it is a small one (64 KB) and the second-level D-cache
uses write-back policy as it is larger (256 KB) with slower speed. The I-cache
is a single level cache that has a smaller size (64 KB). The implementation of
physical address mode on split cache is illustrated in figure 8.3.
In figure 8.4 you can see that a unified cache is in direct contact with virtual
address. It is known as virtual address cache. In the above figure you can also
see that Main Memory Unit and cache validation or interpretation is performed
simultaneously. The cache lookup operation does not use the physical address
produced by the Main Memory Unit but it can be saved for later use. The virtual
address cache is encouraged with the improved proficiency of quick cache
accessing; it is overlapped with the MMU translation.
Advantages of virtual address mode: The virtual mode of cache addressing
offers the following advantages:
• It eliminates address translation time for a cache hit since misses are not
common as hits.
• Cache lookup is not delayed.
8.4 Mapping
Mapping refers to the translation of main memory address to the cache
memory address. The transfer of information from main memory to cache
memory is conducted in units of cache blocks on cache lines. Blocks in caches
are called block frames which are denoted as
Bi for i = 1, 2, ...j
where j is the entire block frames in caches.
The corresponding memory blocks are denoted as
Bj for j = 1, 2, ...k
where k is the total number of blocks in memory. It is assumed that
k >> j and k = 2s and j = 2r
Where s is the number of bits required to address a main memory block, and
r is number of bits required to address a cache memory block.
There are four types of mapping schemes: direct mapping, associative
mapping, set associative mapping, and sequential mapping. Here, we will
discuss the first two types of mapping.
8.4.1 Direct mapping
Associative memories are very costly as compared to RAM due to the
additional logic association with all cells. Generally there are 2j words in main
memory and 2k words in cache memory. The j-bit memory address is
separated by 2 fields. k bits are used for index field. j-k bits are long-fields. The
direct mapping cache organization utilizes k-bit indexes to access the cache
memory and j-bit address for main memory. Cache words contain data and
related tags. Every memory block is assigned to a particular line of cache in
direct mapping. But if a line already contains memory block when new block is
to be added then the old memory block is removed. The figure 8.5 illustrates
Tag bits are stored next to data bits as new word enters the cache. Once
processor has produced a memory request, the cache index field is utilized for
the main memory address to access cache. Tag in word in cache is evaluated
with tag field in processor address. If this comparison is positive there is a hit
and the word is found in cache. If the comparison is negative then it is a miss
and the word is read in main memory. The word is then stored in cache with
the new tag and deletes the previous value.
Demerits of direct mapping: If the two words have similar addresses and
indexes then the hit ratio will fall substantially. But dissimilar tags are accessed
continually.
8.4.2 Associative mapping
Associative mapping is used in cache organization which is the quickest and
most supple organization. Addresses of the word and content of the words are
stored in associative memory. It means cache can store any word in main
memory.
For example, in figure 8.6, CPU address is first placed in argument register
Manipal University Jaipur B1648 Page No. 175
Computer Architecture Unit 1
and then associative memory is explored for the match of the above address.
Cache memory
Activity 1:
Visit an organisation and find out the cache memory size and the costs they
are incurring to hold it. Also try to retrieve the size of the data stored in the
cache memory.
Shared Bus
Figure 8.7: Shared Memory Organisation
As long as the processor requires a single memory read at a time, the above
memory arrangement with a single MAR, a single MDR, a single Address bus
and a single Data bus is sufficient. However, if more than one read is required
simultaneously, the arrangement fails. This problem can be overcome by
adding as many address and data bus pairs along with respective MARs and
MDRs. But buses are expensive as equal number of bus controllers will be
required to carry out the simultaneous reads.
An alternative technique to handle simultaneous reads with comparatively low
cost overheads is memory interleaving. Under this scheme, the memory is
divided into numerous modules which is equivalent to the number of
simultaneous reads required, having their own sets of MARs and MDR but
sharing common data and address buses. For example, if an instruction
pipeline processor requires two simultaneous reads at a time, the memory is
partitioned into two modules having two MARs and two MDRs, as shown in
figure 8.10.
„ 0m-i
ti - (1+—)
mn
0
Where, n >/. (very long vector), ti ^— - r .As n ^ 1 (scalar access), m
t1 ^ 0
Fault Tolerance: High- and low-order interleaving could be mixed to generate
various interleaved memory organisations. Sequential addresses are assigned
in the high-order interleaved memory in each memory module.
This makes it easier to isolate faulty memory modules in a memory bank of m
memory modules. When one module failure is detected, the remaining
modules can still be used by opening a window in the address space. This fault
isolation cannot be carried out in a low-order interleaved memory, in which a
8.12 Glossary
• Associative Mapping: Associative mapping is used in cache
organization which is the quickest and most supple organization.
• Auxiliary memory Auxiliary memory is a high-speed memory which
provides backup storage and not directly accessible by CPU but it is
connected with main memory.
• Cache Memory Organisation: A small, fast and costly memory that is
placed between a processor and main memory.
• Main memory: Refers to physical memory that is internal to the computer.
• Memory interleaving: A category of techniques for increasing memory
speed. NUMA Multiprocessing: Non-Uniform Memory Access
multiprocessing.
• RAM: Random-access memory
• Split Cache Design: A design where instructions and data are stored in
Manipal University Jaipur B1648 Page No. 187
Computer Architecture Unit 1
different caches for execution conveniences.
8.14 Answers
Self Assessment Questions
1. Main memory
2. Auxiliary memory
3. Cache
4. Hit
5. Unified
6. Translation Lookside Buffer
7. Mapping
8. Associative
9. Split
10. Instruction Cache
11. Hit time
12. Miss Penalty
13. Instruction-Level Parallelism
14. Thread-Level Parallelism
15. Interleaved Memory Organisation
16. MARs and MDRs
17. m, 1
18. Sequential addresses
19. Consistency
20. Strong, Weak
Terminal Questions
1. Memory hierarchy contains the Cache Memory Organisation. Refer
References:
• Kai Hwang: Advanced Computer Architecture, Parallelism, Scalablility,
Programmability - MGH
• Micheal J. Flynm: Computer Architecture, Pipelined & Parallel Processor
Design - Narosa.
• J.P. Haycs: Computer Architecture & Organisation - MGM
• Nicholas P. Carter; Schaum’s Outline of Computer Architecture; Mc. Graw-
Hill Professional
E-references:
• www.csbdu.in/
• www.cs.hmc.edu/
• www.usenix.org
• cse.yeditepe.edu.tr/
Structure:
9.1 Introduction
Objectives
9.2 Use and Effectiveness of Vector Processors
9.3 Types of Vector Processing
Memory-memory vector architecture
Vector register architecture
9.4 Vector Length and Stride Issues
Vector length
Vector stride
9.5 Compiler Effectiveness in Vector Processors
9.6 Summary
9.7 Glossary
9.8 Terminal Questions
9.9 Answers
9.10 Introduction
In the previous unit, you learnt about memory hierarchy technology and related
aspects such as cache addressing modes, mapping, elements of cache
design, cache performance, shared & interleaved memory organisation,
bandwidth & fault tolerance, and consistency models. In this unit, we will
introduce you to vector processors.
A processor design which has the capacity to operate mathematical operations
upon multiple data elements at the same time is called a vector processor.
This is just opposite to a scalar processor, which is able to tackle just a single
element at one time. A vector processor is also called array processor. Vector
processing was first successfully implemented in the CDC STAR-100 and
Advanced Scientific Computer (ASC) of the Texas Instruments. The vector
technique was first fully exploited in the famous Cray-1. The Cray design had
eight vector registers which held sixty-four 64-bit words each. The Cray-
1usually had a performance of about 80 MFLOPS (Million floating-point
operations per second), but with up to three chains running, it could hit the
highest point at 240 MFLOPS.
In this unit, you will study about various these processors such as types, uses
and effectiveness of these processors. You will also study vector length and
stride issues, and compiler effectiveness in vector processors.
Objectives:
After studying this unit, you should be able to:
• state the use and effectiveness of vector processors
• identify the types of vector processing
• describe memory-memory vector architecture
• discuss the use of CDC Cyber 200 model 205 computer
• explain vector register architecture
• recognise the functional units of vector processor
• discuss vector instructions and vector processor implementation (CRAY-
1)
• solve vector length and stride issues
• explain compiler effectiveness in vector processors
vector processor. As the elements of the vector require to be taken out from
memory than from a register, it takes a bit longer to start a vector operation;
this is partially because of the price of a memory access. An instance of a
memory-memory vector processor is the CDC Cyber 205.
Because of the ability to overlap memory accesses as well as the probable
reprocess of vector processors, ‘vector-register processors’ are normally
more productive and efficient as compared to ‘memory-memory vector
processors’. However, because the vectors’ length in a computation rises,
such a difference in effectiveness between the two kinds of architectures drops
down. In reality, the memory-memory vector processors can prove much
efficient when it comes to long vectors. However, experience displays that
smaller vectors are more commonly utilised.
Planned on the concepts initiated for the CDC Star 100, the first commercial
model of the CDC Cyber 205 was handed over in 1981. Such a supercomputer
is a memory-memory vector machine and fetches vectors directly from
memory to load the pipelines as well as stores the pipeline outcomes directly
to memory. Besides, it does not contain any vector registers. Consequently,
the vector pipelines have large start-up times. Instead of pipelines designed
for specific operations, such a machine consists of up to four general-purpose
pipelines. It also provides gather as well as scatter functions. ETA-10 is an
updated modern shared-memory multiprocessor version of the CDC Cyber
205.The next section provides more detail of this model.
CDC Cyber 200 model 205 computer overview: The Model 205 computer is
a super-scale, high-speed, logical and arithmetic computing system. It utilises
LSI circuits in both the scalar and vector processors that improve performance
to complement the many advanced features that were implemented in the
STAR-100 and CYBER 203 (these are the two Control Data Corp. computers
with built-in vector processors), like hardware macroinstructions, virtual
addressing and stream processing. The Model 205 contains separate scalar
and vector processors particularly designed for sequential and parallel
operations on single bit, 8-bit bytes, and 32-bit or 64-bit floating-point operands
and vector elements.
The central memory of the Model 205 is a high-performance semiconductor
memory with single-error correction, double-error detection (SECDED) on
each 32-bit half word, providing extremely high storage integrity. Virtual
input/output ports.
9.3.2 Vector register architecture
In a vector-register processor, the entire vector operations excluding load and
store are in the midst of the vector registers. Such architectures are the vector
equivalent of load-store architecture. Since the late 1980s, all major vector
computers have been using a vector-register architecture which includes the
Cray Research processors (Cray-1, Cray-2, X-MP, YMP, C90, T90 and SV1),
Japanese supercomputers (NEC SX/2 through SX/5, Fujitsu VP200 through
VPP5000, and the Hitachi S820 and S-8300), and the mini-
supercomputers(Convex C-1 through C-4).
All vector operations are memory to memory in a memory-memory vector
processor, the initial vector computers and CDC’s vector computers were of
such kind. Vector register architectures possess various benefits over vector
memory-memory architectures. It is necessary for the vector memorymemory
architecture to write the entire intermediate outcomes to memory as well as
later on read them back from memory. Vector register architecture is able to
maintain intermediate outcomes in the vector registers just near to the vector
functional units, decreasing temporary storage needs, inter-instruction latency
and memory bandwidth needs.
In case a vector outcome is required by multiple other vector instructions,
memory-memory architecture should read it from memory innumerable times;
while a vector register machine can use the value from vector registers once
again, thereby decreasing memory bandwidth needs. For such reasons, vector
register machines have proved to be more effective practically.
Components of a vector register processor: The major components of the
vector unit of a vector register machine are as given below:
1. Vector registers: There are many vector registers that can perform
different vector operations in an overlapped manner. Every vector register
is a fixed-length bank that consists of one vector with multiple elements
and each element is 64-bit in length. There are also many read and write
ports. A pair of crossbars connects these ports to the inputs/ outputs of
functional unit.
2. Scalar registers: The scalar registers are also linked to the functional
units with the help of the pair of crossbars. They are used for various
purposes such as computing addresses for passing to the vector
load/store unit and as buffer for input data to the vector registers.
3. Vector functional units: These units are generally floating-point units that
are completely pipelined. They are able to initiate a new operation on each
clock cycle. They comprise all operation units that are utilised by the vector
instructions.
4. Vector load and store unit: This unit can also be pipelined and perform
an overlapped but independent transfer to or from the vector registers.
5. Control unit: This unit decodes and coordinates among functional units.
It can detect data hazards as well as structural hazards. Data hazards are
the conflicts in register accesses while functional hazards are the conflicts
in functional units.
Figure 9.1 gives you a clear picture of the above mentioned functional units
of vector processor.