0% found this document useful (0 votes)
31 views

Computer Architecture Assignment Help

The document discusses improving branch prediction performance by adding a branch history table (BHT) and branch target buffer (BTB) to a pipeline. It also analyzes cache performance for a vector addition loop by varying cache block size and capacity. Key aspects covered include the operation and interactions of the BHT and BTB, analyzing pipeline timing with branch prediction, rewriting code to optimize for a single-block cache, and deriving equations for average memory access time under different cache configurations.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Computer Architecture Assignment Help

The document discusses improving branch prediction performance by adding a branch history table (BHT) and branch target buffer (BTB) to a pipeline. It also analyzes cache performance for a vector addition loop by varying cache block size and capacity. Key aspects covered include the operation and interactions of the BHT and BTB, analyzing pipeline timing with branch prediction, rewriting code to optimize for a single-block cache, and deriving equations for average memory access time under different cache configurations.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

For any Assignment related queries, Call us at : - 

+1 678 648 4277


You can mail us at : - [email protected] or
reach us at : - https://www.architectureassignmenthelp.com/
Computer Architecture Assignment Help
Part A: Branch Prediction

Consider a fetch pipeline based on the UltraSparc-III processor (as seen in


Lecture 5). In this part, we evaluate the impact of branch prediction on
the processor’s performance. Assume there are no branch delay slots.

Here is a table to clarify when the direction and the target of a


branch/jump is known.

Instruction Taken known? (At the Target known? (At the end of)
end of)
BEQZ/ R B
BNEZ
J B (always taken) B
JR B (always taken) R
Question 1.

We add a branch history table (BHT) in the fetch pipeline as shown below.
The A stage of the pipeline generates the next sequential PC address
(PC+4) by default, and this is used to fetch the next instruction unless the
PC is redirected by a branch prediction or an actual branch execution.
 
In the B stage (Branch Address Calc/Begin Decode), a conditional branch
instruction (BEQZ/BNEZ) checks the BHT, but an unconditional jump does
not. If the branch is predicted to be taken, some of the instructions are
flushed and the PC is redirected to the calculated branch target address.
The BHT determines if the PC needs to be redirected at the end of the B
stage and the new PC is inserted at the beginning of the A stage. (The A
stage also contains the multiplexing circuitry to select the correct PC.)
 
To improve the branch performance further, we decide to add a branch
target buffer (BTB) as well. Here is a description of the operation of the
BTB.

1. The BTB holds entry_PC, target_PC pairs for jumps and branches
predicted to be taken. Assume that the target_PC predicted by the BTB is
always correct for this question. (Yet the direction still might be wrong.)

2. The BTB is checked every cycle. If there is a match for the current fetch
PC, the PC is redirected to the target_PC predicted by the BTB (unless PC
is redirected by an older instruction). The BTB determines if the PC needs
to be redirected at the end of the P stage and the new PC is inserted at
the beginning of the A stage.
For a particular branch, if the BHT predicts “taken” and the BTB did not
make a prediction, the BHT will redirect the PC. If the BHT predicts “not-
taken” it does not redirect the PC.

a) Fill out the following table of the number of pipeline bubbles (only for
conditional branches).
For both part b) and part c) below, assume the BTB contains few entries
and is fully associative, and the BHT contains many entries and is direct
mapped.
 
b) For the case where the BHT predicts ‘not taken’ and the BTB predicts
‘taken’, why is it more probable that the BTB made a more accurate
prediction than the BHT?
 
Since the BTB stores the PC, we are certain that the BTB entry
corresponds to that particular branch. Also, because the BTB is much
smaller than the BHT, when there is a BTB hit, it is likely to be for a
branch that was recently taken. The BHT is direct mapped and untagged
and it is possible that there may have been conflicts, so the entry in the
BHT may not actually belong to this particular branch. Therefore in this
case, the BTB is more likely to make a better prediction.
 
c) For the case where the BHT predicts ‘taken’ and the BTB predicts ‘not
taken’, why is it more probable that the BHT made a more accurate
prediction than the BTB?
 
If a branch is not found in the BTB, it could mean either of two things: 1)
the branch was predicted not taken, or 2) the branch was predicted
taken but it was kicked out of the BTB. Since the BTB has few entries, it is
likely that the branch could have been kicked out of the BTB. Since the
BHT has more entries, it is likely that the branch is still in the BHT, so the
BHT will probably make a better prediction than the BTB in this case.
Question 2.
We will be examining the behavior of the branch prediction pipeline for
the following program:

Given a snapshot of the BTB and the BHT states on entry to the loop, fill
in the timing diagram of one iteration (plus two instructions) on the next
page. (Don’t worry about the stages beyond the E stage.) We assume the
following for this question:

1. The initial values of R5 and R1 are zero, so BR1 is always taken.


2. We disregard any possible structural hazards. There are no pipeline
bubbles (except for those created by branches.)
3. We fetch only one instruction per cycle.
4. We use a two-bit predictor whose state diagram is shown below. In
state 1X we will guess not taken; in state 0X we will guess taken. BR1
and BR2 do not conflict in the BHT
5. We use a two-entry fully-associative BTB with the LRU
replacement policy.
Question 3.
 
What will be the BTB and BHT states right after the 6 instructions in
Question 2 have updated the branch predictors’ states? Fill in (1) the BTB
and (2) the entries corresponding to BR1 and BR2 in the BHT.

Part B: Cache Performance Analysis

The following MIPS code loop adds the vectors X, Y, and Z and stores the
result in vector W. All vectors are of length n and contain words of four
bytes each.
 
W[i] = X[i] + Y[i] + Z[i] for 0 <= i < n
Registers 1 through 5 are initialized with the following values:
•R1 holds the address of X
•R2 holds the address of Y
•R3 holds the address of Z
•R4 holds the address of W
•R5 holds n

All vector addresses are word aligned. The MIPS code for our function is:

loop:
LW R6, 0(R1) ; load X
LW R7, 0(R2) ; load Y
LW R8, 0(R3) ; load Z
ADD R9, R6, R7 ; do the add
ADD R9, R9, R8
SW R9, 0(R4) ; store W
ADDI R1, R1, 4 ; increment the vector indices
ADDI R2, R2, 4 ; 4 bytes per word
ADDI R3, R3, 4
ADDI R4, R4, 4
ADDI R5, R5, -1 ; decrement n
BNEZ R5, loop
We run the loop using a cache simulator that reports the
average memory access time for a range of cache
configurations.
 
In Figure B-1 (on the next page) we plot the results of one set
of experiments. For these simulations, we used a single level
cache of fixed capacity and varied the block size. We used large
values of n to ensure the loop settled into a steady state. For
each block size, we plot the average memory access time, i.e.,
the average time in cycles required to access each data
memory operand (cache hit time plus memory access time).
Assume that capacity (C) and block size (b) are restricted to
powers of 2 and measured in four byte words. The cache is fully
associative and uses an LRU replacement policy. Assume the
processor stalls during a cache miss and that the cache miss
penalty is independent of the block size.
Question 4
a) Based on the results in Figure B-1, what is the cache capacity C?
 
When C<4b, there are less than 4 lines in the cache, so all 4 vectors (X, Y,
Z, W) cannot fit in the cache at the same time. With an LRU replacement
policy, the cache line for each vector’s access will be evicted each time
around the loop, resulting in a miss on every access.
When C=>4b, there are enough cache lines to fit all 4 vectors. Therefore a
vector’s block will survive in cache until the next iteration. A number of
hits will occur and either the function will be completed, or the end of
the cache line will be reached and the next cache line from the vector will
have to be loaded in. Larger values of b will result in fewer compulsory
misses, and hence smaller average memory access times.
Therefore, the curve should drop at C=4b and then climb again as b
decreases.
On the curve above, the drop is at b=128. Therefore C=512 words or 2K
bytes.

b) Assume that the cache capacity, C, remains the same as in a). When b
is 256, what is the relation for the average memory access time in terms
of 'b', the hit time 'h', the miss penalty 'p', and the number of elements in
each vector 'n'?
 
When b>=256 (C<4b), all accesses will be misses. The average memory
access time will be:
Tave = h + p, (where h+p is the total time for servicing a miss) or
Tave = p, (where p is the total time for servicing a miss)
c)Assume that the cache capacity, C, remains the same as in a). What is
the relation for the average memory access time when b is £ 128?
 
When b<=128, C>=4b. There will be 1 cold start miss for a given vector,
then b-1 hits for the remaining elements on each cache line. This means
there will be 1 miss every b accesses, resulting in a miss rate of 1/b.
Tave = h + p*(miss rate) = h + p/b
 Alternatively, if we assumed that p is the total time for servicing a miss,
we would get
 Tave = (hit rate) * h + (miss rate) * p
= (1 - (1/b))*h + (1/b)*p

Question 5
Now let us replace the data cache with one that consists of only a
SINGLE block (with 4 words total cache capacity). Explain how to rewrite
the vector code to get better performance from the single-block cache.
Note that total code size is not a constraint in this problem.
 
Only a single 4-word chunk from one of the vectors can be contained in
the block-sized cache at any given moment in time. Thus, in order to
optimize the loop, we must unroll the loop 4 times, and load each
successive element into a different register. If we make 4 consecutive
accesses to the same vector, than we can amortize the cost of a miss
with 3 subsequent hits.
This solution results in 5 memory operations each loop-unrolled
iteration: 4 memory accesses corresponding to 1 miss for each vector,
and 1 memory operation corresponding to writing back W to memory
before loading again. This results in:
5 memory ops/iteration * (n/4) iterations =
5n/4 total memory ops
or
5/4 memory ops per original loop iteration
 
For the solution below, we assume the (R5)%4 = 0 (R5 contains the loop
counter) since loop execution is works on 4 elements per iteration.
loop:
LW R6, 0(R1) ; load X1
LW R10, 4(R1) ; load X2
LW R14, 8(R1) ; load X3
LW R18, 12(R1) ; load X4
LW R7, 0(R2) ; load Y1
LW R11, 4(R2) ; load Y2
LW R15, 8(R2) ; load Y3
LW R19, 12(R2) ; load Y4
LW R8, 0(R3) ; load Z1
LW R12, 4(R3) ; load Z2
LW R16, 8(R3) ; load Z3
LW R20, 12(R3) ; load Z4
ADD R9, R6, R7 ; do the add for element 1
ADD R9, R9, R8
ADD R13, R10, R11 ; do the add for element 2
ADD R13, R13, R12
ADD R17, R14, R15 ; do the add for element 3
ADD R17, R17, R16
ADD R21, R18, R19 ; do the add for element 4
ADD R21, R21, R20
SW R9, 0(R4) ; store W1
SW R13, 4(R4) ; store W2
SW R17, 8(R4) ; store W3
SW R21, 12(R4) ; store W4
ADD R1, R1, 16 ; increment the vector indices
ADD R2, R2, 16 ; 4 bytes per word
ADD R3, R3, 16
ADD R4, R4, 16
SUB R5, R5, 4 ; decrement 4
BNEZ R5, loop
We gave partial credit for the following answer:
 
Interleave the elements of the vectors X, Y, Z, W, so that one cache block
will contain one element from each array.
This answer results in 1 miss per iteration of the loop. However, since we
store to W, we need to mark the block dirty and write it back to memory
on each iteration before we fetch in the next block. This results in 2
memory operations every iteration of the loop, resulting in a total of 2n
memory operations. The previous solution requires only 5/4 memory
operations per iteration so is a better answer.

You might also like