Computer Architecture Assignment Help

The document discusses improving branch prediction performance by adding a branch history table (BHT) and branch target buffer (BTB) to a pipeline. It also analyzes cache performance for a vector addition loop by varying cache block size and capacity. Key aspects covered include the operation and interactions of the BHT and BTB, analyzing pipeline timing with branch prediction, rewriting code to optimize for a single-block cache, and deriving equations for average memory access time under different cache configurations.

Uploaded by

Architecture Assignment Help

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views15 pages

Computer Architecture Assignment Help

Uploaded by

Architecture Assignment Help

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

For any Assignment related queries, Call us at : -

+1 678 648 4277

You can mail us at : - info@[Link] or
reach us at : - [Link]
Computer Architecture Assignment Help
Part A: Branch Prediction

Consider a fetch pipeline based on the UltraSparc-III processor (as seen in

Lecture 5). In this part, we evaluate the impact of branch prediction on
the processor’s performance. Assume there are no branch delay slots.

Here is a table to clarify when the direction and the target of a

branch/jump is known.

Instruction Taken known? (At the Target known? (At the end of)
end of)
BEQZ/ R B
BNEZ
J B (always taken) B
JR B (always taken) R
Question 1.

We add a branch history table (BHT) in the fetch pipeline as shown below.
The A stage of the pipeline generates the next sequential PC address
(PC+4) by default, and this is used to fetch the next instruction unless the
PC is redirected by a branch prediction or an actual branch execution.

In the B stage (Branch Address Calc/Begin Decode), a conditional branch
instruction (BEQZ/BNEZ) checks the BHT, but an unconditional jump does
not. If the branch is predicted to be taken, some of the instructions are
flushed and the PC is redirected to the calculated branch target address.
The BHT determines if the PC needs to be redirected at the end of the B
stage and the new PC is inserted at the beginning of the A stage. (The A
stage also contains the multiplexing circuitry to select the correct PC.)

To improve the branch performance further, we decide to add a branch
target buffer (BTB) as well. Here is a description of the operation of the
BTB.

1. The BTB holds entry_PC, target_PC pairs for jumps and branches
predicted to be taken. Assume that the target_PC predicted by the BTB is
always correct for this question. (Yet the direction still might be wrong.)

2. The BTB is checked every cycle. If there is a match for the current fetch
PC, the PC is redirected to the target_PC predicted by the BTB (unless PC
is redirected by an older instruction). The BTB determines if the PC needs
to be redirected at the end of the P stage and the new PC is inserted at
the beginning of the A stage.
For a particular branch, if the BHT predicts “taken” and the BTB did not
make a prediction, the BHT will redirect the PC. If the BHT predicts “not-
taken” it does not redirect the PC.

a) Fill out the following table of the number of pipeline bubbles (only for
conditional branches).
For both part b) and part c) below, assume the BTB contains few entries
and is fully associative, and the BHT contains many entries and is direct
mapped.

b) For the case where the BHT predicts ‘not taken’ and the BTB predicts
‘taken’, why is it more probable that the BTB made a more accurate
prediction than the BHT?

Since the BTB stores the PC, we are certain that the BTB entry
corresponds to that particular branch. Also, because the BTB is much
smaller than the BHT, when there is a BTB hit, it is likely to be for a
branch that was recently taken. The BHT is direct mapped and untagged
and it is possible that there may have been conflicts, so the entry in the
BHT may not actually belong to this particular branch. Therefore in this
case, the BTB is more likely to make a better prediction.

c) For the case where the BHT predicts ‘taken’ and the BTB predicts ‘not
taken’, why is it more probable that the BHT made a more accurate
prediction than the BTB?

If a branch is not found in the BTB, it could mean either of two things: 1)
the branch was predicted not taken, or 2) the branch was predicted
taken but it was kicked out of the BTB. Since the BTB has few entries, it is
likely that the branch could have been kicked out of the BTB. Since the
BHT has more entries, it is likely that the branch is still in the BHT, so the
BHT will probably make a better prediction than the BTB in this case.
Question 2.
We will be examining the behavior of the branch prediction pipeline for
the following program:

Given a snapshot of the BTB and the BHT states on entry to the loop, fill
in the timing diagram of one iteration (plus two instructions) on the next
page. (Don’t worry about the stages beyond the E stage.) We assume the
following for this question:

1. The initial values of R5 and R1 are zero, so BR1 is always taken.

2. We disregard any possible structural hazards. There are no pipeline
bubbles (except for those created by branches.)
3. We fetch only one instruction per cycle.
4. We use a two-bit predictor whose state diagram is shown below. In
state 1X we will guess not taken; in state 0X we will guess taken. BR1
and BR2 do not conflict in the BHT
5. We use a two-entry fully-associative BTB with the LRU
replacement policy.
Question 3.

What will be the BTB and BHT states right after the 6 instructions in
Question 2 have updated the branch predictors’ states? Fill in (1) the BTB
and (2) the entries corresponding to BR1 and BR2 in the BHT.

Part B: Cache Performance Analysis

The following MIPS code loop adds the vectors X, Y, and Z and stores the
result in vector W. All vectors are of length n and contain words of four
bytes each.

W[i] = X[i] + Y[i] + Z[i] for 0 <= i < n
Registers 1 through 5 are initialized with the following values:
•R1 holds the address of X
•R2 holds the address of Y
•R3 holds the address of Z
•R4 holds the address of W
•R5 holds n

All vector addresses are word aligned. The MIPS code for our function is:

loop:
LW R6, 0(R1) ; load X
LW R7, 0(R2) ; load Y
LW R8, 0(R3) ; load Z
ADD R9, R6, R7 ; do the add
ADD R9, R9, R8
SW R9, 0(R4) ; store W
ADDI R1, R1, 4 ; increment the vector indices
ADDI R2, R2, 4 ; 4 bytes per word
ADDI R3, R3, 4
ADDI R4, R4, 4
ADDI R5, R5, -1 ; decrement n
BNEZ R5, loop
We run the loop using a cache simulator that reports the
average memory access time for a range of cache
configurations.

In Figure B-1 (on the next page) we plot the results of one set
of experiments. For these simulations, we used a single level
cache of fixed capacity and varied the block size. We used large
values of n to ensure the loop settled into a steady state. For
each block size, we plot the average memory access time, i.e.,
the average time in cycles required to access each data
memory operand (cache hit time plus memory access time).
Assume that capacity (C) and block size (b) are restricted to
powers of 2 and measured in four byte words. The cache is fully
associative and uses an LRU replacement policy. Assume the
processor stalls during a cache miss and that the cache miss
penalty is independent of the block size.
Question 4
a) Based on the results in Figure B-1, what is the cache capacity C?

When C<4b, there are less than 4 lines in the cache, so all 4 vectors (X, Y,
Z, W) cannot fit in the cache at the same time. With an LRU replacement
policy, the cache line for each vector’s access will be evicted each time
around the loop, resulting in a miss on every access.
When C=>4b, there are enough cache lines to fit all 4 vectors. Therefore a
vector’s block will survive in cache until the next iteration. A number of
hits will occur and either the function will be completed, or the end of
the cache line will be reached and the next cache line from the vector will
have to be loaded in. Larger values of b will result in fewer compulsory
misses, and hence smaller average memory access times.
Therefore, the curve should drop at C=4b and then climb again as b
decreases.
On the curve above, the drop is at b=128. Therefore C=512 words or 2K
bytes.

b) Assume that the cache capacity, C, remains the same as in a). When b
is 256, what is the relation for the average memory access time in terms
of 'b', the hit time 'h', the miss penalty 'p', and the number of elements in
each vector 'n'?

When b>=256 (C<4b), all accesses will be misses. The average memory
access time will be:
Tave = h + p, (where h+p is the total time for servicing a miss) or
Tave = p, (where p is the total time for servicing a miss)
c)Assume that the cache capacity, C, remains the same as in a). What is
the relation for the average memory access time when b is £ 128?

When b<=128, C>=4b. There will be 1 cold start miss for a given vector,
then b-1 hits for the remaining elements on each cache line. This means
there will be 1 miss every b accesses, resulting in a miss rate of 1/b.
Tave = h + p*(miss rate) = h + p/b
Alternatively, if we assumed that p is the total time for servicing a miss,
we would get
Tave = (hit rate) * h + (miss rate) * p
= (1 - (1/b))*h + (1/b)*p

Question 5
Now let us replace the data cache with one that consists of only a
SINGLE block (with 4 words total cache capacity). Explain how to rewrite
the vector code to get better performance from the single-block cache.
Note that total code size is not a constraint in this problem.

Only a single 4-word chunk from one of the vectors can be contained in
the block-sized cache at any given moment in time. Thus, in order to
optimize the loop, we must unroll the loop 4 times, and load each
successive element into a different register. If we make 4 consecutive
accesses to the same vector, than we can amortize the cost of a miss
with 3 subsequent hits.
This solution results in 5 memory operations each loop-unrolled
iteration: 4 memory accesses corresponding to 1 miss for each vector,
and 1 memory operation corresponding to writing back W to memory
before loading again. This results in:
5 memory ops/iteration * (n/4) iterations =
5n/4 total memory ops
or
5/4 memory ops per original loop iteration

For the solution below, we assume the (R5)%4 = 0 (R5 contains the loop
counter) since loop execution is works on 4 elements per iteration.
loop:
LW R6, 0(R1) ; load X1
LW R10, 4(R1) ; load X2
LW R14, 8(R1) ; load X3
LW R18, 12(R1) ; load X4
LW R7, 0(R2) ; load Y1
LW R11, 4(R2) ; load Y2
LW R15, 8(R2) ; load Y3
LW R19, 12(R2) ; load Y4
LW R8, 0(R3) ; load Z1
LW R12, 4(R3) ; load Z2
LW R16, 8(R3) ; load Z3
LW R20, 12(R3) ; load Z4
ADD R9, R6, R7 ; do the add for element 1
ADD R9, R9, R8
ADD R13, R10, R11 ; do the add for element 2
ADD R13, R13, R12
ADD R17, R14, R15 ; do the add for element 3
ADD R17, R17, R16
ADD R21, R18, R19 ; do the add for element 4
ADD R21, R21, R20
SW R9, 0(R4) ; store W1
SW R13, 4(R4) ; store W2
SW R17, 8(R4) ; store W3
SW R21, 12(R4) ; store W4
ADD R1, R1, 16 ; increment the vector indices
ADD R2, R2, 16 ; 4 bytes per word
ADD R3, R3, 16
ADD R4, R4, 16
SUB R5, R5, 4 ; decrement 4
BNEZ R5, loop
We gave partial credit for the following answer:

Interleave the elements of the vectors X, Y, Z, W, so that one cache block
will contain one element from each array.
This answer results in 1 miss per iteration of the loop. However, since we
store to W, we need to mark the block dirty and write it back to memory
on each iteration before we fetch in the next block. This results in 2
memory operations every iteration of the loop, resulting in a total of 2n
memory operations. The previous solution requires only 5/4 memory
operations per iteration so is a better answer.

Data Hazards and Cache Optimization
No ratings yet
Data Hazards and Cache Optimization
2 pages
Pipeline CPU Hazards and Solutions
No ratings yet
Pipeline CPU Hazards and Solutions
6 pages
Not An Exam Paper
No ratings yet
Not An Exam Paper
5 pages
Branch Prediction in Computer Architecture
No ratings yet
Branch Prediction in Computer Architecture
37 pages
111-1 Final Exam
No ratings yet
111-1 Final Exam
15 pages
Computer Organization Exam Prep
No ratings yet
Computer Organization Exam Prep
15 pages
hw3 - Cse490 590 sp2025
No ratings yet
hw3 - Cse490 590 sp2025
4 pages
Solutions To Exercises On Memory Hierarchy
No ratings yet
Solutions To Exercises On Memory Hierarchy
15 pages
Example Midterm
No ratings yet
Example Midterm
5 pages
202004221613338445rohit Engg Advance Opt of Cache
No ratings yet
202004221613338445rohit Engg Advance Opt of Cache
9 pages
Homework 2 - Solution
No ratings yet
Homework 2 - Solution
5 pages
111 Computer Organization - Final
No ratings yet
111 Computer Organization - Final
4 pages
PDC A#02
No ratings yet
PDC A#02
4 pages
hw3 Cse490-590-Sp2025 Sol
No ratings yet
hw3 Cse490-590-Sp2025 Sol
6 pages
Computer Design (Spring 2010) Midterm Exam Solution
No ratings yet
Computer Design (Spring 2010) Midterm Exam Solution
2 pages
Solution For Chapter 4
100% (3)
Solution For Chapter 4
26 pages
Branch Prediction Techniques
No ratings yet
Branch Prediction Techniques
48 pages
Computer Architecture Exam 2023
No ratings yet
Computer Architecture Exam 2023
3 pages
Cse - Ece 511 - Endsem
No ratings yet
Cse - Ece 511 - Endsem
3 pages
BaiTap Chuong4 PDF
No ratings yet
BaiTap Chuong4 PDF
8 pages
Cao Assignment2 ID-202-15-14442
No ratings yet
Cao Assignment2 ID-202-15-14442
5 pages
CompArch Cheatsheet
No ratings yet
CompArch Cheatsheet
2 pages
CS152 Quiz: Cache & Microtagging
No ratings yet
CS152 Quiz: Cache & Microtagging
13 pages
Ex 5
No ratings yet
Ex 5
5 pages
Pipeline Hazards
No ratings yet
Pipeline Hazards
4 pages
Archi Second 2013 2014 JCE
No ratings yet
Archi Second 2013 2014 JCE
2 pages
CPU Pipeline and Cache Policies Quiz
No ratings yet
CPU Pipeline and Cache Policies Quiz
4 pages
Kien-Truc-May-Tinh - David-Brooks - cs146-hw2 - (Cuuduongthancong - Com)
No ratings yet
Kien-Truc-May-Tinh - David-Brooks - cs146-hw2 - (Cuuduongthancong - Com)
5 pages
Conditional Branches
No ratings yet
Conditional Branches
35 pages
15IF11 Multicore E PDF
No ratings yet
15IF11 Multicore E PDF
14 pages
Cs146-Lecture7 2
No ratings yet
Cs146-Lecture7 2
17 pages
EE557 SP25 HW2 Sol
No ratings yet
EE557 SP25 HW2 Sol
9 pages
Data Forwarding in MIPS Pipeline Analysis
No ratings yet
Data Forwarding in MIPS Pipeline Analysis
4 pages
CA Lecture 4 Module 3
No ratings yet
CA Lecture 4 Module 3
27 pages
Midterm1 s15 Sol
No ratings yet
Midterm1 s15 Sol
26 pages
Dynamic Branch Prediction
No ratings yet
Dynamic Branch Prediction
17 pages
Dynamic Branch Prediction Techniques
No ratings yet
Dynamic Branch Prediction Techniques
35 pages
Chapter 5
No ratings yet
Chapter 5
4 pages
Lecture #3
No ratings yet
Lecture #3
12 pages
ch2 Appb
No ratings yet
ch2 Appb
58 pages
Coa Applied
No ratings yet
Coa Applied
13 pages
Pipe 3
No ratings yet
Pipe 3
32 pages
Solutions Ch4
No ratings yet
Solutions Ch4
7 pages
Sample Midterm2
No ratings yet
Sample Midterm2
4 pages
Computer Architecture Midterm Key
No ratings yet
Computer Architecture Midterm Key
7 pages
Cs433 Fa20 Hw3 Solution
No ratings yet
Cs433 Fa20 Hw3 Solution
15 pages
Pipeline - Instr - Super Branch
No ratings yet
Pipeline - Instr - Super Branch
48 pages
Computer Architecture Homework
No ratings yet
Computer Architecture Homework
14 pages
Lecture 7
No ratings yet
Lecture 7
21 pages
Computer Architecture Homework
No ratings yet
Computer Architecture Homework
5 pages
05 - Pipelining - Branch Prediction
No ratings yet
05 - Pipelining - Branch Prediction
20 pages
Practice Midterm2 A Sol PDF
No ratings yet
Practice Midterm2 A Sol PDF
14 pages
Ignou MCS-041 Operating Systems Assignment
No ratings yet
Ignou MCS-041 Operating Systems Assignment
14 pages
Hpca Pyqp
No ratings yet
Hpca Pyqp
17 pages
Computer Architecture Homework
No ratings yet
Computer Architecture Homework
5 pages
L38 PDF
No ratings yet
L38 PDF
19 pages
Cache Design and Optimization Techniques
No ratings yet
Cache Design and Optimization Techniques
19 pages
Unit II Microprocessor and Microcontroller
100% (1)
Unit II Microprocessor and Microcontroller
26 pages
MacBook Keyboard Shortcuts Cheat Sheet
No ratings yet
MacBook Keyboard Shortcuts Cheat Sheet
1 page
Bcse 1
No ratings yet
Bcse 1
12 pages
Intro to Computers: Key Concepts
No ratings yet
Intro to Computers: Key Concepts
17 pages
1995 Full Line Zippo Lighter Catalog
100% (1)
1995 Full Line Zippo Lighter Catalog
55 pages
3DS MT Card Firmware Upgrade Guide
No ratings yet
3DS MT Card Firmware Upgrade Guide
13 pages
OptiPlex 7040 Technical Spec Sheet PDF
No ratings yet
OptiPlex 7040 Technical Spec Sheet PDF
3 pages
Computer Memory System Overview
No ratings yet
Computer Memory System Overview
3 pages
A G I M: Ppendix Nterleaved Emory
No ratings yet
A G I M: Ppendix Nterleaved Emory
4 pages
Untitled
No ratings yet
Untitled
53 pages
Readme
No ratings yet
Readme
3 pages
Smart Quill: The Intelligent Writing Pen
No ratings yet
Smart Quill: The Intelligent Writing Pen
24 pages
What Is A BSP PDF
No ratings yet
What Is A BSP PDF
11 pages
Topic 7 Tools in Assembly Language Programming Develoment System
No ratings yet
Topic 7 Tools in Assembly Language Programming Develoment System
10 pages
Ram Connection Data Sheet
No ratings yet
Ram Connection Data Sheet
2 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
W1 - Introduction To Computer - MODULE PDF
No ratings yet
W1 - Introduction To Computer - MODULE PDF
10 pages
Aquilion 16 Software Installation Guide
No ratings yet
Aquilion 16 Software Installation Guide
14 pages
COA Mid Sem
No ratings yet
COA Mid Sem
1 page
04 - 2 Octane - Troubleshooting
No ratings yet
04 - 2 Octane - Troubleshooting
18 pages
TP - MS638.PC821 TV Mainboard Power LED Red Light But Won't Startup Problem
100% (1)
TP - MS638.PC821 TV Mainboard Power LED Red Light But Won't Startup Problem
6 pages
Post and Bios
No ratings yet
Post and Bios
22 pages
MARIE Assembly Language Guide
No ratings yet
MARIE Assembly Language Guide
2 pages
Materi Teknik Antarmuka
No ratings yet
Materi Teknik Antarmuka
2 pages
Sms 32 V 50
No ratings yet
Sms 32 V 50
107 pages
HSC Initialization for S7-200
No ratings yet
HSC Initialization for S7-200
4 pages
Esp32 Technical Reference Manual en
50% (2)
Esp32 Technical Reference Manual en
586 pages
Mbed Board LED Blinking Guide
No ratings yet
Mbed Board LED Blinking Guide
7 pages
Answer The Questions
No ratings yet
Answer The Questions
3 pages

Computer Architecture Assignment Help

Uploaded by

Computer Architecture Assignment Help

Uploaded by

For any Assignment related queries, Call us at : -

+1 678 648 4277

Consider a fetch pipeline based on the UltraSparc-III processor (as seen in

Here is a table to clarify when the direction and the target of a

1. The initial values of R5 and R1 are zero, so BR1 is always taken.

Part B: Cache Performance Analysis

You might also like