0% found this document useful (0 votes)
45 views6 pages

CompEng 361 - Homework 3 Solutions

Uploaded by

Aaron Sun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views6 pages

CompEng 361 - Homework 3 Solutions

Uploaded by

Aaron Sun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Northwestern University

CompEng 361: Computer Architecture


Fall 2023
Homework 3

1. In this exercise, we examine how pipelining affects the clock cycle time of the processor.
Problems in this exercise assume that individual stages of the datapath have the following
latencies:

IF ID EX MEM WB

300ps 450ps 200ps 400ps 200ps

Also, assume that instructions executed by the processor are broken down as follows:

R-Type beq lw sw

60% 15% 15% 10%

a. What is the clock cycle time in a pipelined and non-pipelined processor?

Non-Pipelined: 300 + 450 + 200 + 400 + 200 = 1550 ps


Pipelined: max(300, 450, 200, 400, 200) = 450 ps

b. What is the total latency of an lw instruction in a pipelined and non-pipelined processor?

Non-Pipelined: 300 + 450 + 200 + 400 + 200 = 1550 ps


Pipelined: 450 * 5 = 2250 ps

c. If we can split one stage of the pipelined datapath into two new stages, each with half the
latency of the original stage, which stage would you split and what is the new clock cycle
time of the processor?

Split the ID stage because it is the longest.


New Cycle Time = max(300, 225, 225, 200, 400, 200) = 400 ps

d. Assuming there are no stalls or hazards, what is the utilization of the data memory?

Pct of Stores + Loads = 10% + 15% = 25%

e. Assuming there are no stalls or hazards, what is the utilization of the write-register port of
the “Registers” unit?

Pct of R-Type + Loads = 60% + 15% = 75 %

f. Instead of a single-cycle organization, we can use a multi-cycle organization where each


instruction takes multiple cycles but one instruction finishes before another is fetched. In
this organization, an instruction only goes through stages it actually needs (e.g., ST only
takes 4 cycles because it does not need the WB stage). Compare clock cycle times and
execution times with single-cycle, multi-cycle, and pipelined organization.
Multi Cycle CPI: 0.6 * 4 + 0.15 * 3 + 0.15 * 5 + 0.1 * 4 = 4

CPU Single Cycle Mult-cycle Pipeline

Cycle Time 1550 ps 450 ps 450 ps

CPI 1 4 1

Execution Time / 1550 ps 1800 ps 450 ps


Instruction

2. In this exercise, we examine how data dependencies affect execution in the basic 5-stage
pipeline described in Section 4.5. Problems in this exercise refer to the following sequence of
instructions:

or r1, r2, r3 // (i)


or r2, r1, r4 // (ii)
or r1, r1, r2 // (iii)

Also, assume the following cycle times for each of the options related to forwarding:

Without Forwarding With Full With ALU-ALU


Forwarding Forwarding Only

300 ps 350 ps 340 ps

a. Indicate dependences and their type.


1. RAW dependency for r1 between i and ii
2. RAW dependency for r1 between i and iii
3. RAW dependency for r2 between ii and iii
4. WAR dependency for r2 between i and ii
5. WAR dependency for r1 between ii and iii
6. WAW dependency for r1 between i and iii
b. Assume there is no forwarding in this pipelined processor. Indicate hazards and add nop
instructions to eliminate them.

or r1, r2, r3
nop
nop
// Data hazard on r1
or r2, r1, r4
nop
nop
// Data hazard on r1, r2
or r1, r1, r2
c. Assume there is full forwarding. Indicate hazards and add nop instructions to eliminate
them.

No hazards…nothing to do!
d. What is the total execution time of this instruction sequence without forwarding and with
full forwarding? What is the speedup achieved by adding full forwarding to a pipeline that
had no forwarding?

With: 2 + 5 cycles, time: 7 * 350 = 2450


Without: 4 + 7 cycles, time: 11 * 300 = 3300
Speedup: 3300 / 2450 = ~1.35 times

e. Add nop instructions to this code to eliminate hazards if there is ALU-ALU forwarding only
(no forwarding from the MEM to the EX stage).

or r1, r2, r3
or r2, r1, r4
nop
nop
or r1, r1, r2
f. What is the total execution time of this instruction sequence with only ALU-ALU
forwarding?

With ALU-ALU: 4 + 5 cycles, time: 9 * 340 = 3060


Without: 4 + 7 cycles, time: 11 * 300 = 3300
Speed up: 3300/3060~1.08 times

3. The importance of having a good branch predictor depends on how often conditional branches
are executed. Together with branch predictor accuracy, this will determine how much time is
spent stalling due to mispredicted branches. In this exercise, assume that the breakdown of
dynamic instructions into various instruction categories is as follows:

R-Type beq jmp lw sw

50% 20% 5% 20% 5%

Also, assume the following branch predictor accuracies:

Always-Taken Always-Not-Taken 2-Bit

40% 60% 85%

a. Stall cycles due to mispredicted branches increase the CPI. What is the extra CPI due to
mispredicted branches with the always-taken predictor? Assume that branch outcomes
are determined in the EX stage, that there are no data hazards, and that no delay slots are
used.
b. Repeat 3a. for the “always-not-taken” predictor.
c. Repeat 3a. for the 2-bit predictor.
Predictor Miss Rate Occurrence Stall Cycles Extra CPI

Atways Taken (a) 0.6 0.2 3 0.36

Always NT (b) 0.4 0.2 3 0.24

2 Bit (c) 0.15 0.2 3 0.09

d. With the 2-bit predictor, what speedup would be achieved if we could convert half of the
branch instructions in a way that replaces a branch instruction with an ALU instruction?
Assume that correctly and incorrectly predicted instructions have the same chance of
being replaced.

New extra CPI: 0.15 * 0.1 * 3 = 0.045


Speed up: 1.09 / 1.045 = ~ 1.043

e. With the 2-bit predictor, what speedup would be achieved if we could convert half of the
branch instructions in a way that replaced each branch instruction with two ALU
instructions? Assume that correctly and incorrectly predicted instructions have the same
chance of being replaced.

New extra CPI: 0.15 * 0.1 * 3 + 0.1 * 1 = 0.145


Speed up: 1.09 / 1.145 ~ 0.952

f. Some branch instructions are much more predictable than others. If we know that 80% of
all executed branch instructions are easy-to predict loop-back branches that are always
predicted correctly, what is the accuracy of the 2-bit predictor on the remaining 20% of the
branch instructions?

0.8 + x * 0.2 = 0.85 => x = 0.25


25 %

4. This exercise examines the accuracy of various branch predictors for the following repeating
pattern (e.g., in a loop) of branch outcomes:

NT, T, NT, NT, T

a. What is the accuracy of always-taken and always-not-taken predictors for this sequence of
branch outcomes?

Always T: 0.4
Always NT: 0.6

b. What is the accuracy of the two-bit predictor for the first 4 branches in this pattern,
assuming that the predictor starts off in the bottom left state from the lecture slides
(strongly predict not taken)?

Branches NT T NT NT
Pred NT NT NT NT

Outcome Correct Wrong Correct Correct

Status After Strong NT Weak NT Strong NT Strong NT


Accuracy is 75%

c. What is the accuracy of the two-bit predictor if this pattern is repeated forever?

Branches NT T NT NT T

Pred NT NT NT NT NT

Outcome Correct Wrong Correct Correct Wrong

Status After Strong NT Weak NT Strong NT Strong NT Weak NT


Accuracy is 60%

d. Design a predictor that would achieve a perfect accuracy if this pattern is repeated forever.
Your predictor should be a sequential circuit with one output that provides a prediction (1
for taken, 0 for not taken) and no inputs other than the clock and the control signal that
indicates that the instruction is a conditional branch.

There are several ways to show this (including Verilog code, Gate diagram w/ FFs). Here is the
most straightforward way to do this with a state diagram:

Slight variations on this are fine. Note that this predictor must be initialized in the correct state to
predict the pattern perfectly.

e. What is the accuracy of your predictor from part d if it is given a repeating pattern that is
the exact opposite of this one?

This predictor is always wrong => 0% accuracy.

f. Repeat 4d., but now your predictor should be able to eventually (after a warm-up period
during which it can make wrong predictions) start perfectly predicting both this pattern and
its opposite. Your predictor should have an input that tells it what the real outcome was.
Hint: this input lets your predictor determine which of the two repeating patterns it is given.

Again. There are many ways to show this. The simplest thing to do is to distinguish between the
two patterns early in the sequence (warm up). Again, we will assume that you are initialized into
one of the two starting states. Here is a state diagram:
Note that the two inputs identify if the current instruction is a branch and what the real outcome
was. Slight variations on this are fine.

Given that we have an input to tell us what the correct prediction was, we can actually devise a
more complex predictor that will eventually correctly predict either pattern but won’t need to be
initialized into the correct state.

You might also like