0% found this document useful (0 votes)
24 views22 pages

Ee457 Final Fall2023

Uploaded by

arditxzy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views22 pages

Ee457 Final Fall2023

Uploaded by

arditxzy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Cover page

EE457 Final Exam (~33.5%)


Closed-book Closed-notes Exam; Verilog Guides are not needed and are not allowed.
This is a traditional paper pencil exam. Smart phones, laptops, iPads, tablets, and all kinds of computing/Internet devices are not allowed.
This is a Crowdmark exam. Please do not write on margins or on the backside. Use a dark HB or H1 pencil
Fall 2023
Instructor: Gandhi Puvvada
Final Exam (~33.5%): Saturday, Dec. 9, 2023, 01:15 PM - 04:15 PM PST in THH 101

I have previously read the Viterbi Code of Integrity and other related material at the site https://viterbischool.usc.edu/academic-
integrity/ and I will abide by these rules of conduct. I will neither seek help from others nor offer help to others in my exams.

_____________________________ <== Student’s signature

Ques# Topic Page# Points


1 Mutual Exclusion, MSI and MOESI 2-7 83

2 Miscellaneous advanced topics 8-9 50

3 Virtual Memory and Cache 10-12 73

4 FIFO 13 28

5 Lab 7 Part 3 SP 3 Verilog RTL coding 13-14 20

6 Lab 7 P3 SP2 modification 15-20 181

Just FYI Early Branch Block diagram 21

Total Cover+ 2-to-20 +2 435


Perfect Score 420
EE457 Final - Fall2023 1 / 22 C Copyright 2023 Gandhi Puvvada
Viterbi School of Engineering, University of Southern California
Q1P2 Page total 31 pts
1 ( points) min. Mutual Exclusion, MSI, and MOESI

1.1 Reproduced on the next four pages is the 14-step sequence from our class notes showing how
52 pts three competing threads in three single-threaded cores can obtain the lock in a mutual exclusive
on next
4 pages manner. Revise the 14-step sequence for MOESI in place of MSI showing the O state when
needed. Change the contents of the L1 and L2 caches as needed. Answer the 3 questions on the
last page of these 4 pages related to MOESI and FMM. Hint: E-state is never used here.

If the three threads are all in one single three-threaded core, you expect that

9 1. mutual exclusion is still possible even though there are no SCUs involved. _____ (T/F).
pts
2. most of the polling (checking for the lock to be released) is done in ______ (M/S/I) state
by the threads who are waiting to lock the lock.

3. here we can just use LW and SW and do not need LL and SC _____ (T/F).

1.2 Legend: A = desirable; B = undesirable; C = wrong; D = none of the above


6 It is ______________ (A/B/C/D) to map two independent locks, such as a Student Database Lock
pts
(SDBL) and a Faculty Database Lock (FDBL) to locations in the same cache block.
It is ______________ (A/B/C/D) to map two independent locks, such as a Student Database Lock
(SDBL) and a Faculty Database Lock (FDBL) to locations in the same virtual page.

1.3 After locking the SDBL and entering the Student Database, can the L1 cache of that core choose
6 to voluntarily replace the block containing the lock (replace it to bring some other block)?
pts
_______ (Yes/No). Does it pose any problem? _______ (Yes/No).

1.4 An application to admit students is written to admit one student at a time and release the SDBL
10 lock and not seek for it for the next millisecond. During this millisecond, there may or may not
pts
be others looking for the lock! If no other is looking for the lock, do you expect that the block
containing SDBL in this core would be in M or S or I states at the end of the millisecond?
.
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________

Blank rectangle (for rough work)

EE457 Final - Fall2023 2 / 22 C Copyright 2023 Gandhi Puvvada


Q1P3 Page total pts
EE457 Final - Fall2023 3 / 22 C Copyright 2023 Gandhi Puvvada
Q1P4 Page total pts
EE457 Final - Fall2023 4 / 22 C Copyright 2023 Gandhi Puvvada
Q1P5 Page total pts
EE457 Final - Fall2023 5 / 22 C Copyright 2023 Gandhi Puvvada
Q1P6 Page total pts
EE457 Final - Fall2023 6 / 22 C Copyright 2023 Gandhi Puvvada
Q1P7 Page total 14 pts

1.5 Complete the MOESI state diagram


state transition
conditions in
the two boxes
for the two state
transition
arrows: one
from I to S state
and the other
from I to E
state. If you
used any open-
collector (open-
6 drain) signal,
pts identify it and
state why it
should be an
open-collector
(open-drain)
signal.
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
8 ____________________________________________________________________________
pts
____________________________________________________________________________
____________________________________________________________________________

Blank rectangle (for rough work)

EE457 Final - Fall2023 7 / 22 C Copyright 2023 Gandhi Puvvada


Q2P8 Page total 32 pts

2 ( points) min. Miscellaneous advanced topics

2.1 MPI (Miss rate Per Instruction) in the case of a hierarchy of caches
P
Calculate the effective CPI assuming that there are no other losses of clocks
due to stalling or flushes etc. (i.e CPI would be 1 if there are no cache
misses in L1 cache). Cache L1

MPI for L1 cache: MPI_1 = 6% Cache L2


8 L1 miss penalty: L1_M_P = 25 clocks
pts
MPI for L2 cache: MPI_2 = 1% Cache L3
L2 miss penalty: L2_M_P = 50 clocks
MPI for L3 cache: MPI_3 = 0.5% Main Memory
L3 miss penalty: L3_M_P = 200 clocks

2
pts L3 MPI is always less than L2 MPI which is always less than L1 MPI. True / False

2.2 Branch prediction: A 2Kx2 2-bit BPB in the ID stage is indexed by Mr. Trojan using
______________ ( PC[12:2] / PC[31:21] ) where as Mr. Bruin used ______________ ( PC[12:2]
8 / PC[31:21]) for indexing. Why Mr. Bruin is wrong? _________ _______________________
pts
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________

2.3 A direct mapped cache in a 32-bit address 32-bit data system uses 4-word blocks. The cache size
is 256KB (= 218 bytes = 216 words = 214 blocks)). Identify which of the following two address
divisions was made by Mr. Trojan and which is made by Mr. Bruin.
Address division by Mr. ____________ (Trojan/Bruin)
Tag (14) Index (14) Word Byte

A31 A30 A29 A28 A27 A26 A25 A24 A23 A22 A21 A20 A19 A18 A17 A16 A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4 A3 A2 A1 A0
6
pts
Address division by Mr. ____________ (Trojan/Bruin)
Index (14) Tag (14) Word Byte

A31 A30 A29 A28 A27 A26 A25 A24 A23 A22 A21 A20 A19 A18 A17 A16 A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4 A3 A2 A1 A0

Mr. Bruin further says that, since both the index and the tag fields are equal in size (14 bits each),
the size if the TAG RAM, DATA RAM, Tag comparison unit are all equal either way. ___ (T / F)
Why Mr. Bruin is wrong? ________________________________________________________
8 ____________________________________________________________________________
pts
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
EE457 Final - Fall2023 8 / 22 C Copyright 2023 Gandhi Puvvada
Q2P9 Page total 18 pts
2.4 In the following extract from our class notes, we show that the BPB is accessed in the IF stage
and is processed in the ID stage. We sighted timing advantage (register re-balancing). Even if
there is no need for a timing advantage, show that there is cost advantage in doing this . Assume
(and state) a reasonably sized 2-bit prediction BPB, and explain quantitatively the cost advantage.

BPB size assumed: __________________________________________________________________________


6
pts Cost Advantage: __________________________________________________________________________

__________________________________________________________________________________________

Access in IF stage Processing


in ID stage

2 2.5 CMP: Intel's HTT (Hyper Threading Technology) is essentially same as


pts ___________________________ (fine-grain / coarse-grain / simultaneous) multi-threading

2.6 Tomasulo 3:

PRF stands for _____________________________________

FRL stands for _____________________________________


10
pts RAT stands for _____________________________________

Legend: A = Dispatch unit, B= Instruction Retirement logic

FRAT is updated by the _________________ (A / B).

RRAT is updated by the _________________ (A / B).

EE457 Final - Fall2023 9 / 22 C Copyright 2023 Gandhi Puvvada


Q3P10 Page total 15 pts

3 ( points) min. Virtual Memory and Cache


Specs of our Trojan computer (a 32-bit address, 32-bit data, byte-addressable machine) with
physically addressed cache (more specifically PIPT cache).

Virtual address space = 4GB, Virtual address = 32 bits (VA31-VA0) (232 = 4G),
Physical address space = 4GB, Physical address = 32 bits (PA31-PA0) (232 = 4G)

Page size = 16 KB (214 = 16K),


TLB size = 64 entry (fully-associative) (26 = 64)
Page table organization:
2-level table with 128-entry (27 = 128) page directory (top level table)

Cache size = 224 KB (7*215 = 7 * 32K =224K),


Cache Block (cache line size) = two 32-bit words (8 bytes total) (23 = 8),
Cache mapping: Set-associative with _____ blocks per set. (choose a minimum
4
pts number of blocks per set suitable for the 224 KB cache).
State another possible choice _____ (but do not use this choice).

Main memory organization: Lower-order Interleaved. Degree of interleaving to suit


the most efficient access of the main-memory block for transferring it to cache.

3.1 Divide the virtual address into VPN (Virtual Page Number) and Page offset fields. Since
8 TLB is a fully associative TLB, we ____________ (further divide / do not divide) the
pts
VPN into TAG and SET fields.
How many comparators of what size are needed in the TLB? _____________ _
Virtual address Bank Enables BE3-BE0
VA31-VA0 (Byte enables)

VA31 VA30 VA29 VA28 VA27 VA26 VA25 VA24 VA23 VA22 VA21 VA20 VA19 VA18 VA17 VA16 VA15 VA14 VA13 VA12 VA11 VA10 VA9 VA8 VA7 VA6 VA5 VA4 VA3 VA2 VA1 VA0

Byte

Is any portion of the virtual address used for "indexing" TLB? ______________ (Yes / No ).

3.2 Divide the virtual address into VPN and Page offset fields again and further divide the VPN
(based on the page table organization information) into page directory index
and 2nd-level page table index.

Virtual address Bank Enables BE3-BE0


3 VA31-VA0 (Byte enables)
pts
VA31 VA30 VA29 VA28 VA27 VA26 VA25 VA24 VA23 VA22 VA21 VA20 VA19 VA18 VA17 VA16 VA15 VA14 VA13 VA12 VA11 VA10 VA9 VA8 VA7 VA6 VA5 VA4 VA3 VA2 VA1 VA0

Byte

EE457 Final - Fall2023 10 / 22 C Copyright 2023 Gandhi Puvvada


Page total 24 pts
Q3P11

3.3 Divide the physical address into PPFN (Physical Page Frame Number) and Page offset fields.

2 Physical address Bank Enables BE3-BE0


PA31-PA0 (Byte enables)
pts
PA31 PA30 PA29 PA28 PA27 PA26 PA25 PA24 PA23 PA22 PA21 PA20 PA19 PA18 PA17 PA16 PA15 PA14 PA13 PA12 PA11 PA10 PA9 PA8 PA7 PA6 PA5 PA4 PA3 PA2 PA1 PA0

Byte

3.4 Divide the physical address (based on cache specifications) into TAG, SET, WORD and BYTE fields

3 Physical address Bank Enables BE3-BE0


pts PA31-PA0 (Byte enables)

PA31 PA30 PA29 PA28 PA27 PA26 PA25 PA24 PA23 PA22 PA21 PA20 PA19 PA18 PA17 PA16 PA15 PA14 PA13 PA12 PA11 PA10 PA9 PA8 PA7 PA6 PA5 PA4 PA3 PA2 PA1 PA0

Byte

3.5 If the 32-bit physical byte address (produced by TAG RAM


13 address translation
pts Address
through TLB or Page Table) is 70586124 H

Comparator
(0111_0000_0101_1000_0110_0001_0010_0100B), Data_out HIT
+ valid
which set in the cache you will be approaching? Data_in
_____________________________ (set # in binary)
Does this set number form an index (an address) Size =
into _____________________________ (the
multiple TAG RAMs/the single TAG RAM/neither _____ more (besides the above)
of these)? are needed in this cache.
Complete the TAG RAM details in the side panel.

3.6 Complete the Cache DATA RAM details below.


DATA RAM
Address
Size: Each ______ more
of the 4 such DATA
byte_wide
D31-D24

D23-D16

D15-D8

RAM units
D7-D0

6 banks is a
pts Trojan (besides the
x 8
Processor D31-D 0 one on the side

Blank rectangle (for rough work)

EE457 Final - Fall2023 11 / 22 C Copyright 2023 Gandhi Puvvada


Q3P12 Page total 34 pts
3.7 Complete the Interleaved Main Memory details below.
Each of these 4 is __________ MB in size.
PA - PA

______ more such units


(besides the one on the left)
exist in Main Memory.
6 D31-D24 D23-D16 D15-D8 D7-D0
pts
32 bit 32 bit 32 bit 32 bit
XCVR XCVR XCVR XCVR

D31-D0

4
3.8 TLB miss leads to a _________________ (cache look up / a PT look up).
pts During TLB look up, a Read/Write/Execute violation (a memory protection violation) causes a TRAP. T / F
3.9 In a set associative cache of 2-blocks per set and 4 words per block, the degree of lower-order
interleaving recommended for the main memory is __________ (1-way/2-way/4-way/8-way/
6 other namely ...) and the number of TAG RAMs is __________ (8/16/32/other namely ...).
pts
The depth of a TAG RAM is determined by ________________________________________.
3.10 In a 4-core processor with each core running 8 threads,
6 in each core, there is/are ______________ (1 / 4 / 8) PTBR(s),
pts in each core, there is/are ______________ (1 / 4 / 8) L1 Data cache(s),
in each core, there is/are ______________ (1 / 4 / 8) PC(s),
in each core, there is/are ______________ (1 / 4 / 8) Register Files.

Assume that our TLB entries have a field containing Address Space number. Write a few lines,
stating the number of TLBs in a core, whether flushing of TLB occurs on thread switching by
hardware or context switching by the operating system or under both or neither situations.
___________________________________________________________________________
6 ___________________________________________________________________________
pts ___________________________________________________________________________
___________________________________________________________________________

3.11 Oracle T1 processor has the TS (Thread Select) stage before the ID where as in our EE560 CMP,
6 we have the TS stage after the ID stage. Both are right in their own context because .. _______
pts ____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________

EE457 Final - Fall2023 12 / 22 C Copyright 2023 Gandhi Puvvada


Q4P13 Page total 28 pts
4 ( points) min. FIFO
A 8x4 single-clock FIFO (8 locations, each of 4 bits) can use one of the two methods below.
(i) n-bit (3-bit) WP and RP pointers with a AF/AE FF (Almost Full/Almost Empty flip-flop to disambiguate the WP-RP=0 situation)
28
pts (ii) (n+1)-bit (4-bit) WP and RP pointers.
The number of pins on the FIFO chip is _________ (the same / different) in the two choices .
In method (i), for WP = 110 (6), if the FIFO is full, find the values: RP = ______, AF/AE FF = ______
In method (i), for WP = 110 (6), if the FIFO is empty, find the values: RP = ______, AF/AE FF = ______
In method (i), for depth calculation, you perform ______ (3-bit/4-bit) subtraction ___________________
(WP-RP/RP-WP) with modulo _____.
In method (ii), for WP = 1100 (12), if the FIFO is full, find the values: RP = ______
In method (ii), for WP = 1100 (12), if the FIFO is empty, find the values: RP = ______
In method (ii), for depth calculation, you perform ______ (3-bit/4-bit) subtraction ___________________
(WP-RP/RP-WP) with modulo _____.
For the following four figures, if possible calculate depth and show the calculation (mod subtraction).
If not possible, state the reason.
Method (i) 8 7
Method (ii) 8 7
4 3 4 3
9 6 9 6

10 5 10 5
5 2 5 2
11 4 11 4
WP RP
WP RP 12 3 12 3
P
RP
1 1
P

6 6
RP

W
W

13 2 13 2
14 1 14 1
7 0 7 0 15 0 15 0

Depth = _______ Depth = _______ Depth = _______ Depth = _______

5 ( points) min.
Lab 7 Part 3 Subpart 3 Verilog RTL coding: A couple of figures to refresh your memory.
XD
IFRF_Mux

EX2 WB EX2 WB
IFRF Circuit

EX2_XMEX1 EX2_XMEX1
0
WD 1

Qualifying
reg_file[XA]

FU2 signals
WB_EX2_ADDER_OUT

FU2
MODIFIED

Qualifying
ADD4 signals ADD4
P=Q

EN EN
EX2_ADDER_IN EX2_ADDER_OUT
FORW2
R-Write

FORW2

ADDER_OUT
RA Q
P
XA

A+4 R2_Mux A+4 R2_Mux


X2_Mux 0 WB_RD X2_Mux 0
0 A 1 RD 0 1
A
ADDER_IN
WB_EX2_ADDER_IN

1
XD

CLK

1
WB_SKIP2
SKIP2

Cout
Reg. File

Cout
R-Write

EX2_MOV EX2_MOV
RD
XA

SKIP2
RA

Write

EX2_SUB3 EX2_SUB3
WB_Write WB_Write
Write

EX2_ADD4 EX2_ADD4
ORIGINAL

EX2_ADD1 EX2_ADD1
WB_RA WB_RA
EX2_RA RA RA
EX2_RA
XD

CLK

RESET_B RESET_B
Reg. File

R-Write
RD
XA

RA

ORIGINAL MODIFIED

EE457 Final - Fall2023 13 / 22 C Copyright 2023 Gandhi Puvvada


Q4P14 Page total 20 pts
Suppose we had only one change form subpart 1, namely change of the negative-edge triggered
register file to positive-edge triggered register file with internal forwarding mechanism, but no
R2-Mux movement to WB stage.
Consider each of the following four choices for coding the write port (only the write port) of the
register file and state whether you agree or disagree.

always @(posedge CLK)


begin : RegFile_Block if (WB_WRITE)
if (WB_WRITE) begin
begin reg_file[WB_RA] <= WB_RD;
reg_file[WB_RA] <= WB_RD; end
end
end
Students #2, #3, and #4 choose to write
the above lines in the main clocked block
Student #1 chooses to write a separate after the following line:
clocked block as shown above.
else // else if posedge CLK

However, students #2, #3, and #4 differed in where they placed the above 4 lines.

Student #2 placed the 4 lines at the beginning of the else block (i.e before
producing WB_WRITE, WB_RA, and WB_RD).

Student #3 placed the 4 lines at the end of the else block (i.e after
producing WB_WRITE, WB_RA, and WB_RD).

Student #4 placed the 4 lines in the middle of the else block (i.e after
producing WB_WRITE and WB_RA, but before producing WB_RD).

You agree with Student #1. Yes / No


6 You agree with Student #2. Yes / No
pts
You agree with Student #3. Yes / No
You agree with Student #4. Yes / No

Your short explanation: _________________________________________________________


14 ____________________________________________________________________________
pts ____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________

EE457 Final - Fall2023 14 / 22 C Copyright 2023 Gandhi Puvvada


Q5P15 Page total 10 pts
6 ( points) min. Lab 7 P3 SP2 modification

An ADD8 instruction (besides an ADD4 instruction) can be supported in Lab 7 Part3 by replacing
the SUB3 unit in EX1 with another ADD4 unit. Instead of having an ADD4 unit in each of the two
EX stages, EX1 and EX2, here, we have merged those two stages, EX1 and EX2 into EX12. So
ADD8 needs an extra clock in EX12 as it has to go through the second ADD4 also.

Instruction Operation Opcode MSD 32-bit instruction in hex


MOV BZ ADD4 ADD8 D=Destination, S=Source

NOP 0 0 0 0 0 000000DS

MOV $R, $X; ($R) <= ($X) 1 0 0 0 8 800000DS

SUB3 $R, $X; ($R) <= ($X) - 3 0 1 0 0 4 400000DS

BZ $X, JJJJ; (PC) <= JJJJ if ($X) = 0 0 1 0 0 2 4JJJJ0DS

ADD4 $R, $X; ($R) <= ($X) + 4 0 0 1 0 2 200000DS

ADD8 $R, $X; ($R) <= ($X) + 8 0 0 0 1 1 100000DS

We have a BZ (Branch if Zero) instruction. It uses the opcode previously allocated to the SUB3
instruction. The instructions are 32-bits in size, but the addresses are only 16-bit. PC is 16-bit wide
and is incremented by a "1". The JJJJ in the BZ $X, JJJJ stands for a 16-bit (4-digit hex)
absolute branch address. If the source register $X is a zero then we branch to JJJJ [ (PC) <= JJJJ if
($X) = 0 ]. The "D" in "4JJJJ0DS" is a random hex digit and should not be treated as a valid destination,
similar to the "DS" in "000000DS" for a NOP instruction. BZ executes from the ID stage.

You need to complete the early branch mechanism: dependency stalls, branch execution by causing
PC to be changed to JJJJ and flushing the junior instruction in IF stage, avoiding spurious branch
execution during stalling of ID stage (stalling BZ due to its dependency on ADD4 or ADD8 in the
EX12 stage), etc. A copy of our Lab 6 Early Branch design is given at the end of the
exam just FYI (for your information)

6.1 Complete the design on the page next to next (on page 17).

6.2 In your lab 7 Part 3 Subpart 2 (EX1 and EX2 merged case), you used the left side circuit below to stall
ADD1 for 1 clock. Complete the design by labeling the STALL signal. Suppose you are given a
flipflop with an asynchronous set as shown in the right side below (instead of the FF with an asynchronous
clear as shown on the left). Redesign your stall circuit with this FF and show the STALL signal.

STALL?? STALL??
RESET_B
10
pts
D Q SET
D Q
EX12_ADD1 CLK CLK
CLK CLK
CLR
RESET_B

EE457 Final - Fall2023 15 / 22 C Copyright 2023 Gandhi Puvvada


Q5P16 Page total 43 pts
6.3 When STALL_ADD8 is active, you stall the entire pipeline. True / False
6 When STALL_BR is active, you stall the entire pipeline. True / False
pts IF_Flush mechanism here is ___________________ (the same as / different from) the wrist-band
mechanism used in our pipelined CPU design.

6.4 In this design we have implemented an early branch. Would a medium branch from EX12 be better?
Yes / No / It depends. Explain. ____________________________________________________
_______________________________________________________________________________
_______________________________________________________________________________
6 _______________________________________________________________________________
pts
_______________________________________________________________________________
_______________________________________________________________________________
Is it possible to postpone executing the BZ instruction all the way into the WB stage (WB!, not EX12)?
Not Possible / possible but undesirable / possible and desirable. Explain __________________
_______________________________________________________________________________
6 _______________________________________________________________________________
pts _______________________________________________________________________________
_______________________________________________________________________________

6.5 Combining EX1 and EX2 into one EX12 stage (as done here) is ____________________________
(always better / always worse / depends on the instruction sequence in the program). Explain. ___
_______________________________________________________________________________
10
pts _______________________________________________________________________________
_______________________________________________________________________________
_______________________________________________________________________________

6.6 How come, we carried (PC + 4) to the ID stage in our 5-stage early branch CPU design (copy 16
+
5 at the end of the exam), but we do not carry (PC+1) to the ID stage here? __________________
pts 16 1
_________________________________________________________________________
_______________________________________________________________________________

6.7 We had HDU_BR, FU_BR, HDU, and FU in our 5-stage early branch CPU design (copy at the end of the exam).
5 How come we do not have a HDU here? We have the other three pieces here. ________________
pts _______________________________________________________________________________
_______________________________________________________________________________

6.8 Produce STALL_BR below.


Comp Station
5 in ID Stage
pts ID_XMEX12
HDU_BR

STALL_BR STALL_BR

EE457 Final - Fall2023 16 / 22 C Copyright 2023 Gandhi Puvvada


Q5P17
PCSource IF STALL_ADD8 ID EX12 WB
STALL_IF_ID Comp Station
in ID Stage EN
FU
1 STALL_BR

XMEX12
EX12_XMEX12

Page total 15
ID_XMEX12
0 HDU_BR

XD_ZERO
STALL_BR
16 16 16
ADD4 ADD4
EN EN
+ FU_BR
EN

pts
1

FORW1
Branch Reg. File
JJJJ Address IFRF X0_Mux A+4
I-MEM

16 A+4 R2_Mux
0 R1_Mux
XA XD XD 0
X1_Mux
XA 1 0
A 1 RD
WB_RA RA 0 A 1

FORW0

SKIP2
EE457 Final - Fall2023

PC WB_RD 1
RD

SKIP1
Cout

EX12_Write
R-Write Cout

WB_Write
ADD8 ADD4 BZ MOV

MOV
ID_MOV WB_Write

EX12_MOV

Write
ID_BZ
RESET_B

ADD8 ADD4

WB_RA
ID_ADD4 EX12_ADD4
EX12_A4_A8
17 / 22

EX12_ADD8
IF_Flush
ID_ADD8 RA
EX12_RA
RA STALL_BR
RA
RESET_B
RESET_B
RESET_B

WB_RD
D Q
CLK
C Copyright 2023 Gandhi Puvvada

Comp Station in ID Stage


CLK
ID_XA Matched with EX12_RA CLR
RESET_B STALL_ADD8
ID_XMEX12

P=Q
1. Complete the 6 connections to/from
P Q 2. Complete the STALL_ADD8 logic in EX12 (generate it).
4. Draw needed logic to produce IF_Flush, SKIP1, SKIP2 on this page itsef.
ID_XA EX12_RA 3. On a separate page, draw logic to produce STALL_BR,PCSource, FORW0, and FORW1.
Q5P18 Page total 36 pts
6.9 In our Lab 6 early branch design (copy at the end of the exam), we produced a BR1 signal, which _______ (A/B).
A = may go active based on obsolete values, but no harm is done because of the guardian angel HDU_BR.
B = is generated carefully and does not require any guardian angel’s help!
For the current design, produce a BR1 with less logic is possible, and state if there is a guardian
angel to help the BR1. State if appropriate how the simple-minded BR1 does not cause any harm.
In the same box below produce PCSource (mux select line to select the next value for the PC)
and IF_Flush (to flush the Junior1 after a taken branch.
8 PCSource PCSource
pts
16
16 1
16
0
BR1 16
+
16 1

IF_Flush

Your above BR1 _______ (A/B).


A = may go active based on obsolete values, but no harm is done because of the guardian angel HDU_BR.
8
pts B = is generated carefully and does not require any guardian angel’s help!
Explain: __________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________

6.10 Guardian angel: In our 5-stage early branch CPU design (copy at the end of the exam), we said that
HDU_BR acts like a guardian angel to FU_BR and FU_BR could use ___________________
____________________________________________________________________________
(use words like register-writing, memory-reading, register-writing-but-not-memory-reading, if they fit).
Here, while FU_BR could use ____________________ (EX2_Write/EX2_MOV) in place of
10 the more precise ________________ (EX2_Write/EX2_MOV), using the more precise signal
pts
creates a __________ (slower/faster) timing path! Produce FORW0 and FORW1 below.
Can the FU_Br be generous and help ID_XMEX12
other instructions besides the ID_Bz? _____ (Y/N) FU_BR FU
EX12_XMEX12
X0_Mux

XD 0 X1_Mux
1
0
FORW1
FORW0

6.11 The ID stage gets stalled for _______ (0/1/2) clock(s), if the BZ in ID is dependent on a MOV in EX12.

The ID stage gets stalled for _______ (0/1/2) clock(s), if the BZ in ID is dependent on a ADD4 in EX12.
10
pts The ID stage gets stalled for _______ (0/1/2) clock(s), if the BZ in ID is dependent on a ADD8 in EX12.

The EX12 stage gets stalled for _______ (0/1/2) clock(s), if a ADD8 is present in EX12.

EE457 Final - Fall2023 18 / 22 C Copyright 2023 Gandhi Puvvada


Q5P19 Page total 24 pts
6.12 Complete the following "Single Cycle CPU" version of the 4-stage pipeline design. Complete the
control unit and the six points marked with .

PCSource Single Cycle CPU


16
16 1
pts

XD_ZERO
0
16
16
ADD4
PC_EN

ADD4
+ Reg. File

1 Branch
Address
XD A+4
I-MEM

JJJJ A+4 R2_Mux


16 XA XD
R1_Mux
0
XA
0
RA A 1
A 1
RA RD
PC R-Write
Cout Cout
RegWrite

SKIP1
CU

SKIP2
RESET_B ADD4

MOV

ADD8

BZ Branch

complete this

For this single-cycle CPU, you expect to have a clock with a clock period equivalent to the 4-
stage pipeline, or half of it or double of it? ____________________ (Short answer first). Brief
explanation: _________________________________________________________________
8 ____________________________________________________________________________
pts
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________

Blank rectangle (for rough work)

EE457 Final - Fall2023 19 / 22 C Copyright 2023 Gandhi Puvvada


Q5P20 Page total 53 pts
6.13 Mr. Trojan thought of an improvement to our above 4-stage pipeline saving one clock
occasionally. He wants you to implement the improvement below. He has given you enough clues
18 below in the form of observations, questions, and suggestions.
pts
1. Do you agree that a BZ instruction does not do anything in the last two stages (EX12 and WB) _______ (Y / N).
2. Unless BZ itself wants to stall because of its dependency on its senior#1, can we let BZ execute
and vanish while an ADD8 is stalled in EX12? ________ (Yes / No).
3. The word "execute" in the preceding sentence may include both taken as well as untaken
branches. ________ (Yes / No).
4. You may want to avoid stalling the PC and the IF/ID stage register to save a clock under that
special occasion.
4.1. However, if a non-branch instruction (other than a NOP) is in the ID stage, the ADD8 related
stall shall stall PC and the IF/ID stage register to avoid loss of the ID stage instruction. ______ (T/F)
5. WB stage and IFRF: The senior in the WB stage may be helping the BZ in the ID stage through
the register file which is a IFRF. What does IFRF mean? ______________________________
6. The senior in the WB stage may be helping the ADD8 in the EX12 stage also. ______ (T / F)
7. WB stage instruction’s behavior when it is stalled:
A register writing WB stage instruction should write to the Register File
15
Choice #1 (C#1): in every clock even if it is stalled by STALL_ADD8
pts Choice #2 (C#2): only in clocks when it is not stalled
7.1 In our original design without Trojan’s improvement, ______________________________
(C#1 only/C#2 only/both/neither) is/are acceptable.
7.2 In the new design with Trojan’s improvement, ______________________________
(C#1 only/C#2 only/both/neither) is/are acceptable.
Discuss/Explain: _____________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________

STALL_ADD8
20 STALL_IF_ID STALL_IF_ID
pts
STALL_BR

EN EN EN EN
IF_ID

IF_ID
PC

PC

Current design without Please implement


EE457 Final - Fall2023 20 / 22 C Copyright 2023 Gandhi Puvvada
Trojan’s improvement Trojan’s improvement
P21
Lab 6 Early Branch Design (Just FYI)
RegWrite_EX
Hazard MemRead_EX
detection MemRead_MEM

Non-Grading page. DEN students: No need to submit this page


unit
1
HDU_Br WriteRegister_MEM
WB
0 STALL_BEQ ME
STALL_LW RegDst
WB

Branch
EX ALUOp
STALL ALUSrc ME

+
(PC)

BR1
Control

4 0 0

MemWrite
MemRead
1

WriteRegister_EX

RegWrite
0 0
Branch
opcode

(rs)
1

ALU_result
1
EE457 Final - Fall2023

FW_RS_MEM
ALU

FW_RS_WB
R1 0 MemtoReg
@ WB
r1
rs

1 ALUSrc
Instruction

Registers
memory

Zero

MEM_data
memory
PC = 0

(rt)
r2
rt

Data
0 1
R2 0 0 R

Store_data
1
w 1
1 1

FW_RT_WB
rd

W 0

FW_RT_MEM
RegDst
W
21 / 22

rs
FW_RS
shift

Shift
FW_RT

ALUOp

REG_data
Left 2 ALU
Sign ctrl

WR
rt
funct

ext. 0

RegWrite
IF.Flush
fowarding_mux_control
rd

WR
FU_Br
s_ext

Forwarding Unit
C Copyright 2023 Gandhi Puvvada

WriteRegister_MEM
funct
P22 Non-Grading page. DEN students: No need to submit this page
Blank page: Please write your name and email. Tear it off and use for rough work. Do not submit.
Student’s First & Last Name:______________________ email: __________________

We enjoyed teaching this course. Hope you liked it too!


Best Wishes!
Gandhi, TA: Rakshith Jayanth Mentor-cum-Graders: Shubham Rana, Ziyu Liu, Wenkai Zhang, Haochen Wu, Junjie Chen, Godha Lakshmi Garudaiahgari

EE457 Final - Fall2023 22 / 22 C Copyright 2023 Gandhi Puvvada

You might also like