Ee457 Final Fall2023
Ee457 Final Fall2023
I have previously read the Viterbi Code of Integrity and other related material at the site https://viterbischool.usc.edu/academic-
integrity/ and I will abide by these rules of conduct. I will neither seek help from others nor offer help to others in my exams.
4 FIFO 13 28
1.1 Reproduced on the next four pages is the 14-step sequence from our class notes showing how
52 pts three competing threads in three single-threaded cores can obtain the lock in a mutual exclusive
on next
4 pages manner. Revise the 14-step sequence for MOESI in place of MSI showing the O state when
needed. Change the contents of the L1 and L2 caches as needed. Answer the 3 questions on the
last page of these 4 pages related to MOESI and FMM. Hint: E-state is never used here.
If the three threads are all in one single three-threaded core, you expect that
9 1. mutual exclusion is still possible even though there are no SCUs involved. _____ (T/F).
pts
2. most of the polling (checking for the lock to be released) is done in ______ (M/S/I) state
by the threads who are waiting to lock the lock.
3. here we can just use LW and SW and do not need LL and SC _____ (T/F).
1.3 After locking the SDBL and entering the Student Database, can the L1 cache of that core choose
6 to voluntarily replace the block containing the lock (replace it to bring some other block)?
pts
_______ (Yes/No). Does it pose any problem? _______ (Yes/No).
1.4 An application to admit students is written to admit one student at a time and release the SDBL
10 lock and not seek for it for the next millisecond. During this millisecond, there may or may not
pts
be others looking for the lock! If no other is looking for the lock, do you expect that the block
containing SDBL in this core would be in M or S or I states at the end of the millisecond?
.
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
2.1 MPI (Miss rate Per Instruction) in the case of a hierarchy of caches
P
Calculate the effective CPI assuming that there are no other losses of clocks
due to stalling or flushes etc. (i.e CPI would be 1 if there are no cache
misses in L1 cache). Cache L1
2
pts L3 MPI is always less than L2 MPI which is always less than L1 MPI. True / False
2.2 Branch prediction: A 2Kx2 2-bit BPB in the ID stage is indexed by Mr. Trojan using
______________ ( PC[12:2] / PC[31:21] ) where as Mr. Bruin used ______________ ( PC[12:2]
8 / PC[31:21]) for indexing. Why Mr. Bruin is wrong? _________ _______________________
pts
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
2.3 A direct mapped cache in a 32-bit address 32-bit data system uses 4-word blocks. The cache size
is 256KB (= 218 bytes = 216 words = 214 blocks)). Identify which of the following two address
divisions was made by Mr. Trojan and which is made by Mr. Bruin.
Address division by Mr. ____________ (Trojan/Bruin)
Tag (14) Index (14) Word Byte
A31 A30 A29 A28 A27 A26 A25 A24 A23 A22 A21 A20 A19 A18 A17 A16 A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4 A3 A2 A1 A0
6
pts
Address division by Mr. ____________ (Trojan/Bruin)
Index (14) Tag (14) Word Byte
A31 A30 A29 A28 A27 A26 A25 A24 A23 A22 A21 A20 A19 A18 A17 A16 A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4 A3 A2 A1 A0
Mr. Bruin further says that, since both the index and the tag fields are equal in size (14 bits each),
the size if the TAG RAM, DATA RAM, Tag comparison unit are all equal either way. ___ (T / F)
Why Mr. Bruin is wrong? ________________________________________________________
8 ____________________________________________________________________________
pts
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
EE457 Final - Fall2023 8 / 22 C Copyright 2023 Gandhi Puvvada
Q2P9 Page total 18 pts
2.4 In the following extract from our class notes, we show that the BPB is accessed in the IF stage
and is processed in the ID stage. We sighted timing advantage (register re-balancing). Even if
there is no need for a timing advantage, show that there is cost advantage in doing this . Assume
(and state) a reasonably sized 2-bit prediction BPB, and explain quantitatively the cost advantage.
__________________________________________________________________________________________
2.6 Tomasulo 3:
Virtual address space = 4GB, Virtual address = 32 bits (VA31-VA0) (232 = 4G),
Physical address space = 4GB, Physical address = 32 bits (PA31-PA0) (232 = 4G)
3.1 Divide the virtual address into VPN (Virtual Page Number) and Page offset fields. Since
8 TLB is a fully associative TLB, we ____________ (further divide / do not divide) the
pts
VPN into TAG and SET fields.
How many comparators of what size are needed in the TLB? _____________ _
Virtual address Bank Enables BE3-BE0
VA31-VA0 (Byte enables)
VA31 VA30 VA29 VA28 VA27 VA26 VA25 VA24 VA23 VA22 VA21 VA20 VA19 VA18 VA17 VA16 VA15 VA14 VA13 VA12 VA11 VA10 VA9 VA8 VA7 VA6 VA5 VA4 VA3 VA2 VA1 VA0
Byte
Is any portion of the virtual address used for "indexing" TLB? ______________ (Yes / No ).
3.2 Divide the virtual address into VPN and Page offset fields again and further divide the VPN
(based on the page table organization information) into page directory index
and 2nd-level page table index.
Byte
3.3 Divide the physical address into PPFN (Physical Page Frame Number) and Page offset fields.
Byte
3.4 Divide the physical address (based on cache specifications) into TAG, SET, WORD and BYTE fields
PA31 PA30 PA29 PA28 PA27 PA26 PA25 PA24 PA23 PA22 PA21 PA20 PA19 PA18 PA17 PA16 PA15 PA14 PA13 PA12 PA11 PA10 PA9 PA8 PA7 PA6 PA5 PA4 PA3 PA2 PA1 PA0
Byte
Comparator
(0111_0000_0101_1000_0110_0001_0010_0100B), Data_out HIT
+ valid
which set in the cache you will be approaching? Data_in
_____________________________ (set # in binary)
Does this set number form an index (an address) Size =
into _____________________________ (the
multiple TAG RAMs/the single TAG RAM/neither _____ more (besides the above)
of these)? are needed in this cache.
Complete the TAG RAM details in the side panel.
D23-D16
D15-D8
RAM units
D7-D0
6 banks is a
pts Trojan (besides the
x 8
Processor D31-D 0 one on the side
D31-D0
4
3.8 TLB miss leads to a _________________ (cache look up / a PT look up).
pts During TLB look up, a Read/Write/Execute violation (a memory protection violation) causes a TRAP. T / F
3.9 In a set associative cache of 2-blocks per set and 4 words per block, the degree of lower-order
interleaving recommended for the main memory is __________ (1-way/2-way/4-way/8-way/
6 other namely ...) and the number of TAG RAMs is __________ (8/16/32/other namely ...).
pts
The depth of a TAG RAM is determined by ________________________________________.
3.10 In a 4-core processor with each core running 8 threads,
6 in each core, there is/are ______________ (1 / 4 / 8) PTBR(s),
pts in each core, there is/are ______________ (1 / 4 / 8) L1 Data cache(s),
in each core, there is/are ______________ (1 / 4 / 8) PC(s),
in each core, there is/are ______________ (1 / 4 / 8) Register Files.
Assume that our TLB entries have a field containing Address Space number. Write a few lines,
stating the number of TLBs in a core, whether flushing of TLB occurs on thread switching by
hardware or context switching by the operating system or under both or neither situations.
___________________________________________________________________________
6 ___________________________________________________________________________
pts ___________________________________________________________________________
___________________________________________________________________________
3.11 Oracle T1 processor has the TS (Thread Select) stage before the ID where as in our EE560 CMP,
6 we have the TS stage after the ID stage. Both are right in their own context because .. _______
pts ____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
10 5 10 5
5 2 5 2
11 4 11 4
WP RP
WP RP 12 3 12 3
P
RP
1 1
P
6 6
RP
W
W
13 2 13 2
14 1 14 1
7 0 7 0 15 0 15 0
5 ( points) min.
Lab 7 Part 3 Subpart 3 Verilog RTL coding: A couple of figures to refresh your memory.
XD
IFRF_Mux
EX2 WB EX2 WB
IFRF Circuit
EX2_XMEX1 EX2_XMEX1
0
WD 1
Qualifying
reg_file[XA]
FU2 signals
WB_EX2_ADDER_OUT
FU2
MODIFIED
Qualifying
ADD4 signals ADD4
P=Q
EN EN
EX2_ADDER_IN EX2_ADDER_OUT
FORW2
R-Write
FORW2
ADDER_OUT
RA Q
P
XA
1
XD
CLK
1
WB_SKIP2
SKIP2
Cout
Reg. File
Cout
R-Write
EX2_MOV EX2_MOV
RD
XA
SKIP2
RA
Write
EX2_SUB3 EX2_SUB3
WB_Write WB_Write
Write
EX2_ADD4 EX2_ADD4
ORIGINAL
EX2_ADD1 EX2_ADD1
WB_RA WB_RA
EX2_RA RA RA
EX2_RA
XD
CLK
RESET_B RESET_B
Reg. File
R-Write
RD
XA
RA
ORIGINAL MODIFIED
However, students #2, #3, and #4 differed in where they placed the above 4 lines.
Student #2 placed the 4 lines at the beginning of the else block (i.e before
producing WB_WRITE, WB_RA, and WB_RD).
Student #3 placed the 4 lines at the end of the else block (i.e after
producing WB_WRITE, WB_RA, and WB_RD).
Student #4 placed the 4 lines in the middle of the else block (i.e after
producing WB_WRITE and WB_RA, but before producing WB_RD).
An ADD8 instruction (besides an ADD4 instruction) can be supported in Lab 7 Part3 by replacing
the SUB3 unit in EX1 with another ADD4 unit. Instead of having an ADD4 unit in each of the two
EX stages, EX1 and EX2, here, we have merged those two stages, EX1 and EX2 into EX12. So
ADD8 needs an extra clock in EX12 as it has to go through the second ADD4 also.
NOP 0 0 0 0 0 000000DS
We have a BZ (Branch if Zero) instruction. It uses the opcode previously allocated to the SUB3
instruction. The instructions are 32-bits in size, but the addresses are only 16-bit. PC is 16-bit wide
and is incremented by a "1". The JJJJ in the BZ $X, JJJJ stands for a 16-bit (4-digit hex)
absolute branch address. If the source register $X is a zero then we branch to JJJJ [ (PC) <= JJJJ if
($X) = 0 ]. The "D" in "4JJJJ0DS" is a random hex digit and should not be treated as a valid destination,
similar to the "DS" in "000000DS" for a NOP instruction. BZ executes from the ID stage.
You need to complete the early branch mechanism: dependency stalls, branch execution by causing
PC to be changed to JJJJ and flushing the junior instruction in IF stage, avoiding spurious branch
execution during stalling of ID stage (stalling BZ due to its dependency on ADD4 or ADD8 in the
EX12 stage), etc. A copy of our Lab 6 Early Branch design is given at the end of the
exam just FYI (for your information)
6.1 Complete the design on the page next to next (on page 17).
6.2 In your lab 7 Part 3 Subpart 2 (EX1 and EX2 merged case), you used the left side circuit below to stall
ADD1 for 1 clock. Complete the design by labeling the STALL signal. Suppose you are given a
flipflop with an asynchronous set as shown in the right side below (instead of the FF with an asynchronous
clear as shown on the left). Redesign your stall circuit with this FF and show the STALL signal.
STALL?? STALL??
RESET_B
10
pts
D Q SET
D Q
EX12_ADD1 CLK CLK
CLK CLK
CLR
RESET_B
6.4 In this design we have implemented an early branch. Would a medium branch from EX12 be better?
Yes / No / It depends. Explain. ____________________________________________________
_______________________________________________________________________________
_______________________________________________________________________________
6 _______________________________________________________________________________
pts
_______________________________________________________________________________
_______________________________________________________________________________
Is it possible to postpone executing the BZ instruction all the way into the WB stage (WB!, not EX12)?
Not Possible / possible but undesirable / possible and desirable. Explain __________________
_______________________________________________________________________________
6 _______________________________________________________________________________
pts _______________________________________________________________________________
_______________________________________________________________________________
6.5 Combining EX1 and EX2 into one EX12 stage (as done here) is ____________________________
(always better / always worse / depends on the instruction sequence in the program). Explain. ___
_______________________________________________________________________________
10
pts _______________________________________________________________________________
_______________________________________________________________________________
_______________________________________________________________________________
6.6 How come, we carried (PC + 4) to the ID stage in our 5-stage early branch CPU design (copy 16
+
5 at the end of the exam), but we do not carry (PC+1) to the ID stage here? __________________
pts 16 1
_________________________________________________________________________
_______________________________________________________________________________
6.7 We had HDU_BR, FU_BR, HDU, and FU in our 5-stage early branch CPU design (copy at the end of the exam).
5 How come we do not have a HDU here? We have the other three pieces here. ________________
pts _______________________________________________________________________________
_______________________________________________________________________________
STALL_BR STALL_BR
XMEX12
EX12_XMEX12
Page total 15
ID_XMEX12
0 HDU_BR
XD_ZERO
STALL_BR
16 16 16
ADD4 ADD4
EN EN
+ FU_BR
EN
pts
1
FORW1
Branch Reg. File
JJJJ Address IFRF X0_Mux A+4
I-MEM
16 A+4 R2_Mux
0 R1_Mux
XA XD XD 0
X1_Mux
XA 1 0
A 1 RD
WB_RA RA 0 A 1
FORW0
SKIP2
EE457 Final - Fall2023
PC WB_RD 1
RD
SKIP1
Cout
EX12_Write
R-Write Cout
WB_Write
ADD8 ADD4 BZ MOV
MOV
ID_MOV WB_Write
EX12_MOV
Write
ID_BZ
RESET_B
ADD8 ADD4
WB_RA
ID_ADD4 EX12_ADD4
EX12_A4_A8
17 / 22
EX12_ADD8
IF_Flush
ID_ADD8 RA
EX12_RA
RA STALL_BR
RA
RESET_B
RESET_B
RESET_B
WB_RD
D Q
CLK
C Copyright 2023 Gandhi Puvvada
P=Q
1. Complete the 6 connections to/from
P Q 2. Complete the STALL_ADD8 logic in EX12 (generate it).
4. Draw needed logic to produce IF_Flush, SKIP1, SKIP2 on this page itsef.
ID_XA EX12_RA 3. On a separate page, draw logic to produce STALL_BR,PCSource, FORW0, and FORW1.
Q5P18 Page total 36 pts
6.9 In our Lab 6 early branch design (copy at the end of the exam), we produced a BR1 signal, which _______ (A/B).
A = may go active based on obsolete values, but no harm is done because of the guardian angel HDU_BR.
B = is generated carefully and does not require any guardian angel’s help!
For the current design, produce a BR1 with less logic is possible, and state if there is a guardian
angel to help the BR1. State if appropriate how the simple-minded BR1 does not cause any harm.
In the same box below produce PCSource (mux select line to select the next value for the PC)
and IF_Flush (to flush the Junior1 after a taken branch.
8 PCSource PCSource
pts
16
16 1
16
0
BR1 16
+
16 1
IF_Flush
6.10 Guardian angel: In our 5-stage early branch CPU design (copy at the end of the exam), we said that
HDU_BR acts like a guardian angel to FU_BR and FU_BR could use ___________________
____________________________________________________________________________
(use words like register-writing, memory-reading, register-writing-but-not-memory-reading, if they fit).
Here, while FU_BR could use ____________________ (EX2_Write/EX2_MOV) in place of
10 the more precise ________________ (EX2_Write/EX2_MOV), using the more precise signal
pts
creates a __________ (slower/faster) timing path! Produce FORW0 and FORW1 below.
Can the FU_Br be generous and help ID_XMEX12
other instructions besides the ID_Bz? _____ (Y/N) FU_BR FU
EX12_XMEX12
X0_Mux
XD 0 X1_Mux
1
0
FORW1
FORW0
6.11 The ID stage gets stalled for _______ (0/1/2) clock(s), if the BZ in ID is dependent on a MOV in EX12.
The ID stage gets stalled for _______ (0/1/2) clock(s), if the BZ in ID is dependent on a ADD4 in EX12.
10
pts The ID stage gets stalled for _______ (0/1/2) clock(s), if the BZ in ID is dependent on a ADD8 in EX12.
The EX12 stage gets stalled for _______ (0/1/2) clock(s), if a ADD8 is present in EX12.
XD_ZERO
0
16
16
ADD4
PC_EN
ADD4
+ Reg. File
1 Branch
Address
XD A+4
I-MEM
SKIP1
CU
SKIP2
RESET_B ADD4
MOV
ADD8
BZ Branch
complete this
For this single-cycle CPU, you expect to have a clock with a clock period equivalent to the 4-
stage pipeline, or half of it or double of it? ____________________ (Short answer first). Brief
explanation: _________________________________________________________________
8 ____________________________________________________________________________
pts
____________________________________________________________________________
____________________________________________________________________________
____________________________________________________________________________
STALL_ADD8
20 STALL_IF_ID STALL_IF_ID
pts
STALL_BR
EN EN EN EN
IF_ID
IF_ID
PC
PC
Branch
EX ALUOp
STALL ALUSrc ME
+
(PC)
BR1
Control
4 0 0
MemWrite
MemRead
1
WriteRegister_EX
RegWrite
0 0
Branch
opcode
(rs)
1
ALU_result
1
EE457 Final - Fall2023
FW_RS_MEM
ALU
FW_RS_WB
R1 0 MemtoReg
@ WB
r1
rs
1 ALUSrc
Instruction
Registers
memory
Zero
MEM_data
memory
PC = 0
(rt)
r2
rt
Data
0 1
R2 0 0 R
Store_data
1
w 1
1 1
FW_RT_WB
rd
W 0
FW_RT_MEM
RegDst
W
21 / 22
rs
FW_RS
shift
Shift
FW_RT
ALUOp
REG_data
Left 2 ALU
Sign ctrl
WR
rt
funct
ext. 0
RegWrite
IF.Flush
fowarding_mux_control
rd
WR
FU_Br
s_ext
Forwarding Unit
C Copyright 2023 Gandhi Puvvada
WriteRegister_MEM
funct
P22 Non-Grading page. DEN students: No need to submit this page
Blank page: Please write your name and email. Tear it off and use for rough work. Do not submit.
Student’s First & Last Name:______________________ email: __________________