0% found this document useful (0 votes)

172 views18 pages

Midterm Solution

Uploaded by

ruweiyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

172 views18 pages

Midterm Solution

Uploaded by

ruweiyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Name: USC ID#: .

I hereby affirm that all the answers below are my own. I have
neither searched online nor taken assistance from any external
entity.
---------------------------------
Student Signature Above

EE557—Spring 2024
MIDTERM 1
Open Books and Notes
No electronics but a calculator is allowed
Time limit: 1 hour and 50 minutes

Q1: / 20
Q2: / 20
Q3: / 10
Q4: / 18
Q5: / 16
Q6: / 20
TOTAL: / 104

The total number of pages is 18 (including this page). Staple and turn in all pages.

Put your name on every page where noted

Name: __________________________

Question 1 [Power] (20 points)

A) What is the difference between dynamic and static power? (2 points)

Static power is power dissipated when powered on but not active switching on inputs.

Dynamic power is power dissipated when there is active switching on inputs

Give full credit if students mention anything similar to toggling and switching.

B) Describe what sub-threshold power leakage is for CMOS chips. Will clock-gating help to address
sub-threshold power leakage? (3 points)

(2pts) Even when the transistors are OFF, there exists a small current across transistors causing energy
dissipation.

(1pt) Clock-gating cannot mitigate the sub-threshold power leakage.

C) Chip designers embrace the hybrid CPU architecture where E (efficient) and P (performance)
cores exist in a single package. E cores are optimized for power efficiency, and P cores are for
high-performance tasks. In this question, assume core specification when clocked at 𝑓 in Table 1.
Given 100 unit of workload and a CPU with 2 P cores and 4 E cores, how to distribute the
workloads to those 6 cores that result in the shortest runtime? Write down what is the shortest
runtime and energy consumption in terms of 𝑃! , 𝑃" , and 𝑇 . (5 points)

P Cores E Cores

Dynamic Power 6 ⋅ 𝑃! 𝑃!

Static Power 2 ⋅ 𝑃" 𝑃"

𝑇
Time taken for 1 unit of workload 𝑇
4

Table 1: Core specification when clocked at 𝑓 .

1pt Give each P core 33.33 units of workload.

1pt Give each E core 8.33 units of workload.

1pt Execution can finish 8.33T

2pt Total energy consumption is (16𝑃! + 8𝑃" ) ⋅ 8.33𝑇 = 133.28𝑇𝑃! + 66.64 ⋅ 𝑇𝑃"

Page 2 of 18
Name: __________________________

(Question 1 continued)

D) Repeat Part C for 100 units of workloads but CPU has 4 P cores and no E cores? (5 points)

2pt Given each P core 25 units of workload.

1pt Execution can finish 6.25T

2pt Total energy (24𝑃! + 8𝑃" ) ⋅ 6.25𝑇 = 150𝑇𝑃! + 50𝑇𝑃"

E) Repeat Part C if we increase the VDD on E core only and clock them at 1.5𝑓 . (5 points)

1pt Give each P core 28.57 units of workload.

1pt Give each E core 10.71 units of workload.

1pt Execution can finish 7.14 T

2pt Total energy (21𝑃! + 10𝑃" ) ⋅ 7.14𝑇 = 149.94𝑇𝑃! + 71.4𝑇𝑃"

Page 3 of 18
Name: __________________________

Question 2 [Speedup] (20 points)

A) You are developing a new enhancement that provides a 2.5x speedup to certain kinds of instructions.
What percentage of a program, as measured by its original execution time, must consist of these
instructions if you want to gain an overall speedup of 10%? (3 points)

Solution: 1.1 = 1/(1-f+f/2.5)

1/(1-3/5f) = 1.1

3f/5 = 1-1/1.1

f = .152 or 15.2%

Grading: 2 points for equation, 1 for final simplification

B) For the program identified in Part (A), what is the maximum possible speedup if the described
enhancement provides infinite speedup? (2 points)

Solution: Speedup = 1/(1-0.152) = 1.18

Gradin: 2 points for the equation

C) Now let us consider a different machine where two enhancements are proposed: one that can enhance
40% of execution time with a speedup of 1.5, and another that can enhance 25% of execution time
with some greater speedup value. Only one of these two can be implemented. How much of a
speedup is necessary in the second enhancement to beat the first enhancement? (3 points)

Solution: Tnew1 > Tnew2

0.6+0.4/1.5 > 0.75+0.25/s

7/60 > 0.25/s

s > 15/7 or 2.14

Grading: 1point per side of the equation and 1 point for simplification (do not deduct for calculation
error)

Page 4 of 18
Name: __________________________

(Question 2 continued)

D) Suppose an important program you run has the following characteristics.

Instruction Type % of execution time

Load from memory 12

Store to memory 6

FP multiplication 17

Others 65

You are considering upgrading your machine to one of two possible configurations. M1 reduces the
contribution of loads to the execution time by 3✕ and that of stores by 2✕. M2 reduces the contribution
of (only) floating-point multiplications to the execution time by 10✕. What is the overall speedup of the
program for each upgrade? You may express your answer in terms of an equation with all variables
explicitly substituted. You are not required to perform numerical calculations. (3 points)

Solution:

Speedup of M1 = 1/(0. 12/3 + 0. 06/2 + 0. 17 + 0. 65) = 1/0.89 = 1.124 or 12.4%

Speedup of M2 = 1/(0. 12 + 0. 06 + 0. 17/10 + 0. 65) = 1/0.847 = 1.181 or 18.1%

Grading: 1.5 points for the speedup equation for M1. 1.5 points for the speedup equation for M2.

E) What is the maximum possible speedup if all memory accesses could be speed up infinitely? What is
the maximum speedup if all (and only) floating-point multiplications could be sped up infinitely? You
may express your answer in terms of an equation with all variables explicitly substituted. You are not
required to perform numerical calculations. (3 points)

Solution:

Speedup memory: 1/(0.17+0.65) = 1/0.82

Speedup FP: 1/(0.12+0.06+0.65) = 1/0.83

Grading: 1.5 points for the speedup equation for memory. 1.5 points for the speedup equation for FP.

Page 5 of 18
Name: __________________________

(Question 2 continued)

F) Several researchers have suggested that adding a register-memory addressing mode to a load-store
machine might be useful. The idea is to replace sequences of

LD R1, 0(R8)
ADD R2, R2, R1

with

ADD R2, 0(R8)

Assume that the new instruction will cause the clock cycle time to increase by 5% and will not affect
the CPI (clock per instruction). Also, assume loads constitute 25.1% of all instructions. What
percentage of the loads must be eliminated for the machine with the new instruction to have at least
the same performance? (6 points)

Solution:

Let 𝐿 be the fraction of loads that are eliminated. This means that 0. 251 × 𝐿 of all instructions are
eliminated.

𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 𝑜𝑙𝑑 = # 𝑜𝑓 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 × 𝐶𝑃𝐼 × 𝑐𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒

𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 𝑛𝑒𝑤 = ((1 − 0. 251 × 𝐿) × # 𝑜𝑓 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠) × 𝐶𝑃𝐼×((1 +. 05) × 𝑐𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒)

𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 𝑛𝑒𝑤 ≤ 𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 𝑜𝑙𝑑

(1 − 0. 251 × 𝐿) × 1 × 1. 05 ≤ 1

0. 251 × 𝐿 ≥ 1 − 1/1.05

𝐿 ≥ 0. 19 𝑜𝑟 19%

Grading: 2 points for the basic time equation. 3 points for setting up the equation correctly to solve the
problem. No deductions for calculation mistakes.

Page 6 of 18
Name: __________________________

Question 3 [Tomasulo] (10 points)

Your processor currently has a small 2-bit saturating counter-based branch predictor which performs
moderately well. It has 8 Integer Functional Units and 4 Floating Point Units (FPUs), 256KB of on-chip
caches, 4 reservation stations for the Integer Units, and 2 reservation stations for FPUs. The Reorder
Buffer has 8 entries. The processor has a 25 stage pipeline.

The applications you care for have a small code size and work on small data sets in the range of 64 KB.
These applications spend most of their time in loops whose iterations are independent of each other, but
typically have only a limited amount of ILP within a single iteration (within the current processor
implementation).

You have some extra transistor budget in (possibly several of) the following ways:

1. Improve the branch predictor accuracy.

2. Add more reservation stations to your Tomasulo’s Algorithm-based Dynamic Scheduler.
3. Add more FPUs and Integer Units.
4. Add more Reorder Buffer entries.

Some of these may be desirable additions, while others may not be too beneficial given the current
configuration. Which of the above four additions should you support, and which ones should you oppose
(you can support/oppose multiple of these)? You need to justify your choices bellow to receive credit.

1. Improve the branch predictor accuracy: Support or Oppose, and why?

1. Improve the branch predictor accuracy: This is a desirable addition. The problem states that the
branch predictor performs only moderately well and the processor has a long pipeline making branch
mispredictions expensive. Thus, improving branch prediction accuracy is quite likely to increase
performance.

Page 7 of 18
Name: __________________________

(Question 3 continued)

2. Add more reservation stations to your Tomasulo’s Algorithm-based Dynamic Scheduler: Support or
Oppose, and why?

2. Add more reservation stations to your Tomasulo’s Algorithm-based Dynamic Scheduler: This is a
desirable addition. More reservation stations would mean a larger window within which the processor
can search for ready instructions to execute, thus it can discover more parallelism and keep execution
units busy. This would lead to better performance, especially since our application needs to discover
parallelism across loop iterations.

3. Add more FPUs and Integer Units: Support or Oppose, and why?

3. Add more FPUs and Integer Units: This doesn’t seem to be a good addition. The current machine
already has enough FUs and we should try to improve other aspects of the processor. Adding more FUs
12 won’t help if the processor is unable to discover enough parallelism in the instruction stream to keep
them busy.

4. Add more Reorder Buffer entries: Support or Oppose, and why?

4. Add more Reorder Buffer entries: This is a desirable addition. The current configuration has very few
ROB entries. A large ROB helps to mask out the effects of long latency instructions and help search for
parallelism within a larger window (this goes together with (3)).

Grading: 2.5 points for correctly analyzing each part. 10 points total. Give partial credit (1 points) if
student gives valid reason for why improvement can be avoided.

Page 8 of 18
Name: __________________________

Question 4 [Software Optimization] (18 points)

Consider a single-issue, in-order five stage pipeline similar to those studied in class, but with the
following specification:

Functional Unit Cycles in EX No. Of Functional Units Pipelined

Integer 1 1 Yes

FP Add/Sub 3 1 Yes

FP Mult 8 1 Yes

FP Division 24 1 Yes

• The integer functional unit performs integer addition (including effective address calculation for
loads/stores), subtraction, and logic operations.
• There is full forwarding and bypassing, including forwarding from the end of a functional unit to
the MEM stage for stores.
• Loads and stores complete in one cycle. That is, they spend one cycle in the MEM stage after the
effective address calculation.
• There are as many registers, both FP and integer, as you need.
• Branches are resolved in ID and there is one branch delay slot.
• While the hardware has full forwarding and bypassing, it is the responsibility of the compiler to
schedule such that the operands of each instruction are available when needed by each
instruction.
• If multiple instructions finish their EX stages in the same cycle, then we will assume they can all
proceed to the MEM stage together. Similarly, if multiple instructions finish their MEM stages in
the same cycle, then we will assume they can all proceed to the WB stage together. In other
words, for the purpose of this problem, you are to ignore structural hazards on the MEM and WB
stages.

This problem explores the ability of the compiler to schedule code as efficiently as possible for such a
pipeline. Consider the following code (also repeated on the next pages for reference):

loop:

L.D F4, 0(R1)

MUL.D F8, F4, F0
L.D F6, 0(R2)
ADD.D F10, F6, F2
ADD.D F12, F8, F10
S.D F12, 0(R3)
DADDIU R1, R1, #8
DADDIU R2, R2, #8
DADDIU R3, R3, #8
SUB.D R5, R4, R1
BNEZ R5, loop

Note: DADDIU is just like ADDIU but for 64 bits which performs ADD IMMEDIATE operation.

Page 9 of 18
Name: __________________________

(Question 4 continued)

A) Rewrite the above loop (repeated below for reference), but let every row take a cycle (each row can
be an instruction or a stall). If an instruction can’t be issued on a given cycle (because the current
instruction has a dependency that will not be resolved in time), write “stall” instead, and move on to
the next cycle (row) to see if it can be issued then. Assume that a NOP is scheduled in the branch
delay slot (effectively stalling 1 cycle after the branch). Explain the cause of all stalls, but don’t
reorder instructions. How many cycles elapse before the second iteration begins? (6 points)

L.D F4, 0(R1)

stall RAW F4
MUL.D F8, F4, F0
L.D F6, 0(R2)
stall RAW F6
ADD.D F10, F6, F2
stall RAW F8, F10
stall RAW F8, F10
stall RAW F8
stall RAW F8
ADD.D F12, F8, F10
stall RAW F12
S.D F12, 0(R3)
DADDIU R1, R1, #8
DADDIU R2, R2, #8
DADDIU R3, R3, #8
SUB.D R5, R4, R1
stall RAW R5 (branch resolved in ID)
BNEZ R5, loop
NOP stall for branch delay

20 cycles elapse before the second iteration begins.

Grading: 1 point for each sequence of stalls (1*5); 1 point for a correct cycle count; ½ point partial credit
for identifying that a stall is needed between a pair of instructions, but with an incorrect number of
cycles. Negative ½ point for each unnecessary sequence of stalls

Page 10 of 18
Name: __________________________

(Question 4 continued)

B) Now reschedule the loop to compute the same results as quickly as possible. You can change
immediate values and memory offsets and reorder instructions without violating any of the
dependencies, but don’t change anything else. Show any stalls that remain. How many cycles elapse
before the second iteration begins? Show your work. (6 points)

Solution:

L.D F4, 0(R1)

L.D F6, 0(R2)
MUL.D F8, F4, F0
ADD.D F10, F6, F2
DADDIU R1, R1, #8
DADDIU R2, R2, #8
DADDIU R3, R3, #8
SUB.D R5, R4, R1
stall RAW F8
stall RAW F8
ADD.D F12, F8, F10
BNEZ R5, loop
S.D F12, -8(R3)

13 cycles elapse before the second iteration begins.

Grading: Full points for any correct sequence with minimum number of stalls. Partial credit only if the
sequence does the same computation and reduces some stalls. Deduct ½ point for each error e.g.
incorrect index. Deduct ½ point for each stall in excess of 2.

Page 11 of 18
Name: __________________________

(Question 4 continued)

C) Now unroll the loop the minimum number of times needed to eliminate all stalls (with rescheduling).
Show the unrolled and rescheduled loop. You can, and should, remove redundant instructions. How
many original iterations of the loop are in an iteration of your new unrolled loop? How many cycles
elapse before the next iteration of the unrolled loop begins? Don’t worry about start-up or clean-up
code outside the unrolled loop. Assume a very large number of iterations for the original loop. Show
your work. (6 points)

Solution: Note that in the solution below, the registers used could be different and there is some
flexibility in scheduling the instructions.

L.D F4, 0(R1)

L.D F6, 0(R2)
MUL.D F8, F4, F0
L.D F14, 8(R1)
L.D F16, 8(R2)
MUL.D F18, F14, F0
ADD.D F10, F6, F2
ADD.D F20, F16, F2
DADDIU R1, R1, #16
DADDIU R2, R2, #16
DADDIU R3, R3, #16
ADD.D F12, F8, F10
SUB.D R5, R4, R1
ADD.D F22, F18, F20
S.D F12, -16(R3)
BNEZ R5, loop
S.D F22, -8(R3)

The loop has an unroll factor of 2: there are two iterations of the original loop in a single iteration of the
new loop. 17 cycles elapse before the next iteration of the unrolled loop begins.

Grading: 1 point for the correct iteration count. Deduct ½ point for each error or stall cycle. Give partial
credit (3 points) if three iterations are used instead of two and the solution is correct with three
iterations.

Page 12 of 18
Name: __________________________

Question 5 [Tomasulo] (16 points)

In this problem, we try to understand the implications of the Reorder Buffer (ROB) size on performance.

• Consider a processor implementing the ROB.

• Each instruction goes through issue (IS), CDB, execute (EX), and commit/retire (CM).
• Assume IS includes instruction fetch, decode and issue.
• Assume IS, CDB, EX and CM each take one cycle (once all the conditions for these stages are
met), as discussed in class.
• Assume our machine can fetch, decode, issue, and commit 4 instructions each cycle.
• Assume a branch misprediction is handled when the branch instruction reaches the head of the
ROB. It involves flushing that ROB entry and all entries following that entry.
• For now, assume there are no memory accesses (this will change in part C).

A) Suppose we have a perfect branch predictor and there is no data dependency between instructions.
We have infinite execution units of each type and infinite reservation stations. All instructions take
one cycle in the EX-stage. What is the maximum achievable IPC? What is the minimum ROB size
required to guarantee that IPC? (4 points)

Solution: Each instruction holds its ROB entry for 4 cycles (Issue, Execute, CDB, and Commit). Therefore,
we must have a minimum ROB size of 16 to avoid any stalls. In that case, the throughput would be 4
instructions per cycle.

Grading: 2 point for correct IPC. 2 point for correct ROB size.

B) Suppose different FUs have different latencies in the EX-stage, as given by the following table.
Everything else is the same as in the previous part. What is the minimum size of the ROB required
now to avoid any issue stalls due to a full ROB? (4 points)

Functional Unit Cycles

Integer ALU 3

Floating Point Addition 7

Floating Point Multiplication 13

Solution: If we issue an instruction to the FP multiplier, it will occupy its ROB entry for 16 cycles, and
would also block the instructions following it from committing. During that period, we could have issued
64 instructions in all. Thus, we need a minimum ROB size of 64 to avoid stalls. In that case we would get
a throughput of 4 instructions per cycle.

Grading: 2 points for realizing that FP multiplier is the bottleneck. 2 points for correct ROB size.

Page 13 of 18
Name: __________________________

(Question 5 continued)

C) In addition to the latencies above, now every 10th instruction is a load instruction. Assume the
address calculation and cache/memory access parts of the load both happen in the EX-stage. The hit
rate in the data cache is 95% and the misses are uniformly spaced through the instruction stream. A
hit takes 1 cycle in the EX-stage. However, upon a miss, the data has to be fetched from the memory
and this results in 120 cycles in the EX-stage. What is the ROB size required now to avoid any issue
stalls? (4 points)

Solution: A load instruction that misses in the cache would occupy its ROB entry for 1 + 120 + 1 + 1 = 123
cycles. During this time, we could have issued 492 instructions. This is the size of the ROB required to
continuously issue 4 instructions each cycle.

Grading: 2 points for calculating the correct latency of loads. 2 points for correct ROB size.

D) Now additionally assume we don’t have perfect branch prediction anymore. Instead, we have a
predictor with an accuracy of 95%. Assume every 8th instruction is a branch and the mispredictions
are uniformly spaced through the instruction stream. After a misprediction, how many instructions are
issued before the next misprediction is encountered? In light of this result, do you think we need a
ROB of the size you derived in Part C? Why/why not? If not, what do you think would be a good
ROB size to have? (4 points)

Solution: We would issue 8 × 20 = 160 instructions before a mispredict is encountered. Given this result,
it doesn’t make sense to have a ROB of size 492, because once we have a cache miss, soon after we
would run into a branch mispredict, and all the work done after that would be wasted anyway. A
reasonable ROB size would be 160.

Grading: 2 points for starting that a smaller ROB is sufficient. 2 points for correct ROB size.

Page 14 of 18
Name: __________________________

Question 6 [Branch Prediction] (20 Points)

Assume that the current state of global history register is: ghist[8:0]= 111010111 and the current branch
PC is PC[5:0]= 000101. You are asked to update a TAGE Branch predictor that has the following
predictor tables:

Table 0 bimodal
address ctr (2 bits)
000 01 Table 1 Table 2 Table 3 Table 4
001 10 address
ctr tag u ctr tag u ctr tag u ctr tag u
010 00 00 001 110 00 101 101 01 110 111 01 100 011 00
011 01 01 101 010 11 000 011 00 101 100 00 101 100 00
100 01 10 100 100 01 101 000 10 010 010 01 000 101 11
101 11 11 111 100 10 111 110 11 011 000 10 111 001 10
110 01
111 00

You will have to use the global history, branch PC and the predictor tables to predict the outcome of the
branch then update the predictor tables.

You index the bimodal table (table 0) using the following bits from the branch PC: PC[5:3]

You index the predictor tables (tables 1-4) and compute the tag as follows. FHi is the folded history of i
bits that is computed from a subset of ghist bits as specified below.

Address: PC[4:3] ⊕ FH2, Tag: PC[2:0] ⊕ FH3, where ‘⊕’ is an xor operation

Note that table 1 uses history ghist[1:0], table 2 uses history ghist[3:0], table 3 uses history ghist[5:0], and
table 4 uses history ghist[7:0]. These ghist bits are used to compute FH2, FH3 which are then used for tag
and address computation. FHi is the folded global history and here is an example to show you how to
compute it:

Assume we are to fold 8 bits of global history ghist[7:0] into 4 bits or 3 bits:

To fold it into 4 bits à FH4=ghist[7:4] ⊕ghist[3:0]

To fold it into 3 bits à FH3= ghist[8:6] ⊕ghist[5:3] ⊕ghist[2:0]

For all non-existing bits, we treat them as 0. Here ghist[8] is set to 0.

Note also that we are using is a 3-bit counter in the tagged predictor components. Hence, 1xx is predicted
taken and 0xx is predicted not taken. A 2-bit counter is used in Table 0 (base predictor) so 1x is predicted
taken and 0x is predicted not taken.

Page 15 of 18
Name: __________________________

(Question 6 continued)

a) Find the outcome of the prediction from each predictor components (i.e. table 0-5). Show your
steps in details. (13 points)

Table 0:

Prediction: Use PC[5:3]=000 to index table 0 and get 01 so predict not taken

Table 1:

Address computation: PC[4:3] ⊕FH2 = 11

Tag computation: PC[2:0] ⊕ FH3 = 110

Prediction: Tag doesn’t match à can’t predict

Table 2:

Address computation: PC[4:3] ⊕FH2 = 10

Tag computation: PC[2:0] ⊕ FH3 = 010

Prediction: Tag doesn’t match à can’t predict

Table 3:

Address computation: PC[4:3] ⊕FH2 = 11

Tag computation: PC[2:0] ⊕ FH3 = 000

Prediction: Tag matches and ctr = 011 à not taken

Table 4:

Address computation: PC[4:3] ⊕FH2 = 00

Tag computation: PC[2:0] ⊕ FH3 = 011

Prediction: Tag matches and ctr = 100 à not taken

Page 16 of 18
Name: __________________________

(Question 6 continued)

b) Assume the branch is taken, how will you update the predictor components? mark the tables
below and show your steps. (7 points)

Table 0 bimodal
address ctr (2 bits)
000 01 Table 1 Table 2 Table 3 Table 4
001 10 address
ctr tag u ctr tag u ctr tag u ctr tag u
010 00 00 001 110 00 101 101 01 110 111 01 101 011 01
011 01 01 101 010 11 000 011 00 101 100 00 101 100 00
100 01 10 100 100 01 101 000 10 010 010 01 000 101 11
101 11 11 111 100 10 111 110 11 011 000 10 111 001 10
110 01
111 00

Table 4 à increment counter and increment u bits since prediction is correct.

Page 17 of 18
Name: __________________________

This page is blank

Page 18 of 18

Information Technology Control and Audit, Fifth Edition 2018
100% (1)
Information Technology Control and Audit, Fifth Edition 2018
511 pages
Business Analytics Question and Answers
No ratings yet
Business Analytics Question and Answers
38 pages
HW2 S24 Sol
No ratings yet
HW2 S24 Sol
15 pages
Solution Manual COD
No ratings yet
Solution Manual COD
115 pages
HW4S24 - Sol
No ratings yet
HW4S24 - Sol
11 pages
Test 6 PracticeQuestion Cachememory 1
No ratings yet
Test 6 PracticeQuestion Cachememory 1
21 pages
HW5S24 Sol
No ratings yet
HW5S24 Sol
8 pages
Coursera Quiz Week3 Fall 2012
100% (1)
Coursera Quiz Week3 Fall 2012
3 pages
Pipeline Hazards
No ratings yet
Pipeline Hazards
39 pages
OS Allpyq
No ratings yet
OS Allpyq
12 pages
Unit 2 (Process Synchronization) 1
No ratings yet
Unit 2 (Process Synchronization) 1
79 pages
Cs433 Fa12 Hw4 Sol Correct
No ratings yet
Cs433 Fa12 Hw4 Sol Correct
14 pages
Assignment 4
No ratings yet
Assignment 4
3 pages
COA511S Supplementary TEST Memo
No ratings yet
COA511S Supplementary TEST Memo
6 pages
Ch02 OS9e
No ratings yet
Ch02 OS9e
97 pages
COA511S SUPP TEST Memo
No ratings yet
COA511S SUPP TEST Memo
6 pages
COA511S TEST1 Memo
No ratings yet
COA511S TEST1 Memo
4 pages
Design Issues: SMT and CMP Architectures
No ratings yet
Design Issues: SMT and CMP Architectures
9 pages
Processors
No ratings yet
Processors
25 pages
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
0% (1)
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
40 pages
Solution For Chapter 4
100% (3)
Solution For Chapter 4
26 pages
Tora
100% (3)
Tora
14 pages
ESD Assignment1
No ratings yet
ESD Assignment1
7 pages
COA511S TEST2 Memo
No ratings yet
COA511S TEST2 Memo
6 pages
Problem 1 A) Considering The Number of Instructions Here To Be A Constant A
No ratings yet
Problem 1 A) Considering The Number of Instructions Here To Be A Constant A
13 pages
Assignment
No ratings yet
Assignment
41 pages
p1 p2 l1 p1 l1 l2 p2 l1 l3 p1 p2: Course: Hardware Software Co0Design Assignment 1 (10 Marks) Last Date: 3 MRCH 2023
No ratings yet
p1 p2 l1 p1 l1 l2 p2 l1 l3 p1 p2: Course: Hardware Software Co0Design Assignment 1 (10 Marks) Last Date: 3 MRCH 2023
1 page
Collision Free Scheduling
No ratings yet
Collision Free Scheduling
18 pages
Chapter 05
No ratings yet
Chapter 05
19 pages
Cs9227 - Operating System Lab Manual
No ratings yet
Cs9227 - Operating System Lab Manual
39 pages
Operating Systems (BTCOC402) Summer Examination - 2023
No ratings yet
Operating Systems (BTCOC402) Summer Examination - 2023
2 pages
Os Practical File
No ratings yet
Os Practical File
50 pages
Image Filtering
0% (1)
Image Filtering
56 pages
Advanced Data Structures: Sartaj Sahni
No ratings yet
Advanced Data Structures: Sartaj Sahni
34 pages
PDF
No ratings yet
PDF
6 pages
How To Install SimpleScalar
No ratings yet
How To Install SimpleScalar
4 pages
High Performance Computer Architecture (CS60003)
No ratings yet
High Performance Computer Architecture (CS60003)
2 pages
Cst206 Scheme
No ratings yet
Cst206 Scheme
4 pages
Ca Mid1 2017
No ratings yet
Ca Mid1 2017
9 pages
Chapter 5 - CPU Scheduling
100% (1)
Chapter 5 - CPU Scheduling
41 pages
Systolic Array
No ratings yet
Systolic Array
42 pages
Computer Arch Test
No ratings yet
Computer Arch Test
8 pages
Archmidsem 2009 Sol
No ratings yet
Archmidsem 2009 Sol
5 pages
Ch01 Basic Concepts and Computer Evolution
No ratings yet
Ch01 Basic Concepts and Computer Evolution
36 pages
Operating System - Processes: Process
No ratings yet
Operating System - Processes: Process
3 pages
Midtermsolutions
No ratings yet
Midtermsolutions
3 pages
ECE 341 2013 in Class Midterm1
No ratings yet
ECE 341 2013 in Class Midterm1
9 pages
Technology Changes The World
No ratings yet
Technology Changes The World
2 pages
RTOS Concepts190617 PDF
No ratings yet
RTOS Concepts190617 PDF
55 pages
Chapter 5 PDF
No ratings yet
Chapter 5 PDF
11 pages
PH 4 Quiz
100% (1)
PH 4 Quiz
119 pages
Computer Architecture - A Quantitative Approach Chapter 5 Solutions
No ratings yet
Computer Architecture - A Quantitative Approach Chapter 5 Solutions
14 pages
Unit 2,3 Ct2 Question Bank 4 Marks
No ratings yet
Unit 2,3 Ct2 Question Bank 4 Marks
3 pages
Cs433 Sp12 Midterm Sol
No ratings yet
Cs433 Sp12 Midterm Sol
9 pages
Instruction Pipeline
No ratings yet
Instruction Pipeline
27 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
Week 6: Assignment Solutions
No ratings yet
Week 6: Assignment Solutions
4 pages
Assignment 2 Front Sheet: Qualification TEC Level 5 HND Diploma in Computing
No ratings yet
Assignment 2 Front Sheet: Qualification TEC Level 5 HND Diploma in Computing
47 pages
Eecs112 hw1
No ratings yet
Eecs112 hw1
2 pages
Sample Midterm Exam Questions
No ratings yet
Sample Midterm Exam Questions
2 pages
Midterm Exam Architecture
No ratings yet
Midterm Exam Architecture
2 pages
What Is The Longest Text About LTEs
No ratings yet
What Is The Longest Text About LTEs
43 pages
ICT Chapter 2 Exam-Style Questions Some Answers
100% (3)
ICT Chapter 2 Exam-Style Questions Some Answers
8 pages
1Y0-312 (119 Questions)
No ratings yet
1Y0-312 (119 Questions)
10 pages
Midterm Answer
No ratings yet
Midterm Answer
2 pages
HPE ProLiant DL365 Gen11
No ratings yet
HPE ProLiant DL365 Gen11
46 pages
Illinois Exam2 Practice Solfa08
No ratings yet
Illinois Exam2 Practice Solfa08
4 pages
LDM1 Module 3 Decision Tree
No ratings yet
LDM1 Module 3 Decision Tree
5 pages
Foreign and Russian Experience of Blockchain
No ratings yet
Foreign and Russian Experience of Blockchain
7 pages
Sample Exam LCSPC V082019A EN
No ratings yet
Sample Exam LCSPC V082019A EN
8 pages
Acrs 2.0
No ratings yet
Acrs 2.0
14 pages
Smwa Notes
No ratings yet
Smwa Notes
37 pages
DCCN Notes
No ratings yet
DCCN Notes
26 pages
Systolic Arrays & Their Applications
No ratings yet
Systolic Arrays & Their Applications
35 pages
Cyber Security File
No ratings yet
Cyber Security File
52 pages
M03 Build A Small Wireless LAN
No ratings yet
M03 Build A Small Wireless LAN
92 pages
Rita Mulcahy: PM Crash Course ™ For IT Professionals
No ratings yet
Rita Mulcahy: PM Crash Course ™ For IT Professionals
6 pages
Object Oriented Programming
No ratings yet
Object Oriented Programming
26 pages
New Mini Project
No ratings yet
New Mini Project
20 pages
Unix Top100 e
No ratings yet
Unix Top100 e
3 pages
Full Download Design Computing and Cognition'22 John S. Gero PDF
No ratings yet
Full Download Design Computing and Cognition'22 John S. Gero PDF
47 pages
Unit 6 - Java Server Programming
No ratings yet
Unit 6 - Java Server Programming
9 pages
Intel® Volume Management Device LED Management Tool Release Notes and Guide
No ratings yet
Intel® Volume Management Device LED Management Tool Release Notes and Guide
8 pages
Unit 4
No ratings yet
Unit 4
27 pages
Januarius T. Manipol - Profile - PDF - 03152024
No ratings yet
Januarius T. Manipol - Profile - PDF - 03152024
4 pages
Social Engineering For Security Attacks
No ratings yet
Social Engineering For Security Attacks
4 pages
Networking Devices CheatSheet - WK v1
No ratings yet
Networking Devices CheatSheet - WK v1
1 page
Computer Science Technology - Programming (Profile 420.BP)
No ratings yet
Computer Science Technology - Programming (Profile 420.BP)
2 pages
Chapter 2 (SOFTCOPY)
No ratings yet
Chapter 2 (SOFTCOPY)
4 pages

Midterm Solution

Uploaded by

Midterm Solution

Uploaded by

Name: USC ID#: .

Put your name on every page where noted

Question 1 [Power] (20 points)

Dynamic power is power dissipated when there is active switching on inputs

(1pt) Clock-gating cannot mitigate the sub-threshold power leakage.

Static Power 2 ⋅ 𝑃" 𝑃"

Table 1: Core specification when clocked at 𝑓 .

1pt Give each P core 33.33 units of workload.

1pt Give each E core 8.33 units of workload.

1pt Execution can finish 8.33T

2pt Given each P core 25 units of workload.

1pt Execution can finish 6.25T

2pt Total energy (24𝑃! + 8𝑃" ) ⋅ 6.25𝑇 = 150𝑇𝑃! + 50𝑇𝑃"

1pt Give each P core 28.57 units of workload.

1pt Give each E core 10.71 units of workload.

1pt Execution can finish 7.14 T

2pt Total energy (21𝑃! + 10𝑃" ) ⋅ 7.14𝑇 = 149.94𝑇𝑃! + 71.4𝑇𝑃"

Question 2 [Speedup] (20 points)

Solution: 1.1 = 1/(1-f+f/2.5)

Grading: 2 points for equation, 1 for final simplification

Solution: Speedup = 1/(1-0.152) = 1.18

Gradin: 2 points for the equation

Solution: Tnew1 > Tnew2

0.6+0.4/1.5 > 0.75+0.25/s

7/60 > 0.25/s

s > 15/7 or 2.14

D) Suppose an important program you run has the following characteristics.

Instruction Type % of execution time

Load from memory 12

Speedup of M1 = 1/(0. 12/3 + 0. 06/2 + 0. 17 + 0. 65) = 1/0.89 = 1.124 or 12.4%

Speedup of M2 = 1/(0. 12 + 0. 06 + 0. 17/10 + 0. 65) = 1/0.847 = 1.181 or 18.1%

Speedup memory: 1/(0.17+0.65) = 1/0.82

Speedup FP: 1/(0.12+0.06+0.65) = 1/0.83

ADD R2, 0(R8)

𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 𝑜𝑙𝑑 = # 𝑜𝑓 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 × 𝐶𝑃𝐼 × 𝑐𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒

𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 𝑛𝑒𝑤 ≤ 𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 𝑜𝑙𝑑

Question 3 [Tomasulo] (10 points)

1. Improve the branch predictor accuracy.

1. Improve the branch predictor accuracy: Support or Oppose, and why?

4. Add more Reorder Buffer entries: Support or Oppose, and why?

Question 4 [Software Optimization] (18 points)

Functional Unit Cycles in EX No. Of Functional Units Pipelined

L.D F4, 0(R1)

L.D F4, 0(R1)

20 cycles elapse before the second iteration begins.

L.D F4, 0(R1)

13 cycles elapse before the second iteration begins.

L.D F4, 0(R1)

Question 5 [Tomasulo] (16 points)

• Consider a processor implementing the ROB.

Functional Unit Cycles

Floating Point Addition 7

Floating Point Multiplication 13

Question 6 [Branch Prediction] (20 Points)

To fold it into 4 bits à FH4=ghist[7:4] ⊕ghist[3:0]

To fold it into 3 bits à FH3= ghist[8:6] ⊕ghist[5:3] ⊕ghist[2:0]

For all non-existing bits, we treat them as 0. Here ghist[8] is set to 0.

Address computation: PC[4:3] ⊕FH2 = 11

Tag computation: PC[2:0] ⊕ FH3 = 110

Prediction: Tag doesn’t match à can’t predict

Address computation: PC[4:3] ⊕FH2 = 10

Tag computation: PC[2:0] ⊕ FH3 = 010

Prediction: Tag doesn’t match à can’t predict

Address computation: PC[4:3] ⊕FH2 = 11

Tag computation: PC[2:0] ⊕ FH3 = 000

Prediction: Tag matches and ctr = 011 à not taken

Address computation: PC[4:3] ⊕FH2 = 00

Tag computation: PC[2:0] ⊕ FH3 = 011

Prediction: Tag matches and ctr = 100 à not taken

Table 4 à increment counter and increment u bits since prediction is correct.

This page is blank

You might also like