Advanced Computer Network Assignment Help
Advanced Computer Network Assignment Help
1000
1004 1004 taken to 1014 taken to 1004
1008
100c 1000 not taken to 1004 not taken to
1010 1008
1014 both paths hash to table entry 2.
1018
b) Construct two distinct legal paths through the code, both ending
at the BRANCH at address 1014, but that produce the same 2
bit global history.
2. You are designing the new Illin 511 processor, a 2-wide in-order pipeline
that will run at 2GHz. The pipeline front end (address generation, fetch,
decode, and in-order issue) is 14 cycles deep. Branch instructions then
take 6 cycles to execute in the back end, load and floating-point
instructions take 8 cycles, and integer ALU operations take 4 cycles to
execute.
a) Your lab partner has written some excellent assembly code that would
be able to achieve a sustained throughput on the Illin 511 of 4 billion
instructions/second, as long as no branches mispredict. Assume that
an average of 1 out of 10 instructions is a branch, and that branches
are correctly predicted at a rate of p. Give an expression for the
average sustained throughput in terms of p.
If we increase the cache size the branch mispredict penalty increases from
20 cycles to 25
cycles so, by part (a), utilization would be 10
. If we keep the i-cache
small, it will
60 50p
have a non-zero miss-rate, m . We still fetch e instructions between
mispredicts, but now of those e , em are 10 cycle (=20 slot) L1 misses.
So the number of instruction slots to execute our instructions
20 1p 1
(1 20m) 40 p
. Thus
utilization
will be 10 . The larger cache will provide better utilization when
50 200m 40p
10
10
, or 50 200m 40p 60
60 .50p 50 200m 40p
50p , or p 1 20m
3.I mentioned, in passing, that the return address stack predictor needs to
be carefully repaired after a mispredict. Assume we are working with an
architecture where it is “easy” to identify call and return instructions (as
distinct from other branch instructions or non-branch instructions).
Assume also that the BTB always hits and provides accurate information
about the instruction being fetched.
The return address stack (RAS) is a small structure that contains an array of
(say) 8 addresses and a “head pointer” that points to one entry of the array.
Each time we speculatively fetch a call instruction we push the expected
return point of the call (PC+4) onto the RAS. We do this by incrementing
the head pointer (mod 8) and writing the return address at the
corresponding RAS array location. Each time we speculatively fetch a
return instruction we predict that the nextPC will be the PC currently
pointed to by the head pointer. Then we decrement the head pointer (mod
8).
a) First, assume that we don’t do any “repair” of the RAS after a branch
mispredict. Give a (short) sequence of instructions and events that
will cause the RAS to “get out of sync” (i.e., mispredict all the return
addresses currently listed in the RAS).
call A
branch mispredict
call B (bumps the RAS pointer)
rollback to the branch and now no matter what we do every return will get
its target from the RAS entry one above where it should be looking
b) Now, assume that we “repair” the state of the RAS as follows: with
each branch instruction we carry down the pipeline the value of the
RAS head pointer, as it was when the branch was fetched. (Much
the same as we do with the global history). When a branch
misprediction is detected, we reset the RAS head pointer to the
position we carried with the mispredicted branch. Give a (short)
sequence of instructions and events that will cause at least the next
return address to be mispredicted, but not the contents of the rest of
the stack.
A: call (headpointer <= 1, write value A+4 in RAS[1])
Fetch Branch that will mispredict (record
headpointer = 1) Return (headpointer <= 0)
B: call (headpointer <= 1, and we overwrite RAS[1]
<= B+4)
Rollback to the branch that mispredicted. This will restore the head
pointer <= 1, but RAS[1] = B+4, not A+4. Thus the next return will try
to
4.jump
You toareB+4 and end the
designing up mispredicted.
register renaming circuit for the new OOPS
Corp. processor (I guess OOPS must be an acronym for “Out-of-Order
Processing Systems”). The OOPS is a 2-wide out-of-order superscalar.
The OOPS ISA is much like MIPS or DLX: there are 32 architectural
registers, register 0 always contains the value 0, instructions have at most
two source operands and at most one destination operand. If an
instruction does not have a destination operand the decoder will tell the
renamer that the instruction’s “destination” is register 0.
a) What is the size of the physical register file you should use if
you want to guarantee the renamer will never stall because the free
list is empty? (Don't worry about the extra register or two that you
would need to keep to account for latch delay) Give an example of
a sequence of instruction dispatches that would require you to
allocate this many registers (assuming that W = 3, A = 4, and R =
5).
A+R.
Example: we would need 9 registers. Suppose we start with the machine
empty, the RAT and RRAT each with 4 entries:
1: p1
2: p2
3: p3
4: p4
and the freelist with 5 more regs: p5, p6, p7, p8, p9.
Now we rename and dispatch 5 instructions, each with arch reg 4 as dest
(but none of the instructions has retired yet.
Now the RRAT hasn’t yet changed state (so p1, p2, p3, p4 are all still
mapped), p5, p6, p7, p8, p9 are allocated to the 5 instructions in the
ROB and the RAT has the state:
1: p1
2: p2
3: p3
4: p9
b) Explain why you should not actually make the physical register file as
big as your calculation in part (a) suggests.
Neither store instructions nor branches have a dest operand. A reasonable
assumption might be that between 25% and 40% of instrs are branches and
stores, so even if the ROB were full we might be able to get by with A
+ .75R physical regs.
6. You are designing the issue unit for a 1-wide pipeline with in-order issue
and out-of-order completion. As in the “standard” pipeline we’ve been using
in class, once an instruction issues it reads its source register values on the
first cycle and then proceeds to the ALU on the following cycle. Branch
instructions take 6 cycles to execute in the back end, load and floating-point
instructions take 8 cycles, and integer ALU operations take 4 cycles to
execute (including the cycle to read the registers). Instructions write to the
register file on the cycle they complete. Branch mispredictions are detected 6
cycles after the branch issues. On a mispredict all preceding pipeline stages
are flushed. Luckily, none of the instructions can ever take an exception.
(So you don’t need to support precise exceptions).
a) One of the rules that the instruction issue unit must follow is: an
instruction must wait at the issue unit for its source operands to be
written/completed before the instruction can issue. Since this is an in-
order processor the instruction issue unit also follows the rule: an
instruction must wait at the issue unit for all preceding instructions to
issue. Give the rest of the rules that the issue unit must follow.
An instruction must wait for its dest operand to be completed before
issue. (Otherwise we might have a problem with a load followed by an
ALU to the same destination).
Any integer ALU instruction immediately following a branch must
stall at least one cycle before issuing (after that the branch will
flush the instr before it overwrites data).