No. of Cycles IF ID EXE MEM WB
No. of Cycles IF ID EXE MEM WB
(a)
No. of Cycles IF ID EXE MEM WB
1. 1, 2
2. 3, 4 1, 2
3. 5, 6 3, 4 1, 2
4. 3, 4 1, 2
5. 3, 4 2 1
6. 4, 5 3 2 1
7. 4, 5 3 2
8. 5, 6 4 3
9. 5, 6 4 3
10. 6 5 4
11. 6 5 4
12. 6 5
13. 1, 2 6 5
14. 3, 4 1, 2 6
15. 5, 6 3, 4 1, 2 6
16. 3, 4 1, 2
17. 3, 4 2 1
18. 4, 5 3 2 1
19. 4, 5 3 2
20. 5, 6 4 3
21. 5, 6 4 3
22. 6 5 4
23. 6 5 4
24. 6 5
25. 6 5
26. 6
27. 6
(b)
Instructions after 2 level loop- With stalls Optimized schedule of
unrolling (without false Instructions
dependencies)
Loop: LD R3, 40(R5) Loop: LD R3, 40(R5) Loop: DIV R2, R2, R5
DIV R2, R2, R5 DIV R2, R2, R5 LD R3, 40(R5)
ADD R2, R2, R3 2stalls SUB R8, R5, 2
ST R2, 20(R5) ADD R2, R2, R3 LD R6, 38(R5)
SUB R8, R5, 2 1stall ADD R2, R2, R3
LD R6, 38(R5) ST R2, 20(R5) DIV R7, R7, R8
DIV R7, R2, R8 SUB R8, R5, 2 ST R2, 20(R5)
ADD R2, R7, R6 LD R6, 38(R5) SUB R5, R5, 4
ST R2, 18(R5) DIV R7, R2, R8 ADD R7, R7, R6
SUB R5, R5, 4 2stalls 1stall
BEQ R5, R0, Loop ADD R2, R7, R6 BEQ R5, R0, Loop
1stall ST R7, 18(R5)
ST R2, 18(R5)
SUB R5, R5, 4
2 stalls
BEQ R5, R0, Loop
1stall
(c)
No. of Cycles IF ID EXE MEM WB
1. 1, 2
2. 3, 4 1, 2
3. 5, 6 3, 4 1, 2
4. 3, 4 1, 2
5. 3, 4 2 1
6. 6, 7 4, 5 3 2 1
7. 4, 5 3 2
8. 7, 8 5, 6 4 3
9. 5, 6 4 3
10. 8, 9 6, 7 5 4
11. 6, 7 5 4
12. 10, 11 8, 9 6, 7 5
13. 8, 9 6, 7 5
14. 8, 9 7 6
15. 9, 10 8 7 6
16. 9, 10 8 7
17. 10, 11 9 8
18. 10, 11 9 8
19. 11 10 9
20. 11 10 9
21. 11 10
22. 11 10
23. 11
24. 11
Q2
L1 cache:
L2 cache:
Q3
Let’s suppose a processor has CPI of 1
a) Design one
Instruction miss cycles 3% * 200 * I 6I cycles
Data miss cycles 8% * 25% * 200 * I 4I cycles
Total cycles per instruction1 + 4 + 611 cycles
Design two
Instruction miss cycles 5% * 200 * I 10I cycles
Data miss cycles 5% * 25% * 200 * I 2.5I cycles
Total cycles per instruction 1 + 2.5 + 1013.5 cycles
2) Small on-chip
L2 hit = 3ns = 3/0.25 cycles = 12 cycles
Fetch instruction cycles = L1 hit + L2 hit + memory access
= 2*106 + 0.08*106*12 + 0.625*0.08*106*1000
= (2+0.96+50) * 106 = 52960000 cycles
50% Data instruction cycles = 2*0.5*10 6 + 0.08*0.5*106*12 + 0.625*0.08*0.5*106*1000
= (1+0.48+25) * 106 = 26480000 cycles
Total memory access cycles = 26480000 + 52960000 = 79440000 cycles
g) 1) Large off-chip
For LD (7*0.3 + 2*60*0.08*0.3 + 1000*0.375*0.08*0.3) * 10 6 13.98*106 cycles
For ST (7*0.2 + 2*60*0.08*0.2 + 1000*0.375*0.08*0.2) * 10 6 9.32*106 cycles
For INT (5*0.3 + 60*0.08*0.3 + 1000*0.375*0.08*0.3) * 10 611.94*106 cycles
For BR (5*0.1 + 60*0.08*0.1 + 1000*0.375*0.08*0.1) * 10 6 3.98*106 cycles
For FL (21*0.1 + 60*0.08*0.1 + 1000*0.375*0.08*0.1) * 10 65.58*106 cycles
Total cycles = 44.8 * 106 = 44800000
Total Execution Time = 44800000 * 0.25 * 10 -9 = 0.0112ns
2) Small on-chip
For LD (7*0.3 + 2*12*0.08*0.3 + 1000*0.625*0.08*0.3) * 10 6 17.676*106 cycles
For ST (7*0.2 + 2*12*0.08*0.2 + 1000*0.625*0.08*0.2) * 10 6 11.784*106 cycles
For INT (5*0.3 + 12*0.08*0.3 + 1000*0.625*0.08*0.3) * 10 6 16.788*106 cycles
For BR (5*0.1 + 12*0.08*0.1 + 1000*0.625*0.08*0.1) * 10 6 5.596*106 cycles
For FL (21*0.1 + 12*0.08*0.1 + 1000*0.625*0.08*0.1) * 10 6 7.196*106 cycles
Total cycles = 59.04 * 106 = 59040000
Total Execution Time = 59040000 * 0.25 * 10 -9 = 0.01476ns