0% found this document useful (0 votes)
51 views

No. of Cycles IF ID EXE MEM WB

The document contains the pipeline stages of different instructions over multiple cycles in tabular form. It also contains code snippets of a loop with instructions before and after optimizations like loop unrolling and scheduling instructions to reduce stalls. Finally, it contains questions related to cache organization, memory hierarchy performance, and estimating execution time based on cache hit rates and memory access latencies.

Uploaded by

xxx
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

No. of Cycles IF ID EXE MEM WB

The document contains the pipeline stages of different instructions over multiple cycles in tabular form. It also contains code snippets of a loop with instructions before and after optimizations like loop unrolling and scheduling instructions to reduce stalls. Finally, it contains questions related to cache organization, memory hierarchy performance, and estimating execution time based on cache hit rates and memory access latencies.

Uploaded by

xxx
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Q1

(a)
No. of Cycles IF ID EXE MEM WB
1. 1, 2
2. 3, 4 1, 2
3. 5, 6 3, 4 1, 2
4. 3, 4 1, 2
5. 3, 4 2 1
6. 4, 5 3 2 1
7. 4, 5 3 2
8. 5, 6 4 3
9. 5, 6 4 3
10. 6 5 4
11. 6 5 4
12. 6 5
13. 1, 2 6 5
14. 3, 4 1, 2 6
15. 5, 6 3, 4 1, 2 6
16. 3, 4 1, 2
17. 3, 4 2 1
18. 4, 5 3 2 1
19. 4, 5 3 2
20. 5, 6 4 3
21. 5, 6 4 3
22. 6 5 4
23. 6 5 4
24. 6 5
25. 6 5
26. 6
27. 6

(b)
Instructions after 2 level loop- With stalls Optimized schedule of
unrolling (without false Instructions
dependencies)
Loop: LD R3, 40(R5) Loop: LD R3, 40(R5) Loop: DIV R2, R2, R5
DIV R2, R2, R5 DIV R2, R2, R5 LD R3, 40(R5)
ADD R2, R2, R3 2stalls SUB R8, R5, 2
ST R2, 20(R5) ADD R2, R2, R3 LD R6, 38(R5)
SUB R8, R5, 2 1stall ADD R2, R2, R3
LD R6, 38(R5) ST R2, 20(R5) DIV R7, R7, R8
DIV R7, R2, R8 SUB R8, R5, 2 ST R2, 20(R5)
ADD R2, R7, R6 LD R6, 38(R5) SUB R5, R5, 4
ST R2, 18(R5) DIV R7, R2, R8 ADD R7, R7, R6
SUB R5, R5, 4 2stalls 1stall
BEQ R5, R0, Loop ADD R2, R7, R6 BEQ R5, R0, Loop
1stall ST R7, 18(R5)
ST R2, 18(R5)
SUB R5, R5, 4
2 stalls
BEQ R5, R0, Loop
1stall

(c)
No. of Cycles IF ID EXE MEM WB
1. 1, 2
2. 3, 4 1, 2
3. 5, 6 3, 4 1, 2
4. 3, 4 1, 2
5. 3, 4 2 1
6. 6, 7 4, 5 3 2 1
7. 4, 5 3 2
8. 7, 8 5, 6 4 3
9. 5, 6 4 3
10. 8, 9 6, 7 5 4
11. 6, 7 5 4
12. 10, 11 8, 9 6, 7 5
13. 8, 9 6, 7 5
14. 8, 9 7 6
15. 9, 10 8 7 6
16. 9, 10 8 7
17. 10, 11 9 8
18. 10, 11 9 8
19. 11 10 9
20. 11 10 9
21. 11 10
22. 11 10
23. 11
24. 11

Q2
L1 cache:

Cache size = 128KB


Block size = 16B  offset bits = 4 bits
Total Block in cache = 128KB / 16B = 8K
Direct-Map cache, total sets = 8K  index bits = 13 bits
Tag bits = 36 – (4 + 13) = 19 bits
Size of tag array = 19 * 8K = 152K bits = 19KB

L2 cache:

Cache size = 4MB


Block size = 16B  offset bits = 4 bits
Total Block in cache = 4MB / 16B = 256K
4-way set associative cache, total sets = 64K  index bits = 16 bits
Tag bits = 36 – (4 + 16) = 16 bits
Size of tag array = 16 * 64K * 4 = 4M bits = 512KB

Q3
Let’s suppose a processor has CPI of 1

a) Design one
Instruction miss cycles 3% * 200 * I 6I cycles
Data miss cycles  8% * 25% * 200 * I 4I cycles
Total cycles per instruction1 + 4 + 611 cycles

Design two
Instruction miss cycles 5% * 200 * I 10I cycles
Data miss cycles 5% * 25% * 200 * I 2.5I cycles
Total cycles per instruction  1 + 2.5 + 1013.5 cycles

Design one is better than design two by almost 22.73 percent

b) Instruction miss cycles for L1cache 5% * 20 * I  1I


Instruction miss cycles for L2 cache 50% * 5% * 200 * I 5I
Total instruction miss cycles 6I
Data miss cycles for L1 cache  5% * 25% * 20 * I  0.25I
Data miss cycles for L2 cache 50% * 5% * 25% * 200 * I 1.25
Total data miss cycles  1.5I
Total cycles per instruction  1 + 6+ 1.57.5 cycles
Q4
a->2, b->2, c->2, d->1, f->3, g->5
a) (5*30% + 5*20% + 4*30% + 4*10% + 20*10%) * 10 6 = 6.1M cycles = 6100000 cycles

b) Clock Speed = 4GHz  Clock cycle time = 1/4G = 0.25ns


106 fetch instructions to L1  2*106 cycles (L1 hit time)
8% * 106 miss rate  0.08*106 * 250ns  20 * 106 ns (20*106 / 0.25) = 80 * 106 cycles
Total time = 2*106 + 80*106 = 82 * 106 cycles = 82000000 cycles
c) Clock Speed = 4GHz  Clock cycle time = 1/4G = 0.25ns
50% data instructions 5 * 105 to L1  2*5*105 106cycles (L1 hit time)
8%*5*105 miss rate  0.08*5*105*250ns 107 ns  (107 / 0.25) = 40 * 106 cycles
Total time = 106 + 40*106 = 41 * 106 cycles = 41000000 cycles

d) Total Memory Access cycles = Fetch Instruction + Memory Data


= 82000000 + 41000000
= 123 * 106 cycles = 123000000 cycles

f) 3% of 106 = 37.5% of 8% of 106& 5% of 106 = 62.5% of 8% of 106


1) Large off-chip
L2 hit = 15ns = 15/0.25 cycles = 60 cycles
Fetch instruction cycles = L1 hit + L2 hit + memory access
= 2*106 + 0.08*106*60 + 0.375*0.08*106*1000
= (2+4.8+30) * 106 = 36800000 cycles
50% Data instruction cycles = 2*0.5*10 6 + 0.08*0.5*106*60 + 0.375*0.08*0.5*106*1000
= (1+2.4+15) * 106 = 18400000 cycles
Total memory access cycles = 18400000 + 36800000 = 55200000 cycles

2) Small on-chip
L2 hit = 3ns = 3/0.25 cycles = 12 cycles
Fetch instruction cycles = L1 hit + L2 hit + memory access
= 2*106 + 0.08*106*12 + 0.625*0.08*106*1000
= (2+0.96+50) * 106 = 52960000 cycles
50% Data instruction cycles = 2*0.5*10 6 + 0.08*0.5*106*12 + 0.625*0.08*0.5*106*1000
= (1+0.48+25) * 106 = 26480000 cycles
Total memory access cycles = 26480000 + 52960000 = 79440000 cycles

g) 1) Large off-chip
For LD  (7*0.3 + 2*60*0.08*0.3 + 1000*0.375*0.08*0.3) * 10 6 13.98*106 cycles
For ST  (7*0.2 + 2*60*0.08*0.2 + 1000*0.375*0.08*0.2) * 10 6 9.32*106 cycles
For INT  (5*0.3 + 60*0.08*0.3 + 1000*0.375*0.08*0.3) * 10 611.94*106 cycles
For BR  (5*0.1 + 60*0.08*0.1 + 1000*0.375*0.08*0.1) * 10 6 3.98*106 cycles
For FL  (21*0.1 + 60*0.08*0.1 + 1000*0.375*0.08*0.1) * 10 65.58*106 cycles
Total cycles = 44.8 * 106 = 44800000
Total Execution Time = 44800000 * 0.25 * 10 -9 = 0.0112ns

2) Small on-chip
For LD  (7*0.3 + 2*12*0.08*0.3 + 1000*0.625*0.08*0.3) * 10 6 17.676*106 cycles
For ST  (7*0.2 + 2*12*0.08*0.2 + 1000*0.625*0.08*0.2) * 10 6 11.784*106 cycles
For INT  (5*0.3 + 12*0.08*0.3 + 1000*0.625*0.08*0.3) * 10 6 16.788*106 cycles
For BR  (5*0.1 + 12*0.08*0.1 + 1000*0.625*0.08*0.1) * 10 6 5.596*106 cycles
For FL  (21*0.1 + 12*0.08*0.1 + 1000*0.625*0.08*0.1) * 10 6 7.196*106 cycles
Total cycles = 59.04 * 106 = 59040000
Total Execution Time = 59040000 * 0.25 * 10 -9 = 0.01476ns

Option 2 is 1.318 times better than option 1. i.e.25% better

You might also like