0% found this document useful (0 votes)
21 views4 pages

Final Exam Comp Org

The document discusses computer performance metrics, including instruction execution times, CPI calculations, and speedup from instruction optimizations. It also covers cache memory structure, pipelined processor design, and instruction set architecture details. Key calculations and examples illustrate the impact of different architectural choices on performance and efficiency.

Uploaded by

daniel mata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views4 pages

Final Exam Comp Org

The document discusses computer performance metrics, including instruction execution times, CPI calculations, and speedup from instruction optimizations. It also covers cache memory structure, pipelined processor design, and instruction set architecture details. Key calculations and examples illustrate the impact of different architectural choices on performance and efficiency.

Uploaded by

daniel mata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Computer Performance - A program consisting of 2,000,000 Memory Address Data Little & Big Endian Single Cycle Data

Little & Big Endian Single Cycle Data Path


instructions is executed running at a 1 GHz clock rate. The program 0x8000000000000000 0xA1 - What are the contents of X11 after
uses the following mix of instructions: 40% R-type, 20% ldur, 10% 0x8000000000000001 0x32
LDUR X11, [X9, #0] is executed?
stur, 30% cbz. Cbz and LDUR have a CPI of 5, STUR and R-type have 0x8000000000000002 0x5B
a CPI of 4. 0x8000000000000003 0xC6 big endian: 0xA1325BC692583211
0x8000000000000004 0x92 little endian: 0x11325892C65B32A1
- What is the average CPI? CPIavg (cycles/instr) = ∑(CPI * %)
0x8000000000000005 0x58
CPIavg = 5*(0.3) + 5*(0.2) + 4*(0.4) + 4*(0.1) = 4.5 (cycles/instr)
0x8000000000000006 0x32
- How long does it take to execute the program? 0x8000000000000007 0x11
!" $%&'⁄()*+ ∗ ".! /0/12&⁄$%&')
texe (seconds) = ; smaller texe : faster 0x8000000000000008 0x13
"3456 789: /0/12&⁄&2/
;2< $%&'⁄()*+ ∗ =.? /0/12&⁄$%&') Numbers & arithmetic - Binary multiplication: Step 1: check the
texe = = 0.009 (seconds)
@2A /0/12&⁄&2/ rightmost bit of the multiplier: if it’s 0 (do nothing), if it’s 1 (add the
- What is the speedup if CBZ is improved to a CPI of 3? multiplicand to the product); Step 2: shift the result (entire 8 bits) to
9 !"# (!" ∗ ".!⁄"3456 789:)!"# "3456 789:$%& the right | Multiply 7 (multiplicand) times 6 (multiplier):
Speedup = = =
9 $%& (!" ∗ ".!⁄"3456 789:)$%& "3456 789:!"# Product |
Iteration Step Multiplicand Action
new CPI = 3*(0.3) + 5*(0.2) + 4*(0.4) + 4*(0.1) = 3.9 Multiplier
Speedup = 4.5/3.9 = 1.15 0 Initial values 0111 0000 0110
Lowest bit of multiplier is
1 1 0111 0000 0110
0, don’t add multiplicand
Input writes Control sends signals that Memory Output 2 0000 0011 Shift right
Datapath
data & determine the stores presents Lowest bit of multiplier is
executes 2 1 0111 0011
instructions in operations of datapath, instructions results to 1, add multiplicand
instructions 2 0011 1001 Shift right
memory memory, input & output & data user
Lowest bit of multiplier is
3 1 1010 1001
Parts of !!! Things that are 1, add multiplicand - What are the control values input to the the multiplexors (mux’s) for a
the determined by the 2 0101 0100 Shift right CBZ instruction where the output of the Zero control line is 1? 12, 13, 15
design of the Lowest bit of multiplier is - ADD and STUR read data from two registers
computer instruction set 4 1 0101 0100
architecture:
0, don’t add multiplicand - What are the control values input to the the multiplexors (mux’s) for a
2 0010 1010 Shift right CBZ instruction where the output of the Zero control line is 1?
format of machine
code; type of Binary Addition: - If the sum is 0 or 1, write it down in the result. Mux 12: 1, Mux 13: 0, Mux 15: 1
instructions; - If the sum is 2, sum bit is 0, carry bit is 1. *** multiplying two 16 bit
number of bytes - If the sum is 3, sum bit is 1, carry bit is 1. numbers, 32 bits are required for
in a register; the product
number of - Decimal ↔ IEEE 754: (doesn’t provide an exact rep of a floating point number)
registers; set up of
registers; data normalized form: (sign±)1.(fraction)2 × 2(exponent)
types and how sign: 1 bit; exponent: 8 bits; fraction: 23 bits
they’re expressed • What number does 0x0C000000 represent if it is a IEEE 754 format
ARM instruction floating point number? 0x0C000000 = 0 00011000
00000000000000000000000 → Sign: 0 (+);
ARM ISA: R-format instruction Exponent: 24 – 127 = –103; Fraction: 0 → 1.0 x 2-103 = 9.861 x 10-32
opcode 11 bits rm 5 bits shamt 6 bits rn 5 bits rd 5 bits • Express –4.75 in IEEE 754 format hexadecimal:
(basic op) (2nd source) (shift) (1st source) (destination) 4.75 = 4 + 0.5 + 0.25 = 22 + 2–1 + 2–2 = 100.11 in binary
What is the 32-bit machine code corresponding to LSL X10, X9, #3? 100.11 = 1.0011 x 22 normalized
opcode = 11010011011, rm is not used = 00000, shamt = 000011, rn = Sign = 1 (negative); Exponent = 2 + 127 = 129 = 10000001;
01001 shifted and placed in rd = 01010 ➝ 1101 0011 0110 0000 0000 Fraction = 00110000000000000000000
1101 0010 1010 ➝ 0xD3600D2A → 1 10000001 00110000000000000000000 → 0xC0980000
ARM ISA: I-format instruction Floating point: IEEE-754 Special values - What number does 0xFF800000
opcode 10 bits immediate 12 bits rn 5 bits rd 5 bits Special Value Exponent Fraction represent if it is a IEEE 754 format
floating point number?
(basic op) (value of operand) (source reg) (destination) ±0 (Two Zeroes) 0000 0000 0 Which control signals are forwarded…
-126 1111 1111 1000 0000
ADDI X9, X10, #100: opcode = 1001000100, immediate = Denormalized Number 0000 0000 (2 ) Non-zero 1. to the EX stage? ALUSrc, ALUOp
0000 0000 0000 0000
000001100100, rn = 01010, and rd = 01001 ➝ 1001 0001 0000 0001 NaN (Not a Number) 1111 1111 Non-zero
exponent: 1111 1111, fraction: 0 2. to the Memory stage? Branch, MemWrite, MemRead
± Infinity 1111 1111 0
1001 0001 0100 1001 ➝ 0x91019149 → Special value: Negative Infinity 3. for use in the WriteBack stage? MemtoReg, RegWrite
ARM ISA: D-format instruction: Load - Transfers data from memory to which elements
register, Store - Transfers data from register to memory; LDURB(1 byte); Instruction type Instruction Reg2Loc ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0
produce output
LDURW-word(4 bytes); LDURH-half(2 bytes); LDUR-double(8 bytes) Integer Arithmetic & Logical 1, 2, 3, 4, 5,
R-Format 0 0 0 1 0 0 0 1 0
opcode address op2 rn rt (ADD, SUB, AND, ORR) 12, 13, 14
11 bits 9 bits 2 bits 5 bits 5 bits 1, 2, 3, 4, 5,
LDUR X 1 1 1 1 0 0 0 0
LDURH X11 [X10, #4]: opcode = 01111000010, address = 000000100, Memory-Reference 6, 7, 13, 14
STUR 1 1 X 0 0 1 0 0 0
op2 = 00, rn = 01010, and rt = 01011
1, 2, 3, 4, 5, 6, 12, 13,
➝ 0111 1000 0100 0000 0100 0001 0100 1011 ➝ 0x7840414B Branch
8?, 9, 10, 11, 15
CBZ 1 0 X 0 0 0 1 0 1
Pipelined Data Path Cache Cache
1. A single cycle processor with a cycle time of 1000 ps split into a 5 staged Why Do We Have Cache? 3. A system has a main memory with 232 bytes has a block size of 16
pipelined processor with a cycle time of 250 ps. • Registers bytes and a cache size of 212 bytes. If address 0xABCABC00 is
- How long does it take an instruction to execute in the pipelined processor? o Fast access requested, and the cache is fully associative, what is the tag and
250*5 = 1250 ps
- What is the speedup for executing 5 instructions?
o Expensive per byte what is the offset? Where will the system look in the cache for the
5 instructions single cycle = 1000x5 = 5000 ps o On chip, size (number of bytes) tag? How may possible places could this byte be if it is in the cache?
5 instructions pipeline = 1250 + 250*4 = 2250 ps (1250 for the first o Data operated on by ALU must be in registers Tag Offset
instruction, after that, an instruction completes every 250 ps) o Have to transfer from main memory log2(block size) = log2(16)
Speedup = old time/new time = 5000/2250 = 2.22 log2(memory size) – log2(offset)
• Main Memory = 4 bits = 0000
- What would the speedup be for an infinite number of instructions? o Slower access | Less expensive per byte | Large Address = 1010 1011 1100 1010 1011 1010 0000 0000
1000/250 = 4 • Cache Tag = 1010101111001010 101110100000; Offset = 0000
2. We are designing a pipelined processor. The pipeline stages have the
following execution times:
o Used to make transfer to registers faster # of lines = cache size / block size
Instruction Fetch: 150 ps; Instruction Decode: 200 ps (longest time is cycle o Faster access than main memory = (212 bytes/cache) / (24 bytes/line) = 28 lines/cache
time); ALU Execute: 150 ps; Memory Read/Write: 150 ps; Write back: 150 ps o More expensive per byte than main memory The system will look in every line of the cache, 28 possible places.
- What is the clock rate for a pipelined processor? 1/200(ps) = 5(GHz) o Smaller than main memory
3. o Cache is a subset of main memory 4. A system has a main memory with 232 bytes has a block size of 16
(all stages)
o Put items in cache that are likely to be accessed bytes and a cache size of 212 bytes. If address 0xABCABC00 is
§ temporal locality – recently accessed items may be requested, and the cache is direct mapped, what is the tag, line and
accessed again offset? Where will the system look in the cache for the tag? How
§ spatial locality – items near recently accessed items may possible places could this byte be if it is in the cache?
may be accessed. (Arrays are together in memory) Tag Line field Offset
What Causes Cache Misses (not find the data in cache ➝ have to main memory – everything else log2(block size)
- What would the cycle time be for a single cycle processor? use main memory)? log2(2 /2 ) = 8 bits
12 4
= 32 – 8 – 4 = 20 bits = log2(16) = 4 bits
800 ps, the longest time required by an instruction • Compulsory misses – cache is empty when a process starts
- What would the cycle time be for a pipelined processor? Offset field = log(16) = 4 bits
• Capacity misses – the cache is full, based on size of cache
200 ps, the longest time required in a stage # of lines = (212 bytes/cache) / (24 bytes/line) = 28 lines/cache
- How long does one instruction take to get through the pipelined processor?
o Fully associative – capacity misses
⟶ 8 bits
200*5 = 1000 ps (the longest time required in a stage * # of stages) • Conflict – more than one block mapping to the same line
Tag field = 32 – 8 – 4 = 20 bits
- What is the speedup of the pipeline processor compared to the single cycle o Fully associative – no conflict misses – any block can go
Address = 1010 1011 1100 1010 1011 1100 0000 0000
processor when executing 10 instructions? anywhere
Single cycle time: 800*10 = 8000 ps
Tag = 10101011110010101011; Line = 11000000; Offset = 0000
o Direct mapped – highest rate of conflict misses
Pipelined time: 1000 + 9*200 = 2800 ps There is one possible place where this byte could be, and that is at
Speedup = old time/new time = 8000/2800 = 2.85 line 11000000. The system will look at that line to see if the tag
1. Consider a system with a main memory access time of 200 ns and
matches.
an L1 cache having a 20 ns access time and a hit rate of 90% and an
Branch Prediction:
- Static branch prediction - Predict branch at compile time L2 cache having a 40 ns access time and a hit rate of 95%.
5. A system has a main memory with 232 bytes has a block size of 16
(Predict “not taken” for for branching out of a loop; Predict “taken” for a L1 + L2 + MM AMAT bytes and a cache size of 212 bytes. If address 0xABCABC00 is
backward branch to stay in a loop) look aside (parallel) cache
0 Dynamic branch prediction - Predict during running 20*0.9 + 40*0.1*0.95 + 200*0.1*0.05 requested, and the cache is 2-way set associative, what is the tag,
= (20*0.9) + (0.1)(40*0.95 + 200*0.05)
(1 bit predictor (do whatever was done last time); 2 bit predictor (change line and offset? Where will the system look in the cache for the tag?
look through cache How may possible places could this byte be if it is in the cache?
prediction after getting it wrong twice); Keep a table of past predictions) 1*20 + 0.1*40 + 0.1*0.05*200
100 % have to look through 11
= 20 + 0.1(40 + 0.05*200)
forward loop: predict not taken backward loop: predict taken Tag Set field Offset
2. A system has a main memory of 216 bytes and is split into blocks of 26
LoopStart: cbz x9, LoopDone LoopStart: ldur x10, [x13, x16]
bytes. The system has a cache that is 212 bytes. main memory – cache size
ldur x10, [x13, x16] ldur x11, [x14, x16] log2 block size ∗ n log2(block size)
- How many bits wide is the address? log2(216) =16 everything else
ldur x11, [x14, x16] add x12, x10, x11
= log = log2(16) = 4 bits
2 (2 /(2 * 2)) = 7 bits
12 4
add x12, x10, x11 stur x12, [x15, x16] - How many bits wide is the offset field? log2(block size) = log2(line size) = 32 – 7 – 4 = 21 bits
stur x12, [x15, x16] sub x9, x9, 1 = log2(26) = 6 Offset field = log(16) = 4 bits
sub x9, x9, 1 add x16, x16, 8
add x16, x16, 8 cbnz x9, LoopStart
- How many bits describe the block? 16 – 6 = 10 Number of sets = (212 bytes/cache) / ((24 bytes/line)*(2 lines/set))
b LoopStart br x30 - How many lines does the cache have? = 27 sets/cache
LoopDone: br x30 (212 bytes/cache ) / (26 bytes/line) = 26 lines/cache = 64 lines/cache Set field = log(27) = 7 bits
Stalls: Data hazards requiring stall: Load-use(LDUR) | CBZ - If the cache is fully associative, into how many possible lines can a block Tag field = 32 – 7 – 4 = 21 bits
be placed? 64 – any block can go to any line
LDUR X9, [X10, #0] Requies a two stalls, because CBZ ID stage needs Address = 1010 1011 1100 1010 1011 1100 0000 0000
CBZ X9, #8 to wait for LDUR MEM stage - If the cache is direct mapped, into how many possible lines can a block
Tag = 101010111100101010111; Set = 1000000; Offset = 0000
be placed? 1 – each block maps to one line in the cache
LDUR X9, [X20] The system will look at set 1000000. There are two lines there,
- If the cache is 4-way set associative, into how many possible lines can a
CBZ X9, Label block be placed? 4 – a block will map to a set, and that set has four lines each with a tag. It will look at those two tags and see if one of
SUB X12, X10, X9 them is a match.
SUB X12, X12, X19 Move branch checking to earlier in the pipeline to avoid 3 cycles
LDUR X13, [X20, #8] in which what execution to fetch is unknown. Still have one
cycle in which we don’t know what instruction to fetch
0000-0 0001-1 0010-2 0011-3 0100-4 0101-5 0110-6 0111-7
Label: ADD X9, X9, X13 Binary - Hex
→ control hazards - requires a stall to wait for that stage 1000-8 1001-9 1010-A 1011-B 1100-C 1101-D 1110-E 1111-F
Data Streams
Virtual Memory Virtual Physical Parallel Processing - A memory request results in a page Flynn’s Taxonomy
Access Single Multiple
– Address Translation, TLB Valid bit page frame table hit and a cache miss. What is the
time SIMD – Vector processing:
number number R sub x5, x3, x4
Physical frame Access
Packet 1
order that resources are checked? Single
SISD - One processor, one data same operation on a vector of
Index Valid bit 1 + 0x0 0x0 2 2
- - -
19 25 R add x1, x3, x4 Packet 2 TLB, Page Table, Cache, MM path for a single instruction values. Handled with vector
number time ↑
1 0x9 4 24 D stur x1, [x0, #8] instructions and registers.
0 1 2 19
1 0x3 3 22
Packet 3 - all - A multi-core processor is shared Instruction
ldur x2, [x0, #4] other commands MISD - No “processor” that MIMD - Single program,
1
2
0
0 1 0xe 0 23 B add x5, x2, x2
after depend on
add
memory processing
- Cache coherence concerns shared
Streams
does this. But fault tolerance multiple data. Shared memory
Multiple and having multiple computers processing. Multiple processors
3 1 3 22 2. The virtual address 0x09b3 is
D stur x5, [x0, #12] Packet 4 operating on the same data networked together to
4 0 requested. Is it a TLB hit, page table hit, memory processing
or page fault? R sub x7, x5, x2 Packet 5 - with a
- Name a type of instruction that will could fall into this category accomplish a task
5 0 ↑ stall because of
6 1 1 11 12
Offset = log(2 ) = 12 bits, CB cbz x7, #3
cause the cache to be checked
the data hazard

7 1 6 15 Offset = 9b3, virtual page = 0


Load and Store
8 0 0 is in the TLB → TLB hit
9 1 4 13 - What is the updated state of the TLB? MSI Protocol: - Threading is used in MIMD
a 0 Replace 0 (0000) with 2 (010) (change • M flag: this core has written to its copy of the cached value
b 0 page number to frame number) • I flag: indicates if the cached block is invalid or contains an older
c 1 0 10 Physical address: copy of the data that does not reflect the current value of the
d 0 010 100110110011 = 0x29b3
e 1 7 23
block of memory
f 1 5 18 Virtual Physical
Access
• S flag: the block is unmodified and can be safely shared, meaning
Valid bit page frame that multiple L1 caches may safely store a copy of the block and
A virtual memory system has 212 time
number number read from their copy
byte pages, a four entry fully 1 0x0 2 19 SNOOPING: a processor needs some mechanism to identify when
associative TLB with LRU 1 0xf 5 18 accesses to one core’s L1 cache content require
replacement, and the following 1 - 0x3 3 22 25
coherency state changes involving the other cores’ L1 cache
-

TLB and page table states. Virtual 1 0xe 0 23


contents. —> implemented by snooping on a bus that is shared
addresses are 16 bits and physical 3. A virtual memory system has 27 byte
by all. L1 caches. It can identify any read or write from another L1
addresses are 15 bits pages, a four entry fully associative TLB
with LRU replacement, and the cache for a block it stores
Virtual Physical STATIC: COMPILE TIME
Access following TLB and page table states.
Valid bit page frame
Physical memory has 8 frames and DYNAMIC: RUNNING TIME
time
number number
there are 16 virtual pages. MULTIPROCESSOR: Computer with more than one processor, also
1 0x0 2 19 - How many bits are in the physical known as multicore; PARALLEL PROCESSING: A single program
1 + 0xf uX9 54 18 24
address? log(27) = 7 bits in the offset, running on multiple processors simultaneously
- -
-

1 0x3 3 22 8 frames requires 3 bits in frame field


1 0xe 0 23
SHARED MEMORY PROCESSING: Multiple cores; Threading; Single
→ 3+7=10 bits in the physical address
physical address space for all processors
1. The virtual address 0x91bb is - How many bits are in the virtual
requested. Is it a TLB hit, page address? log(27) = 7 bits in the offset, 6. A byte-addressable system has 24 bytes of
table hit, or page fault? 16 pages requires 4 bits in page field
→ 4+7=11 bits in the virtual address
main memory and a direct mapped cache with
Offset = log(212) = 12 bits, page = 8 bytes. Addresses are little endian. A line is 2
- If address 0x1A3 is requested, is it a
16 – 12 = leftmost 4 bits hit in the TLB? Is it a page table hit? A bytes. The state of the cache is as follows:
Page = 9, not in TLB → check if page fault? Line Number Tag Data (2 byte line)
Page 9 is in the page table? 0x1A3=0|0011|0100011 00 1 0xACD7
yes → page table hit → page = 0011 = 3 in the TLB → TLB hit
01 0 0x32A8
- Translate the virtual address to a - The access time is 25. What is the
physical address: state of the TLB after the access? 10 0 0x7B52
Replace 9 (1001) with 4 (100) update 11 1 0xBB65
(change page number to frame A program requests access to address 0x5.
4. A virtual memory has 212 byte pages. What is the value of the data (one byte) at that
number 4 bits to 3 bits) There are 8 frames in main memory
Physical address: and 16 pages a page table.
address?
100 000110111011 = 0x41bb - in the physical address: Offset: log2(block size or line size) = log22 = 1
- What is the updated state of the 8 = 23 frames →12+3=15 bits Line field: (23 bytes/cache) / (22 bytes/line) = 2
TLB? LRU → Replace the oldest - in the virtual address: bits
item in TLB (assume access time is 16 = 24 pages →12+4=16 bits Tag = MM – everything else = 4 – 1 – 2 = 1 bit
next increment) → access time 18 0x5 = 0101 → tag; 0; line: 10; offset: 1
- the data (one byte) → take half of data
5. Main memory contains a total of 24 frames and each frame is 23 bytes. - little endian, offset =1 → value is 0x7B.
- How many bits are in the offset field? log(23) = 3 bits
- How many bits are required for the physical address? 4+3 = 7 bits More examples:
- How many bits are required for the virtual address? 8 bits Using little endian:
- Convert the following virtual address to a physical address: 0xFB Offset=0 → value: 0x52; Offset=1 → value: 0x7B
0xFB = 11111|011 → page: 11111 = 31 → frame = 5 (check page table) Using big endian:
5 = 0101 → replace 11111 with 101 and offset is 011 → 0010 1011 or 0x2B Offset=0 → value: 0x7B; Offset=1 → value: 0x52
- Write ARM code that corresponds to this C++ code L long long int procedureA (long long int arg1, long long int arg2) { for (i = 0, i < 10; i++) {
int myProc1(int i, int j) { long long int var1, var2; myArr[i ] = myArr[i ] + 1;
int k = myProc2(i); var1 = procedureB(arg2); }
k=k+j; var2 = var1+arg1;
return k; return var2; The address of myArr is in X9. The variable i will be in X10.
} } mov X10, #0 //initialize i to zero
Loop: //loop label
//create a stack frame: //restore items from stack We will convert this code to ARM. Assume var1 is in X9 & var2 is in X10. sub X13, X10, #10 //subtract 10 from i and place in X13
//i, j, x29, x30 → 32 bytes LDUR x0, [sp, #0] //first we will prepare a stack frame for procedureA before the call to cbz X13, Exit //if X13==0, jump to Exit label
SUB sp, sp, 32 LDUR x1, [sp, #8] procedureB lsl X14, X10, #3 //multiply i by 8 and place in X14
STUR x0, [sp, #0] LDUR x29, [sp, #16] //we will put arg1, arg2, frame pointer, and return address on stack ldur X16, [x9, x14] //get myArr[ i]
STUR x1, [sp, #8] LDUR x30, [sp, #24] sub sp, sp, #32 //make room on the stack add X16, X16, #1 //add 1 to myArr[ i]
STUR x29, [sp, #16] ADD sp, sp, #32 //reset stack ptr stur X0, [sp, #24] //place arg1 on the stack stur X16, [X9, X14] //save myArr[ i]
STUR x30, [sp, #24] ADD x2, x2, x1 //update k stur X1, [sp, #16] //place arg2 on the stack add X10, X10, #1 //add 1 to i
//call myProc2 //return stur X30, [sp, #8 ] //place the return address on the stack b Loop //branch to loop
BL myProc2 BR x30 stur X29, [sp, #0] //place the frame pointer on the stack Exit:
//save k mov X0, X1 //put arg2 in the appropriate register to be the argument
MOV x2, x1 for procedureB int f(int a, int b) {
bl procedureB //call procedureB int c = a+g(b);
Consider the following sequence of instructions executed on the 5- mov x9, x1 //load x9 (var1) with the value returned by procedureB return b+g(c);
stage pipeline. The pipeline uses forwarding for data hazards and ldur X0, [sp, #24] //restore arg1 from stack }
conditional branch is evaluated in the instruction decode stage. ldur X1, [sp, #16] //restore arg2 from stack
Which instructions are stalled because of a data hazard? //prepare the stack with a, b, return address and frame pointer
ldur X30, [sp, #8] //restore the return address from stack
SUB SP, SP, #32
ldur x29, [sp, #0] //restore the frame pointer from the stack
40 ADD X1, X2, X3 STUR X0, [SP, #24]
add sp, sp, #32 //move stack pointer back to were it was before
44 CBZ X1, branch //since cbz is evaluated in ID stage, it needs to STUR X1, [SP, #16]
add x10, x9, x0 //var2 = var1 + arg1
wait for x1 to be calculated STUR X30, [SP, #8]
mov x2, x10 //move var2 to the appropriate register with for return
48 ADD X5, X1, X1 STUR X29, [SP, #0]
from procedureA
52 LDUR X1, [X6, #3] MOV X0, X1 //move b into X0
br X30 //return memnonic followed by register.
56 ADD X2, X1, X5 //ldur loads x1 in MEM stage, so add needs to BL G //call g
wait for x1 for (i=0; i<n; i++) MOV X9, X1 //X1 is return value from G, move it into another register
60 branch: LDUR X7, [X6, #4] if (a[i] <= b[i]) LDUR X0, [SP, #24] //restore a from the stack, keep the stack frame in
64 ADD X3, X2, X6 c[i] = a[i] place because there is another call
else ADD X10, X0, X9 // calculate C
c[i]=b[i] MOV X0, X10 //place C in X0
BL G //call G
Assume n is in X9, base address for a is in X10, base address for b is in MOV X11, X1 //return value is in X1, save it to another register
X11, base address for c is in X12 //restore b and return address from the stack
addi x13, xzr, xzr #establish variable i in x13
LDUR X1, [SP, #24]
loop: sub x14, x9, x13 // set up the loop
cbz x14, exit #if i==n, exit the loop LDUR X30, [SP, #8]
cbz x14, exit #if i==n, exit the loop // set up variables a[i], b[i], c[i] ADD SP, SP, #32 //reset the stack pointer to release the memory
// set up variables a[i], b[i], c[i] ADD X2, X1, X11 //calculate return value, put in X2
lsl x15, x13, #3 #x15 is i*8, the byte offset from the base address BR X30 //return
ldur x17, [x10, x15] // x17 contains a[i]
ldur x19, [x11, x15] // x19 contains b[i]
subs x21, x17, x19 //set flags with x17 and x19
b.gt else
stur x17, [x12, x15]
addi x13, x13, #1
b loop • R-type data hazards can be eliminated with
Not a hazard, X19 is not being used in the next instruction,
else: stur x19, [x12, x15] forwarding (forward result directly to ALU)
LDUR X19, [X10, #0] addi x13, x13, #1 • Load-Use hazard requires a stall (even with
X10 does not get a new value
b loop forwarding)
ADD X9, X9, X10
Forwarding takes care of updating X9 before the first SUB exit: Data hazard – An Control hazard – Which instruction
SUB X12, X10, X9
needs it, and updating X12 before the second SUB needs it Hazards: Cause the Structural hazard – Two
SUB X12, X12, X19 instruction needs data to fetch is unknown because the
instruction to not be instructions need the same
LDUR X13, [X20, #8] Load-Use hazard!!! Insert a stall so that ADD can wait for ← Hazards → from a register before it result of a branch decision is not
ADD X9, X9, X13-stall the value from memory to be loaded able to execute hardware in the same cycle
has been calculated known yet

You might also like