The document discusses computer performance metrics, including instruction execution times, CPI calculations, and speedup from instruction optimizations. It also covers cache memory structure, pipelined processor design, and instruction set architecture details. Key calculations and examples illustrate the impact of different architectural choices on performance and efficiency.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
21 views4 pages
Final Exam Comp Org
The document discusses computer performance metrics, including instruction execution times, CPI calculations, and speedup from instruction optimizations. It also covers cache memory structure, pipelined processor design, and instruction set architecture details. Key calculations and examples illustrate the impact of different architectural choices on performance and efficiency.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4
Computer Performance - A program consisting of 2,000,000 Memory Address Data Little & Big Endian Single Cycle Data
Little & Big Endian Single Cycle Data Path
instructions is executed running at a 1 GHz clock rate. The program 0x8000000000000000 0xA1 - What are the contents of X11 after uses the following mix of instructions: 40% R-type, 20% ldur, 10% 0x8000000000000001 0x32 LDUR X11, [X9, #0] is executed? stur, 30% cbz. Cbz and LDUR have a CPI of 5, STUR and R-type have 0x8000000000000002 0x5B a CPI of 4. 0x8000000000000003 0xC6 big endian: 0xA1325BC692583211 0x8000000000000004 0x92 little endian: 0x11325892C65B32A1 - What is the average CPI? CPIavg (cycles/instr) = ∑(CPI * %) 0x8000000000000005 0x58 CPIavg = 5*(0.3) + 5*(0.2) + 4*(0.4) + 4*(0.1) = 4.5 (cycles/instr) 0x8000000000000006 0x32 - How long does it take to execute the program? 0x8000000000000007 0x11 !" $%&'⁄()*+ ∗ ".! /0/12&⁄$%&') texe (seconds) = ; smaller texe : faster 0x8000000000000008 0x13 "3456 789: /0/12&⁄&2/ ;2< $%&'⁄()*+ ∗ =.? /0/12&⁄$%&') Numbers & arithmetic - Binary multiplication: Step 1: check the texe = = 0.009 (seconds) @2A /0/12&⁄&2/ rightmost bit of the multiplier: if it’s 0 (do nothing), if it’s 1 (add the - What is the speedup if CBZ is improved to a CPI of 3? multiplicand to the product); Step 2: shift the result (entire 8 bits) to 9 !"# (!" ∗ ".!⁄"3456 789:)!"# "3456 789:$%& the right | Multiply 7 (multiplicand) times 6 (multiplier): Speedup = = = 9 $%& (!" ∗ ".!⁄"3456 789:)$%& "3456 789:!"# Product | Iteration Step Multiplicand Action new CPI = 3*(0.3) + 5*(0.2) + 4*(0.4) + 4*(0.1) = 3.9 Multiplier Speedup = 4.5/3.9 = 1.15 0 Initial values 0111 0000 0110 Lowest bit of multiplier is 1 1 0111 0000 0110 0, don’t add multiplicand Input writes Control sends signals that Memory Output 2 0000 0011 Shift right Datapath data & determine the stores presents Lowest bit of multiplier is executes 2 1 0111 0011 instructions in operations of datapath, instructions results to 1, add multiplicand instructions 2 0011 1001 Shift right memory memory, input & output & data user Lowest bit of multiplier is 3 1 1010 1001 Parts of !!! Things that are 1, add multiplicand - What are the control values input to the the multiplexors (mux’s) for a the determined by the 2 0101 0100 Shift right CBZ instruction where the output of the Zero control line is 1? 12, 13, 15 design of the Lowest bit of multiplier is - ADD and STUR read data from two registers computer instruction set 4 1 0101 0100 architecture: 0, don’t add multiplicand - What are the control values input to the the multiplexors (mux’s) for a 2 0010 1010 Shift right CBZ instruction where the output of the Zero control line is 1? format of machine code; type of Binary Addition: - If the sum is 0 or 1, write it down in the result. Mux 12: 1, Mux 13: 0, Mux 15: 1 instructions; - If the sum is 2, sum bit is 0, carry bit is 1. *** multiplying two 16 bit number of bytes - If the sum is 3, sum bit is 1, carry bit is 1. numbers, 32 bits are required for in a register; the product number of - Decimal ↔ IEEE 754: (doesn’t provide an exact rep of a floating point number) registers; set up of registers; data normalized form: (sign±)1.(fraction)2 × 2(exponent) types and how sign: 1 bit; exponent: 8 bits; fraction: 23 bits they’re expressed • What number does 0x0C000000 represent if it is a IEEE 754 format ARM instruction floating point number? 0x0C000000 = 0 00011000 00000000000000000000000 → Sign: 0 (+); ARM ISA: R-format instruction Exponent: 24 – 127 = –103; Fraction: 0 → 1.0 x 2-103 = 9.861 x 10-32 opcode 11 bits rm 5 bits shamt 6 bits rn 5 bits rd 5 bits • Express –4.75 in IEEE 754 format hexadecimal: (basic op) (2nd source) (shift) (1st source) (destination) 4.75 = 4 + 0.5 + 0.25 = 22 + 2–1 + 2–2 = 100.11 in binary What is the 32-bit machine code corresponding to LSL X10, X9, #3? 100.11 = 1.0011 x 22 normalized opcode = 11010011011, rm is not used = 00000, shamt = 000011, rn = Sign = 1 (negative); Exponent = 2 + 127 = 129 = 10000001; 01001 shifted and placed in rd = 01010 ➝ 1101 0011 0110 0000 0000 Fraction = 00110000000000000000000 1101 0010 1010 ➝ 0xD3600D2A → 1 10000001 00110000000000000000000 → 0xC0980000 ARM ISA: I-format instruction Floating point: IEEE-754 Special values - What number does 0xFF800000 opcode 10 bits immediate 12 bits rn 5 bits rd 5 bits Special Value Exponent Fraction represent if it is a IEEE 754 format floating point number? (basic op) (value of operand) (source reg) (destination) ±0 (Two Zeroes) 0000 0000 0 Which control signals are forwarded… -126 1111 1111 1000 0000 ADDI X9, X10, #100: opcode = 1001000100, immediate = Denormalized Number 0000 0000 (2 ) Non-zero 1. to the EX stage? ALUSrc, ALUOp 0000 0000 0000 0000 000001100100, rn = 01010, and rd = 01001 ➝ 1001 0001 0000 0001 NaN (Not a Number) 1111 1111 Non-zero exponent: 1111 1111, fraction: 0 2. to the Memory stage? Branch, MemWrite, MemRead ± Infinity 1111 1111 0 1001 0001 0100 1001 ➝ 0x91019149 → Special value: Negative Infinity 3. for use in the WriteBack stage? MemtoReg, RegWrite ARM ISA: D-format instruction: Load - Transfers data from memory to which elements register, Store - Transfers data from register to memory; LDURB(1 byte); Instruction type Instruction Reg2Loc ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0 produce output LDURW-word(4 bytes); LDURH-half(2 bytes); LDUR-double(8 bytes) Integer Arithmetic & Logical 1, 2, 3, 4, 5, R-Format 0 0 0 1 0 0 0 1 0 opcode address op2 rn rt (ADD, SUB, AND, ORR) 12, 13, 14 11 bits 9 bits 2 bits 5 bits 5 bits 1, 2, 3, 4, 5, LDUR X 1 1 1 1 0 0 0 0 LDURH X11 [X10, #4]: opcode = 01111000010, address = 000000100, Memory-Reference 6, 7, 13, 14 STUR 1 1 X 0 0 1 0 0 0 op2 = 00, rn = 01010, and rt = 01011 1, 2, 3, 4, 5, 6, 12, 13, ➝ 0111 1000 0100 0000 0100 0001 0100 1011 ➝ 0x7840414B Branch 8?, 9, 10, 11, 15 CBZ 1 0 X 0 0 0 1 0 1 Pipelined Data Path Cache Cache 1. A single cycle processor with a cycle time of 1000 ps split into a 5 staged Why Do We Have Cache? 3. A system has a main memory with 232 bytes has a block size of 16 pipelined processor with a cycle time of 250 ps. • Registers bytes and a cache size of 212 bytes. If address 0xABCABC00 is - How long does it take an instruction to execute in the pipelined processor? o Fast access requested, and the cache is fully associative, what is the tag and 250*5 = 1250 ps - What is the speedup for executing 5 instructions? o Expensive per byte what is the offset? Where will the system look in the cache for the 5 instructions single cycle = 1000x5 = 5000 ps o On chip, size (number of bytes) tag? How may possible places could this byte be if it is in the cache? 5 instructions pipeline = 1250 + 250*4 = 2250 ps (1250 for the first o Data operated on by ALU must be in registers Tag Offset instruction, after that, an instruction completes every 250 ps) o Have to transfer from main memory log2(block size) = log2(16) Speedup = old time/new time = 5000/2250 = 2.22 log2(memory size) – log2(offset) • Main Memory = 4 bits = 0000 - What would the speedup be for an infinite number of instructions? o Slower access | Less expensive per byte | Large Address = 1010 1011 1100 1010 1011 1010 0000 0000 1000/250 = 4 • Cache Tag = 1010101111001010 101110100000; Offset = 0000 2. We are designing a pipelined processor. The pipeline stages have the following execution times: o Used to make transfer to registers faster # of lines = cache size / block size Instruction Fetch: 150 ps; Instruction Decode: 200 ps (longest time is cycle o Faster access than main memory = (212 bytes/cache) / (24 bytes/line) = 28 lines/cache time); ALU Execute: 150 ps; Memory Read/Write: 150 ps; Write back: 150 ps o More expensive per byte than main memory The system will look in every line of the cache, 28 possible places. - What is the clock rate for a pipelined processor? 1/200(ps) = 5(GHz) o Smaller than main memory 3. o Cache is a subset of main memory 4. A system has a main memory with 232 bytes has a block size of 16 (all stages) o Put items in cache that are likely to be accessed bytes and a cache size of 212 bytes. If address 0xABCABC00 is § temporal locality – recently accessed items may be requested, and the cache is direct mapped, what is the tag, line and accessed again offset? Where will the system look in the cache for the tag? How § spatial locality – items near recently accessed items may possible places could this byte be if it is in the cache? may be accessed. (Arrays are together in memory) Tag Line field Offset What Causes Cache Misses (not find the data in cache ➝ have to main memory – everything else log2(block size) - What would the cycle time be for a single cycle processor? use main memory)? log2(2 /2 ) = 8 bits 12 4 = 32 – 8 – 4 = 20 bits = log2(16) = 4 bits 800 ps, the longest time required by an instruction • Compulsory misses – cache is empty when a process starts - What would the cycle time be for a pipelined processor? Offset field = log(16) = 4 bits • Capacity misses – the cache is full, based on size of cache 200 ps, the longest time required in a stage # of lines = (212 bytes/cache) / (24 bytes/line) = 28 lines/cache - How long does one instruction take to get through the pipelined processor? o Fully associative – capacity misses ⟶ 8 bits 200*5 = 1000 ps (the longest time required in a stage * # of stages) • Conflict – more than one block mapping to the same line Tag field = 32 – 8 – 4 = 20 bits - What is the speedup of the pipeline processor compared to the single cycle o Fully associative – no conflict misses – any block can go Address = 1010 1011 1100 1010 1011 1100 0000 0000 processor when executing 10 instructions? anywhere Single cycle time: 800*10 = 8000 ps Tag = 10101011110010101011; Line = 11000000; Offset = 0000 o Direct mapped – highest rate of conflict misses Pipelined time: 1000 + 9*200 = 2800 ps There is one possible place where this byte could be, and that is at Speedup = old time/new time = 8000/2800 = 2.85 line 11000000. The system will look at that line to see if the tag 1. Consider a system with a main memory access time of 200 ns and matches. an L1 cache having a 20 ns access time and a hit rate of 90% and an Branch Prediction: - Static branch prediction - Predict branch at compile time L2 cache having a 40 ns access time and a hit rate of 95%. 5. A system has a main memory with 232 bytes has a block size of 16 (Predict “not taken” for for branching out of a loop; Predict “taken” for a L1 + L2 + MM AMAT bytes and a cache size of 212 bytes. If address 0xABCABC00 is backward branch to stay in a loop) look aside (parallel) cache 0 Dynamic branch prediction - Predict during running 20*0.9 + 40*0.1*0.95 + 200*0.1*0.05 requested, and the cache is 2-way set associative, what is the tag, = (20*0.9) + (0.1)(40*0.95 + 200*0.05) (1 bit predictor (do whatever was done last time); 2 bit predictor (change line and offset? Where will the system look in the cache for the tag? look through cache How may possible places could this byte be if it is in the cache? prediction after getting it wrong twice); Keep a table of past predictions) 1*20 + 0.1*40 + 0.1*0.05*200 100 % have to look through 11 = 20 + 0.1(40 + 0.05*200) forward loop: predict not taken backward loop: predict taken Tag Set field Offset 2. A system has a main memory of 216 bytes and is split into blocks of 26 LoopStart: cbz x9, LoopDone LoopStart: ldur x10, [x13, x16] bytes. The system has a cache that is 212 bytes. main memory – cache size ldur x10, [x13, x16] ldur x11, [x14, x16] log2 block size ∗ n log2(block size) - How many bits wide is the address? log2(216) =16 everything else ldur x11, [x14, x16] add x12, x10, x11 = log = log2(16) = 4 bits 2 (2 /(2 * 2)) = 7 bits 12 4 add x12, x10, x11 stur x12, [x15, x16] - How many bits wide is the offset field? log2(block size) = log2(line size) = 32 – 7 – 4 = 21 bits stur x12, [x15, x16] sub x9, x9, 1 = log2(26) = 6 Offset field = log(16) = 4 bits sub x9, x9, 1 add x16, x16, 8 add x16, x16, 8 cbnz x9, LoopStart - How many bits describe the block? 16 – 6 = 10 Number of sets = (212 bytes/cache) / ((24 bytes/line)*(2 lines/set)) b LoopStart br x30 - How many lines does the cache have? = 27 sets/cache LoopDone: br x30 (212 bytes/cache ) / (26 bytes/line) = 26 lines/cache = 64 lines/cache Set field = log(27) = 7 bits Stalls: Data hazards requiring stall: Load-use(LDUR) | CBZ - If the cache is fully associative, into how many possible lines can a block Tag field = 32 – 7 – 4 = 21 bits be placed? 64 – any block can go to any line LDUR X9, [X10, #0] Requies a two stalls, because CBZ ID stage needs Address = 1010 1011 1100 1010 1011 1100 0000 0000 CBZ X9, #8 to wait for LDUR MEM stage - If the cache is direct mapped, into how many possible lines can a block Tag = 101010111100101010111; Set = 1000000; Offset = 0000 be placed? 1 – each block maps to one line in the cache LDUR X9, [X20] The system will look at set 1000000. There are two lines there, - If the cache is 4-way set associative, into how many possible lines can a CBZ X9, Label block be placed? 4 – a block will map to a set, and that set has four lines each with a tag. It will look at those two tags and see if one of SUB X12, X10, X9 them is a match. SUB X12, X12, X19 Move branch checking to earlier in the pipeline to avoid 3 cycles LDUR X13, [X20, #8] in which what execution to fetch is unknown. Still have one cycle in which we don’t know what instruction to fetch 0000-0 0001-1 0010-2 0011-3 0100-4 0101-5 0110-6 0111-7 Label: ADD X9, X9, X13 Binary - Hex → control hazards - requires a stall to wait for that stage 1000-8 1001-9 1010-A 1011-B 1100-C 1101-D 1110-E 1111-F Data Streams Virtual Memory Virtual Physical Parallel Processing - A memory request results in a page Flynn’s Taxonomy Access Single Multiple – Address Translation, TLB Valid bit page frame table hit and a cache miss. What is the time SIMD – Vector processing: number number R sub x5, x3, x4 Physical frame Access Packet 1 order that resources are checked? Single SISD - One processor, one data same operation on a vector of Index Valid bit 1 + 0x0 0x0 2 2 - - - 19 25 R add x1, x3, x4 Packet 2 TLB, Page Table, Cache, MM path for a single instruction values. Handled with vector number time ↑ 1 0x9 4 24 D stur x1, [x0, #8] instructions and registers. 0 1 2 19 1 0x3 3 22 Packet 3 - all - A multi-core processor is shared Instruction ldur x2, [x0, #4] other commands MISD - No “processor” that MIMD - Single program, 1 2 0 0 1 0xe 0 23 B add x5, x2, x2 after depend on add memory processing - Cache coherence concerns shared Streams does this. But fault tolerance multiple data. Shared memory Multiple and having multiple computers processing. Multiple processors 3 1 3 22 2. The virtual address 0x09b3 is D stur x5, [x0, #12] Packet 4 operating on the same data networked together to 4 0 requested. Is it a TLB hit, page table hit, memory processing or page fault? R sub x7, x5, x2 Packet 5 - with a - Name a type of instruction that will could fall into this category accomplish a task 5 0 ↑ stall because of 6 1 1 11 12 Offset = log(2 ) = 12 bits, CB cbz x7, #3 cause the cache to be checked the data hazard
7 1 6 15 Offset = 9b3, virtual page = 0
Load and Store 8 0 0 is in the TLB → TLB hit 9 1 4 13 - What is the updated state of the TLB? MSI Protocol: - Threading is used in MIMD a 0 Replace 0 (0000) with 2 (010) (change • M flag: this core has written to its copy of the cached value b 0 page number to frame number) • I flag: indicates if the cached block is invalid or contains an older c 1 0 10 Physical address: copy of the data that does not reflect the current value of the d 0 010 100110110011 = 0x29b3 e 1 7 23 block of memory f 1 5 18 Virtual Physical Access • S flag: the block is unmodified and can be safely shared, meaning Valid bit page frame that multiple L1 caches may safely store a copy of the block and A virtual memory system has 212 time number number read from their copy byte pages, a four entry fully 1 0x0 2 19 SNOOPING: a processor needs some mechanism to identify when associative TLB with LRU 1 0xf 5 18 accesses to one core’s L1 cache content require replacement, and the following 1 - 0x3 3 22 25 coherency state changes involving the other cores’ L1 cache -
TLB and page table states. Virtual 1 0xe 0 23
contents. —> implemented by snooping on a bus that is shared addresses are 16 bits and physical 3. A virtual memory system has 27 byte by all. L1 caches. It can identify any read or write from another L1 addresses are 15 bits pages, a four entry fully associative TLB with LRU replacement, and the cache for a block it stores Virtual Physical STATIC: COMPILE TIME Access following TLB and page table states. Valid bit page frame Physical memory has 8 frames and DYNAMIC: RUNNING TIME time number number there are 16 virtual pages. MULTIPROCESSOR: Computer with more than one processor, also 1 0x0 2 19 - How many bits are in the physical known as multicore; PARALLEL PROCESSING: A single program 1 + 0xf uX9 54 18 24 address? log(27) = 7 bits in the offset, running on multiple processors simultaneously - - -
1 0x3 3 22 8 frames requires 3 bits in frame field
1 0xe 0 23 SHARED MEMORY PROCESSING: Multiple cores; Threading; Single → 3+7=10 bits in the physical address physical address space for all processors 1. The virtual address 0x91bb is - How many bits are in the virtual requested. Is it a TLB hit, page address? log(27) = 7 bits in the offset, 6. A byte-addressable system has 24 bytes of table hit, or page fault? 16 pages requires 4 bits in page field → 4+7=11 bits in the virtual address main memory and a direct mapped cache with Offset = log(212) = 12 bits, page = 8 bytes. Addresses are little endian. A line is 2 - If address 0x1A3 is requested, is it a 16 – 12 = leftmost 4 bits hit in the TLB? Is it a page table hit? A bytes. The state of the cache is as follows: Page = 9, not in TLB → check if page fault? Line Number Tag Data (2 byte line) Page 9 is in the page table? 0x1A3=0|0011|0100011 00 1 0xACD7 yes → page table hit → page = 0011 = 3 in the TLB → TLB hit 01 0 0x32A8 - Translate the virtual address to a - The access time is 25. What is the physical address: state of the TLB after the access? 10 0 0x7B52 Replace 9 (1001) with 4 (100) update 11 1 0xBB65 (change page number to frame A program requests access to address 0x5. 4. A virtual memory has 212 byte pages. What is the value of the data (one byte) at that number 4 bits to 3 bits) There are 8 frames in main memory Physical address: and 16 pages a page table. address? 100 000110111011 = 0x41bb - in the physical address: Offset: log2(block size or line size) = log22 = 1 - What is the updated state of the 8 = 23 frames →12+3=15 bits Line field: (23 bytes/cache) / (22 bytes/line) = 2 TLB? LRU → Replace the oldest - in the virtual address: bits item in TLB (assume access time is 16 = 24 pages →12+4=16 bits Tag = MM – everything else = 4 – 1 – 2 = 1 bit next increment) → access time 18 0x5 = 0101 → tag; 0; line: 10; offset: 1 - the data (one byte) → take half of data 5. Main memory contains a total of 24 frames and each frame is 23 bytes. - little endian, offset =1 → value is 0x7B. - How many bits are in the offset field? log(23) = 3 bits - How many bits are required for the physical address? 4+3 = 7 bits More examples: - How many bits are required for the virtual address? 8 bits Using little endian: - Convert the following virtual address to a physical address: 0xFB Offset=0 → value: 0x52; Offset=1 → value: 0x7B 0xFB = 11111|011 → page: 11111 = 31 → frame = 5 (check page table) Using big endian: 5 = 0101 → replace 11111 with 101 and offset is 011 → 0010 1011 or 0x2B Offset=0 → value: 0x7B; Offset=1 → value: 0x52 - Write ARM code that corresponds to this C++ code L long long int procedureA (long long int arg1, long long int arg2) { for (i = 0, i < 10; i++) { int myProc1(int i, int j) { long long int var1, var2; myArr[i ] = myArr[i ] + 1; int k = myProc2(i); var1 = procedureB(arg2); } k=k+j; var2 = var1+arg1; return k; return var2; The address of myArr is in X9. The variable i will be in X10. } } mov X10, #0 //initialize i to zero Loop: //loop label //create a stack frame: //restore items from stack We will convert this code to ARM. Assume var1 is in X9 & var2 is in X10. sub X13, X10, #10 //subtract 10 from i and place in X13 //i, j, x29, x30 → 32 bytes LDUR x0, [sp, #0] //first we will prepare a stack frame for procedureA before the call to cbz X13, Exit //if X13==0, jump to Exit label SUB sp, sp, 32 LDUR x1, [sp, #8] procedureB lsl X14, X10, #3 //multiply i by 8 and place in X14 STUR x0, [sp, #0] LDUR x29, [sp, #16] //we will put arg1, arg2, frame pointer, and return address on stack ldur X16, [x9, x14] //get myArr[ i] STUR x1, [sp, #8] LDUR x30, [sp, #24] sub sp, sp, #32 //make room on the stack add X16, X16, #1 //add 1 to myArr[ i] STUR x29, [sp, #16] ADD sp, sp, #32 //reset stack ptr stur X0, [sp, #24] //place arg1 on the stack stur X16, [X9, X14] //save myArr[ i] STUR x30, [sp, #24] ADD x2, x2, x1 //update k stur X1, [sp, #16] //place arg2 on the stack add X10, X10, #1 //add 1 to i //call myProc2 //return stur X30, [sp, #8 ] //place the return address on the stack b Loop //branch to loop BL myProc2 BR x30 stur X29, [sp, #0] //place the frame pointer on the stack Exit: //save k mov X0, X1 //put arg2 in the appropriate register to be the argument MOV x2, x1 for procedureB int f(int a, int b) { bl procedureB //call procedureB int c = a+g(b); Consider the following sequence of instructions executed on the 5- mov x9, x1 //load x9 (var1) with the value returned by procedureB return b+g(c); stage pipeline. The pipeline uses forwarding for data hazards and ldur X0, [sp, #24] //restore arg1 from stack } conditional branch is evaluated in the instruction decode stage. ldur X1, [sp, #16] //restore arg2 from stack Which instructions are stalled because of a data hazard? //prepare the stack with a, b, return address and frame pointer ldur X30, [sp, #8] //restore the return address from stack SUB SP, SP, #32 ldur x29, [sp, #0] //restore the frame pointer from the stack 40 ADD X1, X2, X3 STUR X0, [SP, #24] add sp, sp, #32 //move stack pointer back to were it was before 44 CBZ X1, branch //since cbz is evaluated in ID stage, it needs to STUR X1, [SP, #16] add x10, x9, x0 //var2 = var1 + arg1 wait for x1 to be calculated STUR X30, [SP, #8] mov x2, x10 //move var2 to the appropriate register with for return 48 ADD X5, X1, X1 STUR X29, [SP, #0] from procedureA 52 LDUR X1, [X6, #3] MOV X0, X1 //move b into X0 br X30 //return memnonic followed by register. 56 ADD X2, X1, X5 //ldur loads x1 in MEM stage, so add needs to BL G //call g wait for x1 for (i=0; i<n; i++) MOV X9, X1 //X1 is return value from G, move it into another register 60 branch: LDUR X7, [X6, #4] if (a[i] <= b[i]) LDUR X0, [SP, #24] //restore a from the stack, keep the stack frame in 64 ADD X3, X2, X6 c[i] = a[i] place because there is another call else ADD X10, X0, X9 // calculate C c[i]=b[i] MOV X0, X10 //place C in X0 BL G //call G Assume n is in X9, base address for a is in X10, base address for b is in MOV X11, X1 //return value is in X1, save it to another register X11, base address for c is in X12 //restore b and return address from the stack addi x13, xzr, xzr #establish variable i in x13 LDUR X1, [SP, #24] loop: sub x14, x9, x13 // set up the loop cbz x14, exit #if i==n, exit the loop LDUR X30, [SP, #8] cbz x14, exit #if i==n, exit the loop // set up variables a[i], b[i], c[i] ADD SP, SP, #32 //reset the stack pointer to release the memory // set up variables a[i], b[i], c[i] ADD X2, X1, X11 //calculate return value, put in X2 lsl x15, x13, #3 #x15 is i*8, the byte offset from the base address BR X30 //return ldur x17, [x10, x15] // x17 contains a[i] ldur x19, [x11, x15] // x19 contains b[i] subs x21, x17, x19 //set flags with x17 and x19 b.gt else stur x17, [x12, x15] addi x13, x13, #1 b loop • R-type data hazards can be eliminated with Not a hazard, X19 is not being used in the next instruction, else: stur x19, [x12, x15] forwarding (forward result directly to ALU) LDUR X19, [X10, #0] addi x13, x13, #1 • Load-Use hazard requires a stall (even with X10 does not get a new value b loop forwarding) ADD X9, X9, X10 Forwarding takes care of updating X9 before the first SUB exit: Data hazard – An Control hazard – Which instruction SUB X12, X10, X9 needs it, and updating X12 before the second SUB needs it Hazards: Cause the Structural hazard – Two SUB X12, X12, X19 instruction needs data to fetch is unknown because the instruction to not be instructions need the same LDUR X13, [X20, #8] Load-Use hazard!!! Insert a stall so that ADD can wait for ← Hazards → from a register before it result of a branch decision is not ADD X9, X9, X13-stall the value from memory to be loaded able to execute hardware in the same cycle has been calculated known yet