03 Instructions II
03 Instructions II
Chapter 02 - II
RISCC-V instructions Use Case Studies
and Compilation
Hyoukjun kwon
[email protected]
EECS 112 (Spring 2024)
Organization of Digital Computers
2
Preliminary: Stored Program in Computers
§ How is the program stored?
• Programs are also “binary data” in memory
3
Preliminary: Basic Blocks
§ A basic block is a sequence of instructions with
• No embedded branches (except at end)
• No branch targets (except at beginning)
<Address>
0x100
0x104
0x108
0x10C
0x110
0x114 Meaning
A continuous block of instructions that do not
make any jump. Within the basic block, PC always
moves to PC +4 for the next instruction.
Instruction Memory
Low Address
<Main Memory Space>
5
Procedure (Function) Calling
§ Steps required
1. Place parameters in registers x10 to x17
2. Transfer control to procedure
3. Acquire storage for procedure
4. Perform procedure’s operations
5. Place result in register for caller
6. Return to place of call (address in x1)
6
Procedure (Function) Calling w/ More Details
1. Place parameters in registers x10 to x17
• Store function arguments into registers
• Store the address of following instruction (the one immediately after jal) in x1
o “Return address:” Where to resume after coming back to the original procedure (caller)
• Jumps to target address (of the “callee” function)
8
Stack: Date Structure for Temporary Values
§ When it is used
• When spilling registers (i.e., when we need more space beyond the register file).
• “Spilling”: When we do not have sufficient space, we send data to other larger memory
o E.g., when we have more data than available registers, we store temporary values in main memory (stack)*
§ Where it is stored
High Address
• Main memory space
• It is likely that the stack values reside in cache memory
§ How it is managed
• Using stack pointer(”sp”) stored in the register x2
• Stack pointer points to the address of most recently allocated space
§ Stack Operations
• Push: placing data onto the stack Low Address Memory Space
• Pop: removing data from the stack
9
Local Data on the Stack
SP = SP-12
Push three
temporary
values
SP = SP+12
Pop three
temporary
values
10
Leaf Procedure Example
§ C code: § RISC-V code:
int leaf_example ( leaf_example:
addi sp,sp,-12 Adjust stack to make room for 3 items
int g, int h, int i, int j
sw x5,8(sp)
){ sw x6,4(sp) Save x5, x6, x20 on stack
int f;
f = (g + h) - (i + j); sw x20,0(sp)
return f; add x5,x10,x11 x5 = g + h
} add x6,x12,x13 x6 = i + j
sub x20,x5,x6 f = x5 – x6
• Arguments g, h, i, j in x10, …, x13 addi x10,x20,0 copy f to return register
• f in x20 lw x20,0(sp)
• Temporaries x5, x6 lw x6,4(sp) Restore x5, x6, x20 from stack
• Return register: x10 lw x5,8(sp)
• Save x5, x6, x20 on stack addi sp,sp,12 Adjust stack to delete 3 items
• saving x5 and x6 is optional jalr x0,0(x1) Return to caller
11
Register Usage
§ x5 – x7, x28 – x31: temporary registers
• Not preserved by the callee
12
Non-Leaf Procedures
§ Procedures that call other procedures
§ For nested call, caller needs to save on the stack:
• Its return address
• Any arguments and temporaries needed after the call
§ Restore from the stack after the call
13
Non-Leaf Procedure Example
• RISC-V code:
§ C code: fact:
int fact (int n) addi sp,sp,-8 Save return address and n on stack
{ sw x1,4(sp)
if (n < 1) return 1; sw x10,0(sp)
else return n * fact(n - 1); Another way: addi x5,x10,-1 x5 = n - 1
} addi x5, x0, 1
if n >= 1, go to Else
blt x10,x5,Else bge x5,x0,Else
addi x10,x0,1 Else, set return value to 1
• Argument n in x10 addi sp,sp,8 Pop stack, don’t bother restoring values
14
Local Data on the Stack
§ Local data allocated by callee, do not fit in registers -> need to be stored in stack
• e.g., local arrays or structures
§ Procedure frame (or activation record): segment of stack containing local data
• Frame pointer, fp, or register x8, used by some compilers to manage stack storage
• fp points to the first word of a procedure frame
15
RISC-V Register Conventions
16
EECS 112 (Spring 2024)
Organization of Digital Computers
17
Conditional Operations
§ Branch to a labeled instruction if a condition is true
• Otherwise, continue sequentially
18
Compiling If Statements
§ C code:
if (i==j) f = g+h;
else f = g-h;
• f, g, h, i, j in x19, x20, x21, x22, x23
20
More Conditional Operations
§ blt rs1, rs2, L1
• if (rs1 < rs2) branch to instruction labeled L1
§ bge rs1, rs2, L1
• if (rs1 >= rs2) branch to instruction labeled L1
§ Example
§ if (a > b) a += 1;
§ a in x22, b in x23
21
Signed vs. Unsigned
22
EECS 112 (Spring 2024)
Organization of Digital Computers
Section 3. String
23
Communicating with People: Character Data
UTF-8 BINARY
CHARACTER CODE POINT
§ Byte-encoded character sets ENCODING
UTF-
8 Examples 24
Byte/Halfword/Word Operations
25
String Copy Example
§ C code: • RISC-V code:
• Null-terminated string strcpy:
addi sp,sp,-4 // adjust stack for one more item
void strcpy (char x[], char y[]) sw x19,0(sp) // save x19
{ add x19,x0,x0 // i = 0 + 0
L1: add x5,x19,x11 // x5 = addr of y[i]
size_t i; lbu x6,0(x5) // x6 = y[i]
i = 0; add x7,x19,x10 // x7 = addr of x[i]
sb x6,0(x7) // x[i] = y[i]
while ((x[i]=y[i])!='\0') beq x6,x0,L2 // if y[i] == 0 then exit
i += 1; addi x19,x19,1 // i = i + 1
} jal x0, L1 // next iteration of loop
L2: lw x19,0(sp) // restore saved old x19
§ Base addresses for x and y in x10 and x11 addi sp,sp,4 // pop one word off stack
jalr x0,0(x1) // return
§ i in x19
26
Loading a 32-bit Constants into A Register
§ Most constants are small
• 12-bit immediate is sufficient lui (Load Upper Immediate)
• Load a 20-bit constant into bits 12
§ For the occasional 32-bit constant through 31 of a register, lower 12 bits
lui rd, constant filled with 0
• Uses a new instruction format, U-type,
• Copies 20-bit constant to bits [31:12] of rd
to accommodate such a large constant
• Add in the lowest 12 bits
U-type Format
27
EECS 112 (Spring 2024)
Organization of Digital Computers
Section 4. Addressing
28
Branch Addressing
§ Branch instructions specify
• Opcode, two registers, target address
§ Most branch targets are near branch
• Forward or backward
§ S-type and B-type:
imm imm
[10:5] rs2 rs1 funct3 [4:1] opcode
imm[12] imm[11]
§ PC-relative addressing
• Target address = PC + immediate × 2
30
RISC-V Addressing Summary
32
EECS 112 (Spring 2024)
Organization of Digital Computers
Section 5. Synchronization
33
Synchronization
§ Two processors sharing an area of memory
• P1 writes, then P2 reads
• Data race if P1 and P2 don’t synchronize
o Result depends on order of accesses
§ Hardware support required
• Atomic read/write memory operation
• No other access to the location allowed between the read and write
§ Could be a single instruction
• E.g., atomic swap of register ↔ memory
• Or an atomic pair of instructions
34
Synchronization in RISC-V
§ Load reserved: lr.w rd,(rs1)
• Load from address in rs1 to rd
• Place reservation on memory address
35
Synchronization in RISC-V
§ Example 1: atomic swap (to test/set lock variable)
again: lr.w x10,(x20)
sc.w x11,x23,(x20) // X11 = status
bne x11,x0,again // branch if store failed
addi x23,x10,0 // X23 = loaded value
§ Example 2: lock
addi x12,x0,1 // copy locked value
again: lr.d x10,(x20) // read lock
bne x10,x0,again // check if it is 0 yet
sc.w x11,x12,(x20) // attempt to store
bne x11,x0,again // branch if fails
• Unlock:
sw x0,0(x20) // free lock
36
EECS 112 (Spring 2024)
Organization of Digital Computers
Section 6. Compilation
37
Translation and Startup
Static linking
38
Producing an Object Module
§ Assembler (or compiler) translates program into machine instructions
§ Provides information for building a complete program from the pieces
• Header: described contents of object module
• Text segment: translated instructions
• Static data segment: data allocated for the life of the program
• Relocation info: for contents that depend on absolute location of loaded program
• Symbol table: global definitions and external refs
• Debug info: for associating with source code
39
Linking Object Modules
§ Produces an executable image
1. Merges segments
2. Resolve labels (determine their addresses)
3. Patch location-dependent and external refs
§ Could leave location dependencies for fixing by a relocating loader
• But with virtual memory, no need to do this
• Program can be loaded into absolute location in virtual memory space
40
Loading a Program
§ Load from image file on disk into memory
1. Read header to determine segment sizes
2. Create virtual address space
3. Copy text and initialized data into memory
o Or set page table entries so they can be faulted in
4. Set up arguments on stack
5. Initialize registers (including sp, fp, gp)
6. Jump to startup routine
o Copies arguments to x10, … and calls main
o When main returns, do exit syscall
41
Dynamic Linking
• Avoids image bloat caused by static linking of all (transitively) referenced libraries
42
Lazy Linkage
Indirection table
Linker/loader code
Dynamically
mapped code
43
Starting Java Applications
Simple portable
instruction set for the
JVM
Compiles
bytecodes of Interprets
“hot” methods bytecodes
into native
code for host
machine 44
C Sort Example
§ Illustrates use of assembly
instructions for a C bubble sort
function
§ Swap procedure (leaf)
void swap(long long int swap:
v[], slli x6,x11,3 // reg x6 = k * 8
long long int k) add x6,x10,x6 // reg x6 = v + (k * 8)
{ ld x5,0(x6) // reg x5 (temp) = v[k]
long long int temp;
ld x7,8(x6) // reg x7 = v[k + 1]
temp = v[k];
v[k] = v[k+1]; sd x7,0(x6) // v[k] = reg x7
v[k+1] = temp; sd x5,8(x6) // v[k+1] = reg x5 (temp)
} jalr x0,0(x1) // return to calling routine
• v in x10, k in x11, temp in x5
45
The Sort Procedure in C
§ Non-leaf (calls swap)
void sort (long long int v[], size_t n)
{
size_t i, j;
for (i = 0; i < n; i += 1) {
for (j = i – 1;
j >= 0 && v[j] > v[j + 1];
j -= 1) {
swap(v,j);
}
}
}
• v in x10, n in x11, i in x19, j in x20
46
The Outer Loop
§ Skeleton of outer loop:
• for (i = 0; i <n; i += 1) {
li x19,0 // i = 0
for1tst:
bge x19,x11,exit1 // go to exit1 if x19 ≥ x11 (i≥n)
addi x19,x19,1 // i += 1
j for1tst // branch to test of outer loop
exit1:
47
The Inner Loop
§ Skeleton of inner loop:
• for (j = i − 1; j >= 0 && v[j] > v[j + 1]; j − = 1) {
addi x20,x19,-1 // j = i −1
for2tst:
blt x20,x0,exit2 // go to exit2 if X20 < 0 (j < 0)
slli x5,x20,3 // reg x5 = j * 8
add x5,x10,x5 // reg x5 = v + (j * 8)
ld x6,0(x5) // reg x6 = v[j]
ld x7,8(x5) // reg x7 = v[j + 1]
ble x6,x7,exit2 // go to exit2 if x6 ≤ x7
mv x21, x10 // copy parameter x10 into x21
mv x22, x11 // copy parameter x11 into x22
mv x10, x21 // first swap parameter is v
mv x11, x20 // second swap parameter is j
jal x1,swap // call swap
addi x20,x20,-1 // j –= 1
j for2tst // branch to test of inner loop
exit2:
48
Preserving Registers
§ Preserve saved registers:
addi sp,sp,-40 // make room on stack for 5 regs
sd x1,32(sp) // save x1 on stack
sd x22,24(sp) // save x22 on stack
sd x21,16(sp) // save x21 on stack
sd x20,8(sp) // save x20 on stack
sd x19,0(sp) // save x19 on stack
49
Effect of Compiler Optimization
Compiled with gcc for Pentium 4 under Linux
2 100000
80000
1.5
60000
1
40000
0.5 20000
0 0
none O1 O2 O3 none O1 O2 O3
50
Effect of Language and Algorithm
3 Bubblesort Relative Performance
2.5
1.5
0.5
0
C/none C/O1 C/O2 C/O3 Java/int Java/JIT
1.5
0.5
0
C/none C/O1 C/O2 C/O3 Java/int Java/JIT
2000
1500
1000
500
0
C/none C/O1 C/O2 C/O3 Java/int Java/JIT
51
Lessons Learnt
§ Instruction count and CPI are not good performance indicators in isolation
§ Compiler optimizations are sensitive to the algorithm
§ Java/JIT compiled code is significantly faster than JVM interpreted
• Comparable to optimized C in some cases
§ Nothing can fix a dumb algorithm!
52