0% found this document useful (0 votes)
13 views52 pages

03 Instructions II

Uploaded by

20jasmine.asami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views52 pages

03 Instructions II

Uploaded by

20jasmine.asami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

EECS 112 (Spring 2024)

Organization of Digital Computers

Chapter 02 - II
RISCC-V instructions Use Case Studies
and Compilation
Hyoukjun kwon
[email protected]
EECS 112 (Spring 2024)
Organization of Digital Computers

Section 1. Calling Convention

2
Preliminary: Stored Program in Computers
§ How is the program stored?
• Programs are also “binary data” in memory

§ Programs can operate on programs


• e.g., compilers, linkers, …

§ Binary compatibility allows compiled programs to


work on different computers
• Standardized ISAs

§ Greatly simplifies both the memory hardware and


the software of computer systems

3
Preliminary: Basic Blocks
§ A basic block is a sequence of instructions with
• No embedded branches (except at end)
• No branch targets (except at beginning)
<Address>

0x100
0x104
0x108
0x10C
0x110
0x114 Meaning
A continuous block of instructions that do not
make any jump. Within the basic block, PC always
moves to PC +4 for the next instruction.
Instruction Memory

§ A compiler identifies basic blocks for optimization


§ An advanced processor can accelerate execution of basic blocks
4
Preliminary: Memory Layout
§ Memory Layout: How we split the memory space and utilize each region
• Reserved: Used for system/OS functionalities (e.g., kernel, device driver, etc.)
• Text: Contains instructions (binary) of recently/frequently executed programs
• Static data: Stores constants (e.g., const int x = 3;)
• Dynamic data: Stores dynamic objects created by “malloc,” “calloc,” “new,” and so on
High Address
“Stack” grows downward

Region for Dynamic Data: “Heap”


Note: Memory layout is not a part of RISC-V spec;
it is dependent on the OS. However, most OS
follow a similar style

Low Address
<Main Memory Space>
5
Procedure (Function) Calling

§ Steps required
1. Place parameters in registers x10 to x17
2. Transfer control to procedure
3. Acquire storage for procedure
4. Perform procedure’s operations
5. Place result in register for caller
6. Return to place of call (address in x1)

6
Procedure (Function) Calling w/ More Details
1. Place parameters in registers x10 to x17
• Store function arguments into registers

2. Store the return address (i.e., where to resume) in X1


• PC + 4 after function call (JAL)

3. Transfer control to procedure


• Update PC with the address of the function to be called (“callee”)

4. Acquire storage for procedure


• Allocate stack space; update stack pointer

5. Perform procedure’s operations


6. Place result in register for caller
• Store the return value in a register

7. Return to place of call (address stored in x1)


7
Instructions for Procedure Call / Return
§ Procedure call: jump and link
jal x1, ProcedureLabel

• Store the address of following instruction (the one immediately after jal) in x1
o “Return address:” Where to resume after coming back to the original procedure (caller)
• Jumps to target address (of the “callee” function)

§ Procedure return: jump and link register


jalr x0, 0(x1)

• Like jal, but jumps to 0 + address in x1


• Use x0 as rd (x0 cannot be changed)
• Can also be used for computed jumps
o e.g., for case/switch statements

8
Stack: Date Structure for Temporary Values
§ When it is used
• When spilling registers (i.e., when we need more space beyond the register file).
• “Spilling”: When we do not have sufficient space, we send data to other larger memory
o E.g., when we have more data than available registers, we store temporary values in main memory (stack)*

§ Where it is stored
High Address
• Main memory space
• It is likely that the stack values reside in cache memory

§ How it is managed
• Using stack pointer(”sp”) stored in the register x2
• Stack pointer points to the address of most recently allocated space

§ Stack Operations
• Push: placing data onto the stack Low Address Memory Space
• Pop: removing data from the stack

9
Local Data on the Stack

SP = SP-12
Push three
temporary
values

SP = SP+12
Pop three
temporary
values

10
Leaf Procedure Example
§ C code: § RISC-V code:
int leaf_example ( leaf_example:
addi sp,sp,-12 Adjust stack to make room for 3 items
int g, int h, int i, int j
sw x5,8(sp)
){ sw x6,4(sp) Save x5, x6, x20 on stack
int f;
f = (g + h) - (i + j); sw x20,0(sp)
return f; add x5,x10,x11 x5 = g + h
} add x6,x12,x13 x6 = i + j
sub x20,x5,x6 f = x5 – x6
• Arguments g, h, i, j in x10, …, x13 addi x10,x20,0 copy f to return register
• f in x20 lw x20,0(sp)
• Temporaries x5, x6 lw x6,4(sp) Restore x5, x6, x20 from stack
• Return register: x10 lw x5,8(sp)
• Save x5, x6, x20 on stack addi sp,sp,12 Adjust stack to delete 3 items
• saving x5 and x6 is optional jalr x0,0(x1) Return to caller

11
Register Usage
§ x5 – x7, x28 – x31: temporary registers
• Not preserved by the callee

§ x8 – x9, x18 – x27: saved registers


• If used, the callee saves and restores them

12
Non-Leaf Procedures
§ Procedures that call other procedures
§ For nested call, caller needs to save on the stack:
• Its return address
• Any arguments and temporaries needed after the call
§ Restore from the stack after the call

13
Non-Leaf Procedure Example
• RISC-V code:
§ C code: fact:
int fact (int n) addi sp,sp,-8 Save return address and n on stack
{ sw x1,4(sp)
if (n < 1) return 1; sw x10,0(sp)
else return n * fact(n - 1); Another way: addi x5,x10,-1 x5 = n - 1
} addi x5, x0, 1
if n >= 1, go to Else
blt x10,x5,Else bge x5,x0,Else
addi x10,x0,1 Else, set return value to 1
• Argument n in x10 addi sp,sp,8 Pop stack, don’t bother restoring values

• Result in x10 jalr x0,0(x1) Return


Else: addi x10,x10,-1 n >= 1: n = n - 1
jal x1,fact call fact(n-1)
addi x6,x10,0 move result of fact(n - 1) to x6
lw x10,0(sp) Restore caller’s n
lw x1,4(sp) Restore caller’s return address
addi sp,sp,8 Adjust stack to pop 2 items
mul x10,x10,x6 return n * fact(n-1)
jalr x0,0(x1) Return to the caller

14
Local Data on the Stack

§ Local data allocated by callee, do not fit in registers -> need to be stored in stack
• e.g., local arrays or structures
§ Procedure frame (or activation record): segment of stack containing local data
• Frame pointer, fp, or register x8, used by some compilers to manage stack storage
• fp points to the first word of a procedure frame
15
RISC-V Register Conventions

16
EECS 112 (Spring 2024)
Organization of Digital Computers

Section 2. Conditional Statements

17
Conditional Operations
§ Branch to a labeled instruction if a condition is true
• Otherwise, continue sequentially

§ beq rs1, rs2, L1


• if (rs1 == rs2) branch to instruction labeled L1

§ bne rs1, rs2, L1


• if (rs1 != rs2) branch to instruction labeled L1

18
Compiling If Statements
§ C code:
if (i==j) f = g+h;
else f = g-h;
• f, g, h, i, j in x19, x20, x21, x22, x23

§ Compiled RISC-V code:


bne x22, x23, Else
add x19, x20, x21
beq x0, x0, Exit // unconditional
Else: sub x19, x20, x21
Exit: …

Assembler calculates addresses


19
Compiling Loop Statements
§ C code:
while (save[i] == k) i += 1;
• i in x22, k in x24, address of save in x25
• Assume each element in save has 4 bytes

§ Compiled RISC-V code:


Loop: slli x10, x22, 2 // Temp reg x10 = i * 4
add x10, x10, x25
lw x9, 0(x10) // x10 = address of save[i]
bne x9, x24, Exit
addi x22, x22, 1
beq x0, x0, Loop
Exit: …

20
More Conditional Operations
§ blt rs1, rs2, L1
• if (rs1 < rs2) branch to instruction labeled L1
§ bge rs1, rs2, L1
• if (rs1 >= rs2) branch to instruction labeled L1
§ Example
§ if (a > b) a += 1;
§ a in x22, b in x23

bge x23, x22, Exit // branch if b >= a


addi x22, x22, 1
Exit:

21
Signed vs. Unsigned

§ Signed comparison: blt, bge


§ Unsigned comparison: bltu, bgeu
§ Example
• x22 = 1111 1111 1111 1111 1111 1111 1111 1111
• x23 = 0000 0000 0000 0000 0000 0000 0000 0001
• x22 < x23 // signed
o –1 < +1
• x22 > x23 // unsigned
o +4,294,967,295 > +1

22
EECS 112 (Spring 2024)
Organization of Digital Computers

Section 3. String

23
Communicating with People: Character Data
UTF-8 BINARY
CHARACTER CODE POINT
§ Byte-encoded character sets ENCODING

• ASCII: 128 characters A U+0041 01000001


a U+0061 01100001
o 95 graphic, 33 control
0 U+0030 00110000
• Latin-1: 256 characters 9 U+0039 00111001
o ASCII, +96 more graphic characters ! U+0021 00100001
11000011
§ Unicode: 32-bit character set Ø U+00D8
10011000
• Used in Java, C++ wide characters, … ‫ڃ‬ U+0683
11011010
10000011
• Most of the world’s alphabets, plus symbols 11100000
U+0C9A 10110010
• UTF-8, UTF-16: variable-length encodings 10011010
11110000
10100000
𠜎 U+2070E
10011100
10001110
11110000
10011111
😁 U+1F601
10011000
10000001

UTF-
8 Examples 24
Byte/Halfword/Word Operations

§ RISC-V byte/halfword/word load/store


• Load byte/halfword/word: Sign extend to 64 bits in rd
o lb rd, offset(rs1)
o lh rd, offset(rs1)
o lw rd, offset(rs1)
• Load byte/halfword/word unsigned: Zero extend to 64 bits in rd
o lbu rd, offset(rs1)
o lhu rd, offset(rs1)
o lwu rd, offset(rs1)
• Store byte/halfword/word: Store rightmost 8/16/32 bits
o sb rs2, offset(rs1)
o sh rs2, offset(rs1)
o sw rs2, offset(rs1)

25
String Copy Example
§ C code: • RISC-V code:
• Null-terminated string strcpy:
addi sp,sp,-4 // adjust stack for one more item
void strcpy (char x[], char y[]) sw x19,0(sp) // save x19
{ add x19,x0,x0 // i = 0 + 0
L1: add x5,x19,x11 // x5 = addr of y[i]
size_t i; lbu x6,0(x5) // x6 = y[i]
i = 0; add x7,x19,x10 // x7 = addr of x[i]
sb x6,0(x7) // x[i] = y[i]
while ((x[i]=y[i])!='\0') beq x6,x0,L2 // if y[i] == 0 then exit
i += 1; addi x19,x19,1 // i = i + 1
} jal x0, L1 // next iteration of loop
L2: lw x19,0(sp) // restore saved old x19
§ Base addresses for x and y in x10 and x11 addi sp,sp,4 // pop one word off stack
jalr x0,0(x1) // return
§ i in x19

26
Loading a 32-bit Constants into A Register
§ Most constants are small
• 12-bit immediate is sufficient lui (Load Upper Immediate)
• Load a 20-bit constant into bits 12
§ For the occasional 32-bit constant through 31 of a register, lower 12 bits
lui rd, constant filled with 0
• Uses a new instruction format, U-type,
• Copies 20-bit constant to bits [31:12] of rd
to accommodate such a large constant
• Add in the lowest 12 bits

lui x19, 976 // 0x003D0


0000 0000 0011 1101 0000 0000 0000 0000

addi x19,x19,1280 // 0x500


0000 0000 0011 1101 0000 0101 0000 0000

U-type Format

27
EECS 112 (Spring 2024)
Organization of Digital Computers

Section 4. Addressing

28
Branch Addressing
§ Branch instructions specify
• Opcode, two registers, target address
§ Most branch targets are near branch
• Forward or backward
§ S-type and B-type:
imm imm
[10:5] rs2 rs1 funct3 [4:1] opcode

imm[12] imm[11]

§ PC-relative addressing
• Target address = PC + immediate × 2

Represents the number of halfwords


between branch and branch target.
Each instruction is 4 bytes (word)
29
Jump Addressing
§ Jump and link (jal) target uses 20-bit immediate for
larger range
§ UJ format:
imm[10:1] imm[19:12] rd opcode
5 bits 7 bits
imm[20] imm[11]

§ For long jumps, e.g., to 32-bit absolute address


• lui: load address[31:12] to temp register
• jalr: add address[11:0] and jump to target

30
RISC-V Addressing Summary

operands are shaded in color


31
RISC-V Encoding Summary

32
EECS 112 (Spring 2024)
Organization of Digital Computers

Section 5. Synchronization

33
Synchronization
§ Two processors sharing an area of memory
• P1 writes, then P2 reads
• Data race if P1 and P2 don’t synchronize
o Result depends on order of accesses
§ Hardware support required
• Atomic read/write memory operation
• No other access to the location allowed between the read and write
§ Could be a single instruction
• E.g., atomic swap of register ↔ memory
• Or an atomic pair of instructions

34
Synchronization in RISC-V
§ Load reserved: lr.w rd,(rs1)
• Load from address in rs1 to rd
• Place reservation on memory address

§ Store conditional: sc.w rd,rs2,(rs1)


• Store from rs2 to address in rs1
• Succeeds if location not changed since the lr.w
o Returns 0 in rd
• Fails if location is changed
o Returns non-zero value in rd

35
Synchronization in RISC-V
§ Example 1: atomic swap (to test/set lock variable)
again: lr.w x10,(x20)
sc.w x11,x23,(x20) // X11 = status
bne x11,x0,again // branch if store failed
addi x23,x10,0 // X23 = loaded value

§ Example 2: lock
addi x12,x0,1 // copy locked value
again: lr.d x10,(x20) // read lock
bne x10,x0,again // check if it is 0 yet
sc.w x11,x12,(x20) // attempt to store
bne x11,x0,again // branch if fails
• Unlock:
sw x0,0(x20) // free lock

36
EECS 112 (Spring 2024)
Organization of Digital Computers

Section 6. Compilation

37
Translation and Startup

Many compilers produce object


modules directly

Static linking

38
Producing an Object Module
§ Assembler (or compiler) translates program into machine instructions
§ Provides information for building a complete program from the pieces
• Header: described contents of object module
• Text segment: translated instructions
• Static data segment: data allocated for the life of the program
• Relocation info: for contents that depend on absolute location of loaded program
• Symbol table: global definitions and external refs
• Debug info: for associating with source code

39
Linking Object Modules
§ Produces an executable image
1. Merges segments
2. Resolve labels (determine their addresses)
3. Patch location-dependent and external refs
§ Could leave location dependencies for fixing by a relocating loader
• But with virtual memory, no need to do this
• Program can be loaded into absolute location in virtual memory space

40
Loading a Program
§ Load from image file on disk into memory
1. Read header to determine segment sizes
2. Create virtual address space
3. Copy text and initialized data into memory
o Or set page table entries so they can be faulted in
4. Set up arguments on stack
5. Initialize registers (including sp, fp, gp)
6. Jump to startup routine
o Copies arguments to x10, … and calls main
o When main returns, do exit syscall

41
Dynamic Linking

§ Only link/load library procedure when it is called


• Requires procedure code to be relocatable

• Avoids image bloat caused by static linking of all (transitively) referenced libraries

• Automatically picks up new library versions

42
Lazy Linkage

Indirection table

Stub: Loads routine ID,


Jump to linker/loader

Linker/loader code

Dynamically
mapped code

43
Starting Java Applications
Simple portable
instruction set for the
JVM

Compiles
bytecodes of Interprets
“hot” methods bytecodes
into native
code for host
machine 44
C Sort Example
§ Illustrates use of assembly
instructions for a C bubble sort
function
§ Swap procedure (leaf)
void swap(long long int swap:
v[], slli x6,x11,3 // reg x6 = k * 8
long long int k) add x6,x10,x6 // reg x6 = v + (k * 8)
{ ld x5,0(x6) // reg x5 (temp) = v[k]
long long int temp;
ld x7,8(x6) // reg x7 = v[k + 1]
temp = v[k];
v[k] = v[k+1]; sd x7,0(x6) // v[k] = reg x7
v[k+1] = temp; sd x5,8(x6) // v[k+1] = reg x5 (temp)
} jalr x0,0(x1) // return to calling routine
• v in x10, k in x11, temp in x5

45
The Sort Procedure in C
§ Non-leaf (calls swap)
void sort (long long int v[], size_t n)
{
size_t i, j;
for (i = 0; i < n; i += 1) {
for (j = i – 1;
j >= 0 && v[j] > v[j + 1];
j -= 1) {
swap(v,j);
}
}
}
• v in x10, n in x11, i in x19, j in x20

46
The Outer Loop
§ Skeleton of outer loop:
• for (i = 0; i <n; i += 1) {

li x19,0 // i = 0
for1tst:
bge x19,x11,exit1 // go to exit1 if x19 ≥ x11 (i≥n)

(body of outer for-loop)

addi x19,x19,1 // i += 1
j for1tst // branch to test of outer loop
exit1:

47
The Inner Loop
§ Skeleton of inner loop:
• for (j = i − 1; j >= 0 && v[j] > v[j + 1]; j − = 1) {
addi x20,x19,-1 // j = i −1
for2tst:
blt x20,x0,exit2 // go to exit2 if X20 < 0 (j < 0)
slli x5,x20,3 // reg x5 = j * 8
add x5,x10,x5 // reg x5 = v + (j * 8)
ld x6,0(x5) // reg x6 = v[j]
ld x7,8(x5) // reg x7 = v[j + 1]
ble x6,x7,exit2 // go to exit2 if x6 ≤ x7
mv x21, x10 // copy parameter x10 into x21
mv x22, x11 // copy parameter x11 into x22
mv x10, x21 // first swap parameter is v
mv x11, x20 // second swap parameter is j
jal x1,swap // call swap
addi x20,x20,-1 // j –= 1
j for2tst // branch to test of inner loop
exit2:

48
Preserving Registers
§ Preserve saved registers:
addi sp,sp,-40 // make room on stack for 5 regs
sd x1,32(sp) // save x1 on stack
sd x22,24(sp) // save x22 on stack
sd x21,16(sp) // save x21 on stack
sd x20,8(sp) // save x20 on stack
sd x19,0(sp) // save x19 on stack

§Restore saved registers:


exit1:
sd x19,0(sp) // restore x19 from stack
sd x20,8(sp) // restore x20 from stack
sd x21,16(sp) // restore x21 from stack
sd x22,24(sp) // restore x22 from stack
sd x1,32(sp) // restore x1 from stack
addi sp,sp, 40 // restore stack pointer
jalr x0,0(x1)

49
Effect of Compiler Optimization
Compiled with gcc for Pentium 4 under Linux

3 Relative Performance 140000 Instruction count


2.5 120000

2 100000
80000
1.5
60000
1
40000
0.5 20000
0 0
none O1 O2 O3 none O1 O2 O3

180000 Clock Cycles 2 CPI


160000
140000 1.5
120000
100000
1
80000
60000
40000 0.5
20000
0 0
none O1 O2 O3 none O1 O2 O3

50
Effect of Language and Algorithm
3 Bubblesort Relative Performance
2.5

1.5

0.5

0
C/none C/O1 C/O2 C/O3 Java/int Java/JIT

2.5 Quicksort Relative Performance


2

1.5

0.5

0
C/none C/O1 C/O2 C/O3 Java/int Java/JIT

3000 Quicksort vs. Bubblesort Speedup


2500

2000

1500

1000

500

0
C/none C/O1 C/O2 C/O3 Java/int Java/JIT

51
Lessons Learnt
§ Instruction count and CPI are not good performance indicators in isolation
§ Compiler optimizations are sensitive to the algorithm
§ Java/JIT compiled code is significantly faster than JVM interpreted
• Comparable to optimized C in some cases
§ Nothing can fix a dumb algorithm!

52

You might also like