0% found this document useful (0 votes)
37 views

IT3106E SP 01 Machine Level Programming

Uploaded by

quyanhh10
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

IT3106E SP 01 Machine Level Programming

Uploaded by

quyanhh10
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 296

IT3106E System Programming

Course Expert Team:


Pham Ngoc Hung, Hoang Van Hiep, Nguyen Dinh Thuan

1
Chapter 1. Machine Level Programming
 I. Basics
 II. Control
 III. Procedures
 IV. Data
 V. Advance

With materials from Computer Systems: A Programmer's Perspective, 3/E (CS:APP3e)


Randal E. Bryant and David R. O'Hallaron, Carnegie Mellon University

2
Machine Level Programming I:
Basics

3
Machine Level Programming I: Basics
 History of Intel processors and architectures
 C, assembly, machine code
 Assembly Basics: Registers, operands, move
 Arithmetic & logical operations

4
Intel x86 Processors
 Dominate laptop/desktop/server market

 Evolutionary design
▪ Backwards compatible up until 8086, introduced in 1978
▪ Added more features as time goes on

 Complex instruction set computer (CISC)


▪ Many different instructions with many different formats
▪ But. only small subset encountered with Linux programs
▪ Hard to match performance of Reduced Instruction Set Computers (RISC)
▪ But, Intel has done just that!
▪ In terms of speed. Less so for low power.

5
Intel x86 Evolution: Milestones
Name Date Transistors MHz
 8086 1978 29K 5-10
▪ First 16-bit Intel processor. Basis for IBM PC & DOS
▪ 1MB address space
 386 1985 275K 16-33
▪ First 32 bit Intel processor , referred to as IA32
▪ Added “flat addressing”, capable of running Unix
 Pentium 4E 2004 125M 2800-3800
▪ First 64-bit Intel x86 processor, referred to as x86-64
 Core 2 2006 291M 1060-3500
▪ First multi-core Intel processor
 Core i7 2008 731M 1700-3900
▪ Four cores (our shark machines)

6
Intel x86 Processors, cont.
 Machine Evolution
▪ 386 1985 0.3M
▪ Pentium 1993 3.1M
▪ Pentium/MMX 1997 4.5M
▪ PentiumPro 1995 6.5M
▪ Pentium III 1999 8.2M
▪ Pentium 4 2001 42M
▪ Core 2 Duo 2006 291M
▪ Core i7 2008 731M
 Added Features
▪ Instructions to support multimedia operations
▪ Instructions to enable more efficient conditional operations
▪ Transition from 32 bits to 64 bits
▪ More cores

7
2015 State of the Art
▪ Core i7 Broadwell 2015

 Desktop Model
▪ 4 cores
▪ Integrated graphics
▪ 3.3-3.8 GHz
▪ 65W

 Server Model
▪ 8 cores
▪ Integrated I/O
▪ 2-2.6 GHz
▪ 45W

8
x86 Clones: Advanced Micro Devices (AMD)
 Historically
▪ AMD has followed just behind Intel
▪ A little bit slower, a lot cheaper
 Then
▪ Recruited top circuit designers from Digital Equipment Corp. and other
downward trending companies
▪ Built Opteron: tough competitor to Pentium 4
▪ Developed x86-64, their own extension to 64 bits
 Recent Years
▪ Intel got its act together
▪Leads the world in semiconductor technology
▪ AMD has fallen behind
▪ Relies on external semiconductor manufacturer

9
Intel’s 64-Bit History

 2001: Intel Attempts Radical Shift from IA32 to IA64


▪ Totally different architecture (Itanium)
▪ Executes IA32 code only as legacy
▪ Performance disappointing
 2003: AMD Steps in with Evolutionary Solution
▪ x86-64 (now called “AMD64”)
 Intel Felt Obligated to Focus on IA64
▪ Hard to admit mistake or that AMD is better
 2004: Intel Announces EM64T extension to IA32
▪ Extended Memory 64-bit Technology
▪ Almost identical to x86-64!
 All but low-end x86 processors support x86-64
▪ But, lots of code still runs in 32-bit mode
10
Course Coverage
 IA32
▪ The traditional x86
 x86-64
▪ The standard
▪ shark> gcc hello.c
▪ shark> gcc –m64 hello.c

 Presentation
▪ Book covers x86-64
▪ Web aside on IA32
▪ We will only cover x86-64

11
Machine Programming: Basics
 History of Intel processors and architectures
 C, assembly, machine code
 Assembly Basics: Registers, operands, move
 Arithmetic & logical operations

12
Definitions
 Architecture: (also ISA: instruction set architecture) The parts of
a processor design that one needs to understand or write
assembly/machine code.
▪ Examples: instruction set specification, registers.
 Microarchitecture: Implementation of the architecture.
▪ Examples: cache sizes and core frequency.
 Code Forms:
▪ Machine Code: The byte-level programs that a processor executes
▪ Assembly Code: A text representation of machine code

 Example ISAs:
▪ Intel: x86, IA32, Itanium, x86-64
▪ ARM: Used in almost all mobile phones

13
Assembly/Machine Code View
CPU Memory
Addresses
Registers
Data Code
PC Data
Condition Instructions Stack
Codes

Programmer-Visible State
▪ PC: Program counter ▪ Memory
▪ Address of next instruction ▪ Byte addressable array
▪ Called “RIP” (x86-64) ▪ Code and user data
▪ Stack to support procedures
▪ Register file
▪ Heavily used program data
▪ Condition codes
▪ Store status information about most
recent arithmetic or logical operation
▪ Used for conditional branching 14
Turning C into Object Code
▪ Code in files p1.c p2.c
▪ Compile with command: gcc –Og p1.c p2.c -o p
▪ Use basic optimizations (-Og) [New to recent versions of GCC]
▪ Put resulting binary in file p

text C program (p1.c p2.c)

Compiler (gcc –Og -S)

text Asm program (p1.s p2.s)

Assembler (gcc or as)

binary Object program (p1.o p2.o) Static libraries


(.a)
Linker (gcc or ld)

binary Executable program (p)

15
Compiling Into Assembly
C Code (sum.c) Generated x86-64 Assembly
long plus(long x, long y); sumstore:
pushq %rbx
void sumstore(long x, long y, movq %rdx, %rbx
long *dest) call plus
{ movq %rax, (%rbx)
long t = plus(x, y); popq %rbx
*dest = t; ret
}
Obtain (on shark machine) with command
gcc –Og –S sum.c
Produces file sum.s
Warning: Will get very different results on non-Shark
machines (Andrew Linux, Mac OS-X, …) due to different
versions of gcc and different compiler settings.
16
AT&T vs Intel format
 ATT is the default format for GCC, objdump
 To generate Intel format
▪ gcc -Og -S -masm=intel mstore.c

17
Assembly Characteristics: Data Types

 “Integer” data of 1, 2, 4, or 8 bytes


▪ Data values
▪ Addresses (untyped pointers)

 Floating point data of 4, 8, or 10 bytes

 Code: Byte sequences encoding series of instructions

 No aggregate types such as arrays or structures


▪ Just contiguously allocated bytes in memory

18
Assembly Characteristics: Data Types

19
Assembly Characteristics: Operations
 Perform arithmetic function on register or memory data

 Transfer data between memory and register


▪ Load data from memory into register
▪ Store register data into memory

 Transfer control
▪ Unconditional jumps to/from procedures
▪ Conditional branches

20
Object Code
Code for sumstore
 Assembler
0x0400595:
0x53
▪ Translates .s into .o
0x48 ▪ Binary encoding of each instruction
0x89 ▪ Nearly-complete image of executable code
0xd3
0xe8
▪ Missing linkages between code in different
0xf2 files
0xff  Linker
0xff
0xff ▪ Resolves references between files
• Total of 14 bytes
0x48 ▪ Combines with static run-time libraries
0x89 • Each instruction
E.g., code for malloc, printf

0x03 1, 3, or 5 bytes
0x5b • Starts at address
▪ Some libraries are dynamically linked
0xc3 0x0400595 ▪ Linking occurs when program begins
execution

21
Machine Instruction Example
 C Code
*dest = t;
▪ Store value t where designated by
dest
 Assembly
movq %rax, (%rbx)
▪ Move 8-byte value to memory
▪Quad words in x86-64 parlance
▪ Operands:
t: Register %rax
dest: Register %rbx
*dest: Memory M[%rbx]
 Object Code
0x40059e: 48 89 03
▪ 3-byte instruction
▪ Stored at address 0x40059e

22
Disassembling Object Code
Disassembled
0000000000400595 <sumstore>:
400595: 53 push %rbx
400596: 48 89 d3 mov %rdx,%rbx
400599: e8 f2 ff ff ff callq 400590 <plus>
40059e: 48 89 03 mov %rax,(%rbx)
4005a1: 5b pop %rbx
4005a2: c3 retq

 Disassembler
objdump –d sum
▪ Useful tool for examining object code
▪ Analyzes bit pattern of series of instructions
▪ Produces approximate rendition of assembly code
▪ Can be run on either a.out (complete executable) or .o file
23
Alternate Disassembly
Disassembled
Object
0x0400595:
0x53 Dump of assembler code for function sumstore:
0x48 0x0000000000400595 <+0>: push %rbx
0x89 0x0000000000400596 <+1>: mov %rdx,%rbx
0xd3 0x0000000000400599 <+4>: callq 0x400590 <plus>
0xe8 0x000000000040059e <+9>: mov %rax,(%rbx)
0xf2 0x00000000004005a1 <+12>:pop %rbx
0xff 0x00000000004005a2 <+13>:retq
0xff
0xff
0x48  Within gdb Debugger
0x89 gdb sum
0x03
0x5b disassemble sumstore
0xc3 ▪ Disassemble procedure
x/14xb sumstore
▪ Examine the 14 bytes starting at sumstore
24
What Can be Disassembled?
% objdump -d WINWORD.EXE

WINWORD.EXE: file format pei-i386

No symbols in "WINWORD.EXE".
Disassembly of section .text:

30001000 <.text>:
30001000: 55 push %ebp
30001001: 8b ec mov %esp,%ebp
30001003: 6a ff push $0xffffffff
30001005: 68 90 10 00 30 push $0x30001090
3000100a: 68 91 dc 4c 30 push $0x304cdc91

Reverse engineering forbidden by


Microsoft End User License Agreement
 Anything that can be interpreted as executable code
 Disassembler examines bytes and reconstructs assembly source
25
Machine Programming: Basics
 History of Intel processors and architectures
 C, assembly, machine code
 Assembly Basics: Registers, operands, move
 Arithmetic & logical operations

26
x86-64 Integer Registers

%rax %eax %r8 %r8d

%rbx %ebx %r9 %r9d

%rcx %ecx %r10 %r10d

%rdx %edx %r11 %r11d

%rsi %esi %r12 %r12d

%rdi %edi %r13 %r13d

%rsp %esp %r14 %r14d

%rbp %ebp %r15 %r15d

▪ Can reference low-order 4 bytes (also low-order 1 & 2 bytes)


27
Some History: IA32 Registers
Origin
(mostly obsolete)

%eax %ax %ah %al accumulate

%ecx %cx %ch %cl counter


general purpose

%edx %dx %dh %dl data

%ebx %bx %bh %bl base

source
%esi %si index

destination
%edi %di index
stack
%esp %sp
pointer
base
%ebp %bp
pointer

16-bit virtual registers


(backwards compatibility) 28
Moving Data
%rax
 Moving Data
%rcx
movq Source, Dest:
%rdx
 Operand Types %rbx
▪ Immediate: Constant integer data
▪Example: $0x400, $-533
%rsi
▪ Like C constant, but prefixed with ‘$’ %rdi
▪ Encoded with 1, 2, or 4 bytes %rsp
▪ Register: One of 16 integer registers %rbp
▪ Example: %rax, %r13
▪ But %rsp reserved for special use %rN
▪ Others have special uses for particular instructions
▪ Memory: 8 consecutive bytes of memory at address given by register
▪ Simplest example: (%rax)
▪ Various other “address modes”

29
movq Operand Combinations

Source Dest Src,Dest C Analog

Reg movq $0x4,%rax temp = 0x4;


Imm
Mem movq $-147,(%rax) *p = -147;

movq Reg movq %rax,%rdx temp2 = temp1;


Reg
Mem movq %rax,(%rdx) *p = temp;

Mem Reg movq (%rax),%rdx temp = *p;

Cannot do memory-memory transfer with a single instruction

30
Moving data

31
Moving data

32
Moving data

33
Simple Memory Addressing Modes
 Normal (R) Mem[Reg[R]]
▪ Register R specifies memory address
▪ Aha! Pointer dereferencing in C

movq (%rcx),%rax

 Displacement D(R) Mem[Reg[R]+D]


▪ Register R specifies start of memory region
▪ Constant displacement D specifies offset

movq 8(%rbp),%rdx

34
Example of Simple Addressing Modes

void swap
(long *xp, long *yp)
{ swap:
long t0 = *xp; movq (%rdi), %rax
long t1 = *yp; movq (%rsi), %rdx
*xp = t1; movq %rdx, (%rdi)
*yp = t0; movq %rax, (%rsi)
} ret

35
Understanding Swap()
Memory
void swap Registers
(long *xp, long *yp)
{ %rdi
long t0 = *xp;
%rsi
long t1 = *yp;
*xp = t1; %rax
*yp = t0;
} %rdx

Register Value
%rdi xp
%rsi yp swap:
%rax t0 movq (%rdi), %rax # t0 = *xp
%rdx t1 movq (%rsi), %rdx # t1 = *yp
movq %rdx, (%rdi) # *xp = t1
movq %rax, (%rsi) # *yp = t0
ret
36
Understanding Swap()
Memory
Registers Address
123 0x120
%rdi 0x120
0x118
%rsi 0x100
0x110
%rax 0x108
%rdx 456 0x100

swap:
movq (%rdi), %rax # t0 = *xp
movq (%rsi), %rdx # t1 = *yp
movq %rdx, (%rdi) # *xp = t1
movq %rax, (%rsi) # *yp = t0
ret

37
Understanding Swap()
Memory
Registers Address
123 0x120
%rdi 0x120
0x118
%rsi 0x100
0x110
%rax 123 0x108
%rdx 456 0x100

swap:
movq (%rdi), %rax # t0 = *xp
movq (%rsi), %rdx # t1 = *yp
movq %rdx, (%rdi) # *xp = t1
movq %rax, (%rsi) # *yp = t0
ret

38
Understanding Swap()
Memory
Registers Address
123 0x120
%rdi 0x120
0x118
%rsi 0x100
0x110
%rax 123 0x108
%rdx 456 456 0x100

swap:
movq (%rdi), %rax # t0 = *xp
movq (%rsi), %rdx # t1 = *yp
movq %rdx, (%rdi) # *xp = t1
movq %rax, (%rsi) # *yp = t0
ret

39
Understanding Swap()
Memory
Registers Address
456 0x120
%rdi 0x120
0x118
%rsi 0x100
0x110
%rax 123 0x108
%rdx 456 456 0x100

swap:
movq (%rdi), %rax # t0 = *xp
movq (%rsi), %rdx # t1 = *yp
movq %rdx, (%rdi) # *xp = t1
movq %rax, (%rsi) # *yp = t0
ret

40
Understanding Swap()
Memory
Registers Address
456 0x120
%rdi 0x120
0x118
%rsi 0x100
0x110
%rax 123 0x108
%rdx 456 123 0x100

swap:
movq (%rdi), %rax # t0 = *xp
movq (%rsi), %rdx # t1 = *yp
movq %rdx, (%rdi) # *xp = t1
movq %rax, (%rsi) # *yp = t0
ret

41
Simple Memory Addressing Modes
 Normal (R) Mem[Reg[R]]
▪ Register R specifies memory address
▪ Aha! Pointer dereferencing in C

movq (%rcx),%rax

 Displacement D(R) Mem[Reg[R]+D]


▪ Register R specifies start of memory region
▪ Constant displacement D specifies offset

movq 8(%rbp),%rdx

42
Complete Memory Addressing Modes
 Most General Form
D(Rb,Ri,S) Mem[Reg[Rb]+S*Reg[Ri]+ D]
▪ D: Constant “displacement” 1, 2, or 4 bytes
▪ Rb: Base register: Any of 16 integer registers
▪ Ri: Index register: Any, except for %rsp
▪ S: Scale: 1, 2, 4, or 8 (why these numbers?)

 Special Cases
(Rb,Ri) Mem[Reg[Rb]+Reg[Ri]]
D(Rb,Ri) Mem[Reg[Rb]+Reg[Ri]+D]
(Rb,Ri,S) Mem[Reg[Rb]+S*Reg[Ri]]

43
Carnegie Mellon

Address Computation Examples

%rdx 0xf000
%rcx 0x0100

Expression Address Computation Address


0x8(%rdx) 0xf000 + 0x8 0xf008
(%rdx,%rcx) 0xf000 + 0x100 0xf100
(%rdx,%rcx,4) 0xf000 + 4*0x100 0xf400
0x80(,%rdx,2) 2*0xf000 + 0x80 0x1e080

44
Machine Programming: Basics
 History of Intel processors and architectures
 C, assembly, machine code
 Assembly Basics: Registers, operands, move
 Arithmetic & logical operations

45
Carnegie Mellon

Address Computation Instruction


 leaq Src, Dst
▪ Src is address mode expression
▪ Set Dst to address denoted by expression

 Uses
▪ Computing addresses without a memory reference
▪E.g., translation of p = &x[i];
▪ Computing arithmetic expressions of the form x + k*y
▪ k = 1, 2, 4, or 8

 Example

long m12(long x)
{ Converted to ASM by compiler:
return x*12; leaq (%rdi,%rdi,2), %rax # t <- x+x*2
} salq $2, %rax # return t<<2

46
Carnegie Mellon

Some Arithmetic Operations


 Two Operand Instructions:
Format Computation
addq Src,Dest Dest = Dest + Src
subq Src,Dest Dest = Dest − Src
imulq Src,Dest Dest = Dest * Src
salq Src,Dest Dest = Dest << Src Also called shlq
sarq Src,Dest Dest = Dest >> Src Arithmetic
shrq Src,Dest Dest = Dest >> Src Logical
xorq Src,Dest Dest = Dest ^ Src
andq Src,Dest Dest = Dest & Src
orq Src,Dest Dest = Dest | Src
 Watch out for argument order!
 No distinction between signed and unsigned int (why?)

47
Carnegie Mellon

Some Arithmetic Operations


 One Operand Instructions
incq Dest Dest = Dest + 1
decq Dest Dest = Dest − 1
negq Dest Dest = − Dest
notq Dest Dest = ~Dest

 See book for more instructions

48
Carnegie Mellon

Arithmetic Expression Example


arith:
leaq (%rdi,%rsi), %rax
long arith addq %rdx, %rax
(long x, long y, long z) leaq (%rsi,%rsi,2), %rdx
{ salq $4, %rdx
long t1 = x+y; leaq 4(%rdi,%rdx), %rcx
long t2 = z+t1; imulq %rcx, %rax
long t3 = x+4; ret
long t4 = y * 48;
long t5 = t3 + t4; Interesting Instructions
long rval = t2 * t5; ▪ leaq: address computation
return rval; ▪ salq: shift
}
▪ imulq: multiplication
▪ But, only used once

49
Carnegie Mellon

Understanding Arithmetic Expression


Example
arith:
leaq (%rdi,%rsi), %rax # t1
long arith addq %rdx, %rax # t2
(long x, long y, long z) leaq (%rsi,%rsi,2), %rdx
{ salq $4, %rdx # t4
long t1 = x+y; leaq 4(%rdi,%rdx), %rcx # t5
long t2 = z+t1; imulq %rcx, %rax # rval
long t3 = x+4; ret
long t4 = y * 48;
long t5 = t3 + t4; Register Use(s)
long rval = t2 * t5;
return rval; %rdi Argument x
} %rsi Argument y
%rdx Argument z
%rax t1, t2, rval
%rdx t4
%rcx t5

50
Machine Programming I: Summary
 History of Intel processors and architectures
▪ Evolutionary design leads to many quirks and artifacts
 C, assembly, machine code
▪ New forms of visible state: program counter, registers, ...
▪ Compiler must transform statements, expressions, procedures into low-
level instruction sequences
 Assembly Basics: Registers, operands, move
▪ The x86-64 move instructions cover wide range of data movement forms
 Arithmetic
▪ C compiler will figure out different instruction combinations to carry out
computation

51
CarnegieMellon
Carnegie Mellon

Machine-Level Programming II:


Control

52
Carnegie Mellon

Machine-Level Programming II: Control


 Control: Condition codes
 Conditional branches
 Loops
 Switch Statements

53
Carnegie Mellon

Processor State (x86-64, Partial)

 Information about
currently executing Registers
program %rax %r8
▪ Temporary data %rbx %r9
( %rax, … ) %rcx %r10
▪ Location of runtime %rdx %r11
stack %rsi %r12
( %rsp )
%rdi %r13
▪ Location of current code %rsp %r14
control point
%rbp %r15
( %rip, … )
▪ Status of recent tests
%rip Instruction pointer
( CF, ZF, SF, OF )
Current stack top
CF ZF SF OF Condition codes
54
Condition Codes (Implicit Setting)
 Single bit registers
▪CF Carry Flag (for unsigned) SF Sign Flag (for signed)
▪ZF Zero Flag OF Overflow Flag (for signed)

 Implicitly set (think of it as side effect) by arithmetic operations


Example: addq Src,Dest t = a+b
CF set if carry out from most significant bit (unsigned overflow)
ZF set if t == 0
SF set if t < 0 (as signed)
OF set if two’s-complement (signed) overflow
(a>0 && b>0 && t<0) || (a<0 && b<0 && t>=0)

 Not set by leaq instruction

55
Carnegie Mellon

Condition Codes (Explicit Setting: Compare)


 Explicit Setting by Compare Instruction
▪cmpq Src2, Src1
▪cmpq b,a like computing a-b without setting destination

▪CF set if carry out from most significant bit (used for unsigned comparisons)
▪ZF set if a == b
▪SF set if (a-b) < 0 (as signed)
▪OF set if two’s-complement (signed) overflow
(a>0 && b<0 && (a-b)<0) || (a<0 && b>0 && (a-b)>0)

56
Carnegie Mellon

Condition Codes (Explicit Setting: Test)


 Explicit Setting by Test instruction
▪testq Src2, Src1
▪testq b,a like computing a&b without setting destination

▪Sets condition codes based on value of Src1 & Src2


▪Useful to have one of the operands be a mask

▪ZF set when a&b == 0


▪SF set when a&b < 0

57
Carnegie Mellon

Reading Condition Codes


 SetX Instructions
▪ Set low-order byte of destination to 0 or 1 based on combinations of
condition codes
▪ Does not alter remaining 7 bytes

SetX Condition Description


sete ZF Equal / Zero
setne ~ZF Not Equal / Not Zero
sets SF Negative
setns ~SF Nonnegative
setg ~(SF^OF)&~ZF Greater (Signed)
setge ~(SF^OF) Greater or Equal (Signed)
setl (SF^OF) Less (Signed)
setle (SF^OF)|ZF Less or Equal (Signed)
seta ~CF&~ZF Above (unsigned)
setb CF Below (unsigned)
58
x86-64 Integer Registers

%rax %al %r8 %r8b

%rbx %bl %r9 %r9b

%rcx %cl %r10 %r10b

%rdx %dl %r11 %r11b

%rsi %sil %r12 %r12b

%rdi %dil %r13 %r13b

%rsp %spl %r14 %r14b

%rbp %bpl %r15 %r15b

▪ Can reference low-order byte


59
Carnegie Mellon

Reading Condition Codes (Cont.)


 SetX Instructions:
▪ Set single byte based on combination of
condition codes
 One of addressable byte registers
▪ Does not alter remaining bytes
▪ Typically use movzbl to finish job
▪ 32-bit instructions also set upper 32 bits to 0
Register Use(s)
int gt (long x, long y)
{ %rdi Argument x
return x > y; %rsi Argument y
}
%rax Return value

cmpq %rsi, %rdi # Compare x:y


setg %al # Set when >
movzbl %al, %eax # Zero rest of %rax
ret
60
Carnegie Mellon

Machine-Level Programming II: Control


 Control: Condition codes
 Conditional branches
 Loops
 Switch Statements

61
Carnegie Mellon

Jumping

 jX Instructions
▪ Jump to different part of code depending on condition codes

jX Condition Description
jmp 1 Unconditional
je ZF Equal / Zero
jne ~ZF Not Equal / Not Zero
js SF Negative
jns ~SF Nonnegative
jg ~(SF^OF)&~ZF Greater (Signed)
jge ~(SF^OF) Greater or Equal (Signed)
jl (SF^OF) Less (Signed)
jle (SF^OF)|ZF Less or Equal (Signed)
ja ~CF&~ZF Above (unsigned)
jb CF Below (unsigned)

62
Carnegie Mellon

Conditional Branch Example (Old Style)


 Generation
shark> gcc –Og -S –fno-if-conversion control.c
absdiff:
long absdiff cmpq %rsi, %rdi # x:y
(long x, long y) jle .L4
{ movq %rdi, %rax
long result; subq %rsi, %rax
if (x > y) ret
result = x-y; .L4: # x <= y
else movq %rsi, %rax
result = y-x; subq %rdi, %rax
return result; ret
}
Register Use(s)
%rdi Argument x
%rsi Argument y
%rax Return value
63
Carnegie Mellon

Expressing with Goto Code


 C allows goto statement
 Jump to position designated by label

long absdiff long absdiff_j


(long x, long y) (long x, long y)
{ {
long result; long result;
if (x > y) int ntest = x <= y;
result = x-y; if (ntest) goto Else;
else result = x-y;
result = y-x; goto Done;
return result; Else:
} result = y-x;
Done:
return result;
}

64
Carnegie Mellon

General Conditional Expression Translation


(Using Branches)
C Code
val = Test ? Then_Expr : Else_Expr;

val = x>y ? x-y : y-x;

Goto Version
ntest = !Test; ▪ Create separate code regions for
if (ntest) goto Else;
then & else expressions
val = Then_Expr;
goto Done; ▪ Execute appropriate one
Else:
val = Else_Expr;
Done:
. . .

65
Carnegie Mellon

Using Conditional Moves

 Conditional Move Instructions


▪ Instruction supports:
if (Test) Dest  Src
▪ Supported in post-1995 x86 processors C Code
▪ GCC tries to use them val = Test
▪ But, only when known to be safe ? Then_Expr
 Why? : Else_Expr;
▪ Branches are very disruptive to
instruction flow through pipelines
Goto Version
▪ Conditional moves do not require result = Then_Expr;
control transfer eval = Else_Expr;
nt = !Test;
if (nt) result = eval;
return result;

66
Carnegie Mellon

Conditional Move Example

long absdiff
(long x, long y)
{ Register Use(s)
long result;
if (x > y) %rdi Argument x
result = x-y; %rsi Argument y
else %rax Return value
result = y-x;
return result;
}

absdiff:
movq %rdi, %rax # x
subq %rsi, %rax # result = x-y
movq %rsi, %rdx
subq %rdi, %rdx # eval = y-x
cmpq %rsi, %rdi # x:y
cmovle %rdx, %rax # if <=, result = eval
ret
67
Carnegie Mellon

Bad Cases for Conditional Move

Expensive Computations
val = Test(x) ? Hard1(x) : Hard2(x);

 Both values get computed


 Only makes sense when computations
are very simple
Risky Computations
val = p ? *p : 0;

 Both values get computed


 May have undesirable effects
Computations with side effects
val = x > 0 ? x*=7 : x+=3;

 Both values get computed


 Must be side-effect free 68
Carnegie Mellon

Machine-Level Programming II: Control


 Control: Condition codes
 Conditional branches
 Loops
 Switch Statements

69
Carnegie Mellon

“Do-While” Loop Example

C Code Goto Version


long pcount_do long pcount_goto
(unsigned long x) { (unsigned long x) {
long result = 0; long result = 0;
do { loop:
result += x & 0x1; result += x & 0x1;
x >>= 1; x >>= 1;
} while (x); if(x) goto loop;
return result; return result;
} }

 Count number of 1’s in argument x (“popcount”)


 Use conditional branch to either continue looping or to exit
loop

70
Carnegie Mellon

“Do-While” Loop Compilation


Goto Version
long pcount_goto
(unsigned long x) {
Register Use(s)
long result = 0;
loop: %rdi Argument x
result += x & 0x1; %rax result
x >>= 1;
if(x) goto loop;
return result;
}

movl $0, %eax # result = 0


.L2: # loop:
movq %rdi, %rdx
andl $1, %edx # t = x & 0x1
addq %rdx, %rax # result += t
shrq %rdi # x >>= 1
jne .L2 # if (x) goto loop
rep; ret
71
Carnegie Mellon

General “Do-While” Translation

C Code Goto Version


do loop:
Body Body
while (Test); if (Test)
goto loop
 Body: {
Statement1;
Statement2;

Statementn;
}

72
Carnegie Mellon

General “While” Translation #1


 “Jump-to-middle” translation
 Used with -Og

Goto Version
goto test;
loop:
While version Body
while (Test) test:
Body if (Test)
goto loop;
done:

73
Carnegie Mellon

While Loop Example #1

C Code Jump to Middle


long pcount_while Version
long pcount_goto_jtm
(unsigned long x) { (unsigned long x) {
long result = 0; long result = 0;
while (x) { goto test;
result += x & 0x1; loop:
x >>= 1; result += x & 0x1;
} x >>= 1;
return result; test:
} if(x) goto loop;
return result;
}

 Compare to do-while version of function


 Initial goto starts loop at test

74
Carnegie Mellon

General “While” Translation #2

While version
 “Do-while” conversion
while (Test)
 Used with –O1
Body

Goto Version
Do-While Version if (!Test)
if (!Test) goto done;
goto done; loop:
do Body
Body if (Test)
while(Test); goto loop;
done: done:
75
Carnegie Mellon

While Loop Example #2

C Code Do-While Version


long pcount_while long pcount_goto_dw
(unsigned long x) { (unsigned long x) {
long result = 0; long result = 0;
while (x) { if (!x) goto done;
result += x & 0x1; loop:
x >>= 1; result += x & 0x1;
} x >>= 1;
return result; if(x) goto loop;
} done:
return result;
}

 Compare to do-while version of function


 Initial conditional guards entrance to loop

76
Carnegie Mellon

“For” Loop Form


Init
General Form i = 0

for (Init; Test; Update ) Test


Body i < WSIZE

#define WSIZE 8*sizeof(int) Update


long pcount_for i++
(unsigned long x)
{
size_t i; Body
long result = 0; {
for (i = 0; i < WSIZE; i++) unsigned bit =
{ (x >> i) & 0x1;
unsigned bit = result += bit;
(x >> i) & 0x1; }
result += bit;
}
return result;
}
77
Carnegie Mellon

“For” Loop → While Loop


For Version
for (Init; Test; Update )
Body

While Version
Init;
while (Test ) {
Body
Update;
}
78
Carnegie Mellon

For-While Conversion
long pcount_for_while
Init (unsigned long x)
{
i = 0 size_t i;
long result = 0;
Test i = 0;
i < WSIZE while (i < WSIZE)
{
unsigned bit =
Update
(x >> i) & 0x1;
i++ result += bit;
i++;
Body }
{ return result;
unsigned bit = }
(x >> i) & 0x1;
result += bit;
}

79
Carnegie Mellon

“For” Loop Do-While Conversion

Goto Version long pcount_for_goto_dw


C Code
(unsigned long x) {
long pcount_for size_t i;
(unsigned long x) long result = 0;
{ i = 0; Init
size_t i; if (!(i < WSIZE))
long result = 0; goto done; !Test
for (i = 0; i < WSIZE; i++) loop:
{ {
unsigned bit = unsigned bit =
(x >> i) & 0x1; (x >> i) & 0x1; Body
result += bit; result += bit;
} }
return result; i++; Update
} if (i < WSIZE)
Test
goto loop;
 Initial test can be optimized done:
away return result;
}
80
Carnegie Mellon

Machine-Level Programming II: Control


 Control: Condition codes
 Conditional branches
 Loops
 Switch Statements

81
Carnegie Mellon

long switch_eg
(long x, long y, long z) Switch Statement
{
long w = 1; Example
switch(x) {
case 1:
w = y*z;  Multiple case labels
break; ▪ Here: 5 & 6
case 2:
w = y/z;  Fall through cases
/* Fall Through */ ▪ Here: 2
case 3:
w += z;  Missing cases
break; ▪ Here: 4
case 5:
case 6:
w -= z;
break;
default:
w = 2;
}
return w;
}
82
Carnegie Mellon

Jump Table Structure

Switch Form Jump Table Jump Targets


switch(x) { jtab: Targ0 Targ0: Code Block
case val_0: 0
Block 0 Targ1
case val_1: Targ2 Targ1:
Block 1 Code Block
• 1
• • • •
case val_n-1: •
Block n–1 Targ2: Code Block
} Targn-1
2


Translation (Extended C) •
goto *JTab[x]; •

Targn-1: Code Block


n–1
83
Carnegie Mellon

Switch Statement Example

long switch_eg(long x, long y, long z)


{
long w = 1;
switch(x) {
. . .
}
return w;
}

Setup:
Register Use(s)
switch_eg:
movq %rdx, %rcx %rdi Argument x
cmpq $6, %rdi # x:6 %rsi Argument y
ja .L8
jmp *.L4(,%rdi,8) %rdx Argument z
%rax Return value
What range of values Note that w not
takes default? initialized here 84
Carnegie Mellon

Switch Statement Example

long switch_eg(long x, long y, long z)


{
long w = 1;
switch(x) {
. . .
Jump table
.section .rodata
} .align 8
return w; .L4:
} .quad .L8 # x = 0
.quad .L3 # x = 1
.quad .L5 # x = 2
Setup: .quad .L9 # x = 3
.quad .L8 # x = 4
switch_eg: .quad .L7 # x = 5
movq %rdx, %rcx .quad .L7 # x = 6
cmpq $6, %rdi # x:6
ja .L8 # Use default
Indirect jmp *.L4(,%rdi,8) # goto *JTab[x]
jump

85
Carnegie Mellon

Assembly Setup Explanation

 Table Structure Jump table


▪ Each target requires 8 bytes
.section .rodata
▪ Base address at .L4 .align 8
.L4:
.quad .L8 # x = 0
.quad .L3 # x = 1
 Jumping .quad .L5 # x = 2
.quad .L9 # x = 3
▪ Direct: jmp .L8 .quad .L8 # x = 4
▪ Jump target is denoted by label .L8 .quad
.quad
.L7 # x
.L7 # x
=
=
5
6

▪ Indirect: jmp *.L4(,%rdi,8)


▪ Start of jump table: .L4
▪ Must scale by factor of 8 (addresses are 8 bytes)
▪ Fetch target from effective Address .L4 + x*8
▪ Only for 0 ≤ x ≤ 6

86
Carnegie Mellon

Jump Table

Jump table
switch(x) {
.section .rodata case 1: // .L3
.align 8 w = y*z;
.L4: break;
.quad .L8 # x = 0
.quad .L3 # x = 1
case 2: // .L5
.quad .L5 # x = 2 w = y/z;
.quad .L9 # x = 3 /* Fall Through */
.quad .L8 # x = 4
case 3: // .L9
.quad .L7 # x = 5
.quad .L7 # x = 6 w += z;
break;
case 5:
case 6: // .L7
w -= z;
break;
default: // .L8
w = 2;
}

87
Carnegie Mellon

Code Blocks (x == 1)

switch(x) { .L3:
case 1: // .L3 movq %rsi, %rax # y
w = y*z; imulq %rdx, %rax # y*z
break; ret
. . .
}

Register Use(s)
%rdi Argument x
%rsi Argument y
%rdx Argument z
%rax Return value

88
Carnegie Mellon

Handling Fall-Through

long w = 1;
. . .
switch(x) { case 2:
. . . w = y/z;
case 2: goto merge;
w = y/z;
/* Fall Through */
case 3:
w += z;
break;
. . .
case 3:
}
w = 1;

merge:
w += z;

89
Carnegie Mellon

Code Blocks (x == 2, x == 3)

.L5: # Case 2
long w = 1; movq %rsi, %rax
. . . cqto
switch(x) { idivq %rcx # y/z
. . . jmp .L6 # goto merge
case 2: .L9: # Case 3
w = y/z; movl $1, %eax # w = 1
/* Fall Through */ .L6: # merge:
case 3: addq %rcx, %rax # w += z
w += z; ret
break;
. . .
} Register Use(s)
%rdi Argument x
%rsi Argument y
%rdx Argument z
%rax Return value
90
Carnegie Mellon

Code Blocks (x == 5, x == 6, default)

switch(x) { .L7: # Case 5,6


. . . movl $1, %eax # w = 1
case 5: // .L7 subq %rdx, %rax # w -= z
case 6: // .L7 ret
w -= z; .L8: # Default:
break; movl $2, %eax # 2
default: // .L8 ret
w = 2;
}

Register Use(s)
%rdi Argument x
%rsi Argument y
%rdx Argument z
%rax Return value
91
Carnegie Mellon

Summarizing
 C Control
▪ if-then-else
▪ do-while
▪ while, for
▪ switch
 Assembler Control
▪ Conditional jump
▪ Conditional move
▪ Indirect jump (via jump tables)
▪ Compiler generates code sequence to implement more complex control
 Standard Techniques
▪ Loops converted to do-while or jump-to-middle form
▪ Large switch statements use jump tables
▪ Sparse switch statements may use decision trees (if-elseif-elseif-else)
92
Carnegie Mellon

Summary
 Today
▪ Control: Condition codes
▪ Conditional branches & conditional moves
▪ Loops
▪ Switch statements
 Next Time
▪ Stack
▪ Call / return
▪ Procedure call discipline

93
Carnegie Mellon

Machine-Level Programming III:


Procedures

94
Mechanisms in Procedures
 Passing control P(…) {

▪ To beginning of procedure code •
▪ Back to return point y = Q(x);
 Passing data print(y)

▪ Procedure arguments }
▪ Return value
 Memory management
int Q(int i)
▪ Allocate during procedure execution {
▪ Deallocate upon return int t = 3*i;
int v[10];
 Mechanisms all implemented with •
machine instructions •
return v[t];
 x86-64 implementation of a
}
procedure uses only those
mechanisms required
95
Carnegie Mellon

Machine-Level Programming III: Procedures


 Procedures
▪ Stack Structure
▪ Calling Conventions
▪ Passing control
▪ Passing data
▪ Managing local data
▪ Illustration of Recursion

96
Carnegie Mellon

x86-64 Stack
Stack “Bottom”
 Region of memory managed
with stack discipline
Increasing
 Grows toward lower addresses
Addresses

 Register %rsp contains


lowest stack address
▪ address of “top” element
Stack
Grows
Down
Stack Pointer: %rsp

Stack “Top”

97
Carnegie Mellon

x86-64 Stack: Push


 pushq Src
Stack “Bottom”
▪ Fetch operand at Src
▪ Decrement %rsp by 8
▪ Write operand at address given by %rsp Increasing
Addresses

Stack
Grows
Down
Stack Pointer: %rsp
-8

Stack “Top”
98
Carnegie Mellon

x86-64 Stack: Pop


Stack “Bottom”
 popq Dest
▪ Read value at address given by %rsp
▪ Increment %rsp by 8 Increasing
Addresses
▪ Store value at Dest (must be register)

Stack
Grows
+8 Down
Stack Pointer: %rsp

Stack “Top”

99
Carnegie Mellon

Machine-Level Programming III: Procedures


 Procedures
▪ Stack Structure
▪ Calling Conventions
▪ Passing control
▪ Passing data
▪ Managing local data
▪ Illustration of Recursion

100
Code Examples void multstore
(long x, long y, long *dest)
{
long t = mult2(x, y);
*dest = t;
}

0000000000400540 <multstore>:
400540: push %rbx # Save %rbx
400541: mov %rdx,%rbx # Save dest
400544: callq 400550 <mult2> # mult2(x,y)
400549: mov %rax,(%rbx) # Save at dest
40054c: pop %rbx # Restore %rbx
40054d: retq # Return

long mult2 0000000000400550 <mult2>:


(long a, long b) 400550: mov %rdi,%rax # a
{ 400553: imul %rsi,%rax # a * b
long s = a * b; 400557: retq # Return
return s;
}
101
Carnegie Mellon

Procedure Control Flow


 Use stack to support procedure call and return
 Procedure call: call label
▪ Push return address on stack
▪ Jump to label
 Return address:
▪ Address of the next instruction right after call
▪ Example from disassembly
 Procedure return: ret
▪ Pop address from stack
▪ Jump to address

102
Control Flow Example #1

0000000000400540 <multstore>:
0x130 •
• 0x128 •
• 0x120
400544: callq 400550 <mult2>
400549: mov %rax,(%rbx)

• %rsp 0x120

%rip 0x400544
0000000000400550 <mult2>:
400550: mov %rdi,%rax


400557: retq

103
Control Flow Example #2

0000000000400540 <multstore>:
0x130 •
• 0x128 •
• 0x120
400544: callq 400550 <mult2>
400549: mov %rax,(%rbx) 0x118 0x400549

• %rsp 0x118

%rip 0x400550
0000000000400550 <mult2>:
400550: mov %rdi,%rax


400557: retq

104
Control Flow Example #3

0000000000400540 <multstore>:
0x130 •
• 0x128 •
• 0x120
400544: callq 400550 <mult2>
400549: mov %rax,(%rbx) 0x118 0x400549

• %rsp 0x118

%rip 0x400557
0000000000400550 <mult2>:
400550: mov %rdi,%rax


400557: retq

105
Control Flow Example #4

0000000000400540 <multstore>:
0x130 •
• 0x128 •
• 0x120
400544: callq 400550 <mult2>
400549: mov %rax,(%rbx)

• %rsp 0x120

%rip 0x400549
0000000000400550 <mult2>:
400550: mov %rdi,%rax


400557: retq

106
Carnegie Mellon

Machine-Level Programming III: Procedures


 Procedures
▪ Stack Structure
▪ Calling Conventions
▪ Passing control
▪ Passing data
▪ Managing local data
▪ Illustrations of Recursion & Pointers

107
Carnegie Mellon

Procedure Data Flow


Registers Stack
 First 6 arguments

%rdi
•••
%rsi Arg n
%rdx
%rcx
•••
%r8 Arg 8
%r9 Arg 7
 Return value

%rax  Only allocate stack space


when needed
108
Data Flow void multstore
(long x, long y, long *dest)
Examples {
long t = mult2(x, y);
*dest = t;
}

0000000000400540 <multstore>:
# x in %rdi, y in %rsi, dest in %rdx
•••
400541: mov %rdx,%rbx # Save dest
400544: callq 400550 <mult2> # mult2(x,y)
# t in %rax
400549: mov %rax,(%rbx) # Save at dest
•••

long mult2 0000000000400550 <mult2>:


(long a, long b) # a in %rdi, b in %rsi
{ 400550: mov %rdi,%rax # a
long s = a * b; 400553: imul %rsi,%rax # a * b
return s; # s in %rax
} 400557: retq # Return
109
Carnegie Mellon

Machine-Level Programming III: Procedures


 Procedures
▪ Stack Structure
▪ Calling Conventions
▪ Passing control
▪ Passing data
▪ Managing local data
▪ Illustration of Recursion

110
Carnegie Mellon

Stack-Based Languages

 Languages that support recursion


▪ e.g., C, Pascal, Java
▪ Code must be “Reentrant”
▪Multiple simultaneous instantiations of single procedure
▪ Need some place to store state of each instantiation
▪ Arguments
▪ Local variables
▪ Return pointer

 Stack discipline
▪ State for given procedure needed for limited time
▪From when called to when return
▪ Callee returns before caller does
 Stack allocated in Frames
▪ state for single procedure instantiation
111
Carnegie Mellon

Call Chain Example


Example
yoo(…) Call Chain
{
• yoo
• who(…)
who(); { who
• • • •
• amI();
} amI(…) amI amI
• • • {
amI(); •
• • • amI

} amI();
• amI

}

Procedure amI() is recursive

112
Carnegie Mellon

Stack Frames
Previous
Frame
 Contents
▪ Return information
Frame Pointer: %rbp
▪ Local storage (if needed) (Optional) x
▪ Temporary space (if needed) Frame for
proc

Stack Pointer: %rsp


 Management
▪ Space allocated when enter Stack “Top”
procedure
▪ “Set-up” code
▪ Includes push by call instruction
▪ Deallocated when return
▪ “Finish” code
▪ Includes pop by ret instruction
113
Carnegie Mellon

Example Stack

yoo(…) yoo
%rbp
{
yoo yoo
• who %rsp

who();
• amI amI

} amI

amI

114
Carnegie Mellon

Example Stack

yoo(…) yoo
{ who(…)
•{ yoo yoo
who
• • • • %rbp
amI();
who();
• • • • amI amI who
%rsp
• amI();
} • • •
} amI

amI

115
Carnegie Mellon

Example Stack

yoo(…) yoo
{ who(…)
•{ yoo yoo
amI(…) who
• • • •
{
amI();
who();
• who
• • • • amI amI

• amI();
• •amI();
• %rbp
}
} • amI amI
• %rsp
}
amI

116
Carnegie Mellon

Example Stack

yoo(…) yoo
{ who(…)
•{ yoo yoo
amI(…) who
• • • •
{
amI();
who(); amI(…)
• who
• • • {• amI amI

• amI();•
} • •amI();
••
} • amI amI
• amI();
} • %rbp
• amI
}
amI
%rsp

117
Carnegie Mellon

Example Stack

yoo(…) yoo
{ who(…)
•{ yoo yoo
amI(…) who
• • • •
{
amI();
who(); amI(…)
• who
• • • {• amI amI

• amI();• amI(…)
} • •amI();
• •{
} • • amI amI
• amI();
• •
}
• amI(); amI
} • amI

} %rbp
amI
%rsp

118
Carnegie Mellon

Example Stack

yoo(…) yoo
{ who(…)
•{ yoo yoo
amI(…) who
• • • •
{
amI();
who(); amI(…)
• who
• • • {• amI amI

• amI();•
} • •amI();
••
} • amI amI
• amI();
} • %rbp
• amI
}
amI
%rsp

119
Carnegie Mellon

Example Stack

yoo(…) yoo
{ who(…)
•{ yoo yoo
amI(…) who
• • • •
{
amI();
who();
• who
• • • • amI amI

• amI();
• •amI();
• %rbp
}
} • amI amI
• %rsp
}
amI

120
Carnegie Mellon

Example Stack

yoo(…) yoo
{ who(…)
•{ yoo yoo
who
• • • • %rbp
amI();
who();
• • • • amI amI who
%rsp
• amI();
} • • •
} amI

amI

121
Carnegie Mellon

Example Stack

yoo(…) yoo
{ who(…)
•{ yoo yoo
amI(…) who
• • • •
{
amI();
who();
• who
• • • • amI amI

• amI();
• •amI();
• %rbp
}
} • amI amI
• %rsp
}
amI

122
Carnegie Mellon

Example Stack

yoo(…) yoo
{ who(…)
•{ yoo yoo
who
• • • • %rbp
amI();
who();
• • • • amI amI who
%rsp
• amI();
} • • •
} amI

amI

123
Carnegie Mellon

Example Stack

yoo
yoo(…) %rbp
{ yoo yoo
• who %rsp

who(); amI amI


} amI

amI

124
Carnegie Mellon

x86-64/Linux Stack Frame

 Current Stack Frame (“Top” to


Bottom)
▪ “Argument build:” Caller
Frame
Parameters for function about to call
Arguments
▪ Local variables 7+
If can’t keep in registers Frame pointer Return Addr
▪ Saved register context %rbp Old %rbp
(Optional)
▪ Old frame pointer (optional)
Saved
Registers
+
 Caller Stack Frame Local
▪ Return address Variables
▪Pushed by call instruction
Argument
▪ Arguments for this call Stack pointer Build
%rsp (Optional)
125
Carnegie Mellon

Example: incr

long incr(long *p, long val) {


long x = *p;
long y = x + val;
*p = y;
return x;
}

incr: Register Use(s)


movq (%rdi), %rax
addq %rax, %rsi %rdi Argument p
movq %rsi, (%rdi) %rsi Argument val, y
ret %rax x, Return value

126
Carnegie Mellon

Example: Calling incr #1


Initial Stack Structure
long call_incr() {
long v1 = 15213;
long v2 = incr(&v1, 3000); ...
return v1+v2;
} Rtn address %rsp

call_incr:
subq $16, %rsp Resulting Stack Structure
movq $15213, 8(%rsp)
movl $3000, %esi
leaq 8(%rsp), %rdi ...
call incr
addq 8(%rsp), %rax
Rtn address
addq $16, %rsp
ret 15213 %rsp+8
Unused %rsp
127
Carnegie Mellon

Example: Calling incr #2


Stack Structure
long call_incr() {
long v1 = 15213;
long v2 = incr(&v1, 3000); ...
return v1+v2;
}
Rtn address
15213 %rsp+8
Unused %rsp
call_incr:
subq $16, %rsp
movq $15213, 8(%rsp) Register Use(s)
movl $3000, %esi %rdi &v1
leaq 8(%rsp), %rdi
call incr %rsi 3000
addq 8(%rsp), %rax
addq $16, %rsp
ret

128
Carnegie Mellon

Example: Calling incr #3


Stack Structure
long call_incr() {
long v1 = 15213;
long v2 = incr(&v1, 3000); ...
return v1+v2;
}
Rtn address
18213 %rsp+8
Unused %rsp
call_incr:
subq $16, %rsp
movq $15213, 8(%rsp) Register Use(s)
movl $3000, %esi %rdi &v1
leaq 8(%rsp), %rdi
call incr %rsi 3000
addq 8(%rsp), %rax
addq $16, %rsp
ret

129
Carnegie Mellon

Example: Calling incr #4


Stack Structure

long call_incr() {
long v1 = 15213; ...
long v2 = incr(&v1, 3000);
return v1+v2; Rtn address
}
18213 %rsp+8
Unused %rsp

call_incr:
subq $16, %rsp Register Use(s)
movq $15213, 8(%rsp) %rax Return value
movl $3000, %esi
leaq 8(%rsp), %rdi Updated Stack Structure
call incr
addq 8(%rsp), %rax
addq $16, %rsp ...
ret
Rtn address %rsp
130
Carnegie Mellon

Example: Calling incr #5

long call_incr() {
Updated Stack Structure
long v1 = 15213;
long v2 = incr(&v1, 3000);
...
return v1+v2;
}
Rtn address %rsp

call_incr:
subq $16, %rsp Register Use(s)
movq $15213, 8(%rsp) %rax Return value
movl $3000, %esi
leaq 8(%rsp), %rdi Final Stack Structure
call incr
addq 8(%rsp), %rax
addq $16, %rsp ...
ret %rsp

131
Carnegie Mellon

Register Saving Conventions


 When procedure yoo calls who:
▪ yoo is the caller
▪ who is the callee
 Can register be used for temporary storage?
yoo: who:
• • • • • •
movq $15213, %rdx subq $18213, %rdx
call who • • •
addq %rdx, %rax ret
• • •
ret
▪ Contents of register %rdx overwritten by who
▪ This could be trouble ➙ something should be done!
▪ Need some coordination

132
Carnegie Mellon

Register Saving Conventions


 When procedure yoo calls who:
▪ yoo is the caller
▪ who is the callee
 Can register be used for temporary storage?
 Conventions
▪ “Caller Saved”
▪ Caller saves temporary values in its frame before the call
▪ “Callee Saved”
▪ Callee saves temporary values in its frame before using
▪ Callee restores them before returning to caller

133
Carnegie Mellon

x86-64 Linux Register Usage #1


 %rax
▪ Return value
Return value %rax
▪ Also caller-saved
▪ Can be modified by procedure %rdi
 %rdi, ..., %r9 %rsi
▪ Arguments %rdx
Arguments
▪ Also caller-saved %rcx
▪ Can be modified by procedure
%r8
 %r10, %r11
%r9
▪ Caller-saved
▪ Can be modified by procedure %r10
Caller-saved
temporaries %r11

134
Carnegie Mellon

x86-64 Linux Register Usage #2


 %rbx, %r12, %r13, %r14
▪ Callee-saved %rbx
▪ Callee must save & restore Callee-saved
%r12
 %rbp Temporaries %r13
▪ Callee-saved %r14
▪ Callee must save & restore
%rbp
▪ May be used as frame pointer Special
▪ Can mix & match %rsp
 %rsp
▪ Special form of callee save
▪ Restored to original value upon
exit from procedure

135
Carnegie Mellon

Callee-Saved Example #1
Initial Stack Structure
long call_incr2(long x) {
long v1 = 15213;
long v2 = incr(&v1, 3000); ...
return x+v2;
} Rtn address %rsp

call_incr2:
pushq %rbx
Resulting Stack Structure
subq $16, %rsp
movq %rdi, %rbx
movq $15213, 8(%rsp) ...
movl $3000, %esi
leaq 8(%rsp), %rdi
call incr Rtn address
addq %rbx, %rax Saved %rbx
addq $16, %rsp
15213 %rsp+8
popq %rbx
ret Unused %rsp
136
Carnegie Mellon

Callee-Saved Example #2
Resulting Stack Structure
long call_incr2(long x) {
long v1 = 15213; ...
long v2 = incr(&v1, 3000);
return x+v2;
Rtn address
}
Saved %rbx
15213 %rsp+8
call_incr2:
pushq %rbx Unused %rsp
subq $16, %rsp
movq %rdi, %rbx
movq $15213, 8(%rsp) Pre-return Stack Structure
movl $3000, %esi
leaq 8(%rsp), %rdi
call incr ...
addq %rbx, %rax
addq $16, %rsp Rtn address %rsp
popq %rbx
ret
137
Carnegie Mellon

Machine-Level Programming III: Procedures


 Procedures
▪ Stack Structure
▪ Calling Conventions
▪ Passing control
▪ Passing data
▪ Managing local data
▪ Illustration of Recursion

138
Carnegie Mellon

Recursive Function
pcount_r:
movl $0, %eax
/* Recursive popcount */ testq %rdi, %rdi
long pcount_r(unsigned long x) { je .L6
if (x == 0) pushq %rbx
return 0; movq %rdi, %rbx
else andl $1, %ebx
return (x & 1) shrq %rdi # (by 1)
+ pcount_r(x >> 1); call pcount_r
} addq %rbx, %rax
popq %rbx
.L6:
rep; ret

139
Carnegie Mellon

Recursive Function Terminal Case

/* Recursive popcount */ pcount_r:


long pcount_r(unsigned long x) { movl $0, %eax
if (x == 0) testq %rdi, %rdi
return 0; je .L6
else pushq %rbx
return (x & 1) movq %rdi, %rbx
+ pcount_r(x >> 1); andl $1, %ebx
} shrq %rdi # (by 1)
call pcount_r
addq %rbx, %rax
popq %rbx
.L6:
rep; ret
Register Use(s) Type
%rdi x Argument
%rax Return value Return value

140
Carnegie Mellon

Recursive Function Register Save


pcount_r:
/* Recursive popcount */ movl $0, %eax
long pcount_r(unsigned long x) { testq %rdi, %rdi
if (x == 0) je .L6
return 0; pushq %rbx
else movq %rdi, %rbx
return (x & 1) andl $1, %ebx
+ pcount_r(x >> 1); shrq %rdi # (by 1)
} call pcount_r
addq %rbx, %rax
popq %rbx
.L6:
rep; ret

Register Use(s) Type


%rdi x Argument
...

Rtn address
Saved %rbx %rsp
141
Carnegie Mellon

Recursive Function Call Setup

/* Recursive popcount */ pcount_r:


long pcount_r(unsigned long x) { movl $0, %eax
if (x == 0) testq %rdi, %rdi
return 0; je .L6
else pushq %rbx
return (x & 1) movq %rdi, %rbx
+ pcount_r(x >> 1); andl $1, %ebx
} shrq %rdi # (by 1)
call pcount_r
addq %rbx, %rax
popq %rbx
.L6:
rep; ret
Register Use(s) Type
%rdi x >> 1 Rec. argument
%rbx x & 1 Callee-saved

142
Carnegie Mellon

Recursive Function Call

/* Recursive popcount */ pcount_r:


long pcount_r(unsigned long x) { movl $0, %eax
if (x == 0) testq %rdi, %rdi
return 0; je .L6
else pushq %rbx
return (x & 1) movq %rdi, %rbx
+ pcount_r(x >> 1); andl $1, %ebx
} shrq %rdi # (by 1)
call pcount_r
addq %rbx, %rax
popq %rbx
.L6:
rep; ret
Register Use(s) Type
%rbx x & 1 Callee-saved
%rax Recursive call return
value

143
Carnegie Mellon

Recursive Function Result

/* Recursive popcount */ pcount_r:


long pcount_r(unsigned long x) { movl $0, %eax
if (x == 0) testq %rdi, %rdi
return 0; je .L6
else pushq %rbx
return (x & 1) movq %rdi, %rbx
+ pcount_r(x >> 1); andl $1, %ebx
} shrq %rdi # (by 1)
call pcount_r
addq %rbx, %rax
popq %rbx
.L6:
rep; ret
Register Use(s) Type
%rbx x & 1 Callee-saved
%rax Return value

144
Carnegie Mellon

Recursive Function Completion


pcount_r:
/* Recursive popcount */ movl $0, %eax
long pcount_r(unsigned long x) { testq %rdi, %rdi
if (x == 0) je .L6
return 0; pushq %rbx
else movq %rdi, %rbx
return (x & 1) andl $1, %ebx
+ pcount_r(x >> 1); shrq %rdi # (by 1)
} call pcount_r
addq %rbx, %rax
popq %rbx
.L6:
rep; ret

Register Use(s) Type


%rax Return value Return value
...
%rsp

145
Carnegie Mellon

Observations About Recursion


 Handled Without Special Consideration
▪ Stack frames mean that each function call has private storage
▪ Saved registers & local variables
▪ Saved return pointer
▪ Register saving conventions prevent one function call from corrupting
another’s data
▪ Unless the C code explicitly does so (e.g., buffer overflow in Lecture
9)
▪ Stack discipline follows call / return pattern
▪ If P calls Q, then Q returns before P
▪ Last-In, First-Out

 Also works for mutual recursion


▪ P calls Q; Q calls P

146
Carnegie Mellon

x86-64 Procedure Summary


 Important Points
▪ Stack is the right data structure for procedure
call / return
▪ If P calls Q, then Q returns before P Caller
 Recursion (& mutual recursion) handled Frame
Arguments
by normal calling conventions 7+
▪ Can safely store values in local stack frame and Return Addr
in callee-saved registers %rbp Old %rbp
▪ Put function arguments at top of stack (Optional)
Saved
▪ Result return in %rax Registers
 Pointers are addresses of values +
Local
▪ On stack or global Variables

Argument
Build
%rsp
147
Machine-Level Programming IV:
Data

148
Machine-Level Programming IV: Data
 Arrays
▪ One-dimensional
▪ Multi-dimensional (nested)
▪ Multi-level
 Structures
▪ Allocation
▪ Access
▪ Alignment
 Floating Point

149
Array Allocation
 Basic Principle
T A[L];
▪ Array of data type T and length L
▪ Contiguously allocated region of L * sizeof(T) bytes in memory

char string[12];

x x + 12

int val[5];

x x+4 x+8 x + 12 x + 16 x + 20

double a[3];

x x+8 x + 16 x + 24

char *p[3];

x x+8 x + 16 x + 24

150
Array Access
 Basic Principle
T A[L];
▪ Array of data type T and length L
▪ Identifier A can be used as a pointer to array element 0: Type T*

int val[5]; 1 5 2 1 3
x x+4 x+8 x + 12 x + 16 x + 20

 Reference Type Value


val[4] int 3
val int * x
val+1 int * x+4
&val[2] int * x+8
val[5] int ??
*(val+1) int 5
val + i int * x+4i
151
Array Example
#define ZLEN 5
typedef int zip_dig[ZLEN];

zip_dig cmu = { 1, 5, 2, 1, 3 };
zip_dig mit = { 0, 2, 1, 3, 9 };
zip_dig ucb = { 9, 4, 7, 2, 0 };

zip_dig cmu; 1 5 2 1 3
16 20 24 28 32 36
zip_dig mit; 0 2 1 3 9
36 40 44 48 52 56
zip_dig ucb; 9 4 7 2 0
56 60 64 68 72 76

 Declaration “zip_dig cmu” equivalent to “int cmu[5]”


 Example arrays were allocated in successive 20 byte blocks
▪ Not guaranteed to happen in general
152
Array Accessing Example

zip_dig cmu; 1 5 2 1 3
16 20 24 28 32 36

int get_digit
(zip_dig z, int digit)
{
return z[digit];
} ◼ Register %rdi contains
starting address of array
IA32 ◼ Register %rsi contains
# %rdi = z array index
# %rsi = digit ◼ Desired digit at
movl (%rdi,%rsi,4), %eax # z[digit] %rdi + 4*%rsi
◼ Use memory reference
(%rdi,%rsi,4)
153
Array Loop Example
void zincr(zip_dig z) {
size_t i;
for (i = 0; i < ZLEN; i++)
z[i]++;
}

# %rdi = z
movl $0, %eax # i = 0
jmp .L3 # goto middle
.L4: # loop:
addl $1, (%rdi,%rax,4) # z[i]++
addq $1, %rax # i++
.L3: # middle
cmpq $4, %rax # i:4
jbe .L4 # if <=, goto loop
rep; ret

154
Multidimensional (Nested) Arrays
 Declaration A[0][0] • • • A[0][C-1]
T A[R][C];
• •
▪ 2D array of data type T • •
▪ R rows, C columns • •
▪ Type T element requires K bytes
A[R-1][0] • • • A[R-1][C-1]
 Array Size
▪ R * C * K bytes
 Arrangement
▪ Row-Major Ordering

int A[R][C];
A A A A A A
[0] • • • [0] [1] • • • [1] • • • [R-1] • • • [R-1]
[0] [C-1] [0] [C-1] [0] [C-1]

4*R*C Bytes
155
Nested Array Example
#define PCOUNT 4
zip_dig pgh[PCOUNT] =
{{1, 5, 2, 0, 6},
{1, 5, 2, 1, 3 },
{1, 5, 2, 1, 7 },
{1, 5, 2, 2, 1 }};

zip_dig
1 5 2 0 6 1 5 2 1 3 1 5 2 1 7 1 5 2 2 1
pgh[4];

76 96 116 136 156

 “zip_dig pgh[4]” equivalent to “int pgh[4][5]”


▪ Variable pgh: array of 4 elements, allocated contiguously
▪ Each element is an array of 5 int’s, allocated contiguously
 “Row-Major” ordering of all elements in memory
156
Nested Array Row Access
 Row Vectors
▪ A[i] is array of C elements
▪ Each element of type T requires K bytes
▪ Starting address A + i * (C * K)

int A[R][C];

A[0] A[i] A[R-1]

A A A A A A
[0] ••• [0] • • • [i] ••• [i] • • • [R-1] ••• [R-1]
[0] [C-1] [0] [C-1] [0] [C-1]

A A+(i*C*4) A+((R-1)*C*4)

157
Nested Array Row Access Code

1 5 2 0 6 1 5 2 1 3 1 5 2 1 7 1 5 2 2 1

pgh int *get_pgh_zip(int index)


{
return pgh[index];
}
# %rdi = index
leaq (%rdi,%rdi,4),%rax # 5 * index
leaq pgh(,%rax,4),%rax # pgh + (20 * index)

 Row Vector
▪ pgh[index] is array of 5 int’s
▪ Starting address pgh+20*index
 Machine Code
▪ Computes and returns address
▪ Compute as pgh + 4*(index+4*index)
158
Nested Array Element Access
 Array Elements
▪ A[i][j] is element of type T, which requires K bytes
▪ Address A + i * (C * K) + j * K = A + (i * C + j)* K

int A[R][C];

A[0] A[i] A[R-1]

A A A A A
[0] ••• [0] • • • ••• [i] ••• • • • [R-1] ••• [R-1]
[0] [C-1] [j] [0] [C-1]

A A+(i*C*4) A+((R-1)*C*4)

A+(i*C*4)+(j*4)
159
Nested Array Element Access Code

1 5 2 0 6 1 5 2 1 3 1 5 2 1 7 1 5 2 2 1

pgh int get_pgh_digit


(int index, int dig)
{
return pgh[index][dig];
}

leaq (%rdi,%rdi,4), %rax # 5*index


addl %rax, %rsi # 5*index+dig
movl pgh(,%rsi,4), %eax # M[pgh + 4*(5*index+dig)]

 Array Elements
▪ pgh[index][dig] is int
▪ Address: pgh + 20*index + 4*dig
▪ = pgh + 4*(5*index + dig)

160
Multi-Level Array Example

zip_dig cmu = { 1, 5, 2, 1, 3 };  Variable univ denotes


zip_dig mit = { 0, 2, 1, 3, 9 }; array of 3 elements
zip_dig ucb = { 9, 4, 7, 2, 0 };  Each element is a pointer
#define UCOUNT 3 ▪ 8 bytes
int *univ[UCOUNT] = {mit, cmu, ucb};  Each pointer points to array
of int’s

cmu
1 5 2 1 3
univ
16 20 24 28 32 36
160 36 mit
0 2 1 3 9
168 16
176 56 ucb 36 40 44 48 52 56
9 4 7 2 0
56 60 64 68 72 76

161
Element Access in Multi-Level Array
int get_univ_digit
(size_t index, size_t digit)
{
return univ[index][digit];
}

salq $2, %rsi # 4*digit


addq univ(,%rdi,8), %rsi # p = univ[index] + 4*digit
movl (%rsi), %eax # return *p
ret

 Computation
▪ Element access Mem[Mem[univ+8*index]+4*digit]
▪ Must do two memory reads
▪ First get pointer to row array
▪ Then access element within array
162
Array Element Accesses

Nested array Multi-level array


int get_pgh_digit int get_univ_digit
(size_t index, size_t digit) (size_t index, size_t digit)
{ {
return pgh[index][digit]; return univ[index][digit];
} }

Accesses looks similar in C, but address computations very different:

Mem[pgh+20*index+4*digit] Mem[Mem[univ+8*index]+4*digit]

163
N X N Matrix #define N 16
typedef int fix_matrix[N][N];
Code /* Get element a[i][j] */
int fix_ele(fix_matrix a,
 Fixed dimensions size_t i, size_t j)
▪ Know value of N at {
compile time return a[i][j];
}
#define IDX(n, i, j) ((i)*(n)+(j))
 Variable dimensions, /* Get element a[i][j] */
explicit indexing int vec_ele(size_t n, int *a,
▪ Traditional way to size_t i, size_t j)
{
implement dynamic
return a[IDX(n,i,j)];
arrays }

/* Get element a[i][j] */


 Variable dimensions, int var_ele(size_t n, int a[n][n],
implicit indexing size_t i, size_t j) {
return a[i][j];
▪ Now supported by gcc }
164
16 X 16 Matrix Access

 Array Elements
▪ Address A + i * (C * K) + j * K
▪ C = 16, K = 4

/* Get element a[i][j] */


int fix_ele(fix_matrix a, size_t i, size_t j) {
return a[i][j];
}

# a in %rdi, i in %rsi, j in %rdx


salq $6, %rsi # 64*i
addq %rsi, %rdi # a + 64*i
movl (%rdi,%rdx,4), %eax # M[a + 64*i + 4*j]
ret

165
n X n Matrix Access
 Array Elements
▪ Address A + i * (C * K) + j * K
▪ C = n, K = 4
▪ Must perform integer multiplication
/* Get element a[i][j] */
int var_ele(size_t n, int a[n][n], size_t i, size_t j)
{
return a[i][j];
}

# n in %rdi, a in %rsi, i in %rdx, j in %rcx


imulq %rdx, %rdi # n*i
leaq (%rsi,%rdi,4), %rax # a + 4*n*i
movl (%rax,%rcx,4), %eax # a + 4*n*i + 4*j
ret

166
Machine-Level Programming IV: Data
 Arrays
▪ One-dimensional
▪ Multi-dimensional (nested)
▪ Multi-level
 Structures
▪ Allocation
▪ Access
▪ Alignment
 Floating Point

167
Structure Representation
r
struct rec {
int a[4];
size_t i; a i next
struct rec *next;
0 16 24 32
};

 Structure represented as block of memory


▪ Big enough to hold all of the fields
 Fields ordered according to declaration
▪ Even if another ordering could yield a more compact
representation
 Compiler determines overall size + positions of fields
▪ Machine-level program has no understanding of the structures
in the source code

168
Generating Pointer to Structure Member
r r+4*idx
struct rec {
int a[4];
size_t i; a i next
struct rec *next;
0 16 24 32
};

 Generating Pointer to int *get_ap


(struct rec *r, size_t idx)
Array Element {
▪ Offset of each structure return &r->a[idx];
member determined at }
compile time
▪ Compute as r + 4*idx # r in %rdi, idx in %rsi
leaq (%rdi,%rsi,4), %rax
ret

169
Following Linked List struct rec {
int a[4];
int i;
struct rec *next;
 C Code };
r
void set_val
(struct rec *r, int val) a i next
{
0 16 24 32
while (r) {
int i = r->i; Element i
r->a[i] = val;
r = r->next; Register Value
} %rdi r
}
%rsi val
.L11: # loop:
movslq 16(%rdi), %rax # i = M[r+16]
movl %esi, (%rdi,%rax,4) # M[r+4*i] = val
movq 24(%rdi), %rdi # r = M[r+24]
testq %rdi, %rdi # Test r
jne .L11 # if !=0 goto loop
170
Structures & Alignment

 Unaligned Data struct S1 {


char c;
c i[0] i[1] v
int i[2];
p p+1 p+5 p+9 p+17 double v;
} *p;

 Aligned Data
▪ Primitive data type requires K bytes
▪ Address must be multiple of K

c 3 bytes i[0] i[1] 4 bytes v


p+0 p+4 p+8 p+16 p+24

Multiple of 4 Multiple of 8

Multiple of 8 Multiple of 8
171
Alignment Principles
 Aligned Data
▪ Primitive data type requires K bytes
▪ Address must be multiple of K
▪ Required on some machines; advised on x86-64
 Motivation for Aligning Data
▪ Memory accessed by (aligned) chunks of 4 or 8 bytes (system dependent)
▪ Inefficient to load or store datum that spans quad word boundaries
▪ Virtual memory trickier when datum spans 2 pages
 Compiler
▪ Inserts gaps in structure to ensure correct alignment of fields

172
Specific Cases of Alignment (x86-64)

 1 byte: char, …
▪ no restrictions on address
 2 bytes: short, …
▪ lowest 1 bit of address must be 02
 4 bytes: int, float, …
▪ lowest 2 bits of address must be 002
 8 bytes: double, long, char *, …
▪ lowest 3 bits of address must be 0002
 16 bytes: long double (GCC on Linux)
▪ lowest 4 bits of address must be 00002

173
Satisfying Alignment with Structures

 Within structure: struct S1 {


▪ Must satisfy each element’s alignment requirement char c;
int i[2];
 Overall structure placement double v;
▪ Each structure has alignment requirement K } *p;
▪K = Largest alignment of any element
▪ Initial address & structure length must be multiples of K
 Example:
▪ K = 8, due to double element

c 3 bytes i[0] i[1] 4 bytes v


p+0 p+4 p+8 p+16 p+24

Multiple of 4 Multiple of 8

Multiple of 8 Multiple of 8
174
Meeting Overall Alignment Requirement

 For largest alignment requirement K


 Overall structure must be multiple of K struct S2 {
double v;
int i[2];
char c;
} *p;

v i[0] i[1] c 7 bytes


p+0 p+8 p+16 p+24

Multiple of K=8

175
Arrays of Structures
struct S2 {
 Overall structure length double v;
int i[2];
multiple of K char c;
 Satisfy alignment requirement } a[10];
for every element

a[0] a[1] a[2] • • •


a+0 a+24 a+48 a+72

v i[0] i[1] c 7 bytes


a+24 a+32 a+40 a+48
176
Accessing Array Elements
struct S3 {
short i;
float v;
 Compute array offset 12*idx short j;
} a[10];
▪ sizeof(S3), including alignment spacers
 Element j is at offset 8 within structure
 Assembler gives offset a+8
▪ Resolved during linking
a[0] • • • a[idx] • • •
a+0 a+12 a+12*idx

i 2 bytes v j 2 bytes
a+12*idx a+12*idx+8

short get_j(int idx)


# %rdi = idx
{
leaq (%rdi,%rdi,2),%rax # 3*idx
return a[idx].j;
movzwl a+8(,%rax,4),%eax
}
177
Saving Space
 Put large data types first

struct S4 { struct S5 {
char c; int i;
int i; char c;
char d; char d;
 Effect (K=4)
} *p; } *p;

c 3 bytes i d 3 bytes

i c d 2 bytes

178
Machine-Level Programming IV: Data
 Arrays
▪ One-dimensional
▪ Multi-dimensional (nested)
▪ Multi-level
 Structures
▪ Allocation
▪ Access
▪ Alignment
 Floating Point

179
Background
 History
▪ x87 FP
▪ Legacy, very ugly
▪ SSE FP
▪ Supported by Shark machines
▪ Special case use of vector instructions
▪ AVX FP
▪ Newest version
▪ Similar to SSE
▪ Documented in book

180
Programming with SSE3
XMM Registers
◼ 16 total, each 16 bytes
◼ 16 single-byte integers

◼ 8 16-bit integers

◼ 4 32-bit integers

◼ 4 single-precision floats

◼ 2 double-precision floats

◼ 1 single-precision float

◼ 1 double-precision float

181
Scalar & SIMD Operations
◼ Scalar Operations: Single Precision addss %xmm0,%xmm1
%xmm0

+
%xmm1
◼ SIMD Operations: Single Precision addps %xmm0,%xmm1
%xmm0

+ + + +
%xmm1
◼ Scalar Operations: Double Precision
addsd %xmm0,%xmm1
%xmm0

+
%xmm1 182
FP Basics

 Arguments passed in %xmm0, %xmm1, ...


 Result returned in %xmm0
 All XMM registers caller-saved
float fadd(float x, float y) double dadd(double x, double y)
{ {
return x + y; return x + y;
} }

# x in %xmm0, y in %xmm1 # x in %xmm0, y in %xmm1


addss %xmm1, %xmm0 addsd %xmm1, %xmm0
ret ret

183
FP Memory Referencing

 Integer (and pointer) arguments passed in regular registers


 FP values passed in XMM registers
 Different mov instructions to move between XMM registers,
and between memory and XMM registers

double dincr(double *p, double v)


{
double x = *p;
*p = x + v;
return x;
}

# p in %rdi, v in %xmm0
movapd %xmm0, %xmm1 # Copy v
movsd (%rdi), %xmm0 # x = *p
addsd %xmm0, %xmm1 # t = x + v
movsd %xmm1, (%rdi) # *p = t
ret 184
Other Aspects of FP Code

 Lots of instructions
▪ Different operations, different formats, ...
 Floating-point comparisons
▪ Instructions ucomiss and ucomisd
▪ Set condition codes CF, ZF, and PF
 Using constant values
▪ Set XMM0 register to 0 with instruction xorpd %xmm0, %xmm0
▪ Others loaded from memory

185
Summary
 Arrays
▪ Elements packed into contiguous region of memory
▪ Use index arithmetic to locate individual elements
 Structures
▪ Elements packed into single region of memory
▪ Access using offsets determined by compiler
▪ Possible require internal and external padding to ensure alignment
 Combinations
▪ Can nest structure and array code arbitrarily
 Floating Point
▪ Data held and operated on in XMM registers

186
Understanding Pointers & Arrays #1

Decl An *An
Cmp Bad Size Cmp Bad Size
int A1[3]
int *A2

 Cmp: Compiles (Y/N)


 Bad: Possible bad pointer reference (Y/N)
 Size: Value returned by sizeof

187
Understanding Pointers & Arrays #1

Decl An *An
Cmp Bad Size Cmp Bad Size
int A1[3] Y N 12 Y N 4
int *A2 Y N 8 Y Y 4

A1 Allocated pointer
Unallocated pointer
A2
Allocated int
Unallocated int

 Cmp: Compiles (Y/N)


 Bad: Possible bad pointer reference (Y/N)
 Size: Value returned by sizeof

188
Understanding Pointers & Arrays #2

Decl An *An **An


Cmp Bad Size Cmp Bad Size Cmp Bad Size
int A1[3]
int *A2[3]
int
(*A3)[3]
int
(*A4[3])

 Cmp: Compiles (Y/N)


 Bad: Possible bad pointer reference (Y/N)
 Size: Value returned by sizeof

189
Understanding Pointers & Arrays #2
Decl An *An **An
Cmp Bad Size Cmp Bad Size Cmp Bad Size
int A1[3] Y N 12 Y N 4 N - -
int *A2[3] Y N 24 Y N 8 Y Y 4
int Y N 8 Y Y 12 Y Y 4
(*A3)[3]
int Y N 24 Y N 8 Y Y 4
(*A4[3])

A1

A2/A4

A3

Allocated pointer
Unallocated pointer
Allocated int
Unallocated int 190
Understanding Pointers & Arrays #3

Decl An *An **An


Cm Bad Size Cm Bad Size Cm Bad Size
p p p
int A1[3][5]
int *A2[3][5]
int (*A3)[3][5]
int *(A4[3][5])
int (*A5[3])[5]
Decl ***An
 Cmp: Compiles (Y/N)
Cm Bad Size
 Bad: Possible bad p
pointer reference (Y/N) int A1[3][5]
 Size: Value returned by int *A2[3][5]
sizeof int (*A3)[3][5]
int *(A4[3][5])
int (*A5[3])[5] 191
Allocated pointer Declaration
Allocated pointer to unallocated int
int A1[3][5]
Unallocated pointer
Allocated int int *A2[3][5]
Unallocated int int (*A3)[3][5]
int *(A4[3][5])
A1 int (*A5[3])[5]

A2/A4

A3

A5

192
Understanding Pointers & Arrays #3
Decl An *An **An
Cm Bad Size Cm Bad Size Cm Bad Size
p p p
int A1[3][5] Y N 60 Y N 20 Y N 4
int *A2[3][5] Y N 120 Y N 40 Y N 8
int (*A3)[3][5] Y N 8 Y Y 60 Y Y 20
int *(A4[3][5]) Y N 120 Y N 40 Y N 8
int (*A5[3])[5] Y N 24 Y N 8 Y Y 20

Decl ***An
 Cmp: Compiles (Y/N)
Cm Bad Size
 Bad: Possible bad p
pointer reference (Y/N) int A1[3][5] N - -
 Size: Value returned by int *A2[3][5] Y Y 4
sizeof int (*A3)[3][5] Y Y 4
int *(A4[3][5]) Y Y 4
int (*A5[3])[5] Y Y 4 193
Machine-Level Programming V:
Advanced Topics

194
Machine-Level Programming V: Advance
 Memory Layout
 Buffer Overflow
▪ Vulnerability
▪ Protection
 Unions

195
x86-64 Linux Memory Layout not drawn to scale

 Stack 00007FFFFFFFFFFF
Stack
▪ Runtime stack (8MB limit)
8MB
▪ E. g., local variables
 Heap
▪ Dynamically allocated as needed
▪ When call malloc(), calloc(), new()
 Data
▪ Statically allocated data Shared
▪ E.g., global vars, static vars, string constants Libraries
 Text / Shared Libraries
▪ Executable machine instructions
▪ Read-only Heap
Data
Text
Hex Address 400000
000000 196
Memory Allocation Example not drawn to scale

Stack
char big_array[1L<<24]; /* 16 MB */
char huge_array[1L<<31]; /* 2 GB */

int global = 0;

int useless() { return 0; }

int main ()
{ Shared
void *p1, *p2, *p3, *p4; Libraries
int local = 0;
p1 = malloc(1L << 28); /* 256 MB */
p2 = malloc(1L << 8); /* 256 B */
p3 = malloc(1L << 32); /* 4 GB */
p4 = malloc(1L << 8); /* 256 B */ Heap
/* Some print statements ... */ Data
} Text
Where does everything go? 197
x86-64 Example Addresses not drawn to scale
00007F
Stack
address range ~247
Heap
local 0x00007ffe4d3be87c
p1 0x00007f7262a1e010
p3 0x00007f7162a1d010
p4 0x000000008359d120
p2 0x000000008359d010
big_array 0x0000000080601060
huge_array 0x0000000000601060
main() 0x000000000040060c
useless() 0x0000000000400590

Heap

Data
Text
000000
198
Machine-Level Programming V: Advance
 Memory Layout
 Buffer Overflow
▪ Vulnerability
▪ Protection
 Unions

199
Carnegie Mellon

Recall: Memory Referencing Bug Example


typedef struct {
int a[2];
double d;
} struct_t;

double fun(int i) {
volatile struct_t s;
s.d = 3.14;
s.a[i] = 1073741824; /* Possibly out of bounds */
return s.d;
}

fun(0)  3.14
fun(1)  3.14
fun(2)  3.1399998664856
fun(3)  2.00000061035156
fun(4)  3.14
fun(6)  Segmentation fault

▪ Result is system specific


200
Carnegie Mellon

Memory Referencing Bug Example

typedef struct { fun(0)  3.14


int a[2]; fun(1)  3.14
double d; fun(2)  3.1399998664856
} struct_t;
fun(3)  2.00000061035156
fun(4)  3.14
fun(6)  Segmentation fault

Explanation:
Critical State 6
? 5
? 4
d7 ... d4 3 Location accessed by
fun(i)
d3 ... d0 2
struct_t
a[1] 1
a[0] 0

201
Such problems are a BIG deal
 Generally called a “buffer overflow”
▪ when exceeding the memory size allocated for an array
 Why a big deal?
▪ It’s the #1 technical cause of security vulnerabilities
▪ #1 overall cause is social engineering / user ignorance
 Most common form
▪ Unchecked lengths on string inputs
▪ Particularly for bounded character arrays on the stack
▪ sometimes referred to as stack smashing

202
String Library Code
 Implementation of Unix function gets()
/* Get string from stdin */
char *gets(char *dest)
{
int c = getchar();
char *p = dest;
while (c != EOF && c != '\n') {
*p++ = c;
c = getchar();
}
*p = '\0';
return dest;
}

▪ No way to specify limit on number of characters to read


 Similar problems with other library functions
▪ strcpy, strcat: Copy strings of arbitrary length
▪ scanf, fscanf, sscanf, when given %s conversion specification
203
Vulnerable Buffer Code
/* Echo Line */
void echo()
{
char buf[4]; /* Way too small! */ btw, how big
gets(buf);
puts(buf); is big enough?
}

void call_echo() {
echo();
}

unix>./bufdemo-nsp
Type a string:012345678901234567890123
012345678901234567890123

unix>./bufdemo-nsp
Type a string:0123456789012345678901234
Segmentation Fault

204
Buffer Overflow Disassembly
echo:
00000000004006cf <echo>:
4006cf: 48 83 ec 18 sub $0x18,%rsp
4006d3: 48 89 e7 mov %rsp,%rdi
4006d6: e8 a5 ff ff ff callq 400680 <gets>
4006db: 48 89 e7 mov %rsp,%rdi
4006de: e8 3d fe ff ff callq 400520 <puts@plt>
4006e3: 48 83 c4 18 add $0x18,%rsp
4006e7: c3 retq

call_echo:
4006e8: 48 83 ec 08 sub $0x8,%rsp
4006ec: b8 00 00 00 00 mov $0x0,%eax
4006f1: e8 d9 ff ff ff callq 4006cf <echo>
4006f6: 48 83 c4 08 add $0x8,%rsp
4006fa: c3 retq

205
Buffer Overflow Stack
Before call to gets
Stack Frame
for call_echo

/* Echo Line */
Return Address void echo()
(8 bytes) {
char buf[4]; /* Way too small! */
gets(buf);
puts(buf);
20 bytes unused }

[3] [2] [1] [0] buf %rsp

echo:
subq $24, %rsp
movq %rsp, %rdi
call gets
. . .
206
Buffer Overflow Stack Example
Before call to gets
void echo() echo:
Stack Frame { subq $24, %rsp
for call_echo char buf[4]; movq %rsp, %rdi
gets(buf); call gets
. . . . . .
00 00 Address
Return 00 00 }
00 (8
40bytes)
06 f6
call_echo:
. . .
20 bytes unused 4006f1: callq 4006cf <echo>
4006f6: add $0x8,%rsp
. . .
[3] [2] [1] [0] buf %rsp

207
Buffer Overflow Stack Example #1
After call to gets
void echo() echo:
Stack Frame { subq $24, %rsp
for call_echo char buf[4]; movq %rsp, %rdi
gets(buf); call gets
. . . . . .
00 00 Address
Return 00 00 }
00 (8
40bytes)
06 f6
00 32 31 30 call_echo:
39 38 37 36 . . .
35 34 unused
20 bytes 33 32 4006f1: callq 4006cf <echo>
31 30 39 38 4006f6: add $0x8,%rsp
37 36 35 34 . . .
33 32 31 30 buf %rsp

unix>./bufdemo-nsp
Type a string:01234567890123456789012
01234567890123456789012

Overflowed buffer, but did not corrupt state


208
Buffer Overflow Stack Example #2
After call to gets
void echo() echo:
Stack Frame { subq $24, %rsp
for call_echo char buf[4]; movq %rsp, %rdi
gets(buf); call gets
. . . . . .
00 00 Address
Return 00 00 }
00 (8
40bytes)
00 34
33 32 31 30 call_echo:
39 38 37 36 . . .
35 34 unused
20 bytes 33 32 4006f1: callq 4006cf <echo>
31 30 39 38 4006f6: add $0x8,%rsp
37 36 35 34 . . .
33 32 31 30 buf %rsp

unix>./bufdemo-nsp
Type a string:0123456789012345678901234
Segmentation Fault

Overflowed buffer and corrupted return pointer


209
Buffer Overflow Stack Example #3
After call to gets
void echo() echo:
Stack Frame { subq $24, %rsp
for call_echo char buf[4]; movq %rsp, %rdi
gets(buf); call gets
. . . . . .
00 00 Address
Return 00 00 }
00 (8
40bytes)
06 00
33 32 31 30 call_echo:
39 38 37 36 . . .
35 34 unused
20 bytes 33 32 4006f1: callq 4006cf <echo>
31 30 39 38 4006f6: add $0x8,%rsp
37 36 35 34 . . .
33 32 31 30 buf %rsp

unix>./bufdemo-nsp
Type a string:012345678901234567890123
012345678901234567890123

Overflowed buffer, corrupted return pointer, but program seems to work!


210
Buffer Overflow Stack Example #3 Explained
After call to gets
Stack Frame
for call_echo
register_tm_clones:
. . .
400600: mov %rsp,%rbp
00 00 Address
00 00 400603: mov %rax,%rdx
Return
400606: shr $0x3f,%rdx
00 (8
40bytes)
06 00
40060a: add %rdx,%rax
33 32 31 30
40060d: sar %rax
39 38 37 36 400610: jne 400614
35 34 unused
20 bytes 33 32 400612: pop %rbp
31 30 39 38 400613: retq
37 36 35 34
33 32 31 30 buf %rsp

“Returns” to unrelated code


Lots of things happen, without modifying critical state
Eventually executes retq back to main

211
Code Injection Attacks
Stack after call to gets()

void P(){
P stack frame
Q(); return
... address
} A B

int Q() { data written pad


char buf[64]; by gets()
gets(buf);
... exploit Q stack frame
return ...; code
B
}

 Input string contains byte representation of executable code


 Overwrite return address A with address of buffer B
 When Q executes ret, will jump to exploit code
212
Exploits Based on Buffer Overflows
 Buffer overflow bugs can allow remote machines to execute
arbitrary code on victim machines
 Distressingly common in real progams
▪ Programmers keep making the same mistakes 
▪ Recent measures make these attacks much more difficult
 Examples across the decades
▪ Original “Internet worm” (1988)
▪ “IM wars” (1999)
▪ Twilight hack on Wii (2000s)
▪ … and many, many more
 You will learn some of the tricks in attacklab
▪ Hopefully to convince you to never leave such holes in your programs!!

213
Example: the original Internet worm (1988)
 Exploited a few vulnerabilities to spread
▪ Early versions of the finger server (fingerd) used gets() to read the
argument sent by the client:
▪ finger [email protected]
▪ Worm attacked fingerd server by sending phony argument:
▪ finger “exploit-code padding new-return-
address”
▪ exploit code: executed a root shell on the victim machine with a
direct TCP connection to the attacker.
 Once on a machine, scanned for other machines to attack
▪ invaded ~6000 computers in hours (10% of the Internet ☺ )
▪ see June 1989 article in Comm. of the ACM
▪ the young author of the worm was prosecuted…
▪ and CERT was formed… still homed at CMU

214
Example 2: IM War
 July, 1999
▪ Microsoft launches MSN Messenger (instant messaging system).
▪ Messenger clients can access popular AOL Instant Messaging Service
(AIM) servers

AIM
client

MSN MSN AIM


server client server

AIM
client

215
IM War (cont.)
 August 1999
▪ Mysteriously, Messenger clients can no longer access AIM servers
▪ Microsoft and AOL begin the IM war:
▪AOL changes server to disallow Messenger clients
▪ Microsoft makes changes to clients to defeat AOL changes
▪ At least 13 such skirmishes
▪ What was really happening?
▪ AOL had discovered a buffer overflow bug in their own AIM clients
▪ They exploited it to detect and block Microsoft: the exploit code
returned a 4-byte signature (the bytes at some location in the AIM
client) to server
▪ When Microsoft changed code to match signature, AOL changed
signature location

216
Date: Wed, 11 Aug 1999 11:30:57 -0700 (PDT)
From: Phil Bucking <[email protected]>
Subject: AOL exploiting buffer overrun bug in their own software!
To: [email protected]

Mr. Smith,

I am writing you because I have discovered something that I think you


might find interesting because you are an Internet security expert with
experience in this area. I have also tried to contact AOL but received
no response.

I am a developer who has been working on a revolutionary new instant


messaging client that should be released later this year.
...
It appears that the AIM client has a buffer overrun bug. By itself
this might not be the end of the world, as MS surely has had its share.
But AOL is now *exploiting their own buffer overrun bug* to help in
its efforts to block MS Instant Messenger.
....
Since you have significant credibility with the press I hope that you
can use this information to help inform people that behind AOL's
friendly exterior they are nefariously compromising peoples' security.

Sincerely,
Phil Bucking It was later determined that this
Founder, Bucking Consulting
[email protected]
email originated from within
Microsoft!
217
Aside: Worms and Viruses
 Worm: A program that
▪ Can run by itself
▪ Can propagate a fully working version of itself to other computers

 Virus: Code that


▪ Adds itself to other programs
▪ Does not run independently

 Both are (usually) designed to spread among computers and to


wreak havoc

218
OK, what to do about buffer overflow attacks
 Avoid overflow vulnerabilities

 Employ system-level protections

 Have compiler use “stack canaries”

 Lets talk about each…

219
1. Avoid Overflow Vulnerabilities in Code (!)

/* Echo Line */
void echo()
{
char buf[4]; /* Way too small! */
fgets(buf, 4, stdin);
puts(buf);
}

 For example, use library routines that limit string lengths


▪ fgets instead of gets
▪ strncpy instead of strcpy
▪ Don’t use scanf with %s conversion specification
▪ Use fgets to read the string
▪ Or use %ns where n is a suitable integer

220
2. System-Level Protections can help

 Randomized stack offsets Stack base

▪ At start of program, allocate


random amount of space on
Random
stack allocation
▪ Shifts stack addresses for entire
program
▪ Makes it difficult for hacker to main
predict beginning of inserted Application
code Code
▪ E.g.: 5 executions of memory
allocation B?
local 0x7ffe4d3be87ccode
0x7fff75a4f9fc 0x7ffeadb7c80c 0x7ffeaea2fdac 0x7ffcd452017c

pad
▪ Stack repositioned each time exploit
program executes code
B?

221
2. System-Level Protections can help
Stack after call to gets()
 Nonexecutable code
segments
P stack frame
▪ In traditional x86, can mark
region of memory as either
“read-only” or “writeable” B
▪ Can execute anything
readable data written pad
▪ X86-64 added explicit by gets()
“execute” permission exploit Q stack frame
▪ Stack marked as non- code
B
executable

Any attempt to execute this code will fail

222
3. Stack Canaries can help
 Idea
▪ Place special value (“canary”) on stack just beyond buffer
▪ Check for corruption before exiting function
 GCC Implementation
▪ -fstack-protector
▪ Now the default (disabled earlier)

unix>./bufdemo-sp
Type a string:0123456
0123456

unix>./bufdemo-sp
Type a string:01234567
*** stack smashing detected ***

223
Protected Buffer Disassembly
echo:
40072f: sub $0x18,%rsp
400733: mov %fs:0x28,%rax
40073c: mov %rax,0x8(%rsp)
400741: xor %eax,%eax
400743: mov %rsp,%rdi
400746: callq 4006e0 <gets>
40074b: mov %rsp,%rdi
40074e: callq 400570 <puts@plt>
400753: mov 0x8(%rsp),%rax
400758: xor %fs:0x28,%rax
400761: je 400768 <echo+0x39>
400763: callq 400580 <__stack_chk_fail@plt>
400768: add $0x18,%rsp
40076c: retq

224
Setting Up Canary
Before call to gets
/* Echo Line */
Stack Frame void echo()
for call_echo {
char buf[4]; /* Way too small! */
gets(buf);
Return Address puts(buf);
(8 bytes) }

20 bytes unused
Canary
(8 bytes)

[3] [2] [1] [0] buf %rsp

echo:
. . .
movq %fs:40, %rax # Get canary
movq %rax, 8(%rsp) # Place on stack
xorl %eax, %eax # Erase canary
. . .
225
Checking Canary
After call to gets
Before call to gets /* Echo Line */
Stack Frame void echo()
forStack Frame
call_echo {
for main char buf[4]; /* Way too small! */
gets(buf);
Return Address puts(buf);
Return Address
(8 bytes) }
Saved %ebp
Saved %ebx
20 bytes unused
Canary Input: 0123456
(8Canary
bytes)
[3]
00 [2]
36 [1]
35 [0]
34
33 32 31 30 buf %rsp
echo:
. . .
movq 8(%rsp), %rax # Retrieve from
stack
xorq %fs:40, %rax # Compare to canary
je .L6 # If same, OK
call __stack_chk_fail # FAIL
.L6: . . .
226
Return-Oriented Programming Attacks
 Challenge (for hackers)
▪ Stack randomization makes it hard to predict buffer location
▪ Marking stack nonexecutable makes it hard to insert binary code
 Alternative Strategy
▪ Use existing code
▪ E.g., library code from stdlib
▪ String together fragments to achieve overall desired outcome
▪ Does not overcome stack canaries
 Construct program from gadgets
▪ Sequence of instructions ending in ret
▪Encoded by single byte 0xc3
▪ Code positions fixed from run to run
▪ Code is executable

227
Gadget Example #1

long ab_plus_c
(long a, long b, long c)
{
return a*b + c;
}

00000000004004d0 <ab_plus_c>:
4004d0: 48 0f af fe imul %rsi,%rdi
4004d4: 48 8d 04 17 lea (%rdi,%rdx,1),%rax
4004d8: c3 retq

rax  rdi + rdx


Gadget address = 0x4004d4

 Use tail end of existing functions

228
Gadget Example #2

void setval(unsigned *p) {


*p = 3347663060u;
}

Encodes movq %rax, %rdi


<setval>:
4004d9: c7 07 d4 48 89 c7 movl $0xc78948d4,(%rdi)
4004df: c3 retq

rdi  rax
Gadget address = 0x4004dc

 Repurpose byte codes

229
ROP Execution

Stack
Gadget n code c3




Gadget 2 code c3

%rsp
Gadget 1 code c3

 Trigger with ret instruction


▪ Will start executing Gadget 1
 Final ret in each gadget will start next one

230
Machine-Level Programming V: Advance
 Memory Layout
 Buffer Overflow
▪ Vulnerability
▪ Protection
 Unions

231
Union Allocation

 Allocate according to largest element


 Can only use one field at a time
union U1 {
char c;
int i[2]; c
double v; i[0] i[1]
} *up;
v
struct S1 { up+0 up+4 up+8
char c;
int i[2];
double v;
} *sp;

c 3 bytes i[0] i[1] 4 bytes v


sp+0 sp+4 sp+8 sp+16 sp+24
232
Using Union to Access Bit Patterns

typedef union { u
float f;
f
unsigned u;
} bit_float_t; 0 4

float bit2float(unsigned u) unsigned float2bit(float f)


{ {
bit_float_t arg; bit_float_t arg;
arg.u = u; arg.f = f;
return arg.f; return arg.u;
} }

Same as (float) u ? Same as (unsigned) f ?

233
Byte Ordering Revisited
 Idea
▪ Short/long/quad words stored in memory as 2/4/8 consecutive bytes
▪ Which byte is most (least) significant?
▪ Can cause problems when exchanging binary data between machines
 Big Endian
▪ Most significant byte has lowest address
▪ Sparc
 Little Endian
▪ Least significant byte has lowest address
▪ Intel x86, ARM Android and IOS
 Bi Endian
▪ Can be configured either way
▪ ARM

234
Byte Ordering Example
union {
unsigned char c[8];
unsigned short s[4];
unsigned int i[2];
unsigned long l[1];
} dw;

32-bit c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7]


s[0] s[1] s[2] s[3]
i[0] i[1]
l[0]

64-bit c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7]


s[0] s[1] s[2] s[3]
i[0] i[1]
l[0]
235
Byte Ordering Example (Cont).
int j;
for (j = 0; j < 8; j++)
dw.c[j] = 0xf0 + j;

printf("Characters 0-7 ==
[0x%x,0x%x,0x%x,0x%x,0x%x,0x%x,0x%x,0x%x]\n",
dw.c[0], dw.c[1], dw.c[2], dw.c[3],
dw.c[4], dw.c[5], dw.c[6], dw.c[7]);

printf("Shorts 0-3 == [0x%x,0x%x,0x%x,0x%x]\n",


dw.s[0], dw.s[1], dw.s[2], dw.s[3]);

printf("Ints 0-1 == [0x%x,0x%x]\n",


dw.i[0], dw.i[1]);

printf("Long 0 == [0x%lx]\n",
dw.l[0]);

236
Byte Ordering on IA32
Little Endian

f0 f1 f2 f3 f4 f5 f6 f7
c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7]
s[0] s[1] s[2] s[3]
i[0] i[1]
l[0]
LSB MSB LSB MSB
Print

Output:
Characters 0-7 == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7]
Shorts 0-3 == [0xf1f0,0xf3f2,0xf5f4,0xf7f6]
Ints 0-1 == [0xf3f2f1f0,0xf7f6f5f4]
Long 0 == [0xf3f2f1f0]

237
Byte Ordering on Sun
Big Endian

f0 f1 f2 f3 f4 f5 f6 f7
c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7]
s[0] s[1] s[2] s[3]
i[0] i[1]
l[0]
MSB LSB MSB LSB
Print

Output on Sun:
Characters 0-7 == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7]
Shorts 0-3 == [0xf0f1,0xf2f3,0xf4f5,0xf6f7]
Ints 0-1 == [0xf0f1f2f3,0xf4f5f6f7]
Long 0 == [0xf0f1f2f3]

238
Byte Ordering on x86-64
Little Endian

f0 f1 f2 f3 f4 f5 f6 f7
c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7]
s[0] s[1] s[2] s[3]
i[0] i[1]
l[0]
LSB MSB
Print

Output on x86-64:
Characters 0-7 == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7]
Shorts 0-3 == [0xf1f0,0xf3f2,0xf5f4,0xf7f6]
Ints 0-1 == [0xf3f2f1f0,0xf7f6f5f4]
Long 0 == [0xf7f6f5f4f3f2f1f0]

239
Summary of Compound Types in C
 Arrays
▪ Contiguous allocation of memory
▪ Aligned to satisfy every element’s alignment requirement
▪ Pointer to first element
▪ No bounds checking
 Structures
▪ Allocate bytes in order declared
▪ Pad in middle and at end to satisfy alignment
 Unions
▪ Overlay declarations
▪ Way to circumvent type system

240
Program Optimization

241
Program Optimization
 Overview
 Generally Useful Optimizations
▪ Code motion/precomputation
▪ Strength reduction
▪ Sharing of common subexpressions
▪ Removing unnecessary procedure calls
 Optimization Blockers
▪ Procedure calls
▪ Memory aliasing
 Exploiting Instruction-Level Parallelism
 Dealing with Conditionals

242
Performance Realities
 There’s more to performance than asymptotic complexity

 Constant factors matter too!


▪ Easily see 10:1 performance range depending on how code is written
▪ Must optimize at multiple levels:
▪ algorithm, data representations, procedures, and loops
 Must understand system to optimize performance
▪ How programs are compiled and executed
▪ How modern processors + memory systems operate
▪ How to measure program performance and identify bottlenecks
▪ How to improve performance without destroying code modularity and
generality

243
Optimizing Compilers
 Provide efficient mapping of program to machine
▪ register allocation
▪ code selection and ordering (scheduling)
▪ dead code elimination
▪ eliminating minor inefficiencies
 Don’t (usually) improve asymptotic efficiency
▪ up to programmer to select best overall algorithm
▪ big-O savings are (often) more important than constant factors
▪ but constant factors also matter
 Have difficulty overcoming “optimization blockers”
▪ potential memory aliasing
▪ potential procedure side-effects

244
Limitations of Optimizing Compilers
 Operate under fundamental constraint
▪ Must not cause any change in program behavior
▪ Except, possibly when program making use of nonstandard language
features
▪ Often prevents it from making optimizations that would only affect behavior
under pathological conditions.
 Behavior that may be obvious to the programmer can be obfuscated by
languages and coding styles
▪ e.g., Data ranges may be more limited than variable types suggest
 Most analysis is performed only within procedures
▪ Whole-program analysis is too expensive in most cases
▪ Newer versions of GCC do interprocedural analysis within individual files
▪ But, not between code in different files
 Most analysis is based only on static information
▪ Compiler has difficulty anticipating run-time inputs
 When in doubt, the compiler must be conservative
245
Generally Useful Optimizations
 Optimizations that you or the compiler should do regardless
of processor / compiler

 Code Motion
▪ Reduce frequency with which computation performed
▪ If it will always produce same result
▪ Especially moving code out of loop
void set_row(double *a, double *b,
long i, long n)
{
long j; long j;
for (j = 0; j < n; j++) int ni = n*i;
a[n*i+j] = b[j]; for (j = 0; j < n; j++)
} a[ni+j] = b[j];

246
Compiler-Generated Code Motion (-O1)
void set_row(double *a, double *b,
long i, long n) long j;
{ long ni = n*i;
long j; double *rowp = a+ni;
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[n*i+j] = b[j]; *rowp++ = b[j];
}

set_row:
testq %rcx, %rcx # Test n
jle .L1 # If 0, goto done
imulq %rcx, %rdx # ni = n*i
leaq (%rdi,%rdx,8), %rdx # rowp = A + ni*8
movl $0, %eax # j = 0
.L3: # loop:
movsd (%rsi,%rax,8), %xmm0 # t = b[j]
movsd %xmm0, (%rdx,%rax,8) # M[A+ni*8 + j*8] = t
addq $1, %rax # j++
cmpq %rcx, %rax # j:n
jne .L3 # if !=, goto loop
.L1: # done:
rep ; ret

247
Reduction in Strength
▪ Replace costly operation with simpler one
▪ Shift, add instead of multiply or divide
16*x --> x << 4
▪ Utility machine dependent
▪ Depends on cost of multiply or divide instruction
– On Intel Nehalem, integer multiply requires 3 CPU cycles
▪ Recognize sequence of products

int ni = 0;
for (i = 0; i < n; i++) { for (i = 0; i < n; i++) {
int ni = n*i; for (j = 0; j < n; j++)
for (j = 0; j < n; j++) a[ni + j] = b[j];
a[ni + j] = b[j]; ni += n;
} }

248
Share Common Subexpressions
▪ Reuse portions of expressions
▪ GCC will do this with –O1

/* Sum neighbors of i,j */ long inj = i*n + j;


up = val[(i-1)*n + j ]; up = val[inj - n];
down = val[(i+1)*n + j ]; down = val[inj + n];
left = val[i*n + j-1]; left = val[inj - 1];
right = val[i*n + j+1]; right = val[inj + 1];
sum = up + down + left + right; sum = up + down + left + right;

3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n

leaq 1(%rsi), %rax # i+1 imulq %rcx, %rsi # i*n


leaq -1(%rsi), %r8 # i-1 addq %rdx, %rsi # i*n+j
imulq %rcx, %rsi # i*n movq %rsi, %rax # i*n+j
imulq %rcx, %rax # (i+1)*n subq %rcx, %rax # i*n+j-n
imulq %rcx, %r8 # (i-1)*n leaq (%rsi,%rcx), %rcx # i*n+j+n
addq %rdx, %rsi # i*n+j
addq %rdx, %rax # (i+1)*n+j
addq %rdx, %r8 # (i-1)*n+j

249
Optimization Blocker #1: Procedure Calls
 Procedure to Convert String to Lower Case

void lower(char *s)


{
size_t i;
for (i = 0; i < strlen(s); i++)
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
}
▪ Extracted from 213 lab submissions, Fall, 1998

250
Lower Case Conversion Performance

▪ Time quadruples when double string length


▪ Quadratic performance

250

200
CPU seconds

150
lower1

100

50

0
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
String length

251
Convert Loop To Goto Form
void lower(char *s)
{
size_t i = 0;
if (i >= strlen(s))
goto done;
loop:
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
i++;
if (i < strlen(s))
goto loop;
done:
}

▪ strlen executed every iteration

252
Calling Strlen
/* My version of strlen */
size_t strlen(const char *s)
{
size_t length = 0;
while (*s != '\0') {
s++;
length++;
}
return length;
}

 Strlen performance
▪ Only way to determine length of string is to scan its entire length, looking for
null character.
 Overall performance, string of length N
▪ N calls to strlen
▪ Require times N, N-1, N-2, …, 1
▪ Overall O(N2) performance

253
Improving Performance
void lower(char *s)
{
size_t i;
size_t len = strlen(s);
for (i = 0; i < len; i++)
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
}

▪ Move call to strlen outside of loop


▪ Since result does not change from one iteration to another
▪ Form of code motion

254
Lower Case Conversion Performance
▪ Time doubles when double string length
▪ Linear performance of lower2

250

200
CPU seconds

150
lower1

100

50
lower2
0
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
String length

255
Optimization Blocker: Procedure Calls
 Why couldn’t compiler move strlen out of inner loop?
▪ Procedure may have side effects
▪ Alters global state each time called
▪ Function may not return same value for given arguments
▪ Depends on other parts of global state
▪ Procedure lower could interact with strlen

 Warning:
▪ Compiler treats procedure call as a black box
▪ Weak optimizations near them
size_t lencnt = 0;
 Remedies: size_t strlen(const char *s)
▪ Use of inline functions {
▪ GCC does this with –O1 size_t length = 0;
while (*s != '\0') {
– Within single file
s++; length++;
▪ Do your own code motion }
lencnt += length;
return length;
}
256
Memory Matters
/* Sum rows is of n X n matrix a
and store in vector b */
void sum_rows1(double *a, double *b, long n) {
long i, j;
for (i = 0; i < n; i++) {
b[i] = 0;
for (j = 0; j < n; j++)
b[i] += a[i*n + j];
}
}

# sum_rows1 inner loop


.L4:
movsd (%rsi,%rax,8), %xmm0 # FP load
addsd (%rdi), %xmm0 # FP add
movsd %xmm0, (%rsi,%rax,8) # FP store
addq $8, %rdi
cmpq %rcx, %rdi
jne .L4

▪ Code updates b[i] on every iteration


▪ Why couldn’t compiler optimize this away?
257
Memory Aliasing
/* Sum rows is of n X n matrix a
and store in vector b */
void sum_rows1(double *a, double *b, long n) {
long i, j;
for (i = 0; i < n; i++) {
b[i] = 0;
for (j = 0; j < n; j++)
b[i] += a[i*n + j];
}
}

Value of B:
double A[9] = init: [4, 8, 16]
{ 0, 1, 2,
4, 8, 16},
i = 0: [3, 8, 16]
32, 64, 128};

double B[3] = A+3; i = 1: [3, 22, 16]

sum_rows1(A, B, 3); i = 2: [3, 22, 224]

▪ Code updates b[i] on every iteration


▪ Must consider possibility that these updates will affect program
behavior
258
Removing Aliasing
/* Sum rows is of n X n matrix a
and store in vector b */
void sum_rows2(double *a, double *b, long n) {
long i, j;
for (i = 0; i < n; i++) {
double val = 0;
for (j = 0; j < n; j++)
val += a[i*n + j];
b[i] = val;
}
}

# sum_rows2 inner loop


.L10:
addsd (%rdi), %xmm0 # FP load + add
addq $8, %rdi
cmpq %rax, %rdi
jne .L10

▪ No need to store intermediate results

259
Optimization Blocker: Memory Aliasing
 Aliasing
▪ Two different memory references specify single location
▪ Easy to have happen in C
▪ Since allowed to do address arithmetic
▪ Direct access to storage structures
▪ Get in habit of introducing local variables
▪ Accumulating within loops
▪ Your way of telling compiler not to check for aliasing

260
Exploiting Instruction-Level Parallelism
 Need general understanding of modern processor design
▪ Hardware can execute multiple instructions in parallel
 Performance limited by data dependencies
 Simple transformations can yield dramatic performance
improvement
▪ Compilers often cannot make these transformations
▪ Lack of associativity and distributivity in floating-point arithmetic

261
Benchmark Example: Data Type for Vectors

/* data structure for vectors */


typedef struct{ len 0 1 len-1
size_t len;
data_t *data;
data
} vec;

/* retrieve vector element


and store at val */
 Data Types int get_vec_element
(*vec v, size_t idx, data_t *val)
▪ Use different declarations {
for data_t if (idx >= v->len)
return 0;
▪ int *val = v->data[idx];
▪ long return 1;
}
▪ float
▪ double 262
Benchmark Computation
void combine1(vec_ptr v, data_t *dest)
{
long int i; Compute sum or
*dest = IDENT; product of vector
for (i = 0; i < vec_length(v); i++) { elements
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}

 Data Types  Operations


▪ Use different declarations ▪ Use different definitions of
for data_t OP and IDENT
▪ int ▪ + / 0
▪ long ▪ * / 1
▪ float
▪ double 263
Cycles Per Element (CPE)
 Convenient way to express performance of program that operates on
vectors or lists
 Length = n
 In our case: CPE = cycles per OP
 T = CPE*n + Overhead
▪ CPE is slope of line

2500

2000

psum1
1500 Slope = 9.0
Cycles

1000

psum2
500 Slope = 6.0

0
0 50 100 150 200
Elements

264
Benchmark Performance
void combine1(vec_ptr v, data_t *dest)
{
long int i; Compute sum or
*dest = IDENT; product of vector
for (i = 0; i < vec_length(v); i++) { elements
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}

Method Integer Double FP


Operation Add Mult Add Mult
Combine1 22.68 20.02 19.98 20.18
unoptimized
Combine1 –O1 10.12 10.12 10.17 11.14

265
Basic Optimizations

void combine4(vec_ptr v, data_t *dest)


{
long i;
long length = vec_length(v);
data_t *d = get_vec_start(v);
data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP d[i];
*dest = t;
}

 Move vec_length out of loop


 Avoid bounds check on each cycle
 Accumulate in temporary

266
Effect of Basic Optimizations

void combine4(vec_ptr v, data_t *dest)


{
long i;
long length = vec_length(v);
data_t *d = get_vec_start(v);
data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP d[i];
*dest = t;
}

Method Integer Double FP


Operation Add Mult Add Mult
Combine1 –O1 10.12 10.12 10.17 11.14
Combine4 1.27 3.01 3.01 5.01

 Eliminates sources of overhead in loop


267
Modern CPU Design
Instruction Control
Fetch Address
Retirement Control
Unit Instruction
Register Instructions Cache
Instruction
File Decode
Operations
Register Updates Prediction OK?

Functional
Branch Arith Arith Arith Load Store
Units

Operation Results
Addr. Addr.
Data Data

Data
Cache

Execution
268
Superscalar Processor
 Definition: A superscalar processor can issue and execute
multiple instructions in one cycle. The instructions are retrieved
from a sequential instruction stream and are usually scheduled
dynamically.

 Benefit: without programming effort, superscalar processor can


take advantage of the instruction level parallelism that most
programs have

 Most modern CPUs are superscalar.


 Intel: since Pentium (1993)

269
Pipelined Functional Units Stage 1
long mult_eg(long a, long b, long c) {
long p1 = a*b; Stage 2
long p2 = a*c;
long p3 = p1 * p2; Stage 3
return p3;
}

Time
1 2 3 4 5 6 7
Stage 1 a*b a*c p1*p2

Stage 2 a*b a*c p1*p2

Stage 3 a*b a*c p1*p2

▪ Divide computation into stages


▪ Pass partial computations from stage to stage
▪ Stage i can start on new computation once values passed to i+1
▪ E.g., complete 3 multiplications in 7 cycles, even though each
requires 3 cycles
270
Haswell CPU
▪ 8 Total Functional Units
 Multiple instructions can execute in parallel
2 load, with address computation
1 store, with address computation
4 integer
2 FP multiply
1 FP add
1 FP divide
 Some instructions take > 1 cycle, but can be pipelined
Instruction Latency Cycles/Issue
Load / Store 4 1
Integer Multiply 3 1
Integer/Long Divide 3-30 3-30
Single/Double FP Multiply 5 1
Single/Double FP Add 3 1
Single/Double FP Divide 3-15 3-15

271
x86-64 Compilation of Combine4
 Inner Loop (Case: Integer Multiply)
.L519: # Loop:
imull (%rax,%rdx,4), %ecx # t = t * d[i]
addq $1, %rdx # i++
cmpq %rdx, %rbp # Compare length:i
jg .L519 # If >, goto Loop

Method Integer Double FP


Operation Add Mult Add Mult
Combine4 1.27 3.01 3.01 5.01
Latency 1.00 3.00 3.00 5.00
Bound

272
Combine4 = Serial Computation (OP = *)
1 d0
 Computation (length=8)
((((((((1 * d[0]) * d[1]) * d[2]) * d[3])
* * d[4]) * d[5]) * d[6]) * d[7])
d1
 Sequential dependence
* d2
▪ Performance: determined by latency of OP
* d3

* d4

* d5

* d6

* d7

273
Loop Unrolling (2x1)
void unroll2a_combine(vec_ptr v, data_t *dest)
{
long length = vec_length(v);
long limit = length-1;
data_t *d = get_vec_start(v);
data_t x = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = (x OP d[i]) OP d[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x = x OP d[i];
}
*dest = x;
}

 Perform 2x more useful work per iteration


274
Effect of Loop Unrolling

Method Integer Double FP


Operation Add Mult Add Mult
Combine4 1.27 3.01 3.01 5.01
Unroll 2x1 1.01 3.01 3.01 5.01
Latency 1.00 3.00 3.00 5.00
Bound

 Helps integer add


x = (x OP d[i]) OP d[i+1];
▪ Achieves latency bound
 Others don’t improve. Why?
▪ Still sequential dependency

275
Loop Unrolling with Reassociation (2x1a)
void unroll2aa_combine(vec_ptr v, data_t *dest)
{
long length = vec_length(v);
long limit = length-1;
data_t *d = get_vec_start(v);
data_t x = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = x OP (d[i] OP d[i+1]);
}
/* Finish any remaining elements */
for (; i < length; i++) {
x = x OP d[i]; Compare to before
}
*dest = x; x = (x OP d[i]) OP d[i+1];
}

 Can this change the result of the computation?


 Yes, for FP. Why?
276
Effect of Reassociation
Method Integer Double FP
Operation Add Mult Add Mult
Combine4 1.27 3.01 3.01 5.01
Unroll 2x1 1.01 3.01 3.01 5.01
Unroll 2x1a 1.01 1.51 1.51 2.51
Latency 1.00 3.00 3.00 5.00
Bound
Throughput 0.50 1.00 1.00 0.50
Bound

 Nearly 2x speedup for Int *, FP +, FP * 2 func. units for FP *


▪ Reason: Breaks sequential dependency 2 func. units for load

x = x OP (d[i] OP d[i+1]); 4 func. units for int +


2 func. units for load
▪ Why is that? (next slide)
277
Reassociated Computation

x = x OP (d[i] OP d[i+1]);  What changed:


▪ Ops in the next iteration can be
started early (no dependency)
d0 d1

*  Overall Performance
1 d2 d3
▪ N elements, D cycles latency/op
* d4 d5 ▪ (N/2+1)*D cycles:
* CPE = D/2
* d6 d7
*
*
*

278
Loop Unrolling with Separate Accumulators
(2x2) void unroll2a_combine(vec_ptr v, data_t *dest)
{
long length = vec_length(v);
long limit = length-1;
data_t *d = get_vec_start(v);
data_t x0 = IDENT;
data_t x1 = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x0 = x0 OP d[i];
x1 = x1 OP d[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x0 = x0 OP d[i];
}
*dest = x0 OP x1;
}

 Different form of reassociation


279
Effect of Separate Accumulators
Method Integer Double FP
Operation Add Mult Add Mult
Combine4 1.27 3.01 3.01 5.01
Unroll 2x1 1.01 3.01 3.01 5.01
Unroll 2x1a 1.01 1.51 1.51 2.51
Unroll 2x2 0.81 1.51 1.51 2.51
Latency Bound 1.00 3.00 3.00 5.00
Throughput Bound 0.50 1.00 1.00 0.50

 Int + makes use of two load units


x0 = x0 OP d[i];
x1 = x1 OP d[i+1];

 2x speedup (over unroll2) for Int *, FP +, FP *


280
Separate Accumulators
x0 = x0 OP d[i];  What changed:
x1 = x1 OP d[i+1]; ▪ Two independent “streams” of
operations
1 d0 1 d1
 Overall Performance
* d2 * d3
▪ N elements, D cycles latency/op
* d4 * d5
▪ Should be (N/2+1)*D cycles:
CPE = D/2
* d6 * d7 ▪ CPE matches prediction!

* *
What Now?
*

281
Unrolling & Accumulating
 Idea
▪ Can unroll to any degree L
▪ Can accumulate K results in parallel
▪ L must be multiple of K

 Limitations
▪ Diminishing returns
Cannot go beyond throughput limitations of execution units

▪ Large overhead for short lengths
▪ Finish off iterations sequentially

282
Unrolling & Accumulating: Double *
 Case
▪ Intel Haswell
▪ Double FP Multiplication
▪ Latency bound: 5.00. Throughput bound: 0.50
FP * Unrolling Factor L
K 1 2 3 4 6 8 10 12
1 5.01 5.01 5.01 5.01 5.01 5.01 5.01
Accumulators

2 2.51 2.51 2.51


3 1.67
4 1.25 1.26
6 0.84 0.88
8 0.63
10 0.51
12 0.52

283
Unrolling & Accumulating: Int +
 Case
▪ Intel Haswell
▪ Integer addition
▪ Latency bound: 1.00. Throughput bound: 1.00
FP * Unrolling Factor L
K 1 2 3 4 6 8 10 12
1 1.27 1.01 1.01 1.01 1.01 1.01 1.01
Accumulators

2 0.81 0.69 0.54


3 0.74
4 0.69 1.24
6 0.56 0.56
8 0.54
10 0.54
12 0.56

284
Achievable Performance
Method Integer Double FP
Operation Add Mult Add Mult
Best 0.54 1.01 1.01 0.52
Latency Bound 1.00 3.00 3.00 5.00
Throughput Bound 0.50 1.00 1.00 0.50

 Limited only by throughput of functional units


 Up to 42X improvement over original, unoptimized code

285
Programming with AVX2
YMM Registers
◼ 16 total, each 32 bytes
◼ 32 single-byte integers

◼ 16 16-bit integers

◼ 8 32-bit integers

◼ 8 single-precision floats

◼ 4 double-precision floats

◼ 1 single-precision float

◼ 1 double-precision float
286
SIMD Operations
◼ SIMD Operations: Single Precision
vaddsd %ymm0, %ymm1, %ymm1
%ymm0
+ + + + + + + +
%ymm1

◼ SIMD Operations: Double Precision


vaddpd %ymm0, %ymm1, %ymm1
%ymm0
+ + + +
%ymm1

287
Using Vector Instructions
Method Integer Double FP
Operation Add Mult Add Mult
Scalar Best 0.54 1.01 1.01 0.52
Vector Best 0.06 0.24 0.25 0.16
Latency Bound 0.50 3.00 3.00 5.00
Throughput Bound 0.50 1.00 1.00 0.50
Vec Throughput 0.06 0.12 0.25 0.12
Bound

 Make use of AVX Instructions


▪ Parallel operations on multiple data elements
▪ See Web Aside OPT:SIMD on CS:APP web page

288
What About Branches?
 Challenge
▪ Instruction Control Unit must work well ahead of Execution Unit
to generate enough operations to keep EU busy

404663: mov $0x0,%eax


404668: cmp (%rdi),%rsi
Executing
40466b: jge 404685 How to continue?
40466d: mov 0x8(%rdi),%rax

. . .

404685: repz retq

▪ When encounters conditional branch, cannot reliably determine where to


continue fetching

289
Modern CPU Design
Instruction Control
Fetch Address
Retirement Control
Unit Instruction
Register Instructions Cache
Instruction
File Decode
Operations
Register Updates Prediction OK?

Functional
Branch Arith Arith Arith Load Store
Units

Operation Results
Addr. Addr.
Data Data

Data
Cache

Execution
290
Branch Outcomes
▪ When encounter conditional branch, cannot determine where to continue
fetching
▪ Branch Taken: Transfer control to branch target
▪ Branch Not-Taken: Continue with next instruction in sequence
▪ Cannot resolve until outcome determined by branch/integer unit

404663: mov $0x0,%eax


404668: cmp (%rdi),%rsi
40466b: jge 404685
Branch Not-Taken
40466d: mov 0x8(%rdi),%rax

. . . Branch Taken
404685: repz retq

291
Branch Prediction
 Idea
▪ Guess which way branch will go
▪ Begin executing instructions at predicted position
▪ But don’t actually modify register or memory data

404663: mov $0x0,%eax


404668: cmp (%rdi),%rsi
40466b: jge 404685
40466d: mov 0x8(%rdi),%rax Predict Taken
. . .

404685: repz retq Begin


Execution

292
Branch Prediction Through Loop
401029: vmulsd (%rdx),%xmm0,%xmm0 Assume
40102d: add $0x8,%rdx vector length = 100
401031: cmp %rax,%rdx
401034: jne 401029 i = 98
Predict Taken (OK)
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 99
Predict Taken
401029: vmulsd (%rdx),%xmm0,%xmm0
(Oops)
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx Read Executed
401034: jne 401029 i = 100 invalid
location
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx Fetched
401031: cmp %rax,%rdx
401034: jne 401029 i = 101
293
Branch Misprediction Invalidation
401029: vmulsd (%rdx),%xmm0,%xmm0 Assume
40102d: add $0x8,%rdx vector length = 100
401031: cmp %rax,%rdx
401034: jne 401029 i = 98
Predict Taken (OK)
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 99
Predict Taken
401029: vmulsd (%rdx),%xmm0,%xmm0
(Oops)
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 100
Invalidate
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 101
294
Branch Misprediction Recovery
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx i= 99 Definitely not taken
401034: jne 401029
401036: jmp 401040
. . . Reload
401040: vmovsd %xmm0,(%r12) Pipeline

 Performance Cost
▪ Multiple clock cycles on modern processor
▪ Can be a major performance limiter

295
Getting High Performance
 Good compiler and flags
 Don’t do anything stupid
▪ Watch out for hidden algorithmic inefficiencies
▪ Write compiler-friendly code
▪Watch out for optimization blockers:
procedure calls & memory references
▪ Look carefully at innermost loops (where most work is done)

 Tune code for machine


▪ Exploit instruction-level parallelism
▪ Avoid unpredictable branches
▪ Make code cache friendly (Covered later in course)

296

You might also like