IT3106E SP 01 Machine Level Programming
IT3106E SP 01 Machine Level Programming
1
Chapter 1. Machine Level Programming
I. Basics
II. Control
III. Procedures
IV. Data
V. Advance
2
Machine Level Programming I:
Basics
3
Machine Level Programming I: Basics
History of Intel processors and architectures
C, assembly, machine code
Assembly Basics: Registers, operands, move
Arithmetic & logical operations
4
Intel x86 Processors
Dominate laptop/desktop/server market
Evolutionary design
▪ Backwards compatible up until 8086, introduced in 1978
▪ Added more features as time goes on
5
Intel x86 Evolution: Milestones
Name Date Transistors MHz
8086 1978 29K 5-10
▪ First 16-bit Intel processor. Basis for IBM PC & DOS
▪ 1MB address space
386 1985 275K 16-33
▪ First 32 bit Intel processor , referred to as IA32
▪ Added “flat addressing”, capable of running Unix
Pentium 4E 2004 125M 2800-3800
▪ First 64-bit Intel x86 processor, referred to as x86-64
Core 2 2006 291M 1060-3500
▪ First multi-core Intel processor
Core i7 2008 731M 1700-3900
▪ Four cores (our shark machines)
6
Intel x86 Processors, cont.
Machine Evolution
▪ 386 1985 0.3M
▪ Pentium 1993 3.1M
▪ Pentium/MMX 1997 4.5M
▪ PentiumPro 1995 6.5M
▪ Pentium III 1999 8.2M
▪ Pentium 4 2001 42M
▪ Core 2 Duo 2006 291M
▪ Core i7 2008 731M
Added Features
▪ Instructions to support multimedia operations
▪ Instructions to enable more efficient conditional operations
▪ Transition from 32 bits to 64 bits
▪ More cores
7
2015 State of the Art
▪ Core i7 Broadwell 2015
Desktop Model
▪ 4 cores
▪ Integrated graphics
▪ 3.3-3.8 GHz
▪ 65W
Server Model
▪ 8 cores
▪ Integrated I/O
▪ 2-2.6 GHz
▪ 45W
8
x86 Clones: Advanced Micro Devices (AMD)
Historically
▪ AMD has followed just behind Intel
▪ A little bit slower, a lot cheaper
Then
▪ Recruited top circuit designers from Digital Equipment Corp. and other
downward trending companies
▪ Built Opteron: tough competitor to Pentium 4
▪ Developed x86-64, their own extension to 64 bits
Recent Years
▪ Intel got its act together
▪Leads the world in semiconductor technology
▪ AMD has fallen behind
▪ Relies on external semiconductor manufacturer
9
Intel’s 64-Bit History
Presentation
▪ Book covers x86-64
▪ Web aside on IA32
▪ We will only cover x86-64
11
Machine Programming: Basics
History of Intel processors and architectures
C, assembly, machine code
Assembly Basics: Registers, operands, move
Arithmetic & logical operations
12
Definitions
Architecture: (also ISA: instruction set architecture) The parts of
a processor design that one needs to understand or write
assembly/machine code.
▪ Examples: instruction set specification, registers.
Microarchitecture: Implementation of the architecture.
▪ Examples: cache sizes and core frequency.
Code Forms:
▪ Machine Code: The byte-level programs that a processor executes
▪ Assembly Code: A text representation of machine code
Example ISAs:
▪ Intel: x86, IA32, Itanium, x86-64
▪ ARM: Used in almost all mobile phones
13
Assembly/Machine Code View
CPU Memory
Addresses
Registers
Data Code
PC Data
Condition Instructions Stack
Codes
Programmer-Visible State
▪ PC: Program counter ▪ Memory
▪ Address of next instruction ▪ Byte addressable array
▪ Called “RIP” (x86-64) ▪ Code and user data
▪ Stack to support procedures
▪ Register file
▪ Heavily used program data
▪ Condition codes
▪ Store status information about most
recent arithmetic or logical operation
▪ Used for conditional branching 14
Turning C into Object Code
▪ Code in files p1.c p2.c
▪ Compile with command: gcc –Og p1.c p2.c -o p
▪ Use basic optimizations (-Og) [New to recent versions of GCC]
▪ Put resulting binary in file p
15
Compiling Into Assembly
C Code (sum.c) Generated x86-64 Assembly
long plus(long x, long y); sumstore:
pushq %rbx
void sumstore(long x, long y, movq %rdx, %rbx
long *dest) call plus
{ movq %rax, (%rbx)
long t = plus(x, y); popq %rbx
*dest = t; ret
}
Obtain (on shark machine) with command
gcc –Og –S sum.c
Produces file sum.s
Warning: Will get very different results on non-Shark
machines (Andrew Linux, Mac OS-X, …) due to different
versions of gcc and different compiler settings.
16
AT&T vs Intel format
ATT is the default format for GCC, objdump
To generate Intel format
▪ gcc -Og -S -masm=intel mstore.c
17
Assembly Characteristics: Data Types
18
Assembly Characteristics: Data Types
19
Assembly Characteristics: Operations
Perform arithmetic function on register or memory data
Transfer control
▪ Unconditional jumps to/from procedures
▪ Conditional branches
20
Object Code
Code for sumstore
Assembler
0x0400595:
0x53
▪ Translates .s into .o
0x48 ▪ Binary encoding of each instruction
0x89 ▪ Nearly-complete image of executable code
0xd3
0xe8
▪ Missing linkages between code in different
0xf2 files
0xff Linker
0xff
0xff ▪ Resolves references between files
• Total of 14 bytes
0x48 ▪ Combines with static run-time libraries
0x89 • Each instruction
E.g., code for malloc, printf
▪
0x03 1, 3, or 5 bytes
0x5b • Starts at address
▪ Some libraries are dynamically linked
0xc3 0x0400595 ▪ Linking occurs when program begins
execution
21
Machine Instruction Example
C Code
*dest = t;
▪ Store value t where designated by
dest
Assembly
movq %rax, (%rbx)
▪ Move 8-byte value to memory
▪Quad words in x86-64 parlance
▪ Operands:
t: Register %rax
dest: Register %rbx
*dest: Memory M[%rbx]
Object Code
0x40059e: 48 89 03
▪ 3-byte instruction
▪ Stored at address 0x40059e
22
Disassembling Object Code
Disassembled
0000000000400595 <sumstore>:
400595: 53 push %rbx
400596: 48 89 d3 mov %rdx,%rbx
400599: e8 f2 ff ff ff callq 400590 <plus>
40059e: 48 89 03 mov %rax,(%rbx)
4005a1: 5b pop %rbx
4005a2: c3 retq
Disassembler
objdump –d sum
▪ Useful tool for examining object code
▪ Analyzes bit pattern of series of instructions
▪ Produces approximate rendition of assembly code
▪ Can be run on either a.out (complete executable) or .o file
23
Alternate Disassembly
Disassembled
Object
0x0400595:
0x53 Dump of assembler code for function sumstore:
0x48 0x0000000000400595 <+0>: push %rbx
0x89 0x0000000000400596 <+1>: mov %rdx,%rbx
0xd3 0x0000000000400599 <+4>: callq 0x400590 <plus>
0xe8 0x000000000040059e <+9>: mov %rax,(%rbx)
0xf2 0x00000000004005a1 <+12>:pop %rbx
0xff 0x00000000004005a2 <+13>:retq
0xff
0xff
0x48 Within gdb Debugger
0x89 gdb sum
0x03
0x5b disassemble sumstore
0xc3 ▪ Disassemble procedure
x/14xb sumstore
▪ Examine the 14 bytes starting at sumstore
24
What Can be Disassembled?
% objdump -d WINWORD.EXE
No symbols in "WINWORD.EXE".
Disassembly of section .text:
30001000 <.text>:
30001000: 55 push %ebp
30001001: 8b ec mov %esp,%ebp
30001003: 6a ff push $0xffffffff
30001005: 68 90 10 00 30 push $0x30001090
3000100a: 68 91 dc 4c 30 push $0x304cdc91
26
x86-64 Integer Registers
source
%esi %si index
destination
%edi %di index
stack
%esp %sp
pointer
base
%ebp %bp
pointer
29
movq Operand Combinations
30
Moving data
31
Moving data
32
Moving data
33
Simple Memory Addressing Modes
Normal (R) Mem[Reg[R]]
▪ Register R specifies memory address
▪ Aha! Pointer dereferencing in C
movq (%rcx),%rax
movq 8(%rbp),%rdx
34
Example of Simple Addressing Modes
void swap
(long *xp, long *yp)
{ swap:
long t0 = *xp; movq (%rdi), %rax
long t1 = *yp; movq (%rsi), %rdx
*xp = t1; movq %rdx, (%rdi)
*yp = t0; movq %rax, (%rsi)
} ret
35
Understanding Swap()
Memory
void swap Registers
(long *xp, long *yp)
{ %rdi
long t0 = *xp;
%rsi
long t1 = *yp;
*xp = t1; %rax
*yp = t0;
} %rdx
Register Value
%rdi xp
%rsi yp swap:
%rax t0 movq (%rdi), %rax # t0 = *xp
%rdx t1 movq (%rsi), %rdx # t1 = *yp
movq %rdx, (%rdi) # *xp = t1
movq %rax, (%rsi) # *yp = t0
ret
36
Understanding Swap()
Memory
Registers Address
123 0x120
%rdi 0x120
0x118
%rsi 0x100
0x110
%rax 0x108
%rdx 456 0x100
swap:
movq (%rdi), %rax # t0 = *xp
movq (%rsi), %rdx # t1 = *yp
movq %rdx, (%rdi) # *xp = t1
movq %rax, (%rsi) # *yp = t0
ret
37
Understanding Swap()
Memory
Registers Address
123 0x120
%rdi 0x120
0x118
%rsi 0x100
0x110
%rax 123 0x108
%rdx 456 0x100
swap:
movq (%rdi), %rax # t0 = *xp
movq (%rsi), %rdx # t1 = *yp
movq %rdx, (%rdi) # *xp = t1
movq %rax, (%rsi) # *yp = t0
ret
38
Understanding Swap()
Memory
Registers Address
123 0x120
%rdi 0x120
0x118
%rsi 0x100
0x110
%rax 123 0x108
%rdx 456 456 0x100
swap:
movq (%rdi), %rax # t0 = *xp
movq (%rsi), %rdx # t1 = *yp
movq %rdx, (%rdi) # *xp = t1
movq %rax, (%rsi) # *yp = t0
ret
39
Understanding Swap()
Memory
Registers Address
456 0x120
%rdi 0x120
0x118
%rsi 0x100
0x110
%rax 123 0x108
%rdx 456 456 0x100
swap:
movq (%rdi), %rax # t0 = *xp
movq (%rsi), %rdx # t1 = *yp
movq %rdx, (%rdi) # *xp = t1
movq %rax, (%rsi) # *yp = t0
ret
40
Understanding Swap()
Memory
Registers Address
456 0x120
%rdi 0x120
0x118
%rsi 0x100
0x110
%rax 123 0x108
%rdx 456 123 0x100
swap:
movq (%rdi), %rax # t0 = *xp
movq (%rsi), %rdx # t1 = *yp
movq %rdx, (%rdi) # *xp = t1
movq %rax, (%rsi) # *yp = t0
ret
41
Simple Memory Addressing Modes
Normal (R) Mem[Reg[R]]
▪ Register R specifies memory address
▪ Aha! Pointer dereferencing in C
movq (%rcx),%rax
movq 8(%rbp),%rdx
42
Complete Memory Addressing Modes
Most General Form
D(Rb,Ri,S) Mem[Reg[Rb]+S*Reg[Ri]+ D]
▪ D: Constant “displacement” 1, 2, or 4 bytes
▪ Rb: Base register: Any of 16 integer registers
▪ Ri: Index register: Any, except for %rsp
▪ S: Scale: 1, 2, 4, or 8 (why these numbers?)
Special Cases
(Rb,Ri) Mem[Reg[Rb]+Reg[Ri]]
D(Rb,Ri) Mem[Reg[Rb]+Reg[Ri]+D]
(Rb,Ri,S) Mem[Reg[Rb]+S*Reg[Ri]]
43
Carnegie Mellon
%rdx 0xf000
%rcx 0x0100
44
Machine Programming: Basics
History of Intel processors and architectures
C, assembly, machine code
Assembly Basics: Registers, operands, move
Arithmetic & logical operations
45
Carnegie Mellon
Uses
▪ Computing addresses without a memory reference
▪E.g., translation of p = &x[i];
▪ Computing arithmetic expressions of the form x + k*y
▪ k = 1, 2, 4, or 8
Example
long m12(long x)
{ Converted to ASM by compiler:
return x*12; leaq (%rdi,%rdi,2), %rax # t <- x+x*2
} salq $2, %rax # return t<<2
46
Carnegie Mellon
47
Carnegie Mellon
48
Carnegie Mellon
49
Carnegie Mellon
50
Machine Programming I: Summary
History of Intel processors and architectures
▪ Evolutionary design leads to many quirks and artifacts
C, assembly, machine code
▪ New forms of visible state: program counter, registers, ...
▪ Compiler must transform statements, expressions, procedures into low-
level instruction sequences
Assembly Basics: Registers, operands, move
▪ The x86-64 move instructions cover wide range of data movement forms
Arithmetic
▪ C compiler will figure out different instruction combinations to carry out
computation
51
CarnegieMellon
Carnegie Mellon
52
Carnegie Mellon
53
Carnegie Mellon
Information about
currently executing Registers
program %rax %r8
▪ Temporary data %rbx %r9
( %rax, … ) %rcx %r10
▪ Location of runtime %rdx %r11
stack %rsi %r12
( %rsp )
%rdi %r13
▪ Location of current code %rsp %r14
control point
%rbp %r15
( %rip, … )
▪ Status of recent tests
%rip Instruction pointer
( CF, ZF, SF, OF )
Current stack top
CF ZF SF OF Condition codes
54
Condition Codes (Implicit Setting)
Single bit registers
▪CF Carry Flag (for unsigned) SF Sign Flag (for signed)
▪ZF Zero Flag OF Overflow Flag (for signed)
55
Carnegie Mellon
▪CF set if carry out from most significant bit (used for unsigned comparisons)
▪ZF set if a == b
▪SF set if (a-b) < 0 (as signed)
▪OF set if two’s-complement (signed) overflow
(a>0 && b<0 && (a-b)<0) || (a<0 && b>0 && (a-b)>0)
56
Carnegie Mellon
57
Carnegie Mellon
61
Carnegie Mellon
Jumping
jX Instructions
▪ Jump to different part of code depending on condition codes
jX Condition Description
jmp 1 Unconditional
je ZF Equal / Zero
jne ~ZF Not Equal / Not Zero
js SF Negative
jns ~SF Nonnegative
jg ~(SF^OF)&~ZF Greater (Signed)
jge ~(SF^OF) Greater or Equal (Signed)
jl (SF^OF) Less (Signed)
jle (SF^OF)|ZF Less or Equal (Signed)
ja ~CF&~ZF Above (unsigned)
jb CF Below (unsigned)
62
Carnegie Mellon
64
Carnegie Mellon
Goto Version
ntest = !Test; ▪ Create separate code regions for
if (ntest) goto Else;
then & else expressions
val = Then_Expr;
goto Done; ▪ Execute appropriate one
Else:
val = Else_Expr;
Done:
. . .
65
Carnegie Mellon
66
Carnegie Mellon
long absdiff
(long x, long y)
{ Register Use(s)
long result;
if (x > y) %rdi Argument x
result = x-y; %rsi Argument y
else %rax Return value
result = y-x;
return result;
}
absdiff:
movq %rdi, %rax # x
subq %rsi, %rax # result = x-y
movq %rsi, %rdx
subq %rdi, %rdx # eval = y-x
cmpq %rsi, %rdi # x:y
cmovle %rdx, %rax # if <=, result = eval
ret
67
Carnegie Mellon
Expensive Computations
val = Test(x) ? Hard1(x) : Hard2(x);
69
Carnegie Mellon
70
Carnegie Mellon
72
Carnegie Mellon
Goto Version
goto test;
loop:
While version Body
while (Test) test:
Body if (Test)
goto loop;
done:
73
Carnegie Mellon
74
Carnegie Mellon
While version
“Do-while” conversion
while (Test)
Used with –O1
Body
Goto Version
Do-While Version if (!Test)
if (!Test) goto done;
goto done; loop:
do Body
Body if (Test)
while(Test); goto loop;
done: done:
75
Carnegie Mellon
76
Carnegie Mellon
While Version
Init;
while (Test ) {
Body
Update;
}
78
Carnegie Mellon
For-While Conversion
long pcount_for_while
Init (unsigned long x)
{
i = 0 size_t i;
long result = 0;
Test i = 0;
i < WSIZE while (i < WSIZE)
{
unsigned bit =
Update
(x >> i) & 0x1;
i++ result += bit;
i++;
Body }
{ return result;
unsigned bit = }
(x >> i) & 0x1;
result += bit;
}
79
Carnegie Mellon
81
Carnegie Mellon
long switch_eg
(long x, long y, long z) Switch Statement
{
long w = 1; Example
switch(x) {
case 1:
w = y*z; Multiple case labels
break; ▪ Here: 5 & 6
case 2:
w = y/z; Fall through cases
/* Fall Through */ ▪ Here: 2
case 3:
w += z; Missing cases
break; ▪ Here: 4
case 5:
case 6:
w -= z;
break;
default:
w = 2;
}
return w;
}
82
Carnegie Mellon
•
Translation (Extended C) •
goto *JTab[x]; •
Setup:
Register Use(s)
switch_eg:
movq %rdx, %rcx %rdi Argument x
cmpq $6, %rdi # x:6 %rsi Argument y
ja .L8
jmp *.L4(,%rdi,8) %rdx Argument z
%rax Return value
What range of values Note that w not
takes default? initialized here 84
Carnegie Mellon
85
Carnegie Mellon
86
Carnegie Mellon
Jump Table
Jump table
switch(x) {
.section .rodata case 1: // .L3
.align 8 w = y*z;
.L4: break;
.quad .L8 # x = 0
.quad .L3 # x = 1
case 2: // .L5
.quad .L5 # x = 2 w = y/z;
.quad .L9 # x = 3 /* Fall Through */
.quad .L8 # x = 4
case 3: // .L9
.quad .L7 # x = 5
.quad .L7 # x = 6 w += z;
break;
case 5:
case 6: // .L7
w -= z;
break;
default: // .L8
w = 2;
}
87
Carnegie Mellon
Code Blocks (x == 1)
switch(x) { .L3:
case 1: // .L3 movq %rsi, %rax # y
w = y*z; imulq %rdx, %rax # y*z
break; ret
. . .
}
Register Use(s)
%rdi Argument x
%rsi Argument y
%rdx Argument z
%rax Return value
88
Carnegie Mellon
Handling Fall-Through
long w = 1;
. . .
switch(x) { case 2:
. . . w = y/z;
case 2: goto merge;
w = y/z;
/* Fall Through */
case 3:
w += z;
break;
. . .
case 3:
}
w = 1;
merge:
w += z;
89
Carnegie Mellon
Code Blocks (x == 2, x == 3)
.L5: # Case 2
long w = 1; movq %rsi, %rax
. . . cqto
switch(x) { idivq %rcx # y/z
. . . jmp .L6 # goto merge
case 2: .L9: # Case 3
w = y/z; movl $1, %eax # w = 1
/* Fall Through */ .L6: # merge:
case 3: addq %rcx, %rax # w += z
w += z; ret
break;
. . .
} Register Use(s)
%rdi Argument x
%rsi Argument y
%rdx Argument z
%rax Return value
90
Carnegie Mellon
Register Use(s)
%rdi Argument x
%rsi Argument y
%rdx Argument z
%rax Return value
91
Carnegie Mellon
Summarizing
C Control
▪ if-then-else
▪ do-while
▪ while, for
▪ switch
Assembler Control
▪ Conditional jump
▪ Conditional move
▪ Indirect jump (via jump tables)
▪ Compiler generates code sequence to implement more complex control
Standard Techniques
▪ Loops converted to do-while or jump-to-middle form
▪ Large switch statements use jump tables
▪ Sparse switch statements may use decision trees (if-elseif-elseif-else)
92
Carnegie Mellon
Summary
Today
▪ Control: Condition codes
▪ Conditional branches & conditional moves
▪ Loops
▪ Switch statements
Next Time
▪ Stack
▪ Call / return
▪ Procedure call discipline
93
Carnegie Mellon
94
Mechanisms in Procedures
Passing control P(…) {
•
▪ To beginning of procedure code •
▪ Back to return point y = Q(x);
Passing data print(y)
•
▪ Procedure arguments }
▪ Return value
Memory management
int Q(int i)
▪ Allocate during procedure execution {
▪ Deallocate upon return int t = 3*i;
int v[10];
Mechanisms all implemented with •
machine instructions •
return v[t];
x86-64 implementation of a
}
procedure uses only those
mechanisms required
95
Carnegie Mellon
96
Carnegie Mellon
x86-64 Stack
Stack “Bottom”
Region of memory managed
with stack discipline
Increasing
Grows toward lower addresses
Addresses
Stack “Top”
97
Carnegie Mellon
Stack
Grows
Down
Stack Pointer: %rsp
-8
Stack “Top”
98
Carnegie Mellon
Stack
Grows
+8 Down
Stack Pointer: %rsp
Stack “Top”
99
Carnegie Mellon
100
Code Examples void multstore
(long x, long y, long *dest)
{
long t = mult2(x, y);
*dest = t;
}
0000000000400540 <multstore>:
400540: push %rbx # Save %rbx
400541: mov %rdx,%rbx # Save dest
400544: callq 400550 <mult2> # mult2(x,y)
400549: mov %rax,(%rbx) # Save at dest
40054c: pop %rbx # Restore %rbx
40054d: retq # Return
102
Control Flow Example #1
•
0000000000400540 <multstore>:
0x130 •
• 0x128 •
• 0x120
400544: callq 400550 <mult2>
400549: mov %rax,(%rbx)
•
• %rsp 0x120
%rip 0x400544
0000000000400550 <mult2>:
400550: mov %rdi,%rax
•
•
400557: retq
103
Control Flow Example #2
•
0000000000400540 <multstore>:
0x130 •
• 0x128 •
• 0x120
400544: callq 400550 <mult2>
400549: mov %rax,(%rbx) 0x118 0x400549
•
• %rsp 0x118
%rip 0x400550
0000000000400550 <mult2>:
400550: mov %rdi,%rax
•
•
400557: retq
104
Control Flow Example #3
•
0000000000400540 <multstore>:
0x130 •
• 0x128 •
• 0x120
400544: callq 400550 <mult2>
400549: mov %rax,(%rbx) 0x118 0x400549
•
• %rsp 0x118
%rip 0x400557
0000000000400550 <mult2>:
400550: mov %rdi,%rax
•
•
400557: retq
105
Control Flow Example #4
•
0000000000400540 <multstore>:
0x130 •
• 0x128 •
• 0x120
400544: callq 400550 <mult2>
400549: mov %rax,(%rbx)
•
• %rsp 0x120
%rip 0x400549
0000000000400550 <mult2>:
400550: mov %rdi,%rax
•
•
400557: retq
106
Carnegie Mellon
107
Carnegie Mellon
%rdi
•••
%rsi Arg n
%rdx
%rcx
•••
%r8 Arg 8
%r9 Arg 7
Return value
0000000000400540 <multstore>:
# x in %rdi, y in %rsi, dest in %rdx
•••
400541: mov %rdx,%rbx # Save dest
400544: callq 400550 <mult2> # mult2(x,y)
# t in %rax
400549: mov %rax,(%rbx) # Save at dest
•••
110
Carnegie Mellon
Stack-Based Languages
Stack discipline
▪ State for given procedure needed for limited time
▪From when called to when return
▪ Callee returns before caller does
Stack allocated in Frames
▪ state for single procedure instantiation
111
Carnegie Mellon
112
Carnegie Mellon
Stack Frames
Previous
Frame
Contents
▪ Return information
Frame Pointer: %rbp
▪ Local storage (if needed) (Optional) x
▪ Temporary space (if needed) Frame for
proc
Example Stack
yoo(…) yoo
%rbp
{
yoo yoo
• who %rsp
•
who();
• amI amI
•
} amI
amI
114
Carnegie Mellon
Example Stack
yoo(…) yoo
{ who(…)
•{ yoo yoo
who
• • • • %rbp
amI();
who();
• • • • amI amI who
%rsp
• amI();
} • • •
} amI
amI
115
Carnegie Mellon
Example Stack
yoo(…) yoo
{ who(…)
•{ yoo yoo
amI(…) who
• • • •
{
amI();
who();
• who
• • • • amI amI
•
• amI();
• •amI();
• %rbp
}
} • amI amI
• %rsp
}
amI
116
Carnegie Mellon
Example Stack
yoo(…) yoo
{ who(…)
•{ yoo yoo
amI(…) who
• • • •
{
amI();
who(); amI(…)
• who
• • • {• amI amI
•
• amI();•
} • •amI();
••
} • amI amI
• amI();
} • %rbp
• amI
}
amI
%rsp
117
Carnegie Mellon
Example Stack
yoo(…) yoo
{ who(…)
•{ yoo yoo
amI(…) who
• • • •
{
amI();
who(); amI(…)
• who
• • • {• amI amI
•
• amI();• amI(…)
} • •amI();
• •{
} • • amI amI
• amI();
• •
}
• amI(); amI
} • amI
•
} %rbp
amI
%rsp
118
Carnegie Mellon
Example Stack
yoo(…) yoo
{ who(…)
•{ yoo yoo
amI(…) who
• • • •
{
amI();
who(); amI(…)
• who
• • • {• amI amI
•
• amI();•
} • •amI();
••
} • amI amI
• amI();
} • %rbp
• amI
}
amI
%rsp
119
Carnegie Mellon
Example Stack
yoo(…) yoo
{ who(…)
•{ yoo yoo
amI(…) who
• • • •
{
amI();
who();
• who
• • • • amI amI
•
• amI();
• •amI();
• %rbp
}
} • amI amI
• %rsp
}
amI
120
Carnegie Mellon
Example Stack
yoo(…) yoo
{ who(…)
•{ yoo yoo
who
• • • • %rbp
amI();
who();
• • • • amI amI who
%rsp
• amI();
} • • •
} amI
amI
121
Carnegie Mellon
Example Stack
yoo(…) yoo
{ who(…)
•{ yoo yoo
amI(…) who
• • • •
{
amI();
who();
• who
• • • • amI amI
•
• amI();
• •amI();
• %rbp
}
} • amI amI
• %rsp
}
amI
122
Carnegie Mellon
Example Stack
yoo(…) yoo
{ who(…)
•{ yoo yoo
who
• • • • %rbp
amI();
who();
• • • • amI amI who
%rsp
• amI();
} • • •
} amI
amI
123
Carnegie Mellon
Example Stack
yoo
yoo(…) %rbp
{ yoo yoo
• who %rsp
•
who(); amI amI
•
•
} amI
amI
124
Carnegie Mellon
Example: incr
126
Carnegie Mellon
call_incr:
subq $16, %rsp Resulting Stack Structure
movq $15213, 8(%rsp)
movl $3000, %esi
leaq 8(%rsp), %rdi ...
call incr
addq 8(%rsp), %rax
Rtn address
addq $16, %rsp
ret 15213 %rsp+8
Unused %rsp
127
Carnegie Mellon
128
Carnegie Mellon
129
Carnegie Mellon
long call_incr() {
long v1 = 15213; ...
long v2 = incr(&v1, 3000);
return v1+v2; Rtn address
}
18213 %rsp+8
Unused %rsp
call_incr:
subq $16, %rsp Register Use(s)
movq $15213, 8(%rsp) %rax Return value
movl $3000, %esi
leaq 8(%rsp), %rdi Updated Stack Structure
call incr
addq 8(%rsp), %rax
addq $16, %rsp ...
ret
Rtn address %rsp
130
Carnegie Mellon
long call_incr() {
Updated Stack Structure
long v1 = 15213;
long v2 = incr(&v1, 3000);
...
return v1+v2;
}
Rtn address %rsp
call_incr:
subq $16, %rsp Register Use(s)
movq $15213, 8(%rsp) %rax Return value
movl $3000, %esi
leaq 8(%rsp), %rdi Final Stack Structure
call incr
addq 8(%rsp), %rax
addq $16, %rsp ...
ret %rsp
131
Carnegie Mellon
132
Carnegie Mellon
133
Carnegie Mellon
134
Carnegie Mellon
135
Carnegie Mellon
Callee-Saved Example #1
Initial Stack Structure
long call_incr2(long x) {
long v1 = 15213;
long v2 = incr(&v1, 3000); ...
return x+v2;
} Rtn address %rsp
call_incr2:
pushq %rbx
Resulting Stack Structure
subq $16, %rsp
movq %rdi, %rbx
movq $15213, 8(%rsp) ...
movl $3000, %esi
leaq 8(%rsp), %rdi
call incr Rtn address
addq %rbx, %rax Saved %rbx
addq $16, %rsp
15213 %rsp+8
popq %rbx
ret Unused %rsp
136
Carnegie Mellon
Callee-Saved Example #2
Resulting Stack Structure
long call_incr2(long x) {
long v1 = 15213; ...
long v2 = incr(&v1, 3000);
return x+v2;
Rtn address
}
Saved %rbx
15213 %rsp+8
call_incr2:
pushq %rbx Unused %rsp
subq $16, %rsp
movq %rdi, %rbx
movq $15213, 8(%rsp) Pre-return Stack Structure
movl $3000, %esi
leaq 8(%rsp), %rdi
call incr ...
addq %rbx, %rax
addq $16, %rsp Rtn address %rsp
popq %rbx
ret
137
Carnegie Mellon
138
Carnegie Mellon
Recursive Function
pcount_r:
movl $0, %eax
/* Recursive popcount */ testq %rdi, %rdi
long pcount_r(unsigned long x) { je .L6
if (x == 0) pushq %rbx
return 0; movq %rdi, %rbx
else andl $1, %ebx
return (x & 1) shrq %rdi # (by 1)
+ pcount_r(x >> 1); call pcount_r
} addq %rbx, %rax
popq %rbx
.L6:
rep; ret
139
Carnegie Mellon
140
Carnegie Mellon
Rtn address
Saved %rbx %rsp
141
Carnegie Mellon
142
Carnegie Mellon
143
Carnegie Mellon
144
Carnegie Mellon
145
Carnegie Mellon
146
Carnegie Mellon
Argument
Build
%rsp
147
Machine-Level Programming IV:
Data
148
Machine-Level Programming IV: Data
Arrays
▪ One-dimensional
▪ Multi-dimensional (nested)
▪ Multi-level
Structures
▪ Allocation
▪ Access
▪ Alignment
Floating Point
149
Array Allocation
Basic Principle
T A[L];
▪ Array of data type T and length L
▪ Contiguously allocated region of L * sizeof(T) bytes in memory
char string[12];
x x + 12
int val[5];
x x+4 x+8 x + 12 x + 16 x + 20
double a[3];
x x+8 x + 16 x + 24
char *p[3];
x x+8 x + 16 x + 24
150
Array Access
Basic Principle
T A[L];
▪ Array of data type T and length L
▪ Identifier A can be used as a pointer to array element 0: Type T*
int val[5]; 1 5 2 1 3
x x+4 x+8 x + 12 x + 16 x + 20
zip_dig cmu = { 1, 5, 2, 1, 3 };
zip_dig mit = { 0, 2, 1, 3, 9 };
zip_dig ucb = { 9, 4, 7, 2, 0 };
zip_dig cmu; 1 5 2 1 3
16 20 24 28 32 36
zip_dig mit; 0 2 1 3 9
36 40 44 48 52 56
zip_dig ucb; 9 4 7 2 0
56 60 64 68 72 76
zip_dig cmu; 1 5 2 1 3
16 20 24 28 32 36
int get_digit
(zip_dig z, int digit)
{
return z[digit];
} ◼ Register %rdi contains
starting address of array
IA32 ◼ Register %rsi contains
# %rdi = z array index
# %rsi = digit ◼ Desired digit at
movl (%rdi,%rsi,4), %eax # z[digit] %rdi + 4*%rsi
◼ Use memory reference
(%rdi,%rsi,4)
153
Array Loop Example
void zincr(zip_dig z) {
size_t i;
for (i = 0; i < ZLEN; i++)
z[i]++;
}
# %rdi = z
movl $0, %eax # i = 0
jmp .L3 # goto middle
.L4: # loop:
addl $1, (%rdi,%rax,4) # z[i]++
addq $1, %rax # i++
.L3: # middle
cmpq $4, %rax # i:4
jbe .L4 # if <=, goto loop
rep; ret
154
Multidimensional (Nested) Arrays
Declaration A[0][0] • • • A[0][C-1]
T A[R][C];
• •
▪ 2D array of data type T • •
▪ R rows, C columns • •
▪ Type T element requires K bytes
A[R-1][0] • • • A[R-1][C-1]
Array Size
▪ R * C * K bytes
Arrangement
▪ Row-Major Ordering
int A[R][C];
A A A A A A
[0] • • • [0] [1] • • • [1] • • • [R-1] • • • [R-1]
[0] [C-1] [0] [C-1] [0] [C-1]
4*R*C Bytes
155
Nested Array Example
#define PCOUNT 4
zip_dig pgh[PCOUNT] =
{{1, 5, 2, 0, 6},
{1, 5, 2, 1, 3 },
{1, 5, 2, 1, 7 },
{1, 5, 2, 2, 1 }};
zip_dig
1 5 2 0 6 1 5 2 1 3 1 5 2 1 7 1 5 2 2 1
pgh[4];
int A[R][C];
A A A A A A
[0] ••• [0] • • • [i] ••• [i] • • • [R-1] ••• [R-1]
[0] [C-1] [0] [C-1] [0] [C-1]
A A+(i*C*4) A+((R-1)*C*4)
157
Nested Array Row Access Code
1 5 2 0 6 1 5 2 1 3 1 5 2 1 7 1 5 2 2 1
Row Vector
▪ pgh[index] is array of 5 int’s
▪ Starting address pgh+20*index
Machine Code
▪ Computes and returns address
▪ Compute as pgh + 4*(index+4*index)
158
Nested Array Element Access
Array Elements
▪ A[i][j] is element of type T, which requires K bytes
▪ Address A + i * (C * K) + j * K = A + (i * C + j)* K
int A[R][C];
A A A A A
[0] ••• [0] • • • ••• [i] ••• • • • [R-1] ••• [R-1]
[0] [C-1] [j] [0] [C-1]
A A+(i*C*4) A+((R-1)*C*4)
A+(i*C*4)+(j*4)
159
Nested Array Element Access Code
1 5 2 0 6 1 5 2 1 3 1 5 2 1 7 1 5 2 2 1
Array Elements
▪ pgh[index][dig] is int
▪ Address: pgh + 20*index + 4*dig
▪ = pgh + 4*(5*index + dig)
160
Multi-Level Array Example
cmu
1 5 2 1 3
univ
16 20 24 28 32 36
160 36 mit
0 2 1 3 9
168 16
176 56 ucb 36 40 44 48 52 56
9 4 7 2 0
56 60 64 68 72 76
161
Element Access in Multi-Level Array
int get_univ_digit
(size_t index, size_t digit)
{
return univ[index][digit];
}
Computation
▪ Element access Mem[Mem[univ+8*index]+4*digit]
▪ Must do two memory reads
▪ First get pointer to row array
▪ Then access element within array
162
Array Element Accesses
Mem[pgh+20*index+4*digit] Mem[Mem[univ+8*index]+4*digit]
163
N X N Matrix #define N 16
typedef int fix_matrix[N][N];
Code /* Get element a[i][j] */
int fix_ele(fix_matrix a,
Fixed dimensions size_t i, size_t j)
▪ Know value of N at {
compile time return a[i][j];
}
#define IDX(n, i, j) ((i)*(n)+(j))
Variable dimensions, /* Get element a[i][j] */
explicit indexing int vec_ele(size_t n, int *a,
▪ Traditional way to size_t i, size_t j)
{
implement dynamic
return a[IDX(n,i,j)];
arrays }
Array Elements
▪ Address A + i * (C * K) + j * K
▪ C = 16, K = 4
165
n X n Matrix Access
Array Elements
▪ Address A + i * (C * K) + j * K
▪ C = n, K = 4
▪ Must perform integer multiplication
/* Get element a[i][j] */
int var_ele(size_t n, int a[n][n], size_t i, size_t j)
{
return a[i][j];
}
166
Machine-Level Programming IV: Data
Arrays
▪ One-dimensional
▪ Multi-dimensional (nested)
▪ Multi-level
Structures
▪ Allocation
▪ Access
▪ Alignment
Floating Point
167
Structure Representation
r
struct rec {
int a[4];
size_t i; a i next
struct rec *next;
0 16 24 32
};
168
Generating Pointer to Structure Member
r r+4*idx
struct rec {
int a[4];
size_t i; a i next
struct rec *next;
0 16 24 32
};
169
Following Linked List struct rec {
int a[4];
int i;
struct rec *next;
C Code };
r
void set_val
(struct rec *r, int val) a i next
{
0 16 24 32
while (r) {
int i = r->i; Element i
r->a[i] = val;
r = r->next; Register Value
} %rdi r
}
%rsi val
.L11: # loop:
movslq 16(%rdi), %rax # i = M[r+16]
movl %esi, (%rdi,%rax,4) # M[r+4*i] = val
movq 24(%rdi), %rdi # r = M[r+24]
testq %rdi, %rdi # Test r
jne .L11 # if !=0 goto loop
170
Structures & Alignment
Aligned Data
▪ Primitive data type requires K bytes
▪ Address must be multiple of K
Multiple of 4 Multiple of 8
Multiple of 8 Multiple of 8
171
Alignment Principles
Aligned Data
▪ Primitive data type requires K bytes
▪ Address must be multiple of K
▪ Required on some machines; advised on x86-64
Motivation for Aligning Data
▪ Memory accessed by (aligned) chunks of 4 or 8 bytes (system dependent)
▪ Inefficient to load or store datum that spans quad word boundaries
▪ Virtual memory trickier when datum spans 2 pages
Compiler
▪ Inserts gaps in structure to ensure correct alignment of fields
172
Specific Cases of Alignment (x86-64)
1 byte: char, …
▪ no restrictions on address
2 bytes: short, …
▪ lowest 1 bit of address must be 02
4 bytes: int, float, …
▪ lowest 2 bits of address must be 002
8 bytes: double, long, char *, …
▪ lowest 3 bits of address must be 0002
16 bytes: long double (GCC on Linux)
▪ lowest 4 bits of address must be 00002
173
Satisfying Alignment with Structures
Multiple of 4 Multiple of 8
Multiple of 8 Multiple of 8
174
Meeting Overall Alignment Requirement
Multiple of K=8
175
Arrays of Structures
struct S2 {
Overall structure length double v;
int i[2];
multiple of K char c;
Satisfy alignment requirement } a[10];
for every element
i 2 bytes v j 2 bytes
a+12*idx a+12*idx+8
struct S4 { struct S5 {
char c; int i;
int i; char c;
char d; char d;
Effect (K=4)
} *p; } *p;
c 3 bytes i d 3 bytes
i c d 2 bytes
178
Machine-Level Programming IV: Data
Arrays
▪ One-dimensional
▪ Multi-dimensional (nested)
▪ Multi-level
Structures
▪ Allocation
▪ Access
▪ Alignment
Floating Point
179
Background
History
▪ x87 FP
▪ Legacy, very ugly
▪ SSE FP
▪ Supported by Shark machines
▪ Special case use of vector instructions
▪ AVX FP
▪ Newest version
▪ Similar to SSE
▪ Documented in book
180
Programming with SSE3
XMM Registers
◼ 16 total, each 16 bytes
◼ 16 single-byte integers
◼ 8 16-bit integers
◼ 4 32-bit integers
◼ 4 single-precision floats
◼ 2 double-precision floats
◼ 1 single-precision float
◼ 1 double-precision float
181
Scalar & SIMD Operations
◼ Scalar Operations: Single Precision addss %xmm0,%xmm1
%xmm0
+
%xmm1
◼ SIMD Operations: Single Precision addps %xmm0,%xmm1
%xmm0
+ + + +
%xmm1
◼ Scalar Operations: Double Precision
addsd %xmm0,%xmm1
%xmm0
+
%xmm1 182
FP Basics
183
FP Memory Referencing
# p in %rdi, v in %xmm0
movapd %xmm0, %xmm1 # Copy v
movsd (%rdi), %xmm0 # x = *p
addsd %xmm0, %xmm1 # t = x + v
movsd %xmm1, (%rdi) # *p = t
ret 184
Other Aspects of FP Code
Lots of instructions
▪ Different operations, different formats, ...
Floating-point comparisons
▪ Instructions ucomiss and ucomisd
▪ Set condition codes CF, ZF, and PF
Using constant values
▪ Set XMM0 register to 0 with instruction xorpd %xmm0, %xmm0
▪ Others loaded from memory
185
Summary
Arrays
▪ Elements packed into contiguous region of memory
▪ Use index arithmetic to locate individual elements
Structures
▪ Elements packed into single region of memory
▪ Access using offsets determined by compiler
▪ Possible require internal and external padding to ensure alignment
Combinations
▪ Can nest structure and array code arbitrarily
Floating Point
▪ Data held and operated on in XMM registers
186
Understanding Pointers & Arrays #1
Decl An *An
Cmp Bad Size Cmp Bad Size
int A1[3]
int *A2
187
Understanding Pointers & Arrays #1
Decl An *An
Cmp Bad Size Cmp Bad Size
int A1[3] Y N 12 Y N 4
int *A2 Y N 8 Y Y 4
A1 Allocated pointer
Unallocated pointer
A2
Allocated int
Unallocated int
188
Understanding Pointers & Arrays #2
189
Understanding Pointers & Arrays #2
Decl An *An **An
Cmp Bad Size Cmp Bad Size Cmp Bad Size
int A1[3] Y N 12 Y N 4 N - -
int *A2[3] Y N 24 Y N 8 Y Y 4
int Y N 8 Y Y 12 Y Y 4
(*A3)[3]
int Y N 24 Y N 8 Y Y 4
(*A4[3])
A1
A2/A4
A3
Allocated pointer
Unallocated pointer
Allocated int
Unallocated int 190
Understanding Pointers & Arrays #3
A2/A4
A3
A5
192
Understanding Pointers & Arrays #3
Decl An *An **An
Cm Bad Size Cm Bad Size Cm Bad Size
p p p
int A1[3][5] Y N 60 Y N 20 Y N 4
int *A2[3][5] Y N 120 Y N 40 Y N 8
int (*A3)[3][5] Y N 8 Y Y 60 Y Y 20
int *(A4[3][5]) Y N 120 Y N 40 Y N 8
int (*A5[3])[5] Y N 24 Y N 8 Y Y 20
Decl ***An
Cmp: Compiles (Y/N)
Cm Bad Size
Bad: Possible bad p
pointer reference (Y/N) int A1[3][5] N - -
Size: Value returned by int *A2[3][5] Y Y 4
sizeof int (*A3)[3][5] Y Y 4
int *(A4[3][5]) Y Y 4
int (*A5[3])[5] Y Y 4 193
Machine-Level Programming V:
Advanced Topics
194
Machine-Level Programming V: Advance
Memory Layout
Buffer Overflow
▪ Vulnerability
▪ Protection
Unions
195
x86-64 Linux Memory Layout not drawn to scale
Stack 00007FFFFFFFFFFF
Stack
▪ Runtime stack (8MB limit)
8MB
▪ E. g., local variables
Heap
▪ Dynamically allocated as needed
▪ When call malloc(), calloc(), new()
Data
▪ Statically allocated data Shared
▪ E.g., global vars, static vars, string constants Libraries
Text / Shared Libraries
▪ Executable machine instructions
▪ Read-only Heap
Data
Text
Hex Address 400000
000000 196
Memory Allocation Example not drawn to scale
Stack
char big_array[1L<<24]; /* 16 MB */
char huge_array[1L<<31]; /* 2 GB */
int global = 0;
int main ()
{ Shared
void *p1, *p2, *p3, *p4; Libraries
int local = 0;
p1 = malloc(1L << 28); /* 256 MB */
p2 = malloc(1L << 8); /* 256 B */
p3 = malloc(1L << 32); /* 4 GB */
p4 = malloc(1L << 8); /* 256 B */ Heap
/* Some print statements ... */ Data
} Text
Where does everything go? 197
x86-64 Example Addresses not drawn to scale
00007F
Stack
address range ~247
Heap
local 0x00007ffe4d3be87c
p1 0x00007f7262a1e010
p3 0x00007f7162a1d010
p4 0x000000008359d120
p2 0x000000008359d010
big_array 0x0000000080601060
huge_array 0x0000000000601060
main() 0x000000000040060c
useless() 0x0000000000400590
Heap
Data
Text
000000
198
Machine-Level Programming V: Advance
Memory Layout
Buffer Overflow
▪ Vulnerability
▪ Protection
Unions
199
Carnegie Mellon
double fun(int i) {
volatile struct_t s;
s.d = 3.14;
s.a[i] = 1073741824; /* Possibly out of bounds */
return s.d;
}
fun(0) 3.14
fun(1) 3.14
fun(2) 3.1399998664856
fun(3) 2.00000061035156
fun(4) 3.14
fun(6) Segmentation fault
Explanation:
Critical State 6
? 5
? 4
d7 ... d4 3 Location accessed by
fun(i)
d3 ... d0 2
struct_t
a[1] 1
a[0] 0
201
Such problems are a BIG deal
Generally called a “buffer overflow”
▪ when exceeding the memory size allocated for an array
Why a big deal?
▪ It’s the #1 technical cause of security vulnerabilities
▪ #1 overall cause is social engineering / user ignorance
Most common form
▪ Unchecked lengths on string inputs
▪ Particularly for bounded character arrays on the stack
▪ sometimes referred to as stack smashing
202
String Library Code
Implementation of Unix function gets()
/* Get string from stdin */
char *gets(char *dest)
{
int c = getchar();
char *p = dest;
while (c != EOF && c != '\n') {
*p++ = c;
c = getchar();
}
*p = '\0';
return dest;
}
void call_echo() {
echo();
}
unix>./bufdemo-nsp
Type a string:012345678901234567890123
012345678901234567890123
unix>./bufdemo-nsp
Type a string:0123456789012345678901234
Segmentation Fault
204
Buffer Overflow Disassembly
echo:
00000000004006cf <echo>:
4006cf: 48 83 ec 18 sub $0x18,%rsp
4006d3: 48 89 e7 mov %rsp,%rdi
4006d6: e8 a5 ff ff ff callq 400680 <gets>
4006db: 48 89 e7 mov %rsp,%rdi
4006de: e8 3d fe ff ff callq 400520 <puts@plt>
4006e3: 48 83 c4 18 add $0x18,%rsp
4006e7: c3 retq
call_echo:
4006e8: 48 83 ec 08 sub $0x8,%rsp
4006ec: b8 00 00 00 00 mov $0x0,%eax
4006f1: e8 d9 ff ff ff callq 4006cf <echo>
4006f6: 48 83 c4 08 add $0x8,%rsp
4006fa: c3 retq
205
Buffer Overflow Stack
Before call to gets
Stack Frame
for call_echo
/* Echo Line */
Return Address void echo()
(8 bytes) {
char buf[4]; /* Way too small! */
gets(buf);
puts(buf);
20 bytes unused }
echo:
subq $24, %rsp
movq %rsp, %rdi
call gets
. . .
206
Buffer Overflow Stack Example
Before call to gets
void echo() echo:
Stack Frame { subq $24, %rsp
for call_echo char buf[4]; movq %rsp, %rdi
gets(buf); call gets
. . . . . .
00 00 Address
Return 00 00 }
00 (8
40bytes)
06 f6
call_echo:
. . .
20 bytes unused 4006f1: callq 4006cf <echo>
4006f6: add $0x8,%rsp
. . .
[3] [2] [1] [0] buf %rsp
207
Buffer Overflow Stack Example #1
After call to gets
void echo() echo:
Stack Frame { subq $24, %rsp
for call_echo char buf[4]; movq %rsp, %rdi
gets(buf); call gets
. . . . . .
00 00 Address
Return 00 00 }
00 (8
40bytes)
06 f6
00 32 31 30 call_echo:
39 38 37 36 . . .
35 34 unused
20 bytes 33 32 4006f1: callq 4006cf <echo>
31 30 39 38 4006f6: add $0x8,%rsp
37 36 35 34 . . .
33 32 31 30 buf %rsp
unix>./bufdemo-nsp
Type a string:01234567890123456789012
01234567890123456789012
unix>./bufdemo-nsp
Type a string:0123456789012345678901234
Segmentation Fault
unix>./bufdemo-nsp
Type a string:012345678901234567890123
012345678901234567890123
211
Code Injection Attacks
Stack after call to gets()
void P(){
P stack frame
Q(); return
... address
} A B
213
Example: the original Internet worm (1988)
Exploited a few vulnerabilities to spread
▪ Early versions of the finger server (fingerd) used gets() to read the
argument sent by the client:
▪ finger [email protected]
▪ Worm attacked fingerd server by sending phony argument:
▪ finger “exploit-code padding new-return-
address”
▪ exploit code: executed a root shell on the victim machine with a
direct TCP connection to the attacker.
Once on a machine, scanned for other machines to attack
▪ invaded ~6000 computers in hours (10% of the Internet ☺ )
▪ see June 1989 article in Comm. of the ACM
▪ the young author of the worm was prosecuted…
▪ and CERT was formed… still homed at CMU
214
Example 2: IM War
July, 1999
▪ Microsoft launches MSN Messenger (instant messaging system).
▪ Messenger clients can access popular AOL Instant Messaging Service
(AIM) servers
AIM
client
AIM
client
215
IM War (cont.)
August 1999
▪ Mysteriously, Messenger clients can no longer access AIM servers
▪ Microsoft and AOL begin the IM war:
▪AOL changes server to disallow Messenger clients
▪ Microsoft makes changes to clients to defeat AOL changes
▪ At least 13 such skirmishes
▪ What was really happening?
▪ AOL had discovered a buffer overflow bug in their own AIM clients
▪ They exploited it to detect and block Microsoft: the exploit code
returned a 4-byte signature (the bytes at some location in the AIM
client) to server
▪ When Microsoft changed code to match signature, AOL changed
signature location
216
Date: Wed, 11 Aug 1999 11:30:57 -0700 (PDT)
From: Phil Bucking <[email protected]>
Subject: AOL exploiting buffer overrun bug in their own software!
To: [email protected]
Mr. Smith,
Sincerely,
Phil Bucking It was later determined that this
Founder, Bucking Consulting
[email protected]
email originated from within
Microsoft!
217
Aside: Worms and Viruses
Worm: A program that
▪ Can run by itself
▪ Can propagate a fully working version of itself to other computers
218
OK, what to do about buffer overflow attacks
Avoid overflow vulnerabilities
219
1. Avoid Overflow Vulnerabilities in Code (!)
/* Echo Line */
void echo()
{
char buf[4]; /* Way too small! */
fgets(buf, 4, stdin);
puts(buf);
}
220
2. System-Level Protections can help
pad
▪ Stack repositioned each time exploit
program executes code
B?
221
2. System-Level Protections can help
Stack after call to gets()
Nonexecutable code
segments
P stack frame
▪ In traditional x86, can mark
region of memory as either
“read-only” or “writeable” B
▪ Can execute anything
readable data written pad
▪ X86-64 added explicit by gets()
“execute” permission exploit Q stack frame
▪ Stack marked as non- code
B
executable
222
3. Stack Canaries can help
Idea
▪ Place special value (“canary”) on stack just beyond buffer
▪ Check for corruption before exiting function
GCC Implementation
▪ -fstack-protector
▪ Now the default (disabled earlier)
unix>./bufdemo-sp
Type a string:0123456
0123456
unix>./bufdemo-sp
Type a string:01234567
*** stack smashing detected ***
223
Protected Buffer Disassembly
echo:
40072f: sub $0x18,%rsp
400733: mov %fs:0x28,%rax
40073c: mov %rax,0x8(%rsp)
400741: xor %eax,%eax
400743: mov %rsp,%rdi
400746: callq 4006e0 <gets>
40074b: mov %rsp,%rdi
40074e: callq 400570 <puts@plt>
400753: mov 0x8(%rsp),%rax
400758: xor %fs:0x28,%rax
400761: je 400768 <echo+0x39>
400763: callq 400580 <__stack_chk_fail@plt>
400768: add $0x18,%rsp
40076c: retq
224
Setting Up Canary
Before call to gets
/* Echo Line */
Stack Frame void echo()
for call_echo {
char buf[4]; /* Way too small! */
gets(buf);
Return Address puts(buf);
(8 bytes) }
20 bytes unused
Canary
(8 bytes)
echo:
. . .
movq %fs:40, %rax # Get canary
movq %rax, 8(%rsp) # Place on stack
xorl %eax, %eax # Erase canary
. . .
225
Checking Canary
After call to gets
Before call to gets /* Echo Line */
Stack Frame void echo()
forStack Frame
call_echo {
for main char buf[4]; /* Way too small! */
gets(buf);
Return Address puts(buf);
Return Address
(8 bytes) }
Saved %ebp
Saved %ebx
20 bytes unused
Canary Input: 0123456
(8Canary
bytes)
[3]
00 [2]
36 [1]
35 [0]
34
33 32 31 30 buf %rsp
echo:
. . .
movq 8(%rsp), %rax # Retrieve from
stack
xorq %fs:40, %rax # Compare to canary
je .L6 # If same, OK
call __stack_chk_fail # FAIL
.L6: . . .
226
Return-Oriented Programming Attacks
Challenge (for hackers)
▪ Stack randomization makes it hard to predict buffer location
▪ Marking stack nonexecutable makes it hard to insert binary code
Alternative Strategy
▪ Use existing code
▪ E.g., library code from stdlib
▪ String together fragments to achieve overall desired outcome
▪ Does not overcome stack canaries
Construct program from gadgets
▪ Sequence of instructions ending in ret
▪Encoded by single byte 0xc3
▪ Code positions fixed from run to run
▪ Code is executable
227
Gadget Example #1
long ab_plus_c
(long a, long b, long c)
{
return a*b + c;
}
00000000004004d0 <ab_plus_c>:
4004d0: 48 0f af fe imul %rsi,%rdi
4004d4: 48 8d 04 17 lea (%rdi,%rdx,1),%rax
4004d8: c3 retq
228
Gadget Example #2
rdi rax
Gadget address = 0x4004dc
229
ROP Execution
Stack
Gadget n code c3
Gadget 2 code c3
%rsp
Gadget 1 code c3
230
Machine-Level Programming V: Advance
Memory Layout
Buffer Overflow
▪ Vulnerability
▪ Protection
Unions
231
Union Allocation
typedef union { u
float f;
f
unsigned u;
} bit_float_t; 0 4
233
Byte Ordering Revisited
Idea
▪ Short/long/quad words stored in memory as 2/4/8 consecutive bytes
▪ Which byte is most (least) significant?
▪ Can cause problems when exchanging binary data between machines
Big Endian
▪ Most significant byte has lowest address
▪ Sparc
Little Endian
▪ Least significant byte has lowest address
▪ Intel x86, ARM Android and IOS
Bi Endian
▪ Can be configured either way
▪ ARM
234
Byte Ordering Example
union {
unsigned char c[8];
unsigned short s[4];
unsigned int i[2];
unsigned long l[1];
} dw;
printf("Characters 0-7 ==
[0x%x,0x%x,0x%x,0x%x,0x%x,0x%x,0x%x,0x%x]\n",
dw.c[0], dw.c[1], dw.c[2], dw.c[3],
dw.c[4], dw.c[5], dw.c[6], dw.c[7]);
printf("Long 0 == [0x%lx]\n",
dw.l[0]);
236
Byte Ordering on IA32
Little Endian
f0 f1 f2 f3 f4 f5 f6 f7
c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7]
s[0] s[1] s[2] s[3]
i[0] i[1]
l[0]
LSB MSB LSB MSB
Print
Output:
Characters 0-7 == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7]
Shorts 0-3 == [0xf1f0,0xf3f2,0xf5f4,0xf7f6]
Ints 0-1 == [0xf3f2f1f0,0xf7f6f5f4]
Long 0 == [0xf3f2f1f0]
237
Byte Ordering on Sun
Big Endian
f0 f1 f2 f3 f4 f5 f6 f7
c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7]
s[0] s[1] s[2] s[3]
i[0] i[1]
l[0]
MSB LSB MSB LSB
Print
Output on Sun:
Characters 0-7 == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7]
Shorts 0-3 == [0xf0f1,0xf2f3,0xf4f5,0xf6f7]
Ints 0-1 == [0xf0f1f2f3,0xf4f5f6f7]
Long 0 == [0xf0f1f2f3]
238
Byte Ordering on x86-64
Little Endian
f0 f1 f2 f3 f4 f5 f6 f7
c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7]
s[0] s[1] s[2] s[3]
i[0] i[1]
l[0]
LSB MSB
Print
Output on x86-64:
Characters 0-7 == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7]
Shorts 0-3 == [0xf1f0,0xf3f2,0xf5f4,0xf7f6]
Ints 0-1 == [0xf3f2f1f0,0xf7f6f5f4]
Long 0 == [0xf7f6f5f4f3f2f1f0]
239
Summary of Compound Types in C
Arrays
▪ Contiguous allocation of memory
▪ Aligned to satisfy every element’s alignment requirement
▪ Pointer to first element
▪ No bounds checking
Structures
▪ Allocate bytes in order declared
▪ Pad in middle and at end to satisfy alignment
Unions
▪ Overlay declarations
▪ Way to circumvent type system
240
Program Optimization
241
Program Optimization
Overview
Generally Useful Optimizations
▪ Code motion/precomputation
▪ Strength reduction
▪ Sharing of common subexpressions
▪ Removing unnecessary procedure calls
Optimization Blockers
▪ Procedure calls
▪ Memory aliasing
Exploiting Instruction-Level Parallelism
Dealing with Conditionals
242
Performance Realities
There’s more to performance than asymptotic complexity
243
Optimizing Compilers
Provide efficient mapping of program to machine
▪ register allocation
▪ code selection and ordering (scheduling)
▪ dead code elimination
▪ eliminating minor inefficiencies
Don’t (usually) improve asymptotic efficiency
▪ up to programmer to select best overall algorithm
▪ big-O savings are (often) more important than constant factors
▪ but constant factors also matter
Have difficulty overcoming “optimization blockers”
▪ potential memory aliasing
▪ potential procedure side-effects
244
Limitations of Optimizing Compilers
Operate under fundamental constraint
▪ Must not cause any change in program behavior
▪ Except, possibly when program making use of nonstandard language
features
▪ Often prevents it from making optimizations that would only affect behavior
under pathological conditions.
Behavior that may be obvious to the programmer can be obfuscated by
languages and coding styles
▪ e.g., Data ranges may be more limited than variable types suggest
Most analysis is performed only within procedures
▪ Whole-program analysis is too expensive in most cases
▪ Newer versions of GCC do interprocedural analysis within individual files
▪ But, not between code in different files
Most analysis is based only on static information
▪ Compiler has difficulty anticipating run-time inputs
When in doubt, the compiler must be conservative
245
Generally Useful Optimizations
Optimizations that you or the compiler should do regardless
of processor / compiler
Code Motion
▪ Reduce frequency with which computation performed
▪ If it will always produce same result
▪ Especially moving code out of loop
void set_row(double *a, double *b,
long i, long n)
{
long j; long j;
for (j = 0; j < n; j++) int ni = n*i;
a[n*i+j] = b[j]; for (j = 0; j < n; j++)
} a[ni+j] = b[j];
246
Compiler-Generated Code Motion (-O1)
void set_row(double *a, double *b,
long i, long n) long j;
{ long ni = n*i;
long j; double *rowp = a+ni;
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[n*i+j] = b[j]; *rowp++ = b[j];
}
set_row:
testq %rcx, %rcx # Test n
jle .L1 # If 0, goto done
imulq %rcx, %rdx # ni = n*i
leaq (%rdi,%rdx,8), %rdx # rowp = A + ni*8
movl $0, %eax # j = 0
.L3: # loop:
movsd (%rsi,%rax,8), %xmm0 # t = b[j]
movsd %xmm0, (%rdx,%rax,8) # M[A+ni*8 + j*8] = t
addq $1, %rax # j++
cmpq %rcx, %rax # j:n
jne .L3 # if !=, goto loop
.L1: # done:
rep ; ret
247
Reduction in Strength
▪ Replace costly operation with simpler one
▪ Shift, add instead of multiply or divide
16*x --> x << 4
▪ Utility machine dependent
▪ Depends on cost of multiply or divide instruction
– On Intel Nehalem, integer multiply requires 3 CPU cycles
▪ Recognize sequence of products
int ni = 0;
for (i = 0; i < n; i++) { for (i = 0; i < n; i++) {
int ni = n*i; for (j = 0; j < n; j++)
for (j = 0; j < n; j++) a[ni + j] = b[j];
a[ni + j] = b[j]; ni += n;
} }
248
Share Common Subexpressions
▪ Reuse portions of expressions
▪ GCC will do this with –O1
249
Optimization Blocker #1: Procedure Calls
Procedure to Convert String to Lower Case
250
Lower Case Conversion Performance
250
200
CPU seconds
150
lower1
100
50
0
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
String length
251
Convert Loop To Goto Form
void lower(char *s)
{
size_t i = 0;
if (i >= strlen(s))
goto done;
loop:
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
i++;
if (i < strlen(s))
goto loop;
done:
}
252
Calling Strlen
/* My version of strlen */
size_t strlen(const char *s)
{
size_t length = 0;
while (*s != '\0') {
s++;
length++;
}
return length;
}
Strlen performance
▪ Only way to determine length of string is to scan its entire length, looking for
null character.
Overall performance, string of length N
▪ N calls to strlen
▪ Require times N, N-1, N-2, …, 1
▪ Overall O(N2) performance
253
Improving Performance
void lower(char *s)
{
size_t i;
size_t len = strlen(s);
for (i = 0; i < len; i++)
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
}
254
Lower Case Conversion Performance
▪ Time doubles when double string length
▪ Linear performance of lower2
250
200
CPU seconds
150
lower1
100
50
lower2
0
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
String length
255
Optimization Blocker: Procedure Calls
Why couldn’t compiler move strlen out of inner loop?
▪ Procedure may have side effects
▪ Alters global state each time called
▪ Function may not return same value for given arguments
▪ Depends on other parts of global state
▪ Procedure lower could interact with strlen
Warning:
▪ Compiler treats procedure call as a black box
▪ Weak optimizations near them
size_t lencnt = 0;
Remedies: size_t strlen(const char *s)
▪ Use of inline functions {
▪ GCC does this with –O1 size_t length = 0;
while (*s != '\0') {
– Within single file
s++; length++;
▪ Do your own code motion }
lencnt += length;
return length;
}
256
Memory Matters
/* Sum rows is of n X n matrix a
and store in vector b */
void sum_rows1(double *a, double *b, long n) {
long i, j;
for (i = 0; i < n; i++) {
b[i] = 0;
for (j = 0; j < n; j++)
b[i] += a[i*n + j];
}
}
Value of B:
double A[9] = init: [4, 8, 16]
{ 0, 1, 2,
4, 8, 16},
i = 0: [3, 8, 16]
32, 64, 128};
259
Optimization Blocker: Memory Aliasing
Aliasing
▪ Two different memory references specify single location
▪ Easy to have happen in C
▪ Since allowed to do address arithmetic
▪ Direct access to storage structures
▪ Get in habit of introducing local variables
▪ Accumulating within loops
▪ Your way of telling compiler not to check for aliasing
260
Exploiting Instruction-Level Parallelism
Need general understanding of modern processor design
▪ Hardware can execute multiple instructions in parallel
Performance limited by data dependencies
Simple transformations can yield dramatic performance
improvement
▪ Compilers often cannot make these transformations
▪ Lack of associativity and distributivity in floating-point arithmetic
261
Benchmark Example: Data Type for Vectors
2500
2000
psum1
1500 Slope = 9.0
Cycles
1000
psum2
500 Slope = 6.0
0
0 50 100 150 200
Elements
264
Benchmark Performance
void combine1(vec_ptr v, data_t *dest)
{
long int i; Compute sum or
*dest = IDENT; product of vector
for (i = 0; i < vec_length(v); i++) { elements
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}
265
Basic Optimizations
266
Effect of Basic Optimizations
Functional
Branch Arith Arith Arith Load Store
Units
Operation Results
Addr. Addr.
Data Data
Data
Cache
Execution
268
Superscalar Processor
Definition: A superscalar processor can issue and execute
multiple instructions in one cycle. The instructions are retrieved
from a sequential instruction stream and are usually scheduled
dynamically.
269
Pipelined Functional Units Stage 1
long mult_eg(long a, long b, long c) {
long p1 = a*b; Stage 2
long p2 = a*c;
long p3 = p1 * p2; Stage 3
return p3;
}
Time
1 2 3 4 5 6 7
Stage 1 a*b a*c p1*p2
271
x86-64 Compilation of Combine4
Inner Loop (Case: Integer Multiply)
.L519: # Loop:
imull (%rax,%rdx,4), %ecx # t = t * d[i]
addq $1, %rdx # i++
cmpq %rdx, %rbp # Compare length:i
jg .L519 # If >, goto Loop
272
Combine4 = Serial Computation (OP = *)
1 d0
Computation (length=8)
((((((((1 * d[0]) * d[1]) * d[2]) * d[3])
* * d[4]) * d[5]) * d[6]) * d[7])
d1
Sequential dependence
* d2
▪ Performance: determined by latency of OP
* d3
* d4
* d5
* d6
* d7
273
Loop Unrolling (2x1)
void unroll2a_combine(vec_ptr v, data_t *dest)
{
long length = vec_length(v);
long limit = length-1;
data_t *d = get_vec_start(v);
data_t x = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = (x OP d[i]) OP d[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x = x OP d[i];
}
*dest = x;
}
275
Loop Unrolling with Reassociation (2x1a)
void unroll2aa_combine(vec_ptr v, data_t *dest)
{
long length = vec_length(v);
long limit = length-1;
data_t *d = get_vec_start(v);
data_t x = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = x OP (d[i] OP d[i+1]);
}
/* Finish any remaining elements */
for (; i < length; i++) {
x = x OP d[i]; Compare to before
}
*dest = x; x = (x OP d[i]) OP d[i+1];
}
* Overall Performance
1 d2 d3
▪ N elements, D cycles latency/op
* d4 d5 ▪ (N/2+1)*D cycles:
* CPE = D/2
* d6 d7
*
*
*
278
Loop Unrolling with Separate Accumulators
(2x2) void unroll2a_combine(vec_ptr v, data_t *dest)
{
long length = vec_length(v);
long limit = length-1;
data_t *d = get_vec_start(v);
data_t x0 = IDENT;
data_t x1 = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x0 = x0 OP d[i];
x1 = x1 OP d[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x0 = x0 OP d[i];
}
*dest = x0 OP x1;
}
* *
What Now?
*
281
Unrolling & Accumulating
Idea
▪ Can unroll to any degree L
▪ Can accumulate K results in parallel
▪ L must be multiple of K
Limitations
▪ Diminishing returns
Cannot go beyond throughput limitations of execution units
▪
▪ Large overhead for short lengths
▪ Finish off iterations sequentially
282
Unrolling & Accumulating: Double *
Case
▪ Intel Haswell
▪ Double FP Multiplication
▪ Latency bound: 5.00. Throughput bound: 0.50
FP * Unrolling Factor L
K 1 2 3 4 6 8 10 12
1 5.01 5.01 5.01 5.01 5.01 5.01 5.01
Accumulators
283
Unrolling & Accumulating: Int +
Case
▪ Intel Haswell
▪ Integer addition
▪ Latency bound: 1.00. Throughput bound: 1.00
FP * Unrolling Factor L
K 1 2 3 4 6 8 10 12
1 1.27 1.01 1.01 1.01 1.01 1.01 1.01
Accumulators
284
Achievable Performance
Method Integer Double FP
Operation Add Mult Add Mult
Best 0.54 1.01 1.01 0.52
Latency Bound 1.00 3.00 3.00 5.00
Throughput Bound 0.50 1.00 1.00 0.50
285
Programming with AVX2
YMM Registers
◼ 16 total, each 32 bytes
◼ 32 single-byte integers
◼ 16 16-bit integers
◼ 8 32-bit integers
◼ 8 single-precision floats
◼ 4 double-precision floats
◼ 1 single-precision float
◼ 1 double-precision float
286
SIMD Operations
◼ SIMD Operations: Single Precision
vaddsd %ymm0, %ymm1, %ymm1
%ymm0
+ + + + + + + +
%ymm1
287
Using Vector Instructions
Method Integer Double FP
Operation Add Mult Add Mult
Scalar Best 0.54 1.01 1.01 0.52
Vector Best 0.06 0.24 0.25 0.16
Latency Bound 0.50 3.00 3.00 5.00
Throughput Bound 0.50 1.00 1.00 0.50
Vec Throughput 0.06 0.12 0.25 0.12
Bound
288
What About Branches?
Challenge
▪ Instruction Control Unit must work well ahead of Execution Unit
to generate enough operations to keep EU busy
. . .
289
Modern CPU Design
Instruction Control
Fetch Address
Retirement Control
Unit Instruction
Register Instructions Cache
Instruction
File Decode
Operations
Register Updates Prediction OK?
Functional
Branch Arith Arith Arith Load Store
Units
Operation Results
Addr. Addr.
Data Data
Data
Cache
Execution
290
Branch Outcomes
▪ When encounter conditional branch, cannot determine where to continue
fetching
▪ Branch Taken: Transfer control to branch target
▪ Branch Not-Taken: Continue with next instruction in sequence
▪ Cannot resolve until outcome determined by branch/integer unit
. . . Branch Taken
404685: repz retq
291
Branch Prediction
Idea
▪ Guess which way branch will go
▪ Begin executing instructions at predicted position
▪ But don’t actually modify register or memory data
292
Branch Prediction Through Loop
401029: vmulsd (%rdx),%xmm0,%xmm0 Assume
40102d: add $0x8,%rdx vector length = 100
401031: cmp %rax,%rdx
401034: jne 401029 i = 98
Predict Taken (OK)
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 99
Predict Taken
401029: vmulsd (%rdx),%xmm0,%xmm0
(Oops)
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx Read Executed
401034: jne 401029 i = 100 invalid
location
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx Fetched
401031: cmp %rax,%rdx
401034: jne 401029 i = 101
293
Branch Misprediction Invalidation
401029: vmulsd (%rdx),%xmm0,%xmm0 Assume
40102d: add $0x8,%rdx vector length = 100
401031: cmp %rax,%rdx
401034: jne 401029 i = 98
Predict Taken (OK)
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 99
Predict Taken
401029: vmulsd (%rdx),%xmm0,%xmm0
(Oops)
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 100
Invalidate
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 101
294
Branch Misprediction Recovery
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx i= 99 Definitely not taken
401034: jne 401029
401036: jmp 401040
. . . Reload
401040: vmovsd %xmm0,(%r12) Pipeline
Performance Cost
▪ Multiple clock cycles on modern processor
▪ Can be a major performance limiter
295
Getting High Performance
Good compiler and flags
Don’t do anything stupid
▪ Watch out for hidden algorithmic inefficiencies
▪ Write compiler-friendly code
▪Watch out for optimization blockers:
procedure calls & memory references
▪ Look carefully at innermost loops (where most work is done)
296