Operating Systems
Operating Systems
1 / 48
Outline
2 Virtual machines
4 Binary translation
5 Hardware-assisted virtualization
7 Final remarks
2 / 48
Confining code with legacy OSe
• Often want to confine code on legacy OSes
• Analogy: Firewalls
Attacker
Hopelessly
Insecure Attacker
Server
3 / 48
Using chroot
4 / 48
Escaping chroot
5 / 48
System call interposition
• Why not use ptrace or other debugging facilities to
control untrusted programs?
• Almost any “damage” must result from system call
- delete files → unlink
- overwrite files → open/write
- attack over network → socket/bind/connect/send/recv
- leak private data → open/read/socket/connect/write . . .
6 / 48
Limitations of syscall interposition
• Race conditions
- Remember difficulty of eliminating TOCCTOU bugs?
- Now imagine malicious application deliberately doing this
- Symlinks, directory renames (so “..” changes), . . .
7 / 48
Outline
2 Virtual machines
4 Binary translation
5 Hardware-assisted virtualization
7 Final remarks
8 / 48
Review: What is an OS
9 / 48
What if. . .
10 / 48
How do process abstraction & HW differ?
Process Hardware
Non-privileged registers All registers and
and instructions instructions
Virtual memory Both virtual and physical
memory, MMU functions,
TLB/page tables, etc.
Errors, signals Trap architecture, interrupts
File system, directories, I/O devices accessed using
files, raw devices programmed I/O, DMA,
interrupts
11 / 48
Virtual Machine Monitor
12 / 48
Old idea from the 1960s
13 / 48
VMM benefits
• Software compatibility
- VMMs can runs pretty much all software
• Isolation
- Seemingly total data isolation between virtual machines
- Leverage hardware memory protection mechanisms
• Encapsulation
- Virtual machines are not tied to physical machines
- Checkpoint/migration
14 / 48
OS backwards compatibility
• Backward compatibility is bane of new OSes
- Huge effort require to innovate but not break
• Security considerations may make it impossible
- Choice: Close security hole and break apps or be insecure
• Example: Windows XP end of life imminent
- Eventually hardware running WinXP will die
- What to do with legacy WinXP applications?
- Not all applications will run on later Windows
- Given the number of WinXP applications, practically any OS
change will break something
if (OS == WinXP) . . .
• Solution: Use a VMM to run both WinXP and Win8
- Obvious for OS migration as well: Windows → Linux
15 / 48
Logical partitioning of servers
• Run multiple servers on same box (e.g., Amazon EC2)
- Ability to give away less than one machine
Modern CPUs more powerful than most services need
- 0.10U rack space machine – less power, cooling, space, etc.
- Server consolidation trend: N machines → 1 real machine
• Isolation of environments
- Printer server doesn’t take down Exchange server
- Compromise of one VM can’t get at data of others1
• Resource management
- Provide service-level agreements
• Heterogeneous environments
- Linux, FreeBSD, Windows, etc.
1 though in practice not so simple because of side-channel attacks [Ristenpart]
16 / 48
Outline
2 Virtual machines
4 Binary translation
5 Hardware-assisted virtualization
7 Final remarks
17 / 48
Complete Machine Simulation
• Simplest VMM approach, used by bochs
• Build a simulation of all the hardware
- CPU – A loop that fetches each instruction, decodes it, simulates
its effect on the machine state
- Memory – Physical memory is just an array, simulate the MMU on
all memory accesses
- I/O – Simulate I/O devices, programmed I/O, DMA, interrupts
• x86 example:
- Give CPU an IDT that vectors back to VMM
- Look up trap vector in VM’s “virtual” IDT
- Push virtualized %cs, %eip, %eflags, on stack
- Switch to virtualized privileged mode
20 / 48
Virtualizing memory
• Basic MMU functionality:
- OS manages physical memory (0. . . MAX MEM)
- OS sets up page tables mapping VA PA
- CPU accesses to VA should go to PA (if paging off, PA = VA)
- Used for every instruction fetch, load, or store
Host Host
Virtual Host PT Physical
Address Address
physical machine
virtual machine
Guest Host
Virtual Shadow Page Table Physical
Address Address
22 / 48
Shadow page tables
23 / 48
Shadow PT issues
• Hardware only ever sees shadow page table
- Guest OS only sees it’s own VM page table, never shadow PT
• Consider the following
- Guest OS has a page table T mapping VU → PU
- T itself resides at guest physical address PT
- Another guest page table entry maps VT → PT
- VMM stores PU in host physical address MU and PT in MT
• What can VMM put in shadow page table?
- Safe to map VT MT or VU MU
• Not safe to map both simultaneously!
- If OS writes to PT , may make VU MU in shadow PT incorrect
- If OS reads/writes VU , may require accessed/dirty bits to be
changed in PT (hardware can only change shadow PT)
24 / 48
Illustration
Option 1 for sh
adow PT
PU MT
VT
T
by P
ped
map PT
MU
PT
VU 2 for shadow shad-
Option ow PT
27 / 48
I/O device virtualization
• Types of communication
- Special instruction – in/out
- Memory-mapped I/O (PIO)
- Interrupts
- DMA
• Virtualization
- Make in/out and PIO trap into monitor
- Use tracing for memory-mapped I/O
- Run simulation of I/O device
• Simulation
- Interrupt – Tell CPU simulator to generate interrupt
- DMA – Copy data to/from physical memory of virtual machine
28 / 48
CPU virtualization requirements
• Need protection levels to run VMs and monitors
• All unsafe/privileged operations should trap
- Example: disable interrupt, access I/O dev, . . .
- x86 problem: popfl (different semantics in different rings)
2 Virtual machines
4 Binary translation
5 Hardware-assisted virtualization
7 Final remarks
30 / 48
Binary translation
• Cannot directly execute guest OS kernel code on x86
- Can maybe execute most user code directly
- But how to get good performance on kernel code?
• Original VMware solution: binary translation
- Don’t run slow instruction-by-instruction emulator
- Instead, translate guest kernel code into code that runs in
fully-privileged monitor mode2
• Challenges:
- Don’t know the difference between code and data
(guest OS might include self-modifying code)
- Translated code may not be the same size as original
- Prevent translated code from messing with VMM memory
- Performance, performance, performance, . . .
2 actually CPL 1, so that the VMM has its own exception stack
31 / 48
VMware binary translator
• VMware translates kernel dynamically (like a JIT)
- Start at guest eip
- Accumulate up to 12 instructions until next control transfer
- Translate into binary code that can run in VMM context
33 / 48
Control transfer
• All branches/jumps require indirection
• Original: isPrime: mov %edi, %ecx # %ecx = %edi (a)
mov $2, %esi # i = 2
cmp %ecx, %esi # is i >= a?
jge prime # jump if yes
...
• Translated: isPrime’: mov %edi, %ecx # IDENT
mov $2, %esi
cmp %ecx, %esi
jge [takenAddr] # JCC
jmp [fallthrAddr]
35 / 48
Outline
2 Virtual machines
4 Binary translation
5 Hardware-assisted virtualization
7 Final remarks
36 / 48
Hardware-assisted virtualization
• Both Intel and AMD now have hardware support
- Different mechanisms, similar concepts
- This lecture covers AMD (see [AMD Vol 2], Ch. 15)
- For Intel details, see [Intel Vol 3c]
• VM-enabled CPUs support new guest mode
- This is separate from kernel/user modes in bits 0–1 of %cs
- Less privileged than host mode (where VMM runs)
- Some sensitive instructions trap in guest mode (e.g., load %cr3)
- Hardware keeps shadow state for many things (e.g., %eflags)
• Enter guest mode with vmrun instruction
- Loads state from hardware-defined 1-KiB VMCB data structure
• Various events cause EXIT back to host mode
- On EXIT, hardware saves state back to VMCB
37 / 48
VMCB control bits
• Intercept vector specifies what ops should cause EXIT
- One bit for each of %cr0–%cr15 to say trap on read
- One bit for each of %cr0–%cr15 to say trap on write
- 32 analogous bits for the debug registers (%dr0–%dr15)
- 32 bits for whether to intercept exception vectors 0–31
- Bits for various other events (e.g., NMI, SMI, ...)
- Bit to intercept writes to sensitive bits of %cr0
- 8 bits to intercept reads and writes of IDTR, GDTR, LDTR, TR
- Bits to intercept rdtsc, rdpmc, pushf, popf, vmrun, hlt, invlpg,
int, iret, in/out (to selected ports), . . .
• EXIT code and reason (e.g., which inst. caused EXIT)
• Other control values
- Pending virtual interrupt, event/exception injection
38 / 48
Guest state saved in VMCB
39 / 48
Hardware vs. Software virtualization
• HW VM makes implementing VMM much easier
- Avoids implementing binary translation (BT)
• Hardware VM is better at entering/exiting kernel
- E.g., Apache on Windows benchmark: one address space, lots of
syscalls, hardware VM does better [Adams]
- Apache on Linux w. many address spaces: lots of context switches,
tracing faults, etc., Software faster [Adams]
• Fork with copy-on-write bad for both HW & BT
- [Adams] reports fork benchmark where BT-based virtualization
37× and HW-based 106× slower than native!
• Newer CPUs support nested paging
- Eliminates shadow PT & tracing faults, simplifies VMM
- Guests can now manipulate %cr3 w/o VM EXIT
- But dramatically increases cost of TLB misses
40 / 48
Outline
2 Virtual machines
4 Binary translation
5 Hardware-assisted virtualization
7 Final remarks
41 / 48
ESX mem. mgmt. [Waldspurger]
• Virtual machines see virtualized physical memory
- Can let VMs use more “physical” memory than in machine
43 / 48
Sharing pages across VMs
44 / 48
Idle memory tax
• Need machine page? What VM to take it from?
• Normal proportional share scheme
- Reclaim from VM with lowest “shares-to-pages” (S/P) ratio
- If A & B both have S = 1, reclaim from larger VM
- If A has twice B’s share, can use twice the machine memory
2 Virtual machines
4 Binary translation
5 Hardware-assisted virtualization
7 Final remarks
46 / 48
Final thoughts
47 / 48
How to learn more about OSes
• Take CS240 – Advanced Topics in Operating Systems
- Class will bring you up to speed on OS research
- Read & discuss 18–25 research papers
- By the end, should be ready to do OS research
• Get involved in research!
• Lot’s of interesting OS work at Stanford
- Rosenblum – launched the virtual machine resurgence
- Lam – collective system, software for mobile devices
- Levis – seminal work on sensor nets & power management
- Engler – tools to find OS bugs automatically
- Boneh/Mitchell – lots of practical OS security work
- Mazières – done multiple new OSes, FSes, and distributed
systems. Applying OS ideas to browser, language security.
48 / 48