Class 2 3 5
Class 2 3 5
Cloud Networking
K. K. Ramakrishnan
[email protected]
CS 208, Winter 2025
Tue-Thur 2 pm - 3:20 pm
Class 2, 3
Virtualization in Operating
Systems
Class 2
Page
2
Virtualization
• M. Rosenblum, T. Garfinkel,
Virtual machine monitors: current technology and future
trends
, IEEE Computer, 38:5, May 2005.
• Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand,
Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and
Andrew Warfield, Xen and the art of Virtualization,
Proc. of SOSP, October 2003
Types of Interfaces
User space
VMM
12
Paravirtualization
• Paravirtualization: changes to the system in order to
redirect privileged sensitive, instructions over to the
hypervisor to regain full control on the resources
– Redirect rather than ‘trap’
• Privileged instructions: storage protection setting, interrupt handling,
timer control, I/O, special processor status-setting instructions -
executed only in special privileged mode for OS but not user program
• Guest OS modified to work with hypervisor. OS is aware
that it is virtualized.
– Apps that run on top of altered OS - no change required
• OS mods require work; but performance improves
– See: Crosby & Brown, ACM Queue 2006 (
https://dl.acm.org/doi/pdf/10.1145/1189276.1189289)
• Paravirtualization, generally able to run on any system
A note on Operating System Rings
• On most OSs, Ring 0 is level with most
privileges and interacts most directly with
physical hardware - CPU and memory.
• Special gates between rings are provided
to allow an outer ring to access an inner
ring's resources in a predefined manner,
as opposed to allowing arbitrary usage.
• Linux x86 ring usage: Linux kernel only
uses 0 and 3: Ring 0
15
Hardware Assisted Virtualization
• Hardware-assisted virtualization: achieved by additional
functionality included into CPU,
– specifically an additional execution mode called guest mode,
dedicated to virtual instances
– Requires specific hardware
• Intel & AMD - realized full virtualization & paravirtualization were
major challenges and created new processor extensions: VT-x &
AMD-V
• virtualization-aware hardware provides the support to build the
VMM and also ensures isolation of a guest OS.
• representative of this virtualization type is the Kernel-
based Virtual Machine (KVM)
Type 1 hypervisor
17
Binary Translation
Host OS
Hardware
19
True VMs vs. Paravirtualization
User Redirect/hypercall
space
Type 1
22
I/O Virtualization
23
More Details on
Virtualization
Functionality
Slides from Prof. Nael Abu-Ghazaleh @UCR in his OS Course
NUTS AND BOLTS
25
Full virtualization
• Idea: run guest operating systems unmodified
–
Example of Hypervisor Intervention:
Disable Interrupts
• Guest OS tries to disable interrupts
– the instruction is trapped by the VMM which
makes a note that interrupts are disabled for that
VM
35
Xen VM interface: Memory
• Memory management
– Guest cannot install highest privilege
level segment descriptors; top end of
linear address space is not accessible
• Kernel direct mapping space
– Guest has direct (not trapped) read
access to hardware page tables; writes
are trapped and handled by the VMM
– Physical memory presented to guest is
not necessarily contiguous
• ‘Guest’ Guest OS
Two Layers of Virtual Memory Physical address machine
address 0xFFFFFFFF
Guest OS’s
View of RAM
Guest App’s
0xFFFF Page 2
View of RAM
Page 0 Page 0
0xFF
Page 3
Page 2 Page 1
Page 3
Page 1
Page 0
Page 3
Page 1
0x00 Page 2
Unknown to the
Known to the guest guest OS
0x0000
OS
0x00000000
Guest’s Page Tables Are Invalid
• Guest OS page tables map virtual page
numbers (VPNs) to physical frame
numbers (PFNs)
• Problem: the guest is virtualized, doesn’t
actually know the true PFNs (locations)
– That true location is the machine frame number
(MFN)
– MFNs are known to the VMM and the host OS
• Guest page tables cannot be installed in cr3
– Map VPNs to PFNs, but the PFNs are incorrect
• How can the MMU translate addresses used
by the guest (VPNs) to MFNs? 38
Shadow Page Tables
• Solution: VMM creates shadow page tables
that map VPN MFN (as opposed to
VPNPFN)
Guest Page Table Physical Memory
VPN PFN •
64
Page 2 3 Maintained by the
00 (0) 01 (1)
48 2 guest OS
01 (1) 10 (2) Page 1
• Invalid for the MMU
32
Page 0 1
Virtual Memory 10 (2) 11 (3)
16
64 11 (3) 00 (0) Page 3 0
Page 3 0
48
Page 2
32 Shadow Page Table Machine Memory
Page 1
VPN MFN
16
Page 0 64 3 • Maintained by the
00 (0) 10 (2) Page 1
0 48 2 VMM
01 (1) 11 (3) Page 0
•
32 1 Valid for the MMU
10 (2) 00 (0) Page 3
16 0 39
11 (3) 01 (1) Page 2
0
Building Shadow Tables
• Problem: how can the VMM maintain
consistent shadow pages tables?
– The guest OS may modify its page tables at any
time
– Modifying the tables is a simple memory write, not
a privileged instruction
• Thus, there are no helpful CPU exceptions to trap this
action :(
• Solution: mark the hardware pages containing
the guest’s tables as read-only (guest page
table)
– If the guest updates a table, an exception is
generated
– VMM catches the exception, examines the faulting
40
More VMM Tricks
• The VMM can play tricks with virtual
memory just like an OS can
• Balooning:
– The VMM can page parts of a guest, or even an
entire guest, to disk
– A guest can be written to disk and brought
back online on a different machine!
• Deduplication:
– The VMM can share read-only pages between
guests
– Example: two guests both running Windows XP
41
Xen VM interface: CPU
• CPU
– Guest runs at lower privilege than VMM
– Exception handlers must be registered with
VMM
– Fast system call handler can be serviced
without trapping to VMM
• Allow direct calls from application to Guest OS,
rather than directing it through the VMM
– Hardware interrupts replaced by lightweight
event notification system
– Timer interface: both for real and virtual time
Details: CPU
• Frequent exceptions:
– Software interrupts for system calls
– Page faults
• Allow “guest” to register a ‘fast’
exception handler for system calls
that can be accessed directly by CPU in
ring 1, without switching to ring-0/Xen
– Handler is validated before installing in
hardware exception table: To make sure
nothing executed in Ring 0 privilege.
• Not used for Page Fault
Xen VM interface: I/O
• I/O
– Virtual devices exposed as
asynchronous I/O rings to guests
– Event notification replaces interrupts
Details: I/O 1
• Xen does not emulate hardware devices
– Exposes device abstractions for simplicity
and performance
– I/O data transferred to/from guest via Xen
using shared-memory buffers
– Virtualized interrupts: light-weight event
delivery mechanism from Xen-guest
• Update a bitmap in shared memory
• Optional call-back handlers registered by guest
O/S
NIC data structure
59
VT-x : Motivation
• To solve the problem that the x86 architecture
instructions cannot be virtualized.
• Simplify VMM software by closing virtualization
holes by design.
– Ring Compression
– Non-trapping instructions
– Excessive trapping
• Eliminate need for software virtualization (i.e.,
paravirtualization, binary translation).
60
VMX
• Virtual Machine Extensions define processor-
level support for virtual machines on the x86
platform by a new form of operation called VMX
operation.
• Kinds of VMX operation:
– root: VMM (hypervisor) runs in VMX root
operation
– non-root: Guest runs in VMX non-root
operation
• Eliminate de-privileging of Ring 0 for guest
OS.
61
Pre VT-x
Post without VT-x with VT-x
VMM ring de-privileging of guest OS VMM executes in VMX root-mode
Guest OS aware its not at Ring 0 Guest OS de-privileging eliminated
Guest OS views as if it “runs directly
on hardware”
62
VMX Transitions
• Transitions between VMX root operation and
VMX non-root operation.
• Kinds of VMX transitions:
– VM Entry: Transitions into VMX non-root
operation. Allows Guest OS to execute Priv.
instructions as if it is in Ring 0
– VM Exit: Transitions from VMX non-root
operation to VMX root operation.
• Registers and address space swapped in
one atomic operation.
63
VM 1 VM 2 VM n
VM Exit
Ring 3
VMX Root Operation
vmlaunch /
Ring 0
(hypervisor has access vmresume
to priv. instr.)
VMX Transitions
65
VMCS: VM Control Structure
The VMCS consists of six logical groups:
• Guest-state area: Processor state loaded on VM
entries from guest-state area; saved into guest-state
area on VM exits.
• Host-state area: Processor state loaded from the
host-state area on VM exits.
• VM-execution control fields: Fields controlling
processor operation in VMX non-root operation.
• VM-exit control fields: Fields that control VM exits.
• VM-entry control fields: Fields that control VM
entries.
• VM-exit information fields: Read-only fields to
receive information on VM exits describing the cause
66
67
Source: [2]
MMU Virtualization with VT-x
68
VPID: Motivation
• First generation VT-x forces TLB flush on each
VMX transition.
• Performance loss on all VM exits.
• Performance loss on most VM entries
– Guest page tables not modified always
• Better VMM software control of TLB
flushes is beneficial.
A translation lookaside buffer is part of the chip's memory-management unit:
TLB contains page table entries that have been most recently used.
69
VPID: Virtual Processor
Identifier
• 16-bit virtual-processor-ID field in the VMCS.
• Cached linear translations tagged with VPID
value.
• No flush of TLBs on VM entry or VM exit if VPID
active.
• TLB entries of different virtual machines can all
co-exist in the TLB.
70
Virtualizing Memory in
Software
0
• Three abstractions of memory: 4GB
Virtual
Current Guest Process Guest OS
Address Spaces
0 4GB
Virtual
Virtual Virtual Physical
Frame
Virtual RAM
Devices ROM
Buffer
Address Spaces
0 4GB
Frame Machine
RAM Devices ROM
Buffer
Address Space
71
Shadow Page Tables
• VMM maintains shadow page tables that
map guest-virtual pages directly to
machine pages.
• Guest modifications to V->P tables
synced to VMM V->M shadow page
tables.
– Guest OS page tables marked as read-only.
– Modifications of page tables by guest OS ->
trapped to VMM.
– Shadow page tables synced to the guest OS
tables 72
Drawbacks: Shadow Page
Tables
• Under shadow paging, in order to provide
transparent MMU virtualization, the VMM
intercepts guest page table updates to keep the
shadow page tables coherent with the guest page
tables.
• Maintaining consistency between guest page
tables and shadow page tables leads to
overhead: VMM traps
• Loss of performance due to TLB flush on every
“world-switch”.
• Memory overhead due to shadow copying of guest
page tables.
73
Guest Guest Guest
Virtual CR3
Page Table Page Table Page Table
Real CR3
Real CR3
78
Nested / Extended Page Tables
79
Source: [4]
Advantages: EPT
• Simplified VMM design.
• Guest page table modifications need not be
trapped, hence VM exits reduced.
• Reduced memory footprint compared to
shadow page table algorithms.
80
Disadvantages: EPT
• TLB miss is very costly since guest-physical
address to machine address needs an extra
EPT walk for each stage of guest-virtual
address translation.
81
Virtual Appliances & Multi-Core
• Virtual appliance: pre-configured VM with OS/ apps
pre-installed
– Just download and run (no need to install/configure)
– Software distribution using appliances
(see: “SnowFlock: Rapid Virtual Machine Cloning for Cloud Computing”, H.
Andrés Lagar-Cavilla et. al., Eurosys 2009)
• Multi-core CPUs
– Run multiple VMs on multi-core systems
– Each VM assigned one or more vCPU
– Mapping from vCPUs to physical CPUs
83
OS Virtualization
• Referred to as containers
– Solaris containers, BSD jails, Linux containers
84
Linux Containers (LXC)
Material courtesy of
“Realizing Linux
Containers” by Boden
Russell, IBM
85
OS Mechanisms for LXC
• OS mechanisms for resource isolation and
management
• namespaces: process-based resource isolation
• Cgroups: limits, prioritization, accounting, control
• chroot: change the apparent root directory for the
current running process and its children.
– Program run in such modified environment cannot name
files outside designated directory tree.
• Linux security module, access control
• Tools (e.g., docker) for easy management
86
Linux Namespaces
• Namespace: restrict what a container can see
– Provide process level isolation of global resources
• Processes have illusion they are the only processes in
the system
• MNT: mount points, file systems (what files, dir are
visible)?
• PID: what other processes are visible?
• NET: NICs, routing
• Users: what uid, gid are visible?
88
Proportional Share Scheduling
• Resource allocation
– Uses a variant of proportional-share scheduling
• Share-based scheduling:
– Assign each process a weight w_i (a “share”)
– E.g., CPU Allocation is in proportion to the ‘share'
– fairness: reused unused cycles to others in proportion to weight
– Examples: fair queuing, start time fair queuing
• Hard limits: assign upper bounds (e.g., 30%), no
reallocation
• Credit-based: allocate credits every time T, can
accumulate credits, and can burst up-to credit limit
– can a process starve other processes?
89
Share-based Schedulers
90
Putting it all together
• Images: files/data for a container
– can run different distributions/apps on a host
• Linux security modules and access control
• Linux capabilities: per process privileges
91
Docker and Linux Containers
92
Docker
93
LXC Virtualization Using Docker
• Portable: docker images run anywhere Docker runs
• Docker decouples LXC provider from operations
– uses virtual resources (LXC virtualization)
– fair share of physical NIC vs use virtual NICs that are fair-
shared
94
Docker Images and Use