0% found this document useful (0 votes)
13 views95 pages

Class 2 3 5

The document discusses virtualization in operating systems, detailing its history, types, and mechanisms, including hypervisors and their roles in managing virtual machines. It explains the differences between Type 1 (bare metal) and Type 2 (hosted) hypervisors, as well as various virtualization techniques such as full virtualization, paravirtualization, and hardware-assisted virtualization. Additionally, it covers the importance of isolation and security in cloud computing environments, emphasizing the need for efficient resource management and the role of the hypervisor in this context.

Uploaded by

Aaryan Bhagat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views95 pages

Class 2 3 5

The document discusses virtualization in operating systems, detailing its history, types, and mechanisms, including hypervisors and their roles in managing virtual machines. It explains the differences between Type 1 (bare metal) and Type 2 (hosted) hypervisors, as well as various virtualization techniques such as full virtualization, paravirtualization, and hardware-assisted virtualization. Additionally, it covers the importance of isolation and security in cloud computing environments, emphasizing the need for efficient resource management and the role of the hypervisor in this context.

Uploaded by

Aaryan Bhagat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 95

Cloud Computing and

Cloud Networking
K. K. Ramakrishnan
[email protected]
CS 208, Winter 2025
Tue-Thur 2 pm - 3:20 pm
Class 2, 3
Virtualization in Operating
Systems
Class 2

Page
2
Virtualization

• Virtualization: extend or replace an existing interface to


mimic the behavior of another system.
– Introduced in 1970s: run legacy software on newer mainframe
hardware
• Handle platform diversity by running apps in VMs
– Portability and flexibility
Thanks to Prof. Prashant Shenoy, CS Umass
for many of the slides here
Papers to Read

• M. Rosenblum, T. Garfinkel,
Virtual machine monitors: current technology and future
trends
, IEEE Computer, 38:5, May 2005.
• Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand,
Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and
Andrew Warfield, Xen and the art of Virtualization,
Proc. of SOSP, October 2003
Types of Interfaces

• Different types of interfaces


– Assembly instructions
– System calls
– APIs
• Depending on what is replaced /mimic-ed, we obtain
different forms of virtualization
5
Types of Virtualization
Application
s
Application Application Guest OS Non-
s s privileged
Applicati Guest OS VMM VMM mode
on
OS VMM Host OS Host VMM Privileged
(Hypervisor OS mode
Hardware Hardware
) Hardware
Hardware
Physical Hosted VM
machine Native VM Dual-Mode VM

Virtual machines can be built in multiple ways


1) With a VMM on top of the bare-metal: so hypervisor is in
privileged mode, and mediates access to CPU, disk etc.
2) VMM runs on top of Host OS.
3) VMM implemented in dual mode: Part of VMM in user mode, part
in kernel (OS needs modifications).
Page
6
Two broad types of hypervisors

• Type 1: Native (bare metal) - – example of Full/Native Virtualization


– Hypervisor runs on top of the bare metal machine
– e.g., VMWare Workstation
• Type 2: Hosted Virtual Machines
– Hypervisor is an emulator
– e.g., KVM, QEMU 7
Types of Hypervisors

• Type 1: hypervisor runs on “bare metal”


• Type 2: hypervisor runs on a host OS (like hosted VMs)
– Guest OS runs inside hypervisor
• Both VM types act like real hardware from application
standpoint
8
Types of Virtualization

• Full/native Virtualization (Type 1)


– VM simulates “enough” hardware to allow an unmodified
guest OS to be run in isolation (isolation from one VM to
another is an important characteristic for users sharing a cloud
platform)
• Same hardware CPU
– IBM VM family, VMWare Workstation, Parallels, VirtualBox
• Emulation (Type 2)
– VMM emulates/simulates complete hardware
– Unmodified guest OS for a different 'PC' can be run
• Bochs, VirtualPC for Mac, QEMU
9
Protection through Isolation
• Excerpted from Johanna Ullrich, Edgar R. Weippl, in “The Cloud
Security Ecosystem”, 2015 (Elsevier)
• Hypervisor provides efficient, isolated ‘duplicate’ of
physical machine for VMs.
• All sensitive instructions, (those changing
resource availability or configuration), must be privileged
instructions, to build effective hypervisor (Popek & Goldberg, ‘74)
– all sensitive instructions cross (go through) the hypervisor, which
is able to control VMs appropriately.
– This is full virtualization: advantage is host OS does not have to
be adapted, i.e., it is unaware of its virtualized environment.
• But, often need # additional actions to be virtualizable, 
paravirtualization, binary translation & hardware-assisted
virtualization
Hybrid organizations

User space

VMM

• Hybrid hypervisors are popular: e.g., Xen


– Mostly bare metal virtualization.
– But, VM0/Dom0 to keep device drivers out of VMM
– Paravirtualization (PV) front-end talks to Dom0 11
More on Types of virtualization
• Para-virtualization
– VMM does not simulate all of the hardware capabilities
– Use special API that a modified guest OS must use
– Hypercalls trapped by the Hypervisor and serviced through Dom0
– Xen, VMWare ESX Server
• OS-level virtualization
– OS allows multiple 'secure' virtual servers to be run
– Guest OS is the same as the host OS, but appears isolated
• apps see an isolated OS
– Solaris Containers, BSD Jails, Linux Vserver, Linux containers, Docker
• Application level virtualization
– Application is given its own copy of components that are not shared
• (E.g., own registry files, global objects) - VE prevents conflicts
– JVM, Rosetta on Mac (also emulation), WINE

12
Paravirtualization
• Paravirtualization: changes to the system in order to
redirect privileged sensitive, instructions over to the
hypervisor to regain full control on the resources
– Redirect rather than ‘trap’
• Privileged instructions: storage protection setting, interrupt handling,
timer control, I/O, special processor status-setting instructions -
executed only in special privileged mode for OS but not user program
• Guest OS modified to work with hypervisor. OS is aware
that it is virtualized.
– Apps that run on top of altered OS - no change required
• OS mods require work; but performance improves
– See: Crosby & Brown, ACM Queue 2006 (
https://dl.acm.org/doi/pdf/10.1145/1189276.1189289)
• Paravirtualization, generally able to run on any system
A note on Operating System Rings
• On most OSs, Ring 0 is level with most
privileges and interacts most directly with
physical hardware - CPU and memory.
• Special gates between rings are provided
to allow an outer ring to access an inner
ring's resources in a predefined manner,
as opposed to allowing arbitrary usage.
• Linux x86 ring usage: Linux kernel only
uses 0 and 3: Ring 0

– 0 for kernel – can do anything


– 3 for users
Focus only on
• ring 3 cannot run several instructions and Ring3: user space
Ring0: kernel,

write to several registers Can run privileged instructions


How does Virtualization work?
• CPU supports kernel and user mode (ring0, ring3)
– There is a set of instructions that can only be executed in kernel mode
• I/O, change MMU settings etc -- sensitive instructions
– Privileged instructions: cause a trap when executed in user mode
• Result: type 1 virtualization is feasible if sensitive instruction
subset are handled by the VMM, since it runs in Privileged mode
• Intel 386: ignores sensitive instructions in user mode
• Recent (well – ‘old’ now) Intel/AMD CPUs have hardware support
– Intel VT, AMD SVM
– Intel® Virtualization Technology provides hardware assist to the
virtualization software, reducing its size, cost, and complexity.
• Create an environment where a VM and guest OS can run
• Hypervisor uses hardware bitmap to specify which inst. should trap
• So, Sensitive instruction in guest traps to hypervisor

15
Hardware Assisted Virtualization
• Hardware-assisted virtualization: achieved by additional
functionality included into CPU,
– specifically an additional execution mode called guest mode,
dedicated to virtual instances
– Requires specific hardware
• Intel & AMD - realized full virtualization & paravirtualization were
major challenges and created new processor extensions: VT-x &
AMD-V
• virtualization-aware hardware provides the support to build the
VMM and also ensures isolation of a guest OS.
• representative of this virtualization type is the Kernel-
based Virtual Machine (KVM)
Type 1 hypervisor

• Unmodified OS is running in user mode


– But it thinks it is running in kernel mode (virtual kernel mode)
– privileged instructions trap; sensitive inst-> use VT to trap
– Hypervisor is the “real kernel”
• Upon trap, executes privileged operations
• emulates what the hardware would do

17
Binary Translation

• What did we do before hardware assist was available?


• VMware example
– Upon loading program: scans code for basic blocks
• (basic block: straight-line code sequence with no branches)
– If sensitive instructions, replace by VMware procedure
• Binary translation
– Conversion can be expensive: Cache the modified basic block
in VMware cache for subsequent reuse
• Execute; load next basic block etc.
Type 2 Hypervisor
Applications
Hosted VM
VMM

Host OS

Hardware

• Type 2 hypervisors can work without virtualization


technology (VT) support
– Host OS responsible for executing privileged instructions and
is running in kernel mode (with privileges)

19
True VMs vs. Paravirtualization
User Redirect/hypercall
space

• Both type 1 and 2 hypervisors work on unmodified OS


• Paravirtualization: modify OS kernel to replace all
sensitive instructions with hypercalls
– OS behaves like a user program making system calls
– Hypervisor executes the privileged operation invoked by
hypercall.
20
Standard Virtual machine Interface

Type 1

• Standardize the VM interface (VMIL) so kernel can run


on bare hardware or any hypervisor
– Can help a VM run on bare hardware (like a Type 1)
– Allows for VMs using binary translation (like a Type 2) – as if
the VMWare is the OS after translation is performed
– Allows for VMs to run on Xen (Paravirtualization) 21
Memory virtualization
• OS manages page tables
– Creating a new pagetable is sensitive priv. operation -> traps to
hypervisor
• Hypervisor manages multiple OSs
– Need a second shadow page table – perform two level mapping:
– A typical OS translates VM virtual pages to VM’s physical pages
– Hypervisor maps to actual page in shadow page table
– Need to catch changes to page table (but that is not a privileged
instruction)
– Change Page Table to read-only - page fault on update
• Paravirtualized system - use hypercalls to inform changes to
page table

22
I/O Virtualization

• Each guest OS thinks it “owns” the disk


• Hypervisor creates “virtual disks”
– Large empty files on the physical disk that appear as “disks” to
the guest OS
• Hypervisor converts block # to file offset for I/O
– But DMA needs physical addresses
• Hypervisor needs to do the translation to physical addresses

23
More Details on
Virtualization
Functionality
Slides from Prof. Nael Abu-Ghazaleh @UCR in his OS Course
NUTS AND BOLTS

25
Full virtualization
• Idea: run guest operating systems unmodified

• However, hypervisor is the real privileged


software

• When OS executes privileged instruction, trap


to hypervisor who executes it for the OS

• This can be very expensive

• Also, subject to quirks of the architecture


– Example, x86 fails silently if some privileged
instructions execute without privilege 26


Example of Hypervisor Intervention:
Disable Interrupts
• Guest OS tries to disable interrupts
– the instruction is trapped by the VMM which
makes a note that interrupts are disabled for that
VM

• Interrupts arrive for that machine


– Buffered at the VMM layer until the guest OS
enables interrupts.

• Other interrupts are directed to VMs that


have not disabled them
• This action can be expensive
Binary translation--making full
virtualization practical
• Use binary translation to modify OS to
rewrite silent failure instructions
• More aggressive translation can be used
– Translate OS mode instructions to
equivalent VMM instructions
• Some operations still expensive
• Cache for future use
• Used by VMWare ESXi and Microsoft Virtual Server
• Performance on x86 typically ~80-95% of
native
28
Binary Translation Example
Guest OS
Assembly Translated Assembly
do_atomic_operation: do_atomic_operation:
cli call
mov eax, 1 [vmm_disable_interrupts]
xchg eax, [lock_addr] mov eax, 1
test eax, eax xchg eax, [lock_addr]
jnz spinlock test eax, eax
… jnz spinlock
… …
mov [lock_addr], 0 …
sti mov [lock_addr], 0
ret call
[vmm_enable_interrupts]
ret
CLI: Clear Interrupt Flag; STI — Set Interrupt Flag 29
Paravirtualization
• Modify the Guest OS to make it aware
of the hypervisor
– Can avoid these tricky features
– OS is aware of the fact it is virtualized
• Can implement optimizations
• How does it Compare to binary
translation?
• Amount of code change?
– 1.36% of Linux, 0.04% for Windows 30
Hardware supported virtualization
(Intel VT-x, AMD-V)
• Hardware support for virtualization
– Intel® Virtualization Technology
hardware assist for virtualization
• Makes implementing VMMs much
simpler
• Streamlines communication between
VM and OS
• Removes the need for
paravirtualization/binary translation
• EPT: Support for shadow page tables31
Virtualization Tasks
• Virtualize hardware
– Memory hierarchy
– CPUs
– Devices
• Implement data and control transfer
between guests and hypervisor
• We’ll cover this by example – Xen paper
– Slides modified from presentation by
Jianmin Chen
32
Xen
• Design principles:
– Unmodified applications: essential
– Full-blown multi-task O/Ss: essential
– Paravirtualization: necessary for
performance and isolation
• Paul Barham, Boris Dragovic, Keir Fraser,
Steven Hand, Tim Harris, Alex Ho, Rolf
Neugebauer, Ian Pratt, and Andrew
Warfield, Xen and the art of Virtualization
, Proc. of SOSP, October 2003
Xen
Domain 0
Guest OS Implementation summary

35
Xen VM interface: Memory
• Memory management
– Guest cannot install highest privilege
level segment descriptors; top end of
linear address space is not accessible
• Kernel direct mapping space
– Guest has direct (not trapped) read
access to hardware page tables; writes
are trapped and handled by the VMM
– Physical memory presented to guest is
not necessarily contiguous
• ‘Guest’  Guest OS
Two Layers of Virtual Memory Physical address  machine

address Host OS’s

Virtual address  physical View of RAM

address 0xFFFFFFFF
Guest OS’s

View of RAM
Guest App’s
0xFFFF Page 2
View of RAM
Page 0 Page 0
0xFF
Page 3

Page 2 Page 1
Page 3
Page 1

Page 0
Page 3
Page 1
0x00 Page 2
Unknown to the
Known to the guest guest OS
0x0000
OS
0x00000000
Guest’s Page Tables Are Invalid
• Guest OS page tables map virtual page
numbers (VPNs) to physical frame
numbers (PFNs)
• Problem: the guest is virtualized, doesn’t
actually know the true PFNs (locations)
– That true location is the machine frame number
(MFN)
– MFNs are known to the VMM and the host OS
• Guest page tables cannot be installed in cr3
– Map VPNs to PFNs, but the PFNs are incorrect
• How can the MMU translate addresses used
by the guest (VPNs) to MFNs? 38
Shadow Page Tables
• Solution: VMM creates shadow page tables
that map VPN  MFN (as opposed to
VPNPFN)
Guest Page Table Physical Memory

VPN PFN •
64
Page 2 3 Maintained by the
00 (0) 01 (1)
48 2 guest OS
01 (1) 10 (2) Page 1
• Invalid for the MMU
32
Page 0 1
Virtual Memory 10 (2) 11 (3)
16
64 11 (3) 00 (0) Page 3 0
Page 3 0
48
Page 2
32 Shadow Page Table Machine Memory
Page 1
VPN MFN
16
Page 0 64 3 • Maintained by the
00 (0) 10 (2) Page 1
0 48 2 VMM
01 (1) 11 (3) Page 0

32 1 Valid for the MMU
10 (2) 00 (0) Page 3
16 0 39
11 (3) 01 (1) Page 2
0
Building Shadow Tables
• Problem: how can the VMM maintain
consistent shadow pages tables?
– The guest OS may modify its page tables at any
time
– Modifying the tables is a simple memory write, not
a privileged instruction
• Thus, there are no helpful CPU exceptions to trap this
action :(
• Solution: mark the hardware pages containing
the guest’s tables as read-only (guest page
table)
– If the guest updates a table, an exception is
generated
– VMM catches the exception, examines the faulting
40
More VMM Tricks
• The VMM can play tricks with virtual
memory just like an OS can
• Balooning:
– The VMM can page parts of a guest, or even an
entire guest, to disk
– A guest can be written to disk and brought
back online on a different machine!
• Deduplication:
– The VMM can share read-only pages between
guests
– Example: two guests both running Windows XP
41
Xen VM interface: CPU
• CPU
– Guest runs at lower privilege than VMM
– Exception handlers must be registered with
VMM
– Fast system call handler can be serviced
without trapping to VMM
• Allow direct calls from application to Guest OS,
rather than directing it through the VMM
– Hardware interrupts replaced by lightweight
event notification system
– Timer interface: both for real and virtual time
Details: CPU
• Frequent exceptions:
– Software interrupts for system calls
– Page faults
• Allow “guest” to register a ‘fast’
exception handler for system calls
that can be accessed directly by CPU in
ring 1, without switching to ring-0/Xen
– Handler is validated before installing in
hardware exception table: To make sure
nothing executed in Ring 0 privilege.
• Not used for Page Fault
Xen VM interface: I/O
• I/O
– Virtual devices exposed as
asynchronous I/O rings to guests
– Event notification replaces interrupts
Details: I/O 1
• Xen does not emulate hardware devices
– Exposes device abstractions for simplicity
and performance
– I/O data transferred to/from guest via Xen
using shared-memory buffers
– Virtualized interrupts: light-weight event
delivery mechanism from Xen-guest
• Update a bitmap in shared memory
• Optional call-back handlers registered by guest
O/S
NIC data structure

NIC manages incoming and


outgoing packets using circular
queues (ring) of buffer descriptors

Each slot in the ring contains the


length and physical address of the
buffer

NIC registers (CPU accessible)


indicate the portion of ring available
for transmission and reception
Details: I/O 2
• I/O Descriptor Ring:
Data Transfer: Descriptor Ring
• Descriptors are allocated by a
domain (guest) and accessible
from Xen
• Descriptors do not contain I/O
data; instead, point to data buffers
also allocated by domain (guest)
– Facilitate zero-copy transfers of I/O
data into a domain
OS Porting Cost
• Number of lines of code modified
or added compared with original
x86 code base (excluding device
drivers)
– Linux: 2995 (1.36%)
– Windows XP: 4620 (0.04%)
• Re-writing of privileged routines;
• Removing low-level system
initialization code
Control Transfer
• Guest synchronously calls into VMM
– Explicit control transfer from guest O/S to VM
monitor/hypervisor, similar to system calls
– “hypercalls”
• VMM delivers notifications to guest O/S
– E.g. data from network or when an I/O device is
ready
– Asynchronous event mechanism; guest O/S does
not see hardware interrupts, only Xen
notifications
Event notification
• Pending events stored in per-domain
bitmask
– E.g. incoming network packet received
– Updated by Xen before invoking guest OS
handler
– Xen-readable flag may be set by a domain
• To defer handling, based on time or number of
pending requests
• Analogous to interrupt disabling
Network Virtualization
• Each domain has 1+ network interfaces:
virtual interfaces (VIFs)
– Each VIF has 2 I/O rings (send, receive)
– Each direction also has rules of the form
(<pattern>,<action>) that are inserted by
domain 0 (management)
• Xen’s role: models/acts as a virtual
firewall + router (VFR) to which all
domain VIFs connect
Network Virtualization
• Packet transmission:
– Guest adds request to I/O ring
– Xen copies packet header, applies
matching filter rules
• E.g. change header - IP source address for
NAT
• No change to payload; pages with payload
must be pinned to physical memory until
DMA to physical NIC for transmission is
complete
– Round-robin packet scheduler
Network Virtualization
• Packet reception:
– Xen applies pattern-matching rules to determine
destination VIF
– Guest O/S required to exchange unused page
frame for each packet received
• Xen exchanges packet buffer for page frame
in VIF’s receive ring
• If no receive frame is available, the packet is
dropped
• Avoids Xen-to-guest copies; requires page
aligned receive buffers to be queued at VIF’s
receive ring
Disk virtualization
• Domain0 has access to physical
disks
– Currently: SCSI and IDE
• All other domains: virtual block
device (VBD)
– Created & configured by management
software at domain0
– Accessed via I/O ring mechanism
– Possible reordering by Xen based on
knowledge about disk layout
Disk virtualization
• Xen maintains translation tables
for each virtual block device (VBD)
– Used to map requests for VBD
(ID,offset) to corresponding physical
device and sector address
– Zero-copy data transfers take place
using DMA between memory pages
pinned by requesting domain
• Scheduling: batches of requests in
round-robin fashion across
domains
Evaluation
Microbenchmarks
• Stat, open, close, fork, exec, etc
• Xen shows overheads of up to 2x with
respect to native Linux
– (context switch across 16 processes; mmap
latency)
• VMware shows up to 20x overheads
– (context switch; mmap latencies)
• User Mode Linux shows up to 200x
overheads
– Fork, exec, mmap; better than VMware in
context switches
CPU Virtualization with VT-x

59
VT-x : Motivation
• To solve the problem that the x86 architecture
instructions cannot be virtualized.
• Simplify VMM software by closing virtualization
holes by design.
– Ring Compression
– Non-trapping instructions
– Excessive trapping
• Eliminate need for software virtualization (i.e.,
paravirtualization, binary translation).

60
VMX
• Virtual Machine Extensions define processor-
level support for virtual machines on the x86
platform by a new form of operation called VMX
operation.
• Kinds of VMX operation:
– root: VMM (hypervisor) runs in VMX root
operation
– non-root: Guest runs in VMX non-root
operation
• Eliminate de-privileging of Ring 0 for guest
OS.
61
Pre VT-x
Post without VT-x with VT-x
VMM ring de-privileging of guest OS VMM executes in VMX root-mode
Guest OS aware its not at Ring 0 Guest OS de-privileging eliminated
Guest OS views as if it “runs directly
on hardware”

62
VMX Transitions
• Transitions between VMX root operation and
VMX non-root operation.
• Kinds of VMX transitions:
– VM Entry: Transitions into VMX non-root
operation. Allows Guest OS to execute Priv.
instructions as if it is in Ring 0
– VM Exit: Transitions from VMX non-root
operation to VMX root operation.
• Registers and address space swapped in
one atomic operation.
63
VM 1 VM 2 VM n

VMX Non-Root Operation: Ring 3 Ring 3 Ring 3 App

Guest OS has access to priv. Ring 0 Ring 0 Ring 0 OS


instructions
VM Entry

VMCS 1 VMCS 2 VMCS n

VM Exit

Ring 3
VMX Root Operation
vmlaunch /
Ring 0
(hypervisor has access vmresume

to priv. instr.)
VMX Transitions

VMCS: VM Control Structure 64


VMCS: VM Control Structure
• Data structure to manage VMX non-root operation
and VMX transitions.
• Specifies guest OS state.
• Configured by VMM.
• Controls when VM exits occur.

• On VM entry, execution mode (guest state), dedicated


to virtual instances allows Guest OS to execute
privileged instructions without trapping to hypervisor
– Defined by VM-execution control fields that control
processor operation in VMX non-root operation

65
VMCS: VM Control Structure
The VMCS consists of six logical groups:
• Guest-state area: Processor state loaded on VM
entries from guest-state area; saved into guest-state
area on VM exits.
• Host-state area: Processor state loaded from the
host-state area on VM exits.
• VM-execution control fields: Fields controlling
processor operation in VMX non-root operation.
• VM-exit control fields: Fields that control VM exits.
• VM-entry control fields: Fields that control VM
entries.
• VM-exit information fields: Read-only fields to
receive information on VM exits describing the cause
66

and the nature of the VM exit.


CPU Virtualization with VT-x

67
Source: [2]
MMU Virtualization with VT-x

68
VPID: Motivation
• First generation VT-x forces TLB flush on each
VMX transition.
• Performance loss on all VM exits.
• Performance loss on most VM entries
– Guest page tables not modified always
• Better VMM software control of TLB
flushes is beneficial.
A translation lookaside buffer is part of the chip's memory-management unit:
TLB contains page table entries that have been most recently used.

69
VPID: Virtual Processor
Identifier
• 16-bit virtual-processor-ID field in the VMCS.
• Cached linear translations tagged with VPID
value.
• No flush of TLBs on VM entry or VM exit if VPID
active.
• TLB entries of different virtual machines can all
co-exist in the TLB.

70
Virtualizing Memory in
Software
0
• Three abstractions of memory: 4GB
Virtual
Current Guest Process Guest OS
Address Spaces

0 4GB
Virtual
Virtual Virtual Physical
Frame
Virtual RAM
Devices ROM
Buffer
Address Spaces

0 4GB
Frame Machine
RAM Devices ROM
Buffer
Address Space

71
Shadow Page Tables
• VMM maintains shadow page tables that
map guest-virtual pages directly to
machine pages.
• Guest modifications to V->P tables
synced to VMM V->M shadow page
tables.
– Guest OS page tables marked as read-only.
– Modifications of page tables by guest OS ->
trapped to VMM.
– Shadow page tables synced to the guest OS
tables 72
Drawbacks: Shadow Page
Tables
• Under shadow paging, in order to provide
transparent MMU virtualization, the VMM
intercepts guest page table updates to keep the
shadow page tables coherent with the guest page
tables.
• Maintaining consistency between guest page
tables and shadow page tables leads to
overhead: VMM traps
• Loss of performance due to TLB flush on every
“world-switch”.
• Memory overhead due to shadow copying of guest
page tables.
73
Guest Guest Guest
Virtual CR3
Page Table Page Table Page Table

Real CR3

Shadow Shadow Shadow


Page Table Page Table Page Table

Set CPU Rings CR3 by guest OS (1)


Guest Guest Guest
Virtual CR3
Page Table Page Table Page Table

Real CR3

Shadow Shadow Shadow


Page Table Page Table Page Table

Set CPU Rings CR3 by guest OS (2)


75
Nested / Extended Page
Tables
• With the introduction of EPT, the VMM can now rely on
hardware to eliminate the need for shadow page
tables. This removes much of the overhead incurred
to keep the shadow page tables up-to-date.
• Extended page-table mechanism (EPT) is used to
support the virtualization of physical memory.
• Translates the guest-physical addresses used in VMX
non-root operation.
• Guest-physical addresses are translated by traversing
a set of EPT paging structures to produce physical
addresses that are used to access memory.
• REF: https://www.vmware.com/pdf/Perf_ESX_Intel-EPT-
eval.pdf
76
Use of EPT
• guest operating system continues to maintain LPN-
>PPN mappings in the guest page tables, but
• VMM maintains PPN->MPN mappings in an
additional level of page tables, called nested page
tables.
• Both guest page tables and the nested page tables
are exposed to hardware.
• When logical address accessed, hardware walks
guest page tables like native execution
– for every PPN accessed during the guest page
table walk, hardware also walks nested page
tables to determine corresponding MPN.
• composite translation eliminates need to
77
maintain shadow page tables and synchronize them
Nested / Extended Page Tables

78
Nested / Extended Page Tables

79
Source: [4]
Advantages: EPT
• Simplified VMM design.
• Guest page table modifications need not be
trapped, hence VM exits reduced.
• Reduced memory footprint compared to
shadow page table algorithms.

80
Disadvantages: EPT
• TLB miss is very costly since guest-physical
address to machine address needs an extra
EPT walk for each stage of guest-virtual
address translation.

81
Virtual Appliances & Multi-Core
• Virtual appliance: pre-configured VM with OS/ apps
pre-installed
– Just download and run (no need to install/configure)
– Software distribution using appliances
(see: “SnowFlock: Rapid Virtual Machine Cloning for Cloud Computing”, H.
Andrés Lagar-Cavilla et. al., Eurosys 2009)
• Multi-core CPUs
– Run multiple VMs on multi-core systems
– Each VM assigned one or more vCPU
– Mapping from vCPUs to physical CPUs

• Today: Virtual appliances have evolved into Docker containers


82
Examples

• Application-level virtualization: “process virtual


machine”
• VMM /hypervisor

83
OS Virtualization

• Emulate OS-level interface with native interface


• “Lightweight” virtual machines
– No hypervisor, OS provides necessary support

• Referred to as containers
– Solaris containers, BSD jails, Linux containers
84
Linux Containers (LXC)

• Containers share OS kernel of the host


– OS provides resource isolation
• Benefits
– Fast provisioning, bare-metal like performance, lightweight

Material courtesy of
“Realizing Linux
Containers” by Boden
Russell, IBM

85
OS Mechanisms for LXC
• OS mechanisms for resource isolation and
management
• namespaces: process-based resource isolation
• Cgroups: limits, prioritization, accounting, control
• chroot: change the apparent root directory for the
current running process and its children.
– Program run in such modified environment cannot name
files outside designated directory tree.
• Linux security module, access control
• Tools (e.g., docker) for easy management

86
Linux Namespaces
• Namespace: restrict what a container can see
– Provide process level isolation of global resources
• Processes have illusion they are the only processes in
the system
• MNT: mount points, file systems (what files, dir are
visible)?
• PID: what other processes are visible?
• NET: NICs, routing
• Users: what uid, gid are visible?

• chroot: change root directory


87
Linux cgroups
• Resource isolation
– what and how much can a container use?
• Set upper bounds (limits) on resources that can be used
• Fair sharing of certain resources
• Examples:
– cpu: weighted proportional share of CPU for a group
– cpuset: cores that a group can access
– block io: weighted proportional block IO access
– memory: max memory limit for a group

88
Proportional Share Scheduling
• Resource allocation
– Uses a variant of proportional-share scheduling
• Share-based scheduling:
– Assign each process a weight w_i (a “share”)
– E.g., CPU Allocation is in proportion to the ‘share'
– fairness: reused unused cycles to others in proportion to weight
– Examples: fair queuing, start time fair queuing
• Hard limits: assign upper bounds (e.g., 30%), no
reallocation
• Credit-based: allocate credits every time T, can
accumulate credits, and can burst up-to credit limit
– can a process starve other processes?
89
Share-based Schedulers

90
Putting it all together
• Images: files/data for a container
– can run different distributions/apps on a host
• Linux security modules and access control
• Linux capabilities: per process privileges

91
Docker and Linux Containers

• Linux containers are a set of kernel features


– Need user space tools to manage containers
– Virtuoze, OpenVZm, VServer, Lxc-tools, Wardenm Docker
• What does Docker add to Linux containers?
– Portable container deployment across machines
– Application-centric: geared for app deployment
– Automatic builds: create containers from build files
– Component re-use
• Docker containers are self-contained: no dependencies

92
Docker

• Docker uses Linux containers

93
LXC Virtualization Using Docker
• Portable: docker images run anywhere Docker runs
• Docker decouples LXC provider from operations
– uses virtual resources (LXC virtualization)
– fair share of physical NIC vs use virtual NICs that are fair-
shared

94
Docker Images and Use

• Docker uses a union file system (AuFS)


– allows containers to use host FS safely
• Essentially a copy-on-write file system
– read-only files shared (e.g., share glibc)
– make a copy upon write
• Allows for small efficient container images
• Docker Use Cases
– “Run once, deploy anywhere”
– Images can be pulled/pushed to repository
– Containers can be a single process (useful for
microservices) or a full OS
95

You might also like