0% found this document useful (0 votes)
8 views106 pages

M2 Today

The document provides an overview of virtualization, including its definition, components, and concepts such as processes, memory management, and CPU context. It discusses the differences between user mode and kernel mode, address translation, and the role of the operating system in managing these processes. Additionally, it highlights the benefits of virtualization in cloud computing, particularly in Infrastructure as a Service (IaaS).

Uploaded by

Priyanshu Behera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views106 pages

M2 Today

The document provides an overview of virtualization, including its definition, components, and concepts such as processes, memory management, and CPU context. It discusses the differences between user mode and kernel mode, address translation, and the role of the operating system in managing these processes. Additionally, it highlights the benefits of virtualization in cloud computing, particularly in Infrastructure as a Service (IaaS).

Uploaded by

Priyanshu Behera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

Agenda

01 Virtualization 02 Virtual Machine

03 Hardware virtualization 04 Hypervisor

Containerization
05 Open Flow 06 and Docker

07 Docker Components 08 Foundation Engine


Source Content Credits TO:

• Prof. Rajkumar Buyya (University of Melbourne, Australia.)


• Mythili Vutukuru IIT Bombay
• Dr. Rajiv Ranjan and Devki Nandan Jha (Computing Science, Newcastle
University, and oxford university UK.)

Copyright IntelliPaat, All rights reserved


Virtualization
About Virtualization

Virtualization in Layman terms refers to running multiple operating systems on a


single machine. While most computers only have one operating system installed,
virtualization software allows a computer
I/O
to run several operating systems on top of
the same physical machine. Each self-contained
Devices Virtual Machine is isolated from
other VMs. Software executed on a virtual machine is separated from the underlying
hardware resources. VM
Application Application
Memory
s Application
Application s
Application
Application
Virtualisation
Operating Operating
Operating
System Operating
Operating
CPU Operating System
System
System
System
System
Hardware

Hardware
Recap of Operating System

• The following OS concepts are required to understand virtualization


• The concept of a process
• Virtual memory, paging, (segmentation)
• User mode and kernel mode of a process
• Interrupt/trap processing in kernel mode
• I/O handling
Concept of Process
• A process is a running program

• User writes a program, compiles it to generate an executable


• Executable contains machine/CPU instructions
• Every CPU architecture (e.g., x86) defines a certain set of instructions
• Compiler translates high level language code to instructions the CPU can run

• To create a process, OS allocates memory in RAM for the memory image of the process, containing
• Code and static/global data from the executable
• Heap memory for dynamic memory allocations (e.g., malloc)
• Stack to store arguments/return address/local variables during function calls

• All instructions and variables in the memory image are assigned memory addresses
• Starting at 0, up to some max value (4GB in 32-bit systems)
Memory Image of Process

Address = 0 from compiled executable


Code/data
(machine instructions, static/global data)
Heap
Address = 12KB Address of this memory returned by malloc

address Heap and stack grow towards each other


Unused
New stack frame pushed by function call
Stack Contains arguments, local vars, return address

Max Address = 4GB etc. Popped when function returns


…..
Process Execution
• When a process is run, CPU executes the code in the memory image

• When a process runs on the CPU, the CPU registers hold values related to the process execution
• The program counter (PC, or EIP in x86) has address of current instruction
• The CPU fetches the current instruction, decodes it, and executes it

• Any variables needed for operations are loaded from process memory into general purpose CPU registers (EAX,
EBX, ECX, EDX etc in x86)
• After instruction completes, values are stored from registers into memory

• The stack pointer (SP, or ESP in x86) has address of top of stack (current stack frame holds arguments/variables of
the current function that is running)

• The set of values of all CPU registers pertaining to a process execution is called its CPU context
CPU Context During Execution

Code/data
EIP (PC)
EAX ….. EDX Heap

CPU
ESP
Stack

…..
Concurrent Exec., Context Switching
• To run a process, OS allocates memory, loads CPU context
• EIP points to instructions, ESP points to stack of process, registers have process data
• CPU now begins to run the process
• OS runs multiple process concurrently by multiplexing on the same CPU

• Context switch: After running a process for some time, OS switches from one process to another

• How does context switch happen? OS saves the CPU context of the old process and loads the context of new
process
• When EIP points to instruction of new process, new process starts to run

• Where is the context saved?


• OS has a data structure called Process Control Block (PCB) for each process
• PCB (specifically, a PCB field called kernel stack) temporarily stores context of a
process when it is not running
Context Switch

Code/data Code/data
EIP (PC)
CPU has
EAX ….. EDX Heap Heap
switched
CPU execution from
blue process to
ESP green process Stack Stack

….. …..

User memory
Kernel memory

Process control blocks


Virtual Memory

• Addresses assigned in memory image, that are used by CPU to load and store are
virtual/logical addresses
• These virtual addresses are not actual addresses occupied by
instructions/data of process, but assigned from 0 for convenience

• Actual addresses where instructions/data bytes of process are stored are called physical
addresses
• RAM hardware needs physical addresses to fetch bytes
• CPU requests code/data at virtual addresses
• Translated to physical address so that RAM can fetch it
• Memory is allocated at granularity of pages (usually 4KB)
• Logical pages of a process are stored in physical frames in memory
• Logical page numbers translated to physical frame numbers
Virtual and Physical Address

Address = 0

CPU

Address = X

Address = Y

Address = Z

Address = 4GB
Address Translation with Paging

• On every memory access, a piece of hardware called MMU (memory management unit) translates virtual addresses to
physical addresses
• Page table of a process stores the mapping of logical page number to physical frame number
• OS builds the page table when allocation memory
• MMU uses this page table to translate addresses

• MMU looks up CR3 register of CPU (x86) which stores location of page table of current process
• Looks up page number in page table to translate address
• CR3 reset on every context switch

• Recent address translations cached in TLB (Translation Look aside Buffer) located within MMU
Page table
(page 1= frame y)

Virtual address space Physical address space


CR3 Page

Fetch address 5KB


CPU
Address = X

MMU Fetch address Y+1KB


Address = Y

Address = Z
Virtual address Page number Offset

Using page table

Physical address Frame number Offset


Hierarchical Page Table
• Modern operating systems use hierarchical page tables

• 32-bit virtual address space (4GB), 2^12 byte (4KB) pages


• Each process can have up to 2^32/2^12 = 2^20 pages
• Each page has a page table entry (PTE), so 2^20 PTEs per process
• Assuming each PTE is 4 bytes, all page table entries occupy 4*2^20 = 4MB

• Cannot store a large page table contiguously in memory, so page table of a process is stored in memory in page sized
chunks
• Each 4KB page stores 2^10 PTEs, so 2^20/2^10 = 2^10 pages to store all PTEs
• Pointers to these “inner” pages stored in an outer page directory

• In 32-bit architectures, one outer page can store physical frame numbers of all 2^10 inner page table pages, so
2-level page table
• More levels in page table for 64-bit architectures
Page Table Lookup
CPU’s CR3 has
PTE Page table entry stores
physical address of
physical frame number
outermost page
directory

Physical frame numbers of


Array of PTEs
2^10 inner pages stored in
Split among multiple pages
outer page directory

Multiple such levels of page


tables can exist
2^32/2^12 = 2^20 pages
2^20 PTEs in 2^10 pages

Virtual address 10 bits 10 bits Mapping


MMU 12-bit offset
cached in TLB
Physical address Frame number Offset
Demand Paging
Address = 0
Page
Code/data
CPU
Fetch address 5KB
Address = X

Page fault
Address = Y
Another victim page swapped to disk
Physical frame assigned to page Page
table updated. Address = Z
CPU instruction rerun
Address = 4GB
User Mode and Kernel Mode
• Two kinds of CPU instructions

• Unprivileged instructions (regular program code) – part of user code


• Privileged instructions (access to hardware etc.) – part of OS code

• Modern CPUs have multiple privilege levels (rings)


• Privileged instructions are executed only when CPU is at high privilege level

• User code runs at low privilege level / user mode (CPU set to ring 3 in x86)
• If user runs privileged instructions, error is thrown

• OS code runs at high privilege level / kernel mode (CPU set to ring 0 in x86)

• Allowed to run privileged instructions

• Privilege level checked by CPU and MMU (every page has privilege bit)

• When user process needs to perform privileged action, must jump to privileged OS code, set CPU to high privilege, and
then perform the action
Where is OS Code Located ?
Virtual address space Physical address space
• OS is part of high virtual address
Page Address = 0
space of every process
Kernel code
• Page table of process maps these in memory
kernel addresses to location of
kernel code in memory
Address = X
• Only one copy of OS code in memory,
but mapped into virtual address
space/page table of every process

• Process jumps to high virtual


Address = 3GB
addresses to run OS code

OS
Address = 4GB
Kernel Mode Execution

• When does a process go from user mode to kernel mode?


• System call: user requests some privileged action from OS (e.g., read syscall)
• Program fault: hardware raises a fault when user does some invalid action (e.g., privileged instruction in user
mode, page fault)
• Interrupt: I/O devices request attention (e.g., packet has arrived on NIC)
• All these are called traps in general
• How is a trap handled?
• Change CPU privilege level to high privilege/kernel mode
• Jump to OS code that handles trap and run it
• Return to user code after handling trap
• Can choose to return to user code of another process too (if context switch)
What Happens Upon a Trap?
• Trap handling begins by running a CPU instruction (int nin x86)
• Invoked by system call code or triggered by hardware event
• What happens during execution of int n?
• Change CPU privilege level
• Lookup Interrupt Descriptor Table (IDT) with index “n” to get address of kernel code
that handles this trap and set EIP to this value
• Use kernel stack of process (in PCB) as the main stack set ESP to this value
• Start saving user context onto the (kernel) stack

• Next, kernel code to handle the trap runs


• Save more user context, beyond what is saved by hardware instruction
• System call processing, interrupt handling etc.

• Finally, kernel invokes iret(x86) instruction to return back to user mode


• Reverses changes of int n
Trap Handling

EAX

int n ring 0
executed Set EIP = IDT[n]

Kernel stack of process


(part of PCB kernel data
structure)

OS

User context saved


Interrupt Descriptor Table

• Every interrupt has a number (IRQ)

• E.g., different hardware devices get different numbers, system call gets a different
number

• IDT has an entry for every interrupt number (IRQ)


• IDT entry specifies values of EIP and a few such CPU registers
• CPU uses this EIP to locate kernel interrupt handling code

• Pointer to IDT is stored in CPU register


• Setting IDT pointer in CPU is a privileged operation, done by OS
• Much like how setting CR3 to page table is a privileged operation
Segmentation

Heap Base = BHeap


CPU segment Limit = LHeap

Stack Base = BStack


segment Limit = LStack

Code Base = BCode


segment
Limit = LCode
Base = Start address of segment
Limit = End address of segment
Virtual address space
Use of Segmentation
Segment descriptor table
(BCode, BLimit, …)

CS

CPU
DS SS

MMU Base = BCode


Limit = LCode

Segment Offset in segment

Address
Use of Segmentation Today
• What do operating systems use today? Segmentation or Paging?
• Hardware is built to do segmentation and paging, but only paging in practice
• Modern OS’s use “flat” segments: Base = 0, Limit = MAX (4GB)
• Virtual address remains same after segmentation-based translation

• Paging is used to manage virtual memory in reality


• After dummy translation with segmentation, paging does actual translation
• After adding segment base (0), linear address is translated using page table

• Segmentation is used to check permissions (along with paging)


• Modern OS’s have different flat segments for different privilege levels

• CS has two different values in user and kernel mode (base=0, limit=MAX, only privilege level is different)
• CS has high privilege level when executing privileged instructions
I/O Subsystem
• Processes use system calls to access I/O devices
• Process P1 makes system call, goes into kernel mode
• OS device driver initiates I/O request to device (e.g., disk, network card)

• If the system call is blocking (i.e., cannot be completed right away), OS performs context switch to
another process P2
• When request completes, I/O device raises interrupt
• P2 goes into kernel mode, handles interrupt, marks P1 as ready to run
• P1 runs at a later time when scheduled by OS scheduler
I/O Subsystem
• DMA: I/O devices perform Direct Memory Access to store I/O data in memory

• When initiating I/O request, device driver provides (physical) address of memory buffer in which to store
I/O data (e.g., disk block)
• Device first stores data in DMA buffer before raising interrupt
• Interrupt handler need not copy data from device memory to RAM
Summary
• The concept of a process, memory image, CPU context
• CPU context is saved in PCB/kernel stack during user/kernel mode transitions of a
process, as well as during context switch between processes

• Virtual memory, paging, address translation by MMU using page table


• OS is mapped into the virtual address space of every process
• Modern OS’ss use flat segments + paging

• User mode and kernel mode of a process, privileged instructions, CPU privilege levels
• Privileged instructions in OS code run at highest CPU privilege level (ring 0)

• Interrupt/trap processing in kernel mode


• Change privilege level, save user context, jump to kernel code, handle trap

• I/O handling
COMPUTER ORGANISATION

I/O
Devices

Memory

CPU
Non Virtualized System

Blackboard NESS Student Other


Application Application Admissions Application

Server Cluster 1 Server Cluster 2 Server Cluster 3 Server Cluster 4

Consists of

Processor Memory Disk Network

Hardware Failure

v
Non Virtualized System

Blackboard NESS Student Other


Application Application Admissions Application

Server Cluster 1 Server Cluster 2 Server Cluster 3 Server Cluster 4


What do we want to achieve
with cloud?

Before Cloud: static partitioning Dynamic sharing in clouds

Hadoop

Pregel

Shared IaaS
MPI
Benefits of Virtualisation

■ Heterogeneous system consolidation


■ System testing and workload isolation
■ Hardware optimization
■ Application high availability
■ Elasticity
Virtualization (Overview)

■ Fundamental component of cloud computing, especially in IaaS.


■ Allows the creation of secure, customizable, and isolated execution environment, even if
untrusted, NOT affecting other users’ applications.
■ Basic idea: ability of a computer program (software and hardware) to emulate an executing
environment separate from the one that hosts such programs.
■ For example, Windows OS can run on top of a virtual machine, which itself is running on Linux
OS.
■ A great opportunity to build elastically scalable systems
■ Minimum costs resource provision.
■ Virtualization used to deliver customizable computing environments on demand.
Virtualization Reference Model

■ Virtual version of hardware, a software environment, storage, or a network.


■ Three major components: Guest, Host and, Virtualization layer.
■ The Guest represents the system component that interacts with the virtualization layer.
■ Host represents original environment; the guest is supposed to be managed.
■ Virtualization layer recreates the environment where the guest will operate.
■ Most intuitive application is represented by hardware virtualization, the guest is represented by a
system image comprising an operating system and installed applications.
Virtualization Reference Model

Virtual Image Applications Applications


Guest

Virtual Hardware Virtual Storage Virtual Networking


Virtualization Layer
Software Emulation

Host Physical Hardware Physical Storage Physical Networking


Virtualization Reference Model

■ Guests are installed on top of virtual hardware, controlled and managed by the virtualization
layer (virtual machine manager ((VMM)).
■ Host is instead represented by the physical hardware, and in some cases the operating system
that defines the environment where the VMM is running.
■ Virtual storage: guest might be client applications interact with the virtual storage management
software deployed on top of the real storage system.
■ Virtual networking: guest interacts with a virtual network (VPN) managed by specific software
(VPN client) using the physical network on the nodes.
■ VPNs are useful for creating the illusion of being within a different physical network and thus
accessing the resources in it, which would otherwise not be available.
Virtualisation in Conventional Software
Infrastructure

Physical Resource Virtual Resource


Applications
CPU Process

Memory Virtual Memory

Disk Files Operating System

Network Card TCP/UDP Sockets


Hardware
Screen Windows
Virtual Machine (VM)

■ A Virtual Machine (VM) is a compute resource that uses software instead of a physical computer
to run programs and deploy apps.
■ Architecture: a formal specification to an interface in the system, including the logical behavior
of the resources managed via the interface.
■ Implementation describes the actual embodiment of an architecture.
■ Abstraction levels correspond to implementation layers, having its own interface or architecture.
■ Guest – system component interacts with Virtualization Layer.
■ Host – original environment where guest runs.
■ Virtualization Layer – recreate the same or different environment where guest will run.
Virtualised Infrastructure in Clouds

Virtual Machine (VM)

NOW: Physical Machine Applic Applic Applic Applic


(CPU+Mem+Disk+Net+…) ations ations ations ations
Virtual Cloud Machine
(VCPU+Vmem+Vdisk+Vnet+…) OS OS OS OS

Virtualisation Layer

Hardware
Virtualization in Clouds

■ Virtualization refers to the partitioning the


resources of a physical computing hardware
(such as computing, storage, network and
memory) into multiple virtual resources or
virtual machines (VMs).
■ Key enabling technology of cloud computing
that allow pooling of resources.
■ In cloud computing, resources are pooled to
serve multiple users using multi-tenancy.

Server Network Functions


Virtualisation Virtualisation
WHY DO THAT?

Save the cost (space, energy, personnel) of running several machines in place of one or vice-
versa—a green aspect, too!
Use the (otherwise wasted) CPU power
Clone servers (for example, for debugging) at low cost
Migrate a machine (for example, when the load increases) at low cost
Isolate an appliance—a server for a specific purpose (such as security)—without buying
new hardware
Virtualised Infrastructure in Clouds
The only way to enter the physical
hardware is via a software
(Operating System) or hardware
exception (interrupt or trap!)
Virtualised Infrastructure in Clouds

Virtualisation Interface : All to Cloud Computing


Introduction Igor Faynberg
visitors (interrupts and traps)
26 enter
here
Virtualised Infrastructure in Clouds

Server Virtualization – The division of a physical system's resources (e.g. processors,


memory, I/O, and storage), where each such set of resources operates independently with its
own System Image instance and applications

Virtual Machines
•Proxies for physical resources and have the same external interfaces and functions
•Composed from physical resources
Virtualization Layer
•Creates virtual resources and "maps" them to physical resources
•Provides isolation between Virtual Resources
•Accomplished through a combination of
software, firmware, and hardware mechanisms

Physical Hardware Resources


•Hardware components with architected interfaces / functions
•Examples: Memory, disk drives, networks, servers
Characteristics of virtualized
environments

Increased security :
■ For guest, a completely transparent and controlled execution environment.
■ An emulated environment in which the guest is executed.
■ Guest operations performed on VM, It translates and applies them to the host.
■ VMM controls and filters the guest activity, thus preventing some harmful operations from
being performed, Host is hidden or protected from the guest.
■ Sensitive information naturally hidden without installing complex security policies.
■ Increased security is a requirement when dealing with untrusted code.
■ Hardware virtualization solutions (VMware Desktop, Virtual Box, and Parallels) create a
virtual computer with customized virtual hardware on top of which a new operating
system can be installed.
■ By default, the file system exposed by the virtual computer is completely separated from
the one of the host machine.
Characteristics of virtualized
environments

Managed execution :
■ Sharing: sharing in virtualized data centers, to reduce the number of active servers and limit
power consumption
■ Aggregation: A group of separate hosts tied together and represented to guests as a single
virtual host, implemented in middleware for distributed computing.
■ Emulation: Controlling and tuning the environment that is exposed to guests.xample: arcade-
game emulator to play arcade games on a normal personal
■ Isolation: Allows multiple guests to run on the same host without interfering with each other,
Also, Separation between the host and the guest.
Characteristics of virtualized
environments

Portability:
■ Hardware virtualization solution: guest packaged into a virtual image that can be safely
moved and executed on top of different virtual machines.
■ Programming-level virtualization: Implemented by JVM or .NET runtime, binary code
can be run without any recompilation.
■ This makes flexible and very straight forward: One version is able to run on different
platforms with no changes.
■ Allows your own system always with you and ready to use as long as the required virtual
machine manager is available (services you need available to you anywhere you go).
Hands-On

Creation of an Ubuntu Virtual Machine using Oracle Virtual Box


CASE STUDY: LIFECYCLE

Virtual Machine (VM)


after
http://wiki.libvirt.org/page/VM_lifecycle
# Virtual_Machine_Lifecycle
start

Off On

shutdown

Paused

Physical Machine
Virtualization
How it is done? Technique Virtualization Model

Emulation Application

Execution Programming
Process Level High-Level VM
Environment Language

Storage
Multiprogramming Operating System
Virtualization

Network Hardware-assisted
Virtualization

Full Virtualization
System Level Hardware

…. Para virtualization

Partial Virtualization
Execution Virtualization

■ An Execution Environment (EE) is emulated, separated from host, implemented on top of H/W
by OS, application (libraries) dynamically or statically.
Machine reference model
■ For virtualizing an execution environment at different levels of computing stack, a reference
model is required that defines interfaces between levels of abstractions.
■ Virtualization techniques actually replace one of the layers.
■ A clear separation between layers simplifies their implementation.
■ It requires the emulation of interfaces and interaction with underlying layer.
Computing Stack

Applications
Application -
level
Virtualization

Programming Languages
Execution Stack

Programming
Language level
Virtualization

Operative Systems
OS- level
Virtualization

Hardware
Hardware - level
Virtualization
Execution virtualization: Interfaces

■ Virtualization: extend or replace an existing interface to mimic the behavior of another


system.
➢ Application Programming Interfaces (API): interfaces applications to libraries and/or the
underlying OS.
➢ Application Binary Interface (ABI): separates OS layer from application and libraries which
are managed by the OS, System Calls defined, allows portability of applications and libraries
across OS.
➢ Instruction Set Architecture (ISA) : defines the supported instructions, data types, registers,
processor, memory and the interrupt management or H/W.
➢ ISA has been divided into two security classes: Privileged Instructions and Non-privileged
Instructions.
Machine Reference Model

Applications Applications

API
API calls
Libraries Libraries

User
ABI System
ISA
calls User
Operative ISA
Operative System System

ISA
ISA

Hardware Hardware
ye user isa kya hai?

• Depending on what is replaced /mimicked, we obtain different techniques of virtualization.


Execution virtualization

■ Non-privileged instructions can be used without interfering with other tasks because they do not
access shared resources (floating, fixed-point, and arithmetic instructions).
■ Privileged instructions are executed under specific restrictions and mostly used for sensitive
operations:
➢ Behavior-sensitive: operate on the I/O or
➢ Control-sensitive: alter the state of the CPU.
■ Some architectures feature more than one class of privileged instructions.
■ For instance, a hierarchy of privileges see (Fig. Next Slide), in the form of ring-based security:
Ring 0, Ring 1, Ring 2, and Ring 3.
Security Rings and Privilege Modes.
Least privileged
mode
Ring 3 (user mode)
Privileged
Ring 2 modes
Ring 1
Ring 0

Most privileged
mode
(supervisor mode)

Ring 0 is in the most privileged level and Ring 3 in the least privileged level. Ring 0 is used by the
kernel of the OS, rings 1 and 2 are used by the OS-level services, and Ring 3 is used by the user.
Security Rings and Privilege Modes
Segment Table revisited
… … … …
Start address size ring Access type = {read, write, execute}

… … … (rr … 0
1
2
3
4
Code permitted to

access ring i may
n access ring j if and
only if j > i,
Execution virtualization

■ Distinction between user & supervisor mode signifies role of the hypervisor.
■ Hypervisors run in supervisor mode, and division between privileged and non-privileged
instructions makes challenging design of virtual machine managers.
■ Sensitive instructions execute in privileged mode, requires supervisor mode to avoid traps.
■ Without this assumption it is impossible to fully emulate and manage the status of the CPU for
guest operating systems.
■ This is not true for the original ISA, allows 17 sensitive instructions to be called in user mode,
prevents multiple OSs managed by a single hypervisor to be isolated from each other.
■ Since, they access the privileged state of the processor and change it.
without knowledge of hypervisor(ya trap generate kiye bina) to virtualize kaise karenge :)
Execution virtualization

■ Recent implementations of ISA (Intel VT and AMD Pacifica) solved this problem by
redesigning such instructions as privileged ones.
■ In hypervisor-managed environment, all guest OS code will run in user mode in order to prevent it
from directly accessing the status of the CPU.
■ If there are sensitive instructions that can be called in user mode, it is no longer possible to
completely isolate the guest OS.
Hardware Virtualization

■ It provides an abstract execution environment in terms of computer hardware on top of which a


guest operating system can be run.
■ The guest is represented by the operating system, the host by the physical computer hardware,
the virtual machine by its emulation, and the virtual machine manager by the hypervisor (see
Figure).
■ The hypervisor is generally a program or a combination of software and hardware that allows
the abstraction of the underlying physical hardware.
■ Hardware-level virtualization is also called System level Virtualization.
■ Since, it provides ISA to virtual machines, which is the representation of the hardware interface
of a system.
■ This is to differentiate it from process virtual machines, which expose ABI to virtual machines.
A hardware virtualization
reference model

Guest
In memory
representation

Virtual Image
Storage

VMM Host emulation

Virtual Machine

binary translation
instruction mapping
interpretation
……

Host
Techniques to design
Virtual Machine Monitors
What does VMM do?

• Multiple VMs running on a PM – multiplex the underlying machine


• Similar to how OS multiplexes processes on CPU

VM VM VM VM Proc Proc Proc Proc

VMM OS

• VMM performs machine switch (much like context switch)


• Run a VM for a bit, save context and switch to another VM, and so on…
• What is the problem?
• Guest OS expects to have unrestricted access to hardware, runs privileged
instructions, unlike user processes
• But one guest cannot get access, must be isolated from other guests
Trap and emulate VMM (1)

• All CPUs have multiple privilege levels


Guest app (ring 3)
• Ring 0,1,2,3 in x86 CPUs
• Normally, user process in ring 3, OS in ring 0
• Privileged instructions only run in ring 0
Guest OS (ring 1)
• Now, user process in ring 3, VMM/host OS in ring 0
VMM /
• Guest OS must be protected from guest apps
Host OS
• But not fully privileged like host OS/VMM (ring 0)
• Can run in ring 1?

• Trap-and-emulate VMM: guest OS runs at lower privilege level than VMM, traps to VMM for privileged operation
Trap and emulate VMM (2)

• Guest app has to handle syscall/interrupt


• Special trap instr (int n), traps to VMM
• VMM doesn’t know how to handle trap
• VMM jumps to guest OS trap handler
• Trap handled by guest OS normally VMM /
• Guest OS performs return from trap Host OS
(ring 0)
• Privileged instr, traps to VMM
• VMM jumps to corresponding user process

• Any privileged action by guest OS traps to VMM, emulated by VMM


• Example: set IDT, set CR3, access hardware
• Sensitive data structures like IDT must be managed by VMM, not guest OS IDT ek hi hai VMM ke dwara ya har guest OS ke liye ek
managed by VMM.
Problems with trap and emulate

• Guest OS may realize it is running at lower privilege level


• Some registers in x86 reflect CPU privilege level (code segment/CS)
• Guest OS can read these values and get offended!

• Some x86 instructions which change hardware state (sensitive instructions) run in both privileged and
unprivileged modes
• Will behave differently when guest OS is in ring 0 vs in less privileged ring 1
• OS behaves incorrectly in ring1, will not trap to VMM

• Why these problems?


• OSes not developed to run at a lower privilege level

• Instruction set architecture of x86 is not easily virtualizable (x86 wasn’t designed with virtualization in
mind)
Example: Problems with trap and emulate

• Eflags register is a set of CPU flags

• IF (interrupt flag) indicates if interrupts on/off


Code/data
• Consider the popf instruction in x86
EIP
• Pops values on top of stack and sets eflags Heap
EAX ….. EDX
• Executed in ring 0, all flags set normally

• Executed in ring 1, only some flags set


CPU
• IF is not set as it is privileged flag Eflags Stack
• So, popf is a sensitive instruction, not privileged, does not trap,
behaves differently when executed in different privilege levels …..
• Guest OS is buggy in ring 1
Criteria of VMM:

■ Equivalence – same behavior as when it is executed directly on the physical host when it was
running under control of VMM.
■ Resource control – VMM should be in complete control of virtualized resources.
■ Efficiency – a statistically dominant fraction of the machine instructions should be executed without
intervention from the VMM.
■ Theorems*: Classification of the instruction set and proposed three theorems that define the
properties that hardware instructions need to satisfy in order to efficiently support virtualization.
* criteria’s are established by Goldberg and Popek in 1974
Popek Goldberg theorem

• Sensitive instruction = changes hardware state

• Privileged instruction = runs only in privileged mode


• Traps to ring 0 if executed from unprivileged rings

• Theorem 1: In order to build a VMM efficiently via trap-and-emulate method, sensitive instructions should be a subset
of privileged instructions
• x86 does not satisfy this criteria, so trap and emulate VMM is not possible
CPU instructions x86
Privileged
instructions
Sensitive
instructions
Hardware Virtualization

Theorem 2 *:
■ A conventional third-generation computers is recursively virtualizable if:
■ It is virtualizable and
■ A VMM without any timing dependencies can be constructed for it.
■ Theorem 3 *:
■ A hybrid VMM may be constructed third-generation machine in which the set of user-
sensitive instructions is a subset of the set of privileged instructions.
■ In HVM, more instructions are interpreted rather than being executed directly.
Techniques to virtualize x86 (1)

• Paravirtualization: rewrite guest OS code to be virtualizable


• Guest OS won’t invoke privileged operations, makes “hypercalls” to VMM
• Needs OS source code changes, cannot work with unmodified OS
• Example: Xen hypervisor

• Full virtualization: CPU instructions of guest OS are translated to be virtualizable


• Sensitive instructions translated to trap to VMM
• Dynamic (on the fly) binary translation, so works with unmodified OS
• Higher overhead than paravirtualization
• Example: VMWare workstation
Techniques to virtualize x86 (2)

• Hardware assisted virtualization: KVM/QEMU in Linux


• CPU has a special VMX mode of execution
• X86 has 4 rings on non-VMX root mode, another 4 rings in VMX mode

• VMM enters VMX mode to run guest OS in (special) ring 0

• Exit back to VMM on triggers (VMM retains control)

Enter VMX mode to run VM


Host app (ring 3) Guest app (ring 3)

VMM / Host OS Exit to trap to VMM


(ring 0) Guest OS (ring 0)

Non-VMX root mode VMX mode


Hardware Virtualization
VMM: Main Modules coordinate to emulate
the underlying hardware:-
Dispatcher: Entry Point of VMM
■ Reroutes the instructions issued by VM
instance.
Allocator: Decides system resources to be
provided to the VM;
❑ a VM executes an instruction, results in
changing machine resources associated
with that VM.
■ Invoked by dispatcher
Interpreter: Consists of interpreter routines
■ Executed whenever a VM executes a
privileged instruction.
■ Trap is triggered and the corresponding
routine is executed.
Hardware Virtualization

Hypervisors(or VMM):
■ Hypervisor runs above the physical hardware.
■ It runs in supervisor mode.
■ It recreates a h/w environment in which guest operating systems are installed.
■ It is a piece of s/w that enables us to run one or more VMs on a physical server(host).
■ Two major types of hypervisor
■ – Type -I
■ – Type-II
Hardware Virtualization

Hypervisors (Type-I):
■ It runs directly on top of the hardware, takes place of OS.
■ Directly interact with the ISA exposed by the underlying hardware, and emulate this interface in
order to allow the management of guest OS.
■ Also known as native virtual machine.
Hypervisors (Type-2):
■ It requires the support of an operating system to provide virtualization services.
■ Programs managed by the OS, interact with OS through the ABI.
■ Emulate the ISA of virtual h/w for guest OS.
■ Also called hosted virtual machine.
Hosted and native Hypervisors

VM VM VM VM

ISA

Virtual Machine Manager


VM VM VM VM

ABI ISA

Operative System Virtual Machine Manager

ISA ISA

Hardware Hardware
Type 1 Hypervisor

• Unmodified Operating System is running in user mode


– But it thinks it is running in kernel mode (virtual
kernel mode)
– privileged instructions trap;
– Hypervisor is the “real kernel”
• Upon trap, executes privileged operations
• Or emulates what the hardware would do
Ex: VMware ESXi or Microsoft Hyper-V
Type 1 Hypervisor CASE STUDY
MICROSOFT HYPER-V
Type 2 Hypervisor type 1 mein priviledge ko trap karta hypervisor but in type 2 binary translation

• VMWare example
– Type 2 hypervisor Upon loading program:
scans code for basic blocks
– If sensitive instructions, replace by
VMware procedure
• Binary translation
– Cache modified basic block in VMWare
cache
• Execute; load next basic block etc.
• Type 2 hypervisors work without VT (no
automatic trapping of privileged instructions
by hypervisors) support.
Hardware Virtualization

Hardware virtualization Techniques:


■ Each VM that runs on the host requires their own CPU.
■ CPU needs to be virtualized, done by hypervisor.
Types of HVT:
Hardware-assisted virtualization
⮚ Full virtualization
⮚ Para virtualization
Hardware-assisted CPU virtualization in
KVM/QEMU
Hardware Assisted Virtualization
• Modern technique, after hardware support for virtualization introduced in CPUs
• Original x86 CPUs did not support virtualization
• Intel VT-X or AMD-V support is widely available in modern systems
• Special CPU mode of operation called VMX mode for running VMs
• Many hypervisors use this H/W feature, e.g., QEMU/KVM in Linux

Works with binary translation if no hardware support


QEMU (Userspace process)
Sets up guest VM memory as part of user-space process

KVM (kernel module) When invoked, KVM switches to VMX mode to run guest

CPU with VMX mode CPU switches between VMX and non-VMX root modes
LibVirt and QEMU/KVM
• When you install QEMU/KVM on Linux, libvirt is also installed
• A set of tools manage hypervisors, including QEMU/KVM
• A daemon runs on the system and communicates with hypervisors
• Exposes an API using which hypervisors can be managed, VM created etc.
• Commandline tool (virsh) and GUI (virt-manager) use this API to manage VMs

QEMU (Userspace process)


Libvirt
KVM (kernel module) API

CPU with VMX mode


QEMU Architecture
• QEMU is userspace process
• KVM exposes a dummy device
• QEMU talks to KVM via open/ioctl syscalls Guest VM physical memory

• Allocates memory via mmap for guest VM physical memory


VCPU-0 … VCPU-N
• Creates one thread for each virtual CPU (VCPU) in guest

• Multiple file descriptors to /dev/kvm (one for QEMU, one for


QEMU (Userspace process)
VM, one for VCPU and so on)
• ioctl on fds to talk to KVM
KVM (kernel module)
• Host OS sees QEMU as a regular multi- threaded process (/dev/kvm)
QEMU Operations

open(/dev/kvm)
This ioctl system call blocks this thread, KVM
ioctl(qemu_fd, KVM_CREATE_VM) switches to VMX mode, runs guest VM
ioctl(vm_fd, KVM_CREATE_VCPU)

for(;;) { //each VCPU runs this loop Returns to QEMU on host when VM exits from
ioctl(vcpu_fd, KVM_RUN) VMX mode.
switch(exit_reason) {
QEMU handles exit and returns to guest VM
case KVM_EXIT_IO: //do I/O case
KVM_EXIT_HLT:
}
}
QEMU/KVM Operation

Guest
Guest VM physical memory application

VCPU-0 kvm_run VCPU thread has kvm_run structure


to share info from QEMU to KVM
QEMU (Userspace process)
3. Guest OS and
1. QEMU thread calls KVM_RUN
5. KVM handles exit or user applications
returns to QEMU thread run normally

2. KVM shifts CPU to VMX mode

KVM (kernel module) Guest OS


4. Guest OS exits back to KVM
VMX Mode
• Special CPU instructions to enter and exit VMX mode
• VMLAUNCH, VMRESUME invoked by KVM to enter VMX mode
• VMEXIT invoked by guest OS to exit VMX mode

• On VMX entry/exit instructions, CPU switches context between host OS to guest OS


• Page tables (address space), CPU register values etc switched
• Hardware manages the mode switch

• Where is CPU context stored during mode switch?


• Cannot be stored in host OS or guest OS data structures alone (why?)
• VMCS (VM control structure), also called VMCB (VM control block)
VM Control Structure
• What is VMCS?
KVM Guest OS
• Common memory area accessible in both modes (root mode) (VMX mode)
• One VMCS per VM (KVM tells CPU which VMCS to use)

• What is stored in VMCS?


VMCS
• Host CPU context: Stored when launching VM, restored on VM exit
• Guest CPU context: Stored on VM exit, restored when VM is run
• Guest entry/execution/exit control area: KVM can configure guest memory
and CPU context, which instructions and events should cause VM to exit
• Exit information: Exit reason and any other exit-related information

• VMCS information (e.g., exit reason) exchanged with QEMU via kvm_run structure
• VMCS only accessible to KVM in kernel mode, not to QEMU userspace
VMX Mode Execution
• How is guest OS execution in VMX mode different?

• Restrictions on guest OS execution, configurable exits to KVM


• Guest OS exits to KVM on certain instructions (e.g., I/O device access)

• No hardware access to guest, emulated by KVM

• Guest OS usually exits on interrupts (interrupts handled by KVM, assigned to the appropriate host or
guest OS)
• KVM can inject virtual interrupts to guest OS during VMX mode entry IDT such that the guest OS traps to the KVM. ye banata hai KVM

• All of the above controlled by KVM via VMCS

• Mimics the trap-and-emulate architecture with hardware support


• Guest runs in a (special) ring 0, but trap-and-emulate achieved
QEMU/KVM Operation Revisited
Guest
application
Guest VM physical memory

VCPU-0 kvm_run
QEMU (Userspace process)

3. Guest OS and user


5. KVM handles exit or 1. QEMU thread calls KVM_RUN applications run
returns to QEMU normally, within
thread, with exit info in 2. KVM executes VMRESUME/VMLAUNCH restrictions
kvm_run Host context saved in VMCS, guest context restored

KVM (kernel module) VMCS Guest OS

4. Guest OS executes VMEXIT upon trigger


Guest context saved, host restored from VMCS
Host View
• Host sees QEMU as regular multithreaded process
• Process that has memory-mapped memory, talks to KVM device via ioctl calls
• Multiple QEMU VCPU threads can be scheduled
in parallel on multiple cores

• When KVM launches a VM, host OS context is stored in VMCS QEMU

• Host OS execution is suspended (all host processes stop)


• CPU loads guest OS context and guest OS starts running
KVM/Host VMCS
Guest OS
• When guest OS exits, host OS context is restored from VMCS OS

• Host OS resumes in KVM, where it stopped execution Root mode VMS mode
• KVM can return to QEMU, or host can switch to another process
• Host OS is not aware of guest OS execution
Summary
Hardware-assisted CPU virtualization in QEMU/KVM
• QEMU creates guest physical memory, one thread per VPCU
• QEMU VCPU thread gives KVM_RUN command to KVM QEMU

kernel module
VMCS
• KVM configures VM information in VMCS, launches guest KVM/Host
Guest OS
OS
OS in VMX mode
• Guest OS runs natively on CPU until VM exit happens Root mode VMS mode

• Control returns to KVM/Host OS on VM exit


• VM exits handled by KVM or QEMU
• Host schedules QEMU like any other process, not aware of
guest OS
Full virtualization
Full Virtualization

• x86 and other hardware lacked virtualization support


• But cloud computing increased demand for virtualization
• VMWare workstation first to solve the problem of virtualization existing
operating systems on x86 (basis for this lecture)
• Type 2 hypervisor based on trap-and-emulate approach
• Key idea: dynamic (on a need basis) binary (not source) translation of OS
instructions
• Problematic OS instructions translated before execution
• Subsequently, hardware support for virtualization (previous lecture)
• Binary translation is higher overhead than hardware-assisted virtualization
• Used when hardware support not available
Full Virtualization VMM Architecture
Host OS context VMM context

Guest application
Guest VM physical memory
kya isko pata hai ring1 and ring3 mein hai
like what about sensitive instructions?
3. Guest OS and user
VMM userspace process applications run with
ring 3 less privilege

ring 1 Guest OS
1. ioctl call to run VM
6. VMM kernel driver or
ring 0 userspace process handle 4. Privileged actions
exits trap to VMM
2. World switch to VMM context
VMM
VMM kernel driver
(guest OS traps
(Host OS) 5. VMM switches back to host on here)
interrupts, I/O requests etc.
(Some traps handled by VMM without world switch)
ye vmx mode alag kaise hai?
Host And VMM Contexts

• Each context has separate page tables,


CPU registers, IDTs and so on
• VMM context: VMM occupies Guest
Host Memory page of
top 4MB of address space world switch user processes
user processes
• Memory page containing code/data of code/data/context
world switch mapped in both contexts mapped by both
page tables
• Host/VMM context saved/restored in
this special “cross” page by VMM code
data
VMM kernel driver context Guest OS

Host OS VMM
Understand Difference with QEMU/KVM

• Where is context saved?


• Common cross page mapped into both host and guest address spaces
• KVM: Common memory (VMCS) accessible by CPU in both contexts via special
instructions
• Privilege level of guest OS?
• Guest OS runs in ring 1 (lower privilege). Instructions that do not run correctly at
lower privilege level are suitably translated to trap to VMM
• KVM: Guest OS runs in VMX ring 0. Some privileged instructions trigger exit to KVM
• How to trap to VMM?
• VMM is located in top 4MB of guest address space , guest OS traps to VMM for privileged ops.
World switch to host if VMM cannot handle trap in guest context
• KVM: VMM is not in guest context, guest traps to VMM in host via VM exit
Binary Translation
• Guest OS binary is translated instruction-by- instruction and
stored in translation cache (TC) Guest
• Part of VMM memory user processes
• Most code stays same, unmodified
ring 3
• OS code modified to work correctly in ring 1
• Sensitive but unprivileged instructions modified to trap
• Guest OS code executes from TC in ring 1
Guest OS
• Privileged OS code traps to VMM
• E.g., I/O, set IDT, set CR3, other privileged ops
• Emulated in VMM context or by switching to host
Translation
• VMM sets sensitive data structures like IDT etc. (maintains shadow cache (TC)
copies) ring 1
VMM
ring 0
Dynamic Binary Translation

Guest user
• VMM translator logic (ring 0) translates guest code
one basic block at a time to produce a compiled code
fragment (CCF) Basic block Basic block
• Basic block = sequence of instructions until a
jump/return
Guest OS
• Once CCF is created, move to ring 1 to run
translated guest code
• Once CCF ends, “call out” to VMM logic, compute CCF CCF
next instruction to jump to, translate, run CCF, and
so on
• If next CCF present in TC already, then directly jump to Translation cache ring 1
it without invoking VMM translator logic
• Optimization called chaining
VMM ring 0
Use of Segmentation for Protection
• Paging protects user code from kernel
code via bit in page table entry
• Segments are”flat”
cs,ds cs,ds
• Separate flat segments for user
User (flat) (ring 3)
and kernel modes
pages
• Segmentation is used to protect Host Guest
VMM from guest user processes user processes
• Flat segments truncated to
exclude VMM
• CS of guest OS (ring 1)
points to VMM ds
• VMM (ring 0) segments Kernel (ring 1)
point to top 4MB
Guest OS
pages
cs TC cs, ds
pavan sai ke notes mein achche se hai?
Host OS (ring 1) (ring 0)
VMM
Special Case: GS Segment(Optional)

• Sometimes, translated guest code (ring 1) needs to access VMM data structures like saved register values,
program counters and so on
• In such cases, memory accesses are rewritten to use the GS segment, e.g., virtual address “GS:someAddress”
• GS register points to the 4MB VMM area in ring 1
• Ensures that the translated guest OS code can selectively access VMM data
structures
• Original guest code that uses GS (which is rare) is rewritten to use another segment like %fs
Summary
• VMWare workstation is example of full virtualization, where unmodified OS is run on x86 hardware via
dynamic binary translation
• VMM user process and kernel driver on host trigger world switch from host OS context to VMM
context
• World switch code/data is part of both host and VMM contexts, special cross page accessible in both
modes has saved contexts
• VMM is in top 4MB of address space in VMM context
• Translated guest code runs in ring 1, traps to VMM in ring 0 for privileged operations (trap-and-
emulate)
• Traps handled by VMM in ring 0, or VMM exits to host OS for emulation
• Segmentation used to protect VMM from guest OS

• “Bringing Virtualization to the x86 Architecture with the Original VMware Workstation”, Edouard Bugnion,
Scott Devine, Mendel Rosenblum, Jeremy Sugerman, Edward Y. Wang.
• “A Comparison of Software and Hardware Techniques for x86 Virtualization”, Keith Adams, Ole Agesen.

You might also like