M2 Today
M2 Today
Containerization
05 Open Flow 06 and Docker
Hardware
Recap of Operating System
• To create a process, OS allocates memory in RAM for the memory image of the process, containing
• Code and static/global data from the executable
• Heap memory for dynamic memory allocations (e.g., malloc)
• Stack to store arguments/return address/local variables during function calls
• All instructions and variables in the memory image are assigned memory addresses
• Starting at 0, up to some max value (4GB in 32-bit systems)
Memory Image of Process
• When a process runs on the CPU, the CPU registers hold values related to the process execution
• The program counter (PC, or EIP in x86) has address of current instruction
• The CPU fetches the current instruction, decodes it, and executes it
• Any variables needed for operations are loaded from process memory into general purpose CPU registers (EAX,
EBX, ECX, EDX etc in x86)
• After instruction completes, values are stored from registers into memory
• The stack pointer (SP, or ESP in x86) has address of top of stack (current stack frame holds arguments/variables of
the current function that is running)
• The set of values of all CPU registers pertaining to a process execution is called its CPU context
CPU Context During Execution
Code/data
EIP (PC)
EAX ….. EDX Heap
CPU
ESP
Stack
…..
Concurrent Exec., Context Switching
• To run a process, OS allocates memory, loads CPU context
• EIP points to instructions, ESP points to stack of process, registers have process data
• CPU now begins to run the process
• OS runs multiple process concurrently by multiplexing on the same CPU
• Context switch: After running a process for some time, OS switches from one process to another
• How does context switch happen? OS saves the CPU context of the old process and loads the context of new
process
• When EIP points to instruction of new process, new process starts to run
Code/data Code/data
EIP (PC)
CPU has
EAX ….. EDX Heap Heap
switched
CPU execution from
blue process to
ESP green process Stack Stack
….. …..
User memory
Kernel memory
• Addresses assigned in memory image, that are used by CPU to load and store are
virtual/logical addresses
• These virtual addresses are not actual addresses occupied by
instructions/data of process, but assigned from 0 for convenience
• Actual addresses where instructions/data bytes of process are stored are called physical
addresses
• RAM hardware needs physical addresses to fetch bytes
• CPU requests code/data at virtual addresses
• Translated to physical address so that RAM can fetch it
• Memory is allocated at granularity of pages (usually 4KB)
• Logical pages of a process are stored in physical frames in memory
• Logical page numbers translated to physical frame numbers
Virtual and Physical Address
Address = 0
CPU
Address = X
Address = Y
Address = Z
Address = 4GB
Address Translation with Paging
• On every memory access, a piece of hardware called MMU (memory management unit) translates virtual addresses to
physical addresses
• Page table of a process stores the mapping of logical page number to physical frame number
• OS builds the page table when allocation memory
• MMU uses this page table to translate addresses
• MMU looks up CR3 register of CPU (x86) which stores location of page table of current process
• Looks up page number in page table to translate address
• CR3 reset on every context switch
• Recent address translations cached in TLB (Translation Look aside Buffer) located within MMU
Page table
(page 1= frame y)
Address = Z
Virtual address Page number Offset
• Cannot store a large page table contiguously in memory, so page table of a process is stored in memory in page sized
chunks
• Each 4KB page stores 2^10 PTEs, so 2^20/2^10 = 2^10 pages to store all PTEs
• Pointers to these “inner” pages stored in an outer page directory
• In 32-bit architectures, one outer page can store physical frame numbers of all 2^10 inner page table pages, so
2-level page table
• More levels in page table for 64-bit architectures
Page Table Lookup
CPU’s CR3 has
PTE Page table entry stores
physical address of
physical frame number
outermost page
directory
Page fault
Address = Y
Another victim page swapped to disk
Physical frame assigned to page Page
table updated. Address = Z
CPU instruction rerun
Address = 4GB
User Mode and Kernel Mode
• Two kinds of CPU instructions
• User code runs at low privilege level / user mode (CPU set to ring 3 in x86)
• If user runs privileged instructions, error is thrown
• OS code runs at high privilege level / kernel mode (CPU set to ring 0 in x86)
• Privilege level checked by CPU and MMU (every page has privilege bit)
• When user process needs to perform privileged action, must jump to privileged OS code, set CPU to high privilege, and
then perform the action
Where is OS Code Located ?
Virtual address space Physical address space
• OS is part of high virtual address
Page Address = 0
space of every process
Kernel code
• Page table of process maps these in memory
kernel addresses to location of
kernel code in memory
Address = X
• Only one copy of OS code in memory,
but mapped into virtual address
space/page table of every process
OS
Address = 4GB
Kernel Mode Execution
EAX
int n ring 0
executed Set EIP = IDT[n]
OS
• E.g., different hardware devices get different numbers, system call gets a different
number
CS
CPU
DS SS
Address
Use of Segmentation Today
• What do operating systems use today? Segmentation or Paging?
• Hardware is built to do segmentation and paging, but only paging in practice
• Modern OS’s use “flat” segments: Base = 0, Limit = MAX (4GB)
• Virtual address remains same after segmentation-based translation
• CS has two different values in user and kernel mode (base=0, limit=MAX, only privilege level is different)
• CS has high privilege level when executing privileged instructions
I/O Subsystem
• Processes use system calls to access I/O devices
• Process P1 makes system call, goes into kernel mode
• OS device driver initiates I/O request to device (e.g., disk, network card)
• If the system call is blocking (i.e., cannot be completed right away), OS performs context switch to
another process P2
• When request completes, I/O device raises interrupt
• P2 goes into kernel mode, handles interrupt, marks P1 as ready to run
• P1 runs at a later time when scheduled by OS scheduler
I/O Subsystem
• DMA: I/O devices perform Direct Memory Access to store I/O data in memory
• When initiating I/O request, device driver provides (physical) address of memory buffer in which to store
I/O data (e.g., disk block)
• Device first stores data in DMA buffer before raising interrupt
• Interrupt handler need not copy data from device memory to RAM
Summary
• The concept of a process, memory image, CPU context
• CPU context is saved in PCB/kernel stack during user/kernel mode transitions of a
process, as well as during context switch between processes
• User mode and kernel mode of a process, privileged instructions, CPU privilege levels
• Privileged instructions in OS code run at highest CPU privilege level (ring 0)
• I/O handling
COMPUTER ORGANISATION
I/O
Devices
Memory
CPU
Non Virtualized System
Consists of
Hardware Failure
v
Non Virtualized System
Hadoop
Pregel
Shared IaaS
MPI
Benefits of Virtualisation
■ Guests are installed on top of virtual hardware, controlled and managed by the virtualization
layer (virtual machine manager ((VMM)).
■ Host is instead represented by the physical hardware, and in some cases the operating system
that defines the environment where the VMM is running.
■ Virtual storage: guest might be client applications interact with the virtual storage management
software deployed on top of the real storage system.
■ Virtual networking: guest interacts with a virtual network (VPN) managed by specific software
(VPN client) using the physical network on the nodes.
■ VPNs are useful for creating the illusion of being within a different physical network and thus
accessing the resources in it, which would otherwise not be available.
Virtualisation in Conventional Software
Infrastructure
■ A Virtual Machine (VM) is a compute resource that uses software instead of a physical computer
to run programs and deploy apps.
■ Architecture: a formal specification to an interface in the system, including the logical behavior
of the resources managed via the interface.
■ Implementation describes the actual embodiment of an architecture.
■ Abstraction levels correspond to implementation layers, having its own interface or architecture.
■ Guest – system component interacts with Virtualization Layer.
■ Host – original environment where guest runs.
■ Virtualization Layer – recreate the same or different environment where guest will run.
Virtualised Infrastructure in Clouds
Virtualisation Layer
Hardware
Virtualization in Clouds
Save the cost (space, energy, personnel) of running several machines in place of one or vice-
versa—a green aspect, too!
Use the (otherwise wasted) CPU power
Clone servers (for example, for debugging) at low cost
Migrate a machine (for example, when the load increases) at low cost
Isolate an appliance—a server for a specific purpose (such as security)—without buying
new hardware
Virtualised Infrastructure in Clouds
The only way to enter the physical
hardware is via a software
(Operating System) or hardware
exception (interrupt or trap!)
Virtualised Infrastructure in Clouds
Virtual Machines
•Proxies for physical resources and have the same external interfaces and functions
•Composed from physical resources
Virtualization Layer
•Creates virtual resources and "maps" them to physical resources
•Provides isolation between Virtual Resources
•Accomplished through a combination of
software, firmware, and hardware mechanisms
Increased security :
■ For guest, a completely transparent and controlled execution environment.
■ An emulated environment in which the guest is executed.
■ Guest operations performed on VM, It translates and applies them to the host.
■ VMM controls and filters the guest activity, thus preventing some harmful operations from
being performed, Host is hidden or protected from the guest.
■ Sensitive information naturally hidden without installing complex security policies.
■ Increased security is a requirement when dealing with untrusted code.
■ Hardware virtualization solutions (VMware Desktop, Virtual Box, and Parallels) create a
virtual computer with customized virtual hardware on top of which a new operating
system can be installed.
■ By default, the file system exposed by the virtual computer is completely separated from
the one of the host machine.
Characteristics of virtualized
environments
Managed execution :
■ Sharing: sharing in virtualized data centers, to reduce the number of active servers and limit
power consumption
■ Aggregation: A group of separate hosts tied together and represented to guests as a single
virtual host, implemented in middleware for distributed computing.
■ Emulation: Controlling and tuning the environment that is exposed to guests.xample: arcade-
game emulator to play arcade games on a normal personal
■ Isolation: Allows multiple guests to run on the same host without interfering with each other,
Also, Separation between the host and the guest.
Characteristics of virtualized
environments
Portability:
■ Hardware virtualization solution: guest packaged into a virtual image that can be safely
moved and executed on top of different virtual machines.
■ Programming-level virtualization: Implemented by JVM or .NET runtime, binary code
can be run without any recompilation.
■ This makes flexible and very straight forward: One version is able to run on different
platforms with no changes.
■ Allows your own system always with you and ready to use as long as the required virtual
machine manager is available (services you need available to you anywhere you go).
Hands-On
Off On
shutdown
Paused
Physical Machine
Virtualization
How it is done? Technique Virtualization Model
Emulation Application
Execution Programming
Process Level High-Level VM
Environment Language
Storage
Multiprogramming Operating System
Virtualization
Network Hardware-assisted
Virtualization
Full Virtualization
System Level Hardware
…. Para virtualization
Partial Virtualization
Execution Virtualization
■ An Execution Environment (EE) is emulated, separated from host, implemented on top of H/W
by OS, application (libraries) dynamically or statically.
Machine reference model
■ For virtualizing an execution environment at different levels of computing stack, a reference
model is required that defines interfaces between levels of abstractions.
■ Virtualization techniques actually replace one of the layers.
■ A clear separation between layers simplifies their implementation.
■ It requires the emulation of interfaces and interaction with underlying layer.
Computing Stack
Applications
Application -
level
Virtualization
Programming Languages
Execution Stack
Programming
Language level
Virtualization
Operative Systems
OS- level
Virtualization
Hardware
Hardware - level
Virtualization
Execution virtualization: Interfaces
Applications Applications
API
API calls
Libraries Libraries
User
ABI System
ISA
calls User
Operative ISA
Operative System System
ISA
ISA
Hardware Hardware
ye user isa kya hai?
■ Non-privileged instructions can be used without interfering with other tasks because they do not
access shared resources (floating, fixed-point, and arithmetic instructions).
■ Privileged instructions are executed under specific restrictions and mostly used for sensitive
operations:
➢ Behavior-sensitive: operate on the I/O or
➢ Control-sensitive: alter the state of the CPU.
■ Some architectures feature more than one class of privileged instructions.
■ For instance, a hierarchy of privileges see (Fig. Next Slide), in the form of ring-based security:
Ring 0, Ring 1, Ring 2, and Ring 3.
Security Rings and Privilege Modes.
Least privileged
mode
Ring 3 (user mode)
Privileged
Ring 2 modes
Ring 1
Ring 0
Most privileged
mode
(supervisor mode)
Ring 0 is in the most privileged level and Ring 3 in the least privileged level. Ring 0 is used by the
kernel of the OS, rings 1 and 2 are used by the OS-level services, and Ring 3 is used by the user.
Security Rings and Privilege Modes
Segment Table revisited
… … … …
Start address size ring Access type = {read, write, execute}
… … … (rr … 0
1
2
3
4
Code permitted to
…
access ring i may
n access ring j if and
only if j > i,
Execution virtualization
■ Distinction between user & supervisor mode signifies role of the hypervisor.
■ Hypervisors run in supervisor mode, and division between privileged and non-privileged
instructions makes challenging design of virtual machine managers.
■ Sensitive instructions execute in privileged mode, requires supervisor mode to avoid traps.
■ Without this assumption it is impossible to fully emulate and manage the status of the CPU for
guest operating systems.
■ This is not true for the original ISA, allows 17 sensitive instructions to be called in user mode,
prevents multiple OSs managed by a single hypervisor to be isolated from each other.
■ Since, they access the privileged state of the processor and change it.
without knowledge of hypervisor(ya trap generate kiye bina) to virtualize kaise karenge :)
Execution virtualization
■ Recent implementations of ISA (Intel VT and AMD Pacifica) solved this problem by
redesigning such instructions as privileged ones.
■ In hypervisor-managed environment, all guest OS code will run in user mode in order to prevent it
from directly accessing the status of the CPU.
■ If there are sensitive instructions that can be called in user mode, it is no longer possible to
completely isolate the guest OS.
Hardware Virtualization
Guest
In memory
representation
Virtual Image
Storage
Virtual Machine
binary translation
instruction mapping
interpretation
……
Host
Techniques to design
Virtual Machine Monitors
What does VMM do?
VMM OS
• Trap-and-emulate VMM: guest OS runs at lower privilege level than VMM, traps to VMM for privileged operation
Trap and emulate VMM (2)
• Some x86 instructions which change hardware state (sensitive instructions) run in both privileged and
unprivileged modes
• Will behave differently when guest OS is in ring 0 vs in less privileged ring 1
• OS behaves incorrectly in ring1, will not trap to VMM
• Instruction set architecture of x86 is not easily virtualizable (x86 wasn’t designed with virtualization in
mind)
Example: Problems with trap and emulate
■ Equivalence – same behavior as when it is executed directly on the physical host when it was
running under control of VMM.
■ Resource control – VMM should be in complete control of virtualized resources.
■ Efficiency – a statistically dominant fraction of the machine instructions should be executed without
intervention from the VMM.
■ Theorems*: Classification of the instruction set and proposed three theorems that define the
properties that hardware instructions need to satisfy in order to efficiently support virtualization.
* criteria’s are established by Goldberg and Popek in 1974
Popek Goldberg theorem
• Theorem 1: In order to build a VMM efficiently via trap-and-emulate method, sensitive instructions should be a subset
of privileged instructions
• x86 does not satisfy this criteria, so trap and emulate VMM is not possible
CPU instructions x86
Privileged
instructions
Sensitive
instructions
Hardware Virtualization
Theorem 2 *:
■ A conventional third-generation computers is recursively virtualizable if:
■ It is virtualizable and
■ A VMM without any timing dependencies can be constructed for it.
■ Theorem 3 *:
■ A hybrid VMM may be constructed third-generation machine in which the set of user-
sensitive instructions is a subset of the set of privileged instructions.
■ In HVM, more instructions are interpreted rather than being executed directly.
Techniques to virtualize x86 (1)
Hypervisors(or VMM):
■ Hypervisor runs above the physical hardware.
■ It runs in supervisor mode.
■ It recreates a h/w environment in which guest operating systems are installed.
■ It is a piece of s/w that enables us to run one or more VMs on a physical server(host).
■ Two major types of hypervisor
■ – Type -I
■ – Type-II
Hardware Virtualization
Hypervisors (Type-I):
■ It runs directly on top of the hardware, takes place of OS.
■ Directly interact with the ISA exposed by the underlying hardware, and emulate this interface in
order to allow the management of guest OS.
■ Also known as native virtual machine.
Hypervisors (Type-2):
■ It requires the support of an operating system to provide virtualization services.
■ Programs managed by the OS, interact with OS through the ABI.
■ Emulate the ISA of virtual h/w for guest OS.
■ Also called hosted virtual machine.
Hosted and native Hypervisors
VM VM VM VM
ISA
ABI ISA
ISA ISA
Hardware Hardware
Type 1 Hypervisor
• VMWare example
– Type 2 hypervisor Upon loading program:
scans code for basic blocks
– If sensitive instructions, replace by
VMware procedure
• Binary translation
– Cache modified basic block in VMWare
cache
• Execute; load next basic block etc.
• Type 2 hypervisors work without VT (no
automatic trapping of privileged instructions
by hypervisors) support.
Hardware Virtualization
KVM (kernel module) When invoked, KVM switches to VMX mode to run guest
CPU with VMX mode CPU switches between VMX and non-VMX root modes
LibVirt and QEMU/KVM
• When you install QEMU/KVM on Linux, libvirt is also installed
• A set of tools manage hypervisors, including QEMU/KVM
• A daemon runs on the system and communicates with hypervisors
• Exposes an API using which hypervisors can be managed, VM created etc.
• Commandline tool (virsh) and GUI (virt-manager) use this API to manage VMs
open(/dev/kvm)
This ioctl system call blocks this thread, KVM
ioctl(qemu_fd, KVM_CREATE_VM) switches to VMX mode, runs guest VM
ioctl(vm_fd, KVM_CREATE_VCPU)
for(;;) { //each VCPU runs this loop Returns to QEMU on host when VM exits from
ioctl(vcpu_fd, KVM_RUN) VMX mode.
switch(exit_reason) {
QEMU handles exit and returns to guest VM
case KVM_EXIT_IO: //do I/O case
KVM_EXIT_HLT:
}
}
QEMU/KVM Operation
Guest
Guest VM physical memory application
• VMCS information (e.g., exit reason) exchanged with QEMU via kvm_run structure
• VMCS only accessible to KVM in kernel mode, not to QEMU userspace
VMX Mode Execution
• How is guest OS execution in VMX mode different?
• Guest OS usually exits on interrupts (interrupts handled by KVM, assigned to the appropriate host or
guest OS)
• KVM can inject virtual interrupts to guest OS during VMX mode entry IDT such that the guest OS traps to the KVM. ye banata hai KVM
VCPU-0 kvm_run
QEMU (Userspace process)
• Host OS resumes in KVM, where it stopped execution Root mode VMS mode
• KVM can return to QEMU, or host can switch to another process
• Host OS is not aware of guest OS execution
Summary
Hardware-assisted CPU virtualization in QEMU/KVM
• QEMU creates guest physical memory, one thread per VPCU
• QEMU VCPU thread gives KVM_RUN command to KVM QEMU
kernel module
VMCS
• KVM configures VM information in VMCS, launches guest KVM/Host
Guest OS
OS
OS in VMX mode
• Guest OS runs natively on CPU until VM exit happens Root mode VMS mode
Guest application
Guest VM physical memory
kya isko pata hai ring1 and ring3 mein hai
like what about sensitive instructions?
3. Guest OS and user
VMM userspace process applications run with
ring 3 less privilege
ring 1 Guest OS
1. ioctl call to run VM
6. VMM kernel driver or
ring 0 userspace process handle 4. Privileged actions
exits trap to VMM
2. World switch to VMM context
VMM
VMM kernel driver
(guest OS traps
(Host OS) 5. VMM switches back to host on here)
interrupts, I/O requests etc.
(Some traps handled by VMM without world switch)
ye vmx mode alag kaise hai?
Host And VMM Contexts
Host OS VMM
Understand Difference with QEMU/KVM
Guest user
• VMM translator logic (ring 0) translates guest code
one basic block at a time to produce a compiled code
fragment (CCF) Basic block Basic block
• Basic block = sequence of instructions until a
jump/return
Guest OS
• Once CCF is created, move to ring 1 to run
translated guest code
• Once CCF ends, “call out” to VMM logic, compute CCF CCF
next instruction to jump to, translate, run CCF, and
so on
• If next CCF present in TC already, then directly jump to Translation cache ring 1
it without invoking VMM translator logic
• Optimization called chaining
VMM ring 0
Use of Segmentation for Protection
• Paging protects user code from kernel
code via bit in page table entry
• Segments are”flat”
cs,ds cs,ds
• Separate flat segments for user
User (flat) (ring 3)
and kernel modes
pages
• Segmentation is used to protect Host Guest
VMM from guest user processes user processes
• Flat segments truncated to
exclude VMM
• CS of guest OS (ring 1)
points to VMM ds
• VMM (ring 0) segments Kernel (ring 1)
point to top 4MB
Guest OS
pages
cs TC cs, ds
pavan sai ke notes mein achche se hai?
Host OS (ring 1) (ring 0)
VMM
Special Case: GS Segment(Optional)
• Sometimes, translated guest code (ring 1) needs to access VMM data structures like saved register values,
program counters and so on
• In such cases, memory accesses are rewritten to use the GS segment, e.g., virtual address “GS:someAddress”
• GS register points to the 4MB VMM area in ring 1
• Ensures that the translated guest OS code can selectively access VMM data
structures
• Original guest code that uses GS (which is rare) is rewritten to use another segment like %fs
Summary
• VMWare workstation is example of full virtualization, where unmodified OS is run on x86 hardware via
dynamic binary translation
• VMM user process and kernel driver on host trigger world switch from host OS context to VMM
context
• World switch code/data is part of both host and VMM contexts, special cross page accessible in both
modes has saved contexts
• VMM is in top 4MB of address space in VMM context
• Translated guest code runs in ring 1, traps to VMM in ring 0 for privileged operations (trap-and-
emulate)
• Traps handled by VMM in ring 0, or VMM exits to host OS for emulation
• Segmentation used to protect VMM from guest OS
• “Bringing Virtualization to the x86 Architecture with the Original VMware Workstation”, Edouard Bugnion,
Scott Devine, Mendel Rosenblum, Jeremy Sugerman, Edward Y. Wang.
• “A Comparison of Software and Hardware Techniques for x86 Virtualization”, Keith Adams, Ole Agesen.