0% found this document useful (0 votes)
19 views96 pages

Cs498 Week 13 Slide

The document discusses cloud computing and the role of virtualization in sharing resources among multiple users through abstraction. It outlines various types of virtualization, including full virtualization, paravirtualization, and binary translation, and explains how these methods enable efficient and isolated execution of applications. Additionally, it covers the historical context of virtualization and the importance of process isolation and CPU privilege protection in operating systems.

Uploaded by

wzhi1870
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views96 pages

Cs498 Week 13 Slide

The document discusses cloud computing and the role of virtualization in sharing resources among multiple users through abstraction. It outlines various types of virtualization, including full virtualization, paravirtualization, and binary translation, and explains how these methods enable efficient and isolated execution of applications. Additionally, it covers the historical context of virtualization and the importance of process isolation and CPU privilege protection in operating systems.

Uploaded by

wzhi1870
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Cloud Computing

Space reserved for video


Do not put anything here

Virtualization
Prof. Reza Farivar
Sharing Resources

• Economics of Clouds requires sharing


resources
• How do we share a physical computer
among multiple users?
• Answer: Abstraction
• Introduce an abstract model of what a generic
computing resource should look like Space reserved for video
Do not put anything here

• The physical computer resource then provides


this abstract model to many users
Layers of Abstraction
• Introduce an abstract model of
what a generic computing resource
should look like
• The physical computer resource
then provides this abstract model to
many users
• Virtualization avoids creating
dependencies on physical resources

Space reserved for video


Do not put anything here
Virtualization: Foundation of Cloud
Computing
• Virtualization allows distributed computing
models without creating dependencies on
physical resources
• Clouds are based on virtualization
• Offer services based mainly on virtual machines,
remote procedure calls, and client/servers
• Provide lots of servers to lots of clients (e.g., phones) Space reserved for video
Do not put anything here

• Simplicity of use and ease of programming


require allowing client server paradigms to be
used to construct services from lots of resources
Software-Defined Data Center

User Cloud
Network Controller and
Network
Virtualization Scheduler
Self-
Storage Controller and
Application Service Storage
Virtualization Scheduler
Portal
Compute CPU
Compute
Virtualization Hypervisors
Space reserved for video
Do not put anything here

Automation
Types of Virtualization

• Emulation
• Full
• Software
• Binary Translation
• Paravirtualization
• Hardware assisted
• MicroVMs Space reserved for video
Do not put anything here

• OS level
• Containers
Cloud Computing

Space reserved for video


Do not put anything here

Virtualization
Prof. Reza Farivar
Sharing Resources

• Economics of Clouds requires sharing


resources
• How do we share a physical computer
among multiple users?
• Answer: Abstraction
• Introduce an abstract model of what a generic
computing resource should look like Space reserved for video
Do not put anything here

• The physical computer resource then provides


this abstract model to many users
Layers of Abstraction
• Introduce an abstract model of
what a generic computing resource
should look like
• The physical computer resource
then provides this abstract model to
many users
• Virtualization avoids creating
dependencies on physical resources

Space reserved for video


Do not put anything here
Virtualization: Foundation of Cloud
Computing
• Virtualization allows distributed computing
models without creating dependencies on
physical resources
• Clouds are based on virtualization
• Offer services based mainly on virtual machines,
remote procedure calls, and client/servers
• Provide lots of servers to lots of clients (e.g., phones) Space reserved for video
Do not put anything here

• Simplicity of use and ease of programming


require allowing client server paradigms to be
used to construct services from lots of resources
Software-Defined Data Center

User Cloud
Network Controller and
Network
Virtualization Scheduler
Self-
Storage Controller and
Application Service Storage
Virtualization Scheduler
Portal
Compute CPU
Compute
Virtualization Hypervisors
Space reserved for video
Do not put anything here

Automation
Types of Virtualization

• Emulation
• Full
• Software
• Binary Translation
• Paravirtualization
• Hardware assisted
• MicroVMs Space reserved for video
Do not put anything here

• OS level
• Containers
Cloud Computing

Space reserved for video


Do not put anything here

Virtualization: Background
Prof. Reza Farivar
Brief History Lesson
• Single program computers
• VERY early mainframes (1950s)
• MS-DOS
• Single user program gets access to everything the hardware
has
• The OS is really a thin wrapper around BIOS
• No real notion of process
• Multi-user / Multi-tasking
• Need to isolate programs
Space reserved for video
• Need to isolate users Do not put anything here

• Notion of Process
• "executing program and its context”

Cloud Computing Applications - Reza Farivar 2


World-view from a process
• An image of the program’s executable machine code
• Memory
• virtual address space  paging  a VM page is brought into
memory when the process attempts to use it  managed by the OS
• Process-specific data (input and output)
• Stack: temp data, e.g. function parameters, local variables, return
addresses, function call stack, and saved variables
• Heap to hold intermediate data during run time
• OS resource descriptors: e.g. file descriptors, data sources
and sinks
Space reserved for video
• Security attributes: e.g. process owner and the process' set Do not put anything here
of permissions (allowable operations)
• Processor state (context)
• Program Counter
• Content of registers and physical memory addressing
Cloud Computing Applications - Reza Farivar 3
Process Isolation
• Need to isolate processes from each other
• Virtualized, idealized, machine
• A process is not capable of interacting with another process except
through secure, kernel managed mechanisms
• User Processes should not be allowed to issue sensitive
instructions
• Things like loading memory mapping tables and accessing I/O
devices.
Space reserved for video

• Normal applications better not use any of these instructions Do not put anything here

• Imagine what would happen if a normal application like a word


processor would suddenly be able to write to arbitrary memory
locations, or get raw access to your hard drive.
Cloud Computing Applications - Reza Farivar 4
Dual Mode Operations in OS
• The CPU and the Operating System work
together to ensure process isolation

• To isolate processes from each other, the OS has


two modes
• User Mode
• Kernel Mode Space reserved for video
Do not put anything here

Cloud Computing Applications - Reza Farivar 5


User and Kernel Modes
• User Mode
• User processed operate in user mode
• When the user application requests a service from the
operating system, or an interrupt occurs, or a system call is
made, there will be a transition from user to kernel mode to
fulfill the requests
• Kernel Mode
• When the system boots, hardware starts in kernel mode
• privileged instructions which execute only in kernel mode Space reserved for video
Do not put anything here
• If user attempt to run privileged instruction in user mode then it will
treat instruction as illegal and traps to OS
• Example Privileged instruction: Input/Output management
• Interrupt handling
Cloud Computing Applications - Reza Farivar 6
CPU privilege protection
• When a privileged instruction is executed (or a safe
instruction accesses a privileged resource), the CPU
checks whether the process is allowed or not
• Different mechanisms
• x86 Example: Ring levels
• Kernel mode code (OS, Device drivers, …) run in ring 0
• User processes run in ring 3
• The CPU issues General Protection Fault (GPF) if a Space reserved for video
privileged instruction is executed in the wrong ring Do not put anything here

level

Cloud Computing Applications - Reza Farivar 7


CPU + OS
• Certain operations are not allowed in user mode code
• Read and write from a hardware device
• Enabling/Disabling system interrupts
• Such operations only allowed in Kernel mode
• The task of enforcing this requirement is performed
by the CPU
• Examples of privileged operations
• HLT: Halt CPU till next interrupt
• INVLPG: Invalidate a page entry in the translation look-aside
buffer (TLB) Space reserved for video
• LIDT: Load Interrupt Descriptor Table Do not put anything here

• MOV CR registers: load or store control registers


• In this case the MOV instruction (a non-privileged instruction on
its own) is accessing a privileged register

Cloud Computing Applications - Reza Farivar 8


Cloud Computing

Space reserved for video


Do not put anything here

Virtualization: Full Virtualization


Prof. Reza Farivar
Full Virtualization
• The virtual machine simulates enough hardware to allow an
unmodified "guest" OS (one designed for the same CPU) to be
run in isolation
• The virtual machine looks and feels exactly like a real
computer, up to the point where a guest operating system
cannot tell the difference
Apps Apps
• Examples: Space reserved for video
Do not put anything here
• VirtualBox Guest Guest …..
Manage
• Virtual PC OS OS
-ment
• VMWare Hypervisor (VMM)
• QEMU Hardware
Virtualization: Privileged and non-privileged
instructions
• 1974 paper by Goldberg and Popek described
criteria to make a system virtualizable
• Trap and Emulate
• Safe Instructions
• Unsafe (sensitive) instructions
• Privileged instructions a subset of unsafe
• Privileged instruction should cause a trap
• The original X86 was not virtualizable according to Space reserved for video
the above paper Do not put anything here

• 17 unsafe instructions that were not privileged


• Intel VT-x and AMD-V later made these privileged

Cloud Computing Applications - Reza Farivar 3


Trap and Emulate
• The classical way to implement a hypervisor is
using the "trap and emulate” approach
• This approach was used by the very first
hypervisor developed by IBM in the late 60s
• IBM System 370
• Used again today on 64-bit Intel and AMD systems
• The approach usually has good performance,
because the majority of the instructions will Space reserved for video

not cause a trap, and will execute straight on Do not put anything here

the CPU with no overhead.

Cloud Computing Applications - Reza Farivar 4


Trap and Emulate
• Executable code from the guest can execute
directly on the host CPU by the hypervisor
• the hypervisor configures the CPU in such a way
that all potentially unsafe instructions will cause a
"trap”
• An unsafe instruction is one that for example tries
to access or modify the memory of another guest.
• A trap is an exceptional condition that transfers
control back to the hypervisor.
Space reserved for video
• Once the hypervisor has received a trap, it will Do not put anything here
inspect the offending instruction, emulate it in a
safe way, and continue execution after the
instruction

Cloud Computing Applications - Reza Farivar 5


Cloud Computing

Space reserved for video


Do not put anything here

Virtualization: Paravirtualization
Prof. Reza Farivar
Software-only Virtualization
• Problem: x86 processors were not
virtualizable until mid 2000s
• Software-only virtualization is a technique to
go around the trap and emulate design of
Popek and Goldberg
• Does not need special hardware support, e.g.
the Intel "VT-x" or "AMD-V" features Space reserved for video
Do not put anything here

Cloud Computing Applications - Reza Farivar 2


Paravirtualization
Apps Apps
• First approach to software-only virtualization
Modifie Modifie
• The virtual machine does not necessarily d Guest d Guest ….. Mana
simulate hardware, but instead (or in addition), OS OS geme
offers a special API that can only be used by
modifying the "guest" OS Stub Stub nt
• Paravirtualization is a technique in which a modified guest Hypervisor (VMM)
operating system kernel communicates to the hypervisor Hardware
its intent to perform privileged CPU and memory
operations
• The guest OS is specifically modified to run on a
hypervisor Space reserved for video
Do not put anything here
• Windows 7 and newer
• Linux Kernel version 3 and later
• Example:
• XEN
Xen and the Linux Kernel

• Xen was initially a university research project


• Invasive changes to the kernel to run Linux as a
paravirtualized guest
• Maintenance effort required on distributions
• Support was added in mainstream Linux Kernel 3
(2012)
Space reserved for video
• Usually very fast  Trap and Emulate has Do not put anything here

overhead, paravirtualization eliminates traps


• Risk of distributions dropping Xen support
Xen Concepts

• Control Domain 0 a.k.a. Dom0


• Dom0 kernel with drivers
• Xen management tool stack
• Trusted computing base
• Guest Domains
• Your apps
• For example, your cloud management stack
Space reserved for video
• Driver/Stub/Service Domain(s) Do not put anything here

• A "driver, device” model or “control service in a box"


• De-privileged and isolated
• Lifetime: start, stop, kill
Cloud Computing

Space reserved for video


Do not put anything here

Virtualization: Binary Translation


Prof. Reza Farivar
Binary Translation
• Binary translation modifies sensitive instructions on the
fly to virtualizable instructions
• we only need to translate kernel code that is executing in ring 0
• Depending on the workload, this is a small fraction of the total code
• Examine the executable code of the virtual guest for "unsafe"
instructions
• Translate these into "safe" equivalents
• Execute the translated code
• Direct Execution
• most code is executed directly on the CPU, and only the code Space reserved for video
that needs to be translated is actually translated Do not put anything here

• Binary translation is performed on the binary code that


gets executed on the processor it does not require
changes to the guest operating system kernel
Cloud Computing Applications - Reza Farivar 2
Binary Translation
• Binary translation was first described in a
paper from 1992 by Digital Equipment
Corporation (DEC)
• Original VMWare Workstation 1.0 in 1999
• BT support deprecated since 2016
• Somewhat similar to Just in Time compilation
for Java Virtual Machine (JVM), Javascript (V8 Space reserved for video
in Chrome), PHP 8 (since November 2020) Do not put anything here

Cloud Computing Applications - Reza Farivar 3


Cloud Computing

Space reserved for video


Do not put anything here

Virtualization: 1st Gen Hardware Virtualization


Prof. Reza Farivar
Hardware-Enabled Virtualization
Apps Apps Apps
• Intel VT (IVT)
Guest Guest Guest …..
• AMD virtualization (AMD-V) OS OS OS

• Allow “trapping” of sensitive instructions Hardware VM Hardware VM


• Popek & Goldbreg  Trap and Emulate Hardware

• Examples:
• VMWare Fusion, ESX Space reserved for video
Do not put anything here
• Parallels Desktop for Mac
• Parallels Workstation
First Generation Hardware Virtualization
• First introduced in x86 in mid 2000s
• Intel VT-x, AMD-V
• Virtual machine control block (VMCB).
• in-memory data structure
• The VMCB combines control state with a subset of
the guest VCPU state
• A new, less privileged execution mode, guest
mode, supports direct execution of guest Space reserved for video
Do not put anything here

code, including privileged kernel code

Cloud Computing Applications - Reza Farivar 3


First Generation Hardware Virtualization
• A new instruction, vmrun, transfers from host
to guest mode.
• Upon execution of vmrun, the hardware loads
guest state from the VMCB and continues
execution in guest mode
• Guest execution proceeds until some condition
(set by VMM) is reached
• The hardware performs an exit operation
Space reserved for video
• exit is the inverse of vmrun Do not put anything here

• Guest state is saved to the VMCB, VMM state is


loaded, and execution resumes in host mode, now
in the VMM.
Cloud Computing Applications - Reza Farivar 4
First Generation Hardware Virtualization
• First generation hardware support lacks
explicit support for memory virtualization
• The VMM must implement a software MMU using
shadow page tables
•  context switch on each vmrun and exit
• VMPTRLD, VMPTRST, VMCLEAR, VMREAD, VMWRITE,
VMCALL, VMLAUNCH, VMRESUME, VMXOFF, VMXON,
INVEPT, INVVPID, and VMFUNC
• With hardware-assist, the guest runs at full Space reserved for video
speed, unless an exit is triggered Do not put anything here

• Virtualization overheads are determined as the


product of the exit frequency and the average cost
of handling an exit
Cloud Computing Applications - Reza Farivar 5
MMU in First Generation Hardware
Virtualization
• First gen hardware virtualization does not virtualize
MMU
• The VMM has to get involved on MMU
• VMM write-protects primary page tables to trigger exits
when the guest updates primary page tables so that the
VMM can propagate the change into the shadow page tables
(e.g., invalidate).
• the VMM must request exits on page faults to distinguish
between hidden faults, which the VMM consumes to
populate shadow page tables, and true faults, which the
guest consumes to populate primary page tables.
• the VMM must request exits on guest context switches so Space reserved for video
that it can activate the shadow page tables corresponding to Do not put anything here
the new context.
• First generation hardware support often did not
outperform a BT- based VMM, often slower

Cloud Computing Applications - Reza Farivar 6


Cloud Computing

Space reserved for video


Do not put anything here

Virtualization: 2nd & 3rd Gen Hardware Virtualization


Prof. Reza Farivar
Second Generation Hardware Virtualization
• AMD’s RVI and Intel’s EPT (Extended Page Tables)
• The VMM maintains a hardware-walked “nested page
table” that translates gPAs to hPAs
• eliminating the need for VMM interposition
• Many issues of the first-gen are resolved
• No trace-induced exits
• no context-switch exits
• no hidden/true fault exits
• The VMM does not have to allocate memory for shadow
page tables, reducing memory usage
Space reserved for video
• The cost to service a TLB miss will be higher with Do not put anything here
nested paging than without
• TLB Caching helps a lot
• Large Memory pages (1 GB vs 2 MB)

Cloud Computing Applications - Reza Farivar 2


I/O Virtualization
• Most hypervisors “emulate” I/O devices
• Generic display
• Generic network
• Generic storage
• Trap and Emulate idea
• Or paravirtualization
• Cloud Data Center requirements necessitate Space reserved for video
optimal performance Do not put anything here

• Hardware-based I/O Virtualization

Cloud Computing Applications - Reza Farivar 3


Third Generation Hardware Virtualization
• Since the Haswell microarchitecture (announced in
2013), Intel started to include VMCS shadowing as
a technology that accelerates nested
virtualization of VMMs
• Interrupt Virtualization (AMD AVIC and Intel APICv)
2012
• I/O MMU virtualization (AMD-Vi and Intel VT-d)
• An input/output memory management unit (IOMMU)
allows guest virtual machines to directly use
peripheral devices, such as Ethernet, accelerated
graphics cards, and hard-drive controllers, through Space reserved for video
Do not put anything here
DMA and interrupt remapping.
• This is sometimes called PCI passthrough
• PCI-SIG Single Root I/O Virtualization (SR-IOV)
Cloud Computing Applications - Reza Farivar 4
Cloud Computing

Space reserved for video


Do not put anything here

Virtualization: MicroVMs and Unikernels


Prof. Reza Farivar
MicroVMs
• A typical virtual machine usually has many virtual I/O
devices to make it usable
• Think about an EC2 instance
• Virtual storage
• Virtual Network
• Virtual Display
• USB, Audio, …
• The guest operating system supports device drivers, kernel
modules, etc. for all
• Typical load time in tens of seconds, if not minutes
• MicroVMs designed for cloud native Space reserved for video
• Serverless Computing use cases Do not put anything here
• Serverless containers
• Fargate
• Function as a Service
• Lambda
Cloud Computing Applications - Reza Farivar 2
FireCracker
• Open Source Project by Amazon
• Based on top of the Linux KVM
• Similar to QEMU (VMM driver)
• Idea: as light weight as possible
• VMM
• Guest OS
• No support for graphics drivers
• The only virtual devices:
• Paravirtualized virtio net Space reserved for video
• Paravirtualized virtio block Do not put anything here
• a one-button keyboard
• To reset the VM
• Interrupt controller
• Timer
• Clock
Cloud Computing Applications - Reza Farivar 3
FireCracker
• VMM starts in 8 ms
• VM Start time less than 125 ms
• Firecracker InstanceStart API call to the start of
the Linux guest user-space /sbin/init process
• Lightweight Linux guest, e.g. Alpine Linux
• Memory overhead less than 5 MiB

• OSv on FireCracker Space reserved for video


Do not put anything here

• Specialized OS
• Boot time in less than 5 ms

Cloud Computing Applications - Reza Farivar 4


Unikernel
• Unikernels are a relatively new concept
• Software is directly integrated with the kernel it is running on
• Compiling source code, along with only the required system
calls and drivers, into one executable program using a single
address space
• Unikernels can only run a single process, thus forking does
not exist
• The build process results in a complete (virtual) machine
image of minimal size that only contains and executes what it
absolutely needs
• Example: OSv  Can run on FireCracker in 5 ms (compare to Space reserved for video
Do not put anything here
125 ms for Linux), 18 MiB memory overhead, can run any
Linux Executable

Cloud Computing Applications - Reza Farivar 5


Operating System-Level Virtualization

• Virtualizing a physical server at the operating system level,


enabling multiple isolated and secure virtualized servers to run
on a single physical server
• Examples:
• Linux-Vserver Private Private Private
…..
• Solaris Containers Server Server Server

• FreeBSD Jails Space reserved for video


Do not put anything here
Operating System
• Chroot
• CGroups Hardware
Cloud Computing

Space reserved for video


Do not put anything here

Containers
Prof. Reza Farivar
Isolation
• “I once heard that hypervisors are the living
proof of operating system's incompetence”
• -Glauber Costa, 2012
• hypervisors have indeed provided a remedy
for certain deficiencies in operating system
design
• for some cases, containers may be an even Space reserved for video
better remedy for those deficiencies Do not put anything here

Cloud Computing Applications - Reza Farivar 2


Operating System-Level Virtualization

• Virtualizing a physical server at the operating system level,


enabling multiple isolated and secure virtualized servers to run
on a single physical server
• Examples:
• Solaris Containers (2004) Private Private Private
…..
• FreeBSD Jails (2000) Server Server Server

• Linux Containers Space reserved for video


Do not put anything here
Operating System
• Linux Vserver (2001)
• OpenVZ (2005) Hardware
• Process Container (2006)  cgroups
• LXC (2008)
• Docker (2013)
OS-Virtualization / Containers
• OS (operating system) virtualization is how we
generally refer to this type of "light-weight"
virtualization
• processes think they see a virtual kernel, but are all
sharing the same real kernel under the hood
• kernel acts as a sort of hypervisor in ensuring that
container/virtualization boundaries are not crossed
• The goal of containers is to support all of the Space reserved for video
Do not put anything here
resource-isolation use cases, without the overhead
and complexity of running multiple kernel instance

Cloud Computing Applications - Reza Farivar 4


Operating System-Level Virtualization
• Hypervisor (VM) • Containers (CT)
• One real HW, many virtual • One real HW (no virtual HW),
HWs, many OSs one kernel, many userspace
• High versatility – can run instances
different Oss
• Higher density, natural page
• Lower density, performance, sharing
scalability Space reserved for video
• Dynamic resource
Do not put allocation
anything here
• Performance overhead is
mitigated by new hardware • Native performance: [almost]
features (such as VT-D) no overhead
Cloud Computing

Space reserved for video


Do not put anything here

Pillars of Linux Containers


Prof. Reza Farivar
Three Pillars of Linux Containers
• cgroups
• Namespaces
• Unionfs

• Not chroot!
Space reserved for video
Do not put anything here

Cloud Computing Applications - Reza Farivar 2


chroot
• In a Unix-like OS, root directory(/) is the top
directory
• All file system entries branch out of this root
• 1979, Unix V7
• Each process has its own idea of what the root
directory is
• By default, it is actual system root
Space reserved for video
• But we can change this by using chroot()system call Do not put anything here

Cloud Computing Applications - Reza Farivar 3


chroot
• chroot changes apparent root directory for
current running process and its children
• chroot() simply modifies pathname lookups for a
process and its children
• prepends the new root path to any name starting
with /
• Current directory is not modified and relative paths
can refer any locations outside of new root Space reserved for video

• chroot() does NOT provide secure isolation Do not put anything here

• Docker uses mount namespace instead

Cloud Computing Applications - Reza Farivar 4


Cloud Computing

Space reserved for video


Do not put anything here

Containers: cgroups
Prof. Reza Farivar
Cgroups
• Control Groups
• Linux kernel feature which limits, isolates and
measures resource usage of a group of processes
• Since Linux Kernel 2.6.24
• Resources quotas for memory, CPU, network and IO
• Create a control group and assign resource limits on it:
• e.g. 3GB of memory limit and 70% of CPU
• Add a process id to the group
Space reserved for video
• Process resource usage will be throttled Do not put anything here

• The application may exceed the limits in normal scenarios


• it will be throttled back to pre set limits in case system is
facing resource crunch

Cloud Computing Applications - Reza Farivar 2


Cgroup Controllers (v2)
Controller Brief Description
cpu Cgroups can be guaranteed a minimum number of "CPU shares" when a system is busy. This does not limit a cgroup's CPU usage if the CPUs are not
busy.

cpuset Provides accounting for CPU usage by groups of processes

freezer can suspend and restore (resume) all processes in a cgroup. Freezing a cgroup /A also causes its children, for example, processes in /A/B, to be
frozen.

hugetlb This supports limiting the use of huge pages by cgroups

io The io cgroup controls and limits access to specified block devices by applying IO control in the form of throttling and upper limits against leaf nodes
and intermediate nodes in the storage hierarchy. Two policies are available. The first is a proportional- weight time-based division of disk
implemented with CFQ. This is in effect for leaf nodes using CFQ. The second is a throttling policy which specifies upper I/O rate limits on a device.
Space reserved for video
memory The memory controller supports reporting and limiting of process memory, kernel memory, and swap usedDo
bynot
cgroups.
put anything here

Perf_event This controller allows perf monitoring of the set of processes grouped in a cgroup.

pids This controller permits limiting the number of process that may be created in a cgroup (and its descendants).

rdma The RDMA controller permits limiting the use of RDMA/IB- specific resources per cgroup.
3
Detailed Documentation at: https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v2.rst
Cgroup example
• Controllers mounted in the cgroups file system
• /cgroup directory
• /sys/fs/cgroup/memory
• /sys/fs/cgroup/cpu
• Making a control group
• /cgroup/memory/mytestcgroup
• Setting limits
• echo 2097152 > /sys/fs/cgroup/memory/mytestcgroup/memory.limit_in_bytes
• echo 2097152 > /sys/fs/cgroup/memory/mytestcgroup/memory.memsw.limit_in_bytes
Space reserved for video
Do not put anything here
• Set both memory AND swap space limit to 2 MB
• Running a process
• cgexec -g memory:mytestcgroup ./<binary_name>

Cloud Computing Applications - Reza Farivar 4


Cgroup Scheduling
• When we think of containers as lightweight VMs, it is natural
to think of resources in terms of discrete resources such as
number of processors.
• However, the Linux kernel schedules processes dynamically,
just as the hypervisor schedules requests onto discrete
hardware.
• the cpu subsystem schedules CPU access to each cgroup
using either the Completely Fair Scheduler (CFS)—the default
on Linux and Docker—or the Real-Time Scheduler (RT).
• Scheduling cgroups in CFS requires us to think in terms of
time slices instead of processor counts.
Space reserved for video
• CPU shares provide tasks in a cgroup with a relative amount Do not put anything here
of CPU time, providing an opportunity for the tasks to run.
• The file cpu.shares defines the number of shares
allocated to the cgroup.

Cloud Computing Applications - Reza Farivar 5


Cloud Computing

Space reserved for video


Do not put anything here

Containers: Namespaces
Prof. Reza Farivar
Namespaces
• A namespace wraps a global system resource in an
abstraction that makes it appear to the processes
within the namespace that they have their own
isolated instance of the global resource
• Linux processes form a single hierarchy, with all
processes rooting at init.
• Usually privileged processes in this tree can trace
or kill other processes.
• Linux namespace enables us to have many Space reserved for video
Do not put anything here
hierarchies of processes with their own “subtrees”
such that processes in one subtree can NOT access
or even know of those in another.

Cloud Computing Applications - Reza Farivar 2


Namespaces
Namespace Description (This namespace isolates …)

Cgroup Cgroup root directory

IPC Isolates System V IPC, POSIX message queues. The common characteristic of these IPC mechanisms is that IPC objects are
identified by mechanisms other than filesystem pathnames.
Network * Network devices, stacks, ports, etc. each network namespace has its own network devices, IP addresses, IP routing tables,
/proc/net directory, port numbers, etc.
Mount * Mount points. processes in different mount namespaces can have different views of the filesystem hierarchy

PID * Isolates Process IDs. In other words, processes in different PID namespaces can have the same PID.
Space reserved for video
Do not put anything here
Time Boot and monotonic clocks

User * User and group IDs. In other words, a process's user and group IDs can be different inside and outside a user namespace

UTS Hostname and NIS domain name. Allows each container to have its own hostname and NIS domain name. Affects
nodename and domainname—returned by the uname() system call.
Cloud Computing Applications - Reza Farivar 3
PID Namespace example
• Without namespace, all processes descend hierarchically
from PID 1(init).
• If we create a PID namespace and run a process in it, that
first process becomes PID 1 in that namespace.
• The process that creates namespace still remains in parent
namespace, but makes its child the root of new process
tree.
• The processes within the new namespace can not see
parent process but the parent process namespace can see
the child namespace.
• The processes within new namespace have 2 PIDs: one for
new namespace and one for global namespace. Space reserved for video
Do not put anything here
• PID namespaces also allow each container to have its own
init (PID 1), the "ancestor of all processes" that manages
various system initialization tasks and reaps orphaned
child processes when they terminate

Cloud Computing Applications - Reza Farivar 4


Network Namespace
• Provide isolation of the system resources associated with
networking
• each network namespace has its own network devices, IP
addresses, IP routing tables, /proc/net directory, port
numbers, and so on
• Network namespaces make containers useful from a
networking perspective
• Each container can have its own (virtual) network device and its
own applications that bind to the per-namespace port number
space
• suitable routing rules in the host system can direct network
packets to the network device associated with a specific Space reserved for video
container Do not put anything here
• E.g. have multiple containerized web servers on the same
host system, with each server bound to port 80 in its (per-
container) network namespace

Cloud Computing Applications - Reza Farivar 5


User Namespace
• process's user and group IDs can be different inside
and outside a user namespace
• The most interesting case here is that a process can
have a normal unprivileged user ID outside a user
namespace while at the same time having a user ID
of 0 inside the namespace.
• This means that the process has full root privileges
for operations inside the user namespace, but is
unprivileged for operations outside the namespace.
• From a security perspective this is a great feature
as it allows our containers to continue running with Space reserved for video
Do not put anything here
root privileges, but without actually having any
root privilege on the host.
• Docker since 2016

Cloud Computing Applications - Reza Farivar 6


Mount Namespace
• Mount namespaces were the first type of
namespace to be implemented on Linux,
appearing in 2002
• by contrast with the use of the chroot() system
call, mount namespaces are a more secure and
flexible tool for this tas
• Isolate the set of filesystem mount points
seen by a group of processes
Space reserved for video
• Processes in different mount namespaces can Do not put anything here

have different views of the filesystem


hierarchy

Cloud Computing Applications - Reza Farivar 7


Cloud Computing

Space reserved for video


Do not put anything here

Containers: Union Filesystem


Prof. Reza Farivar
Union File System (Unionfs)
• Backbone of container images
• A stackable unification file system, which can
appear to merge the contents of several
directories (branches), while keeping their
physical content separate
• overlays several directory into one single
mount point
• Contents of directories that have the same path Space reserved for video
within the merged branches will be seen together Do not put anything here

in a single merged directory, within the new


virtual filesystem

Cloud Computing Applications - Reza Farivar 2


Union File System (Unionfs)
• With union mount, the directories in the file
system from the underlying layer are getting
merged with those from the upper layer file
systems
• To access a file: first tries to access the file on
the top branch and if the file does not exist
there, it continues on lower level branches
• copy-on-write (cow): If the user tries to modify
a file on a lower level read-only branch the file
is copied to to a higher level read-write branch Space reserved for video
Do not put anything here
Nginx
• the program running inside the container Alpine Linux
doesn't care which layer the files and
directories comes from
Cloud Computing Applications - Reza Farivar 3
Docker images
• A container image is made of a stack of immutable or
read-only layers Dockerfile

• In run-time, the docker engine adds a R/W layer on FROM node:12-alpine


top of the stack of immutable layers WORKDIR /app COPY . .
RUN yarn install --production
CMD ["node", "src/index.js"]

Space reserved for video


Do not put anything here

Cloud Computing Applications - Reza Farivar 4


Graph Driver
• A local instance of a Docker engine has a cache
of Docker image layer
• This cache of layers is built up as
explicit docker pull commands are
executed, as well as docker build
• The driver to handle these layers is called a
“graphdriver” Space reserved for video
Do not put anything here

Cloud Computing Applications - Reza Farivar 5


Graph Driver Options
• graphdriver options: vfs, aufs, overlay,
overlay2, btrfs, zfs, devicemapper, and
windows
• vfs: Naïve implementation, does not use a
union filesystem or CoW technique
• Overlay, overlay2, aufs: unionfs on top of a
real filesystems
• ext4, xfs Space reserved for video
Do not put anything here
• btrfs, zfs, devicemapper, windows: the
underlying real filesystem performs the task of
union
More details: https://integratedcode.us/2016/08/30/storage-drivers-in-docker-a-deep-dive/
Cloud Computing Applications - Reza Farivar 6
Cloud Computing

Space reserved for video


Do not put anything here

Containers: Docker Architecture


Prof. Reza Farivar
Docker Architecture

Docker CLI

Space reserved for video


Do not put anything here

Cloud Computing Applications - Reza Farivar 2


Container Runtime
• Docker was originally monolithic
• Later the runtime was separated
• Responsible for the whole life cycle of a container
• pulls a container image (which is the template for a container)
from a registry
• Creates a container from that image
• Initializes and runs the container
• Eventually stops and removes the container from the system
• The container runtime on a Docker host consists
of containerd and runc Space reserved for video
Do not put anything here
• Both are open source and have been donated by Docker to
the CNCF
• CNCF: Cloud Native Computing Foundation, a Linux Foundation
project

Cloud Computing Applications - Reza Farivar 3


Container Runtime: containerd
• containerd is based on runc, provides higher-
level functionality
• Image push and pull
• Managing of storage
• executing of Containers by calling runc with the right
parameters to run containers
• Managing of network primitives for interfaces
• Management of network namespaces for containers to
join existing namespaces
• reference implementation of the OCI specifications Space reserved for video
• OCI: Open Container Initiative, a Linux Foundation Do not put anything here
Project
• containerd was donated to and accepted by
the CNCF in 2017
Cloud Computing Applications - Reza Farivar 4
Container Runtime: runc
• runc is the low-level functionality of the container
runtime
• full support for Linux namespaces
• native support for all security features available on Linux
• SELinux
• AppArmor
• Seccomp
• cgroups
• Spawns and runs containers according to the Open
Container Initiative (OCI) specification Space reserved for video
Do not put anything here
• Containers are configured using bundles
• A bundle for a container is a directory that includes a
specification file named "config.json" and a root filesystem
• The root filesystem contains the contents of the container

Cloud Computing Applications - Reza Farivar 5


Docker Engine
• Docker engine provides additional
functionality on top of the container runtime
• E.g. network libraries or support for plugins.
• Provides a REST interface over which
all container operations can be automated
• The Docker command-line interface is one of the
consumers of this REST interface
Space reserved for video
Do not put anything here

Cloud Computing Applications - Reza Farivar 6


Docker Overview
• Container
Orchestration

• Windows Space reserved for video


Containers Do not put anything here

7
Cloud Computing

Space reserved for video


Do not put anything here

Containers: Networking
Prof. Reza Farivar
Container Network Model
• Libnetwork implements Container Network Model (CNM)
• Formalizes the steps required to provide networking for
containers while providing an abstraction that can be used to
support multiple network drivers

Space reserved for video


Do not put anything here

Cloud Computing Applications - Reza Farivar 2


Container Network Model: Sandbox
• A Sandbox contains the configuration of a
container's network stack.
• This includes management of the
container's interfaces, routing table and
DNS settings.
• A Sandbox may contain many endpoints
from multiple networks.
• An implementation of a Sandbox could be
a Linux Network Namespace, a FreeBSD
Jail or other similar concept.
• Libnetwork implements sandbox in Linux Space reserved for video
through network namespace Do not put anything here

• It creates a Network Namespace for each


sandbox which is uniquely identified by a
path on the host filesystem.

Cloud Computing Applications - Reza Farivar 3


Container Network Model: Endpoint
• An Endpoint joins a Sandbox to a Network.
• An implementation of an Endpoint could be
a veth pair, an Open vSwitch internal port or
similar.
• The veth devices are virtual Ethernet devices.
• They can act as tunnels between network
namespaces to create a bridge to a physical
network device in another namespace
• Can also be used as standalone network devices
• veth devices are always created in interconnected
pairs
• One end is paced in one network namespace, and the
other end in another namespace
Space reserved for video
• An Endpoint can belong to only one network and Do not put anything here
it can belong to only one Sandbox, if connected.
• Libnetwork delegates the actual implementation
to the drivers which realize the functionality

Cloud Computing Applications - Reza Farivar 4


Container Network Model: Network
• A Network is a group of
Endpoints that are able to
communicate with each-other
directly.
• An implementation of a Network
could be a Linux bridge, a VLAN,
etc.
• Networks consist
of many endpoints. Space reserved for video
Do not put anything here
• Libnetwork delegates the actual
implementation to the drivers
which realize the functionality

Cloud Computing Applications - Reza Farivar 5


Driver packages
• Extension of libnetwork and provide the actual
implementation of API
• driver.Config
• driver.CreateNetwork
• driver.DeleteNetwork
• driver.CreateEndpoint
• driver.DeleteEndpoint
• driver.Join Space reserved for video
Do not put anything here
• driver.Leave

Cloud Computing Applications - Reza Farivar 6


Default Drivers in Docker Libnetwork
• Bridge: uses Linux Bridging and iptables to provide connectivity
for containers
• It creates a single bridge, called docker0 by default, and attaches
a veth pair between the bridge and every endpoint.
• host: For standalone containers, remove network isolation
between the container and the Docker host, and use the host’s
networking directly
• Overlay: networking that can span multiple hosts using overlay
network encapsulations such as VXLAN
• Enable swarm services to communicate with each other
• macvlan: Macvlan networks allow you to assign a MAC address to
a container, making it appear as a physical device on your Space reserved for video
network. Do not put anything here
• Docker daemon routes traffic to containers by their MAC addresses
• None
• Note: The type of network a container uses is transparent from
within the container
Cloud Computing Applications - Reza Farivar 7
Bridge Networks
• Bridge networks are usually link layer devices that
forward traffic between networks
• In Docker, bridge network uses a software bridge
allowing containers connected to the same bridge
network on the same host
• Isolating containers from other containers not
connected to the bridge
• For communicating with containers in other hosts,
use overlay network
• The Docker bridge driver automatically installs rules Space reserved for video
in the host machine so that containers on different Do not put anything here
bridge networks cannot communicate directly with
each other
• iptables rules on Linux

Cloud Computing Applications - Reza Farivar 8


Default Bridge Network
• When you start Docker, a default bridge
network (also called bridge) is created
automatically, and newly-started containers
connect to it unless otherwise specified.
• Containers on the default bridge network can only
access each other by IP addresses
• User-defined bridges provide automatic DNS
resolution between containers.
• The default bridge network is considered a legacy Space reserved for video
detail of Docker and is NOT recommended for Do not put anything here

production use

Cloud Computing Applications - Reza Farivar 9


User-defined Networks
• Use --network to attach a container to a specific
network
• Better isolation
• DNS resolution
• On a user-defined bridge network, containers can
resolve each other by the container name or alias
• Much better than messing with /etc/hosts
• Containers can be attached and detached from
user-defined networks on the fly
Space reserved for video
• Containers connected to the same user-defined Do not put anything here
bridge network effectively expose all ports to each
other

Cloud Computing Applications - Reza Farivar 10


Publishing ports
• From the container’s point of view, it has a network
interface with an IP address, a gateway, a routing
table, DNS services, and other networking details
• For a port to be accessible to containers or non-
Docker hosts on different networks, that port must
be published using the -p or --publish flag.

• $ docker create --name my-nginx -


-network my-net --publish 8080:80
nginx:latest Space reserved for video
Do not put anything here

Host : Container

Cloud Computing Applications - Reza Farivar 11


IPAM: IP Address Management
• IPAM tracks and manages IP addresses for
each network
• Subnet RFC 1918
• E.g. 172.17.0.0/16
• All containers attached to this network will get an IP
address taken from this CIDR range
• Gateway
• E.g. 172.17.0.1
• Router for this network
Space reserved for video
• By default only egress traffic is allowed Do not put anything here

• Containerized applications can reach the internet,


but they cannot be reached by any outside traffic

Cloud Computing Applications - Reza Farivar 12


Containers in the same namespace
• We can have multiple containers in the same
namespace
• Processes in two containers in the same
namespace can communicate through localhost
• Compare to bridge networking, with two containers
connected to the same network, where each host
gets its own IP address
• Note that a sandbox (aka the Linux namespace) is Space reserved for video
connected to a network Do not put anything here

• We typically run each container in its own sandbox


• But multiple containers can run in the same sandbox

Cloud Computing Applications - Reza Farivar 13

You might also like