cs6210
cs6210
introduction
Advanced Operating Systems is a graduate-level course that addresses a broad range of
topics in operating system design and implementation, including:
By tracing the key ideas of today's most popular systems to their origins in research, the
class highlights key developments in operating system design over the last two decades
and illustrates how insight has evolved to implementation.
lesson1
lesson1
abstractions
Definitions
Term Definition
abstraction A well understood interface that hides all the details within a subsystem.
Quizzes
Which is an OS?
Firefox
MacOS
Android
Door knob
Instruction set architecture
Number of pins coming out of a chip
AND logic gate
Transistor
INT data type
Exact location of base pads in a baseball field
A door knob is something we use to enter a room, however, we don't always know the exact
inner workings of a door knob - this is an abstraction. Instruction set architecture of a
processor is an abstraction - it tells us what the processor is capable of doing, not how it
does it. AND logic gates are an abstraction - doesn't describe how it conducts a logic AND,
just its functionality. Transistors allow us to turn it from solid state 0 to 1 - however we don't
know explicitly how it is conducting this operation. The INT data type doesn't describe to us
how it is interpreted by the compiler, it just implements an integer.
The number of pins coming out of a chip is not an abstraction - very explicit. The exact
locations of base pads in a baseball field explicitly describe their location, not an
abstraction.
From the highest abstraction (Google Earth) to the lowest abstraction (electrons):
Applications
System Software (operating system, compilers, etc.)
Instruction Set Architecture
Machine Organization (datapath and control)
Sequential and Combinational Logic Elements
Logic Gates
Transistors
hardware resources
Hardware resources in a computing system
The internal organization of a computing system is the same for all manifestations. Below
is a representation of what the hardware resources would look like in a traditional
computing system.
Bus organization
Computing systems utilize communication buses to relay information between devices, the
CPU, and memory. Below is a representation of the buses that exist within a traditional
computing system.
The CPU utilizes the system bus to directly interface with the memory of the computing
system. The CPU utilizes a bridge to schedule interactions with devices residing on the I/O
bus. Devices on the I/O bus can also directly interact with the memory utilizing the bridge if
they are devices that leverage direct memory access (DMA).
Definitions
Term Definition
The conduit for communication between the central processing unit (CPU)
bus
and the devices connected that create the computing system.
system A synchronous communication devices between the CPU and memory. This
bus bus has a high-speed communication bandwidth.
input /
A bus primarily intended for the devices to communicate with the CPU. This
output
bus has a low-speed communication bandwidth.
(I/O) bus
Specialized input / output processor for scheduling the devices that need to
bridge
communicate with the memory or the CPU.
direct
A capability provided by some computer bus architectures that allows data
memory
to be sent directly from an attached device (such as a disk drive) to the
access
memory on the computer's motherboard.
(DMA)
Quizzes
Are the internal organizations of the computing systems represented in this continuum
vastly different?
Yes
No
No is the right answer here. The hardware inside a computing system consists of the
processor, memory, and input devices. Whether we are discussing smartphones, to cloud
computing systems, the internal organization will not change tremendously.
operating system functionality
What is an operating system?
An operating system contains code to access the physical resources of the computing
system and arbitrates the competing requests for these hardware resources. An operating
system is a complex program that abstracts these features for application programmers
using well-defined application programming interfaces (API).
In this scenario, we are running the Google Earth application. We select a point on the globe
by clicking the mouse. So what happens? When you click your mouse, an interrupt is
created on the dedicated interrupt line. This interrupt is served to the CPU. The operating
system fields this interrupt and determines the appropriate interrupt handler to execute for
this mouse click action. The results of the execution of this interrupt handler are passed to
the Google Earth program, and allows the user to interact with the program via the mouse.
Below is a high-level representation of how interrupts are served to applications via the
interrupt bus, CPU, interrupt handler, and operating system:
Definitions
Term Definition
application
A system of tools and resources in an operating system, enabling
programming
developers to create software applications.
interface
(API)
A signal created and sent to the CPU that is caused by some action taken
interrupt
by a hardware device.
Are initiated by hardware interrupts, software interrupt instructions, or
interrupt
software exceptions, and are used for implementing device drivers or
handler
transitions between protected modes of operation, such as system calls.
Quizzes
Select the choices that apply to the functionalities provided by an operating system.
The last choice is not something you would want to operating system to do.
Mouse clicks result in a CPU interrupt. This interrupt will eventually result in a system
service reading the spatial coordinates of the mouse.
managing the cpu and memory
Introduction
The operating system serves to enact policies regarding the use of the resources of the
computing system. These include:
The operating system tries to accomplish all of these actions while also getting out of the
way as quickly as possible in order to enable the applications.
This memory footprint is created by the operating system loader when an application is
launched. The application is able to ask the operating system for more resources at
runtime. The operating system will perform this service on behalf of the application, and
the application can continue executing.
Below is a high-level representation of how operating systems cater to application resource
requirements.
Threads represent one instance of execution within a process on the CPU. A process is the
state of all the threads that are executing within an application.
Processes contain an address space that is distinct from another process. The operating
system provides the abstraction for each process, and protects processes from other
processes attempting to access their respective address space. The operating system
implements this abstraction using the underlying hardware capabilities of the computing
system.
Definitions
Term Definition
memory
The amount of main memory that a program uses or references while running.
footprint
An area of main storage (system memory) that is used for return addresses,
stack procedure arguments, temporarily saved registers, and locally allocated
variables.
Quizzes
Computers seemingly run several programs in parallel: e-mail, web browsing, music, etc.
If there is only one CPU in the computing system, how is this possible?
The operating system multiplexes multiple applications to run for a specific time on the
CPU, presenting the illusion that the applications are all running in parallel.
Yes
No
Maybe
The operating system is also a program that must execute on the CPU. A good operating
system will utilize the minimal amount of time to execute its functions. Most of the time,
the resources of the computing system will be primarily utilized by the applications.
lesson2
lesson2
operating system structure
Goals of operating system structure
Goal Description
Protecting the user from the system, and vice versa, as well as
Protection protecting users from one another. Also, implementing protection to
protect the user from their own mistakes.
Monolithic structure
In the monolithic operating system structure, all operating system services are contained
within its own hardware address space, protected from all user-level applications. The
hardware is also directly managed by the operating system. Each application running on
the operating system is contained within its own hardware address space.
Due to the segregation implemented in the structure of this operating system, the operating
system is protected from the malfunctions that can occur in the execution of user-level
applications. When an application needs any system service, a context switch is conducted
into the hardware address space of the operating system.
A monolithic structured operating system reduces performance loss by consolidating all
system services. What is lost with the monolithic operating system structure is the inability
to customize operating system services for the needs of different applications.
DOS-like structure
The primary difference between the monolithic operating system structure and the DOS-like
operating system structure is that there exists no hard separation between the applications'
and the operating system's address spaces.
This structure provides applications the ability to make system calls into the operating
system and acquire system services at the same speed of a procedure call. This feature
also presents a security issue, as an errant application can compromise the integrity of the
operating system and corrupt its data structures.
The loss of protection that is experienced with a DOS-like structured operating system is
unacceptable for modern computing systems.
Below is a high-level representation of the DOS-like operating system structure.
Microkernel structure
In the microkernel operating system structure, each application has its own hardware
address space. The microkernel runs in a privileged mode of the architecture and
implements a small number of mechanisms for accessing hardware resources.
This is a key distinction, the microkernel implements no policies for use of the hardware
resources - it only provides abstractions for the underlying system hardware.
The operating system services are implemented as servers that reside on top of the
microkernel and execute with the same privilege as the user-level applications. Each system
service is contained within its own hardware address space, as well.
Listed are the performance costs associated with the microkernel operating system
structure:
Cost Description
Context Switching from the application level hardware address space to the
switching microkernel address space, excessively.
Change in Constant changes of the contents residing within the cache due to the
locality consistent context switches occurring.
Content There is a need to copy information from the system services' address
copying space to the address space of the microkernel.
Term Definition
operating
The way the operating system is organized with respect to the
system
applications that it serves and the underlying hardware that it manages.
structure
Quizzes
Protection
Performance
Flexibility
Scalability
Agility
Responsiveness
What type of operating system structure are we looking to implement? In a perfect world,
we want an operating system structure that is:
Thin and implements only mechanisms, not policies, like the microkernel operating
system structure.
Provides access to resources without crossing between privilege levels like the DOS
operating system structure.
Is flexible in its resource management like a microkernel structured operating system
without sacrificing protection and performance like a monolithic structured operating
system.
Approaches to extensibility
Historically, there was research conducted on extensibility as far back as 1981. Hydra OS
attempted to tackle extensibility in a capability-based manner with these objectives:
Mach was also a microkernel-based operating system designed during that time period that
focused on extensibility and portability. Performance took a hit as Mach attempted to
maximize portability and extensibility for various different architectures.
The results of attempting to implement these two operating systems and the discovery of
their lack of performance in comparison to the monolithic kernel gave the microkernel
operating system structure negative publicity.
Co-location of the kernel and its extensions in an attempt to avoid crossing between
privilege levels too often.
Compiler enforced modularity utilizing a strongly-typed programming language,
preventing dangerous memory accesses from occurring between the kernel and the
extensions.
Utilizing logical protection domains, not hardware address spaces, to enforce
protection.
Application's ability to dynamically bind to different implementations of the same
interface functions, providing flexibility.
Because the extensions are co-located with the kernel, extensions and their advertised
system services are as cheap as a procedure call. No privilege level crossing occurs.
SPIN implements its logical protection domains using the Modula-3 programming
language. Modula-3 provides these features and abstractions:
Type safety
Auto storage management
Objects
Threads
Exceptions
Generic interfaces
All of these features enables the implementation of system services as objects with well-
defined entry points. This is leveraged to create logical protection domains within SPIN that
can be co-located with the kernel. Through this, these logical protection domains provide
protection and performance.
This object-oriented approach provides the designer the ability to define system resources
as capabilities, with definitions as narrow or as broad as desired, thus providing protection.
Possible ways to declare resources as objects include:
Hardware resources (e.g. page frames)
Interfaces (e.g. the page allocation module)
A collection of interfaces (e.g. an entire virtual machine)
Capabilities to these objects were supported by the language in the form of pointers.
Mechanism Description
Create Initialize with object file contents and export names from the namespace.
Resolve names between the source and target domains. This dynamically
Resolve binds two protection domains. Once resolution is complete, resource
sharing can occur at memory speeds.
These mechanisms enabled by the Modula-3 operating system are how SPIN provides
extensibility while maintaining performance and protection.
External interrupts
Exceptions
Page faults
System calls
SPIN supports external events using an event-based communication model. SPIN supports
the mapping of events to event handlers in a one-to-one, one-to-many, and many-to-one
manner.
Below is a table listing the interface functions SPIN exposes to extensions for use in
implementing memory management.
Resource Description
Physical
allocate, deallocate, reclaim
address
Virtual
allocate, deallocate
address
Event
page fault, access fault, bad address
handlers
These interface functions are just a header file provided by SPIN. All of the inner-workings
of these functions will have to be implemented by the extension author.
Below is a table listing the interface functions SPIN exposes to extensions for use in
implementing CPU scheduling.
Resource Description
SPIN global Decides the amount of time given to a particular extension to run on
scheduler the CPU.
Final assessment
Core, trusted services within SPIN provide access to hardware mechanisms. These services
will have to exit the language-defined protection to control the hardware resources. The
applications that run using an extension have to trust the extension and the information it
returns. Extensions to core services only affect the applications that use that extension.
Errant extensions will not affect applications not using the errant extension.
Definitions
Term Definition
Quizzes
No difference exists.
C pointers are more restricted.
Modula-3 pointers are type-specific.
Which of the below structures will result in the least amount of privilege level crossings?
Monolithic
Mircokernel
SPIN
Either SPIN or Monolithic
the exokernel approach
Introduction
The main idea behind the exokernel is to explicitly expose the hardware resources to
processes attempting to use said resources. The exokernel decouples authorization from
use for hardware resources.
Library operating systems request a hardware resource from the exokernel and in turn the
exokernel provides the library operating system with an encrypted key. The exokernel binds
the library operating system to the requested hardware resource and the library operating
system can now utilize the hardware resource without the exokernel's interference by
presenting the encrypted key.
How the library operating system utilizes the hardware resource is entirely up to it. The
exokernel is only designed to protect the resources and expose the resources when correct
credentials are presented for authentication.
Three types of methods are available that allow the exokernel to securely bind resources to
library operating systems:
Method Description
The exokernel doesn't resolve page faults or virtual to physical memory mappings for
pages. Instead, the exokernel provides the page fault to the library operating system or
application that owns the thread running on the CPU that generates the page fault. The
library operating system services the page fault and presents the correct mapping of virtual
to physical memory to the exokernel, along with the library operating system's key to access
the hardware.
The exokernel will proceed to update the TLB, install the mapping, and then the library
operating system's process / thread will be ran again on the CPU.
When a context switch occurs, one of the biggest performance losses is the loss of locality
within the cache and the TLB. The exokernel utilizes a software TLB for each library
operating system that represents each system's memory mapping. When a context switch
occurs, the exokernel will place some subset of the library operating system's software TLB
into the hardware TLB. This helps to mitigate the loss of locality from context switching.
The exokernel maintains a linear vector of time slots for each library operating system to
occupy. Each slot represents a time quantum, and the time quantum dictates how long
each library operating system is allowed to execute on the CPU. On startup, the library
operating system is allowed to dictate which locations within the linear vector it wants to
occupy.
The exokernel needs a mechanism to revoke resources from different library operating
systems. The exokernel doesn't have an understanding of what the library operating system
is using the CPU for, however, unlike SPIN's abstraction of strands.
The exokernel has a revoke call that provides a library operating system a repossession
vector, describing the resources the exokernel is going to revoke from the library operating
system.
The library operating system cleans up whatever it was doing with the defined resources in
the repossession vector, usually by saving the contents to disk. The library operating
system is provided the mechanism to seed the exokernel, allowing the library operating
system to request an autosave mechanism whenever the exokernel intends upon revoking
resources.
Code that is performance critical can be downloaded into the exokernel creating somewhat
of an extension for a particular library operating system.
The exokernel passes all program discontinuities to the library operating system that owns
the thread currently executing on the CPU. In order to do this with fine granularity, the
exokernel will maintain a state for each library operating system. This allows the exokernel
to maintain exceptions, page faults, syscall traps, external interrupts, etc. for the next time a
library operating system is scheduled to handle the discontinuity.
The exokernel maintains a PE data structure on behalf of each library operating system
containing:
entry points for library operating systems to deal with different discontinuities
exceptions
interrupts
syscalls
page faults (requests for an addressing context)
Software TLB
Downloading code into the kernel allows for first level interrupt handling by the exokernel
on behalf of a library operating system.
Below is a high-level representation of the exokernel data structures maintained for each
library operating system.
Performance results of SPIN and exokernel
Key takeaways:
SPIN and exokernel are better at procedure calls between protection domains than a
microkernel.
SPIN and exokernel performance is on par with the monolithic kernel operating system
structure for conducting syscalls.
Definitions
Term Definition
translation
a memory cache that is used to reduce the time taken to access a user
lookaside
memory location.
buffer (TLB)
a type of exception raised by computer hardware when a running
program accesses a memory page that is not currently mapped by the
page fault
memory management unit (MMU) into the virtual address space of a
process.
software a TLB maintained by the exokernel for each library operating system in
TLB order to mitigate the performance loss experienced by context switches
repossession a vector describing the resources the exokernel intends upon revoking
vector from a library operating system
Quizzes
Exokernel's mechanism for library operating systems to "download code into the kernel"
vs. SPIN "extensions and logical protection domains". Which one of these mechanisms
compromises protection more?
SPIN
Exokernel
SPIN enforces protection at compile-time and has runtime verification. For the exokernel, a
library operating system can download arbitrary code into the kernel. Not very safe.
Give a couple of examples of how "code downloaded" from a library operating system
may be used by the exokernel.
The creators of the SPIN and exokernel operating systems approached their research with
the assumption that microkernel design had inherently poor performance. This is due to the
fact that the most popular microkernel design of the period was Mach - a microkernel
aimed at providing portability, not performance.
In this portion of the lesson we will look at L3, an operating system design that attempts to
prove microkernel designs can provide performance.
The key distinction about the L3 microkernel is that each system service has to exist within
its own protection domain, however, each system service doesn't have to exist within its
own address space.
Below is a table describing the issues that exist within the design of a microkernel that
impact performance.
Issue Definition
Kernel-user
border cross cost (context switches)
switches
basis for protected procedure calls for cross protection domain calls
Address space going across hardware address spaces, at a minimum, requires a
switches context switch and the flushing of the TLB
Thread
switches and
kernel mediates all protected procedure calls
IPC
L3 provides empirical proof in their paper that their context switch utilizes 123 processor
cycles - including re-mapping the TLB and resolving cache misses.
Mach on the same hardware utilizes 900 process cycles to conduct a context switch,
resolve cache misses, and re-map the TLB.
When a context switch takes place, do we have to flush the TLB? Is the entire virtual to
physical address mapping invalid?
It depends on whether the TLB has a mechanism to recognize the virtual to physical
address is tagged for a distinct process. This is called an address space tag TLB, where
each TLB entry is tagged for a process's address space.
No need for us to conduct a context switch as the hardware allows us to store multiple
different process' virtual to physical mappings within the TLB.
We can use the hardware defined, segment register bounded protection domains to
determine if the TLB is valid for a specific protection domain. If the protection domain is
valid, then we know we don't need to flush the TLB.
If the segment boundaries overlap because the protection domains are excessively large,
we must conduct a TLB flush - if that TLB is not address space tagged. For large protection
domains, the implicit cost is greater than the explicit cost - our cache locality will greatly
suffer. Example: 864 cycles for TLB flush in a Pentium processor.
By construction, L3 has shown to be competitive with the SPIN and exokernel architectures.
Context switching involves saving all volatile state of a processor.
Memory effects
What do we mean by memory effects? Not all memory needed to allow a process to
execute will be located within physical memory or the cache. The memory effect is
essentially,:
Leidtke suggests packing small protection domains into the physical address space and
protect them using segment registers. The cache will be warmer more often because we
reduce the implicit cost of address space switching - our protection domains will occupy
less space.
You can't avoid the implicit costs imposed by protection domains with large address
spaces, even within monolithic, SPIN, or exokernel architectures.
Focus on Portability
Code bloat - causes large memory footprint
Less locality - more cache misses
All of this causes longer latency for border crossing. Mach's kernel memory footprint is the
culprit of Mach's lack of performance. Portability and performance objectives in Mach's
design contradict each other.
The thesis for the implementation of L3 is one related to how operating systems should be
structured. Here are some suggestions from the thesis:
Definitions
Term Definition
address space tag within a TLB that specifies the process ID generating the
tag translation.
Quizzes
Why did Mach take roughly 800 or more cycles than L3 for user-kernel switching?
L3 uses a faster processor.
Liedtke is smarter.
Due to Mach's design priorities.
Microkernels are slow by definition.
The design priorities of Mach were different than L3. Mach aimed for both extensibility and
portability rather than performance.
paper review questions
SPIN paper
These review questions are related to the paper, "Extensibility, Safety and Performance in
the SPIN Operating System" written by Brian N. Bershad et. al.
What features of Modula-3 are essential for implementing the extensibility features of
SPIN? Why?
The design of SPIN leverages the language's safety and encapsulation mechanisms,
specifically:
Feature Description
prevents code from accessing memory arbitrarily. Pointers may only refer
Type safety to objects of it referent's type. This safety mechanism is enforced through
a combination of compile-time and run-time checks.
Automatic
prevents memory used by a live pointer's referent from being returned to
storage
the heap and reused for an object of a different type.
management
A strand is an abstraction provided by SPIN for processor scheduling. Explain what this
means. In particular, how will a library operating system use the strand abstraction to
achieve its desired functionality with respect to processor scheduling? [Hint: Think of the
data structures to be associated by the library operating system with a strand.]
SPIN provides core services by default, one of these is the global scheduler for
extensions. The global scheduler uses the different strands of each extension to
schedule them for dispatch to the CPU.
SPIN provides a physical page service that may, at any time, reclaim physical memory
by raising the PhysAddr.Reclaim event. This interface allows the hanlder of the event to
volunteer an alternative page, which may be of less importance than the candidate
page. The translation service invalidates any mappings to the reclaimed page.
Exokernel paper
These review questions are related to the paper, "Exokernel: An Operating System
Architecture for Application-Level Resource Management" written by Dawson R. Engler, et.
al.
The memory management unit raises an exception for the page fault.
The Exokernel determines which library operating system owns the thread in which the
page fault emanated from.
The Exokernel delivers the page fault to the library operating system for handling.
The library operating system services the page fault and presents the correct TLB
mapping to the Exokernel, along with its key to access memory.
The Exokernel updates the TLB, installs the mapping of virtual to physical memory, and
dispatches the thread of the library operating system onto the CPU.
"The high-level goals of SPIN and Exokernel are the same." Explain why this statement is
true.
Both operating system architectures attempt to create highly extensible structures
while still maintaining performance and protection. This is evident in their design which
establishes a very minimal kernel, allowing for most of the system services to be
defined and executed within user-space.
Explain how the mechanisms in Exokernel help to meet the high-level goals.
L3 microkernel paper
These review questions are related to the paper, "On micro-Kernel Construction" written by
Jochen Liedtke.
What is the difference between a "thread" as defined in Liedtke's paper and a Strand in
SPIN?
SPIN does not define a thread model for applications, but a structure on which an
implementation of a thread model rests. A strand has no minimal or requisite kernel state
other than a name.
Liedtke argues that a microkernel should fully exploit whatever the architecture gives to
get a performance-conscious implementation of the microkernel. With this argument as
the backdrop, explain how context switching overhead of a microkernel may be mitigated
in modern architectures. Specifically, discuss the difference in approaches to solving this
problem for the PowerPC and Intel Pentium architectures. Clearly explain the
architectural difference between the two that warrant the difference in the microkernel
implementation approaches.
Address space switches are often considered costly. These switches are often costly due to
the requirement to flush the TLB. Using tagged TLBs removes the overhead of flushing,
since an address space switch is transparent to the TLB.
Notion Description
All the subsystems in L3 live in user space. Every such subsystem has its own address
space. The address space mechanism "grant" allows an address space to give a physical
memory page frame to a peer subsystem. Can this compromise the "integrity" of a
subsystem? Explain why or why not.
The grant mechanism implies that there is a sender and receiver conducting IPC to
complete the transfer of memory. The onus is on the subsystem designer, the L3
microkernel will not conduct checking of the memory page being transferred via the grant
mechanism. When the two subsystems initiate IPC, it's possible a malicious or errant
subsystem could compromise the integrity of another subsystem by granting a page frame
intended to do so.
lesson3
lesson3
virtualization basics
Platform virtualization
The main concept described in this slide is using virtualization to provide services to
multiple customers using one set of hardware. Every customer will have the same
experience with less cost.
Utility computing
The concept described here is that multiple customers can share resources, but also be able
to use a diverse set of operating systems on the hardware resources provided.
The customers are also able to split the cost of maintenance for the hardware, instead of
each customer acquiring their own hardware.
The hardware resources available will always be greater than the individual resource
requirements of each customer. This allows the customer to have access to more resources
than they would be able to leverage on their own, monetarily.
Virtual machine managers (or hypervisors) monitor and manage the guest operating
systems running on the shared hardware resources.
There are two types of hypervisors:
Hypervisor
Description
type
running on bare metal, all guest operating systems interact directly with the
native hypervisor. Provide the best performance for the guest operating
Native
systems.
hypervisor
VMWare ESXi
Leaving the guest operating system completely untouched, not a single line of code has
been modified for it to interact with the hypervisor.
This requires some finesse, guest operating systems are running as user processes - thus
they don't have access to privileged instructions. When these guest operating systems
attempt to execute privileged instructions, the hypervisor will have to emulate the intended
function and provide the result back to the guest.
Some issues exist with this trap and emulate strategy - some privileged instructions will fail
silently within the hypervisor; the guest will never know if they succeeded or not.
To fix this, the hypervisor will implement a binary transition strategy, looking for
instructions that will fail silently on the target hardware and deal with those instructions
carefully to catch the issue and take appropriate actions.
Para virtualization
Another approach to virtualization is to modify the source code of the guest operating
system. This allows us to avoid problematic instructions, but also allows us to implement
optimizations.
Para-virtualized operating systems will be aware of the fact that they are guests within a
hypervisor, and this will be knowledgeable about the hardware instructions directly
available for them to use.
Applications will still only be able to use the API of guest operating system that it is running
on, but the operating system itself with utilize para-virtualization to optimize its interaction
with the hardware.
Definitions
Term Definition
Quizzes
What comes to your mind when you hear the word "virtualization"?
Memory systems
Data centers
JVM
Virtual Box
IBM VM/370
Google glass
Cloud computing
Dalvik
VMWare Workstation
The move "Inception"
What percentage of guest operating system code may need modification with para-
virtualization?
~10%
~30%
~50%
less than 2%
Less than 2% of the guest operating system code needs to be modified in order to enable
para-virtualization. This is proven by Xen and their experiments with para-virtualization.
memory virtualization
Introduction
What needs to be done in the system software stack to support virtualization? The question
boils down to:
Memory hierarchy
The biggest issue here is handling virtual memory, namely the virtual address to physical
memory mapping. This is part of the key functionality of memory management for any
operating system.
Each process is contained within its own protection domain. The operating system
manages a page table for each process, and each process contains its own listing of virtual
page numbers that are contiguous in its view.
The virtual address space for a given process is not contiguous within the physical
memory, but scattered across it. When a process utilizes a virtual address, a translation is
conducted using the page table to reference a location within physical memory.
In this case, each guest operating system is considered its own process and protection
domain. Each operating system within its own protection domain contains their own
paging system for each process running inside of the guest operating system.
The hypervisor does not track the page tables for each of the applications running inside of
each guest operating system. The hypervisor is only concerned with the page table of each
protection domain (guest operating system).
Guest operating systems will consider their physical memory contiguous, but in the real
physical memory (machine memory) the guest operating systems are not stored in a
contiguous manner, Their memory is also virtual and mapped to physical memory by a
page table managed by the hypervisor.
The memory requirements of the guest operating systems are also bursty, randomly
requiring large amounts of memory but also releasing it later. They hypervisor may not be
able to provide another contiguous region of memory to a guest, making the continuity of
memory within the guest operating system an illusion.
In a virtualized setting, there exists 3 levels of indirection. The hypervisor will maintain
machine page numbers (MPN) that will be utilized to conduct memory translations
between the guest operating systems and the host's memory resources.
This is what the memory hierarchy looks like for a virtualized system from most abstract to
least abstract:
Abstraction Definition
Physical
pages managed by guest operating systems; translated to machine page
page
numbers using the hypervisor's shadow page table
numbers
Machine
page pages managed by the hypervisor
numbers
Usually, the CPU uses the page table for address translation. The CPU will receive a virtual
address, access the Translation Lookaside Buffer (TLB) to see if translation is already
cached. It it is a hit, the translation is essentially immediate. If it is a miss, the CPU will
consult the page table for the virtual to physical address translation. Once this is complete,
it will stash the virtual to physical translation in the TLB.
In virtualized settings, the hardware page table is the shadow page table. Below is a high-
level representation of the concepts described above.
Efficient mapping (full virtualization)
How do we make the virtual to physical to machine address translation process efficient?
All changes to the page table by the guest operating system are privileged accesses. These
instructions will be trapped by the hypervisor, and the hypervisor will proceed to update the
same mapping in the shadow page table. The translations that occur from virtual to
physical are stored in the hardware TLB and hardware page table.
This bypasses the guest operating system's page table - each time a process generates a
virtual address the translation is conducted using the hardware page table and TLB.
This allows us to not have to use the guest operating system for every virtual address
space translation, prior to the translation being conducted between the guest operating
system and the hypervisor.
Essentially, the hypervisor tracks the memory locations of every unprivileged process
running in every guest operating system. Below is a high-level representation of the
concepts described above.
Efficient mapping (para-virtualization)
The burden of efficient mapping is placed upon each guest operating system. Each guest
operating system maintains its contiguous "physical memory" and also knows where it
exists with the machine memory.
The burden of VPN to PPN to MPN mapping is all handled by the guest operating system.
For example, in Xen, a set of hypercalls are provided to the para-virtualized guest operating
systems to tell the hypervisor about changes to the hardware page table. This allows the
guest operating system the ability to allocate and initialize a page table data structure
within the memory originally allocated to it by the hypervisor.
A switch hypercall is also provided for the guest operating system to switch contexts to a
different user mode application. The hypervisor is unaware of the switches, it just handles
requests made by the guest operating system.
Updating page tables is also provided, the guest operating system notifies the hypervisor
what modifications needs to be made to the page table data structure. Below is a high-level
representation of the concepts described above.
Dynamically increasing memory
How do we dynamically increase the amount of physical memory allocated to each guest
operating system that needs it?
Essentially we monitor the memory requirements of each operating system and reap
memory from operating systems not utilizing their full memory space. Then we provide that
memory to operating systems currently experiencing memory pressure. Below is a high-
level representation of the concepts described above.
Ballooning
A special device driver is installed in every guest operating system (despite full or para
virtualization). This device driver is called a balloon. This driver is the key for managing
memory pressures that may be experienced by guest operating systems.
Let's say the hypervisor needs more memory, suddenly. The hypervisor will contact
operating systems that are currently not utilizing all of their memory and interact with the
balloon driver. The hypervisor instructs the balloon device driver to begin inflating - the
balloon device driver will begin requesting memory from the guest operating system.
This forces the guest operating system to begin paging memory to disk in order to satisfy
the requirements of the balloon driver. Once the balloon driver has acquired all of the
requested memory, the balloon will return the memory back to the hypervisor.
So what if the guest needs more memory and the hypervisor has excess memory? The
hypervisor will contact the balloon driver of the guest operating system and instruct it to
deflate - contracting its memory footprint. Through this, it is actually releasing memory into
the guest operating system. This allows the guest operating system to page in the
processes that were previously paged out to the disk.
This technique assumes that the guest operating system and the balloon drivers coordinate
with the hypervisor. Below is a high-level representation of the concepts described above.
Can we share memory across virtual machines, without affecting the integrity of the
protection domains that exist between each guest operating system?
Yes, we can!
Based upon the similarity of virtual machines existing within a hypervisor, we can avoid
duplication by allowing guest operating systems to point to the same locations in memory.
Of course core pages are going to be separate, but applications that are similar can share
the same locations within the machine memory.
One method is by having the virtual machines and the hypervisor to cooperate in sharing.
The guest operating system has hooks to allow the hypervisor to mark pages as copy on
write - when a guest operating system writes to a page, the other guest will fault and utilize
a different page. All things are fine when the guest operating systems are reading a page,
but as soon as a write occurs, the guest operating systems must separate and begin
referencing to different pages. Below is a high-level representation of the concepts
described above.
So now we're trying to achieve the same effect of memory sharing, as annotated above, but
the guest operating systems will be completely oblivious to the fact they are sharing
memory locations.
If a page in a different guest operating system, content-wise, is the same the page of a
different operating system, the guest operating systems will be oblivious to the fact that
they are sharing the same page in machine memory. The hypervisor maintains hashes of
the content of each page for every guest operating system, looks for matches, and utilizes
this as a hint as to whether or not we can have these guest operating systems share
locations in memory.
If we have a match, the hypervisor conducts a full comparison of the guest operating
system's memory footprints. This is done because a hash match is not an absolute
indicator that the memory footprints of two guest operating systems are exactly the same.
The content in both memory locations could've changed since the last check. Below is a
high-level representation of the concepts described above.
Successful match
If the two different memory locations contain the same content, the hypervisor will modify
the PPN -> MPN mapping for one of the guest operating systems, mapping the PPN
address to the same MPN address as the other operating system. Now, when a PPN -> MPN
translation is conducted for either guest operating system, they both will point to the same
machine memory location.
The reference count within the hash table entry for the specific hash will be incremented.
This mapping from PPN -> MPN will not be marked as copy on write - thus if the guest
operating systems effect the integrity of the memory location, the guest operating systems
will copy their content into different memory locations and separate their usage.
This should not be done when there is active usage of the system - this is computationally
expensive. This is a function of the background activity of the server when a low load is
experienced.
This technique is applicable to both fully and para-virtualized systems.
Policy Description
Pure share
what you pay for is what you get. Problem with this is that this policy
based
could lead to hoarding.
approach
Working-set
if a working set of a guest operating system increases, more memory is
based
allocated. If a low load is experienced, memory is released.
approach
Dynamic idle-
tax idle pages more than active pages. Hoarders will be taxed more
adjusted
memory if they're not using it. People using a majority of their memory
shares
will be taxed less.
approach
we reclaim idle memory from all virtual machines that are not actively
Reclaim most
using it. This allows for sudden working set increases that might be
idle memory
experienced by a particular domain.
Definitions
Term Definition
Machine
Page numbers maintained with the shadow table by the hypervisor; actual
page
physical memory in a virtualized environment
number
Balloon A special device installed in every guest operating system utilized to reclaim
driver or allocate memory to a guest operating system.
Shadow
Page table maintained by the hypervisor in fully virtualized systems that
page
contains virtual to machine page mappings
table
Quizzes
In the case of a fully virtualized system, the guest operating systems have no knowledge
that they are being virtualized. Therefore, the hypervisor must maintain the shadow page
table.
Hypervisor
Guest Operating System
Para-virtualized systems know the virtualization exists. Therefore, it knows its physical
memory will be fragmented in the machine memory. Usually, the PPN -> MPN mapping is
kept in the guest operating system.
cpu and device virtualization
Introduction
The challenge that exists when virtualizing CPUs and devices is giving the illusion to each
guest operating system that they own the CPU. That is, each guest operating system
doesn't know that other guests exist running on the same CPU.
This illusion is being provided by the hypervisor at the granularity of the entire operating
system.
The hypervisor must field events arising due the execution of processes that belong to a
guest operating system. There are going to be discontinuities or exceptions that will occur,
and the hypervisor must ensure that the correct guest operating system handles these
instances.
CPU virtualization
Each guest operating system is already scheduling the processes that it currently hosts
onto the CPU. This is great, however, the hypervisor sits in between each guest operating
system and the CPU. The hypervisor must provide the illusion to each guest operating
system that the guest owns the CPU. This will allow each guest OSs the ability to schedule
its processes accurately on the CPU.
The hypervisor must have a precise way in keeping track of the time a particular guest OS
uses on the CPU. This would be from the point of view of billing the different customers
(guest operating systems).
Proportional share
Fair share
In either of these cases, the hypervisor has to account for the time used on the CPU on
behalf of a different guest during the time allocated for a particular guest. This can happen
if an external interrupt occurs that is intended for a guest operating system while a process
for a different VM is currently running on the CPU.
The hypervisor will essentially track the stolen time of a guest operating system used by
other guest operating systems. The hypervisor will reward those VMs that weren't allowed
to run due to some extenuating circumstance. The challenge that exists when virtualizing
CPUs and devices is giving the illusion to each guest operating system that they own the
CPU. That is, each guest operating system doesn't know that other guests exist running on
the same CPU.
This illusion is being provided by the hypervisor at the granularity of the entire operating
system.
The hypervisor must field events arising due the execution of processes that belong to a
guest operating system. There are going to be discontinuities or exceptions that will occur,
and the hypervisor must ensure that the correct guest operating system handles these
instances. Below is a high-level representation of the concepts described above.
So what are the things this process can do that might require intervention and notification
to the guest operating system?
One example would be the opening of a file. This is a privileged instruction that will be
handled by the guest operating system.
Another example could be the process incurring a page fault. Maybe not all of its
virtual address space is currently mapped to machine memory.
Another example would be the creation of an Exception.
Another example could be an external interrupt.
These are all called program discontinuities that can interrupt the normal execution of a
process. All of these discontinuities have to be passed up, through the hypervisor, to the
parent of the process (the guest operating system).
These events will be delivered as software interrupts to the guest operating system.
Some quirks exist based upon the architecture, and the hypervisor may have to deal with
this. Some of the operations that the guest operating system may have to do to deal with
these discontinuities may require the guest operating system to have privileged access to
the hardware.
This presents a problem, especially in a fully virtualized environment - the guest operating
system is unaware that it's virtualized. When the guest operating system conducts the
privileged operation to remediate a discontinuity, the privileged instruction will trap into the
hypervisor. Unfortunately, some privileged instructions fail silently, the guest operating
system will have no knowledge that this instruction failed or never reached the hypervisor.
The hypervisor will have to determine, for fully virtualized systems, where these quirks may
exist and edit the binary of the guest operating system in order to catch these instructions.
In a fully virtualized environment, all interactions between the guest operating system and
the hypervisor are implicitly conducted via traps for system calls.
In a para-virtualized environment, the guest operating system is provided APIs in order to
ask the hypervisor to conduct some privileged instruction. This is evident when we
discussed how para virtualized operating systems conduct management of their page
tables for different user level processes. Below is a high-level representation of the
concepts described above.
The next issue is virtualizing devices, giving the illusion to the guest operating systems that
they own particular I/O devices.
trap and emulate all accesses to the those devices. The guest operating system does
not know it is currently virtualized.
more opportunity for innovation - the para-virtualized system is aware of the devices
available for it to use. The set of hardware devices that are available to they hypervisor
are also available to the guest operating system. The hypervisor has the ability to
provide an API for the guest operating system to use - providing some level of
transparency and optimization.
How do we transfer control of the devices between the hypervisor and the guest?
How do we transfer data from the devices to the correct recipient?
Control transfer
Full virtualization:
implicit traps by the guest -> handled by the hypervisor
software interrupts are served from the hypervisor -> guest
Para virtualization:
explicit hypercalls guest -> hypervisor (results in control transfer)
software interrupts are generated from the hypervisor -> guest
the additional facility that exists here is that the guest has control via
hypercalls on when event notifications need to be delivered from the
hypervisor.
similar to an operating system disabling interrupts.
Data transfer
Full virtualization
data transfer is, again, implicit
Para Virtualization
explicit -> provides an opportunity to innovate
CPU time, the hypervisor has to de-multiplex the data to the correct owners upon
receiving an interrupt from a device. Hypervisor must account for the computational
time it took to manage the buffers on behalf of the virtualized operating systems.
The second aspect is how the memory buffers are managed. How are they allocated
either by the guest or by the hypervisor?
In Xen, to conduct data transfers and requests, the guest operating systems define a ring
buffer type structure in which they produce and place their I/O requests.
Xen will consume the request, process the request, acquire the data from the requested
device, and provide the data back to the guest via the ring.
The whole concept is having good ways of measuring these resources for accurate billing:
CPU usage
Memory usage
Storage usage
Network usage
Virtualized environments need to have mechanisms in place to store records for each one
of these metrics for customers.
Definitions
Term Definition
Ring a data structure that uses a single, fixed-size buffer as if it were connected
buffer end-to-end
Quizzes
paper review questions
Xen Paper
These review questions are related to the paper, "Xen and the Art of Virtualization" written by
Paul Burham, et. al.
1. The process spawning foo calls fork to create a new process which includes it
process control block. This will result in a hypercall to Xen, as the guest operating
system needs to create a new page table for the newly created process. Xen will create
a new page table utilizing the Memory Management Unit (MMU) and will provide the
base pointer of the page table back to the guest OS.
2. exec will be called to load the foo application into memory, requiring another
hypercall to Xen. Xen will utilize the MMU to update the page table for the process.
3. Finally, when foo is dispatched to run by the guest OS, another hypercall will be
made to Xen to conduct a context switch for foo , and foo 's page table will be
utilized as it executes on the CPU.
fd = fopen("bar");
Show all the interactions between XenoLinux and Xen for this call. Clearly indicate
when foo resumes execution.
This causes a software trap into the guest OS.
A context switch is conducted as the working set for the OS is loaded.
The guest OS acquires the file descriptor for foo .
Another context switch is conducted as foo is dispatched back onto the CPU to
resume execution.
Xen provides para-virtualized operating systems the ability install fast handlers for
system calls .
Upon resumption, foo executes another blocking system call:
Show all the interactions between XenoLinux and Xen for this call. Clearly indicate
when foo resumes execution.
This causes a software trap into the guest OS.
The guest OS will generate an I/O request and place the request into the
asynchronous I/O ring buffer maintained by the Xen hypervisor.
The guest OS will advances the Request Producer pointer in the asynchronous I/O
ring buffer.
The next time the Xen hypervisor is able to conduct round-robin servicing of I/O
requests, it will consumer the I/O request made by the guest OS.
The Xen hypervisor will advance the Request Consumer pointer.
Once the request is complete, the Xen hypervisor will advance the Response
Producer pointer, however, this was a write operation so there will only be an
indication that the I/O request to the device was successful.
This all occurs asynchronously so, as soon as the Request Producer pointer is
updated by the guest OS, a context switch will occur and foo will resume
execution in order to either continue writing or continue execution.
Construct and analyze a similar example for network transmission and reception by foo
.
All processes in XenoLinux occupy the virtual address space 0 through VMMAX, where
VMMAX is some system defined limit of virtual memory per process. XenoLinux itself is a
protection domain on top of Xen and contains all the processes that run on top of it. Given
this, how does XenoLinux provide protection of processes from one another? [Hint: Work
out the details of how the virtual address spaces of the process are mapped by
XenoLinux via Xen.]
1. Upon creation of the XenoLinux protection domain, Xen provisions some space within
machine memory for the protection domain. XenoLinux is aware of its memory
locations, whether they are contiguous or dis-contiguous.
2. With this in mind, when creating space for a new process and creating its page table,
XenoLinux does not have write access but must explicitly conduct hypercalls in
order to have Xen created the required space for the new page table. This page table is
also marked as read only by Xen, and all the memory references between virtual
addresses to physical addresses are created.
3. Any time the page table for a process needs to be updated, XenoLinux will hypercall ,
requesting Xen to conduct the updates. These updates can be delayed so that they can
be conducted all at once.
4. XenoLinux will conduct protection of each process like an operating system normally
would. XenoLinux knows the location of each process's page table and its virtual to
physical memory mappings. With this, XenoLinux is able to determine valid and invalid
memory references.
VMWare ESX
What is the difference between VMWare Workstation and VMWare ESX server? Discuss
how OS calls from a user process running inside a VM are handled in two cases.
VMWare Workstation is host-based. VMWare ESX server is a hypervisor, does not run
within a host OS.
Both provide the illusion to the guest VM that it is running bare-metal. VMWare
Workstation traps system calls, conducts the system call itself to the host OS, and then
returns the result to the guest OS. VMWare ESX server manages system hardware
directly - it validates privileged instructions, conducts the instructions on behalf of
guest VMs, and provides the results back to the guests.
VMWare references the actual physical memory as machine memory. Each guest OS
thinks its memory is actual physical memory, however, transparent to the guest OS its
memory is not contiguous. In addition, the VMWare ESX hypervisor maintains a
shadow page table to translate virtual memory addresses to machine memory
addresses.
The standard method involves introducing another level of paging. This would involve
paging some of the pages of a VM to swap space on a disk. This would require a meta-
level page replacement policy to be implemented by the hypervisor. The hypervisor not
only has to identify a candidate to reclaim memory from, but also what particular
pages need to be swapped out from that VM.
Two mechanisms were generated from this venture:
Ballooning - installing a pseudo device driver into each guest VM. The hypervisor
can inflate the balloon to reclaim pages from a VM. The VM will use its own
internal paging policy in order for it to determine what memory is actually good to
be paged. This allows for less interference within the VM by the hypervisor - only
works with VMs that will cooperate. The balloon may also be unable to reclaim
memory quickly enough to meet system demands. Upper bounds of balloon sizes
may also be imposed by guest OSs.
Demand paging - memory is reclaimed by paging out to an ESX server swap area
on disk, without the guest's consent. This is faster and able to meet the system's
demands for memory faster, but can effect guest OS performance significantly.
Also requires the implementation of a swap daemon and its algorithm to determine
candidate pages for swap.
Page sharing is transparent to all guest OSs. VMWare ESX periodically checks the
content of each page being used by each guest OS. VMWare ESX records a hash for
each page being used by different VMs and maintains a hash table. If a hash for the
content of a page matches to a hash within the hash table, VMWare ESX will conduct a
full scan of the memory to ensure the pages match completely. Afterwards, VMWare
ESX, transparent to the guest OSs involved, will have each guest OS point to the same
page for that particular memory reference. All future changes will be copy-on-write to
maintain the privacy of each page between each guest OS - the OSs will then separate
their memory usage.
VMWare uses a shares-per-pages ratio. Customers that have paid for more resources
will receive those resources and victim clients that have paid less will be required to
release resources. There is a minimum amount of resources you can have allocated
based upon how much you pay. VMWare ESX also implements a idle memory tax to
help balance the possible inequality effects that might be experienced if a customer
pays a lot for resources, then hoards those resources and doesn't use them. VMWare
ESX by default implements a 75% taxation rate on idle memory. The basic idea is
clients are charged more for active pages than idle pages.
lesson4
lesson4
shared memory machines
Different structures for shared memory machines:
This section essentially covers the cache coherence problem for shared memory systems.
This also covers the memory consistency model between hardware and software.
Memory accesses are going to execute sequentially - the program order will be maintained.
Arbitrary interleaving mechanism is added to this model.
Process P1 Process P2
a=a+1 d=b
b=b+1 c=a
c = d = 0
c = d = 1
c = 1, d = 0
Term Definition
Non-cache
hardware that does not support cache coherence. System software
coherent
must implement policies and mechanisms.
multiprocessors
Cache coherent hardware that supports cache coherence. System software is not
multiprocessors required to handle mechanisms.
Scalability
Pro Con
Don't share memory as much as possible across threads. Shared memory works well the
less memory we share. User shared data structures need to be kept to a minimum. Below is
a high-level representation of the concepts listed above.
Quizzes
Question: assume a = b = 0 initially.
Process P1 Process P2
a=a+1 d=b
b=b+1 c=a
c = 0, d = 1
c = d = 0
c = d = 1
c = 1, d = 0
synchronization
Synchronization primitives for shared memory programming
What is a lock? A lock is something that allows a thread to make sure that, when
accessing a particular piece of shared data, it is not being modified by another thread.
Locks come in different flavors.
Mutually
exclusive mutex
lock
Instruction Description
Test-and- return the current value in the mem_location and setting the
set() mem_location to 1 .
Fetch-and-
return the current value in mem_location and increment mem_location
inc()
This version will read the value of the lock from the cache. The cache consistency
mechanism implemented by the hardware will update the value of the lock in the cache of
each processor. This allows us to avoid contention across the communication bus.
while (L == locked);
This algorithm is good because it preserves fairness, but the once the lock is released,
contention is created on the network as everyone acquires the updated value.
algorithms all have their issues. No fairness exists with the first two, and the final one is fair
but still noisy on the communication bus.
Only one thread will be able to acquire the lock at the end of the day, so why are they all
contending? The thread that will receive the lock next should already be aware it is next in
line, and the rest of the threads will wait their turn.
The array-based queue lock is initialized to the number of processors currently in the
computing system. The queuelast variable is initialized at array location 0 . To join the
queue, a processor fetchs the value of queuelast and saves it locally. Then the processor
increments the queuelast variable. This is an atomic operation, no race conditions will
occur when attempting to join the queue. If the architecture does not support
fetch_and_inc , we can use test_and_inc instructions.
The processor will continue to check its location in the array until it has_lock . Releasing
the lock involves the processor setting its own location to must_wait and then sets the
next location in the array to has_lock .
The benefits of this lock algorithm is that it is fair and not noisey (causing contention on
the communication bus). The downside is the size of the data strucuture to host processor
status (space complexity O(n)).
The link based queuing lock is initialized with a dummy node (the head), and is
dynamically sized with the number of processors requesting the lock. Every new requestor
of the lock creates a structure called a q_node .
An atomic instruction is used to access the lock, and logic is implemented when a
processor adds itself to a lock. What occurs is the processor determines what other
processor is using the lock, and uses that specific processor's next field to point to itself.
Then the processor will wait on the lock variable got-it until it is true .
The unlock function for a processor involves the processor removing itself from the list. If
there is no next , the processor will set the successor to NULL . If a new request is
forming, however, a race condition occurs. In this case, the processor will conduct a
comp_and_swap to see if another processor is actively attempting to acquire the lock.
Quizzes
Assuming only read / write atomic instructions, is it possible to achieve the programmer's
intent?
P1 P2
Use struct_t a
Yes
No
Processors executing different threads will reach a barrier and will attempt to wait for all
other processors to reach the barrier. This portion of the lesson will discuss algorithms
used for barrier synchronization, their benefits and pitfalls.
Centralized barrier
Also known as a counting barrier, the centralized barrier uses a count that is equal to the
number N of processors. As processors reach the barrier, the count will be atomically
decremented. Processors will busy-wait on the count until it reaches 0 .
The last processor will arrive, see that the count is 0 , and will reset the count to N. All the
other processors will see that the count was set to 0 by the last processor and will be
released until they reach the next barrier.
The above algorithm contains an issue, however, in that processors being released can race
to the next barrier and fall through, detecting that the count is still 0 . Below is a fixed high-
level representation of the algorithm in which the processors spin until the count is N again,
then they are released.
In the sense reversing barrier, we remove the requirement to sense the count becoming 0 .
A sense variable is initialized as a boolean true , and all processors spin on this sense
variable until the sense variable becomes false .
As processors reach the barrier, they decrement the count and spin on the sense variable.
The last processor reaches the count, resets the count to N, and reverses the sense variable.
This removes one spin operation in comparison to the original centralized barrier, however,
because of the single shared sense variable, there is a large amount of contention across
the communication bus.
K processors share a memory location that contains a count variable and a locksense
variable. The last processor to arrive at the previous memory location shares another
memory location with L processors at the next level of the tree. The same series of actions
takes place at this memory location, waiting for a partner to arrive at the barrier.
This continues to happen until the two final processors meet at the tree's root, signifying
that all processors have arrived at the barrier. This algorithm essentially uses the sense
reversal barrier algorithm, but it attempts to reduce contention by breaking up the shared
memory locations into multiple barriers.
The lock sense flag that a particular processor spins on is not statically defined. It's
dynamic based upon the pattern that arises when processors reach a barrier.
Contention is based upon the ary-ness of the tree. The number of processors allowed
to spin on a lock sense flag.
Contention is also dependent upon whether or not the system is cache-coherent. If
cache-coherence is not forced, processors could be spinning on remote memory,
causing contention.
Below is a high-level representation of the tree barrier algorithm.
In this tree, parent processors are identified. Each processor has two data structures:
have_children and child_not_ready . The parents account for the ready status of their
children, at most four (4), and will wait for the children to be ready before crossing the
barrier. Each child will utilize the parent's static data structure within shared memory and
notify the parent it has arrived at the barrier.
A benefit of this algorithm is that each processor is assigned a unique spot in each parent's
data structure, reducing contention. In a cache-coherent system, all the arrival data can be
stored in one word, locally in the parent's cache further reducing contention.
In this tree, every processor is assigned a unique, static spot again. Each processor
contains a data structure called the child_pointer - parents utilize this data structure to
wake up its children.
When the parent at the root of the tree realizes that all processors have arrived at the barrier,
the root parent will signal the two children within its child_pointer data structure. This
happens recursively until all children of each parent are woken up.
Again, the memory locations are set in specific locations, no dynamic memory locations are
set. Through this static assignment, we reduce contention as much as possible.
This is essentially a barrier where matches are implemented between two processors.
Based upon N players, there will be log2N rounds. The matches are fixed, and the actual
lineup is fixed as well. p0 will always battle p1 and win, p2 will always battle p3 and
win, and so on. This allows the algorithm to statically define the location where the loser
will notify the winner that it has won, helping to reduce contention on memory locations
that are being spun on.
After the final match, the final winner will tell the final loser that it's time to wakeup,
everyone has reached the barrier. This winner to loser notification will happen recursively
until all processors are woken up.
The loser will also spin on a local, statically defined variable. This is convenient even for
non-cache-coherent systems, but overall it reduces contention because each process is
spinning on local variables.
Dissemination barrier
This barrier works by the diffusion of information among participating processors. What's
nice about this barrier is that the number of processors (N) does not need to be a power of
2. The idea is we have multiple rounds of information dissemination, and at the end each
processor is going to gossip to or receive gossip from every other processor.
Each processor will know when a round is complete when it both sends a message and
receives a message. An order of O(N) communication events are occurring every round.
In this algorithm, there is no hierarchy as communication is conducted via dissemination.
On shared memory machines, messages can be disseminated to each processor's statically
determined spin location. As always, sense reversal needs to be conducted at the end of
this barrier.
No hierarchy
Works for non-cache-coherent NUMA machines as well as clusters.
Every processor independently makes a decision to send a message for the round.
Every process independently makes a decision to move onto the next round.
The communication complexity is O(n log2n)
Performance evaluation
The only spin algorithms worth using for synchronization are ticket_lock ,
array_queue_lock , and list_queue_lock . The only barrier synchronization algorithms
Definitions
Term Definition
Quizzes
Yes. Before the last processor sets the count to N, other processors may race to the next
barrier and fall through.
n log2n
log2n
ceil(log2n)
n
The celing of log2n is the number of rounds. We use ceiling because the number of
processor does not need to be a power of 2.
lightweight rpc
RPC and client-server systems
This portion of the lesson discusses efficient communication across address spaces within
a distributed system.
Remote procedure call (RPC) is the mechanism used to build this client-server mechanism
in a distributed system. Should we use RPC on the same machine? The main concern here
is the relationship between performance and safety.
We want to ensure servers and clients reside within different protection domains. This also
means we incur overhead because RPC will have to cross address spaces. We would like to
be able to make RPC across protection domains as efficient as normal procedure calls.
Why is that a good idea? This allows us to create a protected procedure call, and will
encourage system designers to separate services into different protection domains.
Procedure calls are resolved at compile time, essentially calling a function within a process.
For remote procedure call, it's the same, however, the call traps into the kernel. The kernel
validates the call, copies the arguments into kernel buffers, and copies the arguments into
the address space of the callee. The callee is then scheduled to execute the procedure.
When the callee has completed the execution of the procedure, the results are provided
back to the kernel, copied into kernel buffers, and the copied into the caller's address space.
All of these actions are resolved at runtime.
These context switches, traps, and copy instructions incur overhead thus hurting
performance. Throughout this exchange, the kernel is validating the procedure call /
exchange between the client and the server.
Copying overhead
For a regular procedure call, all data movement occurs within user address space within the
process. Copying between the kernel and user space occurs on every RPC. We would like to
accomplish not having the kernel involved in moving data between the client and server for
a RPC.
Let's break down an RPC. When a client prepares to conduct a RPC, it marshals the data
into an RPC message using a client stub. This is the only way the client can communicate
this information to the kernel. Next, the client traps into the kernel and the kernel copies the
marshaled data from user address space into a kernel buffer. The kernel then scheduled the
server to execute on the CPU and copies the marshaled data from the kernel buffer into
server address space. The server then unmarshals the data from the buffer and extracts the
arguments using the server stub.
All of this is conducted in reverse for the server returning the data back to the client.
How do we remove the overheads presented in RPC? We optimize the common case, as is
tradition. The common case are the actual calls being made by the client and the server. We
want to leverage the locality of the information used for each RPC that resides within the
cache.
First, we begin by binding the client to the server, which occurs once. How does this work?
The server advertises an entrypoint procedure, publishing it on some name server managed
by the kernel. Anyone on the system can use the name server to discover the entrypoint of
the server.
The server waits for bind requests to arrive from the kernel. Clients interested in the
procedure being advertised issue the RPC, the call traps into the kernel on the first
execution, and the kernel checks with the server to see if the client is a legitimate caller. The
server can determine whether to grant or deny permission.
Once the validation has been completed, the kernel creates a procedure descriptor. The
procedure descriptor is a data structure that resides within the kernel, representing the
procedure entrypoint. The server describes to the kernel the characteristics of the procedure
entrypoint, including the address of the procedure to be called in server address space, the
expected argument stack, and how many simultaneous calls will be accepted for a
procedure.
Below is a high-level representation of the concepts discussed above that are implented in
pursuit of making RPC cheaper.
Once all this information is stored in the procedure descriptor, the kernel establishes a
buffer called the A-stack (argument stack). This argument stack is actually shared memory
mapped between the client and server address spaces. Now the client and server can read /
write from the A-stack and the kernel is no longer involved in the data transfer for an RPC.
The kernel authenticates the client, allowing the client make future calls to the remote
procedure. The kernel provides the client a binding object, similar to a capability, that the
client must provide the kernel every time the client wants to conduct a RPC to the server.
All kernel copying overheads are now eliminated after the first call is conducted. The client
still utilizes the client stub to marshal data for the argument stack. All arguments must be
passed by value, not reference, because the data needs to traverse between the two address
spaces.
Client traps into the kernel, the client stub presents the binding object to the kernel, the
kernel exposes the procedure descriptor thus exposing the entrypoint procedure address for
the RPC.
As per the semantics of an RPC call, the client is blocked and the server is scheduled to run
at its entrypoint for the RPC. The kernel borrows and doctors the client thread to execute
within the address space of the server. The program counter of the client is set to the
address of the entrypoint procedure within the server address space. The server is provided
an execution stack by the kernel to conduct its work in.
Once the client is doctored, control is transferred to the server. The server stub will copy the
arguments from the A-stack into the execution stack. Once the procedure completes
successful execution, the results will be copied into the A-stack. The server will then trap
back into the kernel. The kernel has no need to validate the return trap.
Control is provided back to the client, the client stub copies the results from the A-stack, and
the client continues execution with the results provided.
Summary:
When implementing this on a shared memory multiprocessor, we can preload the server
domains on a particular processor, only allowing the server to run on the processor. The
caches will remain warm for a processor bound to a server domain, leveraging locality. The
kernel can also inspect the popularity of a particular service and determine if it wants to
dedicate other CPUs to be bound to the particular service and host duplicate server
domains.
In an RPC, how many times does the kernel copy data from user address spaces into the
kernel and vice versa?
once
twice
thrice
four times
Copy instructions from user address space into kernel space and vice versa is conducted
four times. That's a lot of overhead.
scheduling
Scheduling first principles
We are reminded that an effective way to increase performance, particularly for shared
memory systems, is to keep the caches warm and leverage locality.
Typically, the normal execution of a thread is to compute for a while and then making a
blocking system I/O call, attempt to synchronize with others threads, or, if its a compute-
bound thread, its time quantum will expire. Fundamentally, this is a point in which the
scheduler must make a decision to pick another thread to execute on the processor.
The concept covered here is that, if a thread is descheduled it makes a lot of sense to
reschedule the thread back onto the same processor in order to leverage the cache of the
processor. This policy attempts to utilize the cache affinity of a thread for a cache being
utilized by a processor.
A problem is raised when other threads execute on the processor and pollute the cache,
harming the locality that the original thread attempts to leverage.
Policy Description
first come,
ignores affinity for fairness
first server
fixed
thread always executes on a fixed processor
processor
last
processor will execute the thread that most recently ran on it
processor
minimum for every thread, we store its affinity information and execute the thread on
intervening the processor that it has the highest affinity for
In the Minimum Intervening Policy, we determine the affinity of a thread for a processor by
counting the number of threads that intervened on a processor after a thread executed on a
processor. So, if a thread was descheduled from a processor and then two other threads
executed on a process, the intervention variable is now two (2).
The smaller the number, the higher the affinity for that particular processor. There is an
affinity index associated for every processor in the computing system. The scheduler will
pick a processor for the thread in which the intervention variable is the minimum.
If there are a large amount of processors, we enact a limited minimum intervening policy.
This will only store the metadata for the top few processors, avoiding holding a ridiculous
amount of information for a particular thread.
Below is a high-level representation of the minimum intervening policy and the calculation
of the intervention variable for a thread's affinity index representing a particular processor.
This is still the same algorithm as the minimum intervening policy, however, when a
scheduling decision is made, another thread could be running on the chosen processor.
In this policy, the scheduler maintains a queue of threads that will be executing on a
particular processor. So, the intervention variable won't just be calculated from threads that
have executed on the processor in the past, but also future threads that will be executing on
the selected processor. This is done by viewing the queue being maintained for that
particular processor.
Policy Summary
minimum intervening plus queue focus on cache pollution when thread gets to run
In the above table, the amount of information the scheduler has to keep increases as you
traverse down the table. More information also allows the scheduler to make a better
decision, as well.
Implementation issues
Lets' discuss how an operating system would implement the previously discussed
scheduling policies.
One possibility is the operating system maintains one global queue of the run-able threads.
This makes sense for the implementation of the first come, first served scheduling policy.
The global queue becomes infeasible when the number of processors becomes very large.
The data structure becomes unmanageable, and contention increases as each processor
attempts to access the data structure.
Another option is to keep local queues for each processor based on affinity. The
organization of the local queue for each processor will depend on the scheduling policy
being implemented.
There exists a specific equation to determine a thread's position in the queue or its priority.
Variable Definition
BP_i base priority of the current thread when the thread was created
age_i how long the thread has existed in the computing system
Performance
Each scheduling policy will be rated on its performance based upon the metrics below:
Figure of
Definition
merit
Throughput how many threads get executed or completed per unit of time?
Response
user centric; if a thread is started, how long does it take to start?
time
user centric; does the time it take to run a thread vary depending upon
Variance
when the thread is run?
First come, first served scheduling is fair, however, it doesn't account for affinity or the size
of a job. This results in a high variance.
Regarding the memory footprint, the larger the memory footprint of a thread, the longer it
takes for the working set of a thread to be placed in the cache. This signifies to us that
cache affinity is important for performance. The minimum intervenining policies are great
policies when the multiprocessor is handling a light to medium load.
If a very heavy load exists, it is likely that by the time a thread gets a chance to run on a
processor, all of the cache has been polluted. For a heavy load, the fixed processor
scheduling policy is best.
Essentially, we as operating system designers need to pay attention to the load on the
multiprocessor as well as the kind of workload we intend to cater to. These will be the
deciding factors for implementing a scheduling policy. Most operating systems
dynamically change their scheduing policy based on these heuristics.
There also exists the idea of procrastination. The processor is ready to do some work,
however, it inserts an idle loop. Why would this occur? A processor would read the runqueue
and realize there exists no thread that has run on it before. If it schedules any one of those
threads, none of those threads will have their working set loaded within the cache of the
processor.
As operating system designers, schedulers should leverage the structure of the multicore
multithreaded processors and attempt to schedule threads that have cache affinity for a
specific core. Schedulers want to try and ensure that the working set of a thread is
contained in either the L1 or L2 cache.
When conducting cache aware scheduling, the operating system should schedule a number
of cache hungry threads and cache frugal threads - the overall memory footprint of the
threads being scheduled should be no larger than the size of the L2 cache. The operating
system scheduler should be cache aware and attempt to leverage as much of the L2 cache
as possible, but also avoid scheduling threads with large memory footprints that will cause
other threads to have to conduct memory accesses.
How does a scheduler know which threads are cache frugal or cache hungry? We have to
collect these heuristics over time as the threads execute. The scheduler won't initial know
the memory footprint of a thread, however, the scheduler will maintain heuristics of a thread
to make better scheduling decisions in the future. The overhead for this information
gathering has to be kept at minimum in terms of time, of course.
Quizzes
How should the scheduler choose the next thread to run on the CPU?
Pu's queue
Pv's queue
We enqueue Ty onto Pu's queue because, by the time Ty executes on Pv, the number of
intervening tasks will be 5. The number of intervening tasks on Pu's queue by the time Ty
executes will be 3.
shared memory multiprocessor operating
system
Operating system for parallel machines
Modern parallel machines face a lot of challenges when attempting to implement the
algorithms and techniques discussed previously in a scalable manner. What are some of
these challenges?
First, the size bloat of the operating system due to the additional features that need to be
added to the operating system. This results in system software bottlenecks, especially for
global datastructures.
In addition, the memory latency that is incurred when processors need to conduct a
memory access versus utilizing the cache - 1:100 ratio of time required to conduct a
memory access. Now let's take into account the fact that we're attempting to design a
distributed system using some interconnection network with Non-Unfiorm Memory Access
(NUMA). Memory accesses by a processor can be done locally, but they can also be done
across the interconnection network, as well, further increasing the latency of memory
accesses.
Another challenge we encounter is the deep memory hierarchy that exists for each node.
Modern processor cores contain an L1, L2, and L3 cache.
Finally, there's the issue of false sharing. Even though two threads running on two different
cores of a multiprocessor are programatically unrelated, the cache hierachy may make the
block that contains the individual memory location being utilized by these two different
cores to be on the same cache block. The processors are executing completely unrelated
operations and appear to be sharing locations in the cache, however, they are actually false
sharing memory locations within the cache.
Below is a high-level illustration of a parallel machine, its nodes, and the challenges faced
when designing operating systems for parallel machines.
Principles
Let's discuss some general principles that should be followed by operating system
designers when desiging operating systems for parallel machines.
Principle Description
Cache concious leverage locality and exploit cache affinity for scheduling
decisions decisions
Reduce shared data limit shared system data structures to reduce contention for
structures memory locations
keep memory access reduce the distance between the processor and the memory its
local attempting to access
Reading a page from memory, conducting an I/O operation to do this, creating a page
frame for the newly swapped memory, and conducting a page table update takes a lot of
time and can't be done in parallel. We want to avoid this serialized set of actions as much
as possible to avoid bottlenecks on the parallel system.
Let's take into account an easy example scenario. Two threads exist with independent
workloads. Both threads experience a page fault. Because there are no shared memory data
structures between the two threads, no serialization will be experienced.
Now let's take into account a hard example scenario. We have multiple threads executing
within a process, all sharing the same address space. The parallel operating system takes
advantage of the concurrency being implemented for this process and executes threads
from the process on two different nodes within the system. In this scenario, the address
space and page table is shared across each thread and the TLBs are shared for each thread
on a particular node.
What we would want to do for the scenario above is to limit the amount of sharing of
operating system data structures for the threads running on different nodes. The operating
system data structures that the segregated threads interact with should be distinct and
separate. This will help to ensure scalability for the parallel operating system.
The recipe for desiging a scalable, parallel operating systems will be discussed in this
section. First, for every subsystem we design in a parallel operating system, we should
consider these concepts:
Determine, functionally, what needs to be done for the service / subsystem we are
implementing. In the systems being discussed, the hardware is already parallelized,
completing most of the hard work for us.
To ensure the concurrent execution of the service, however, we should minize the
shared data structures. Less sharings equals more scalability.
Depending on the usage of a data structure, we want to replicate and partition data
structures where possible. This provides more concurrency and less locking of shared
data structures.
Tornado achieves the scalability discussed above using a concept called the clustered
object. The key concept is that Tornado provides the illusion of a single object to each piece
of the operating system, however, when a piece of the operating system references said
object it is actually using one of the many representations of the object. These references
are separted by nodes.
What's the degree of clustering? How is that defined? What is the granularity? The designer
of a system service for Tornado has the ability to make this decision. The designer can
choose a:
singleton representation
one per core representation
one per CPU representation
one per group of CPUs
Traditional structure
This is just a quick review of the traditional memory hierarchy used for memory
management. Covers file system descriptors hosted in the page cache (in RAM), the storage
subsystem, and the representations of processes' virtual memory to include their procces
control block, software translation lookaside buffer, page table, and virtual pages on disk.
Now that we are abstracting away parts of the operting system in to object, we'll start with
the process object. The process object is similary to the process control block and
represents the address space of a process used by multiple threads of execution on the
same CPU.
In order to practice the concepts discussed above, the adress space (process object) will be
broken into different regions. Threads don't execute over the entire address space of a
process, so we don't need to maintain the entire address space for each thread on a
different node.
With these regions being defined, another object is created called the File Cache Manager
that backs the memory locations of each region.
Another object, the page frame manager (DRAM object), acts as a page frame service.
When the page fault service needs a page frame, it contacts the page frame manager to
acquire the contents of the File Cache Manager from the storage subsystem for a particular
region.
Finally, there is the cached object representation (COR) that is responsible for knowing the
location of the object the page frame manager is looking for on the backing storage
subsystem. The COR conducts the actual page I/O.
Each clustered object reference is accompanied by a translation table. The translation table
maps a reference to a representation in memory. There also exists the possibility that an
object reference incurs a miss. When this happens, the object reference is added to the miss
handling table, and the object miss handler kicks in.
The object miss handler determines if the object reference requires a new representation be
created in the translation table or if the object reference is referring to an already existing
representation. Once a decision is made, the object miss handler will install the object
reference to representation mapping in the translation table.
There is a possibility that the object miss handler may not be local. This is because the
miss handling table is a partitioned data structure, it doesn't exist for every node. This
displays the necessity for a global miss handler. If the miss handling table does not have
an object miss handler for a specific object reference, we use the global miss handler. The
global miss handler exists on every node, knows the location of the miss handling table for
each object reference, and can obtain representations of object references across all nodes.
The global miss handler will then populate the reference to representation mapping in the
local translation table for a node.
Hierarchical locking of objects kill concurrency. If we want integrity for objects of course we
want to lock critical data structures, however, if the path taken by a thread for a page fault
is different from another thread, there exists little reason why the page fault servicing can't
be done in parallel - without locking the critical data structures.
So we still want integrity of the process object and there's a possibility we need to lock it so
it can't go away. How can a process object just go away? It's possible the operating system
has decided to migrate a process object from one processor to another processor.
To resolve this issue, we instantiate a reference count for the process object. The operating
system will utilize the reference count and existence guarantee instead of hierarchical
locking. So when a thread utilizes the process object, it increments a reference count for the
process object denoting that it is in use. Now other operating system entities will know not
to modify or migrate the object because the reference count is not 0 .
This non-hierarchical locking mechanism included with the existence guarantee promotes
concurrency.
It's important to make sure that memory allocation is scalable and this includes the heap
space of processses. One possibility is to take the heap space of the process and partition
it for the multiple threads executing within the process on separate nodes. We'll associate
the portions of the heap that reside within physical memory and associate them with the
nodes of the threads utilizing the partitioned heap space.
For NUMA machines, this removes the bottleneck that exists for threads when they attempt
to acquire memory from the heap. Now, instead of allocating memory from some central
location, they can acquire the memory from the physical memory residing on their local
node. This mechanism also avoid false sharing of memory between the threads executing
on separate nodes.
IPC
The functionalities of the Tornado operating system are contained within the clustered
objects we described earlier. These clustered objects must communicate in order to
implement the services of the parrallel operating system. We need some form of
interprocess communication between these objects, with the thinking that some objects
can act as a client, and other objects can act as a server.
So all of this interprocess communication is, again, realized by the protected procedure call
mechanism. If the protected procedure call is conducted between a client and a server
executing on the same processor, no context switch is required. If the protected procedure
call is conducted between a client and a server executing on different processors, a remote
protected procedure call, a full context switch is incurred.
Tornado uses this IPC mechanism to ensure clustered objects and their replicas remain
consistent across all nodes. This has to all happen within software because, even with a
cache coherent system, the replicas are not guaranteed to be consistent across all nodes
because they are software based.
Tornado summary
Tornado is an operating system designed for parallel systems. These are its features:
Object oriented design for scalability. Clustered objects and protected procedure call
attempts to preserve locality while also ensuring concurrency.
Multiple implementations of operating system objects. Allows for incremental
optimization and dynamic adaptation of object implementations.
Use of reference counting to avoid hierarchical locking of objects. Locks held by an
object are confined to the specific representation of the object.
Tornado also works to optimize the common case:
page fault handling
Limiting the sharing of operating system data structures by replicating critical data
strucutres to promote scalability and concurrency.
The Corey operating system attempts to prove this principle in its implementation. The
Corey operating system provides mechanisms for the applications to give hints to the
kernel. The Corey operating system is similar to Tornado in that it attempts to reduce the
amount of sharing of kernel data structures. Below is a comparison of features between the
Corey operating system and Tornado.
Corey Tornado
Regions within an
Address ranges in an application; exposed to the application; application;
threads hint to the kernel where it is going to operate transparent to the
application
shares; threads hint to the kernel the system data structure they
intend to use; threads can communicate their intent through the
shares mechanism (whether it be private access or access by
multiple threads)
Cellular Disco was a study at Stanford in pursuit of finding an operating system that would
leverage virtualization in tandem with the multiprocessing nature of the underlying
hardware.
Cellular Disco was a thin virtualization layer that managed hardware resources such as the
CPU, I/O devices, memory management, etc. One thing that is always difficult in operating
system design is I/O - most of the device driver code that resides within an operating
system is third party code written specifically for the I/O device. Cellular Disco attempts to
tackle this issue by abstracting away I/O management for the virtualized systems by
construction. Below is a high-level representation of Cellular Disco on a multiprocessor
system.
Cellular Disco abstracts I/O operations for the guest operating system by utilizing the usual
trap and emulate mechanism of virtual machine monitors. When guests send I/O requests,
Cellular Disco provides that to the host operating system for management, leveraging the
already existing I/O subsystem that will resolve interaction with the device drivers. When an
external I/O interrupt is received, Celluar Disco has identified itself as the interrupt handler,
mimics the external I/O interrupt to the host operating system, and then the host operating
system routes the external I/O interrupt back to Celluar Disco for handling - eventually the
external I/O reaches the guest operating system.
The purpose of Cellular Disco was to prove by construction how to develop an operating
system for new hardware without completely re-writing the operating system. They also
prove that a virtual machine monitor can manage the resources of a multiprocessor as well
as a native operating system. They manage to mitigate the amount of overhead required to
conduct the trap and emulate mechanism, maintaining effeciency for the applications
executing within the guest operating system.
The importance of the inequality between these two times is that this inequality dictates the
design of the algorithms that are going to be utilized to implement the distributed system.
An example
In distributed computing, there exist a set of beliefs or expectations that we have for the
order in which operations are conducted. We expect processes within an exchange to be
sequential, events to be totally ordered, and that sending occurs before receiving.
The happened before relation and notation is just an assertion that a particular action took
place before another particular action. This relation can be asserted for events that take
place within a process, as well as across a distributed system when involved with
communication between nodes.
The "happened before" relationship is also transitive. A particular action, c , takes place
after action b , and action b takes place after action a , then action c takes place after
action a .
Lastly, there are concurrent events, events in which there is no apparent relation. This
usually occurs when events take place on separate nodes and there is no communication
between the events. We cannot say anything about the ordering of these events because
they are not related in any way.
In the consturction of a distributed system, it's important to understand and keep in mind
these two types of events. Not doing so could be the cause of synchronization and timing
bugs.
Identifying events
Below is a the previous example of events taking place within a distributed system, with the
"happened before" and concurrent events identified.
Definitions
Term Definition
event-
computation the time required for a node to conduct some significant processing
time
communication the time required for a node to communicate with another node
time within a distributed system
concurrent
events in which there is no apparent relationship between the events
events
Quizzes
N1 N2
f -> a
b g
a -> b
b -> a
neither
You cannot say anything about the relation betwee a and b given the order of the
operations above.
lamport clocks
Lamport's logical clock
Nodes within a distributed system are only aware of two kinds of events: its own events
and its communication events.
Lamport's logical clock builds on this idea, attempting to associate a timestamp with every
event that occurs on the node, or every communication event that occurs between peer
nodes. A condition is that each timestamp of each event is unique and is increased
monotonically for sequential events - no two events will have the same timestamp.
In Lamport's logical clock, timestamps are also generated for communication events,
however this must be done in coordination with the target node. A condition exists in which
the timestamp for the communication event occurring on the peer node is either greater
than the timestamp for the source node, or the max timestamp number on the target node.
Thus, Lamport's logical clock gives us a partial order for when events happened across the
distributed system.
Will a partial order of events be enough to create deterministic algorithms for the
implementation of a distributed system? It works for many situations, however, this section
will discuss why a total order of events is needed for distributed system design.
in the example provided below, the professor describes the use of his car between his
family, and how negotiation is conducted by timestamping messages requesting the use of
the car. If a tie occurs, all nodes agree upon a tie breaker - in this case being age.
In order to assert that some event, a, occurs before some other event, b, we must meet these
conditions:
or
So how do we create a distributed mutex using Lamport's clock? We don't have shared
memory to implement the lock, so we have to conduct this across an entire network using
messages. So here's how the algorithm works using Lamport's Logical Clock:
A process will check to see if its request is at the top of the queue. If so, it can begin to
assume it holds the mutex.
A process has received acknowledgements from all the other nodes in the system.
A process has received lock requests from other processes that occur later in the
queue.
So what?
Lamport's Logical Clock, providing us the ability to derive a total order from partial
orders, allows processes to make decisions on local information (the local queue)
about whether or not they hold a mutex within a distributed system.
Construction
The local queues are totally ordered and compliant with Lamport's Logical Clock
algorithm. Process IDs are used to break timestamp ties between requests.
Assumptions
Messages between any two processes arrive in order.
There is no loss of messages.
Message complexity
Essentially, when a lock request and subsequent unlock message is sent, the message
complexity is:
3(N − 1)
There is the initial lock request from a process (broadcasted), the lock request
acknowledgement from the peers, and then an unlock message from the initial requesting
process (broadcasted).
Of note, there is no unlock acknowledgement being sent by the peers. This is because there
is the assumption that there is no message loss within this distributed lock algorithm.
2(N − 1)
Below is a high-level representation of a real-world scenario in which clock drift can cause
problems.
Lamport's Physical Clock is the same as the logical one, except with some minor
differences. Again, events have to occur chronologically (event a being before event b), but
now we conduct all calculations in real time. To do this, some conditions must be met:
Conditions that must be met for IPC time and clock drift to avoid anomalies:
1. The difference between the time the message was received and the time the message
was sent must be greater than 0.
2. The amount of clock drift should be negligible compared to the inter-process
communication time.
Using these two conditions, we can derive that inter-process communication time must be
greater than or equal to mutual clock drift divided by 1 minus individual clock drift. In
summary, the mutual clock drift is very small in comparison to inter-process
communication time.
Below is a high-level representation of these conditions as well as their equations.
Definitions
Term Definition
clock refers to several related phenomena where a clock does not run at exactly the
drift same rate as a reference clock.
Quizzes
a -> b
b -> a
a -> b
if (a + b in event in the same process)
if (a is a send event and b is the corresponding receive)
(a, 1) -> (b, 2) -> (c, 3) -> (f, 4) -> (d, 5) -> (e, 5)
How many messages are exchanged among all the nodes for each lock acquisition
followed by a lock release?
N-1
2 (N - 1)
3 (N - 1)
latency limits
Introduction
Lamport's clock laid the foundation and theoretical conditions required to achieve
deterministic communications between distributed systems. In this section, we will discuss
how these communication mechanisms can be implemented in operating systems
efficiently enough to satisfy the conditions recently discussed.
RPC performance is very crucial when it comes to building client / server systems. RPC is
the main method of building distributed systems. There are two components involved in
creating the latency observed during message communication within a distributed system:
Hardware overhead
Dependent upon how the network interfaces with the computer. Typically, most
modern computers have a network controller that transfers messages from system
memory to its network buffer using direct memory access (DMA). The network
controller then places the message onto the physical link - this is where bandwidth
imposed by the physical link occurs.
In some hardware systems, the CPU may be involved in the memory transfer
between physical memory and the network buffer of the network controller. The
CPU will conduct I/O operations to move the data between the two devices.
Software overhead
This overhead is the overhead generated by the operating system and is additive to
the hardware overhead.
This includes the mechanisms the operating system uses to make sure the
message is available in physical memory for transmission via the network
controller.
I won't get too in-depth of the components of an RPC call. At this point the reader should
understand everything involved with this form of communication.
1. Client call
1. Client sets up arguments for the RPC call.
2. Client makes a call into the kernel.
3. Kernel validates the call.
4. Kernel marshals the arguments into a network packet.
5. Kernel sets up network controller to conduct network transmission.
2. Controller latency
1. Network controller conducts DMA to retrieve message and places data into its
network buffer.
2. Network controller transmits message onto the physical link.
3. Time on the wire
1. Depends on the distance between the two nodes, as well as the bandwidth of the
physical link.
4. Interrupt handling
1. Message arrives at destination node in the form of an interrupt to the operating
system.
2. Network controller of the destination node moves message from its network buffer
into memory using DMA.
5. Server setup to execute procedure call
1. Locate the correct server procedure.
2. Dispatch the server procedure onto the CPU.
3. Unmarshal the network packet.
4. Execute the procedure call.
. Server execution and reply
1. Latency involved with server execution of the procedure is outside our control.
2. Reply is formed into a network packet and marshaled.
3. Item number 2 repeats.
4. Item number 3 repeats.
5. Item number 4 repeats.
7. Client setup to receive results and restart execution
1. Client is dispatched onto the CPU to receive results.
The sources of overhead that we can control and mitigate when designing the RPC process
are:
Marshaling
Data copying
Control transfer
Protocol processing
How do we reduce the kernel overhead involved in RPC? We need to take what the hardware
provides us and leverage its capabilities in order to reduce overhead.
The biggest source of overhead in marshaling is the data copying overhead. Up to three
copies of the message data could be involved when marshaling data for an RPC. Where are
these copy operations being generated?
The client stub makes the first copy of the message data from the client process stack
into an RPC message.
Copying of the RPC message from client address space into kernel address space
(kernel buffer).
DMA from kernel buffer to network controller buffer.
Copying overhead is the biggest source of overhead for RPC latency. So how do we reduce
the number of copy operations?
The DMA of the message data from the kernel buffer into the network controller buffer
is unavoidable.
What we can do is marshal the arguments of the client's RPC directly into the kernel
buffer - avoiding copying arguments from the stack into a user-space buffer and then
having to copy them, again, into kernel space. The procedure for the client stub that the
client will call into will reside within kernel space.
The problem with placing stack data from user-space directly into the kernel is that this
operation is dangerous. We cannot guarantee the data being placed into the kernel buffer
by the client is correctly sanitized.
The second option to remove the number of copy operations to conduct an RPC is to utilize
shared descriptors between the client stub and the kernel. The client stub will remain in
user-space and will use the shared descriptors to describe to the kernel the data structures
that will reside within the kernel buffer.
The second source of overhead in a RPC is the control transfer overhead. There are
potentially four control transfers conducted for an RCP.
Steps 2 and 4 are in the critical path of RPC latency. Having to dispatch the server or client
procedures to receive RPC messages causes latency.
We can reduce the the number of context switches down to two by overlapping the context
switches with the network communication being conducted for the RPC.
Can we reduce the number of context switches to one? Absolutely. I we can determine that
the RPC and the server procedure are relatively fast and respond quickly, the client can spin
instead of block on the CPU allowing the client procedure to send and receive the reply
quickly. This only works if the server procedure is fast, however. If we block for too long,
we'll waste CPU time and resources.
Another portion of RPC that creates latency is protocol processing. Here we must decide
how we want to deliver the message to the other node. So what transport method are we
going to use for a particular RPC?
If we're in a Local Area Network (LAN), we know our connection is pretty reliable. This
indicates that we should probably focus on reducing latency. One axiom is that the
concepts of reliability and latency will always be at odds with one another. So in
choosing our transport method, we must always look to make the most sensible
compromise. So how do we reduce latency in a LAN?
Don't conduct low level acknowledgements of messages. Chances are our
message reached its intended destination.
Utilize hardware checksums for packet integrity.
Since the client blocks, we don't need to buffer the message on the client side. If
the message gets corrupted or lost, we'll just resend the call.
We do need buffering on the server side, however, for the reply. That way, if the
client resends the request because the reply was lost, we can easily resend the
reply - this avoids re-executing the server procedure.
Definitions
Term Definition
Quizzes
It takes 1 minutes to walk from the library to the classroom. The hallway is wide enough
for 5 people.
1. What is the latency incurred by traveling from the library to the classroom?
1 minute.
2. What is the throughput achieved if 5 people walk from the library to the classroom?
At this point the reader should understand the OSI model and TCP/IP stack. I only took
notes by exception, commenting on things that were novel presented within this slide.
What does it mean to make a router active? A router becomes active when, instead of
utilizing its routing table to conduct the routing of a packet, it actually executes code in
order to dynamically decide the next hop of the currently inspected packet.
This mechanism is implemented by passing packets to a router than contain code for the
router to execute, that way the router conducts a custom set of instructions for that packet.
This provides us the ability to create customized flows for network services, and every
network can have its own way of choosing the route from source to destination.
Essentially we utilize this extra code placed into a packet for the router to analyze so that
we can leverage more powerful abstractions and requests for the routing of a packet. We
can conduct broadcasts to multiple destinations and recipients without having to
specifically send a packet to each recipient.
The operating system provides quality of service (QoS) APIs to the application developer
for use in hinting to the operating system how specific network interactions should be
treated. This allows the application developer to specify network flows that might have real-
time constraints, etc.
The operating synthesizes these QoS hints and generates executable code that it then
places in the packet. In other words, the protocol stack of the operating system has to
support these QoS improvements and expose an API to support customized network flows.
The synthesized code placed into the packet by the operating system allows routers in
between the source and destination nodes to make more informed routing decisions,
allowing developers to create software-defined network flows.
The primary challenge is that the network stack of an operating system has to be
augmented to support the above idea, and the routers expected to conduct this software-
defined routing also have to support this new concept. We cannot expect all nodes and all
routers to have these features available.
ANTS toolkit
The ANTS toolkit provides a method of avoiding having to augment the operating system
and its protocol stack in order to support software-defined routing and QoS code
synthesizing.
The ANTS toolkit provides an API to the programmer to hint QoS requirements, and the
ANTS toolkit will encapsulate the original packet payload with an ANTS header. The
operating system will place an IP header onto the now encapsulated payload and transmit
the packet onto the network.
The IP packet with an ANTS encapsulated payload will route normally across a network,
unless the router that receives the packet is an active node. An active node will be able to
process the ANTS header and make intelligent routing decisions based upon the contents
of the header.
This encapsulation mechanism removes the pain point of having to modify the operating
system to support software-defined routing. However, what about the routers? Not all of
them will be active.
To defeat this limitation, we ensure that active nodes reside at the edge of the network. This
way, we can leave the core IP network unchanged and all of the intelligent routing decisions
will be made at the edge.
Of note, the ANTS capsule within the IP packet does not contain the code that needs to be
executed to process the ANTS capsule. The ANTS capsule only contains a type field used
as a vehicle to identify the code that needs to be executed to process the capsule.
As for the API, the API allows us to define the routing of the packet for intelligent forwarding
through the network. Thus, we can virtualize the network flow regardless of the actual
physical topology.
The second part of the API allows us to leverage soft storage. Soft storage is storage that's
available in every router node for personalizing the network flow with respect to a particular
type of capsule. Soft storage is where we store the code for a particular type of capsule so
that we can conduct intelligent routing decisions and code execution.
The API provides the programmer the ability to put, get, and remove objects from soft
storage, and soft storage's data structures use key, value pairs. Storing code associated
with the particular type of a capsule is important for personalizing the network flow of said
capsules - nodes that hold the personalized code within their soft store can re-use the code
for the respective type of capsule.
The soft store can also be used to store things like computed hints about the state of the
network. These hints can be used for future capsule processing for capsules of the same
type.
The final portion of the API allows the programmer to acquire metadata about the state of
the network.
The final takeaway from this section is that the ANTS API is very small and for good
reason. We are attempting to execute code on routers residing on the public internet. We
need to be considerate of other network interactions that are occurring across these nodes
and maintain important characteristics about our interactions with them:
Capsule implementation
Each capsule contains a type field that is a cryptographic fingerprint of the original
capsule code.
When a node receives a capsule, two things are possible:
The node has seen capsules of this type before, retrieves the code for this capsule
from its soft store, and executes the code - forwarding the capsule towards its
desired destination.
This is the first time this node has seen a capsule of this type. The node references
the prev node and requests the code for this type of capsule from the prev node.
After receiving the code from the prev node, the node executes the code and
stores the code into its local soft store.
The first time a particular capsule enters the network, no node will have the correct code for
it. However, as network flows are completed, nodes will all have the capsule's code stored in
soft storage - allowing us to leverage locality in the processing of network flows for a
capsule type.
How does a node ensure it got the correct code from the prev node?
By using the cryptographic hash contained in the type field of a capsule, a node can
hash the received code from the prev node and compare the hashes.
What if I request the code from the prev node and the prev node does not have the code in
its soft storage?
Active networks can be used for a variety of applications. Here are some examples:
Active networks are useful for building applications that are difficult to deploy in the
internet. We overlay our own desires for network flow on top of the physical topology of the
network. Some things to keep in mind for application development leveraging active
networks:
Feasible?
Router vendors hate opening up the network for application developers to extend or make
changes. It is only feasible to implement virtual network flows using active routers at the
edge of the network.
Software routing requires computation and is slower than hardware routing. This also
restricts software routing to only occur at the edge of the network.
And finally, there are social reasons why this is hard to adopt. The user community is not
comfortable having arbitrary code executing on public routing fabric where their traffic
exists. While there may be protection in place to prevent abuse, there can never be a
guarantee that malicious actors won't attempt to utilize software network flows for attacks.
Quizzes
We put theory and practice together using the design cycle. Here, we'll explain every step in
the design cycle as we work to create a system component.
Specify
Using IO Automator, we develop the system requirements and specify them using
the syntax of the language. (C-like syntax)
The composition operator in IO Automator allows for the expression of the
specification of an entire subsystem that we wish to build.
Code
Now we convert the specification expressed by IO Automator into code that can
actually be executed.
We use OCaml (object oriented categorical abstract machine language).
OCaml is a great candidate for component based design due to its object-oriented
nature and the fact that it's a functional programming language.
The code generated by OCaml is as efficient as C code, which is very important to
operating system developers.
Optimize
NuPrl is a theoretical framework for optimization of OCaml code. The input is
OCaml code and the output is optimized but functionally equivalent OCaml code.
NuPrl uses a theorem-proving framework in order to conduct optimization and
verifies through theorem-proving that the resulting code generated is equivalent to
the input code.
Next we create the Concrete Behavioral Specification, not implementation just yet, but
refinement of the abstract specification created earlier.
Finally, we arrive at implementation using OCaml, generating executable code that achieves
the behavior outlined in the original abstract behavioral specification. Keep in mind, this
generated executable code is unrefined and not the most optimal implementation - yet.
One word of caution, however, there is no guarantee that the implementation is actually
meeting the abstract behavioral specification. There's no easy way to show that the
implementation is the same as the abstract behavioral specification.
NuPrl can conduct static optimization for OCaml code, however, this is unhelpful as we
have multiple layers that make up our components. We need to collapse common events
that happen.
NuPrl can also conduct dynamic optimization, collapsing layers and generating bypasses if
the code's common case predicate is satisfied. Common case predicates are synthesized
from the decision statements within the micro-protocols - allowing us the ability to bypass
and optimize the code.
Finally, we convert all of this back to optimized OCaml code, and the theorem-proving
framework can prove that the unoptimized code and the optimized code are functionally
the same.
Research and industry is usually restrained by the marketplace it serves. Sun Microsystems
in its heyday was making Unix Workstations, large complex server systems, etc. The
marketplace demands were influenced Sun's research - most of the applications running on
business machines were legacy and it would cost much more for those businesses to move
to an entirely brand new operating system.
For Sun to innovate, they had to make changes where it made sense - without releasing an
entirely brand new product. This is where the Spring operating system came to fruition. The
attempted to preserve the external interface and APIs that application developers could
program against, while at the same time making improvements on internal components
and then integrating them correctly.
Object oriented approaches to this problem are a good choice in ensuring that we can
innovate the internal components of a piece of software, while maintaining the external
interface exposed to the customer.
Monolithic kernels usually leverage a procedural design in their implementation. The code
is one monolithic entity with shared state and global variables, and maybe some private
state between procedure caller and callee. Subsystems make procedure calls to each other,
and this is how monolithic kernels are constructed.
In contrast, objects contain the state and the state is not visible outside of the object.
Methods are exposed to manipulate the state as well as retrieve information about the state
of the object. With object based design, you achieve the implementation of strong
interfaces and isolation of the state of an object from everything else.
So if we have strong interfaces and isolation, isn't this similar to what we discussed in the
structuring of operating systems? Is this similar to the implementation of protection
domains, and will we incur performance costs having to conduct border crossings?
There are ways to make the protection domain crossings performance conscious. The
Spring operating system, for example, applies object orientation in building the operating
system kernel; the key takeaway being if object oriented programming is good enough to
implement a high-performance kernel, it's good enough to implement higher levels of
software, as well.
Spring approach
The Spring approach to operating system design is to adopt strong interfaces for each
subsystem. The only thing exposed for a subsystem is what services are provided. The
internal components of a subsystem can be changed entirely, however, the external
interface will always remain unchanged.
The Spring approach also wanted to ensure that the system was open and flexible, allowing
for the integration of third party software into the operating system.
To be open and flexible, we must also accept that software can be implemented in different
languages. This is why the Spring approach adopted the interface definition language
(IDL). Third party software vendors can use the interface definitions in building their own
subsystems that integrate with the Spring operating system.
The Spring system utilizes the micro-kernel approach to extensibility. The Spring micro-
kernel includes:
the nucleus which provides abstractions for threads and inter-process communication
the virtual memory manager
All of the other system services are outside of the kernel. Spring is Sun's answer to building
a network operating system using the same interface (the UNIX interface).
door
essentially the same as a file descriptor, but for doors to domains
handle
The nucleus is involved in every door call. When the invocation occurs, the nucleus inspects
the door handle, allocates a server thread in the target domain and executes the invocation.
Its a protected procedure call - the client thread is deactivated and activated within the
client domain. The thread is deactivated again and reactivated within the client domain's
address space. This is very similar to LRPC - conducting very fast cross-domain calls using
this door mechanism. These mechanisms leverage the good attributes of object oriented
programming while avoiding performance hits.
Network proxies can potentially employ different network protocols when communicating
with domains hosted on different nodes. This allows the network interaction between a
client domain and a server domain to be completely transparent, the client and server only
have to worry about utilizing the correct doors.
Proxies export a net handle embedding a particular door for a server domain.
Proxies wishing to utilize the server domain will use the net handle to connect nuclei
with the peer Spring operating system.
Client domains have the ability to transfer doors to other domains for use. The privileges of
the door do not have to be transitive, however. The client domain can decide if it wants to
provide the same privileges for other domains using the door, or it can provide lesser
privileges.
The virtual memory manager is Spring is in charge of managing the linear address space of
every process. The virtual memory manager breaks this linear address space into regions.
Regions are a set of pages.
The virtual memory manager abstracts things further, creating what are called memory
objects. Regions are mapped to different memory objects. The abstraction of memory
objects allows a region of virtual memory to be associated with backing files, swap space,
etc. It is also possible that multiple memory objects can map to the same backing file, or
whatever entity is being referenced by a memory object.
The virtual memory manager needs to place the memory object into RAM for a process to
be able to access the memory it contains. The virtual memory manager leverages pager
objects that will make and establish connections between virtual memory and physical
memory. The pager object will create a cached object representation in RAM that will
represent a portion of the memory object in RAM for processes to address.
Below is a high-level representation displaying two processes with their linear address
space being managed by two different virtual memory managers. The virtual memory
managers utilize pager objects to represent a portion of memory objects in the linear
address space using cache objects.
So what happens when two different address spaces are referencing the same memory
object using two different paging objects? What about the coherence of the different
representations?
It is the responsibility of the two pager objects to coordinate cached object coherence if
needed.
Client / server interactions should be agnostic to where they are physically located -
whether it be on the same machine or across a network. Additionally, the Spring system
routers client / server interactions to duplicate servers if the workload is large enough. And
there's also the ability to create a cached copy of the server that the client can interface
with.
So how are these dynamic relationships between clients and servers created? They use a
mechanism called the subcontract. You could use the analogy of off-loading work to some
third party, where that third party conducts work on behalf of the original vendor.
The subcontract is the interface provided for realizing the IDL interface between a client
and a server. Subcontracts hide the runtime implementation of a server that a client can
call.
This simplifies client side stub generation, everything is detailed in the subcontract. The
subcontract can be changed at any time - each one can be discovered and installed at run
time. All they do is advertise the IDL they implement, no more details need to be provided to
the client stub.
Below is a high-level representation for how the client and server stub interface with the
subcontract. This includes the marshaling and unmarshaling of data, invoking the sub
contract, etc.
Quizzes
Identify the similarities and differences in abstractions provided by the Nucleus and
Liedtke's micro-kernel.
java rmi
Java distributed object model
The great thing about Java's distributed object model is that all of the things the
programmer would have to worry about, marshaling / unmarshaling, etc. is handled by the
Java distributed object model. The object model provides these abstractions:
Abstraction Definition
With Java it's possible to re-use local implementation of objects and expose them to the
network by pairing them with remote interfaces. This is done by providing an extension of
the local version of the class. Clients will interface with the public version of the class,
however, all interactions will essentially be forwarded to the local implementation of the
object.
The second choice is to re-use the remote object class. Again, the developer creates the
bank account class and exposes the methods that are in the object using the remote
interface. Now clients can all interface with the remote version of the bank account.
The bank account class this time, however, is derived from the remote object and remote
server classes as an extension. Now, when bank account implementation objects are
instantiated, they immediately become visible on the network for clients to access. This
requires less setup and work required from the developer.
RMI is implemented using a remote reference layer (RRL). Client stubs initiate remote
method invocation calls using RRL. Marshaling / unmarshaling, sending data over the
network, etc. is handled by the RRL.
Similarly, the server uses the skeleton to interface with the RRL for marshaling /
unmarshaling, etc. The skeleton calls into the server to execute the procedure, generate
results, and the results are returned back to the RRL via the skeleton.
These events are called serialization and deserialization by the Java community.
Transport layer
Here are some abstractions provided by the transport layer in Java RMI:
Abstraction Description
UDP, TCP
transport
The RRL layer makes the decision upon what type of transport is to be used to complete an
invocation between a client and a server.
Below is a high-level representation of the concepts discussed above.
Quizzes
When using the internet or interfacing with an organization, everything looks super simple,
however, an enterprise is made up of multiple networks.
Now let's look at the complexity of the inter enterprise view. Using some service could
include using multiple different enterprises, and as we can see, things can get pretty
complex.
interoperability
compatibility
scalability
reliability
N tier applications
N tier applications are some sort of service that we provide to customers that have multiple
layers involved in completing transactions. When we provide these types of services, there
are a few things that we focus on to ensure the service is worth using:
Concern Description
retaining the data of a customer's session so that the customer can begin
persistence
working from where they left off
transactions some defined method of how transactions occur, like a shopping cart
caching of the data closer to the customer so that the data doesn't have to
caching
be retrieved from the database each time the client wishes to view it
Container Description
applet
typically resides on a webserver; augments the client container
container
The key hope is that we exploit as much of reuse as possible of components and this is
done with Java Beans. The containers host the beans and allows you to package Java
Beans and make them available within containers. Java Beans can represent:
Bean
Description
type
The finer the grain of the implementation of these beans, the more we can leverage
concurrency for requests. This also makes the business logic more complex, however.
Servelet represents sessions with different clients. Coarse grained sessions beans are
associated with each servelet. The session bean worries about the data accesses needed to
the database in order for the business logic to execute correctly.
The EJB container usually helps the session beans in conducting their job, providing
required services, however, these coarse grained beans are responsible for handling all
interactions with the database required to complete the business logic. Thus, the EJB
container isn't having to do much and only worries about mitigating conflicts that might
arise in terms of external accesses to the database as sessions conduct concurrent
operations.
Pros Cons
Accessing the database is probably the slowest link in the entire process of providing our
service to the customer. We need to leverage concurrency and parallelism and we do this by
placing the business logic into the web container.
Instead of using sessions beans in the business logic container, we use entity beans to
represent data access objects. This provides parallel access to the unit of granularity in
terms of database access. Now servelets can create parallel database access requests,
reducing the time for data accesses.
When servelets request the same database access, there is also an opportunity presented in
which entity beans can cluster these requests and provide the data back to the clients
requesting it.
Entity beans also represent some form of persistence, and there are two ways of doing this:
Pros Cons
This corrects the fact that the previous alternative exposed the business logic to the web
container. We use session facades in tandem with the business logic. The entity bean data
access objects remain. The session facades worr about the data access needs of the
business logic it serves.
This structure allows us to still manage parallelism, because session facades can make
concurrent calls to multiple entity beans that represent data access objects.
There are two methods for session facades to communicate with the entity beans: via RMI
or local. RMI is useful because it doesn't matter where the entity beans are located, allowing
for flexibility. Local access provides us speed, but the entity beans have to be co-located.
We can choose to construct using either / or.
Pros Cons
business logic
not exposed
incurring additional network access in order to conduct data
able to
access (can be mitigated through co-location)
leverage
concurrency
lesson7
lesson7
global memory systems
Context
This portion of the slides provides some context for the need of global memory systems, or
the problem that global memory systems attempts to solve.
For a distributed system, the memory pressure may be experienced differently across each
node. Is it possible to use idle memory across the cluster to alleviate the memory pressure
of a particular node? Is remote memory access faster than using paging from the node's
local disk?
Because of the advances in networking technology, 10 Gb/s speeds from the LAN to a host
are now possible, making it faster than reading approx. 2 MB/s from disk.
VA -> PA
memory manager disk
VA -> PA
cluster memory
global memory system
disk
GMS will only be used for reads. GMS will not interfere when it comes to writing to the disk.
The only pages that can be in cluster memory are pages that have been paged out that are
not dirty. This avoids worrying about node failures across the network.
GMS basics
State Description
pages within the local cache of a node can be private or shared depending upon
public
if that page is being actively shared by more than one node at a time
everything that resides within the global cache of a node are private copies of
private
pages
The local memory space has increased by 1, and the global has decreased by 1.
In the below high-level example, all of the memory is being consumed by the working set in
Host P, there is no free memory available to provide global memory to the community
service.
The only difference between this case and the previous one is that the oldest page, the
candidate page, is being paged from local memory into global memory.
Handling page faults - case 3
In the below high-level example, the faulting page is not contained within clustered memory.
The only copy of the requested page resides on disk. GMS will identify the globally oldest
page and swap that page to disk. As the original faulting node receives the page from disk,
it will decrease it's global memory space, replacing a page and placing it in the community
space.
If the globally oldest page is clean, we can just drop the page entirely. This happens usually
if the globally oldest page being replaced is in the global cache. It must be written to disk if
it's dirty, this usually happens if the globally oldest page is in the local cache.
Handing page faults - case 4
In the below high-level example, a node page faults on a page that is actively being shared.
The GMS will notice that the page currently exists within a different node's local cache, and
instead of ripping that page away from that node, it will just copy the page to the local
cache of the faulting node.
The faulting node will increase its local cache space, and the global cache space of the
faulting node will decrease. The global cache space of the faulting node will be pushed to
community space, and the globally oldest page will be swapped to disk.
The same swapping requirements occur if the globally oldest page is dirty or clean.
Behavior of the algorithm
The behavior that emerges from this algorithm is described in the high-level representation
below. Essentially, the most idle host in the cluster will gradually become a memory server
for the rest of the clusters as its local memory shrinks to accommodate the global memory
cache.
If the idle node begins doing some work, however, the split between local and global is not
static. Thus, the node should begin to start reclaiming memory for private use as its
working set increases in size.
Geriatrics
We must manage the age of the oldest page within the GMS in order to swap it to disk
when necessary. The age management cannot be assigned to one particular system, we
must distributed the management workload evenly across all nodes within the cluster.
With this, the GMS establishes an initiator who declares that pages will be swapped in the
upcoming epoch. In each epoch, each node sends the age of each page to the initiator. The
initiator calculates a minimum age - this minimum age will be the youngest age a page can
be before it is swapped. The initiator also calculates a weight for each node dependent
upon the percentage of page replacement being conducted for that node in comparison to
the rest of the cluster. The minimum age, weight tuple for all nodes gets sent back to each
node, notifying them of the candidate pages for swapping.
The initiator of the management routine at the next epoch becomes the node with the
maximum weight. We can intuitively determine that the node with the maximum weight is
not very active. This makes it the prime candidate for becoming the next initiator.
For each node when it comes to page replacement, if a page happens to be older than the
minimum age, if we encounter a page fault we can just discard that soon to be evicted
page. If a page incurs a page fault and it's less than minimum age, we'll send the page to a
peer node.
GMS requests a daemon to dump the TLB, and inspects this information to derive the age
information for pages at every node.
GMS distributed data structures
To track pages, GMS converts virtual addresses from nodes into universal IDs. These
universal IDs uniquely identify a page. There are three key data structures utilized by GMS
to manage universal IDs:
GCD (global cache cluster-wide hash table; returns the node that holds the PFD for a
directory) UID
POD (page ownership replicated on all nodes; returns the node that holds the GCD for a
directory) UID
Putting the data structures to work
Below is a high-level representation of how the data structures described above assist in a
node resolving a page fault across a cluster. Up to three communication events can occur
in order to find a page in a cluster.
The below representation shows an uncommon case, usually page faults occur for non-
shared pages - Node A will usually contain both the POD and the GCD.
What happens when the PFD has a miss? This is possible when the PFD evicts a page but
has yet to notify the GCD holder. This is common scenario is common. An uncommon
scenario, however, is when the POD of the node that has a page fault has a stale POD.
Remember, PODs are stored by all nodes - stale PODs will only occur when a POD update
has not propagated to the node yet. POD updates occur when there is the addition or
deletion of a node from the cluster.
Every node has a paging daemon as discussed earlier. The virtual memory manager utilizes
the paging daemon to conduct the page replacement algorithm and execute page
replacement mechanisms. Of note, the virtual memory manager always contains some free
pages on the node for page faults, because we want to conduct our operations in an
aggregate manner.
As noted earlier, the paging daemon is integrated with the cluster's GMS. When the free
pages list of a node falls below a certain threshold, the paging daemon. Essentially we
modify the virtual memory manager and the unified buffer cache to visit GMS prior to
attempting to resolve page faults by using the disk. Writes to disk are unchanged. The
pageout daemon will now defer to GMS when discarding clean pages - dirty pages will still
be written to disk.on will place the oldest pages on the node onto a different node,
coordinating with the cluster's GMS. The candidate node is chosen based on the weight
information derived from the geriatric management that occurs at each epoch.
The page daemon then updates the GCD with a new PFD for the respective UID that was
evicted to a different node. Again, all of these operations do not occur on every page
eviction, but aggregately when meeting the free page threshold.
Quizzes
Fill in the table with what happens to the boundary between local and global on each
node.
distributed shared memory
Cluster as a parallel machine
There's also the ability to create explicitly parallel programs leveraging message passing
between nodes. The programming will leverage a message passing library to leverage the
true physical nature of the cluster. The only problem with this style of parallelization is that
it's a bit tougher to program using message passing libraries. Programming with shared
memory definitely easier and requires less of a change in thinking.
All of what we discussed above drives the motivation for creating the abstraction of
distributed shared memory. We give the illusion to the programmer that all of the memory in
the cluster is shared using a distributed shared memory library.
The programmer can use the same set of primitives to parallelize a sequential program
without having to tailor their implementation to the clustered environment.
Using the example of a typical parallel program, we show that, in the sequential
consistency model, there is delineation between lock read / write operations and shared
memory operations. With that in mind, once a process acquires a lock it can safely assume
the shared memory it modifies is same from outside read / write operation.
Thus, there is no need for coherence actions to occur across the cluster for shared memory
that is locked until the lock is released.
Release consistency
The above discussion spurred the need to research release consistency. The gist of
release consistency is that, if two processes utilize the same lock, all coherence actions
need to take place prior to the release of the lock by a process and the acquisition of a lock
by another process that's protecting some shared memory.
With this mechanism, all coherence actions only have to happen at the release point. This
allows us exploitation of computation on a process with communication that may be
happening through the coherence mechanism corresponding with the memory actions that
may be taking place inside a critical section.
Release consistency memory model
The release consistency model actually distinguishes between data and synchronization
read / write operations. Coherence actions will only take place when locks are released.
Eager release consistency is what we described earlier. All of the coherence operations take
place prior to the release of the lock.
For lazy release consistency (LRC) coherence actions only occur upon the acquisition of the
lock by some other process.
Eager vs Lazy RC
In Eager RC, all release operations require a broadcast push to all processes. In Lazy RC, all
acquisition operations are just a pull, no broadcast required.
Pros of Lazy RC over Eager?
Distributed shared memory (DSM) cannot manage the global memory for a cluster like
what would normally happen for multiple processors sharing memory, as the hardware
usually conducts all of the coherence operations - this is infeasible across a network.
To combat this, DSM provides a global virtual memory abstraction for all processors within
a cluster, and partitions address space of the global virtual memory to different processor.
DSM exercises distributed ownership, each process is responsible for the consistency of its
assigned partition of global memory.
The DSM software layer tracks ownership of pages, providing it the ability to acquire the
owner of a page in order to make a copy of the page. This way, the DSM software can
handle page faults for processors across the network. The DSM software will then contact
the virtual memory manager of a processor to update the page table of the processor that
is experiencing a page fault.
DSM implements a single-writer protocol in which processors that wish to write to a page
will notify the DSM, the DSM will notify the page owner, and then the page owner will
invalidate all copies of the page across the system to maintain coherency. Any number of
readers can read a page, of course.
There still exists the possibility of false sharing, just like in a shared memory
multiprocessor. The granularity of consistency is being conducted at pages, so small parts
of the page can cause invalidation if updated - this causes ping-ponging of the pages as
invalidation occurs.
With the information described above, we can discern that the single-writer protocol and
page based coherence granularity don't play well together.
Multiple-writer coherence protocol with LRC
To combat false sharing, we essentially create diffs of items that were modified on a page
in shared memory. When another processor acquires the lock, it invalidates the pages.
When the processor enters a critical section, the processor will then fetch the current version
of the page by consulting the DSM for the owner of the page.
So what happens when there have been two changes to a shared page prior to a processor
acquiring the lock? The DSM software knows that two diffs exist for a page, and will
acquire the diffs from the different processors. Then it will provide the diffs and the original
page to the processor attempting to access the page.
If a new page is modified for a lock, the new page that has been modified gets associated
with that lock. Now, when other processors acquire the lock, they will also invalidate the
pages and have to fetch them from the DSM when they read / write.
DSM differentiates between locations within shared pages based upon the locks used to
access specific portions of memory. So if a different lock is used than the ones used
previously to read / write to a location on the page that is completely unrelated, the DSM
knows that that particular memory location is unrelated - preventing false sharing.
Implementation
When a write occurs for a page, the DSM creates a twin and maintains the original page. All
writes are made to the original page, the twin exists but does not reside within anyone's
page table.
Upon lock release, the DSM computes the diff of the twin and the now modified page - into
a runlegnth encoded diff. The original page is then write-protected, and cannot be written
to until the lock is acquired again. The twin pages is destroyed. When a different processor
acquires the lock, the DSM provides the diff of the edited page to the processor when it
attempts to read / write to the modified page.
What happens when writes happen to the same location on a page under different locks?
This is a data race, and this is the programmer's fault. Nothing DSM can do about it.
After a while, there could be diffs lying around lots of nodes. It would be a hassle to go
gather all of the diffs for a processor attempting to use the page right? DSM uses garbage
collection. If the page has begun to change enough, the owner of the page will acquire all of
the diffs of the page and will apply the diffs to the original copy of the page. This occurs
when a paging daemon checks up on the diffs created at its node.
Non-page based DSM
There are methods to conduct non-pages base DSM that won't be too low of granularity
where a performance hit occurs. These solutions are implemented for specific binary
applications, requiring no operating system support.
Granularity Description
As usual, as the number of processors increase, performance might decrease due to the
amount of overhead. This is true in the case of shared memory multiprocessors, and it's
also true in the case of DSM.
If the sharing is too fine-grained across a cluster, speed up expectations dwindle with DSM.
The computation to communication ratio has to be very high if we want to achieve
performance increase - our critical sections need to be lengthy.
In addition, if the code has a lot of dynamic data structures manipulated with pointers, this
can result in implicit communication across the network. Pointer codes may result in
increasing overhead for distributed shared memory.
distributed file systems
NFS
Clients on the network, servers on the network, files distributed across servers, clients view
the file system as one centralized view. Servers cache files that it retrieves from the disk into
memory to avoid reading from disk all the time.
A single server fielding client requests becomes a bottleneck for performance. The file
system cache is also limited because it's confined to the space within the server. Can we
distribute file system access?
DFS
With a DFS, there is no central server for a set of files. We distribute files across several
different nodes across the network. The I/O bandwidth that's available can be cumulatively
leveraged to serve clients. Our file cache size increases as we can leverage the caches of all
the servers in the cluster. These mechanisms start to push us towards the implementation
of server-less filesystems.
XFS
Dynamic management
For log segments, we implement what are called stripe groups. Stripe groups are a subset
of the cluster of disks available to write log segments to, because we don't want to write log
segments across the entire cluster - avoid small writes.
Stripe group
Below is an example of stripe groups. Allows for parallel client activities, increased
availability and efficient log cleaning. Different subset of servers can complete different
client requests, increasing performance. Different cleaning services can be assigned to
different stripe groups as well. If a server fails, other stripe groups will still be able to serve
clients of that particular stripe group - our system won't degrade because we didn't
distribute logs across the entire cluster.
Cooperative caching
File blocks are used as units of coherence. A manager manages the coherence of a file
block and tracks the clients that are concurrently accessing a file. Clients request to a
manager the ability to write / read a file block.
The manager invalidates file blocks when they are written to, and the clients acknowledge
that they have invalidated their copies of the files. Then the write privilege is given to the
writing client for the block - it's a token. This token can be revoked by the manager at any
time.
If a client has a copy of the file, we can leverage cooperative caching - instead of going to
disk to get the file a client with the file in its cache will serve a read request.
Log cleaning
High-level representation shows how log maintenance is conducted, killing old log
segments. It also shows how logs are aggregated and then the old log segments are
garbage collected.
This isn't done by a single manager, it's actually done as a distributed activity. In XFS, the
clients are also responsible for log cleaning - there is no differentiation between clients and
servers in XFS. The clients are responsible for knowing the utilization of log segments for
aggregation.
XFS data structures
A lot more data structures are involved in XFS to manage files across the distributed file
system.
Below is a high-level representation of the XFS data structures used to conduct file access.
Fastest path for file access for a client. The client will read the file from its local cache.
Second fastest path is to read the file from a peer who has the file cached. The local
manager map will tell us who the manager node is for the particular file that I want to read.
The manager node tells us who has the file cached, and then I read the file from the cache
of the peer across the network.
Longest method of acquiring a file. Manager map points me to manager node, manager
node says it's not located in anyone's cache, the manager node looks up the file with the
imap, finds the stripe group, gets the logging ID of the file block, gets the stripe group map
again, and then we finally read the file block from the storage server. Once we finally
acquire the file block, we then cache the file locally.
Client writing files
The client writes to the file block in its local memory, knows the log segment that the file
block belongs to and then flushes the log segment to the stripe group. The client then
notifies the manager about the log segments that have been flushed to the stripe group.
lesson9
lesson9
Giant scale services
Generic service model of giant scale services
Below is a high level representation of the architecture used to implement giant scale
services. These types of services use multiple services and data stores to satisfy a massive
number of client requests. Architectures within these sites use a very important mechanism
to maintain quality of service, a load manager.
A load manager completes these two functions for giant scale services:
The high-level representation of the OSI stack below is just describing the increasing
functionality of the load manager as we travel up the stack. This is because applications
and protocols become more complex, and we can make smarter load balancing decisions.
Load management at the network level
If the load balancer is operating at the network level, the load balancer essential acts as a
DNS server. All requests for a particular host name are being balanced in a round-robin
manner, each client being provided the host name for a domain name in a sequential order.
To achieve this type of load balancing, all servers must be identical and replicate the data
that they serve. This way, clients notice no difference when provided a different hostname
for a resolved domain.
The main problem with load balancing at the network level is the load balancer is not able
to hide downed server nodes.
Now we look at having the load balancer at a higher level of the OSI model. In this
architecture, we are able to hide server failures from clients as we are able to isolate
downed server nodes. A load balancer higher than the network layer could be implemented
with layer 4 switches.
This architecture also allows servers to implement server-specific front-end nodes. Instead
of dealing with arbitrary client requests, we are able to deal with specific kinds of requests.
Moving the load balancing to a higher level also provides us the ability to inspect the type
of device making a request, allowing us to make a decision to point it to a server node
servicing that particular device type.
Below is a high-level representation of the DQ principle. It shows the definitions for some
key values:
Value Definition
These values directly compete with one another. If you increase the yield, you must
decrease the harvest and vice versa.
Replication vs partitioning
Replication maintains 100 percent harvest under a fault. In contrast, partitioned data stores
will lose a percentage of their harvest under a fault. The opposite is true if we study the
yield during a fault. Replicated data stores will lose a percentage of their yield while
partitioned data stores will maintain 100 percent yield.
""" The traditional view of replication silently assumes that there is enough excess capacity
to prevent faults from affecting yield. We refer to this as the load redirection problem
because under faults, the remaining replicas have to handle the queries formerly handled
by the failed nodes. Under high utilization, this is unrealistic. """
"""[ ... ] you should always use replicas above some specified throughput. """
""" You will enjoy more control over harvest and support for disaster recovery, and it is easier
to grow systems via replication than by repartitioning onto more nodes. """
""" Finally, we can exploit randomization to make our lost harvest a random subset of the
data, (as well as to avoid hot spots in partitions). Many of the load-balancing switches can
use a (pseudorandom) hash function to partition the data, for example. This makes the
average and worst-case losses the same because we lose a random subset of “average”
value. """
Graceful degradation
""" [ ... ] graceful degradation: We can either limit Q (capacity) to maintain D, or we can
reduce D and increase Q. """
""" We can focus on harvest through admission control (AC), which reduces Q, or on yield
through dynamic database reduction, which reduces D, or we can use a combination of the
two. """
""" The larger insight is that graceful degradation is simply the explicit process for
managing the effect of saturation on availability; that is, we explicitly decide how
saturation should affect uptime, harvest, and quality of service. Here are some more
sophisticated examples taken from real systems: """
Example Description
cost-based
perform AC based on the estimated query cost - reduces the average
admission
data required per query, increasing Q
control (AC)
priority or user pays commission to quarantee their query will be executes; reduce
value-based the required DQ by dropping low-value (unpaid for) queries independent
AC of their DQ cost
under saturation, a site can make data expire less frequently - reduces
reduced data
freshness but also reduces work, thus increasing yield at expense of
freshness
harvest
""" As you can see, the DQ principle can be used as a tool for designing how saturation
affects our availability metrics. We first decide which metrics to preserve (or at least focus
on), and then we use sophisticated AC to limit Q and possibly reduce the average D. We
also use aggressive caching and database reduction to reduce D and thus increase
capacity. """
""" [ ... ] we must plan for continuous growth and frequent functionality updates. """
""" By viewing online evolution as a temporary, controlled reduction in DQ value, we can act
to minimize the impact on our availability metrics. """
ΔDQ = n ∗ u ∗ averageDQ/node = DQ ∗ u
Variable Description
n nodes
Approach Description
fast quickly reboot all nodes into the new version; guarantees some downtime; by
reboot upgrading during off-peak hours, we can reduce the yield impact
rolling we upgrade nodes one at a time in a “wave” that rolls through the cluster; old
upgrade and new versions must be compatible because they will coexist
update the cluster one half at a time by taking down and upgrading half the
big flip
nodes at once;
""" The three regions have the same area, but the DQ loss is spread over time differently. All
three approaches are used in practice, but rolling upgrades are easily the most popular. The
heavyweight big flip is reserved for more complex changes and is used only rarely. The
approaches differ in their handling of the DQ loss and in how they deal with staging areas
and version compatibility, but all three benefit from DQ analysis and explicit management
of the impact on availability metrics. """
- Lessons from Giant-Scale Services
Definitions
Term Definition
load a level of indirection between the service's external name and the servers'
manager physical names
servers system's workers, combining CPU, memory, and disks into replicable units
persistent
replicated or partitioned database that is spread across server disks
data store
mttr mean-time-to-repair
Quizzes
Check off the ones you see below as "giant scale" services.
""" The computation takes a set of input key/value pairs, and produces a set of output
key/value pairs. The user of the MapReduce library expresses the computation as two
functions: Map and Reduce. """ - Simplified Data Processing on Large Clusters
Method Description
written by the user, takes an input pair and produces a set of intermediate
map
key/value pairs.
accepts an intermediate key and a set of values for that key. It merges together
reduce
these values to form a possibly smaller set of values.
Why MapReduce?
MapReduce is applicable to a lot of real-world problems. Some examples:
Distributed grep
Count of URL access frequency
Reverse web-link graph
Term-vector per host
Inverted index
Distributed sort
Below is a high-level representation that uses MapReduce to search for the occurrence of a
targetUrl among a group of sourceUrls.
""" The Map invocations are distributed across multiple machines by automatically
partitioning the input data into a set of M splits. The input splits can be processed in
parallel by different machines. """ - Simplified Data Processing on Large Clusters
""" Reduce invocations are distributed by partitioning the intermediate key space into R
pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R)
and the partitioning function are specified by the user. """ - Simplified Data Processing on
Large Clusters
The following actions occur when the user program calls the MapReduce function:
1. MapReduce library splits the input files into M pieces. The MapReduce library starts up
many copies of the program on a cluster of machines.
2. One of the copies of the program is special - the master. The rest of the copies are
workers that are assigned work by the master. There are M map tasks and R reduce
tasks to assign.
3. A worker assigned a Map task reads the contents of the input split. It parses key/value
pairs out of the input data and passes each pair to the user-defined Map function. The
intermediate key/value pairs produced by the Map function are buffered in memory.
4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the
partitioning function. The locations of these buffered pairs on the local disk are passed
back to the master, who is responsible for forwarding these locations to the reduce
workers.
5. When a reduce worker is notified by the master about these locations, it uses remote
procedure calls to read the buffered data from the local disks of the map workers.
When a reduce worker has read all intermediate data, it sorts it by the intermediate keys
so that all occurrences of the same key are grouped together. The sorting is needed
because typically many different keys map to the same reduce task. If the amount of
intermediate data is too large to t in memory, an external sort is used.
. The reduce worker iterates over the sorted intermediate data and for each unique
intermediate key encountered, it passes the key and the corresponding set of
intermediate values to the user's Reduce function. The output of the Reduce function is
appended to a final output file for this reduce partition.
7. When all map tasks and reduce tasks have been completed, the master wakes up the
user program. At this point, the MapReduce call in the user program returns back to the
user code.
Issues to be handled by the runtime
""" The master keeps several data structures. For each map task and reduce task, it stores
the state (idle, in-progress, or completed), and the identity of the worker machine (for non-
idle tasks). """ - Simplified Data Processing on Large Clusters
Fault tolerance
""" The master pings every worker periodically. If no response is received from a worker in a
certain amount of time, the master marks the worker as failed. Any map tasks completed
by the worker are reset back to their initial idle state, and therefore become eligible for
scheduling on other workers. Similarly, any map task or reduce task in progress on a failed
worker is also reset to idle and becomes eligible for rescheduling. """ - Simplified Data
Processing on Large Clusters
""" We rely on atomic commits of map and reduce task outputs to achieve this property.
Each in-progress task writes its output to private temporary files. A reduce task produces
one such file, and a map task produces R such files (one per reduce task). When a map task
completes, the worker sends a message to the master and includes the names of the R
temporary files in the message. If the master receives a completion message for an already
completed map task, it ignores the message. Otherwise, it records the names of R files in a
master data structure. """ - Simplified Data Processing on Large Clusters
Locality
""" We conserve network bandwidth by taking advantage of the fact that the input data is
stored on the local disks of the machines that make up our cluster. """ - Simplified Data
Processing on Large Clusters
Task granularity
""" We subdivide the map phase into M pieces and the reduce phase into R pieces, as
described above. Ideally,M and R should be much larger than the number of worker
machines. Having each worker perform many different tasks improves dynamic load
balancing, and also speeds up recovery when a worker fails: the many map tasks it has
completed can be spread out across all the other worker machines. """ - Simplified Data
Processing on Large Clusters
Backup tasks
""" We have a general mechanism to alleviate the problem of stragglers. When a MapReduce
operation is close to completion, the master schedules backup executions of the remaining
in-progress tasks. The task is marked as completed whenever either the primary or the
backup execution completes. """ - Simplified Data Processing on Large Clusters
Content distribution networks
DHT
Distributed hast tables (DHT)s are used to store the location of content on the internet.
DHTs are key, value pairs. Here's are the objects that comprise a DHT:
Object Defintion
content the content hash is a numerical sum of the content to be hosted within the
hash content distribution network; a hashing algorithm is used to ensure content
(key) hashes don't collide
the value of the DHT is the node ID where the content is stored within the
value
content distribution network
So where are the key, value pairs stored for pairs listed in a DHT? Key, value pairs are stored
on nodes that have a name closest to the content hash. Clients will query for the content,
receive a response from the node that is hosting the key, value pair for the content, and
provide the client with the value - this points the client to the node that hosts the content
described in the key.
For a distributed hash table, two namespaces are used: key_space and node_space. They
keys in each space are hashes of the real names of the content and the node, and each
hashing algorithm provides a unique hash.
The objective is for the content key to be as close as possible to the node ID, with a best
case scenario of the content key and the node ID to be an exact match. APIs are available
to the programmer to directly interact with the DHT, using putkey, and getkey procedure
calls.
An overlay network is a routing table at the user level, implemented in a shared application
running on multiple nodes - virtual network on top of the physical network. Each node
within the content distribution network advertises their node ID, and maps this to their IP
address. Each server will be able to address a particular node within the CDN using its node
ID -> IP address mapping.
Below is a high-level representation of how nodes can route messages to one another using
a CDN routing table.
Overlay networks in general
Below is a high-level representation of how addresses within from Layer 2 -> Layer 3 ->
Layer 7 are translated for the implementation of an overlay network.
The reason we place key, value pairs on nodes N where N is close to key is because it
makes routing to other nodes simpler. The routing tables per node in a CDN can only be so
large so, if someone requests content for a key, because keys are stored on nodes close to
the key's value, we can best-guess determine that the content is stored on node N.
If a node doesn't know the IP address of the target key -> node, we'll eventually conduct
enough routing where we'll finally find the node that contains the requested key, value and
can provide us the IP address of the node.
Below is a high-level representation of the concepts discussed above, and probably clarifies
the concepts I attempted to describe.
Metadata overload
The traditional approach creates a scenario in which a particular server can become a
hotspot for metadata as well as for content retrieval. Imagine some scenario where a large
number of users are generating content and all of their keys are very similar. All of these
keys will reside on one particular node because its node ID is closes to their keys. This
server will become overloaded with metadata, and the storage of key, value pairs will be
uneven across the content distribution network.
Imagine a scenario in which a particular piece of content becomes really popular and
everyone is requesting that content. The node that contains the key, value pair for said
content will become a hotspot, and will be busy routing clients requesting the content to the
actual node storing the content. This server will become overloaded responding to content
requests - responses will not be conducted in a distributed manner as only one server will
be conducting work.
The origin server overload problem is the fact that one specific server hosting content will
receive all the requests for a particular piece of content - essentially DOSsing the server.
This DOSsing by people attempting to view the content of a server is called the slashdot
effect.
Utilize a webproxy to directly serve content after caching it for users close to the
webproxy
Not good enough for the slashdot effect, especially if people want live content
CDNs
content is mirrored and stored based upon geographic location
user requests are dynamically re-routed to geo-local mirrors
content producers pay CDN provides to avoid origin overload
Tree saturation
Tree saturation is what we described earlier, with keys matching particular nodes and nodes
being overloaded with metadata. The tree is saturated as well because all requests for a
particular key, value pair are happening only on specific parts of the tree the comprises the
CDN.
Coral DHT, a framework / application that attempts to democratize content distribution,
uses what's called a sloppy DHT. This sloppy DHT doesn't strictly adhere to the traditional
rule where a key must match as closely as possible to its node ID.
(Nsrc)XOR(Ndst)
Key-based routing
The difference between this approach and the greedy approach is that we don't
immediately travel to the XOR distance that is closest to our destination. Travel is similar to
a binary sort, in which, our 1st hop is half of our XOR distance, second is half of the
previous calculation, and so on until we reach our destination.
Coral primitives
Primitive Description
value is the node ID of the proxy with content for the key
put can be initiated by the origin server or by a proxy that has just
put (key, download the content and is willing to serve as a proxy that will cache
value) and serve the content
put places the key, value in an appropraite node
the full primitive defines the nodes that are proxies that are able to store
full
max l values for a key - metadata (spatial)
the loaded primitive defines the nodes that are proxies that are able to
loaded
handle max B requests / time for a key - temporal limitation
The put operation is comprised of two phases, the forward phase and the backwards
phase. In the forward phase, the source of the put operation is attempting to place the key,
value pair in a particular node - some desitination. The source will ask each node if it is full
or loaded. If not, the source will increment half the distance to the next node, and continue
this cycle until it finds a node that is full or loaded.
The source will then retract it's steps to find a node that isn't full or loaded and will place its
key, value pair there. This helps to distribute key, value pair metadata across the CDN.
Quizzes
Who?
IBM
Napster
Facebook
Netflix
Akamai
Apple
Microsoft
Google
Why?
World news
E-commerce
Online education
Music sharing
Photo & video sharing
lesson10
lesson10
ts-linux
Introduction
Sources of latency
Latency
Description
source
timer generated from the inaccuracy of the timing mechanism; latency between
latency the point an event happens and when the timer interrupts
scheduler latency incurred when the application that is destined to receive an interrupt
latency cannot be dispatched until a higher priority application is suspended
timer description
one- exact timers; can be programmed for an exact point of delivery; extra overhead
shot for the operating system
no timer interrupt; operating system polls at strategic times to check for events;
soft
extra polling overheard and latency
combines the advantages of all the above three timers; proposed in ts-linux;
firm
attempts to avoids all cons from each timer type
The fundamental idea behind firm timer design is to provide accurate timing with low
overhead. It attempts to combine the mechanisms of the one-shot and soft timers.
Reminder, one-shot is a timer that occurs exact when we need it, however this incurs
overhead. Similarly, the soft timer incurs overhead because the operating system must
check for these events.
Firm timers implement an overshoot abstraction, a knob between hard and soft timers. The
overshoot represents the time between a one-shot timer expiration and a programmed one-
shot timer interrupt. During the overshoot time, if events occur that require kernel
interaction, such as a system call, the kernel will also identify expired one-shot timers and,
instead of waiting to dispatch them at a later time as programmed, will dispatch the timers
while the kernel is handling privileged operations. This allows the kernel to preemptively
handle one-shot timer's while conducting unrelated privileged operations.
The advantage of this is that one-shot timers handled early will not cause interrupts -
saving us from the overhead incurred from the context switch that would've happened. The
firm timer design utilizing overshoots provides us the timing accuracy desired while also
avoiding the overhead of being interrupted by timers.
There are multiple approaches in reducing the kernel preemption latency. Here are some of
the approaches explained:
The lock-breaking pre-emptible kernel combines the two ideas discussed above.
Acquisition of the lock, manipulation of shared data, and releasing of the lock, then
preempt the kernel.
TS-Linux avoids scheduling latency by using proportional period scheduling for admission
control and avoid priority inversion through priority-based scheduling. It reserves time for
time sensitive tasks and throughput sensitive tasks.
TS-Linux is able to provide quality of service guarantees for real-time applications running
on commodity operating systems such as Linux. Using admission control - proportional
period scheduling - TS-Linux ensures that throughput-oriented tasks are not starved for CPU
time.
pts
This lesson covers the middleware that resides between commodity operating systems and
novel multimedia applications that are realtime and distributed.
Programming paradigms
This section covers the abstractions that we currently use for distributed applications:
Parallel programs leverage the pthreads library / API for the implementation of parallel
programs.
Distributed programs leverage the sockets library / API for distributed program
implementation.
Conventional distributed programs leverage the sockets library / API to communicate with
distributed services, like a network file system, however this library doesn't have the level of
abstraction needed for emerging novel multimedia distributed applications.
Novel multimedia apps
An application can leverage this data and make decisions, prioritizing interesting data,
processing it, and acting upon it - creating a feedback loop. This feedback loop, depending
upon the data being sent from the sensors, can require computational intensive processing.
Given the example of a large-scale situational awareness multimedia application, there are
some overheads we probably want to avoid in order to increase the efficiency and usability
of the application:
Overhead Description
infrastructure large amounts of data are being generated by sensors; filtering must be
overhead done to avoid overloading the infrastructure with unnecessary data
Two abstractions are provided by PTS, threads and channels. Threads create time-
sequenced data objects and place these objects in the channels. For channels, there can be
multiple producers and consumers of the data residing in channels.
Channels basically represent the temporal evolution of a thread. Threads that wish to
consume data from channels utilize an upper-bound and lower-bound timestamp - the
channel returns the specified information stored.
Below is a high-level representation of a how a program could leverage the thread and
channel abstraction provided by PTS to compute and make decisions.
Bundling streams
PTS provides programmers the abstraction of the groupget mechanism, where we can
bundle the acquisition of data from multiple streams at a required timestamp. This
abstraction removes the requirement of having to call get on every single source of
multimedia within a group. Programmers can identify specific streams as anchor streams,
all the other streams are dependent streams to the anchor, and all timestamps will be in
reference to the anchor.
The power of simplicity
The below high level representation shows how PTS can simplify the sequential detection
application that was described earlier using abstractions like put and get.
Pts design principles
Below is a high level representation of the abstractions that PTS uses to make large scale
multimedia applications simple to implement.
PTS channels use time as a first class entity - meaning all data access are time based,
queries can manipulate the specific time in which they want to put / get some data
PTS provides persistent streams under application control - meaning they are persisted
to hard storage like disks
Allows for seamless handling of historical and live data using the runtime system; all
operations are abstracted away from the end user / thread / application
PTS channels are similar to UNIX sockets
Component Description
Producers and
get and put data into the channel
consumers
Garbage list expired content that is no longer relevant in the context of the channel
Live channel Contains a list chronological list of data from oldest to now, keeps the
layer most current events
Garbage
Does housekeeping for the garbage list
collection thread
Layer that allows interaction between the live channel layer and the
Interaction layer
persistence layer
An application can keep data for as long as it wants; applications can decide how long
items persist. All of thus occurs transparently to the user - the channel handles all of these
mechanisms internally.
Who will use this abstraction? Subsystem designers, but only if it's performant.
If we make virtual memory persistent, we could possible incur a lot of overhead and latency
as the virtual memory is consistently being written to disk upon each update. A solution to
this latency issue is to maintain a log of changes rather than continuously writing to disk,
essentially buffering the changes of virtual memory. Eventually when all the data is dirty
and needs to be written, we conduct one large I/O operation to write to disk.
This method described above avoid the latency incurred when writing every time, while also
avoiding the issue of writing to possibly random locations on the disk.
The below high level representation shows how a server could provide subsystem designers
the facility to maintain persistent data structures. The gist is that subsystem designers
would use an API to identify locations in memory that need to be backed by some
persistent data segment residing on the server. This avoids the possibility of persisting data
from the subsystem that isn't important, or meant to be volatile.
Reliable virtual memory (RVM) primitives
initialization
initialize(options)
provides the subsystem programmer the ability to initialize the log data structure to
begin recording changes to the application's memory
map(region, options)
application specifies the locations in memory in which it wants to persist to
external data segments
unmap(region)
undoes a map operation
begin_xact(tid, restore_mode)
begins a transaction - starts logging the changes to some location in memory and
buffers for writes to persistent memory
set_range(tid, addr, size)
sets the range address range of memory to be logged for writing to persistent
storage
end_xact(tid, commit_mode)
completes a transaction and stores the changes for the memory addresses
modified into persistent memory
abort_xact(tid)
aborts the transaction, does not write to persistent memory
flush()
truncate()
done my LRVM automatically for logs
also provided for application flexibility
miscellaneous
query_options(region)
set_options(options)
create_log(options, len, mode)
The main take away is RVM is simple in its design and provides primitives for control of its
operations. Transactions in the context of RVM are completely different than your normal
database definition, and much simpler. Transactions are only related to the the
specification and persistence of memory regions.
Below is a high level representation of how RVM supports transactions, creates undo
records, creates redo logs, and flushes changes to disk.
Transaction optimizations
RVM provides developers to give hints to the library when it wants to utilize undo and redo
logs. You can defer to use an undo log if you know the changes you want to make will be
committed. You can also defer the flushing of a redo log if you know future changes are
going to occur.
If a transaction's flush is deferred, this creates a window of vulnerability that the developer
has to accept as risk to their data being lost on system failure. Transactions can be seen as
insurance for a developer's memory for an application.
Implementation
Below is a high level representation of how RVM recovers from crashes using the redo log
record hosted on the disk.
Log truncation
Below is a high level representation of how RVM conducts log truncation. Essentially, LRVM
determines that the redo log has reached a critical size and some of the redo log needs to
be written to disk. The log is split into epochs, with a truncation epoch, a current epoch, and
a new record space generated on the truncation of an epoch.
Servers wait for transaction to occur, and those are stored in the current epoch. Truncation
epochs are moved to permanent storage, and a new record space is generated in tandem
(parallel) with forward processing for transactions.
riovista
System crash
There are two problems that lead to a system crash: power failures and software crashes.
RioVista poses a solution, using hardware to solve our problem of losing information at
power failure (e.g a UPS). This will make the problem of power failures disappear, making
the only source of failure software crashes. RioVista poses to store portion of main
memory in a system that will survive crashes, both software and hardware.
LRVM revisited
Below is a high level representation covering the steps of a transaction using LRVM.
Biggest issue with LRVM is the undo log and redo log are lost during power failure.
Rio file cache
Below is a high level representation of how Rio creates a persistent file cache using
hardware that has a backup UPS power supply. The file cache also has virtual memory
protection to protect it from a stray operating system when the OS is crashing upon power
failure.
Vista RVM on top of Rio
Vista RVM is similar to LRVM, however, it leverage the Rio file cache. The undo log that was
previously created in RAM is now hosted in persistent memory of the Rio file cache. The
data segment that is identified as persistent by the programmer are also mapped to data
segments in the Rio file cache.
If an abortion of the transaction occurs, the undo log is merged into the memory, thus it's
also merged into the mapped portion of the file cache. If the transaction is committed, the
undo log is destroyed. There is no need to flush a redo log like LRVM, however, because all
the changes have already occurred - memory between the application and the file cache
was mapped the entire time.
The implication for this method of transactions is that there is no disk I/O at all in order to
create persistent memory.
Below is a high level representation of how transactions are conducted with Visa RVM and
Rio file cache.
Crash recovery
Crash recovery is treated just like an abort. When the operating system boots, the file cache
will recover the old image of an application from the undo log and undo the changes to the
application.
Vista simplicity
Vista's performance is by virtue of its simplicity. Why is it so simple? There are no redo logs,
no log truncation code. Checkpointing and recovery code are simplified because all
transactions utilize the Rio file cache. No group commit optimizations have to be made be
cause commits are automatic, the only data structure maintained is the undo log.
Quicksilver is a project started by IBM in 1984. The main argument Quicksilver raises is that
system recovery should not be an after thought to operating system design, but a first class
citizen - implemented into the core of an operating system. There are a lot of parallels
between Quicksilver's operating system structure and the structure of a network operating
system. Both have microkernels handling smaller, privileged operations, and dispersed
server processes that implement subsystems of the operating system.
Clients and servers within the Quicksilver operating system conduct inter-process
communication by using a service queue. The service queue is globally accessible, and
different servers can handle client requests.
The IPC mechanisms provides the ability for synchronous and asynchronous calls. The
most important point is that recovery mechanisms are built into the IPC mechanism -
integrated as described before. Recovery is not taking a backseat in Quicksilver.
Below is a high level representation of the IPC mechanism used in Quicksilver's system
services.
Below is a high level representation of how a transaction tree expands as a client, server
interaction occurs to acquire files from a data server. Keep in mind, the owner can select
other nodes to be transaction managers if cleanup needs to be done if the owner fails.
Distributed transaction
Below is a high level representation of a how a distributed transaction occurs. At each node,
they have a transaction manager that will occasionally log the occurrences, results, and
data of transactions. At the top, the transaction coordinator is the client. Clients are usually
the most brittle of transaction managers, thus, other transaction managers are aware when
other managers crash.
The transaction coordinator can send vote requests, abort requests, and end commit /
abort messages. And all the other transaction managers will contact the coordinator and
acknowledge the completion of the request.
So say the client conducts a transaction to open a page on a window manager. If that client
closes, the delegated coordinator after the client will notify the window manager that the
transaction is aborted. The window manager can then acknowledge the abortion and
cleanup any state that it had for the transaction.
Upshot of bundling IPC and recovery management
Advantages of this setup is Quicksilver provides better ability to cleanup breadcrumbs left
behind by failed clients and servers (memory, filehandles, communication handles,
orphaned windows / processes, etc.)
No extra communication is required for recovery because it is all bundled in the transaction
/ IPC mechanisms. The overhead that exists for recovery is similar to LRVM - log records
must be written and flushed to disk to maintain persistent state.
Implementation
Services must choose mechanisms that match their recovery requirements. Some services
need large amounts of persistent state, while others may need to only persist some data
every once in a while.
lesson11
lesson11
principles of information security
terminologies
Term Definition
describes the techniques that control who may use or modify the computer or
security
the information contained in it
They also define potential security violations, and these can be placed into three categories:
Category Description
unauthorized
an unauthorized person is able to read and take advantage of
information
information stored in the computer
release
unauthorized
an unauthorized person is able to make changes in stored information -
information
a form of sabotage
modification
The goal of protection a computing system is to prevent all of these potential security
violations. This is a negative statement, however, because it would be hard to prove that all
bugs have been found and remedied within every piece of software for a computing
system.
Levels of protection
Protection
Description
level
no levels of protection
unprotected MS-DOS (contained mistake prevention, but no protection)
controlled
access lists associated for files
sharing
user
programmed
similar to UNIX file system access rights (user, group, world)
sharing
controls
Design principles
The paper mentioned above also identifies eight design principles that go hand-in-hand
with the levels of protection identified above:
Principle Description
economy of
easy to verify
mechanisms
fail safe
explicitly allow access to information
defaults
complete
the security mechanism should not take any shortcuts
mediation
separation of
two keys by two different entities required to access
privileges
least only place mechanisms in system context where they are necessary (if a
common library conducts no privileged function, it must remain in user space - no
mechanism point in placing code within kernel space)
psychological the mechanisms are easy-to-use by users, else they won't follow security
acceptability mechanisms / enforcement
Takeaways
Craft the system so that it's difficult to crack the protection boundary - computationally
infeasible.
Detection of security violations rather than prevention.
security in andrew
The purpose of the Andrew file system was to create a secure, distributed file system for
students to use on campus at any terminal. Students would be able to access their
personal files on any computer on campus, with security being enforced at every terminal
while the files were shared on a central server.
Andrew architecture
Below is a high level representation of the Andrew secure, distributed file system
architecture.
Below is a table with the description for each component of the architecture:
Component Description
Encryption refresher
Challenge Description
authentication when a user logs in, the system has to verify that the user logging in is
of users who they say they are
when a user logs into the workstation and receives a message from the
authentication
server, the user has to be assured that the server they're communicating
of the server
with is the actual Andrew server
isolation of the users are shielded from one another, either through unintentional or
users intentional actions
In order to conduct this method of security for communication, the identity of the sender
has to be communicated in clear text.
A dilemma for the designers that was presented is, at this time the normal method of
authentication was regular UNIX usernames and passwords. However, from the discussion
from the previous chapter, the overuse of a security mechanism can expose that security
mechanism to attack. Utilizing usernames and passwords for everything, to include the
secure RPC between virtues and the vice, presents a security hole and an opportunity for
attackers to violate the security of the system.
So what should the designers of the Andrew secure, distributed file system use as the
identity and private key in order to conduct secure RPC between the virtues and the vice?
Andrew solution
The solution implemented by the designers of Andrew to the predicament identified above
is as follows:
username and password were only used for logging in to the actual work station
all communication with between the venus and the vice use an ephemeral id and keys
Class Description
RPC session
venus establishes a session with the vice to retrieve files for work
establishment
leveraging of the venus RPC session with the vice to retrieve files for
file system
work. RPC session is closed once files are retrieve and stored locally
access during
within the cache. Once you're done working on the files, the files are
session
returned to the server with a new venus to vice session.
Login process
Below is a high level representation of how a login is conducted within the Andrew secure,
distributed file system.
Users login with a username and password securely over the insecure links and are
presented with a pair of keys, the cleartoken and the secrettoken.
Token Description
cleartoken encrypted with a private key only known to vice - used as the
secrettoken
ephemeral client id for this login session
handshake
used as a private key for establishing a new RPC session
key client
This system prevents the username and password being communicated over the insecure
link too often. The secret token will be used as the client id for all further communication,
and the handshake key will be used for encryption and establishing RPC sessions.
RPC session establishment
Below is a high level representation of how RPC sessions are established between venus
and the vice in the Andrew secure, distributed file system.
The random number (Xr) is incremented each time a handshake interaction occurs in
order to establish that the server is genuine.
The server provides a random number (Yr) that is incremented by the client each time a
handshake interaction occurs in order to establish that the client is genuine.
Once the handshake is complete, the server creates a sequence number for the rest of
the RPC session.
Once the handshake is complete, the server creates a session key for the rest of the
RPC session.
The mechanism to increment these random number avoids replay attacks by a person
sniffing traffic.
Special notes:
Username and password is exposed over the encrypted link once for a login session.
The handshake key is only valid for the initiation for a new RPC session.
The session key is only valid for the duration of the RPC session.
The secrettoken provided by the vice that represents the login session is the only piece
of data the exists for the duration of the user's session with the Andrew file system.
Yes
No
Yes, the Andrew file system enforces suspicion of other users as well as suspicion of the
server.
Does the Andrew file system support protection from system for users?
Yes
No
There's no way to protect the user from the system. The user must trust the system not to
modify its files hosted on the system.
Yes
No
Users can consume a lot of network bandwidth and cause a denial of service.
Yes
No
Yes
No
The servers in the vice are protected by a firewall, but all communication between the
servers are not encrypted. If an attacker compromises the vice, the Andrew file system's
integrity is compromised. The Andrew file system has to be physically secured in order to
support integrity.