CC - AB Notes
CC - AB Notes
Unit 1
Aronya Baksy
January 2021
2 Computing Paradigms
• Centralized Computing: All the compute resources (storage, memory, CPU etc.) are held in
one central location, tightly coupled and shared among all clients. They are accessed from terminal
machines (eg: some datacenters, supercomputers)
• Parallel Computing: Multiple processors are either tightly coupled (shared memory) or loosely
coupled (distributed memory). Inter-Processor communication is done via shared memory or mes-
sage passing.
• Distributed Computing: Multiple autonomous compute nodes that each have their own private
memory and communicate via a network. Message passing is used as the mechanism for this
communication.
1
• The end goal of grid computing is to allow computational power to be o↵ered as an utility (like
electricity).
• The following are the benefits of grid computing
2
• One of the features of a cluster is the abiltiy to merge the multiple systems into a Single System
Image (SSI).
• An SSI is an illusion created by software or hardware that presents a collection of resources as one
integrated, powerful resource. SSI is implemented as a middleware layer (in hardware/software)
that presents CPU cores/IO devices/Disks as a single unit shared across all cluster nodes.
• Middleware support is needed to implement SSI as well as high availability (HA) which consists of
fault tolerance and recovery mechanisms.
• Instead of implementing SSI at many di↵erent levels, virtualization is used to create virtual clusters
from smaller number of actual nodes.
3
Parallel Computing Distributed Computing
Use of single compute node Multiple compute nodes
Tasks run on multiple cores on a single chip Tasks run on a network of computers
Shared or distributed memory Distributed memory only
Processors communicate through a bus Processors communicate via message passing
Improve system performance Improve scalability,
fault tolerance, resource sharing
• Virtualization technology
• Provision of compute and storage resources as a service. Physical resources are abstracted into
virtual containers and presented to the user.
• These virtual resources are allocated on demand to the user, and configured by the user to run any
software applications.
• IaaS has the greatest flexibility but the least application automation from the standpoint of the
user. It allows the user to have complete control over the software stack that they run.
• Building blocks of IaaS are:
– Physical data centers (large collections of server racks with multiple physical machines inside
each rack), managed by IaaS providers.
– Compute: the ability to provision VM instances with CPU/GPU configs depending on work-
load. Also provided are auto-scaling and load balancing services
– Networking: software abstraction of network devices like switches/routers, available typically
through APIs.
– Storage: either block, file or object storage. Block and file storage are the same as found on
traditional data centers, but struggle with scaling, performance and the distributed nature
of the cloud. Object storage on the other hand is infinitely scalable, accessible via HTTP,
works well with distributed systems like the cloud, uses commodity hardware and allows linear
growth of performance wrt cluster size.
• Advantages of IaaS:
– Flexible
– Control
– Pay-as-you-go
– Faster deployment
– High availability
4
• Disadvantages of IaaS:
– Security threats sourced from host or other VMs.
– Multi-tenant Security, new VM users must not be able to access data left behind by previous
users.
– Internal resources and training, ie. the need to train IT managers in the use of IaaS manage-
ment.
• IaaS providers: Google Compute Engine, AWS Elastic Compute Cloud (EC2), MS Azure VMs,
DigitalOcean Droplets
• Disadvantages of PaaS:
– Operational limitations (lack of control) due to management automation workflows (available
on some PaaS providers) that a↵ect provision, management and operation of PaaS systems
– Vendor lock-in
– Runtime issues: specific versions of frameworks may not work with the platform, or platform
may not be optimized for the frameworks/language used
– Security: limited control over hosting policies, risks with storing data on cloud servers
– Integration and customization with legacy services (like data residing on an existing data
center) is more complicated and outweighs the cost saving involved in switching to PaaS.
• PaaS providers: AWS Elastic Beanstalk, Azure DevOps, Google App Engine
5
• Advantages of SaaS:
– Flexible payment scheme, pay-as-you-go model
– High vertical scalability
– Automatic update of software
– Accessibility over the internet
• Disadvantages of SaaS:
– Security of data on cloud servers
– Greater latency in interaction with app, as compared to local deployment
– Total dependency on internet
– Vendor lock-in
6
5 Cloud Deployment Models
5.1 Public Cloud
• The infrastructure is owned by a cloud provider, an entity (individual or company) pays the provider
for access to this infrastructure.
• Resources are virtualized into pools and these pools are allocated among multiple clients that are
using the cloud provider’s infrastructure (multi-tenancy).
• Access to these resources is done using internet and its associated protocols (SSH, FTP etc).
• The factors that make a particular cloud infrastructure public are: resource sharing using virtualiza-
tion, usage agreements on resources (pay-as-you-go may or may not be present), and management
(provider maintains hardware, networking and virtualization at the minimum)
5.1.1 Advantages
• Low cost
5.1.2 Disadvantages
• Security
• Compliance with security standards and government rules on data security
• Interoperability and vendor lock-in
5.2.1 Advantages
• More control over resources and hardware
5.2.2 Disadvantages
• Cost
• Under-utilization of resources
• Platform scaling, upward changes in requirements need scaling of physical infrastructure as against
simple scaling of virtual instances on a hosted cloud
7
5.3 Hybrid Cloud
• A mix of data centers maintained by the organization and hosted cloud infrastructure, connected
by a VPN
• A hybrid cloud model allows enterprises to deploy workloads in private IT environments or public
clouds and move between them as computing needs and costs change
• This gives a business greater flexibility and more data deployment options. A hybrid cloud workload
includes the network, hosting and web service features of an application.
• The OS and its resource sharing/scheduling policies determine the system image of the cluster
• Specific case is the 3-tier model where client intelligence is moved to a middle tier to enable the
use of stateless clients. This simplifies app deployment.
8
6.2 Interaction Models
6.2.1 Synchronous Distributed System
• All the components of the distributed system run on a common clock. The features of a synchronous
distributed system are:
1. Upper bound on message delivery time between processes or between nodes
2. Message delivery is always in order
3. Ordering of events happens at a global scale due to a shared clock
4. Lock-step based execution, meaning that similar operations performed by di↵erent nodes in
parallel complete at the same time, not at di↵erent times.
• These systems are predictable in terms of timing behaviour, hence must be used for hard real-time
systems.
• Send-omission: Process completes send but message never reaches outgoing bu↵er
• Receive-omission: Process does not receive a message put in its incoming bu↵er
• Arbitrary (Byzantine): Arbitrary behaviour wrt message send/receive actions, or omissions, or
stopping/incorrect actions.
9
7 Business Drivers for Cloud Computing
• Cost: Low upfront cost of hardware, reduced investment in future scalability, reduces costs of
resource under-utilization, and reduced management costs
• Assurance: Delegation of management responsibility to a cloud provider reduces need for skilled
IT admins and departments, while still maintaining high standards of security and availability.
• Agility: Faster response to customer requests for new services, due to faster deployment of new
services on the cloud. Also changing business requirements can be better handled
• Flexibility and Scalability: Easy to expand resources to meet increased workload
• Efficiency and improved customer experience: cloud computing allows streamlined enterprise work-
flows which result in better workplace productivity, and hence faster biz growth
• The use of common communication standards is done for easy integration with existing services
(standard network protocols like HTTP/JSON, HTTP/SOAP are used to send requests for various
operations)
• Each service in an SOA embodies the code and data integrations required to execute a complete,
discrete business function. Services are loosely coupled, meaning that no underlying knowledge of
the service implementation is needed to use it.
• Two common SOAs are REST and Web Services
8.2 REST
• REpresentational State Transfer (REST) is an architectural style for distributed systems, used for
providing communication standards between APIs over the internet
• REST-compliant systems (aka RESTful systems) are characterized by their stateless behaviour and
the separation of concerns between server and client.
• A safe REST operation is one that does not modify any data
• An idempotent REST operation is one that does not change the state when applied multiple
times beyond the first time.
• REST architectural style is based on:
10
4. Stateless Interaction: Server and client need not maintain each other’s state, and a message
can be understood without referring to any past messages (all messages are independent).
Statelessness has the benefits of:
– Client is isolated against changes on the server
– Promotes redundancy and improves performance due to reduced synchronization over-
heads
State is normally maintained (only if needed) through compact and lightweight text objects
called cookies.
• Discovery Protocol: Centralized registry for web services to publish their location and descrip-
tion, as well as for clients to discover available services (UDDI not yet widey adopted)
8.3.2 SOAP
• Simple Object Access Protocol (SOAP) provides a standard packaging structure for transmission of
XML documents over HTTP, SMTP or FTP. It allows interoperability between di↵erent middleware
systems.
• Root element of SOAP message is called the envelope, which contains a:
1. Header : Authentication credentials, and routing info/transaction management/message pars-
ing instructions
2. Body: payload of the message
8.3.3 WSDL
• Web Services Description Language gives description of interface for web services (in terms of
possible operations)
• Standardized representation of input, output parameters, protocol bindings.
• Allows heterogeneous clients to communicate with the web service in a standardized manner
8.3.4 UDDI
• Uniform Description, Discovery and Integration standard, a global registry for advertising and
discovery of web services
• Search by name, ID, category or specification implemented
11
9 Models for inter-process communication
• Interaction between processes can be classified along two dimensions:
– First dimension: one-to-one vs one-to-many
– Second dimension: asynchronous vs synchronous
• Message queues can be used to decouple heavyweight processing, to bu↵er or batch work, and to
smooth spiky workloads.
• Producer adds messages to the queue, Consumer reads messages from the queue and processes
them.
• A single queue can be used by multiple producer-consumer pairs but only one consumer can read
a message. Hence this model is used for point-to-point communication.
12
9.2 Pub-Sub Model
• The following are the components of a pub-sub communication model:
9.3 REDIS
• Remote Dictionary Server, a fast, open-source, in-memory key-value data store
13
10.1 Service Oriented Architecture
• SOA breaks up the components required for applications into separate service modules that com-
municate with one another to meet specific business objectives.
• Microservice architecture is generally considered an evolution of SOA as its services are more
fine-grained, and function independently of each other
• Service decomposition leads to management and infrastructure overheads which can be resolved
using containerization technologies to simplify deployment and configuration vastly
• Shared database tables, as well as current state information that are used by multiple components
of a monolithic application can be modelled as a separate independent service
• Writing integration test cases is challenging as it requires knowledge of all microservice components
and since such apps are asynchronous
• Solution is adopting various testing methodologies and tools and leveraging continuous integration
capabilities through automation and standard agile methodologies
14
Cloud Computing (UE18CS352)
Unit 2
Aronya Baksy
February 2021
1 Introduction
• Virtualization is a framework for dividing a single hardware resource (compute or storage) into
multiple independent environments.
• This is done by applying concepts such as h/w and s/w partitioning, emulation, etc.
• A virtual machine (VM) is a complete compute environment with its own processing capability,
as well as memory and communication channels. It is an efficient, isolated duplicate of the physical
machine, with the ability to run a complete operating system.
• A hypervisor (also called a virtual machine monitor or VMM) is a software layer that is
responsible for creation and management of Virtual machines.
2 Types of Virtualization
2.1 Type 1 virtualization
• Type 1 hypervisors are installed directly on top of a bare metal hardware, and they have direct
control over hardware resources.
• Type 1 hypervisors behave like OSes with only virtualization functionality, and a limited GUI for
administrators to configure system properties.
• Type 1 hypervisors o↵er simpler setup (provided that compliant hardware exists), more scalability,
more security and higher performance than type 2 hypervisors.
• e.g.: Xen, Oracle VM (based on Xen), VMWare ESXi server, Microsoft Hyper-V
1
2.2 Type 2 virtualization
• This type of hypervisor runs within a host OS that runs on top of physical hardware. For this
reason type 2 virtualization is also called hosted virtualization.
• They have interfaces to act as management consoles for all the deployed VMs
• Type 2 hypervisors o↵er simpler setup than type 1, but less scalability, larger performance overheads
and less security than type 1.
• e.g.: VMWare Workstation, Oracle VirtualBox
• The hypervisor emulates the e↵ect of such sensitive instructions so that the guest OS still gets the
impression that it is running in kernel mode when it is actually not.
• In trap-and-emulate virtualization, the:
– Guest applications run on ring 3
– Guest OS runs on ring 1
– VMM runs on ring 0
• When a guest app in ring 3 issues a system call, an interrupt is issued to the guest OS in ring 1.
• The interrupt handler in the guest OS runs the system call routine. When a privileged instruction
is encountered as part of this routine, the guest OS kernel issues an interrupt to the VMM.
• The VMM emulates the functionality of that privileged instruction, returns control to the guest
OS.
• Essentially, trap-and-emulate is a method of fooling a guest OS (that is actually running on ring
1) into thinking that it is running in the kernel space on ring 0.
2
3.1 Issues with trap-and-emulate
• Some registers in the CPU reflect the actual privilege level. If the guest OS were to read these
registers and detect that it is not running in kernel mode it might stop functioning normally.
• Some instructions that change system state run in both kernel and user space, but with di↵erent
semantics. This might lead to the guest not trapping to the VMM in case of a privileged instruction
being encountered.
• High performance overheads in processing interrupts.
• All ISAs do not support trap-and-emulate out of the box Most notably, Intel’s x86 ISA did not
support trap-and-emulate for a long time.
• There are 17 such instructions in the x86 ISA. Instructions like pushf reveal to the guest that it is
running in user mode, while instructions like popf discussed above do not execute accurately.
1. The VMM reads the next upcoming basic block of instructions. (By basic block we mean a
logic block of instructions from the current point till the next branch)
2. Each instruction in this basic block is translated to the target ISA, and the result is stored in
a translation cache.
3. Translation involves 3 types of instructions:
– Instructions that can be directly translated and are safe (called ident instructions)
– Short instructions that must be emulated using a sequence of safe instructions (eg: inter-
rupt enable). This is called inline translation
– Other dangerous instructions need to be performed by emulation code in the monitor.
These are called call-out instructions. (eg: instructions that changes the PTBR).
3
3.5 Hardware-Assisted Virtualization
• The challenges of virtualizing x86 are outlined in section 3.2, and the methods to solve them were
adopted as part of Intel’s VT-x and AMD’s AMD-V feature set
• The CPU now has 2 modes of operation, a root mode and a non-root mode.
• Both root and non-root mode have 4 rings. The current hardware state is maintained separately
for both modes.
• The root mode is more powerful than the kernel mode. The host OS and VMM run in root mode,
while the guest OS and applications run in non-root mode.
• If any sensitive instructions are executed in non-root mode, a VMEXIT condition signals to the
processor to enter root mode. In root mode this sensitive operation is emulated by the VMM and
the processor switches back to non-root mode.
• The hardware state of a VM is maintained in a data structure called the Virtual Machine
Control Structure (VMCS). The VMM is in charge of creating the VMCS and modifying it
(when emulating sensitive instructions).
• While the guest is active, the VMM forces the processor to use the SPT for all translations.
• Whenever the guest OS modifies the guest page table, the VMM must update the shadow page
table. This is implemented by making the guest page table write protected.
• This means that whenever the guest OS tries to write to the guest page table, a page fault is raised,
and a trap is set to the VMM. The VMM handles the trap and modifies the SPT.
For every guest application there is one shadow page table. Every time a guest application context
switches, trap to VMM to change cr3 to point to new shadow page table
• The drawbacks of the shadow page table concept is that it leads to overheads involved in handling
traps, and the fact that the TLB cache has to be flushed on every context switch.
4
4.1.2 Extended Page Tables
• The processor is made aware of the virtualization, and the two-level address translation that is
needed to support it.
• Guest-physical addresses are translated by traversing a set of EPT paging structures to produce
physical addresses that are used to access memory.
• A field in the VMCS maintains a pointer to the Extended page table, called the EPT Base Pointer.
• Benefits of EPT:
1. Performance increased due to reduced overheads over shadow paging (performance increase
is dependent on type of workload)
2. Reduced memory footprint compared to SPT scheme that requires maintaining of a table for
each VM that is started.
• The I/O access requests of the guest OS are trapped to the VMM which interacts with the I/O
devices.
• The front end and back end drivers interact with each other via shared memory.
• The front end driver intercepts I/O requests from the guest OS. The back end driver manages the
physical I/O hardware as well as multiplexing the I/O data coming from di↵erent VMs
• Performance wise, para I/O virtualization is better than full device virtualization, but it comes
with a high CPU overhead.
5
• The VMM utilizes and configures technologies such as Intel VT-x and Intel VT-d to perform address
translation when sending data to and from an IO device.
• Advantage of faster performance, but limited scalability (as a single I/O device can only be assigned
to a single VM).
• Goldberg and Popek classified the instructions in an ISA into the following categories:
1. Behaviour sensitive instructions are those wherein the final result of the instruction is
dependent on the privilege level (i.e. executing that instruction in a lower privilege level leads
to a wrong output)
2. Control sensitive instructions are those which result in change of processor state or processor
privilege.
3. Privileged instructions are those that trap if the processor is in user mode and do not trap
if it is in system mode (i.e kernel or supervisor mode).
5.2 Theorems
5.2.1 Theorem 1
”For any conventional third generation computer, a VMM may be constructed if the set of sensitive
instructions for that computer is a subset of the set of privileged instructions”
• The theorem states that to build a VMM it is sufficient that all instructions that could a↵ect the
correct functioning of the VMM (sensitive instructions) always trap and pass control to the VMM.
5.2.2 Theorem 2
”A conventional third generation computer is recursively virtualizable if it is:
1. virtualizable, and
2. A VMM without any timing dependencies can be constructed for it.
”
6
5.2.3 Theorem 3
”A hybrid virtual machine monitor may be constructed for any conventional third generation machine
in which the set of user sensitive instructions are a subset of the set of privileged instructions.”
• Live migration is desired when load on the cluster becomes unbalanced and real-time correction is
needed.
• Migration also allows for scalability (up and down) as well as rapid provisioning.
• During the pre-copy phase, the functioning of the source VM is not interrupted.
7
6.2 Pre-copy and post-copy migration
• In pre-copy migration, the aim is not to impact the functioning of the source VM. However since
the migration daemon is making use of the network to transfer dirty pages, there is a degradation
of performance that occurs.
• Adaptive rate-limited migration is used to mitigate this to an extent.
• Moreover, the maximum number of iterations must be set because not all applications’ dirty pages
are ensured to converge to a small writable working set over multiple rounds.
• In post-copy migration, the migration is initiated by stopping the source VM, a minimal subset
of the execution state of the VM is transferred to the target. The VM is then resumed at the
target.
• Concurrently, the source actively pushes the remaining memory pages of the VM to the target -
an activity known as pre-paging.
• At the target, if the VM tries to access a page that has not yet been transferred, it generates a
page-fault. These page faults are trapped, sent to the source and the source replies with the page
requested.
7 Lightweight Virtualization
7.1 Containers
• Containers are a logical packaging mechanism where the code and all of its dependencies are
abstracted away from their run time environment.
• This allows for much easier deployment on a wide variety of hardware, as well as more e↵ective
isolation and much less CPU/Memory overheads.
• Containers are an example of OS-level virtualization, and multiple containers running on a host
share the same OS. Similar to a VMM for full-scale virtual machines, containers are managed by
a container manager.
• Examples of real world implementation of container technology are Docker, Google’s Kubernetes
Engine, AWS Fargate, Microsoft Azure etc.
7.2 Docker
• Docker is a product that is used to deliver software in the form of containers, and it makes use
of Linux technologies that promote OS-level virtualization such as cgroups, namespaces and
others.
• Docker consists of 3 components:
1. The Docker engine
2. The Docker client (normally a command line interface which is called the Docker CLI)
3. The container registry
• The Docker daemon (dockerd) listens for Docker API requests and manages Docker objects such as
images, containers, networks, and volumes. A daemon can also communicate with other daemons
to manage Docker services.
• The container registry stores Docker images. An example of a publicly-available registry is Docker
Hub. By default, the docker pull and docker run commands pull the needed images from
Docker Hub.
• It is possible to configure Docker to look elsewhere for images, including one’s own privately set
up registry.
• The Docker Engine is a client-server program. The Docker CLI acts as an client and uses the
Docker API to send requests. The engine listens for these requests, and sends them to the Docker
daemon running on the server.
8
7.2.1 Docker Images
• A Docker image is a read-only template that is used to set up a running container.
• Layers that change the most often are organized at the top, as then there are minimal number
of layers that need to be rebuilt each time a change occurs (when a layer changes only the layers
above it must be rebuilt).
• When a container is launched from an image, a thin writable layer called the container layer is
added at the top. The container layer stores all the changes made to the container state as it runs.
• This allows for multiple containers to share the same image layers but only have their distinct
container layers at the top.
• A Dockerfile is a plain-text file that specifies the steps involved in creating a Docker image.
• Any mount/unmount operations that are done by the isolated process in its own mount names-
pace will not a↵ect the parent mount namespace, nor any other isolated mount namespace in the
hierarchy.
9
• Virtual network interfaces span multiple network namespaces, and allow interfaces in di↵erent
namespaces to communicate with one another.
• A routing process takes data incoming at the physical interface, and routes it to the correct network
namespace via the virtual network interface
• Routing tables can be set up that route packets between virtual interfaces.
• It allows files and directories of separate file systems, known as branches, to be transparently
overlaid, forming a single coherent file system.
• Contents of directories which have the same path within the merged branches will be seen together
in a single merged directory, within the new, virtual file system.
• This allows a file system to appear as writable, but without actually allowing writes to change the
file system, also known as copy-on-write.
• In the CoW mechanism, any changes that are made to any of the image layers that make up
the UnionFS, are reflected only in the topmost container layer. The image layer is copied to the
container layer FS and changes are written there.
7.4.1 Disadvantages
• Translating between di↵erent file system rules about file names and attributes, as well as di↵erent
file system’s features.
• Copy-on-write makes memory-mapped file implementation hard
• Not appropriate for working with long-lived data or sharing data between containers, or a container
and the host.
10
8 DevOps on the cloud
• DevOps is an integration of Software Development methodologies and IT operations that are
involved in deployment and operation of software.
• DevOps auomates the process that occur between software development and IT teams so that
software can be built, released and tested faster and more reliably.
• One of the key principles of DevOps is Continuous Integration along with Continuous De-
ployment and Continuous Delivery (this is commonly referred to as CI\CD).
• CI\CD promotes the practice of making small changes and integrating them with the main codebase
often, and using automated deployment infrastructure to test on a production-like environment.
• The entire CI\CD sequence of stages is organized in the form of a sequential pipeline. The pipeline
consists of a series of automated actions that take code from a developer environment to a produc-
tion environment.
• Pipelines automate the build, test and publishing of artifacts so that they can be deployed to a
runtime environment.
• Tools such as Jenkins, Drone, and Travis CI are used for CI\CD pipeline management.
• A typical CI\CD pipeline is as follows:
– Developer push their changes to a centralized Git repository
– Build server automatically builds the application and runs unit tests and integration tests on
it
– If all tests pass then container image is pushed to the central container repository.
– The newly built container is automatically deployed to a staging environment
– The acceptance tests are carried out in this staging environment.
– Verified and tested container image is pushed to production environment.
11
8.3 Jenkins
• A self-contained, open source automation server which can be used to automate tasks related to
building, testing, and delivering or deploying software.
• It can be installed via package managers (apt-get, pacman), DockerHub, or natively built on a
machine with Java Runtime Envt (JRE).
• Plugins are used to extend Jenkins functionality as per the user-specific or organization-specific
needs
• Cluster management: joining multiple physical or virtual servers into a unified, reliable, fault-
tolerant group.
• Typically orchestrators take care of all 3: orchestration, scheduling and cluster mgmt.
• Kubernetes (or K8s for short) is the most prominent example of such a software. Others are Docker
Swarm, Google Container Engine (built on Kubernetes), and Amazon ECS.
• The containers in a pod share cgroups, namespaces, storage and IP Addresses as they are co-located.
• Pods have a short lifetime, they are created, destroyed and restarted on demand.
12
9.1.2 K8s Service
• As pods are shortlived, there is no guarantee on their IP Address, which makes communication
hard
• A service is an abstraction on top of a number of pods, typically requiring to run a proxy on top,
for other services to communicate with it via a Virtual IP address.
• Numerous pods can be exposed as a service with configured load balancing.
• The API server serves REST API requests according to the bound business logic.
• etcd is a consistent and simple key-value store that is used for service discovery and shared config
storage. It allows for CRUD operations and notification services to notify the cluster about config
changes.
• Scheduler deploys configured pods and services onto the worker nodes. It decides based on the
resources available on each cluster.
• Controller-manager is a daemon that enables the use of various control services. It makes use of
the API server to watch the current state and make changes to the config to maintain the desired
state (e.g.: maintaining the replication factor by reviving any dead/failed pods)
13
Cloud Computing (UE18CS352)
Unit 3
Aronya Baksy
March 2021
1. Seek Time: The time needed for the controller to position the disk head to the correct
cylinder of the disk
2. Rotational Latency: The time needed for the first sector of the block to position itself
under the disk head
3. Transfer Time: Time needed for the disk controller to read/write all the sectors on the disk.
• RAID (Redundant Array of Independent Disks) is a storage virtualization technology that com-
bines multiple physical disks into one or more logical volumes for increased redundancy and faster
performance.
• The driving technologies behind RAID are striping,mirroring and parity checking.
• DAS is only accessible from the node to which the storage device is attached physically.
• Network Attached Storage (NAS) is a file-level storage device connected to a heterogeneous
group of clients.
• A single NAS device containing physical storage devices (these may be arranged in RAID) serves
all file requests from any client in the connected network.
• NAS removes the responsibility of file serving from other servers on the network. Data is transferred
over Ethernet using TCP/IP protocol.
• Storage Area Network (SAN) is a network that provides access to block-level data storage.
• A SAN is built from a combination of servers and storage over a high speed, low latency interconnect
that allows direct Fibre Channel connections from the client to the storage volume to provide the
fastest possible performance.
• The SAN may also require a separate, private Ethernet network between the server and clients to
keep the file request traffic out of the Fibre Channel network for even more performance.
• It allows for simultaneous shared access, but it is more expensive than NAS and SAN.
• Distinct protocols were developed for SANs, such as Fibre Channel, iSCSI, Infiniband.
1
Figure 1: Storage Architectures
• LVM provides a method of allocating space on mass-storage devices that is more flexible than
conventional partitioning schemes to store volumes.
• The components of LVM are:
1. Extend volumes while a volume is active and has a full file system (shrinking volumes requires
unmounting and suitable storage requirements)
2. Collect multiple pysical drives into a volume group
• LVM consists of the following basic components layered on top of each other:
– A physical volume corresponds to a physical disk that is detected by the OS (labelled often
as sda or sdb) (NOTE: partitions of a single actual disk are detected as separate disks by the
OS).
– A volume group groups together one or more physical volumes
– A logical volume is a logical partition of the volume group. Each logical volume runs a file
system.
• The /boot partition cannot be included in LVM as GRUB (the GNU Bootloader that loads the
bootstrap program from the master boot record) cannot read LVM metadata.
2
2 Storage Virtualization
• Abstraction of physical storage devices into logical entities presented to the user, hiding the un-
derlying hardware complexity and access functionality (either direct access or network access)
• Advantages of storage virtualization are:
– Enables higher resource usage by aggregating multiple heterogeneous devices into pools
– Easy centralized management, provisioning of storage as per application needs (performance
and cost).
3
– The client passes this layout to a Logical Object Volume (LOV). The LOV maps the layout
to objects and their actual locations on di↵erent OSTs
– The client then locks the file range being operated on and executes one or more parallel
reads/writes directly to the OSTs
– Server delivers the combined disk space of all the physical storage servers as a single file
system
– Client implements highly available, massively parallel access to each storage node along with
node failure handling
• A storage brick is a server (containing directly attached storage or connected to a SAN) on which
a file system (like ext3 or ext4) is created
• A translator is a layer between a brick and the actual user. It acts as a file system interface and
implements one single Gluster functionality
• I/O Scheduling Translators are responsible for load balancing,
• Automatic File Replication (AFR) translator keeps identical copies of a file/directory on all its
subvolumes (used for replication)
4
2.2.3 Network-Level BLV
• Most commonly implemented, scalable form, implemented as part of the interconnect network
between storage and hosts (e.g.: Fibre Channel SAN)
• Switch-based: the actual virtualization occurs in an intelligent switch in the network, and it
works in conjunction with a metadata manager
• Appliance-based: I/O is routed through an appliance that manages the virtualization layer
• In-band appliances perform all I/O with zero direct interaction between client and storage.
• Out-of-band appliances manage only metadata (control paths) while the actual data flows di-
rectly between client and storage server (each client having an agent to manage this)
– Access Control Lists: Set permissions to allow other users to access an object
– Audit Logs: Once enabled, stores the access log for an bucket. This enables one to identify
the AWS account, IP Address, time of access and operations performed by the one who
accessed.
• Data Security is maintained in S3 using:
– Replication: across multiple devices, allows for upto 2 replica failures (cheaper option is
Reduced Redundancy Storage which survives only 1 replica failure), but consistency across
replicas is not guaranteed.
– Versioning: If enabled, S3 stores the full history of each object. It allows for changes to be
undone, including file deletions.
– Regions: select location of S3 bucket for performance/legal reasons.
• S3 allows for large objects to be uploaded in parts. These parts can be uploaded in parallel for
maximum network utilization
5
• Table is collection of items, item is collection of attribute-value pairs. Primary key identifies items
uniquely in a table.
• A partition is an allocation of storage for a table, backed by SSDs and automatically replicated
across multiple Availability Zones within an AWS Region.
• RDS provides encryption at rest and in transit, as well as APIs for applications.
4 Partitioning
• Breaking down large DBs into smaller units that are stored on di↵erent machines. Each row belongs
to exactly one partition
• Supports operations that touch mulitple partitions at the same time.
• Motivation is scalability in terms of load balancing and query throughput, as well as fault tolerance
(when combined with replication)
• Small queries can be independently processed by one partition. Large queries can be parallelized
between multiple partitions.
• When some partitions have more data than others, they are said to be skewed. A partition with
disproportionately high load is called a hot spot
• Disadvantage: When trying to read a particular item, no way of knowing which node it is on, so
all nodes need to be queried in parallel.
6
4.1.3 Partitioning by Hash of Key
• Using a suitable hash function for keys, each partition has a range of hash values assigned to it
(rather than a range of keys), and every key whose hash falls within a partition’s range will be
stored in that partition.
• A good hash function takes skewed data and makes it uniformly distributed
• Simple hash partitioning do not allow efficient range queries. This is solved using composite keys.
• Consistent hashing is a way of evenly distributing load across an internet-wide system of servers
such as a content delivery network
• It uses randomly chosen partition boundaries to avoid the need for central control or distributed
consensus
• Each partition maintains its own secondary index, covering only the documents in that partition.
• Reading involves reading from each and every partition and separately combining the results. This
approach is called scatter-gather, and it makes read queries expensive
• Even if the partitions are queried in parallel, scatter/gather is prone to tail latency amplification
• Writes are less efficient as a write a↵ects multiple partitions of the index. This requires a distributed
transaction across all partitions a↵ected by a write
• In practice, updates to global secondary indexes are often asynchronous
• Simple, but drawback is that any change in N leads to rehashing of large number of keys which
makes the rebalancing very expensive
7
4.3.2 Fixed number of partitions
• Move only entire partitions. Assignment of keys to partitions does not change, but only assignment
of partitions to nodes changes.
• Create many more partitions than there are nodes and assign several partitions to each node
• If a node is added to the cluster, the new node can steal a few partitions from every existing node
until partitions are fairly distributed once again
• So many fixed-partition databases choose not to implement partition split and merge
• Choosing the right number of partitions is difficult if the size of the dataset is variable
• In dynamic partitioning, the partitions split if they grow beyond an upper bound. If the partition
shrinks below a lower bound, it can be merged with an adjacent partition
• Can be used with both key-range partitioned and hash partitioned data
4.4.1 ZooKeeper
• A distributed metadata management system for clusters.
• ZooKeeper maintains an authoritative mapping between partititons and nodes, and each node
registers itself with the ZooKeeper service.
• Other actors, such as the routing tier or the partitioning-aware client, can subscribe to this infor-
mation in ZooKeeper
• When partitioning changes or node removal/addition occurs, ZooKeeper notifies the routing tier
8
5 Replication
• Keeping multiple copies of a single partition on di↵erent nodes connected by a network
• Motivation for replication:
– Reduce latency by reducing distance to user
– Increase availability by allowing fault tolerance
– Increase read throughput by allowing more parallel reads (scalable)
5.1.2 Implementation
• Statement Replication: The leader logs every write request that it executes and sends that
statement log to its followers (fails for non-deterministic functions like rand() and now())
• Write-Ahead Log Shipping: The leader writes the log (an append-only byte stream) to disk
and sends it across the network to its followers. When the follower processes this log, it builds a
copy of the exact same data structures as found on the leader.
• Logical Log Replication: Uses di↵erent log formats for replication and for the storage engine.
A logical log (aka the replication log) is a sequence of records describing writes to database tables
at the row level
• Trigger-Based Replication: A trigger on the leader table logs the change to another table where
an external process can read it. The external process applies the replication to another system
9
5.2 Replication lag
• The delay between a write happening on the leader and the same being reflected on a follower is
known as the replication lag.
• Read-After-Write consistency is a guarantee for a single user, in that if the same user reads the
data at any time interval after reading it, the user will get the updated data.
• Solutions:
– Read critical data from leader, rest from follower (negates scaling advantage)
– Prevent queries on any follower that is lagging significantly behind the leader
– Client remembers the timestamp of their most recent write, and ensure that the node serving
that user is updated atleast till that timestamp
– Monotonic reads: each user read from the same replica always
– Consistent prefix reads - if a sequence of writes happen in a certain order, then anyone reading
those writes should see them appear in the same order
• In a multi-leader config, each datacenter can continue operating independently of the others, and
replication catches up when the failed datacenter is back online.
• In a single-leader config, the public internet is used for synchronous updates between leader and
follower, hence is sensitive to problems in this network
• A multi-leader config with asynchronous replication tolerates network problems better as a tem-
porary network problems do not prevent writes being processed
• In some implementations, the client sends writes to multiple nodes at the same time
• In others, a single co-ordinator node does this on behalf of the client, but it does not enforce a
particular order of writes (like a leader in a single-leader set up does)
• If writes are sent to multiple nodes, but some nodes out of these fail and hence cannot complete
the write. If the nodes that failed come back online, then any data on them is now out of date
(stale)
• To solve this issue, each data item has a version number associated with it. The client reading
from multiple replicas checks the version number of the data and selects the most recent one.
• When the client reads values with di↵erent version numbers, the client writes the most recent
version of the data to all the nodes with less recent versions. This is called read repair
10
• A background process (rather than the client itself) monitors all data values and their versions
across all nodes, and periodically writes the latest value of the data to all the replicas. This is
called an anti-entropy process.
• Let there be n nodes. Let r nodes be queried for each read, and w nodes confirm for each write. If
w+r >n
then an up-to-date copy of the data is guaranteed while reading, as at least one of the r nodes
being read from must be up to date.
• Reads and writes that obey the above rule are called quorum reads and writes.
5.4.1 Monitoring
• Monitoring in leaderless systems is difficult as writes do not happen in any particular order
• In single-leader systems, the writes are in a fixed order maintained on the edit log of the leader.
The out-of-date follower can compare its position (timestamp) with that of the leader and make
the necessary changes.
11
• Version vectors are sent from the database replicas to clients when values are read, and need to be
sent back to the database when a value is subsequently written
• The version vector allows the database to distinguish between overwrites and concurrent writes,
and ensures that it is safe to read from one replica and write to another.
6 Consistency Models
• Most distributed systems only guarantee eventual consistency
• In eventual consistency, data read at any point may not be consistent across nodes, but if there
are no writes for some unspecified interval then all the nodes can catch up to the consistent state
• This is a weak guarantee, as it does not give any guarantees about actual time of consistency.
6.1 Linearizability
• The illusion that there is only one copy of a data item across a distributed system. (implies that
all data must be up to date at all times, no staleness in caching)
• Ensures that applications running on the distributed system do not need to worry about replication.
• Main point of linearizability: After any one read has returned the new value, all following reads
(on the same or other clients) must also return the new value.
• Compare-and-Set is an operation on the database:
– The CAS operation takes in 3 arguments: a memory location to read from (called X), an old
value (vold ) and a new value (vnew )
– If X == vold then set X := vnew
– If X 6= vold then return an error, don’t change the value in X
• Test for linearizable behaviour: record the timings of all requests and responses and check whether
a valid sequential ordering can be constructed from them.
• In synchronous mode, single leader replication is linearizable.
• Consensus algorithms implement measures to avoid stale replicas, and implement safe linearizable
storage. (e.g.: ZooKeeper)
• Multi-leader and leaderless replication are not linearizable (leaderless probably not)
12
• The modern CAP goal is to maximize combinations of consistency and availability that make
sense for the specific application, while incorporating plans for unavailability and recovery of failed
partitions.
6.3.1 Phase 1
• Coordinator places the record Prepare T on its log. The message is then sent to all the sites.
• Each site that receives the message decides whether to commit the componenet of transaction T
or to abort it.
• A site that wants to commit enters the pre-commit stage (in this state the site can no longer abort
the transaction)
• The site takes the necessary actions to ensure that its component of T will not be aborted, then
writes the log message Ready T.
• Once the log is stored on disk at the site, the site sends the Ready T message back to the
coordinator
• A site that doesn’t want to commit sends the message Don’t Commit T back to the coordinator
6.3.2 Phase 2
• If Coordinator gets Ready T from all the sites, it logs the message Commit T and sends it to
all the sites
• If the coordinator has received don’t commit T from one or more sites, it logs Abort T at its
site and then sends abort T messages to all sites involved in T
• If a site receives a commit T message, it commits the component of T at that site, logging
Commit T as it does
• If a site receives the message Abort T, it aborts T and writes the log record Abort T
13
Cloud Computing (UE18CS352)
Unit 4
Aronya Baksy
March 2021
1.1.2 Advantages
• Easy maintenance and security
• Promotes sharing of resources and data between di↵erent h/w and s/w platforms
• Integration of services
1.1.3 Disadvantages
• Not scalable as number of workers increase
• The master is a SPOF
1
Figure 2: Client Server Model for Chat
Figure 1: Client-Server Architecture
1.2.2 Advantages
• No centralized point of failure.
• Highly scalable, addition of peers does not a↵ect quality of service
1.2.3 Disadvantages
• Maintaining decentralized coordination is tough (consistency of global state, needs distributed
coherency protocols)
• Computing power and bandwidth of nodes impacts the performance (i.e. all nodes are not the
same in a P2P network)
• Harder to program and build applications for P2P systems due to the decentralized nature.
1.2.4 Applications
• File Sharing applications with replication for fault tolerance (e.g.: Napster, BitTorrent clients
like µTorrent, Gnutella, KaZaa)
• Large-scale scientific computing for data analysis/mining (e.g.: SETI@Home, Folding@Home
used for protein dynamics simulations)
• Collaborative applications like instant messaging, meetings/teleconferences (e.g.: IRC/ICQ,
Google Meet, MS Teams etc.)
2
• The centralized server maintains a mapping between file names and IP addresses of a node. Each
time a client joins the network, it publishes its IP address and list of files it shares to this database
• Any file lookup happens via the server. If the file is found then the centralized server establishes
a direct connection with the requesting node and the node that contains the requested file.
• e.g.: DNS hierarchy, where authority flows from the root name servers to the servers of the registered
name and so on
3
2 Unreliable Communication
• Issues with communication in distributed systems:
– Request or response is lost due to issues in the interconnect network
– Delay in sending request or response (due to queuing delays and network congestion)
– Remote node failure (permanent or temporary)
• Partial Failure in a distributed system occurs when some components (not all) start to function
unpredictably. Partial failures are non-deterministic.
• Distributed systems involve accepting partial failure, building fault-tolerant mechanisms into the
system.
• Reliability is the ability of a system to continue normal functioning even when components fail
or other issues occur.
• Formally, reliability is defined as the probability that a system meets certain performance standards
and yields correct outputs over a desired period of time
• Reliability includes:
– Tolerant to unexpected behaviour and inputs in the software
– Prevention of unauthorized access and abuse
– Adequate performance for the given use case under expected load and input size
• A fault is usually defined as one component of the system deviating from its spec
• A failure is when the system as a whole stops providing the required service to the user
• Fault-tolerant mechanisms prevent faults from causing system-wide failures.
• Classification of faults:
– Transient: appear once, then vanish entirely (e.g. first request from node A to node B fails
to reach, but the next one reaches on time)
– Intermittent: Occurs once, vanishes, but reappears after a random interval of time. (e.g.
loose hardware connections)
– Permanent: Occurs once, interrupts the functioning of the system until it is fixed. (e.g.
infinite loops or OOM errors in software)
• Classification of failures:
– Crash failure: A server halts, but functions correctly till that point
– Omission Failure: Could be send omission (server fails to send messages) or receive omission
(server fails to respond to incoming messages)
– Timing Failure: server response is delayed beyond the acceptable threshold
– Response Failure: Could be a value failure (response value is wrong for a request) or a state
transition failure (deviation from correct control flow)
– Arbitrary or Byzantine Failure: Arbitrary response produced at arbitrary times
4
2.1 Failure Detection
• Using timeout: Let d be the longest possible delivery time (all messages will reach within time d
after being sent or they will not reach at all), and r be the time needed by the server to process
the message.
• Then the round trip time 2d + r is a reasonable estimate for a timeout value beyond which it can
be assumed that a node has failed.
• Unfortunately, there are no such time guarantees in asynchronous communications that are used
in distributed systems.
• Network congestion causes queuing delays. If queues fill up at routers, then the packets can be
dropped, causing retransmission and further congestion
• Even VMs that give up control of the CPU core to another VM can face network delays as they
stop listening to the network for that short duration when they are not in control of the CPU. This
leads to packets dropping.
• Timeout values are measured experimentally.
– Data is collected on round-trip times across multiple machines in the network and over an
extended time period. Measure the variability in the delays (aka jitter)
– Taking into account this data, as well as the application characteristics, a timeout is chosen
that is a fair compromise between delay in failure detection and premature timeout.
– Instead of constant timeouts, the system constantly measures response time and jitter, and
dynamically adjusts the timeout value.
– This is used in Phi Accrual Failure Detector in systems like Cassandra and Akka (toolkit for
distributed applications in Java/Scala)
• In circuit switched networks, there is no queuing delay as the connection is already set up end-
to-end before message exchange, and the maximum end-to-end latency of the network is fixed
(bounded delay)
• The disadvantage of circuit switched network is that it supports far less number of concurrent
network users, and it leads to low bandwidth utilization.
5
2.4 Failure Detection using Heartbeats
• A heartbeat is a signal sent from a node to another at a fixed time interval that indicates that the
node is alive.
• Absence of a fixed number of consecutive heartbeats from a node is assumed to be evidence that
that node is dead
• Heartbeat signals are organized in the following ways:
– Centralized: All nodes send heartbeats to a central monitoring service. Simplest organiza-
tion but the central service is now a SPOF
– Ring: Each node sends heartbeat only to one neighbour, forming a ring structure. If one of
the nodes fails, then the ring breaks and heartbeats cannot be sent properly
– All-to-All: Each node sends heartbeats to every other node in the system. High communi-
cation cost but every node keeps track of all other nodes hence high fault tolerance.
2.5 Failover
• The act of switching over from one service/node/application to a new instance of the same, upon
the failure or abnormal termination of the first one.
• Failover can be implemented in two ways:
6
Figure 10: Active-Passive Failover Architecture
7
3.1.2 Mean Time to Repair (MTTR)
• Measured as
Total downtime caused by failures
MTTR = (2)
Number of failures
• It measures the average time to repair and restore a failed system
• Fault tolerance can be implemented either in hardware (duplicate hardware), or in software (dupli-
cate instances running the same software) or using load balancers (redirect traffic away from failed
instances).
• Fault tolerant architectures, however, do not address software failures which are the most common
cause of downtime in real-life distributed systems.
• Build fault-tolerance into network architecture. Multiple paths between nodes in a data-center, or
mechanisms to handle link failure and switch failure
• Handle link failures transparently without a↵ecting cloud functionality. Avoid forwarding packets
on broken links.
• Chaos Monkey is used to purposefully introduce faults into systems that are under development
so that fault tolerance can be integrated as early as possible and tested at any time.
8
• By regularly ”killing” random instances of a software service, it is possible to test a redundant
architecture to verify that a server failure does not noticeably impact customers.
• Chaos Monkey relies on Spinnaker (an open source CI/CD tool similar to Jenkins) that can be
deployed on all major cloud providers (AWS, Google App Engine, Azure)
• After a timeout period, the client should try again and, if the new requests are successful, close
the circuit breaker.
9
Figure 12: Circuit Breaker Design Pattern
• Fallback logic must be simple and failure-proof as it is itself running due to a failure.
• If request arrival rate exceeds the processing rate, the incoming requests can either be queued in a
FIFO queue, or discarded when the queue fills up.
• When the service has capacity, it retrieves messages from this queue and processes them. When
the request rate is greater, the available capacity messages are processed in order and are not lost.
10
4 Task Scheduling Algorithms
• Policies that assign tasks to the appropriate available resources (CPU, Memory, bandwidth) in a
manner that ensures maximum possible utilization of those resources
• Categorized into:
• Algorithm: Sort all available tasks in increasing order of their execution time. Then assign the
tasks to VMs in sequential order of the VMs.
11
5 Cluster Coordination
• Consensus is the task of getting all processes in a group to agree on a single value based on votes
gathered from all processes. The value agreed upon has to be submitted by one of the processes,
i.e. this value cannot be invented by the consensus algorithm
• Synchronous processes are those that follow a common clock, while asynchronous processes are
those where each process has an individual clock.
• In asynchronous systems, it is not possible to build a consensus algorithm as it is impossible to
distinguish between processes that are dead, and those are just slow in responding.
• If even one process crashes in an asynchronous system, then the consensus problem is proved to
be unsolvable in this paper by Fischer, Lynch and Patterson from 1985.
• Why solve consensus? It is important because many problems in distributed computing take a
similar form to the consensus problem such as:
– Leader election
– Perfect failure detection
– Mutual exclusion (agreement on which node gets access to a particular shared resource)
• The properties to be satisfied by asynchronous consensus are:
– Validity: The system cannot accept any value that was not proposed by atleast one node. If
every node proposes the same value then that value is accepted
– Uniform Agreement: No two correct processes can agree on di↵erent values after a single
complete run of the algorithm
– Non-Triviality/Termination: All the processes must eventually agree on a single value
• Acceptors receive proposals from proposers. They also inform the proposer in the event that a
value other than the one proposed by them was accepted
• Learners announce the outcome of the voting process to all the nodes in the distributed system.
12
5.1.3 Paxos Phase 2 (Propose - Accept)
• Once a proposer receives a prepare response from a majority of the acceptors, it can start sending
out accept requests.
• A proposer sends out an accept request containing its node ID, and the highest value it recevied
from all the prepare responses.
• If an acceptor receives an accept request for a higher or equal ID than it has already seen, it accepts
and sends a notification to every learner
• A value is chosen by the Paxos algorithm when a learner discovers that a majority of acceptors
have accepted a value.
13
• The simple ring election algorithm o↵ers safety and liveness as long as nodes don’t crash during
election.
• If the nodes crash during election, then it could lead to an election message going around the ring
infinitely, thus the election goes on forever and liveness is not followed.
• Supports concurrent elections – an initiator with a lower id blocks election messages by other
initiators
• If a node fails, then the ring can be re congifured to make it continuous again if all nodes in the
ring know about each other.
• If the initiator is not faulty, then message complexity = 2N , turnaround time = 2N and message
size grows as O(N )
14
6 Distributed Locking
• A lock is a mechanism that allows multiple processes/threads to access shared memory in a safe
manner avoiding race conditions. Locks are implemented as semaphores/mutexes/spinlocks.
• Locks are operated in the following sequence:
– Acquire the lock. This gives the process sole control over the shared resource
– Perform the tasks needed on the shared resource
– Release the lock. THis gives the other waiting processes a chance to access the shared resource
• A distributed lock is one that can be acquired and released by di↵erent nodes (instead of processes
and threads on only one node).
– Mutual exclusion: Only one process can hold a lock at a given time
– Deadlock-free: locks must be held and released in a manner that avoids deadlocks between
processes. No one process can hold a lock indefinitely, locks are released after a certain
timeout.
– Consistency: Despite any failover situation caused by a node failure, the locks that the
original node held must still be maintained.
15
6.2 Fencing
• Everytime the LM grants a lock (in response to an acquire) it sends back a fencing token to the
client.
• Along with every write request to the DB, the client sends this fencing token.
• If the DB has processed a write request with token ID N then it will not process write requests
containing token ID less than N
• Token ID less than N indicates that the node had acquired the lock earlier but the timeout has
expired hence that lock is not valid anymore
• DLM provides software applications running on a distributed system with a means to synchronize
their accesses to shared resources.
• The DLM uses a generalized concept of a resource, which is some entity to which shared access
must be controlled.
7 Zookeeper
• ZooKeeper is a service for coordinating processes of distributed applications
• Zookeeper o↵ers a hierarchical key-value store, to provide a distributed configuration service, syn-
chronization service, and naming registry for large distributed systems.
• Useful for lock management (keep it outside of programmer’s hands for best results) and avoiding
message based control (in async systems message delivery is unreliable)
• Zookeeper maintains configuration information, perform distributed synchronization and enable
group services.
• Properties of Zookeeper:
– Simple: leads to high robustness and fast performance
– Wait-free: slow/failed clients do not interfere with needs of properly-functioning clients (in-
teractions are loosely coupled)
– High availability, high throughput, low latencies
– Tuned for workloads with high % of reads
– Familiar interface
• Leader election
• Distributed Locking to avoid race conditions
• Message queue implementation
16
7.2 Advantages
• Simple distributed coordination
7.3 Disadvantages
• Less feature-rich compared to other such services (e.g. Consul which has service discovery included)
• Dependence on TCP connection for client-server communication
• Each znode can store upto 1 MB of data. The limited amount is because Zookeeper is used for
storing only config information (status, networking, location etc.)
• Every znode is identified by a name, which is a path separated by /, with the root node having the
path as just ’/’.
• A znode with children cannot be deleted.
17
7.4.2 Working
• Each Zookeeper server maintains an in-memory copy of the data tree that is replicated across all
the servers.
• Only transaction logs are kept in a persistent data store for high throughput
• Each clients connects to a single Zookeeper server using a TCP connection. Client can switch to
another Zookeeper server if the current TCP connection fails.
• All updates made by Zookeeper are totally ordered. The order is maintained by the use of the zxid
or Zookeeper Transaction ID.
• Distributed synchronization is maintained using Zookeeper Atomic Broadcast or ZAB Protocol.
• A client can watch a znode, meaning that when any changes are made to the watched znode, the
client receives a notification.
18
7.6 Alternatives to Zookeeper
• Consul: Service discovery and configuration tool, highly available and scalable
• etcd: Distributed key-value store, open source, tolerates failures during leader election
• Yarn: parallelize operations for greater throughput and resource utilization
• Eureka: a REST based service used for locating load-balancing and failover services for middle-tier
servers
• Ambari: Provision, manage and monitor Hadoop clusters. Intuitive web UI backed up by RESTful
operations.
19
Cloud Computing (UE18CS352)
Unit 5
Aronya Baksy
April 2021
1
1.2 Forward Proxy
• Regulates outbound traffic in accordance with certain policies in shared networks. Collects requests
from clients, and interacts with servers on behalf of the client.
1.3 Nginx
• Nginx is a web server that can also be used as a reverse proxy, load balancer and HTTP cache.
• Load balancing is either done using round-robin scheduling, or the optional hash-based scheduler
that chooses an upstream server based on the hash of some value (can be request URL, incoming
HTTP headers, or some combination of the same)
• Scaling is done by simply changing the Nginx server configuration i.e. by adding more servers and
the corresponding IP addresses in the ”upstream” section
2 Scalability
• Ability to increase/decrease IT resources deployed in response to changing workloads and demands.
• Scaling can be done for data storage, compute or networking, and must be done with minimal
downtime or service disruption.
• Elasticity refers to the system’s ability to allocate or deallocate resources for itself in response to
changing workloads
• On the other hand, scalability refers to the ability to use only existing resources to handle increased
workloads
• The tradeo↵ between Elasticity and scalability depends on the app’s workloads being predictable
or highly variable.
• AWS and Azure cloud services have many di↵erent instance sizes, so scaling vertically is possible
for many types of resources (EC2 instances, RDS databases)
2
2.2.2 Horizontal Scaling
• Adding more instances of the same existing configuration and splitting workloads between the new
increased number of instances
• Increase number of instances instead of changing instance type
Think of it like this, vertical scaling is adding more floors to a single house, whereas horizontal scaling
is building 2 more houses of the same size as the existing one.
4 Multi-Tenancy
• An architecture model wherein a single instance of an application or a hardware serves multiple
clients.
• Three types of multi-tenancy models:
– Shared Machine: each client has their own DB process and tables on a single shared machine
– Shared-Process: Each client has their own tables, but only one database process executes
queries for all clients
– Shared-Table: clients share database tables and process.
3
4.2 Types of Multi-tenant Architectures
4.2.1 Single multi-tenant database
• A single app instance, and a single database instance.
• Highly scalable. As more tenants are added, the database is scaled up by adding more storage.
• Low cost due to shared resources, but high operational complexity during design and setup
4
4.4.2 Implementing Resource Sharing
• Access control is provided using roles and business rules.
• A role is associated with a set of permissions specific to it. The ability to set permissions for roles
is also attached to a certain small set of roles.
• A business rule is a policy that provides fine-grained access control, based on the context of
the running application (e.g. in a banking app, limit the amt of money withdrawn in a single
transaction, or limit the time during which transaction can take place)
• Business rules are implemented using policy engines like Drools Guvnor and Drools Expert
• Two types of access control:
– Access Control List: Each object associated with a set of permissions for each role
– Capability-based Access Control: If a user holds a reference or capability (called a key)
to an object, they have access to the object.
• In a dedicated table approach, each tenant has their own table. Access to other tenant’s tables
is restricted.
5 Cloud Security
• A set of control-based safeguards and technologies that protect cloud resources from online theft,
leakage or data loss.
• Cloud security is partitioned into the physical and virtual domains. Basic objectives of cloud
security are confidentiality, integrity and availability
• Network Time Protocol (NTP) is used for this. Encryption is used to avoid fake reference sources.
5
5.2.2 Identity and Access Management
• IM must be scalable, federated, allow single identity and single sign-in and must satisfy legal and
policy requirements.
• Such a protocol must ensure that it can be executed only in emergencies under controlled situations
and that the alarm is triggered properly.
5.2.5 Auditing
• Capture all security-related events, together with data needed to analyze the event
• This data includes time, system on which the event occurred, and userid that initiated the event.
• The audit log is centralized and secure
• It must be possible to create a sanitized or stripped-down version to share with cloud customers
for further analysis.
• All software (releases or patches) are tested in a test bed environment before deployment to pro-
duction environment.
• Testing happens on a continuous ongoing basis that identifies vulnerabilities in the cloud system.
• Risk management is domain dependent, and must tradeo↵ between risk impact and cost of risk
mitigation measure.
• A security control is a safeguard that detects, responds or prevents a security risk. There are
three broad categories of security controls: technical, operational and Management, with each
further divided into 18 families.
• Security breaches are classified as low-impact, medium-impact or high-impact based on the require-
ment for security control.
• Low Impact Systems are those where a security breach causes limited degradation in capability,
but the system can still perform its primary functions.
6
• Medium Impact Systems are those where the system is still capable of performing its primary
functions but there is a significant degradation in the capabilities.
• High-Impact Systems are those where a security breach causes inability to perform primary
functions.
5.4.2 Honeypot
• Honeypots are systems that disguise themselves as valuable targets, while being monitored by
security personnel.
• While an attacker attempts to control the honeypot, the sysadmins monitoring the honeypot can
trap and stop the attack.
• Honeypot VMs can be deployed by the cloud provider or the cloud customers.
5.4.3 Sandboxing
• Execution of software inside a controlled environment within an operating system.
• Within the sandbox, the software has access only to the bare minimum resources it needs to function
properly. Hence any attacker gaining control of the software does not have unrestricted access to
the entire system.
• Sandboxes also provide defense in depth as any attacker is also needed to overcome the sandbox
in order to gain unrestricted access.
7
5.5.2 Subnet Isolation
• Separate subnets for admin traffic, user traffic and storage network traffic.
• Physically separate networks are preferred as virtual LANs (VLAN) that are not physically separate
are hard to configure correctly.
• Routing between the networks is handled by firewalls.
• Simplifies implementation and management of IT services, allows all admins to have a consistent
view of the IT system.
• Multi-factor authentication can be made mandatory for increasing secure access to administrative
functions.
• The access to the public network is via two switches, to increase availability via redundancy.
8
5.6.2 Internal Network Access
• Separate physical networks for admin control functions and one for cloud user functions. Protects
control network from unauthorized access.
• The DBMS is connected to the public network via an aggregated set of links to provide increased
bandwidth and availability.
• PaaS service is accessible from public and private networks. But the security server need not be
accessible from the public network.
– Auditing of security
– Monitoring for security threats
– Hosting a security operations center
– Security scanning of the cloud infrastructure
5.7.3 ENISA
• European Network and Info. Security Agency provides a Cloud Computing Information As-
surance Framework.
• The framework is a set of assurance criteria designed to assess the risk of adopting cloud services,
compare di↵erent Cloud Provider o↵ers, obtain assurance from the selected cloud providers, and
reduce the assurance burden on cloud providers
9
5.7.5 COBIT
• Control OBjectives for Information related Technologies, developed by the ISACA.
• A set of best practices for linking business and IT goals, with metrics and maturity models.
• Broader scope than ISO/IEC 27000
5.7.6 US NIST
• US National Institute for Standards and Technology releases many whitepapers in the Security
Management and Assurance working group.
• Targeted at US Federal Agencies (CIA, FBI etc.) , but apply to many organizations as well.
• Such laws must specify who is responsible for security and accuracy of the data stored on cloud.
• Issues to consider when framing laws:
– Cover all risks arising from a third party’s presence
– Need to ensure data security
– Obligations of the cloud provider during any litigation
5.9.3 Implementation
• Enterprise must ensure that the safeguards laid out in the contract are actually being followed
• It is also important to continuously re-evaluate the system periodically to check for changed cir-
cumstances (increased data sensitivity, revoked external certifications)
• Also ensure that sensitive data if any, is completely deleted from the original provider’s systems.
10
5.9.5 Data Privacy and Secondary use of Data
• Use collected data only for intended purpose, and such data cannot be sold to third parties
• Privacy laws often state that individuals can access their own data and modify or delete it
• Enterprises must ensure that cloud service providers do not use the data for data mining or other
secondary usage.
5.9.9 Litigations
• During a litigation against an enterprise or a cloud provider, the provider must be able to make
available any data that is needed for this litigation.
• This is important as enterprises (not cloud providers) are responsible for responding to such re-
quests.
• In case a cloud provider is directly requested to provide data, then the a↵ected business must be
contacted and must be given the opportunity to oppose the request.
11
6.1 Terminologies
6.1.1 Project
• An abstraction used to group resources (servers, machine images etc.)
• Users or user groups are given access to projects using role assignments.
• The specific role assigned outlines the type of access and capabilites that a user/user group is
entitled to.
6.1.2 Domain
• AN abstraction that isolates the visibility of a set of projects and users (or user groups) to a single
organization
• Domains enable splitting cloud resources into silos that can be used by each organization.
• Domains represents logical divisions within an enterprise, or maybe entirely di↵erent enterprises
6.1.5 Roles
• ”Assigned to” an user and ”assigned on” a project.
• Convey a sense of authority, a particular responsibility to be fulfilled by an actor.
• A role assignment is a triple of actor, target (may be a project or a domain), and a role.
• Role assignments can be granted, revoked and inherited between users/projects.
6.1.6 Token
• Each API call authenticated by Keystone requires the passing of a token.
• Tokens are generated by Keystone upon successful authentication of an user against the service.
• A token has both an unique ID (unique per cloud) and a payload (data about the user)
• Used mainly for service discovery and access (such as creating VMs, storage allocation etc.)
• Each endpoint is broken down into a public URL, an internal URL and an admin URL (all may
be the same or not)
12
6.2 Identity in Keystone
6.2.1 SQL
• Identity of actors (name, password, metadata) and groups stored on an SQL database (MySQL,
PostgreSQL, DB2)
• Keystone in this case serves as the identity provider
• Pros:
– Easy setup
– Manage users and groups via OpenStack APIs
• Cons:
– Keystone should not be identity provider as well as authenticator
– Weak password support: no password rotation or recovery
– Does not integrate with existing enterprise LDAP servers
6.2.2 LDAP
• Keystone can retrieve and store actors (Users and Groups) in Lightweight Directory Access Protocol
(LDAP).
• LDAP should be restricted to only read operations (searching) and authentication (bind).
• Keystone needs a minimal amount of privilege to use the LDAP (read access to attrs defined in
the configuration, as well as an anonymous access)
• Pros:
• Cons:
– Complex set up
– User authentication must be domain-scoped
13
6.2.4 Identity Providers
• An identity provider is a service that abstracts the identity service backed and translates user
information into some standard federated identity protocol.
• Keystone uses Apache modules for consuming authentication info from multiple Identity Providers.
• Such users never stored in Keystone, not permanent, users will have their attributes mapped into
group-based role assignments
• From a Keystone perspective, an identity provider is a source for identities; it may refer to software
that is backed by various backends or Social Logins
• Pros:
– Leverage existing infra & software for user authentication
– Separation between Keystone service and user info
– Keystone never sees any user passwords
– Type of authentication (certificate-based, 2-factor) is abstracted away from keystone
• Con: most complex setup
• The payload of the request contains information needed to find where the user exists, authenticate
the user, and optionally, retrieve a service catalog based on the user’s permissions on a scope
• The user section identifies the user (either on a domain, or using a globally unique user ID),
• The scope section identifies the project being worked on, and hence is used to retrieve the service
catalog. Must contain information to identify a project and the owning domain.
6.3.2 Token
• A user may also request a new token by providing a current token.
• The payload contains the current token ID.
• This allows refreshing a token that will soon expire, or changing a token type from unscoped to
scoped.
• At the top of the file, targets are established that can be used for evaluation of other targets.
• Here the meaning of admin, owner and other roles are defined.
14
Figure 3: Keystone Services and Backends
• DDoS attacks are carried out with networks of internet-connected machines that have been infected
with a malware that allows an attacker to control them remotely.
• Such a network of machines is called a botnet (with individual machines being called bots or
zombies). Each bot is a legitimate machine on the internet hence it is difficult to separate attack
traffic from actual traffic.
• The botnet floods the victim with requests, overwhelming the capacity and causing denial of service.
• The other side e↵ect of this attack is the persistent degradation of services faced by benign cloud
users.
15
DDoS Attack EDoS Attack
Degrade/block cloud services Make cloud resources economically infeasible
Short attack period Long attack period
Attacks occur above EDoS region Attacks occur between normal data traffic zone
and DDoS attack zone
• Network anomaly detection reveals abnormal traffic patterns, such as unauthorized episodes of
TCP connection sequences, against normal traffic patterns.
• Based on the anomaly pattern detected in covered network domains, the scheme detects a DDoS
attack before the victim is overwhelmed
• Personal firewalls at user ends to keep shared data sets from Java, JavaScript, and ActiveX applets
• A privacy policy consistent with the cloud service provider’s policy, to protect against identity
theft, spyware, and web bugs
• VPN channels between resource sites to secure transmission of critical data objects
• Cloud storage provides a process for the generation, embedding, and extraction of the watermarks
in colored objects
• Data coloring takes a minimal number of calculations to color or decolor the data objects (compared
to encryption/decryption)
• Standardized cloud APIs can be built, but this requires providers to build infrastructure that
adhere to OVF aa platform-independent, efficient, extensible, and open format for VMs)
• This will enable efficient, secure software distribution, facilitating the mobility of VMs.
16