0% found this document useful (0 votes)
33 views75 pages

CC - AB Notes

Uploaded by

kalpityonig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views75 pages

CC - AB Notes

Uploaded by

kalpityonig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Cloud Computing (UE18CS352)

Unit 1
Aronya Baksy
January 2021

1 Introduction to Cloud Computing


• Cloud Computing: A model for enabling ubiquitous, convenient, on-demand network access
to a shared pool of configurable computing resources that can be rapidly provisioned
and released with minimal management e↵ort and minimal interaction with service providers.
• Resources available on the cloud include networks, servers, storage, applications and services.

1.1 Characteristics of Cloud Infrastructure


• On-demand Service: Self or auto-provisioned resources in short time (may be compute, storage
or platform) with no need to interact with IT personnel and wait for approval.
• Broad Network Access: Resources should be accessible from any platform (mobile or desktop
device, running any OS). Most cloud service providers use the internet for this purpose.
• Resource Pooling: Ability to share a single hardware resource between multiple clients. This
allows more users to concurrently access the service. It is commonly achieved using Virtualization.
• Rapid Elasticity: Easy and fast increase/decrease of the number of resources deployed, based on
the current load or other criteria.
• Measured Service: Consumer pays only for the resources used by their application (eg: Sales-
Force charges proportionally with number of customers using the service).

2 Computing Paradigms
• Centralized Computing: All the compute resources (storage, memory, CPU etc.) are held in
one central location, tightly coupled and shared among all clients. They are accessed from terminal
machines (eg: some datacenters, supercomputers)
• Parallel Computing: Multiple processors are either tightly coupled (shared memory) or loosely
coupled (distributed memory). Inter-Processor communication is done via shared memory or mes-
sage passing.
• Distributed Computing: Multiple autonomous compute nodes that each have their own private
memory and communicate via a network. Message passing is used as the mechanism for this
communication.

2.1 Grid Computing


• A grid is a system that coordinates resources that are not subject to central control, using
standard, general purpose and open protocols/interfaces to deliver a non-trival quality of
service.
• Grids have the ability to handle heterogeneous infrastructure. Trust and security between resources
pooled from di↵erent organizations on the grid is maintained using resource sharing agreements.

1
• The end goal of grid computing is to allow computational power to be o↵ered as an utility (like
electricity).
• The following are the benefits of grid computing

1. Exploit underutilized resources


2. Load balancing between resources
3. Virtualization of resources at an enterprise level, enable collaboration across VOs
4. Creation of data grids for distributed storage or compute grids for distributed computing on
multiple nodes.

2.1.1 Virtual Organization


• A Virtual Organization (VO) forms the basic unit for enabling access to shared resources.
• The key technical problem addressed by grid technologies is to enable resource sharing among
mutually distrustful participants of a VO who may/may not have any prior relationship and enable
them to solve a common task.

2.1.2 Layered Architecture of Grid Computing


• Application Layer: Application running on the grid
• Collective: Implement a variety of sharing behaviours with directory, brokering, community au-
thorization and accounting services, as well as collaborative services.
• Resource: APIs for allocation of resources as well as secure, negotiation, monitoring, control,
accounting and payment for operations on a single shared resource.

• Connectivity: Protocols for inter-node communication and authentication of these communica-


tions.
• Fabric: Provide physical resources (compute, storage, network resources, catalogs) or logical re-
sources (distributed file system, compute cluster) whose access is mediated by the higher-level grid
protocols.

2.1.3 Gridware (Grid Middleware)


• A type of middleware that enables sharing and management of grid components based on user
requirement as well as resource properties
• Functionality of gridware:

1. Run applications on suitable resources (brokering, scheduling tasks)


2. Provide uniform high level access to resources via semantic interfaces (Service Oriented Ar-
chitecture, Web Service architecture)
3. Address inter-domain security policies
4. Application level status monitoring and control

• Examples of gridware: Globus (U Chicago), Condor (U Wisconsin), Legion (U Virginia), IBP,


NetSolve (for high-throughput and data-intensive scientific applications)

2.2 Cluster Computing


• Set of loosely/tightly coupled compute nodes that can be viewed as a single system.
• Most clusters consist of homogeneous nodes (each node has same configuration), within a small
area, connected by a fast LAN.
• Each node in a compute cluster is configured to execute the same task, scheduled and controlled
by software.

2
• One of the features of a cluster is the abiltiy to merge the multiple systems into a Single System
Image (SSI).
• An SSI is an illusion created by software or hardware that presents a collection of resources as one
integrated, powerful resource. SSI is implemented as a middleware layer (in hardware/software)
that presents CPU cores/IO devices/Disks as a single unit shared across all cluster nodes.
• Middleware support is needed to implement SSI as well as high availability (HA) which consists of
fault tolerance and recovery mechanisms.
• Instead of implementing SSI at many di↵erent levels, virtualization is used to create virtual clusters
from smaller number of actual nodes.

2.3 HTC and HPC


• High Performance Computing: Usage of large compute resources for a relatively short period
of time. Such jobs are typically run on a single system consisting of multiple processors that run
tasks in parallel.
• Performance of HPC systems is measured in FLOPS (Floating point operations per second).
• High Throughput Computing: Usage of large compute resources for a long period of time. It
is defined as a computing paradigm that focuses on the efficient execution of a large number of
loosely-coupled tasks
• Performance of HTC systems is measured in terms of number of jobs or operations completed per
month or year.

2.4 Parallel Computing


• Computation paradigm wherein multiple tightly coupled processors execute smaller sub-tasks
within a larger application program.
• These sub-tasks are independent of one another and executed at the same time on di↵erent hard-
ware units.

2.4.1 Types of Parallelism


• Bit-level Parallelism: Focus on increasing processor word size (eg: an 8-bit processor needs 3
cycles to add 2 16-bit numbers but a 16-bit processor takes one cycle).
• Instruction-level Parallelism: Parallel execution of multiple instructions from a single program
(eg: parallel loops like vector addition can be converted from loop to parallel instruction).

• Task-level or Thread-level Parallelism: Independent threads of execution (performing di↵er-


ent tasks) that run on separate processing cores.
• Data Parallelism: Input data is split into batches, and all batches are processed in parallel. The
exact same instructions are applied to each batch of the data.

2.4.2 Technique and Solutions in Parallel Computing


• Application Checkpointing: Record current state of all components in the system so that it
can be restarted and restored from that point in time
• Automatic Parallelization: Automatic conversion of serial code to multi-threaded code that
can be used on an SMP machine (Shared Memory multi-Processor)
• Parallel Programming Languages: Classified as either using distributed memory (threads use
message passing to communicate) or using shared memory (threads use variables in shared memory
to communicate)

3
Parallel Computing Distributed Computing
Use of single compute node Multiple compute nodes
Tasks run on multiple cores on a single chip Tasks run on a network of computers
Shared or distributed memory Distributed memory only
Processors communicate through a bus Processors communicate via message passing
Improve system performance Improve scalability,
fault tolerance, resource sharing

3 Cloud Computing Models


3.1 Enabling technologies
• Broadband networks, internet architecture, web technologies (URLs, HTTP, XML/HTML)
• Multi-tenant technology: single instance of software running on a server and serving multiple clients
• Data center technology

• Virtualization technology

3.2 Cloud Service Models


3.2.1 Infrastructure as a Service (IaaS)
• Definition: Capability given to the user to provision processing, storage, network and other funda-
mental computing resources where the consumer is able to deploy and run any arbitrary software
platform (can include OSes, application). The consumer has no direct control over the cloud infras-
tructure but has control over OS, storage, deployed applications and select control over networking
components like firewalls.

• Provision of compute and storage resources as a service. Physical resources are abstracted into
virtual containers and presented to the user.
• These virtual resources are allocated on demand to the user, and configured by the user to run any
software applications.

• IaaS has the greatest flexibility but the least application automation from the standpoint of the
user. It allows the user to have complete control over the software stack that they run.
• Building blocks of IaaS are:
– Physical data centers (large collections of server racks with multiple physical machines inside
each rack), managed by IaaS providers.
– Compute: the ability to provision VM instances with CPU/GPU configs depending on work-
load. Also provided are auto-scaling and load balancing services
– Networking: software abstraction of network devices like switches/routers, available typically
through APIs.
– Storage: either block, file or object storage. Block and file storage are the same as found on
traditional data centers, but struggle with scaling, performance and the distributed nature
of the cloud. Object storage on the other hand is infinitely scalable, accessible via HTTP,
works well with distributed systems like the cloud, uses commodity hardware and allows linear
growth of performance wrt cluster size.
• Advantages of IaaS:

– Flexible
– Control
– Pay-as-you-go
– Faster deployment
– High availability

4
• Disadvantages of IaaS:
– Security threats sourced from host or other VMs.
– Multi-tenant Security, new VM users must not be able to access data left behind by previous
users.
– Internal resources and training, ie. the need to train IT managers in the use of IaaS manage-
ment.
• IaaS providers: Google Compute Engine, AWS Elastic Compute Cloud (EC2), MS Azure VMs,
DigitalOcean Droplets

3.2.2 Platform as a Service (PaaS)


• Definition: Capability given to the user to deploy consumer-created or acquired applications created
using programming languages or tools supported by the provider. The consumer does not manage
OS, storage, networking or compute infrastructure, but controls the deployed applications and
maybe the hosting environment.
• Physical hardware and virtualization of the same is controlled by the provider. The provider, in
addition to this, also delivers some selected middleware (like a database software).
• The user can configure and build applications on top of this middleware (eg: create a new database
and build applications that use this new database).
• PaaS is well suited to users who use commonly available middleware that is also supported by a
cloud provider.
• Advantages of PaaS:
– Faster time to market due to reduced setup and install time for hardware
– Faster, easier, risk-free adoption of a large variety of resources (in terms of middleware, OS,
databases, libraries and components)
– Easy to develop for multiple platforms (including mobile)
– Cost e↵ective scalability
– Allows for geographically distributed teams, and e↵ective product life-cycle management
(build, test, deploy, manage, update)

• Disadvantages of PaaS:
– Operational limitations (lack of control) due to management automation workflows (available
on some PaaS providers) that a↵ect provision, management and operation of PaaS systems
– Vendor lock-in
– Runtime issues: specific versions of frameworks may not work with the platform, or platform
may not be optimized for the frameworks/language used
– Security: limited control over hosting policies, risks with storing data on cloud servers
– Integration and customization with legacy services (like data residing on an existing data
center) is more complicated and outweighs the cost saving involved in switching to PaaS.

• PaaS providers: AWS Elastic Beanstalk, Azure DevOps, Google App Engine

3.2.3 Software as a Service (SaaS)


• Definition: Capability given to the user to use the applications provided by the provider. The user
accesses this application through a thin client such as a web browser on their local machine. The
user does not manage the underlying cloud infrastructure (OS, network, servers, storage) or even
the application capabilities directly, but instead only changes the application specific settings.

• SaaS is a no-programming model (very limited scripting/programming abilities can be provided in


order to change app configuration for advanced users).

5
• Advantages of SaaS:
– Flexible payment scheme, pay-as-you-go model
– High vertical scalability
– Automatic update of software
– Accessibility over the internet
• Disadvantages of SaaS:
– Security of data on cloud servers
– Greater latency in interaction with app, as compared to local deployment
– Total dependency on internet
– Vendor lock-in

4 Technological Challenges in Cloud Computing


• Elasticity (Scalability): Resource allocation and workload scheduling algorithms are needed to be
able to scale up and down e↵ectively.
• Performance Unpredictability: Ensuring reliability when resource sharing is involved
• Compliance: with privacy rules, ensure security of information stored on cloud. (in India, the
SEBI’s Clause 49 lays down these rules).
• Multi-Tenancy: Sharing of same virtual resource by multiple users can cause concurrency issues
which lead to security problems (hence appropriate locking mechanisms are needed).
• Interoperability: Application on one platform should be able to incorporate services found on
another platform. This is now possible via web services, which however are complex to develop.
• Portability: Application should migrate seamlessly from one cloud provider to another without
change in design/programming.
• Availability: Reliability that is needed for high availability of cloud resources is hard to achieve
due to cascading failures (one failure causes other failures in turn, and so on). This high level of
availability is achieved using redundancy at the application, middleware or hardware level.

4.1 High Availability in Cloud


• Cloud uses failure detection and application recovery to ensure high availability.
• Failure detection: cloud detects failed instances/components and avoids directing requests to
such instances/components. This is achieved using
1. Heartbeats: Each instance/application sends a heartbeat signal periodically to a monitoring
service in the cloud. If the monitoring service doesn’t receive a specific number of consecutive
heartbeats from an instance then that instance can be declared as failed.
2. Probing: The monitoring service sends a probe to the application instance and waits for a
response from the instance. If the instance does not respond to a fixed number of consecutive
probes then it can be declared as failed.
• Setting a low threshold for number of missed heartbeats/probes can lead to faster failure detection
but can also lead to false positives. Hence there is a trade-o↵ between speed and accuracy of failure
detection
• After identifying failed instances, it is necessary to avoid routing new requests to these instances.
A common mechanism used for this in HTTP-based protocols is HTTP-redirection.
• Application Recovery: Most commonly achieved using checkpointing.
• In check-pointing, the application state is periodically saved in some backing store by the cloud
infrastructure. In case the application fails, it can be restarted from the most recent checkpoint.
• Checkpointing is also available at the middleware level (eg: Docker).

6
5 Cloud Deployment Models
5.1 Public Cloud
• The infrastructure is owned by a cloud provider, an entity (individual or company) pays the provider
for access to this infrastructure.
• Resources are virtualized into pools and these pools are allocated among multiple clients that are
using the cloud provider’s infrastructure (multi-tenancy).

• Access to these resources is done using internet and its associated protocols (SSH, FTP etc).
• The factors that make a particular cloud infrastructure public are: resource sharing using virtualiza-
tion, usage agreements on resources (pay-as-you-go may or may not be present), and management
(provider maintains hardware, networking and virtualization at the minimum)

5.1.1 Advantages
• Low cost

• Less need for server management


• Time saving
• Analytics

• Unlimited scalability, greater redundancy and availability of resources

5.1.2 Disadvantages
• Security
• Compliance with security standards and government rules on data security
• Interoperability and vendor lock-in

5.2 Private Cloud


• Utilizing in-house infrastructure to host cloud services for a single organization
• Can utilize hardware at a local site owned by the organization, or can be a Virtual Private
Cloud (VPC).
• In a VPC the hardware, networking and virtualization infrastructure is hosted by a third party
but with additional security and provisioned config for a secure and exclusive network

5.2.1 Advantages
• More control over resources and hardware

• Security and compliance due to additional layers of security


• Customization, the ability to have custom configurations to run proprietary applications

5.2.2 Disadvantages
• Cost
• Under-utilization of resources

• Platform scaling, upward changes in requirements need scaling of physical infrastructure as against
simple scaling of virtual instances on a hosted cloud

7
5.3 Hybrid Cloud
• A mix of data centers maintained by the organization and hosted cloud infrastructure, connected
by a VPN

• A hybrid cloud model allows enterprises to deploy workloads in private IT environments or public
clouds and move between them as computing needs and costs change
• This gives a business greater flexibility and more data deployment options. A hybrid cloud workload
includes the network, hosting and web service features of an application.

6 Distributed System Architecture


• Three main types of models: architectural models, interaction models, fault models

6.1 Architectural Models


6.1.1 Cluster Architecture
• Building of scalable clusters by connecting smaller clsuters using networks (LAN, WAN, SAN for
storage devices). The cluster may be connected to the internet via a VPN.

• The OS and its resource sharing/scheduling policies determine the system image of the cluster

6.1.2 Peer-to-peer Architecture


• Every node acts as both client and server, and all nodes are identical in terms of resources and
capabilities.
• NO central coordination or database, no global view of the entire system for one machine, and
hence no parent-child relationship between nodes.
• P2P systems are self organizing, and peers can join/leave autonomously.
• Distribution of compute and networking workloads among many links and nodes, thus P2P model
is the most flexible and general model.

6.1.3 Client-Server Architecture


• Processes running on server nodes o↵er service to user requests coming from client nodes.
• Client-server model is implemented as a request-response interaction using send/receive primitives
or using Remote Procedure Calls (RPCs).

6.1.4 n-Tier Architecture


• Web applications that forward requests to other enterprise services.

• Specific case is the 3-tier model where client intelligence is moved to a middle tier to enable the
use of stateless clients. This simplifies app deployment.

6.1.5 Grid Computing


• - A computing grid o↵ers an infrastructure that couples computers, software/middleware, special
instruments, and people and sensors together.
• The grid is often constructed across LAN, WAN, or Internet backbone networks at a regional,
national, or global scale.
• Enterprises or organizations present grids as integrated computing resources. They can also be
viewed as virtual platforms to support virtual organizations.

8
6.2 Interaction Models
6.2.1 Synchronous Distributed System
• All the components of the distributed system run on a common clock. The features of a synchronous
distributed system are:
1. Upper bound on message delivery time between processes or between nodes
2. Message delivery is always in order
3. Ordering of events happens at a global scale due to a shared clock
4. Lock-step based execution, meaning that similar operations performed by di↵erent nodes in
parallel complete at the same time, not at di↵erent times.
• These systems are predictable in terms of timing behaviour, hence must be used for hard real-time
systems.

• It is possible, and safe to use timeouts to detect application or communication errors

6.2.2 Asynchronous Distributed Systems


• There is no shared clock in such a system, each node maintains its own clock. The following are
the properties of asynchronous systems
1. No bound on process execution time, and no assumptions about speed, reliability of individual
nodes.
2. No upper bound on message delivery time
3. Clock rates between di↵erent nodes may change due to the phenomenon of clock drift.

6.3 Fault Models


• Definition of behaviours to be undertaken upon the occurrence of a fault.
• Faults can be in both hardware and software, and fault tolerance (ie. predictable behaviour in case
of a fault) is essential.
• The following are the types of faults

6.3.1 Omission and Arbitrary failures


• Fail-stop: Process halts, remains halted. Can be detected by outside applications
• Crash: Process halts, remains halted. May not be detected by outside applications
• Omission: Message inserted in outgoing bu↵er never enters the incoming bu↵er of the other end

• Send-omission: Process completes send but message never reaches outgoing bu↵er
• Receive-omission: Process does not receive a message put in its incoming bu↵er
• Arbitrary (Byzantine): Arbitrary behaviour wrt message send/receive actions, or omissions, or
stopping/incorrect actions.

6.3.2 Timing Faults


• Clock drift from real time exceeds threshold acceptable.
• Process exceeds bounds on task completion time or message transmission time

9
7 Business Drivers for Cloud Computing
• Cost: Low upfront cost of hardware, reduced investment in future scalability, reduces costs of
resource under-utilization, and reduced management costs
• Assurance: Delegation of management responsibility to a cloud provider reduces need for skilled
IT admins and departments, while still maintaining high standards of security and availability.

• Agility: Faster response to customer requests for new services, due to faster deployment of new
services on the cloud. Also changing business requirements can be better handled
• Flexibility and Scalability: Easy to expand resources to meet increased workload
• Efficiency and improved customer experience: cloud computing allows streamlined enterprise work-
flows which result in better workplace productivity, and hence faster biz growth

8 REST and Web Services


8.1 Service Oriented Architecture
• A method to develop reusable software components using service interfaces.

• The use of common communication standards is done for easy integration with existing services
(standard network protocols like HTTP/JSON, HTTP/SOAP are used to send requests for various
operations)
• Each service in an SOA embodies the code and data integrations required to execute a complete,
discrete business function. Services are loosely coupled, meaning that no underlying knowledge of
the service implementation is needed to use it.
• Two common SOAs are REST and Web Services

8.2 REST
• REpresentational State Transfer (REST) is an architectural style for distributed systems, used for
providing communication standards between APIs over the internet
• REST-compliant systems (aka RESTful systems) are characterized by their stateless behaviour and
the separation of concerns between server and client.

• A safe REST operation is one that does not modify any data
• An idempotent REST operation is one that does not change the state when applied multiple
times beyond the first time.
• REST architectural style is based on:

1. Resource identification through URIs: A resource is a target of interaction between a


service and its clients. Every resource is identified by a global identifier called an URI. The
existence of a globally unique URI provides global interaction as well as service discovery
capability.
2. Uniform, constrained interface: The REST interface consists of 4 basic operations
– GET: Retrieve a resource (S, I)
– PUT: Add a resource (update if already exist) (not S, I)
– POST: Create a new resource (make duplicate if already exist) (not S, not I)
– DELETE: Remove a resource (not S, I)
3. Self-descriptive messages: Messages contain metadata (how to process and other informa-
tion about the data). In REST paradigm, resources are decoupled from their representations,
hence data can be represented in multiple forms as per the client’s understanding. Metadata
is used for cache control, transmission error detection, authentication or authoriza-
tion, and access control.

10
4. Stateless Interaction: Server and client need not maintain each other’s state, and a message
can be understood without referring to any past messages (all messages are independent).
Statelessness has the benefits of:
– Client is isolated against changes on the server
– Promotes redundancy and improves performance due to reduced synchronization over-
heads
State is normally maintained (only if needed) through compact and lightweight text objects
called cookies.

8.3 Web Services


• A self contained, self describing modular application designed to be accessible by other applications
across the internet.
• Web services are designed to support interoperable machine-to-machine communication over the
internet. Other applications and web services can discover and invoke a web service and then
communicate with it.

8.3.1 Protocol Stack for Web Services


• Transport Protocol: transport messages b/w applications over a network (HTTP, SMTP, FTP,
Blocks Extensible Exchange Protocol or BEEP)
• Messaging Protocol: Encoding information in a common XML format for understanding at both
end-points of the communication (eg: XML-RPC, WS-Addressing, SOAP)
• Description Protocol: Public interface description for a web service (WSDL)

• Discovery Protocol: Centralized registry for web services to publish their location and descrip-
tion, as well as for clients to discover available services (UDDI not yet widey adopted)

8.3.2 SOAP
• Simple Object Access Protocol (SOAP) provides a standard packaging structure for transmission of
XML documents over HTTP, SMTP or FTP. It allows interoperability between di↵erent middleware
systems.
• Root element of SOAP message is called the envelope, which contains a:
1. Header : Authentication credentials, and routing info/transaction management/message pars-
ing instructions
2. Body: payload of the message

8.3.3 WSDL
• Web Services Description Language gives description of interface for web services (in terms of
possible operations)
• Standardized representation of input, output parameters, protocol bindings.
• Allows heterogeneous clients to communicate with the web service in a standardized manner

8.3.4 UDDI
• Uniform Description, Discovery and Integration standard, a global registry for advertising and
discovery of web services
• Search by name, ID, category or specification implemented

11
9 Models for inter-process communication
• Interaction between processes can be classified along two dimensions:
– First dimension: one-to-one vs one-to-many
– Second dimension: asynchronous vs synchronous

• The following are the types of one-to-one interaction:


– Synchronous request/response: The client send a message and blocks while waiting for the
reply from the service. This style results from tightly coupled service and client interaction.
– Asynchronous request/response: Client sends a message and the service responds to this
message asynchronously. The client does not block while waiting as there is no guarantee of
the service sending the reply within some constrained time interval
– One-way notification: Service client sends a request to a service but no reply is expected
• The following are the types of one-to-many interaction:
– Pub-sub: A client publishes a message which is consumed by 0 or more interested services
– Publish/async response: A client publishes a request message and then waits for a certain
amount of time for responses from interested services
• Advantages of asynchronous messaging:
– Reduced coupling
– Multiple subscribers
– Failure isolation: If consumer fails then sender can still send messages and consumer can read
them once it is up again. In a synchronous service the downstream client must always be
operational
– Load leveling: A queue can act as a bu↵er to level the workload, so that receivers can process
messages at their own rate
• Disadvantages of asynchronous messaging:
– Tight integration with messaging infrastructure
– High latency in case of high load (message queue overflows)
– Handling complex scenarios like duplicate messages, and coordinating request-response pairs
– Throughput reduction caused by enqueue and dequeue operations as well as locking mecha-
nisms within a queue.

9.1 Message Queue


• A form of asynchronous communication used in serverless and microservice architectures. A mes-
sage queue provides a lightweight bu↵er which temporarily stores messages, and endpoints for
software components to connect to (to be able to send/recv messages).
• A message is pushed into the queue and stays there until it is processed and deleted.

• Message queues can be used to decouple heavyweight processing, to bu↵er or batch work, and to
smooth spiky workloads.
• Producer adds messages to the queue, Consumer reads messages from the queue and processes
them.

• A single queue can be used by multiple producer-consumer pairs but only one consumer can read
a message. Hence this model is used for point-to-point communication.

12
9.2 Pub-Sub Model
• The following are the components of a pub-sub communication model:

– Topic: intermediary channel that maintains a list of subscribers to send a message


– Message: Serialized messages sent by a topic by publishers
– Publishers: Service that publishes the messages
– Subscribers: A service that subscribes to a topic in order to receive messages published on
that topic

• Advantages of pub-sub model:


– Loose coupling as publishers and subscribers are not aware of one another and are independent
of each other’s failures. Hence independent scaling of subscribers and publishers is also allowed.
– Scalability due to parallel operations, message caching, tree-based routing, and multiple other
features built into the pub/sub model
– Allow instantaneous push-based delivery hence removing the need for polling, hence causing
faster response time and reduces delivery latency
– Dynamic targeting as subscribers can dynamically add and remove themselves from a topic,
and the topic server can adjust to changing numbers of subscribers.
– Fewer callbacks and simpler code for communication makes it easier to maintain and extend.

9.3 REDIS
• Remote Dictionary Server, a fast, open-source, in-memory key-value data store

• It can be used as a database, a message-broker or a queue.


• In-memory storage allows for reduced seek time and allows microsecond delays in data access.

10 Monolithic and Micro-Services Applications


• Monolithic applications are built as a single unit, deployed on a single machine, and consists of a
client-side application, a server-side application and a database.
• Microservice applications divide each component of the application into independent units that
implement di↵erent parts of the business logic.

• Advantages of monolithic application:


– Easy to develop
– Simple testing and simple test automation
– Easy to deploy

• Advantages of microservice applications:


– Flexibility in adopting new technologies, maintaining smaller code bases.
– Reliability, as one service failing doesn’t a↵ect the remaining ones.
– Fast development due to reduced code base size, hence also easier to improve code quality.
– Building complex applications is easier once the boundaries for components are decided, each
can be developed independently and in parallel.
– Highly scalable
– Continuous deployment, meaning that microservice components can be independently updated
without a↵ecting the rest of the software

13
10.1 Service Oriented Architecture
• SOA breaks up the components required for applications into separate service modules that com-
municate with one another to meet specific business objectives.

• Microservice architecture is generally considered an evolution of SOA as its services are more
fine-grained, and function independently of each other

10.2 Migration from monolithic to microservice model


• Some architectural challenges involved in this migration are:
– Decomposition of monolithic software into independent components
– Database decomposition in a consistent manner
– Transaction boundaries
– Performance and testing
– Inter-service communication

10.2.1 Service Decomposition


• Don’t add new features, start with existing loosely coupled components and identify those compo-
nents which are ripe for enhancement

• Service decomposition leads to management and infrastructure overheads which can be resolved
using containerization technologies to simplify deployment and configuration vastly

10.2.2 Database decomposition


• In a monolithic application, modules access data belonging to other modules using table joins.
In a microservice application this can be avoided by using APIs to access data, or using projec-
tion/replication of data.

• Shared database tables, as well as current state information that are used by multiple components
of a monolithic application can be modelled as a separate independent service

10.2.3 Transaction Boundaries


• ACID properties of a single database are easier to maintain than that of distributed databases in
a microservice application.
• In a 2-phase commit, the controlling node first asks all the participating nodes whether they are
ready to transact. Only if all nodes respond with a yes, then the controller asks them to commit.
If a single node responds with no, then all nodes are made to roll back the transaction
• In a compensating or Saga transaction, each service performs its own transaction and publishes
an event. The other services listen to that event and perform the next local transaction. If one
transaction fails for some reason, then the saga also executes compensating transactions to undo
the impact of the preceding transactions.

10.2.4 Performance and Testing


• Increase in resource usage causes microservice applications to perform slower
• This can be overcome by provisioning more hardware, logging to analyze bottlenecks, throttling,
dedicated thread pools, and asynchronous features to improve performance

• Writing integration test cases is challenging as it requires knowledge of all microservice components
and since such apps are asynchronous
• Solution is adopting various testing methodologies and tools and leveraging continuous integration
capabilities through automation and standard agile methodologies

14
Cloud Computing (UE18CS352)
Unit 2
Aronya Baksy
February 2021

1 Introduction
• Virtualization is a framework for dividing a single hardware resource (compute or storage) into
multiple independent environments.
• This is done by applying concepts such as h/w and s/w partitioning, emulation, etc.
• A virtual machine (VM) is a complete compute environment with its own processing capability,
as well as memory and communication channels. It is an efficient, isolated duplicate of the physical
machine, with the ability to run a complete operating system.
• A hypervisor (also called a virtual machine monitor or VMM) is a software layer that is
responsible for creation and management of Virtual machines.

1.1 Why Virtual Machines


• Operating System Diversity: can run multiple di↵erent OS on a single machine
• Rapid provisioning, server consolidation: Allows for on-demand provisioning of hardware
resources
• High availability, Load Balancing: the ability to live-migrate VMs ensures that these 2 aspects
of cloud computing are handled
• Encapsulation of a single application’s execution environment.

1.2 Qualities of a Hypervisor


• Equivalence: virtual and real machines should have a similar interface
• Safety and isolation: VMs should be isolated from one other, and also from the underlying
physical hardware
• Low performance overheads: Virtual Machine should have similar performance to the physical
machine.

2 Types of Virtualization
2.1 Type 1 virtualization
• Type 1 hypervisors are installed directly on top of a bare metal hardware, and they have direct
control over hardware resources.
• Type 1 hypervisors behave like OSes with only virtualization functionality, and a limited GUI for
administrators to configure system properties.
• Type 1 hypervisors o↵er simpler setup (provided that compliant hardware exists), more scalability,
more security and higher performance than type 2 hypervisors.
• e.g.: Xen, Oracle VM (based on Xen), VMWare ESXi server, Microsoft Hyper-V

1
2.2 Type 2 virtualization
• This type of hypervisor runs within a host OS that runs on top of physical hardware. For this
reason type 2 virtualization is also called hosted virtualization.

• They have interfaces to act as management consoles for all the deployed VMs
• Type 2 hypervisors o↵er simpler setup than type 1, but less scalability, larger performance overheads
and less security than type 1.
• e.g.: VMWare Workstation, Oracle VirtualBox

2.3 Full Virtualization and Para-Virtualization


2.3.1 Para Virtualization
• In para-virtualization, the guest OS is modified in a way such that all privileged instructions in the
kernel that are addressed to the hardware, are now replaced by hyper calls to the hypervisor.
• The Guest OS accesses privileged functions of the hardware through these hyper calls.
• Para-virtualization improves system performance and reduces overheads of virtualization, as all
privileged instruction calls (which are handled using hyper calls) are all handled at compile time,
instead of at run time.
• Disadvantage: the need for modifying the guest OS, and the resulting lack of portability across
di↵erent

2.3.2 Full Virtualization


• The VMM simulates the hardware at a level that allows any unmodified guest OS to run in isolation
on top of the host OS. It is also called transparent virtualization.
• Full virtualization is achieved using a combination of binary translation and direct execution.

3 Trap and Emulate Virtualization


• Instructions that cannot a↵ect the state of the system (which is typically stored in control registers
on the CPU labelled as CR0 to CR4) can be run directly by the hypervisor on the hardware.
• Sensitive instructions that change system state cannot be executed in user mode (ring 3). Such an
attempt raises a trap, also called a general protection fault.

• The hypervisor emulates the e↵ect of such sensitive instructions so that the guest OS still gets the
impression that it is running in kernel mode when it is actually not.
• In trap-and-emulate virtualization, the:
– Guest applications run on ring 3
– Guest OS runs on ring 1
– VMM runs on ring 0
• When a guest app in ring 3 issues a system call, an interrupt is issued to the guest OS in ring 1.
• The interrupt handler in the guest OS runs the system call routine. When a privileged instruction
is encountered as part of this routine, the guest OS kernel issues an interrupt to the VMM.

• The VMM emulates the functionality of that privileged instruction, returns control to the guest
OS.
• Essentially, trap-and-emulate is a method of fooling a guest OS (that is actually running on ring
1) into thinking that it is running in the kernel space on ring 0.

2
3.1 Issues with trap-and-emulate
• Some registers in the CPU reflect the actual privilege level. If the guest OS were to read these
registers and detect that it is not running in kernel mode it might stop functioning normally.

• Some instructions that change system state run in both kernel and user space, but with di↵erent
semantics. This might lead to the guest not trapping to the VMM in case of a privileged instruction
being encountered.
• High performance overheads in processing interrupts.

• All ISAs do not support trap-and-emulate out of the box Most notably, Intel’s x86 ISA did not
support trap-and-emulate for a long time.

3.2 Issues with x86 virtualization


• The popf instruction is an example of an instruction that does not work with trap-and-emulate.
• In user mode, popf is used to change the ALU status flags. In kernel mode, popf is used to change
system state flags (such as flags related to interrupt delivery).
• In user mode, no interrupt is generated by popf as it changes the system state. Hence even though
the instruction is sensitive, it is not privileged as it does not issue a trap.

• There are 17 such instructions in the x86 ISA. Instructions like pushf reveal to the guest that it is
running in user mode, while instructions like popf discussed above do not execute accurately.

3.3 Some definitions


3.3.1 Privileged
• State of the processor is privileged if
– Access to that state breaks the virtual machine’s isolation boundaries
– It is needed by the monitor to implement virtualization

3.3.2 Strictly Virtualizable


• A processor or mode of a processor is strictly virtualizable if, when executed in a lesser privileged
mode:
– All instructions that access privileged state trap
– All instructions either trap or execute identically

3.4 Binary Translation


• Binary translation is a method of implementing full virtualization. The steps involved in binary
translation are:

1. The VMM reads the next upcoming basic block of instructions. (By basic block we mean a
logic block of instructions from the current point till the next branch)
2. Each instruction in this basic block is translated to the target ISA, and the result is stored in
a translation cache.
3. Translation involves 3 types of instructions:
– Instructions that can be directly translated and are safe (called ident instructions)
– Short instructions that must be emulated using a sequence of safe instructions (eg: inter-
rupt enable). This is called inline translation
– Other dangerous instructions need to be performed by emulation code in the monitor.
These are called call-out instructions. (eg: instructions that changes the PTBR).

3
3.5 Hardware-Assisted Virtualization
• The challenges of virtualizing x86 are outlined in section 3.2, and the methods to solve them were
adopted as part of Intel’s VT-x and AMD’s AMD-V feature set

• The CPU now has 2 modes of operation, a root mode and a non-root mode.
• Both root and non-root mode have 4 rings. The current hardware state is maintained separately
for both modes.
• The root mode is more powerful than the kernel mode. The host OS and VMM run in root mode,
while the guest OS and applications run in non-root mode.
• If any sensitive instructions are executed in non-root mode, a VMEXIT condition signals to the
processor to enter root mode. In root mode this sensitive operation is emulated by the VMM and
the processor switches back to non-root mode.
• The hardware state of a VM is maintained in a data structure called the Virtual Machine
Control Structure (VMCS). The VMM is in charge of creating the VMCS and modifying it
(when emulating sensitive instructions).

4 Memory and I/O Virtualization


4.1 Memory Virtualization
• There are 3 address spaces that have to be translated for a successful memory reference by a guest
OS. They are:
1. The virtual address space of guest (also called guest virtual address or GVA)
2. The physical address space of the guest (also called as the guest RAM or the guest physical
address or GPA)
3. The host physical address or the HVA
• The guest OS page table translates from GVA to GPA.
• The page tables maintained by the VMM translate from GPA to HPA (via the virtual address
space of the host also called the HPA).

4.1.1 Shadow Page Tables


• The VMM creates a mapping from the GVA to the HPA (direct mapping) by combining the
information available in the guest OS page table and the host OS page table.
• This direct mapping is called the shadow page table (SPT). It is o↵ered to the hardware MMU
(memory management unit) as a pointer to the base location. The pointer is stored in the control
register cr3.

• While the guest is active, the VMM forces the processor to use the SPT for all translations.
• Whenever the guest OS modifies the guest page table, the VMM must update the shadow page
table. This is implemented by making the guest page table write protected.
• This means that whenever the guest OS tries to write to the guest page table, a page fault is raised,
and a trap is set to the VMM. The VMM handles the trap and modifies the SPT.
For every guest application there is one shadow page table. Every time a guest application context
switches, trap to VMM to change cr3 to point to new shadow page table
• The drawbacks of the shadow page table concept is that it leads to overheads involved in handling
traps, and the fact that the TLB cache has to be flushed on every context switch.

4
4.1.2 Extended Page Tables
• The processor is made aware of the virtualization, and the two-level address translation that is
needed to support it.

• Guest-physical addresses are translated by traversing a set of EPT paging structures to produce
physical addresses that are used to access memory.
• A field in the VMCS maintains a pointer to the Extended page table, called the EPT Base Pointer.
• Benefits of EPT:

1. Performance increased due to reduced overheads over shadow paging (performance increase
is dependent on type of workload)
2. Reduced memory footprint compared to SPT scheme that requires maintaining of a table for
each VM that is started.

4.2 I/O Virtualization


• It is a technology that uses software to abstract upper-layer protocols from physical connections
or physical transports.
• It involves managing and routing I/O requests from multiple virtual I/O devices and the shared
physical hardware underneath.
• The three types of I/O virtualization are:
– Full Device Emulation
– Para I/O Virtualization
– Direct I/O Virtualization

4.2.1 Full device emulation


• All functions of an I/O device (such as device enumeration and identification, interrrupt handling,
DMA manipulation) are emulated entirely in software
• This software is located in the VMM and acts as a virtual device.

• The I/O access requests of the guest OS are trapped to the VMM which interacts with the I/O
devices.

4.2.2 Para I/O Virtualization


• This is also called the split driver model. It consists of a front end driver that runs on the guest
OS, and a back end driver that runs on the VMM.

• The front end and back end drivers interact with each other via shared memory.
• The front end driver intercepts I/O requests from the guest OS. The back end driver manages the
physical I/O hardware as well as multiplexing the I/O data coming from di↵erent VMs
• Performance wise, para I/O virtualization is better than full device virtualization, but it comes
with a high CPU overhead.

4.2.3 Direct I/O Virtualization


• Allows a guest to directly access the physical address of an I/O device. Virtual devices can directly
perform DMA accesses to/from host memory.
• Intel VT-x technologies (if enabled) allow for a VM to directly write control information to a
device’s control registers. The VT-d extension allows for I/O devices to write into the memory
that is controlled by VMs.

5
• The VMM utilizes and configures technologies such as Intel VT-x and Intel VT-d to perform address
translation when sending data to and from an IO device.
• Advantage of faster performance, but limited scalability (as a single I/O device can only be assigned
to a single VM).

4.2.4 Advantages of I/O Virtualization


• Flexibility due to abstraction of physical protocols, also leads to faster provisioning.
• Minimization of costs as there is now less need for hardware infrastructure like cables/switch
ports/network cards.
• Increased practical density of I/O as it allows more connections to exist in a given space.

5 Goldberg and Popek Theorems


• Fundamental results postulated by R Goldberg and G Popek in 1975 that justify and prove that
virtualization is possible to achieve.

• Goldberg and Popek classified the instructions in an ISA into the following categories:
1. Behaviour sensitive instructions are those wherein the final result of the instruction is
dependent on the privilege level (i.e. executing that instruction in a lower privilege level leads
to a wrong output)
2. Control sensitive instructions are those which result in change of processor state or processor
privilege.
3. Privileged instructions are those that trap if the processor is in user mode and do not trap
if it is in system mode (i.e kernel or supervisor mode).

5.1 Requirements for virtualization support


As postulated by Goldberg and Popek, the requirements for an ISA to support virtualization are:
• Equivalence: A program executing on a VM must display essentially identical behaviour as one
executing on an equivalent machine directly.

• Resource Control: A VM must be in total control of the virtualized resources


• Efficiency: A statistically dominant fraction of machine instructions must be executed without
VMM intervention.

5.2 Theorems
5.2.1 Theorem 1
”For any conventional third generation computer, a VMM may be constructed if the set of sensitive
instructions for that computer is a subset of the set of privileged instructions”

• The theorem states that to build a VMM it is sufficient that all instructions that could a↵ect the
correct functioning of the VMM (sensitive instructions) always trap and pass control to the VMM.

5.2.2 Theorem 2
”A conventional third generation computer is recursively virtualizable if it is:
1. virtualizable, and
2. A VMM without any timing dependencies can be constructed for it.

6
5.2.3 Theorem 3
”A hybrid virtual machine monitor may be constructed for any conventional third generation machine
in which the set of user sensitive instructions are a subset of the set of privileged instructions.”

6 Live Migration of VMs


• Allows for real-time transfer of VMs from one physical node to another
• The challenge here is the design of a migration strategy that allows for migration but without
a↵ecting performance of the cluster of nodes.
• Why migration? Because of load balancing. Using the user login frequency and the load index, the
most appropriate node for a given VM is chosen, to improve response time and increase resource
utilization.

• Live migration is desired when load on the cluster becomes unbalanced and real-time correction is
needed.
• Migration also allows for scalability (up and down) as well as rapid provisioning.

6.1 6-step migration process


6.1.1 Step 0 and 1: Migration Start
• Determining the source VM and the destination host.
• This can be started manually by a human user or by an automated load-balancing or server
consolidation system.

6.1.2 Step 2: Iterative pre-copy


• Iteratively copy dirty pages from the source to the destination. Memory is copied page wise as it
reflects the current execution state of the VM and it is required to continue the same functionality.
• This copy is carried out until the dirty portion of memory is small enough to be copied in a single
final round.

• During the pre-copy phase, the functioning of the source VM is not interrupted.

6.1.3 Step 3: Stop and Copy


• The source VM is stopped, and the remaining memory state information is copied to the destination
VM.
• During this phase the source VM’s functioning is paused. This ”downtime” must be made as short
as possible.
• Non-memory state of the source VM such as the CPU state and network state is also sent in this
step.

6.1.4 Step 4 and 5: Commitment and Activation


• The VM reloads the states and recovers the execution of programs in it, and the service provided
by this VM continues.
• Then the network connection is redirected to the new VM and the dependency to the source host
is cleared.
• The whole migration process finishes by removing the original VM from the source host.

7
6.2 Pre-copy and post-copy migration
• In pre-copy migration, the aim is not to impact the functioning of the source VM. However since
the migration daemon is making use of the network to transfer dirty pages, there is a degradation
of performance that occurs.
• Adaptive rate-limited migration is used to mitigate this to an extent.
• Moreover, the maximum number of iterations must be set because not all applications’ dirty pages
are ensured to converge to a small writable working set over multiple rounds.
• In post-copy migration, the migration is initiated by stopping the source VM, a minimal subset
of the execution state of the VM is transferred to the target. The VM is then resumed at the
target.
• Concurrently, the source actively pushes the remaining memory pages of the VM to the target -
an activity known as pre-paging.
• At the target, if the VM tries to access a page that has not yet been transferred, it generates a
page-fault. These page faults are trapped, sent to the source and the source replies with the page
requested.

7 Lightweight Virtualization
7.1 Containers
• Containers are a logical packaging mechanism where the code and all of its dependencies are
abstracted away from their run time environment.
• This allows for much easier deployment on a wide variety of hardware, as well as more e↵ective
isolation and much less CPU/Memory overheads.
• Containers are an example of OS-level virtualization, and multiple containers running on a host
share the same OS. Similar to a VMM for full-scale virtual machines, containers are managed by
a container manager.
• Examples of real world implementation of container technology are Docker, Google’s Kubernetes
Engine, AWS Fargate, Microsoft Azure etc.

7.2 Docker
• Docker is a product that is used to deliver software in the form of containers, and it makes use
of Linux technologies that promote OS-level virtualization such as cgroups, namespaces and
others.
• Docker consists of 3 components:
1. The Docker engine
2. The Docker client (normally a command line interface which is called the Docker CLI)
3. The container registry
• The Docker daemon (dockerd) listens for Docker API requests and manages Docker objects such as
images, containers, networks, and volumes. A daemon can also communicate with other daemons
to manage Docker services.
• The container registry stores Docker images. An example of a publicly-available registry is Docker
Hub. By default, the docker pull and docker run commands pull the needed images from
Docker Hub.
• It is possible to configure Docker to look elsewhere for images, including one’s own privately set
up registry.
• The Docker Engine is a client-server program. The Docker CLI acts as an client and uses the
Docker API to send requests. The engine listens for these requests, and sends them to the Docker
daemon running on the server.

8
7.2.1 Docker Images
• A Docker image is a read-only template that is used to set up a running container.

• It provides a convenient way to package up applications and pre-configured server environments,


which can be used for private use or to share publicly with other Docker users.
• Each of the files that make up a Docker Image is called a layer. Layers are treated by Docker as
intermediate images that are built in a specific order (each layer being dependent on the layers
below it)

• Layers that change the most often are organized at the top, as then there are minimal number
of layers that need to be rebuilt each time a change occurs (when a layer changes only the layers
above it must be rebuilt).
• When a container is launched from an image, a thin writable layer called the container layer is
added at the top. The container layer stores all the changes made to the container state as it runs.

• This allows for multiple containers to share the same image layers but only have their distinct
container layers at the top.
• A Dockerfile is a plain-text file that specifies the steps involved in creating a Docker image.

7.3 Linux namespaces


• A namespace is a method of partitioning processes into groups such that di↵erent groups see
di↵erent sets of resources.
• The resources that are partitioned in this way can be the file system, networking, interprocess
communication as well as process IDs.

7.3.1 Mount namespace


• Mount namespaces allow a process that lies within a namespace to have a completely di↵erent view
of the system mount structure from the actual one.
• This promotes isolation as it allows each isolated process to have its own separate file system root,
hence it avoids exposing more information about the overall file system than is needed.

• Any mount/unmount operations that are done by the isolated process in its own mount names-
pace will not a↵ect the parent mount namespace, nor any other isolated mount namespace in the
hierarchy.

7.3.2 UTS namespace


• UTS (UNIX Time-Sharing) namespaces allow a single system to appear to have di↵erent host and
domain names to di↵erent processes.
• When a process creates a new UTS namespace, the hostname and domain of the new UTS names-
pace are copied from the corresponding values in the caller’s UTS namespace

7.3.3 Network namespace


• Network namespaces allow processes to see entirely di↵erent sets of network interfaces.

• Each network namespace consists of the following objects:


– Network devices (labelled as veth or Virtual Ethernet devices)
– Bridge networks
– Routing tables
– IP Addresses
– Ports

9
• Virtual network interfaces span multiple network namespaces, and allow interfaces in di↵erent
namespaces to communicate with one another.
• A routing process takes data incoming at the physical interface, and routes it to the correct network
namespace via the virtual network interface

• Routing tables can be set up that route packets between virtual interfaces.

7.3.4 Linux Control Groups (cgroups)


• Developed by Paul Menage, Rohit Seth and others at Google (2006), it is a Linux kernel feature
that allows the limiting, accounting and isolation of resources (CPU, memory, disk, I/O, network
etc.) for a group of processes.

• Functionality of cgroups is as follows:


– Access: which devices can be accessed by a particular cgroup
– Resource limiting
– Resource prioritization (between cgroups)
– Accounting
– Control: freezing processes and checkpointing

7.4 Container File System: UnionFS


• The drawbacks of existing file systems w.r.t. containerized services are:
– Inefficient disk-space utilization: If 10 instances of a single Docker container (each of size
1 GB) are started, then on a traditional FS totally it takes up 10 GB of memory.
– Latency in startup: Given that a container is essentially a process, the only way for a process
to be created is a fork() system call. The inefficiency is because each time a container has to
be started, all the image layers have to be copied into the new process address space, which
takes time for large number of image layers.
• UnionFS is a unified and coherent view to files in separate file systems. It allows for multiple file
systems to be mounted onto a single root.

• It allows files and directories of separate file systems, known as branches, to be transparently
overlaid, forming a single coherent file system.
• Contents of directories which have the same path within the merged branches will be seen together
in a single merged directory, within the new, virtual file system.
• This allows a file system to appear as writable, but without actually allowing writes to change the
file system, also known as copy-on-write.
• In the CoW mechanism, any changes that are made to any of the image layers that make up
the UnionFS, are reflected only in the topmost container layer. The image layer is copied to the
container layer FS and changes are written there.

• Refer to this link

7.4.1 Disadvantages
• Translating between di↵erent file system rules about file names and attributes, as well as di↵erent
file system’s features.
• Copy-on-write makes memory-mapped file implementation hard

• Not appropriate for working with long-lived data or sharing data between containers, or a container
and the host.

10
8 DevOps on the cloud
• DevOps is an integration of Software Development methodologies and IT operations that are
involved in deployment and operation of software.
• DevOps auomates the process that occur between software development and IT teams so that
software can be built, released and tested faster and more reliably.

• One of the key principles of DevOps is Continuous Integration along with Continuous De-
ployment and Continuous Delivery (this is commonly referred to as CI\CD).
• CI\CD promotes the practice of making small changes and integrating them with the main codebase
often, and using automated deployment infrastructure to test on a production-like environment.

• The entire CI\CD sequence of stages is organized in the form of a sequential pipeline. The pipeline
consists of a series of automated actions that take code from a developer environment to a produc-
tion environment.
• Pipelines automate the build, test and publishing of artifacts so that they can be deployed to a
runtime environment.

• Tools such as Jenkins, Drone, and Travis CI are used for CI\CD pipeline management.
• A typical CI\CD pipeline is as follows:
– Developer push their changes to a centralized Git repository
– Build server automatically builds the application and runs unit tests and integration tests on
it
– If all tests pass then container image is pushed to the central container repository.
– The newly built container is automatically deployed to a staging environment
– The acceptance tests are carried out in this staging environment.
– Verified and tested container image is pushed to production environment.

8.1 Continuous Integration (CI)


• CI is a development practice wherein developers integrate code into a shared codebase (implemented
as a repository) at a high frequency (maybe several times a day).
• Each integration can then be verified by an automated build and automated tests.
• CI consists of the following workflows:
1. Development and unit testing in the developer’s local environment
2. Compile code on automated build server
3. Run additional static analyses, measure and profile performance, generate documentation and
facilitate manual QA processes
4. Integration with Continuous Delivery (make sure code is always at a deployable state) and
Continuous Deployment (automate deployment).

8.2 Continuous Deployment (CD)


• Automated deployment of successful builds to the production environment.
• In an environment in which data-centric microservices provide the functionality, and where the
microservices can be multiply instantiated, CD consists of instantiating the new version of a mi-
croservice and retiring the old version.

11
8.3 Jenkins
• A self-contained, open source automation server which can be used to automate tasks related to
building, testing, and delivering or deploying software.

• It can be installed via package managers (apt-get, pacman), DockerHub, or natively built on a
machine with Java Runtime Envt (JRE).
• Plugins are used to extend Jenkins functionality as per the user-specific or organization-specific
needs

• Some commonly used Jenkins plugins are:


– Dashboard view, view job filters
– Monitoring and metrics
– Kubernetes plugin
– Build pipeline
– Git and GitHub integration

9 Container Orchestration and Kubernetes


• The process of deploying containers on a compute cluster consisting of multiple nodes.
• Includes managing container lifecycles in large and dynamic environments (especially in microser-
vice architectures where each microservice is implemented as a container)
• Orchestrator is a software that virtualizes di↵erent physical nodes into a single compute infrastruc-
ture for the user to deploy containers on.

• It automates deployment, scaling, management, networking and availability of container-based


apps.
• Scheduling: managing the resources available and assigning workloads where they can most effi-
ciently be run.

• Cluster management: joining multiple physical or virtual servers into a unified, reliable, fault-
tolerant group.
• Typically orchestrators take care of all 3: orchestration, scheduling and cluster mgmt.
• Kubernetes (or K8s for short) is the most prominent example of such a software. Others are Docker
Swarm, Google Container Engine (built on Kubernetes), and Amazon ECS.

9.1 K8s Architecture


9.1.1 K8s Pod
• K8s manages applications that consist of communicating microservices
• Often those microservices are tightly coupled forming a group of containers that would typically,
in a non-containerized setup run together on one server.
• This smallest unit that can be scheduled to deploy on K8s is called a pod.

• The containers in a pod share cgroups, namespaces, storage and IP Addresses as they are co-located.
• Pods have a short lifetime, they are created, destroyed and restarted on demand.

12
9.1.2 K8s Service
• As pods are shortlived, there is no guarantee on their IP Address, which makes communication
hard

• A service is an abstraction on top of a number of pods, typically requiring to run a proxy on top,
for other services to communicate with it via a Virtual IP address.
• Numerous pods can be exposed as a service with configured load balancing.

9.1.3 Master Node


• Consists of an API Server, a key-value store called etcd, scheduler and controller-manager

• The API server serves REST API requests according to the bound business logic.
• etcd is a consistent and simple key-value store that is used for service discovery and shared config
storage. It allows for CRUD operations and notification services to notify the cluster about config
changes.

• Scheduler deploys configured pods and services onto the worker nodes. It decides based on the
resources available on each cluster.
• Controller-manager is a daemon that enables the use of various control services. It makes use of
the API server to watch the current state and make changes to the config to maintain the desired
state (e.g.: maintaining the replication factor by reviving any dead/failed pods)

9.1.4 Worker Node


• Consists of Docker, kubelet, kube-proxy, and kubectl
• Docker runs the configured pods, takes care of downloading the images and starting the containers.
• kubelet gets the configuration of a pod from the API Server and ensures that the described
containers are up and running. It also communicates with etcd to read and write details on
running services.
• kube-proxy is a network proxy and load balancer for a single node that routes TCP and UDP
traffic
• kubectl is a command line tool that sends API requests to the API server.

13
Cloud Computing (UE18CS352)
Unit 3
Aronya Baksy
March 2021

1 Introduction: Disk Storage Fundamentals


• Disk latency has the following three components:

1. Seek Time: The time needed for the controller to position the disk head to the correct
cylinder of the disk
2. Rotational Latency: The time needed for the first sector of the block to position itself
under the disk head
3. Transfer Time: Time needed for the disk controller to read/write all the sectors on the disk.

• RAID (Redundant Array of Independent Disks) is a storage virtualization technology that com-
bines multiple physical disks into one or more logical volumes for increased redundancy and faster
performance.
• The driving technologies behind RAID are striping,mirroring and parity checking.

1.1 Storage Architectures


• In Directly Attached Storage (DAS), the digital storage is directly attached to the network
node that is accessing that storage.

• DAS is only accessible from the node to which the storage device is attached physically.
• Network Attached Storage (NAS) is a file-level storage device connected to a heterogeneous
group of clients.
• A single NAS device containing physical storage devices (these may be arranged in RAID) serves
all file requests from any client in the connected network.

• NAS removes the responsibility of file serving from other servers on the network. Data is transferred
over Ethernet using TCP/IP protocol.
• Storage Area Network (SAN) is a network that provides access to block-level data storage.
• A SAN is built from a combination of servers and storage over a high speed, low latency interconnect
that allows direct Fibre Channel connections from the client to the storage volume to provide the
fastest possible performance.
• The SAN may also require a separate, private Ethernet network between the server and clients to
keep the file request traffic out of the Fibre Channel network for even more performance.

• It allows for simultaneous shared access, but it is more expensive than NAS and SAN.
• Distinct protocols were developed for SANs, such as Fibre Channel, iSCSI, Infiniband.

1
Figure 1: Storage Architectures

1.2 Logical Volume Management (LVM)


• LVM is a file-system virtualization layer

• LVM provides a method of allocating space on mass-storage devices that is more flexible than
conventional partitioning schemes to store volumes.
• The components of LVM are:
1. Extend volumes while a volume is active and has a full file system (shrinking volumes requires
unmounting and suitable storage requirements)
2. Collect multiple pysical drives into a volume group
• LVM consists of the following basic components layered on top of each other:
– A physical volume corresponds to a physical disk that is detected by the OS (labelled often
as sda or sdb) (NOTE: partitions of a single actual disk are detected as separate disks by the
OS).
– A volume group groups together one or more physical volumes
– A logical volume is a logical partition of the volume group. Each logical volume runs a file
system.

• The /boot partition cannot be included in LVM as GRUB (the GNU Bootloader that loads the
bootstrap program from the master boot record) cannot read LVM metadata.

Figure 2: Logical Volume Management

2
2 Storage Virtualization
• Abstraction of physical storage devices into logical entities presented to the user, hiding the un-
derlying hardware complexity and access functionality (either direct access or network access)
• Advantages of storage virtualization are:
– Enables higher resource usage by aggregating multiple heterogeneous devices into pools
– Easy centralized management, provisioning of storage as per application needs (performance
and cost).

2.1 File-Level Virtualization


• An abstraction layer exists between client and server.
• This virtualization layer manages files, directories or file systems across multiple servers and allows
administrators to present users with a single logical file system
• Normally implemented as a network file system that has
– Standard protocol for file sharing
– Multiple file servers enable access to files
• NFS, CIFS, and Web interfaces like HTTP/HTTPS are examples of this.

2.1.1 Distributed File System


• DFS is a type of network file system that is spread across multiple interconnected nodes.
• The objective of DFS is to enable file directory replication (for fault tolerance) and location trans-
parency (using names to refer to resources rather than their actual location)
• Recently accessed disk blocks can be cached for better performance.
• Metadata managment is important for performance reasons. It can be either centralized or
distributed

2.1.2 DFS with centralized metadata: Lustre


• All metadata operations by clients are directed to a single dedicated metadata server.
• Lock-based synchronization is used in every read or write operation from the clients.
• When workloads involve large files, such systems scale well. But the metadata server can become
a SPOF or a performance bottleneck when loads increase.
• Lustre is a massively parallel, scalable distributed file system for Linux that uses DFS with
centralized metadata.
• It is available under GNU General Public License, and used on many supercomputer grids that
run Linux.
• The components of Lustre are:
1. Object Storage Server (OSS), store file data on object storage targets (OSTs). A single
OSS can serve 2-8 OSTs. The total capacity of a Lustre FS is the sum of capacities provided
by the OSS across all the OST nodes.
2. Metadata target (MDT) stores metadata on one or more metadata servers (MDS)
3. Luster clients access data over a network using a POSIX-compliant interface.
• The file access is done in the following sequence:
– Client performs a lookup on the MDS for a filename.
– MDS either returns layout for the existing file, or creates the metadata for a new file.

3
– The client passes this layout to a Logical Object Volume (LOV). The LOV maps the layout
to objects and their actual locations on di↵erent OSTs
– The client then locks the file range being operated on and executes one or more parallel
reads/writes directly to the OSTs

2.1.3 DFS with distributed metadata : Gluster


• Metadata distributed among all the network nodes. Involves greater complexity as metadata has
to be managed across multiple nodes
• Gluster is an open-source distributed file system with distributed metadata. It is optimized for
high performance, and scales up to 1000s of clients and PB of data.

• Gluster mploys a modular architecture with a stackable user-space design.


• It aggregates multiple storage bricks on a network (over Infiniband RDMA or TCP/IP intercon-
nects) and delivers as a network file system with a global name space
• The components of Gluster are:

– Server delivers the combined disk space of all the physical storage servers as a single file
system
– Client implements highly available, massively parallel access to each storage node along with
node failure handling
• A storage brick is a server (containing directly attached storage or connected to a SAN) on which
a file system (like ext3 or ext4) is created
• A translator is a layer between a brick and the actual user. It acts as a file system interface and
implements one single Gluster functionality
• I/O Scheduling Translators are responsible for load balancing,

• Automatic File Replication (AFR) translator keeps identical copies of a file/directory on all its
subvolumes (used for replication)

2.2 Block-Level Virtualization


• Virtualizes multiple physical disks into a single virtual disk
• Data blocks are mapped to one or more physical disks sub-systems.

2.2.1 Host-based BLV


• Uses LVM (section 1.2) to support dynamic resizing of volumes, or combine fragments of unused
disk space into a single volume, or create virtual disks (with size larger than physical disk)

2.2.2 Storage Device-level BLV


• Creates Virtual Volumes over the physical storage space of the specific storage subsystem.
• Using RAID techniques, logical units are created that span mulitple disks.
• Host independent and low latency as virtualization is built into the firmware and hardware of the
storage device

4
2.2.3 Network-Level BLV
• Most commonly implemented, scalable form, implemented as part of the interconnect network
between storage and hosts (e.g.: Fibre Channel SAN)

• Switch-based: the actual virtualization occurs in an intelligent switch in the network, and it
works in conjunction with a metadata manager
• Appliance-based: I/O is routed through an appliance that manages the virtualization layer
• In-band appliances perform all I/O with zero direct interaction between client and storage.

• Out-of-band appliances manage only metadata (control paths) while the actual data flows di-
rectly between client and storage server (each client having an agent to manage this)

3 Object Storage Technologies


3.1 Amazon Simple Storage Service (S3)
• Highly reliable, available, scalable, fast cloud storage that supports storage and retrieval of large
amounts of data using simple web services
• Interaction with S3 is done via the GUI (Amazon Console), the TUI (Amazon CLI) or language
specific abstractions. A RESTful API is provided for basic HTTP operations
• Files are called objects. The key of an object is its identification (directory path + object name).
All objects are stored in buckets.
• S3 objects are replicated across multiple global zones. Versioning enables further recovery from
modification and deletion by accident.
• Security is maintained in S3 using:

– Access Control Lists: Set permissions to allow other users to access an object
– Audit Logs: Once enabled, stores the access log for an bucket. This enables one to identify
the AWS account, IP Address, time of access and operations performed by the one who
accessed.
• Data Security is maintained in S3 using:

– Replication: across multiple devices, allows for upto 2 replica failures (cheaper option is
Reduced Redundancy Storage which survives only 1 replica failure), but consistency across
replicas is not guaranteed.
– Versioning: If enabled, S3 stores the full history of each object. It allows for changes to be
undone, including file deletions.
– Regions: select location of S3 bucket for performance/legal reasons.
• S3 allows for large objects to be uploaded in parts. These parts can be uploaded in parallel for
maximum network utilization

3.2 DynamoDB - NoSQL Service


• Cloud-based NoSQL database that is available with AWS. Consists of tables created and defined
in advance (with some dynamic elements)
• Overall is schemaless.

• Supports only item-level consistency (similar to row-level consistency in RDBMS). If cross-item


consistency is needed then don’t use DynamoDB
• Joins are implemented only at the applicaiton side. DynamoDB does not support joins between
tables.

5
• Table is collection of items, item is collection of attribute-value pairs. Primary key identifies items
uniquely in a table.
• A partition is an allocation of storage for a table, backed by SSDs and automatically replicated
across multiple Availability Zones within an AWS Region.

• Types of primary keys in DynamoDB:


– Partition Key: The value of the partition key attribute is passed into a hash function to
determine the physical partition on which that item will be stored
– Partition + Sort Keys: All items with the same partition key hash value are stored together
in sorted order by sort key value.
• Users can also create secondary keys in addition to primary keys for alternate queries.

3.3 Amazon Relational DB Service (RDS)


• Provides an abstraction of an RDBMS. O↵ers all majorly used RDBMS such as Amazon Aurora,
PostgreSQL, MySQL, Oracle, MS SQL Server
• AWS performs all admin tasks related to maintenance, as well as periodic backups of the DB state
and the ability to take snapshots.

• RDS provides encryption at rest and in transit, as well as APIs for applications.

4 Partitioning
• Breaking down large DBs into smaller units that are stored on di↵erent machines. Each row belongs
to exactly one partition
• Supports operations that touch mulitple partitions at the same time.
• Motivation is scalability in terms of load balancing and query throughput, as well as fault tolerance
(when combined with replication)

• Small queries can be independently processed by one partition. Large queries can be parallelized
between multiple partitions.
• When some partitions have more data than others, they are said to be skewed. A partition with
disproportionately high load is called a hot spot

4.1 Partitioning Strategies


4.1.1 Randomized Partitioning
• Distribute the data quite evenly across the nodes

• Disadvantage: When trying to read a particular item, no way of knowing which node it is on, so
all nodes need to be queried in parallel.

4.1.2 Partitioning by Key Range


• Assign range of key values to a given partition. If partition boundaries are known then determining
which partition a given key is in is very simple
• Ranges may not be equal width, as data distribution is not uniform

• Each partition can have keys in sorted order


• Disadvantage: certain access patterns can lead to hot spots (e.g.: storing sensor data, if the key is
timestamp then all writes go to one single partition which the current day’s partition)

6
4.1.3 Partitioning by Hash of Key
• Using a suitable hash function for keys, each partition has a range of hash values assigned to it
(rather than a range of keys), and every key whose hash falls within a partition’s range will be
stored in that partition.
• A good hash function takes skewed data and makes it uniformly distributed
• Simple hash partitioning do not allow efficient range queries. This is solved using composite keys.
• Consistent hashing is a way of evenly distributing load across an internet-wide system of servers
such as a content delivery network
• It uses randomly chosen partition boundaries to avoid the need for central control or distributed
consensus

4.2 Secondary Indexes


• Do not map neatly to partitions, but useful for increasing performance of queries made on a
particular key.

4.2.1 Document-based Secondary Indexing


• Also called local secondary indexing

• Each partition maintains its own secondary index, covering only the documents in that partition.
• Reading involves reading from each and every partition and separately combining the results. This
approach is called scatter-gather, and it makes read queries expensive
• Even if the partitions are queried in parallel, scatter/gather is prone to tail latency amplification

4.2.2 Term-based Secondary Indexing


• A single global secondary index covers data from all partitions.
• The index is stored on multiple nodes, partitioned by the term (for range scans) directly, or a hash
of the term (for load balancing)
• Reads are more efficient as a query is made only to the partition where the term resides

• Writes are less efficient as a write a↵ects multiple partitions of the index. This requires a distributed
transaction across all partitions a↵ected by a write
• In practice, updates to global secondary indexes are often asynchronous

4.3 Rebalancing Partitions


• The process of moving load from one node in the cluster to another is called rebalancing.
• Requirements of rebalancing:
– After rebalancing, loads should be shared fairly between all cluster nodes
– During rebalancing the system should still accept read/write requests
– Minimize the amount of data moved around to reduce network and I/O overheads
• The following are rebalancing strategies:

4.3.1 Hash mod n


• hash(key) % n returns a number between 0 and n-1, corresponding to a single partition

• Simple, but drawback is that any change in N leads to rehashing of large number of keys which
makes the rebalancing very expensive

7
4.3.2 Fixed number of partitions
• Move only entire partitions. Assignment of keys to partitions does not change, but only assignment
of partitions to nodes changes.

• Create many more partitions than there are nodes and assign several partitions to each node
• If a node is added to the cluster, the new node can steal a few partitions from every existing node
until partitions are fairly distributed once again
• So many fixed-partition databases choose not to implement partition split and merge

• Choosing the right number of partitions is difficult if the size of the dataset is variable

4.3.3 Dynamic Partitioning


• Fixed number of partitions can become imbalanced as data is inserted and removed from the
database
• In dynamic partitioning, the number of partitions adapts to the total data volume

• In dynamic partitioning, the partitions split if they grow beyond an upper bound. If the partition
shrinks below a lower bound, it can be merged with an adjacent partition
• Can be used with both key-range partitioned and hash partitioned data

4.3.4 Proportional to number of nodes


• The size of each partition grows proportionally to the dataset size while the number of nodes
remains unchanged, but when number of nodes increase, the partitions become smaller again
• Keeps partition sizes stable
• When a new node joins the cluster, it randomly chooses a fixed number of existing partitions to
split, and then takes over half of each of those split partitions.

4.4 Request Routing


• In case of a dataset partitioned among multiple nodes, which node should read/write requests from
a client go to? Request routing solves this issue

• Approaches to routing are:


– Client contacts one node at random. If that node contains the request partition then it serves
the client, else it forwards the request to the appropriate node (this requires all nodes to be
aware of partition -> node assignments)
– Client contacts a routing tier, which is aware of all the node assignments. It forwards the
request to the appropriate node. The routing tier only acts as a partition-aware load balancer
– Client directly contacts the appropriate node on which the requested partition lies, requiring
each client to know about partitioning and assignment to nodes.

4.4.1 ZooKeeper
• A distributed metadata management system for clusters.

• ZooKeeper maintains an authoritative mapping between partititons and nodes, and each node
registers itself with the ZooKeeper service.
• Other actors, such as the routing tier or the partitioning-aware client, can subscribe to this infor-
mation in ZooKeeper
• When partitioning changes or node removal/addition occurs, ZooKeeper notifies the routing tier

8
5 Replication
• Keeping multiple copies of a single partition on di↵erent nodes connected by a network
• Motivation for replication:
– Reduce latency by reducing distance to user
– Increase availability by allowing fault tolerance
– Increase read throughput by allowing more parallel reads (scalable)

5.1 Single Leader Replication


• Among all replicas, elect one leader and keep all other partitions as followers.
• All write requests from clients are directed to the leader, but read requests can be served by leader
or followers
• When a leader gets a read request, it first updates its own write log. This write log is transmitted
to all the followers, and the followers apply the changes in the same order as the leader.
• In synchronous replication, the leader waits for all followers to confirm that they have received
a write request, and only then sends success message to the user.
• In asynchronous replication, the leader sends the success message to the user without waiting
for followers to acknowledge the receipt of the write.
• In semi-synchronous replication, the leader waits for exactly one follower to confirm that it
has received a write request, and only then sends success message to the user.
• Sychronous replication sacrifices availability for consistency, vice versa for asynchronous

5.1.1 Node Failure


• Follower failure is handled using catch-up recovery. Follower stores the edit logs on its disk
• If a failed follower is restarted, then it can ask the leader for all log entries between the time it
crashed and the current time. Upon receiving this, the follower replayed these log entries to get
the updated data
• In case of leader failure, one of the old followers has to be elected to the position of leader. Clients
are to be reconfigured to send their write queries to this new leader
• Leader failover takes place manually (by theactions of a system admin) or automatically. The steps
in leader failover are:
– Identify leader failure
– Elect a new leader
– Reconfigure system to use the new leader

5.1.2 Implementation
• Statement Replication: The leader logs every write request that it executes and sends that
statement log to its followers (fails for non-deterministic functions like rand() and now())
• Write-Ahead Log Shipping: The leader writes the log (an append-only byte stream) to disk
and sends it across the network to its followers. When the follower processes this log, it builds a
copy of the exact same data structures as found on the leader.
• Logical Log Replication: Uses di↵erent log formats for replication and for the storage engine.
A logical log (aka the replication log) is a sequence of records describing writes to database tables
at the row level
• Trigger-Based Replication: A trigger on the leader table logs the change to another table where
an external process can read it. The external process applies the replication to another system

9
5.2 Replication lag
• The delay between a write happening on the leader and the same being reflected on a follower is
known as the replication lag.

• Read-After-Write consistency is a guarantee for a single user, in that if the same user reads the
data at any time interval after reading it, the user will get the updated data.
• Solutions:
– Read critical data from leader, rest from follower (negates scaling advantage)
– Prevent queries on any follower that is lagging significantly behind the leader
– Client remembers the timestamp of their most recent write, and ensure that the node serving
that user is updated atleast till that timestamp
– Monotonic reads: each user read from the same replica always
– Consistent prefix reads - if a sequence of writes happen in a certain order, then anyone reading
those writes should see them appear in the same order

5.3 Multi-Leader Replication


• Allow more scalability in writes by allowing multiple leaders. Each leader simultaneously acts as a
follower to the other leaders.
• Conflict Avoidance:
– Ensure that all writes for a particular record go through the same leader
– Give each write an unique ID and pick the write with the highest ID (throw the others away)
– Custom conflict resolution logic in the application code that may be executed on write/reads
• In a multi leader config, the writes can go to the nearest leader only and replicated asynchronously
to all the other leaders (better perceived performance)
• In a single leader config, the failure of a leader means there is downtime involved in failover.

• In a multi-leader config, each datacenter can continue operating independently of the others, and
replication catches up when the failed datacenter is back online.
• In a single-leader config, the public internet is used for synchronous updates between leader and
follower, hence is sensitive to problems in this network

• A multi-leader config with asynchronous replication tolerates network problems better as a tem-
porary network problems do not prevent writes being processed

5.4 Leaderless Replication


• No single dedicated leader, all replicas of a partition are the same from the client point of view.

• In some implementations, the client sends writes to multiple nodes at the same time
• In others, a single co-ordinator node does this on behalf of the client, but it does not enforce a
particular order of writes (like a leader in a single-leader set up does)
• If writes are sent to multiple nodes, but some nodes out of these fail and hence cannot complete
the write. If the nodes that failed come back online, then any data on them is now out of date
(stale)
• To solve this issue, each data item has a version number associated with it. The client reading
from multiple replicas checks the version number of the data and selects the most recent one.

• When the client reads values with di↵erent version numbers, the client writes the most recent
version of the data to all the nodes with less recent versions. This is called read repair

10
• A background process (rather than the client itself) monitors all data values and their versions
across all nodes, and periodically writes the latest value of the data to all the replicas. This is
called an anti-entropy process.
• Let there be n nodes. Let r nodes be queried for each read, and w nodes confirm for each write. If

w+r >n

then an up-to-date copy of the data is guaranteed while reading, as at least one of the r nodes
being read from must be up to date.
• Reads and writes that obey the above rule are called quorum reads and writes.

5.4.1 Monitoring
• Monitoring in leaderless systems is difficult as writes do not happen in any particular order
• In single-leader systems, the writes are in a fixed order maintained on the edit log of the leader.
The out-of-date follower can compare its position (timestamp) with that of the leader and make
the necessary changes.

5.4.2 Multi-datacenter Operation


• Leaderless replication is suitable for multi-datacenter operation, since it is designed to tolerate
conflicting concurrent writes, network interruptions and latency spikes.
• The number of replicas of a single partition n is across all datacenters. Number of replicas in a
single datacenter can be configured
• All writes are sent to all replicas, but only a quorum of nodes within the local datacenter is sufficient
for the client to detect a success.
• The higher-latency writes to other datacenters are often configured to happen asynchronously

5.4.3 Detecting concurrent writes


• Several clients writing to the same key concurrently means that conflicts will occur (even if quorum
is followed)
• Events may arrive in a di↵erent order at di↵erent nodes, due to variable network delays and partial
failures
• Last Write Wins: each replica stores the value with the highest version number only (discarding
the rest of the data)
• Given two events A and B, A is said to happen before B if B knows about A, or depends on A
or builds on A.
• Definition of concurrency is dependent on this happens-before relationship. Server can determine
whether two operations are concurrent by looking at the version numbers

5.4.4 Algorithm for detecting concurrent operations


• A client must read a key before writing to it.
• When a client reads a key, the replica sends the values that have not been overwritten, as well as
the latest version number
• When a client writes a key, it must include the version number from the prior read, and it must
merge together all values that it received in the prior read.
• When the server receives a write with a particular version number, it can overwrite all values with
that version number or below, but still maintains the values with higher version numbers.
• The collection of all version numbers for an item across all its replicas is called the version vector

11
• Version vectors are sent from the database replicas to clients when values are read, and need to be
sent back to the database when a value is subsequently written
• The version vector allows the database to distinguish between overwrites and concurrent writes,
and ensures that it is safe to read from one replica and write to another.

6 Consistency Models
• Most distributed systems only guarantee eventual consistency
• In eventual consistency, data read at any point may not be consistent across nodes, but if there
are no writes for some unspecified interval then all the nodes can catch up to the consistent state
• This is a weak guarantee, as it does not give any guarantees about actual time of consistency.

6.1 Linearizability
• The illusion that there is only one copy of a data item across a distributed system. (implies that
all data must be up to date at all times, no staleness in caching)
• Ensures that applications running on the distributed system do not need to worry about replication.
• Main point of linearizability: After any one read has returned the new value, all following reads
(on the same or other clients) must also return the new value.
• Compare-and-Set is an operation on the database:
– The CAS operation takes in 3 arguments: a memory location to read from (called X), an old
value (vold ) and a new value (vnew )
– If X == vold then set X := vnew
– If X 6= vold then return an error, don’t change the value in X
• Test for linearizable behaviour: record the timings of all requests and responses and check whether
a valid sequential ordering can be constructed from them.
• In synchronous mode, single leader replication is linearizable.
• Consensus algorithms implement measures to avoid stale replicas, and implement safe linearizable
storage. (e.g.: ZooKeeper)
• Multi-leader and leaderless replication are not linearizable (leaderless probably not)

6.2 CAP Theorem


• Consistency requires that all reads after a write return the same value, which is the latest value
• Availability requires that any node (that is not in a failure state) is ready to process requests.
• Partition Tolerance requires that a system is tolerant to any network or node failures by rerouting
the communications.
• The CAP Theorem states that a distributed system can satisfy at most two of these 2 constraints
at a time
• Consistent and Partition Tolerant systems: If a network outage causes one node to be unavailable,
then such a system can still use a majority consensus and deliver consistent results (e.g.: MongoDB,
Redis, BigTable)
• Available and Partition Tolerant systems: If a network outage disconnects two nodes, then they
can still independently process results but there is no consistency guarantee between the data on
the 2 nodes. (e.g.: Cassandra, Riak, CouchDB)
• Consistent and Available systems: Ones that cannot handle any network failures (e.g.: RDBMS
such as SQL Server, MySQL)

12
• The modern CAP goal is to maximize combinations of consistency and availability that make
sense for the specific application, while incorporating plans for unavailability and recovery of failed
partitions.

6.3 Two Phase Commit


• Every node has a transaction manager that:
– Maintains transaction log for recovery
– Co-ordinates concurrent transactions at the node

• Every node has a transaction co-ordinator that:


– Starts the execution of a transaction at the site, distributes the sub-transactions to other sites
– Co-ordinates the termination of the transaction (either a successful commit on all nodes or
abort on all nodes)

• A concurrency control system must:


– be resilient to node/communication link failure
– allow parallelism for greater throughput
– Optimize cost and communication delay
– Place constraints on atomic actions
• Either commit at all sites, or abort at all sites. The 2PC mechanism is designed to implement this.

6.3.1 Phase 1
• Coordinator places the record Prepare T on its log. The message is then sent to all the sites.
• Each site that receives the message decides whether to commit the componenet of transaction T
or to abort it.
• A site that wants to commit enters the pre-commit stage (in this state the site can no longer abort
the transaction)
• The site takes the necessary actions to ensure that its component of T will not be aborted, then
writes the log message Ready T.
• Once the log is stored on disk at the site, the site sends the Ready T message back to the
coordinator
• A site that doesn’t want to commit sends the message Don’t Commit T back to the coordinator

6.3.2 Phase 2
• If Coordinator gets Ready T from all the sites, it logs the message Commit T and sends it to
all the sites
• If the coordinator has received don’t commit T from one or more sites, it logs Abort T at its
site and then sends abort T messages to all sites involved in T
• If a site receives a commit T message, it commits the component of T at that site, logging
Commit T as it does
• If a site receives the message Abort T, it aborts T and writes the log record Abort T

13
Cloud Computing (UE18CS352)
Unit 4
Aronya Baksy
March 2021

1 Master-Slave vs P2P Models


• Distributed System: A system that involves components on di↵erent physical machines that com-
municate, and coordinate actions in order to appear as a single system to the end user
• Two main types of distributed system architectures are Master-Slave and Peer-to-Peer (P2P)

1.1 Master-Slave Architecture


1.1.1 Key Points
• Two types of nodes: master and worker
• Nodes are unequal, the master is above the worker nodes in the hierarchy. This makes the master
a single point of failure (SPOF)
• The master is the central coordinator of the system. All decisions regarding scheduling and resource
allocation are made by master
• The master becomes a performance bottleneck as the number of worker nodes increases.

1.1.2 Advantages
• Easy maintenance and security
• Promotes sharing of resources and data between di↵erent h/w and s/w platforms
• Integration of services

1.1.3 Disadvantages
• Not scalable as number of workers increase
• The master is a SPOF

1.2 P2P Architecture


1.2.1 Key Points
• No hierarchical relationships between the nodes
• No central coordination, each node takes its own decisions on resource allocation. But synchro-
nization of all the decisions is difficult.
• Theoretically 1 scalability. No performance bottlenecks exist.
• Peers form groups and o↵er services/data within group members. Popular data is propagated
within the group, unpopular data may die out.
• Peers can form a Virtual Overlay Network on top of the physical topology. Each peer routes traffic
through the overlay network.

1
Figure 2: Client Server Model for Chat
Figure 1: Client-Server Architecture

Figure 3: Client-Server Architecture for e-mail


applications

Figure 4: 3-tier Client Server Model

1.2.2 Advantages
• No centralized point of failure.
• Highly scalable, addition of peers does not a↵ect quality of service

1.2.3 Disadvantages
• Maintaining decentralized coordination is tough (consistency of global state, needs distributed
coherency protocols)
• Computing power and bandwidth of nodes impacts the performance (i.e. all nodes are not the
same in a P2P network)
• Harder to program and build applications for P2P systems due to the decentralized nature.

1.2.4 Applications
• File Sharing applications with replication for fault tolerance (e.g.: Napster, BitTorrent clients
like µTorrent, Gnutella, KaZaa)
• Large-scale scientific computing for data analysis/mining (e.g.: SETI@Home, Folding@Home
used for protein dynamics simulations)
• Collaborative applications like instant messaging, meetings/teleconferences (e.g.: IRC/ICQ,
Google Meet, MS Teams etc.)

1.3 P2P Topologies


1.3.1 Centralized Topology
• A centralized server must exist which is used to manage the files and user databases of multiple
peers that log onto it

2
• The centralized server maintains a mapping between file names and IP addresses of a node. Each
time a client joins the network, it publishes its IP address and list of files it shares to this database
• Any file lookup happens via the server. If the file is found then the centralized server establishes
a direct connection with the requesting node and the node that contains the requested file.

1.3.2 Ring Topology


• Consists of a cluster of machines that are arranged in the form of a ring to act as a distributed
server. The ring provides better load balancing and high availability.
• Typically used when nodes are physically nearby, such as a single organization.

1.3.3 Hierarchical Topology


• Suitable for systems that require a form of governance that involves delegation of rights or authority

• e.g.: DNS hierarchy, where authority flows from the root name servers to the servers of the registered
name and so on

1.3.4 Decentralized Topology


• All peers are equal, hence creating a flat, unstructured network topology
• In order to join the network, a peer must first, contact a bootstrapping node (node that is always
online), which gives the joining peer the IP address of one or more existing peers.
• Each peer, however, will only have information about its direct neighbours.
• Any file queries have to be flooded to all the nodes in the network.
• e.g.: GNutella, for file sharing especially music

Figure 5: Ring Topology Figure 6: Centralized Topology

Figure 8: Hierarchical Topology


Figure 7: Decentralized Topology

3
2 Unreliable Communication
• Issues with communication in distributed systems:
– Request or response is lost due to issues in the interconnect network
– Delay in sending request or response (due to queuing delays and network congestion)
– Remote node failure (permanent or temporary)
• Partial Failure in a distributed system occurs when some components (not all) start to function
unpredictably. Partial failures are non-deterministic.
• Distributed systems involve accepting partial failure, building fault-tolerant mechanisms into the
system.

• Reliability is the ability of a system to continue normal functioning even when components fail
or other issues occur.
• Formally, reliability is defined as the probability that a system meets certain performance standards
and yields correct outputs over a desired period of time

• Reliability includes:
– Tolerant to unexpected behaviour and inputs in the software
– Prevention of unauthorized access and abuse
– Adequate performance for the given use case under expected load and input size

• Metric for reliability: mean time between failures, defined as


total uptime
M T BF =
# of failures

• A fault is usually defined as one component of the system deviating from its spec
• A failure is when the system as a whole stops providing the required service to the user
• Fault-tolerant mechanisms prevent faults from causing system-wide failures.
• Classification of faults:

– Transient: appear once, then vanish entirely (e.g. first request from node A to node B fails
to reach, but the next one reaches on time)
– Intermittent: Occurs once, vanishes, but reappears after a random interval of time. (e.g.
loose hardware connections)
– Permanent: Occurs once, interrupts the functioning of the system until it is fixed. (e.g.
infinite loops or OOM errors in software)
• Classification of failures:
– Crash failure: A server halts, but functions correctly till that point
– Omission Failure: Could be send omission (server fails to send messages) or receive omission
(server fails to respond to incoming messages)
– Timing Failure: server response is delayed beyond the acceptable threshold
– Response Failure: Could be a value failure (response value is wrong for a request) or a state
transition failure (deviation from correct control flow)
– Arbitrary or Byzantine Failure: Arbitrary response produced at arbitrary times

4
2.1 Failure Detection
• Using timeout: Let d be the longest possible delivery time (all messages will reach within time d
after being sent or they will not reach at all), and r be the time needed by the server to process
the message.
• Then the round trip time 2d + r is a reasonable estimate for a timeout value beyond which it can
be assumed that a node has failed.
• Unfortunately, there are no such time guarantees in asynchronous communications that are used
in distributed systems.
• Network congestion causes queuing delays. If queues fill up at routers, then the packets can be
dropped, causing retransmission and further congestion
• Even VMs that give up control of the CPU core to another VM can face network delays as they
stop listening to the network for that short duration when they are not in control of the CPU. This
leads to packets dropping.
• Timeout values are measured experimentally.
– Data is collected on round-trip times across multiple machines in the network and over an
extended time period. Measure the variability in the delays (aka jitter)
– Taking into account this data, as well as the application characteristics, a timeout is chosen
that is a fair compromise between delay in failure detection and premature timeout.
– Instead of constant timeouts, the system constantly measures response time and jitter, and
dynamically adjusts the timeout value.
– This is used in Phi Accrual Failure Detector in systems like Cassandra and Akka (toolkit for
distributed applications in Java/Scala)
• In circuit switched networks, there is no queuing delay as the connection is already set up end-
to-end before message exchange, and the maximum end-to-end latency of the network is fixed
(bounded delay)
• The disadvantage of circuit switched network is that it supports far less number of concurrent
network users, and it leads to low bandwidth utilization.

2.2 Failure Models


• Fail-Stop: Assumes that nodes can fail only by crashing. Once a node stops responding it never
responds until it is brought back online
• Fail-Recovery: Once a node stops responding it may start responding again after a random time
interval. Nodes are assumed to have stable disk storage that persists across failures, but in-memory
state is assumed to be lost
• Byzantine: A component such as a server can inconsistently appear both failed and functioning
to failure-detection systems, showing di↵erent symptoms to di↵erent observers.

2.3 Byzantine Faults


• A Byzantine fault is a condition of a distributed system where components may fail and there is
imperfect information on whether a component has failed.
• A system is said to be Byzantine fault-tolerant if it continues to work properly despite nodes not
obeying correct protocol, or if malicious actors are interfering with the working of the system
• The Byzantine agreement:
– Used to build consensus that nodes have failed or messages are being corrupted
– General strategy is to have each node communicate not only its own status but any information
they have on any other nodes.
– Using this information, a majority consensus is built and the nodes that don’t agree with this
consensus are considered to be in failure state.

5
2.4 Failure Detection using Heartbeats
• A heartbeat is a signal sent from a node to another at a fixed time interval that indicates that the
node is alive.
• Absence of a fixed number of consecutive heartbeats from a node is assumed to be evidence that
that node is dead
• Heartbeat signals are organized in the following ways:
– Centralized: All nodes send heartbeats to a central monitoring service. Simplest organiza-
tion but the central service is now a SPOF
– Ring: Each node sends heartbeat only to one neighbour, forming a ring structure. If one of
the nodes fails, then the ring breaks and heartbeats cannot be sent properly
– All-to-All: Each node sends heartbeats to every other node in the system. High communi-
cation cost but every node keeps track of all other nodes hence high fault tolerance.

2.5 Failover
• The act of switching over from one service/node/application to a new instance of the same, upon
the failure or abnormal termination of the first one.
• Failover can be implemented in two ways:

2.5.1 Active-Active Failover Architecture


• Also called symmetric failover.
• Let there be 2 servers. Server 1 runs application A and server 2 runs application B. If server 1 fails
for some reason, then server 2 will now be tasked with running both applications A and B.
• Since the databases are replicated, it mimics having only one instance of the application, allowing
data to stay in sync.
• This scheme is called Continuous Availability because more servers are waiting to receive client
connections to replicate the primary environment if a failover occurs.

Figure 9: Active-Active Failover Architecture

2.5.2 Active-Passive Failover Architecture


• Also called asymmetric failover.
• A standby server is configured to take over the tasks run by the active primary server, but otherwise
the standby does not perform any functions.
• The server runs on the primary node until a failover occurs, then the single primary server is
restarted and relocated to the secondary node.
• Not necessary to shift functions back to the primary server once it comes back online (called
failback). In this situation the primary is the new standby, and the previous standby is currently
the primary.

6
Figure 10: Active-Passive Failover Architecture

3 Availability and Fault Tolerance


3.1 Availability
• Availability is used to describe the situation where a service is ready to respond to user requests,
as well as the time spent in actually servicing those requests.
• Uptime refers to the period during which a service is operational. Downtime refers to the period
during which a service in unavailable and non-operational.
• Most modern distributed systems give an uptime guarantee of 99.999% (5 nines).
• Downtime may be planned (maintenance is generally planned beforehand at regular intervals) or
unplanned (due to network outages, node failures, software crashes)
• The Service Level Agreement (SLA) between the cloud provider and end user includes an
uptime-downtime ratio.
• Available systems are generally designed by eliminating as many SPOF as possible. As size of
system increases, isolation of faults becomes harder hence availability reduces.
• Most distributed systems have high availability with failover.

Figure 11: Availability for di↵erent types of distributed systems

• Availability is measured in terms of the following metrics:

3.1.1 Mean Time to Failure (MTTF)


• Measured as
Total uptime
MTTF = (1)
Number of tracked operations/components
• It is a measure of failure rate of a product
• MTTF is only used for non repairable, only replacable components (such as motherboards,
memory, disk drives, batteries etc.)

7
3.1.2 Mean Time to Repair (MTTR)
• Measured as
Total downtime caused by failures
MTTR = (2)
Number of failures
• It measures the average time to repair and restore a failed system

In terms of the above metrics, the system availability is defined as:


MTTF
Availability = (3)
MTTF + MTTR

3.2 Fault Tolerance


• Ability of system to continue operation without interruption, despite failure of one or more com-
ponents.

• Fault tolerance can be implemented either in hardware (duplicate hardware), or in software (dupli-
cate instances running the same software) or using load balancers (redirect traffic away from failed
instances).
• Fault tolerant architectures, however, do not address software failures which are the most common
cause of downtime in real-life distributed systems.

Consideration Highly Available Fault Tolerant


Downtime Minimal allowed downtime for ser- Zero downtime, continuous service
vice interruption (e.g. 99.999% up- expected
time means 5 mins downtime per
year)
Cost Less expensive to implement as no More expensive to implement as it
redundant hardware is needed requires redundant components
Scope Use shared resources to manage fail- Use power supply backup, compo-
ures and minimize downtime nents that can detect failure and au-
tomatically switch over to redundant
components
Example e.g.: Non-critical software and IT e.g.: Critical applications (related to
services (Amazon etc.) healthcare, defense etc.)

3.3 Implementation of Fault Tolerance


• Hardware systems backed by equivalent components. Replicate data and compute functions be-
tween redundant servers

• Build fault-tolerance into network architecture. Multiple paths between nodes in a data-center, or
mechanisms to handle link failure and switch failure
• Handle link failures transparently without a↵ecting cloud functionality. Avoid forwarding packets
on broken links.

• Redundant power supply using generators.


• Software instances are made fault tolerant by setting up duplicate instances. (e.g. if DB is running
and it fails, switch to another instance of the same DB running on another machine)

3.3.1 Chaos Monkey


• Chaos Monkey is a suite of tools designed by engineers at Netflix designed to randomly introduce
failures in a production environment.

• Chaos Monkey is used to purposefully introduce faults into systems that are under development
so that fault tolerance can be integrated as early as possible and tested at any time.

8
• By regularly ”killing” random instances of a software service, it is possible to test a redundant
architecture to verify that a server failure does not noticeably impact customers.
• Chaos Monkey relies on Spinnaker (an open source CI/CD tool similar to Jenkins) that can be
deployed on all major cloud providers (AWS, Google App Engine, Azure)

• The suite of tools developed by Netflix under this includes:


– Chaos Kong: Drop an entire AWS Region
– Chaos Gorilla: Drop an entire AWS Availability Zone
– Latency Monkey: Introduces delays to simulate network outages or congestion
– Doctor Monkey: Monitor performance metrics, detect unhealthy instances, for analysis of
root causes and eventual fixing/retirement of the instance.
– Janitor Monkey: Identify and clean unused instances
– Conformity Monkey: Identifies non-conforming instances according to a set of rules (age
of instances, security groups of clustered instances etc.)
– Security Monkey: Search for and delete instances with known security vulnerabilities and
invalid config
– 10-18 Monkey Detects problems with localization and internationalization for software serv-
ing customers across di↵erent geographic regions.

3.4 Fault Tolerant Design Patterns for Microservice Architectures


3.4.1 Use Asynchronous Communication
• Avoid long chains of synchronous HTTP calls when communicating between services as the incor-
rect design leads to major outages.
• This helps minimize ripple e↵ects caused by network outages.

3.4.2 Work around Network Timeouts


• To ensure that reosurces are not indefinitely occupied, use network timeouts.
• Clients should be designed not to block indefinitely and to always use timeouts when waiting for a
response

3.4.3 Retry with Exponential Backo↵


• Perform retries to service calls at exponentially increasing intervals.
• This is done to counter intermittent failures when the service is only unavailable for short time.
• Microservices should be designed with circuit breakers so that the increased network load due to
the successive retries does not cause Denial of Service (DoS)

3.4.4 Circuit Breaker Pattern


• In this approach, the client process tracks the number of failed requests.
• If the error rate exceeds a configured limit, a ”circuit breaker” trips so that further attempts fail
immediately. (large number of failed requests implies that the service is unavailable, hence don’t
contact now)

• After a timeout period, the client should try again and, if the new requests are successful, close
the circuit breaker.

9
Figure 12: Circuit Breaker Design Pattern

3.4.5 Fallback Mechanisms


• Fallback provides an alternative solution during a service request failure.
• Fallback logic is implemented that runs when a request fails. The logic may involve returning a
default value, or returning some cached value.

• Fallback logic must be simple and failure-proof as it is itself running due to a failure.

Figure 13: Circuit Breaker Pattern with fallback logic

3.4.6 Limit number of queued requests


• Clients impose an upper bound on number of outstanding requests that can be sent to a service.
• In case this upper bound is crossed, the remaining requests should automatically fail.
• This is a form of rate limiting or throttling, controlling the rate of requests sent in a time period

• If request arrival rate exceeds the processing rate, the incoming requests can either be queued in a
FIFO queue, or discarded when the queue fills up.
• When the service has capacity, it retrieves messages from this queue and processes them. When
the request rate is greater, the available capacity messages are processed in order and are not lost.

10
4 Task Scheduling Algorithms
• Policies that assign tasks to the appropriate available resources (CPU, Memory, bandwidth) in a
manner that ensures maximum possible utilization of those resources
• Categorized into:

– Immediate scheduling algorithms assign tasks to VMs as soon as they arrive


– Batch schedulers group tasks into batches and schedule individual batches onto VMs
– Static schedulers do not take into account the current state of the system, but rather use
only prior available information on the state. Divides all traffic equally among all VMs, e.g.
round robin or random scheduling.
– Dynamic schedulers take into account the current system, do not need any prior information
on the system, and distribute tasks as per the relative capacities of the VMs
– Preemptive scheduling means that tasks can be interrupted and moved to other resources
where they can continue execution
– Non-preemptive scheduling means that a task cannot be reallocated to a new VM until its
execution is complete
• Levels of task scheduling:
– The Task level. Consists of tasks or cloudlets sent to the system by the users
– The scheduling level. It is responsible for mapping tasks to resources to get highest resource
utilization with minimum completion time for all tasks (aka minimum makespan)
– The VM level. Consists of VMs that execute the scheduled tasks

4.1 FCFS Scheduling


• Advantage: Simplest to implement and understand.

• Drawback: Leads to longer wait times and lower resource utilization


• Algorithm: Assign tasks to VMs in the order of their arrival time. If more tasks than VMs then
assign tasks in a round-robin manner.

4.2 SJF Scheduling


• Advantage: Lowest average wait time among all algorithms (provably optimal).
• Drawback: Long tasks are forced to wait for long times (starvation), and cannot be implemented
at the short-term level.

• Algorithm: Sort all available tasks in increasing order of their execution time. Then assign the
tasks to VMs in sequential order of the VMs.

4.3 Min-Max Scheduling


• Advantage: Efficient resource utilization
• Drawback: Leads to increased waiting time for small and medium tasks
• Algorithm: Sort tasks in decreasing order of execution time (longest task first). Sort VMs in
decreasing order of performance (most powerful VM first, i.e. VM with minimum latency first).
Now assign tasks to VM in order.

11
5 Cluster Coordination
• Consensus is the task of getting all processes in a group to agree on a single value based on votes
gathered from all processes. The value agreed upon has to be submitted by one of the processes,
i.e. this value cannot be invented by the consensus algorithm
• Synchronous processes are those that follow a common clock, while asynchronous processes are
those where each process has an individual clock.
• In asynchronous systems, it is not possible to build a consensus algorithm as it is impossible to
distinguish between processes that are dead, and those are just slow in responding.
• If even one process crashes in an asynchronous system, then the consensus problem is proved to
be unsolvable in this paper by Fischer, Lynch and Patterson from 1985.
• Why solve consensus? It is important because many problems in distributed computing take a
similar form to the consensus problem such as:
– Leader election
– Perfect failure detection
– Mutual exclusion (agreement on which node gets access to a particular shared resource)
• The properties to be satisfied by asynchronous consensus are:
– Validity: The system cannot accept any value that was not proposed by atleast one node. If
every node proposes the same value then that value is accepted
– Uniform Agreement: No two correct processes can agree on di↵erent values after a single
complete run of the algorithm
– Non-Triviality/Termination: All the processes must eventually agree on a single value

5.1 Consensus Algorithm: Paxos


5.1.1 Roles in Paxos
• Every node in a Paxos system is either a proposer, a acceptor or a learner.
• Proposers try to convince the acceptors that the value proposed by them is correct

• Acceptors receive proposals from proposers. They also inform the proposer in the event that a
value other than the one proposed by them was accepted
• Learners announce the outcome of the voting process to all the nodes in the distributed system.

5.1.2 Paxos Phase 1 (Prepare - Promise)


• Every proposer creates a prepare message containing the node ID of the proposer (node ID is a
single positive integer, monotonically increasing and unique to a single node), as well as the value
proposed.
• This message is sent to a majority of the acceptors.
• If the acceptor has never seen any prepare message before, it sends back a prepare response
message which indicates that the acceptor promises not to accept any prepare with an ID less than
the current one.
• In case the acceptor has seen a message before, it compares the ID in the incoming prepare message
with the max ID it has seen.
– If current ID is greater than max ID, then it sends back a prepare response. This promise
message contains the Id and value of the highest previously accepted message. The promise
is made not to accept any prepare message with ID less than current ID.
– If current ID is smaller than max ID then simply ignore the current prepare message

12
5.1.3 Paxos Phase 2 (Propose - Accept)
• Once a proposer receives a prepare response from a majority of the acceptors, it can start sending
out accept requests.

• A proposer sends out an accept request containing its node ID, and the highest value it recevied
from all the prepare responses.
• If an acceptor receives an accept request for a higher or equal ID than it has already seen, it accepts
and sends a notification to every learner

• A value is chosen by the Paxos algorithm when a learner discovers that a majority of acceptors
have accepted a value.

5.2 Leader Election Algorithms


• Elect a single leader from a group of non-faulty processes such that all processes agree on who the
leader is
• Leader election criteria:
– Any process can call for an election, but a process can call for only one election at a time
– Multiple processes can call an election simultaneously. The result of all these elections is the
same
– Result of an election does not depend on which process called the election
• Conditions for successful leader election run:
– Safety: Every non faulty process p must either elect a process q with the best attribute value,
or NULL
– Liveness: An election, once started, must terminate, and the result of a terminated election
cannot be NULL.
• Each node has an attribute value which is its identifier. The value determines that node’s fitness
for leadership. (e.g.: CPU power, disk space, etc.)

5.2.1 Ring Election


• All N nodes are arranged in a logical ring. The node p[i] has a direct link to the next node
p[(i + 1)modN ]
• Messages are only sent in clockwise order around the ring. The node that discovers the failure of
the existing coordinator starts the process of election by sending its attribute value in an election
message to the next node.
• Algorithm:
– When a node receives an election message, it checks the attr value in that message.
– If the incoming attribute value is less than the node’s attribute value, the node overwrites the
message with its own attribute value and sends it to the next node in the ring.
– If the incoming attribute value is greater than the node’s attribute value, the node simply
forwards the message to the next node in the ring.
– If the incoming attribute value happens to the same as the node’s attribute value, the election
stops. The newly elected leader sends out elected messages to the next node, which forwards
them to the next node and so on until all nodes in the ring have received an elected message
containing the leader’s ID.
• Worst case occurs when the leader is the anti-clockwise neighbour of the initiator. In this case the
number of messages exchanged is 3N 1 (N 1 election messages to reach the leader, N election
messages to confirm that no higher node exists and finally N elected messages)

13
• The simple ring election algorithm o↵ers safety and liveness as long as nodes don’t crash during
election.
• If the nodes crash during election, then it could lead to an election message going around the ring
infinitely, thus the election goes on forever and liveness is not followed.

5.2.2 Modified Ring Election


• Similar set up to the simple ring election case.
• Algorithm:
– The initiator sends out an election message to the next running node in the ring. The first
message contains the attr value of the initiator.
– If a node receives an election message, it simply appends its own attribute value to the message
and forwards it back.
– When the election message completes one round and reaches the initiator again, the initiator
selects the process with the best attribute value and crafts a coordinator message.
– The coordinator message is of the form coord(ni ) where ni is the elected node value. The
coordinator message is sent around the ring, each node appends its own attribute value to the
end of the message
– Once the coordinator message completes one round and reaches the initiator again, the initia-
tor checks if the value in the coord message is there in the appended list of IDs. If it is there
then election stops. If not then the initiator once again starts the election.

• Supports concurrent elections – an initiator with a lower id blocks election messages by other
initiators
• If a node fails, then the ring can be re congifured to make it continuous again if all nodes in the
ring know about each other.

• If the initiator is not faulty, then message complexity = 2N , turnaround time = 2N and message
size grows as O(N )

5.2.3 Bully Algorithm


• Modified ring leader election algorithm is not suitable for asynchronous systems where there is no
upper bound on message delays, meaning there can be arbitrarily slow processes.
• This is because of the fact that a process pi may not respond to the election messages as it is slow
(but not failed), and the slow initiation and reorganization (in case of node failure)
• In the Bully Algorithm, every process is aware of the Process ID (PID) of every other process.
The algorithm is summarized as follows:
– A process P initiates an election by sending election messages to processes that have a
higher PID than it. If there is no response to the election message within a timeout, then the
election is done, the process that sent the election messages is the leader.
– If P receives a reply to its election message, then P waits for the corresponding coordination
message from the higher PID process. If this does not arrive within a timeout, then the
election is restarted.
– At the end of this election message exchange, there is a process Pl that knows for sure that
it has the highest PID. Pl sends a coordination message to the processes with lower PID
than it. This is the end of the election.
• Worst case: message overhead is O(N 2 ), turnaround time is 5 message times
• Best case: N 2 message overhead, turnaround time is 1 message time.

14
6 Distributed Locking
• A lock is a mechanism that allows multiple processes/threads to access shared memory in a safe
manner avoiding race conditions. Locks are implemented as semaphores/mutexes/spinlocks.
• Locks are operated in the following sequence:

– Acquire the lock. This gives the process sole control over the shared resource
– Perform the tasks needed on the shared resource
– Release the lock. THis gives the other waiting processes a chance to access the shared resource
• A distributed lock is one that can be acquired and released by di↵erent nodes (instead of processes
and threads on only one node).

• Advantages of distributed locking:


– Efficiency: prevents expensive computations from happening multiple times.
– Correctness: Avoid inconsistency, corruption or loss of data.
• Features of distributed locks

– Mutual exclusion: Only one process can hold a lock at a given time
– Deadlock-free: locks must be held and released in a manner that avoids deadlocks between
processes. No one process can hold a lock indefinitely, locks are released after a certain
timeout.
– Consistency: Despite any failover situation caused by a node failure, the locks that the
original node held must still be maintained.

6.1 Types of Distributed Locks


6.1.1 Optimistic Locks
• Useful for stateless environments where there is low amount of data contention.
• Makes use of version numbers to maintain consistency, instead of locks.
• Use a version field on the database record we have to handle, and when updateing it, check if the
data read has the same version of the data being written.

6.1.2 Pessimistic Locks


• Involves the use of an intermediate single lock manager(LM) that allows nodes in the system to
acquire and release locks. All acquire and release operations go through the DLM
• The LM becomes a single point of failure. If the LM crashes then no node can acquire a lock hence
no operations can be performed.
• A node can perform a database transaction only after acquiring the lock from the LM. While the
lock is held by a node, the LM refuses all acquires from any other nodes. After the transaction is
done, the node releases its held lock.
• Expiration time is set on all locks so that no lock is held indefinitely. If this timer expires before
the process finishes the task, then another process acquires the lock and both processes release
their locks hence inconsistency.
• The above is resolved using fencing

15
6.2 Fencing
• Everytime the LM grants a lock (in response to an acquire) it sends back a fencing token to the
client.

• Along with every write request to the DB, the client sends this fencing token.
• If the DB has processed a write request with token ID N then it will not process write requests
containing token ID less than N
• Token ID less than N indicates that the node had acquired the lock earlier but the timeout has
expired hence that lock is not valid anymore

6.3 Distributed Lock Manager


• The DLM runs on all the cluster nodes, each having an identical copy of the same database.

• DLM provides software applications running on a distributed system with a means to synchronize
their accesses to shared resources.
• The DLM uses a generalized concept of a resource, which is some entity to which shared access
must be controlled.

7 Zookeeper
• ZooKeeper is a service for coordinating processes of distributed applications
• Zookeeper o↵ers a hierarchical key-value store, to provide a distributed configuration service, syn-
chronization service, and naming registry for large distributed systems.

• Useful for lock management (keep it outside of programmer’s hands for best results) and avoiding
message based control (in async systems message delivery is unreliable)
• Zookeeper maintains configuration information, perform distributed synchronization and enable
group services.

• Properties of Zookeeper:
– Simple: leads to high robustness and fast performance
– Wait-free: slow/failed clients do not interfere with needs of properly-functioning clients (in-
teractions are loosely coupled)
– High availability, high throughput, low latencies
– Tuned for workloads with high % of reads
– Familiar interface

7.1 Uses of Zookeeper


• As a naming service similar to DNS but for nodes in a distributed system
• Configuration management : latest configuration information of the system for a joining node.
• Data consistency using atomic operations

• Leader election
• Distributed Locking to avoid race conditions
• Message queue implementation

16
7.2 Advantages
• Simple distributed coordination

• Synchronization can be implemented


• Ordered messages
• All data transfers are atomic (i.e. no partial data transfer, either full or none)
• Reliability through replication of data

7.3 Disadvantages
• Less feature-rich compared to other such services (e.g. Consul which has service discovery included)
• Dependence on TCP connection for client-server communication

• Fairly complex to understand and maintain

7.4 Working of Zookeeper


7.4.1 Znode
• Zookeeper essentially provides a stripped-down version of a highly available distributed file system.
The hierarchy of this ”file system” is made up of objects called znodes.
• A znode acts as a container of data (like a file) but also as a parent to other znodes (like a directory).

• Each znode can store upto 1 MB of data. The limited amount is because Zookeeper is used for
storing only config information (status, networking, location etc.)
• Every znode is identified by a name, which is a path separated by /, with the root node having the
path as just ’/’.
• A znode with children cannot be deleted.

• Znodes can be:


– Ephemeral:
∗ Let a client C1 create a znode. If C1 ’s connection session with the server end, then the
ephemeral znode created by C1 will also be destroyed.
∗ Ephemeral znodes are visible to all clients despite their lifetime being tied to a single
client
∗ Ephemeral znodes cannot have children.
– Persistent:
∗ A persistent znode continues to stay in the database until and unless a client (not neces-
sarily the creator of the znode) explicitly deletes it.
– Sequential:
∗ Zookeeper assigns a sequential ID as part of the name of the znode, whenever a sequential
znode is created.
∗ The value of a monotonically increasing counter (maintained by the parent znode) is
appended to the name of the newly created one.
∗ Sequentially numbered znodes enforce a global ordering on the events in a distributed
system. Simple locks can be built using sequentially numbered znodes.

17
7.4.2 Working
• Each Zookeeper server maintains an in-memory copy of the data tree that is replicated across all
the servers.

• Only transaction logs are kept in a persistent data store for high throughput
• Each clients connects to a single Zookeeper server using a TCP connection. Client can switch to
another Zookeeper server if the current TCP connection fails.
• All updates made by Zookeeper are totally ordered. The order is maintained by the use of the zxid
or Zookeeper Transaction ID.
• Distributed synchronization is maintained using Zookeeper Atomic Broadcast or ZAB Protocol.
• A client can watch a znode, meaning that when any changes are made to the watched znode, the
client receives a notification.

7.5 Use Cases


7.5.1 Leader Election
• Client creates persistent znode called /election. All clients watch for children creation/deletion
under this /election znode.
• Each server that joins the cluster tries to create a znode called /election/leader. Only one server
succeeds in doing this, and that server is elected as the leader
• All the servers call getChildren("/election") to get the hostname associated with the child node
leader, which is the hostname of the leader.
• As the leader znode is ephemeral, if the leader crashes then that znode is automatically deleted
by the Zookeeper server. This delete operation triggers the watch on the /election znode as one
of its children has been destroyed
• All the servers who were watching the /election znode are triggered. These servers once again
repeat the same process and a new leader is elected

7.5.2 Distributed Locking


• Similar algorithm to leader election, with /lock used as the parent instead of /election (name
di↵erent dasall)
• The client that successfully creates /lock acquires the lock, performs the operation and then
destroys /lock. Thus the lock is released

7.5.3 Queues and Priority Queues


• A znode called <path>/queue is designated to hold the queue.
• Insertion into the queue is done by creating an ephemeral and sequential znode of the form
<path>/queue/queue-X where X is a sequence number assigned by Zookeeper.
• Deletion is done by calling getChildren() on the /queue znode and processing the list obtained
from the lowest sequence number first.
• Priority queue implementation is di↵erent only in 2 ways:
– Always use up-to-date list of children. Invalidate all old children lists as soon as watch
notification is triggered
– Queue node names end with /queue/queue-YY where YY represents the priority (lower value
means higher priority)

18
7.6 Alternatives to Zookeeper
• Consul: Service discovery and configuration tool, highly available and scalable

• etcd: Distributed key-value store, open source, tolerates failures during leader election
• Yarn: parallelize operations for greater throughput and resource utilization
• Eureka: a REST based service used for locating load-balancing and failover services for middle-tier
servers

• Ambari: Provision, manage and monitor Hadoop clusters. Intuitive web UI backed up by RESTful
operations.

19
Cloud Computing (UE18CS352)
Unit 5
Aronya Baksy
April 2021

1 Proxies in the cloud


1.1 Reverse Proxy
• A reverse proxy receives HTTP connection requests from clients and routes the traffic to the
application’s origin server. Reverse proxies are maintained by the owner of the origin server.
• Reverse proxy servers (implemented in Apache, Nginx, Caddy) can inspect HTTP headers and
route requests directed at a single IP address to any one of many internal servers based on the
domain name.
• Reverse proxy servers improve security, performance and reliability
• Operation:
1. Receive connection request from client
2. Complete the three-way TCP handshake, terminate the orignal connection, connect with the
origin server and complete the request.
• Benefits:
– Application security
– Load balancing when the origin server is replicated across multiple machines
– Caching
– SSL encryption

Figure 1: Reverse Proxy Configuration

1
1.2 Forward Proxy
• Regulates outbound traffic in accordance with certain policies in shared networks. Collects requests
from clients, and interacts with servers on behalf of the client.

• Forward proxies are useful in order to :


– Block access to certain websites for an organization, and monitor organization’s online activ-
ities.
– Block malicious traffic from reaching origin servers.
– Cache external site content and hence reduce response times.

1.3 Nginx
• Nginx is a web server that can also be used as a reverse proxy, load balancer and HTTP cache.

• Load balancing is either done using round-robin scheduling, or the optional hash-based scheduler
that chooses an upstream server based on the hash of some value (can be request URL, incoming
HTTP headers, or some combination of the same)
• Scaling is done by simply changing the Nginx server configuration i.e. by adding more servers and
the corresponding IP addresses in the ”upstream” section

2 Scalability
• Ability to increase/decrease IT resources deployed in response to changing workloads and demands.
• Scaling can be done for data storage, compute or networking, and must be done with minimal
downtime or service disruption.
• Elasticity refers to the system’s ability to allocate or deallocate resources for itself in response to
changing workloads
• On the other hand, scalability refers to the ability to use only existing resources to handle increased
workloads
• The tradeo↵ between Elasticity and scalability depends on the app’s workloads being predictable
or highly variable.

2.1 Benefits of cloud scalability


• Cost Saving: pay-as-you-go model, avoid purchasing expensive hardware that soon may become
obsolete
• Disaster Recovery costs are reduced as need for maintaining secondary data centers is eliminated

• Convenience: rapid provisioning of resources, customized to organization needs


• Flexibility in handling variable workloads with minimal costs, helps small businesses greatly.

2.2 Scaling Strategies


2.2.1 Vertical Scaling
• Adding more resources (CPU, Memory, Disk, I/O) to an existing server, or replacing an existing
server with a more powerful one.
• AWS and Azure support vertical scaling by changing instance types.

• AWS and Azure cloud services have many di↵erent instance sizes, so scaling vertically is possible
for many types of resources (EC2 instances, RDS databases)

2
2.2.2 Horizontal Scaling
• Adding more instances of the same existing configuration and splitting workloads between the new
increased number of instances
• Increase number of instances instead of changing instance type
Think of it like this, vertical scaling is adding more floors to a single house, whereas horizontal scaling
is building 2 more houses of the same size as the existing one.

2.3 Scaling through Reverse Proxies


• The reverse proxy might even be configured as a load balancer.
• To the outside world there is just a single server, but the load balancer takes each request and
forwards it on to an application server on the private network.
• Load balancer decides which server should receive the request based on some scheduling algorithm.

3 Hybrid Cloud and Cloud Bursting


• Combination of private and public clouds enabling expansion of local infrastructure to commercial
infrastructure on a need basis
• Organizations can leverage existing infrastructure and supplement it with cloud resources as per
demand
• Cloud bursting is an configuration between public and private clouds that allows for uninter-
rupted service by sending excess traffic beyound the capability of the private cloud, to the public
cloud.
• Hybrid cloud enables Cloud Bursting of the private cloud by allowing the addition of extra capacity
to a private infrastructure by borrowing from a public cloud
• Benefits of cloud bursting
– Flexible and cost-e↵ective solution to manage sudden workloads seamlessely.
– Simple to manage scaling up/down of resources in public cloud
– Cost savings on internal hardware procurement for an organization. Internal Compute re-
sources Freed up for better usage in other areas.
– Improved customer experience and customer retention levels due to uninterrupted access to
the application.

4 Multi-Tenancy
• An architecture model wherein a single instance of an application or a hardware serves multiple
clients.
• Three types of multi-tenancy models:
– Shared Machine: each client has their own DB process and tables on a single shared machine
– Shared-Process: Each client has their own tables, but only one database process executes
queries for all clients
– Shared-Table: clients share database tables and process.

4.1 Requirements of a Multi-Tenant System


• Fine-grained resource sharing: leads to greater scalability, but also necessitates better access control
and security.
• Security and isolation between tenants
• Customization of tables

3
4.2 Types of Multi-tenant Architectures
4.2.1 Single multi-tenant database
• A single app instance, and a single database instance.
• Highly scalable. As more tenants are added, the database is scaled up by adding more storage.
• Low cost due to shared resources, but high operational complexity during design and setup

4.2.2 One Database per Tenant


• Single app instance, one DB instance per tenant.
• Higher cost and less scalable than single multi-tenant architecture, but operational complexity is
low.
• Scalability may be achieved by adding more DB nodes.

4.2.3 Single-Tenant App with Single-Tenant DB


• The entire app is installed separately for each tenant. Each tenant has their own app instance and
their own DB instance.
• Highest level of data isolation, but high cost due to extra hardware needed to support.

4.3 Levels of Multi-Tenancy


4.3.1 Ad-Hoc or Custom Instances
• Each tenant has their own custom version of software.
• Found in current enterprise data centers.
• Management is difficult as each customer needs specialized management support.

4.3.2 Configurable Instances


• All tenants share same version of program, but configuration is possible to an extent.
• Significant management savings as only one copy of the software is to be maintained.

4.3.3 Configurable, Multi-Tenant instances


• Only one instance of the running program is shared by all customers.
• Leads to additional efficiency in resource usage as well as management

4.3.4 Scalable, configurable multi-tenant instances


• Instances can scale up or down depending on the number of customers, and demand of each
customer.
• Peformance bottlenecks and capacity limitations from other levels are eliminated here to an extent

4.4 Challenges of Multi-Tenancy


4.4.1 Authentication
• Secure sharing of resources is enforced using authentication
• In a centralized authentication system, auth takes place using a centralized user database. The
cloud admin gives the tenants the right to manage their own accounts on this database.
• In a decentralized authentication system, each tenant maintains their own user database, and the
tenant deploys a federation service that interfaces between the authentication services of tenant
and cloud.

4
4.4.2 Implementing Resource Sharing
• Access control is provided using roles and business rules.

• A role is associated with a set of permissions specific to it. The ability to set permissions for roles
is also attached to a certain small set of roles.
• A business rule is a policy that provides fine-grained access control, based on the context of
the running application (e.g. in a banking app, limit the amt of money withdrawn in a single
transaction, or limit the time during which transaction can take place)

• Business rules are implemented using policy engines like Drools Guvnor and Drools Expert
• Two types of access control:
– Access Control List: Each object associated with a set of permissions for each role
– Capability-based Access Control: If a user holds a reference or capability (called a key)
to an object, they have access to the object.

4.4.3 Sharing Storage Resources


• In a shared table approach, all tenant’s data is stored in a single table. A metadata table stores
info about tenants.
• Shared table is more space efficient but requires multiple select statements for a query (to join
between metadata table and actual data table)

• In a dedicated table approach, each tenant has their own table. Access to other tenant’s tables
is restricted.

5 Cloud Security
• A set of control-based safeguards and technologies that protect cloud resources from online theft,
leakage or data loss.
• Cloud security is partitioned into the physical and virtual domains. Basic objectives of cloud
security are confidentiality, integrity and availability

5.1 Physical Security


• Physical security involves protection against physical threats like intruders, natural disasters and
human error (e.g. forgot to turn on the AC)

• Multi-layered physical security system involves:


1. Central monitoring and control center with dedicated sta↵
2. Monitoring for each type of physical threat
3. Training of sta↵ in response to threat situations
4. Manual or automated backup systems to mitigate damage caused
5. Secure access to facility

5.2 Virtual Security: Best practices


5.2.1 Cloud Time Service
• Synchronize all nodes in the data center to the same clock.
• Synchronization is needed for correct ordering of operations, as well as analysis of system logs
across geographically distributed locations

• Network Time Protocol (NTP) is used for this. Encryption is used to avoid fake reference sources.

5
5.2.2 Identity and Access Management
• IM must be scalable, federated, allow single identity and single sign-in and must satisfy legal and
policy requirements.

• Access Management allows access to cloud facilities only to authorized users.


• In addition, it controls the access of cloud management personnel, implements 2-factor authenti-
cation, disallows shared accounts and white lists IP addresses to allow remote access.

5.2.3 Break-glass protocols


• In case of emergencies, bypass normal security controls and allow an alarm to be triggered

• Such a protocol must ensure that it can be executed only in emergencies under controlled situations
and that the alarm is triggered properly.

5.2.4 Key Management


• Secure facilities for the generation, assignment, revocation, and archiving of keys.
• Also generate procedures for recovering from compromised keys

5.2.5 Auditing
• Capture all security-related events, together with data needed to analyze the event
• This data includes time, system on which the event occurred, and userid that initiated the event.
• The audit log is centralized and secure
• It must be possible to create a sanitized or stripped-down version to share with cloud customers
for further analysis.

5.2.6 Security Monitoring and Testing


• Monitoring involves a system-wide anomaly and intrusion detection system installed on network
and host nodes.
• At times, cloud users can create their own anomaly and intrusion detection systems.

• All software (releases or patches) are tested in a test bed environment before deployment to pro-
duction environment.
• Testing happens on a continuous ongoing basis that identifies vulnerabilities in the cloud system.

5.3 Risk Management in Cloud


• Risks in data security and resource outages can be crippling for dependent organizations.
• Risk management is the process of identifying, evaluating, monitoring and controlling risks in a
business environment.

• Risk management is domain dependent, and must tradeo↵ between risk impact and cost of risk
mitigation measure.
• A security control is a safeguard that detects, responds or prevents a security risk. There are
three broad categories of security controls: technical, operational and Management, with each
further divided into 18 families.

• Security breaches are classified as low-impact, medium-impact or high-impact based on the require-
ment for security control.
• Low Impact Systems are those where a security breach causes limited degradation in capability,
but the system can still perform its primary functions.

6
• Medium Impact Systems are those where the system is still capable of performing its primary
functions but there is a significant degradation in the capabilities.
• High-Impact Systems are those where a security breach causes inability to perform primary
functions.

5.3.1 Risk Management Process


• Categorize information resources on the basis of criticality (impact in case of failure) and sensitivity
(confidentiality)
• Select security controls appropriate to the levels of criticality and sensitivity chosen
• Evaluate the chosen security controls. Upgrade them if found insufficient against anticipated
threats.
• Implement chosen security controls. These may be administrative, technical or physical.
• Monitor e↵ectiveness of deployed security controls
• Periodic review of all security controls must take place to protect against new threats, and to
account for operational changes in system design etc.

5.4 Security Design Patterns


5.4.1 Defense in Depth
• Layered defenses protect sensitive resources.
• e.g.: Remote access to a cloud allowed only through VPN. Access could be allowed only from
certain whitelisted IP addresses. Admins may be needed to further provide a OTP for access.

5.4.2 Honeypot
• Honeypots are systems that disguise themselves as valuable targets, while being monitored by
security personnel.
• While an attacker attempts to control the honeypot, the sysadmins monitoring the honeypot can
trap and stop the attack.
• Honeypot VMs can be deployed by the cloud provider or the cloud customers.

5.4.3 Sandboxing
• Execution of software inside a controlled environment within an operating system.
• Within the sandbox, the software has access only to the bare minimum resources it needs to function
properly. Hence any attacker gaining control of the software does not have unrestricted access to
the entire system.

• Sandboxes also provide defense in depth as any attacker is also needed to overcome the sandbox
in order to gain unrestricted access.

5.5 Network Design Patterns for Security


5.5.1 VM Isolation
• Encryption of traffic between VMs, or tightened network controls on VMs (using ACLs, restricting
port numbers)

7
5.5.2 Subnet Isolation
• Separate subnets for admin traffic, user traffic and storage network traffic.

• Physically separate networks are preferred as virtual LANs (VLAN) that are not physically separate
are hard to configure correctly.
• Routing between the networks is handled by firewalls.

5.5.3 Common Management DB


• a database that contains information regarding the components of an IT system (inventory of
components, present config and status)

• Simplifies implementation and management of IT services, allows all admins to have a consistent
view of the IT system.

5.6 Example of PaaS Security

Figure 2: Security Architecture for PaaS System

5.6.1 External Network Access


• Distinct interfaces for distinct physical networks, one for admins and one for cloud users.
• Access to control network is limited to whitelisted IP addresses only.

• Multi-factor authentication can be made mandatory for increasing secure access to administrative
functions.
• The access to the public network is via two switches, to increase availability via redundancy.

8
5.6.2 Internal Network Access
• Separate physical networks for admin control functions and one for cloud user functions. Protects
control network from unauthorized access.
• The DBMS is connected to the public network via an aggregated set of links to provide increased
bandwidth and availability.
• PaaS service is accessible from public and private networks. But the security server need not be
accessible from the public network.

5.6.3 Database Server Security


• The identity server handles access management to the database.
• Database is further secured by restricting the allowed ports on which internet traffic is allowed.
• Additional security is implemented by checking the validity of the ODBC connection from the
client to the database

5.6.4 Security Service


• The diagram also includes a security server to perform security services such as

– Auditing of security
– Monitoring for security threats
– Hosting a security operations center
– Security scanning of the cloud infrastructure

5.7 Standards for Security Architecture


5.7.1 SSE-CMM
• System Security Engineering Capability Maturity Model, adaptation of CMM for software projects
by CMU
• Defines 5 capability levels for an organization
• Allows organizations to plan and implement processes for self improvement

5.7.2 ISO/IEC 27001-27006


• Set of related standards under the ISO/IEC 27000 family that provides an Info. Security Manage-
ment System.
• Specifies requirements to be satisfied by all organizations, and processes for evaluating security
risks
• Not specific to cloud

5.7.3 ENISA
• European Network and Info. Security Agency provides a Cloud Computing Information As-
surance Framework.
• The framework is a set of assurance criteria designed to assess the risk of adopting cloud services,
compare di↵erent Cloud Provider o↵ers, obtain assurance from the selected cloud providers, and
reduce the assurance burden on cloud providers

5.7.4 ITIL Security Management


• ITIL is a comprehensive set of standards used for ITSM, based on ISO/IEC 27002.
• Shallow learning curve due to the fact that ITIL is already adopted in many data centers.

9
5.7.5 COBIT
• Control OBjectives for Information related Technologies, developed by the ISACA.

• A set of best practices for linking business and IT goals, with metrics and maturity models.
• Broader scope than ISO/IEC 27000

5.7.6 US NIST
• US National Institute for Standards and Technology releases many whitepapers in the Security
Management and Assurance working group.

• Targeted at US Federal Agencies (CIA, FBI etc.) , but apply to many organizations as well.

5.8 Legal and Regulatory Issues with the cloud


• Local, national and international laws apply due to distributed nature of the cloud, as well as the
presence of a third party (i.e. the cloud provider)

• Such laws must specify who is responsible for security and accuracy of the data stored on cloud.
• Issues to consider when framing laws:
– Cover all risks arising from a third party’s presence
– Need to ensure data security
– Obligations of the cloud provider during any litigation

5.9 Legal Issues


5.9.1 Due Diligence
• Client must define scope of service provided, as well as regulations and compliance standards to be
followed
• Consider any risks arising from stability and reliability of the cloud provider, as well as the criticality
of the business function outsourced to the cloud.

5.9.2 Contract Negotiation


• Cloud services may have one-click standard agreements that are not customizable. Such agreements
are acceptable for low-risk scenarios.
• Cloud service providers can avoid negotiating custom agreements with each customer through
external accreditations

5.9.3 Implementation
• Enterprise must ensure that the safeguards laid out in the contract are actually being followed
• It is also important to continuously re-evaluate the system periodically to check for changed cir-
cumstances (increased data sensitivity, revoked external certifications)

5.9.4 Contract Termination


• Identify alternate service provider, ensure timely and secure transfer of services

• Also ensure that sensitive data if any, is completely deleted from the original provider’s systems.

10
5.9.5 Data Privacy and Secondary use of Data
• Use collected data only for intended purpose, and such data cannot be sold to third parties
• Privacy laws often state that individuals can access their own data and modify or delete it
• Enterprises must ensure that cloud service providers do not use the data for data mining or other
secondary usage.

5.9.6 Data Location


• Data handling laws di↵er between countries, hence transferring data between countries is a chal-
lenging process.
• Allow for the location of data to be known in advance so that such scenarios can be planned (e.g.
AWS allows selection of regions for all services)
• The enterprise must obey the most stringent of all the laws that apply across the countries where
data is stored.

5.9.7 Business Continuity Planning


• BCP is used to implement actions to keep a business running in the face of natural disasters that
a↵ect infrastructure, including those maintained on a third-party cloud.
• BCP typically involves identifying the possible catastrophes, carrying out Business Impact Analysis,
and using the results of the analysis to formulate a recovery plan
• Disaster Recovery planning (DRP) is a part of BCP used for recovery of IT operations.
• BCP and DRP are made before deploying apps to cloud, and implemented during the deployment.
Some cloud providers provide features (e.g. multi-locations) that help in BCP and DRP

5.9.8 Security Breaches


• In case of a breach, cloud provider’s disclosure policy is important.
• Disclosure policy defines how quickly a customer is notified of a breach, so that corrective action
can be taken.
• To avoid ambiguity, the service agreement should specify the actions to be taken during a breach

5.9.9 Litigations
• During a litigation against an enterprise or a cloud provider, the provider must be able to make
available any data that is needed for this litigation.
• This is important as enterprises (not cloud providers) are responsible for responding to such re-
quests.
• In case a cloud provider is directly requested to provide data, then the a↵ected business must be
contacted and must be given the opportunity to oppose the request.

6 Cloud Authentication: Keystone


• Keystone is an OpenStack service that provides:
– API client authentication
– Service discovery
– Ditributed multi-tenant authorization
using OpenStack’s Identity API
• The fundamental purpose of Keystone is to be a registry of projects and decide on access to projects.

11
6.1 Terminologies
6.1.1 Project
• An abstraction used to group resources (servers, machine images etc.)
• Users or user groups are given access to projects using role assignments.
• The specific role assigned outlines the type of access and capabilites that a user/user group is
entitled to.

6.1.2 Domain
• AN abstraction that isolates the visibility of a set of projects and users (or user groups) to a single
organization
• Domains enable splitting cloud resources into silos that can be used by each organization.
• Domains represents logical divisions within an enterprise, or maybe entirely di↵erent enterprises

6.1.3 Users and User Groups


• Also known as actors, they are the ones who utilize the cloud resources.
• User groups are groups of users that have some shared responsibility.

6.1.4 Relationship between the three


Domains are a collection of Users, Groups, and Projects. Roles are globally unique. Users may have
membership in many Groups.

6.1.5 Roles
• ”Assigned to” an user and ”assigned on” a project.
• Convey a sense of authority, a particular responsibility to be fulfilled by an actor.
• A role assignment is a triple of actor, target (may be a project or a domain), and a role.
• Role assignments can be granted, revoked and inherited between users/projects.

6.1.6 Token
• Each API call authenticated by Keystone requires the passing of a token.
• Tokens are generated by Keystone upon successful authentication of an user against the service.
• A token has both an unique ID (unique per cloud) and a payload (data about the user)

6.1.7 Service Catalog


• List of endpoints and URLs for di↵erent services on a cloud.

• Used mainly for service discovery and access (such as creating VMs, storage allocation etc.)
• Each endpoint is broken down into a public URL, an internal URL and an admin URL (all may
be the same or not)

12
6.2 Identity in Keystone
6.2.1 SQL
• Identity of actors (name, password, metadata) and groups stored on an SQL database (MySQL,
PostgreSQL, DB2)
• Keystone in this case serves as the identity provider
• Pros:

– Easy setup
– Manage users and groups via OpenStack APIs
• Cons:
– Keystone should not be identity provider as well as authenticator
– Weak password support: no password rotation or recovery
– Does not integrate with existing enterprise LDAP servers

6.2.2 LDAP
• Keystone can retrieve and store actors (Users and Groups) in Lightweight Directory Access Protocol
(LDAP).

• LDAP should be restricted to only read operations (searching) and authentication (bind).
• Keystone needs a minimal amount of privilege to use the LDAP (read access to attrs defined in
the configuration, as well as an anonymous access)
• Pros:

– No need to maintain copies of user accounts


– Keystone no longer acts as identity provider
• Cons:
– Service accounts need to be stored somewhere (may not be desirable to have them on LDAP
server)
– Keystone is still seeing user passwords in the request messages. Ideally Keystone should never
see user passwords.

6.2.3 Multiple backends


• Allow one identity source per Keystone domain.
• Allows service accounts and employee accounts to be separated, and allows use of multiple LDAPs
for flexibility in organization of departments.
• Pros:
– Support multiple LDAPs for various user accounts, SQL for service accounts
– Leverage existing LDAP identity

• Cons:
– Complex set up
– User authentication must be domain-scoped

13
6.2.4 Identity Providers
• An identity provider is a service that abstracts the identity service backed and translates user
information into some standard federated identity protocol.

• Keystone uses Apache modules for consuming authentication info from multiple Identity Providers.
• Such users never stored in Keystone, not permanent, users will have their attributes mapped into
group-based role assignments
• From a Keystone perspective, an identity provider is a source for identities; it may refer to software
that is backed by various backends or Social Logins
• Pros:
– Leverage existing infra & software for user authentication
– Separation between Keystone service and user info
– Keystone never sees any user passwords
– Type of authentication (certificate-based, 2-factor) is abstracted away from keystone
• Con: most complex setup

6.3 Authentication in Keystone


6.3.1 Password
• User or service provides a password for authentication

• The payload of the request contains information needed to find where the user exists, authenticate
the user, and optionally, retrieve a service catalog based on the user’s permissions on a scope
• The user section identifies the user (either on a domain, or using a globally unique user ID),
• The scope section identifies the project being worked on, and hence is used to retrieve the service
catalog. Must contain information to identify a project and the owning domain.

6.3.2 Token
• A user may also request a new token by providing a current token.
• The payload contains the current token ID.
• This allows refreshing a token that will soon expire, or changing a token type from unscoped to
scoped.

6.3.3 Access Management


• Keystone manages access to APIs using role-based access control.
• Consists of policies stored in JSON form at each API endpoint.
• Rules in JSON form consists of target:rule pairs.

• At the top of the file, targets are established that can be used for evaluation of other targets.
• Here the meaning of admin, owner and other roles are defined.

14
Figure 3: Keystone Services and Backends

7 Cloud Security Threats: Denial of Service


• A type of cyber attack in which a malicious actors aim to render a device unavailable to its intended
users by interrupting the device’s normal functioning.
• DoS attacks typically function by overwhelming or flooding a targeted machine with requests until
normal traffic is unable to be processed, resulting in denial-of-service to addition users.
• The focus of a DoS attack is to overpower the capacity of a targeted machine, resulting in denial-
of-service to additional requests.
• Types of DoS attacks
– Bu↵er Overflow: force a machine to consume resources until capacity runs out. Leads to
slow response time, system crashes and hence DoS
– Flood Attacks: Overpower the server by targeting a server with a massive number of packets.
For this attack the attacker must have greater bandwidth than the target.

7.1 DDoS Attacks


• Distributed DoS (aka DDoS) attacks utilize multiple computers as sources of attack traffic.

• DDoS attacks are carried out with networks of internet-connected machines that have been infected
with a malware that allows an attacker to control them remotely.
• Such a network of machines is called a botnet (with individual machines being called bots or
zombies). Each bot is a legitimate machine on the internet hence it is difficult to separate attack
traffic from actual traffic.

• The botnet floods the victim with requests, overwhelming the capacity and causing denial of service.

7.2 EDoS: Economic Denial of Sustainability


• EDoS targets the vulnerabilities of cloud computing’s utility pricing model.
• EDoS attackers steadily send illegitimate traffic to gradually consume cloud resources such as VMs,
network devices, security devices and DBs so that it can trigger auto scaling features of cloud
• Consequently, due to the additional resource usage, the target consumer is billed for additional
charges, causing financial problems.

• The other side e↵ect of this attack is the persistent degradation of services faced by benign cloud
users.

15
DDoS Attack EDoS Attack
Degrade/block cloud services Make cloud resources economically infeasible
Short attack period Long attack period
Attacks occur above EDoS region Attacks occur between normal data traffic zone
and DDoS attack zone

7.3 Intrusion Detection Systems (IDS)


• Signature matching IDS and anomaly detection can be implemented on VMs that are dedicated to
building IDS’.

• Network anomaly detection reveals abnormal traffic patterns, such as unauthorized episodes of
TCP connection sequences, against normal traffic patterns.

7.3.1 Defense against DDoS Attack


• Use successive attack transit routers in the network along the tree.
• This mechanism is based on change-point detection across each router.

• Based on the anomaly pattern detected in covered network domains, the scheme detects a DDoS
attack before the victim is overwhelmed

7.3.2 Data Integrity and Privacy Protection


• Special APIs for authentication, e-mail communication
• Fine-grained access control to deter hackers.

• Personal firewalls at user ends to keep shared data sets from Java, JavaScript, and ActiveX applets
• A privacy policy consistent with the cloud service provider’s policy, to protect against identity
theft, spyware, and web bugs
• VPN channels between resource sites to secure transmission of critical data objects

7.3.3 Data Colouring


• Data colouring is a watermarking technique that secures data. Each data object is labelled with
an unique colour.
• User identification is also coloured to correspond with the data coloured.
• This color matching process can be applied to implement di↵erent trust management events.

• Cloud storage provides a process for the generation, embedding, and extraction of the watermarks
in colored objects
• Data coloring takes a minimal number of calculations to color or decolor the data objects (compared
to encryption/decryption)

7.3.4 Data Lock-In


• Data lock-in is caused by inability to move data from one cloud platform to another to do some
other computation
• Causes for data lock-in are lack of interoperability (the lack of standard APIs for access) and lack
of application compatibility (applicaitons are not standard across all clouds)

• Standardized cloud APIs can be built, but this requires providers to build infrastructure that
adhere to OVF aa platform-independent, efficient, extensible, and open format for VMs)
• This will enable efficient, secure software distribution, facilitating the mobility of VMs.

16

You might also like