MCT702 : Distributed
Computing
Course Objectives
The differences among: concurrent, networked, distributed, and
mobile.
Resource allocation and deadlock detection and avoidance
techniques.
Remote procedure calls.
IPC mechanisms in distributed systems.
Course Outcomes
Develop, test and debug RPC based client-server programs in
Unix.
Design and build application programs on distributed systems.
Improve the performance and reliability of distributed programs.
Design and build newer distributed file systems for any OS.
1
Syllabus
UNIT I
Introduction-Examples of Distributed System - Resource, Sharing and the WebChallenges, case study on World wide web-System Models-IntroductionArchitectural Models-Fundamental Models.
Distributed Objects and Components: Introduction, Distributed Objects, from
objects to components, Case study: enterprise java beans and fractals.
Remote Invocation- Remote Procedure Call-Events and Notifications.
UNIT II
Distributed
Operating
Systems-Introduction-Issues-Communication
Primitives-Inherent Limitation-Lamports Logical Clock; Vector Clock; Causal
Ordering; Global State; Cuts; Termination Detection. Distributed Mutual
Exclusion-Non-Token based Algorithms-Lamports Algorithm-Token based
Algorithms-Suzuki-Kasamis Broadcast Algorithm- consensus and related
problems. Distributed Deadlock Detection-Issues-Centralized DeadlockDetection Algorithms-Distributed Deadlock-Detection Algorithms.
UNIT III
Distributed Resource Management-Distributed File Systems-ArchitectureMechanisms-Design Issues-Case Study: Sun Network File System-Distributed
Shared Memory-Architecture-Algorithm-Protocols-Design Issues. Distributed
Scheduling-Issues-Components-Algorithms- Load Distributing Algorithms, Load
Sharing Algorithms.
Unit IV
Transaction
and
Concurrency:Introduction,
Transactions,
Nested
2
Transactions, Locks, Optimistics concurrency control ,Time Stamp Ordering,
Syllabus
UNIT V
Resource Security and Protection : Access and Flow control-IntroductionThe Access Matrix Model-Implementation of Access Matrix Model-Safety in the
Access Matrix Model-Advanced Models of Protection-Data SecurityIntroduction-Modern Cryptography:-Private Key Cryptography, Public key
cryptography.
UNIT VI
Distributed Multimedia Systems:-Introduction- Characteristics Quality of
Service Management- Resource Management-Stream Adaptation -Case Study.
Designing Distributed System:-Google Case Study-Introducing the Case
Study: Google- Overall architecture and Design Paradigm-Communication
Paradigm- Data Storage and Coordination Services-Distributed Computation
Services
Text Book:
Distributed Systems Concepts and Design, George Coulouris, Jean Dellimore
and Tim KIndberg,Pearson Education,5th Edition.
Advanced Concepts in Operating Systems, Mukesh Singhal and
N.G.Shivaratri, McGraw-Hill.
Distributed Operating Systems, Pradeep K. Sinha, PHI,2005
References:
Distributed
Computing-Principles,Algorithms
and
Systems,
Ajay
3
D.Kshemkalyani and Mukesh Singhal Cambridge University Press.
Unit No.I
Chapter 1 : Introduction to DSExamples of Distributed System
Resource Sharing
Web Challenges
Case study on World wide web
Chapter 2 :System ModelsIntroduction
Architectural Models
Fundamental Models
Chapter 3: Distributed Objects and ComponentsIntroduction
Distributed Objects,
From objects to components, Essence of Components
Case study: Enterprise java beans and Fractals.
Chapter 4 :Remote InvocationIntroduction
Remote Procedure Call
Events and Notifications
4
Chapter 1:Introduction to
DS
A distributed system is a collection of
independent entities that cooperate
to solve a problem that cannot be
individually solved.
Definition :
A distributed system is one in which hardware
or software components located at networked
computers, communicate and coordinate their
actions only by passing messages.
Note : Computers that are connected by a
network may be spatially separated by any
distance.
They may be on separate continents, in the same
building or in the same room.
Passing the message is the key feature of
distributed system
Major Consequences
1. Concurrency
concurrent program execution
sharing resources
increasing the capacity of the system
2. No global clock
No shared memory
provide the abstraction of a common address space
shares idea of the time at which the programs actions occur.
3. Independent Failures
Faults in the network result in the isolation of the computers that
are connected to it
The failure of a computer, or the unexpected termination of a
program somewhere in the system (a crash)
It is the responsibility of system designers to plan for the
consequences of possible failures
7
Examples of Distributed
System
1. Application Domains
. Web search
. Massively multiplayer online games
. Financial trading
2. Recent Trends
. Pervasive networking and the modern
Internet
. Mobile and ubiquitous computing
. Distributed multimedia systems
. Distributed computing as a utility
8
1. Application Domain
1. Web search
Sr. no.
Application Domain
Application Network
01.
Finance & commerce
Ecommerce ( companies like Amazon
and eBay)
Online payments, trading & banking
02.
Information Societies
WWW
Search engines : Google & Yahoo
user-generated content :YouTube,
Wikipedia and Flickr
Social networking : Facebook and
MySpace.
03.
Creative industries
and
entertainment
On line gaming , downloading sites of
multimedia , Youtubes
04.
Healthcare
online electronic patient records
telemedicine in supporting remote
Conti
1. Web search
Sr. no.
Application Domain
Application Network
05.
Education
E-learning
virtual learning environments
Distance learning
community-based learning.
06.
Transport & Logistic
location technologies such as GPS
web-based map services :MapQuest,
Google Maps and Google Earth.
07.
Science
e_-science : Grid technology has
enable
worldwide collaboration between
groups of scientists.
08.
Environmental
management
sensor technology : avoid Natural
disasters
understand complex natural
10
phenomena such as climate change.
2. Massively multiplayer online games (MMOGs)
Objective is to provide:
fast response times to preserve the user experience of
the game.
the real-time propagation of events to the many
players
maintaining a consistent view of the shared world.
Solutions:
1. Client server Architecture (EVE Online for online
gamming)
2. Distributed Architecture (EverQuest)
3. peer-to-peer technology (purely decentralized
approach)
11
3. Financial Trading:
The industry employs automated
monitoring and trading applications
Distributed event-based systems
12
2. Recent Trends
1. Pervasive networking and the modern
Internet
intranet
ISP
backbone
satellite link
desktop computer:
server:
network link:
A typical portion of the Internet
Large Distributed.
sys
Services such as
www
It is an open 13
The Internet is also a very large distributed system which
make use of services such as the World Wide Web, email
and file transfer.
Programs running on the computers connected to it
interact by passing messages, employing a common
means of communication.
The figure shows a collection of intranets subnetworks
operated by companies and other organizations and
typically protected by firewalls.
Internet Service Providers (ISPs) are companies that
provide broadband links and other types of connection to
individual users and small organizations, enabling them to
access services anywhere in the Internet as well as
providing local services such as email and web hosting.
A backbone is a network link with a high transmission
capacity, employing satellite connections, fibre optic cables
and other high-bandwidth circuits.
Problem in (i)isolation of some system eg police
confidential matter and (ii) improvement in firewall by
14
some fine grained mechanism and policies.
2. Mobile and ubiquitous computing
Advancement in device miniaturization
and wireless networking.
Small portable devices like laptops,
smart phone , mobile phones, GPS,
PDAs like wearable watches, digital
cameras and video camers.
Embedded appliances such as washing
machines, hi-fi systems, cars and
refrigerators.
15
Example of Mobile and Ubiquitous computing
Internet
Host intranet
WAP
gatew ay
Wireless LAN
Home intranet
Mobile
phone
Printer
Camera
Laptop
.spontaneous interoperation
. service discovery
Host site
16
Mobile computing
Mobile computing is the performance of computing tasks while the
user is on the move, or visiting places other than their usual
environment .
by providing with access to resources via the devices they carry
with them.
They can continue to access the Internet;
they can continue to access resources in their home intranet;
and there is increasing provision for users to utilize resources such
as printers or even sales points that are conveniently nearby as
they move around. The latter is also known as location-aware or
context-aware computing.
Mobility introduces a number of challenges for distributed
systems, including the need to deal with variable connectivity and
indeed disconnection, and the need to maintain operation in the
face of device mobility
17
Ubiquitous computing
This includes many small, cheap computational
devices that are present in users physical
environments, including the home, office and
even natural settings.
it may be convenient for users to control their
washing machine or their entertainment
system from their phone or a universal remote
control device in the home.
Equally, the washing machine could notify the
user via a smart badge or phone when the
washing is done.
18
3. Distributed multimedia
systems
The main objectives are the storage, transmission
and presentation of
discrete media types, such as pictures or text messages.
continuous media types such as audio and video
Webcasting is an application of distributed
multimedia technology to broadcast continuous
media, typically audio or video, over the Internet.
range of encoding and encryption formats
desired quality of service
resource management strategies
scheduling policies
adaptation strategies in open system
19
4. Distributed computing as a utility
A number of companies are promoting the view of
distributed resources as a commodity or utility.
Physical resources such as storage and processing.
(operating system virtualization)
Software services across the global Internet
(Google Apps)
Cloud computing
Clouds are generally implemented on cluster computers which
include set of interconnected computers that cooperate closely
to provide a single, integrated high performance computing
capability.
Blade servers are minimal computational elements containing
for example processing and (main memory) storage capabilities.
Grid computing has support for scientific applications.
20
Resources sharing
We routinely share :
Resources
Hardware resources
Data resources
Example
Printers, disks,
Files, databases,
webpages
Search engines,
Functionally specific
Services
:
that
manages
a
collection
resources
of related resources and presents
their functionality to users and
applications. e.g. Services
Purpose
Access files
Document send to
printers
Buying of goods
File services
Printing services
Electronic payment
21
Services can be accessed only on the basis
of some set of operations that it export .
E.g. such as a file service provides read,
write and delete operations on files.
Resources in a distributed system are
physically encapsulated within computers
and can only be accessed from other
computers by means of communication.
Distributed system uses the approach
known as client-server computing.
features
Client
Servers
operation
invokes an operation
remote invocation.
nature
active
passive
time
Lasts upto applications
run
Work continuously
objects
Encapsulates the
Contains the
22
Challenges
Sr.
no.
Challenges
Remarks
01.
Heterogeneity
Varity and differences
02.
Openness
Extension of resource-sharing services
03.
Security
Protection of shared resources
04.
Scalability
Addition of user vs constant of resources
05.
Failure handling
Computer or network
06.
Concurrency
presence of multiple users to request for
07.
Transparency
Encapsulation of operation, location etc
for users
08.
Quality of services
Provides performance, security and
reliability. Adaptability
23
1. Heterogeneity
Resources
Examples
Network
Internet Protocol on ehternet
Computer
hardware
Data types like integer in form of messages are to
be exchanged between programs running on
different hardwares
Operating systems the calls for exchanging messages in UNIX or
Windows.
Programming
languages
Use of characters and data structures by different
programming lang. should be able to
communicated with each other.
Implementations
1. Middleware
Middleware
:provides
a programming abstraction as well as
by different
2. Mobile codes
virtual
machines
masking
the heterogeneity
. eg:and
The
Common
Object Request
developers.
Broker (CORBA), Java Remote Method Invocation (RMI)
Mobile Codes : refers to transfer of program code . Eg, Java
applets
Virtual Machines : provides a way of making code executable
on a variety of host computers: Java compiler
24
2. Openness
Extension and reimplementation of new resourcesharing services, which can be made available for use
by a variety of client programs.
Open systems are characterized by the fact that their
key interfaces are published. Eg. Requests For
Comments in IP
Open distributed systems are based on the provision
of a uniform communication mechanism and published
interfaces for access to shared resources. Eg. World
Wide Web Consortium (W3C) provides standards of
working on Web.
Open distributed systems can be constructed from
heterogeneous hardware and software, possibly from
different vendors. But the conformance of each
component to the published standard must be
carefully tested and verified if the system is to work
25
correctly.
3. Security
Security for information resources has
three components:
confidentiality
protection against disclosure to unauthorized
individuals
integrity
protection against alteration or corruption
availability
protection against interference with the means to
access the resources
Two security challenges :
Denial of service
attack
To disrupt a service for some reason
Security of mobile
code
Receive of an executable program as an
electronic mail attachment. (running is
unpredictable )
26
4. Scalability
Distributed systems operate effectively and efficiently at
many different scales, ranging from a small intranet to the
Internet.
A system is described as scalable if it will remain effective
when there is a significant increase in the number of
resources and the number of users.
Major scalable challenges in distributed systems
1. Controlling the cost of physical resources:
(file servers should be proportionate to file users)
2. Controlling the performance loss:
( data sets used in hierarchic structures scale better
than linear structures )
3. Preventing software resources running out:
(Adaptation of a new version of the Internet protocol
with 128-bit Internet addresses )
4. Avoiding performance bottlenecks:
27
(partitioning of spaces or replication of web-pages)
5. Failure handling
Detecting Failures : suspect
Masking failures : hiding of failures
Tolerating failures: direct alert
Recovery from failures: rollback
processes
Redundancy : Replication
28
6.Concurrency
It is a problem when two or more users access to the
same resource at the same time
Each resource is encapsulated as an object and invocations are
executed in concurrent threads
Concurrency can be maintained by use of semaphores and other
mutual exclusion mechanisms.
Note :
Thread : A smallest sequence of programmed instructions that can
be managed independently by an operating system scheduler.
Semaphore : A semaphore is a variable or abstract data type that
provides a simple but useful abstraction for controlling access, by
multiple processes, to a common resource in a
parallel programming or a multi user environment.
29
7. Transparency
Concealment of the separation of components
from users:
Access transparency: local and remote resources can
be accessed using identical operations (ftp services)
Location transparency: resources can be accessed
without knowing their where abouts (URL)
Concurrency transparency: processes can operate
concurrently
using
shared
resources
without
interferences( threads and semaphores)
Failure transparency: faults can be concealed from
users/applications( retransmits of mail)
Mobility transparency: resources/users can move
within a system without affecting their operations
(airtel & idea comu)
30
Conti
Replication transparency : enables
multiple instances of resources to be
used to increase reliability and
performance without knowledge of the
replicas by users or application
programmers.
Performance transparency: system can
be reconfigured to improve performance
Scaling transparency: system can be
expanded in scale without change to the
applications
31
Transparency Examples
Distributed File System allows access transparency
and location transparency
URLs are location transparent, but are not mobility
transparent
Message retransmission governed by TCP is a
mechanism for providing failure transparency
Mobile phone is an example of mobility transparency
Note
:
Access
Transparency
and
transparency together is called as
transparency.
Location
Network
32
8. Quality of services
Responsiveness and computational throughput.
Ability
to
meet
timeliness
guarantees
depending on computing and communication
resources.
QoS applies to operating systems as well as
networks.
There must be resource managers that provide
guarantees.
Provision for Reservation requests that cannot
be met should rejected.
33
Case Study : World Wide
Web
Introduction
Short History
Concept
Major Components
HTML
URL
HTTP
Related Terms
34
Short History
The Web began life at the European
centre for nuclear research (CERN),
Switzerland, in 1989 as a vehicle for
exchanging documents between a
community of physicists connected
by the Internet.
35
Concept
Web provides
hypertext structure among the documents that it
stores
hyperlinks i.e. references to other documents
and resources that are also stored in the Web
Open system : it can be extended and
implemented in new ways without disturbing its
existing functionality.
Operation is based on communication standards
and document or content standards that are
freely published and widely implemented.
36
Concept
There are many types of browser, which are implemented
on several platforms and also there are many
implementations of web servers.
The users have access to browsers on the majority of the
devices that they use, from mobile phones to desktop computers.
The Web is open with respect to the types of resource that
can be published and shared on it.
If somebody invents, say, a new image-storage format, then
images in this format can immediately be published on the Web.
Note : The Web has moved beyond these simple data
resources to encompass services, such as electronic
purchasing of goods. It has evolved without changing its
basic architecture.
37
http://www.google.com/search?q=kindberg
www.google.com
Browsers
Web servers
Internet
www.cdk3.net
http://www.cdk3.net/
www.w3c.org
File system of
www.w3c.org
Protocols
http://www.w3c.org/Protocols/Activity.html
Activity.html
Web servers and web browsers
38
Majors Components
The Web is based on three main standard technological
components:
the HyperText Markup Language (HTML), a
language for specifying the contents and layout of
pages as they are displayed by web browsers;
Uniform Resource Locators (URLs), also known
as Uniform Resource Identifiers (URIs), which
identify documents and other resources stored as part
of the Web;
a client-server system architecture, with
standard rules for interaction (the HyperText
Transfer Protocol HTTP) by which browsers and other
clients fetch documents and other resources from web
servers.
39
1.HTML
The HyperText Markup Language is used to specify
the text and images that make up the contents of a
web page, and to specify how they are laid out and
formatted for presentation to the user.
A web page contains such structured items as
headings, paragraphs, tables and images.
HTML is also used to specify links and which
resources are associated with them.
Users may produce HTML by hand, using a standard
text editor, but they more commonly use an HTMLaware wysiwyg editor that generates HTML from a
layout that they create graphically.
40
Example
Consider a piece of code stored in a html file say
earth.html
<IMG SRC = http://www.cdk5.net/WebExample/Images/earth.jpg>
<P>
2
Welcome to Earth! Visitors may also be interested in taking a look at the
<A HREF = http://www.cdk5.net/WebExample/moon.html>Moon</A>.
</P>
5
3
4
Output :
Welcome to Earth! Visitors may also be interested in taking a look at
the Moon.
41
2. URL
The purpose of a Uniform Resource
Locator is to identify a resource.
Uses File Transfer Protocol (FTP) or
HyperText Transfer Protocol (HTTP)
ftp://ftp.downloadIt.com/software/aProg.ex
e
An HTTP URL has two main jobs: to
identify which web server maintains the
resource, and to identify which of the
resources at that server is required.
42
In general, HTTP URLs are of the
following form:
http:// servername [:port] [/pathName] [?query]
[ #fragment]
e.g.
intro
intro
Publishing Resources : http:/S/P
43
3. HTTP
The HyperText Transfer Protocol defines the ways in
which browsers and other types of client interact with
web servers.
The main features are :
Request-reply interactions:
Operation : GET, to retrieve data from the resource, and POST, to provide data to
the resource
Content types:
text/html then a browser will interpret the text as HTML and display
it;
image/GIF then the browser will render it as an image in GIF format
application/zip then it is data compressed in zip format,
One resource per request
Browser makes several requests concurrently, to reduce the overall
delay to the user.
Simple access control:
any user with network connectivity to a web server can access or
restrict access any of its published resources.
44
Related Terms
Dynamic Pages
Download codes
Common Gateway Interfaces on server
Javascript, Asynchronous Javascript And XML(AJAX)
Applets
Web services
programmatic access to web resources
Web resources provide service-specific operations.
GET, POST, PUT,DELETE
Web Discussion
Web faces problems of scale.
Use of proxy servers
clusters of computers.
45
Chapter 2 : System Models
Introduction
Architecture Models
Client server model
Peer to peer model
Variations
Fundamental Models
Interaction models
Failure models
Security models
46
Introduction
Difficulties and threats for distributed systems
Widely varying modes of use: The component parts of systems
are subject to wide variations in workload for example, some
web pages are accessed several million times a day. Some parts
of a system may be disconnected, or poorly connected some of
the time for example, when mobile computers are included in a
system. Some applications have special requirements for high
communication bandwidth and low latency for example,
multimedia applications.
Wide range of system environments: A distributed system must
accommodate heterogeneous hardware, operating systems and
networks. The networks may differ widely in performance
wireless networks operate at a fraction of the speed of local
networks. Systems of widely differing scales, ranging from tens of
computers to millions of computers, must be supported.
Internal problems: Non-synchronized clocks, conflicting data
updates and many modes of hardware and software failure
involving the individual system components.
47
External threats: Attacks on data integrity and secrecy, denial of
Introduction
The properties and design issues of distributed
systems can be captured and discussed
through the use of descriptive models.
Each type of model is intended to provide an
abstract, simplified but consistent description
of a relevant aspect of distributed system
design.
The basic models under consideration are
Architecture model
Fundamental model
And there is one more model : The Physical model
48
1. The Architecture models
Objectives ( approaches)
looking at the core underlying architectural
elements that underpin modern distributed
systems,
highlighting
the
diversity
of
approaches that now exist.
examining composite architectural patterns that
can be used in isolation or, more commonly, in
combination, in developing more sophisticated
distributed systems solutions.
Considering in terms of middleware platforms
that are available to support the various styles
of programming that emerge from the above
architectural styles.
49
I. Architecture elements
Key questions ?
What are the entities that are communicating in
the distributed system?
How do they communicate, or, more specifically,
what communication paradigm is used?
What (potentially changing) roles and
responsibilities do they have in the overall
architecture?
How are they mapped on to the physical
distributed infrastructure (what is their
placements)
50
1. Communication entities
System-oriented entities
Nodes
(primitive environment, based on OS layers)
Processes
( distributed environment, based on threads)
Problem oriented entities
Objects
decomposition for the given problem domain
interface definition language (IDL) ,
methods defined on an object
Components
accessed through interfaces
making all dependencies explicit
third-party development removing hidden dependencies.
Web services
defined by the web-based technologies
a software application identified by a URI
message exchanges via Internet-based protocols.
51
2. Communication paradigms
Inter-process communication
low-level support for communication between processes in DS
Message passing : c-s arch.
Socket programming : Use of IP (TCP/UDP)
Muti-cast communication : one msg to many
Remote invocation
2-way exchange bet. entities in terms of remote oper n, pro. or mtds
Request-reply protocols: c-s communication, encoded as an array of bytes
Remote procedure calls: procedures in processes on remote computers
can be called
Remote method invocation : calling object can invoke a method in a
remote object
Indirect Communication
Group communication : one to many
Publish-subscribe systems: event based system
Message queues: point-to-point service.
Tuple spaces: parallel programming
Distributed shared memory: do not share physical memory
52
3. Roles & Responsibilities
Roles
are
fundamental
in
establishing the overall architecture
to be adopted which reflects the
responsibilities of each components
in the DS:
Basic architecture models
Client Server Model
Peer to peer Model
53
A. Client-server model:
Most important and most widely distributed
system architecture.
Client and server roles are assigned and
changeable.
Servers may in turn be clients of other
servers.
Services may be implemented as several
interacting processes in different host computers
to provide a service to client processes:
Servers partition the set of objects on which
the service is based and distribute them
among themselves
(e.g. Web data
and web servers)
54
Clients invoke individual servers
Client
invocation
result
Client
invocation
Server
Server
result
Key:
Process:
Computer:
Search engine :programs called web
crawlers
55
B. Peer to Peer Model
Peer processes:
All processes play similar roles without destination as a
client or a server.
Interacting cooperatively to perform a distributed activity.
Communications pattern will depend on application
requirements
In system architecture and networks, peer-to-peer is an
architecture where computer resources and services are
direct exchanged between computer systems.
These resources and services include the exchange of
information, processing cycles, cache storage, and disk
storage for files..
In such an architecture, computers that have traditionally
been used solely as clients communicate directly among
themselves and can act as both clients and servers,
assuming whatever role is most efficient for the network.56
A distributed application based on
peer processes
Peer 2
Peer 1
Application
Application
Peer 3
Sharable
objects
application
database,
the storage,
processing
communication
loads
for access to
objects
Application
Peer 4
Application
Peers 5 .... N
exploit the resources (both data and hardware)
57
4. Placements
Variation in the models:
i. Mapping of services to multiple
servers
ii. Caching using web proxy servers
iii. Web applets in form of mobile
code
iv. Mobile agents
v. Thin client
58
i. Multiple servers
Service
Server
Client
Server
Client
Server
59
ii. Web proxy server
Web
server
Client
Proxy
server
Client
Web
server
Cache:
A store of recently used data objects that is closer to the client
process than those remote objects.
When an object is needed by a client process the caching service
checks the cache and supplies the object from there in case of an upto-date copy is available.
Proxy server:
Provides a shared cache of web resources for client machines at a
site or across several sites.
Increase availability and performance of a service by reducing load
on the WAN and web servers.
60
May be used to access remote web servers through a firewall.
iii. Web applets
a) client request results in the dow nloading of applet code
Client
Applet code
Web
server
b) client interacts w ith the applet
Client
A pplet
Web
server
Example: Java applets
The user running a browser selects a link to an applets
whose code is stored on a web server.
The code is downloaded to the browser and runs there.
Advantage:
Good interactive response since.
Does not suffer from the delays or variability of
bandwidth associated with network communication.
Disadvantage:
61
Security threat to the local resources in the destination
iv. Mobile Agents
A running program (including both code and data)
that travels from one computer to another in a
network carrying out a task on someones behalf.
Can make many invocations to local resources at
each visited site.
Visited sites must decide which local resources
are allowed to use based on the identity of the
user owning the agent.
Advantage: Reduce communication cost and time
by replacing remote invocation with local ones.
Disadvantages:
Limited applicability.
Security threat of the visited sites resources.
62
v. Thin Client
Thin client refers to a software layer
that supports a window-based user
interface that is local to the user
while executing application programs
on a remote computer.
Thin clients and compute
servers
63
Conti..
Same as the network computer scheme but
instead of downloading the applications code
into the users computer, it runs them on a
server machine, compute server.
Compute server is a powerful computer that has
the capacity to run large numbers of
applications simultaneously.
Disadvantage: Increasing of the delays in highly
interactive graphical applications .
Recently this concept has led to the emergence
of virtual network computing (VNC), which has
out dated the network computers.
Since all the application data and code is stored
by a file server, the users may migrate from one
network computer to another.
64
II. Architecture Patterns
Layering
Tier Architecture
65
1. Layering
In the layered view of a system each layer offers its services to the level
above and builds its own service on the services of the layer below.
Software architecture is the structuring of software in terms of layers
(modules) or services that can be requested locally or remotely.
Applications, services
Middlew are
Operating system
Platform
Computer and netw ork hardw are
66
Platform:
Lowest-level layers that provide services to other higher layers.
bring a systems programming interface for communication
and coordination between processes .
Examples:
Pentium processor / Windows NT
SPARC processor / Solaris
Middleware:
Layer of software to mask heterogeneity and provide a unified
distributed programming interface to application programmers.
Provide services, infrastructure services, for use by application
programs.
Examples:
Object Management Groups Common Object Request Broker
Architecture (CORBA).
Java Remote Object Invocation (RMI).
Microsofts Distributed Common Object Model (DCOM).
Limitation: require application level involvement in some tasks.
67
2. Tier Architecture
68
III : Middleware Platform
69
Architectures Design
Requirements
1. Performance Issues:
Considered under the following factors:
Responsiveness:
Fast and consistent response time is important for the
users of interactive applications.
Response speed is determined by the load and
performance of the server and the network and the
delay in all the involved software components.
System must be composed of relatively few software
layers and small quantities of transferred data to
achieve good response times.
Throughput:
The rate at which work is done for all users in a
distributed system.
Load balancing:
Enable applications and service processes to proceed
concurrently without competing for the same resources.
Exploit available processing resources.
70
Architectures Design
Requirements
2. Quality of Service:
Main system properties that affect the service
quality are:
Reliability: related to failure fundamental model
(discussed later).
Performance: ability to meet timeliness guarantees.
Security: related to security fundamental model
(discussed later).
Adaptability: ability to meet changing resource
availability and system configurations.
3. Dependability issues:
A requirement in most application domains.
Achieved by:
Fault tolerance: continuing to function in the presence of
failures.
Security: locate sensitive data only in secure computers.
Correctness of distributed concurrent programs:
research topic.
71
2. Fundamental Models
Models of systems share some fundamental
properties which are more specific about their
characteristics , the failures and security risks
they might exhibit.
The interaction model is concerned with the
performance of processes and communication
channels and the absence of a global clock.
The failure model classifies the failures of
processes and basic communication channels
in a distributed system.
The security model identifies the possible
threats to processes and communication
72
channels in an open distributed system.
I. Interaction Model
Distributed systems consists of multiple interacting
processes with private set of data that can access.
Distributed processes behavior is described by
distributed algorithms.
Distributed algorithms define the steps to be taken
by each process in the system including the
transmission of messages between them.
Transmitted
messages
transfer
information
between these processes and coordinate their
ordering and synchronization activities.
73
Two Significant Factors
1. Performance of communication channels: is
characterized by:
Latency: delay between sending and receipt of a
message including
Network
access
time.(Ethernet
transmission
depends on free traffic)
Time for first bit transmitted through a network to
reach its destination. (satellite link using radio
signals)
Processing time within the sending and receiving
processes. (current load on the operating systems.)
Throughput: number of units (e.g., packets) delivered
per time unit.
Bandwidth: total amount of information transmitted
per time unit. (Communication channels using the
same network, share the available bandwidth).
Jitter: variation in the time taken to deliver series 74of
2. Computer clocks & Event ordering : In computer clocks
Each computer in a distributed system has its own
internal clock to supply the value of the current time to
local processes.
Therefore, two processes running on different
computers read their clocks at the same time may take
different time values.
Clock drift rate refers to the relative amount a
computer clock differs from a perfect reference clock.
Several approaches to correcting the times on
computer clocks are proposed.eg
Radio receivers to get time readings from the
Global Positioning System(GPS) with an accuracy of
about 1 microsecond.
Clock corrections can be made by sending
messages, from a computer has an accurate time
to other computers, which will still be affected by
network delays.
75
Two Variation in Interaction Model
Setting time limits for process execution, as
message delivery, in a distributed system is
hard.
Two opposing extreme positions provide a
pair of simple interaction models:
1. Synchronous distributed systems:
A system in which the following bounds are
defined:
Time to execute each step of a process
has known lower and upper bounds.
Each message transmitted over a channel
is received within a known bounded time.
Each process has a local clock whose drift
rate from perfect time has a known bound.
Easier to handle, but determining realistic
bounds can be hard or impossible.
76
A synchronous model is required for
2. Asynchronous distributed systems:
A system in which there are no bounds
on:
process execution times.
message delivery times.
clock drift rate.
Allows no assumptions about the time
intervals involved in any execution.
Exactly models the Internet.
Browsers are designed to allow users to do
other things while they are waiting.
More abstract and general:
A distributed algorithm executing on one
system is likely to also work on another one.
77
Interaction Model
Event ordering: when its need to know if an
event at one process (sending or receiving a
message) occurred before, after, or concurrently
with another event at another process.
It is impossible for any process in a distributed
system to have a view on the current global state
of the system.
The execution of a system can be described in
terms of events and their ordering despite the
lack of accurate clocks.
Logical clocks define some event order based on
causality.
Logical time can be used to provide ordering
among events in different computers in a
78
distributed system (since real clocks cannot be
1. User X sends a message with the subject Meeting.
2. Users Y and Z reply by sending a message with the subject Re:
Meeting.
send
X
receive
m1
2
receive
receive
4
send
3
m2
receive
Physical
time
send
receive
receive
m3
A
t1
t2
m1
m2
receive receive receive
t3
Real-time ordering of events
79
II. Failure Model
Defines the ways in which failure may occur in order
to provide an understanding of its effects.
A taxonomy of failures which distinguish between the
failures of processes and communication channels is
provided:
Omission failures
Process or channel failed to do something.
Eg The chief omission failure of a process is to crash. Fail stop is other process
to detect crash .
Send and receive are related to communication primitives.
Arbitrary failures
Any type of error can occur in processes or channels (worst).
Timing failures
Applicable only to synchronous distributed systems where time
limits may not be met.
80
Process p
1. Omission
Failures
sendm
Process q
receive
Communication channel
Outgoing message buffer
Incoming message buffer
Processes and channels
loss of messages between the sending process
and the outgoing message buffer as Sendomission failures.
loss of messages between the incoming
message buffer and the receiving process as
receive-omission failures.
81
2. Arbitrary Failures
The term arbitrary or Byzantine failure is used to describe
the worst possible failure semantics, in which any type of
error may occur. For example, a process may set wrong
values in its data items, or it may return a wrong value in
response to an invocation.
Arbitrary failures in processes cannot be detected by
seeing whether the process responds to invocations,
because it might arbitrarily omit to reply.
Communication channels can suffer from arbitrary failures;
eg, message contents may be corrupted, nonexistent
messages may be delivered or real messages may be
delivered more than once.
Arbitrary failures of communication channels are rare
because the communication software is able to
recognize them and reject the faulty messages. Eg ,
checksums are used to detect corrupted messages, and
message sequence numbers can be used to detect
82
nonexistent and duplicated messages.
Failure Model
Omission and arbitrary failures
83
3. Timing Failures
Timing
failures
are
applicable
in
synchronous distributed systems where
time limits are set on process execution
time, message delivery time and clock drift
rate.
Real-time operating systems (like UNIX) are
designed with a view to providing timing
guarantees.
ClassofFailure
Affects
Description
Clock The typical
Process timing
Processslocalclockexceedstheboundsonits
failures are :
Performance
Process
Performance
Channel
rateofdriftfromrealtime.
Processexceedstheboundsontheinterval
betweentwosteps.
Amessagestransmissiontakeslongerthanthe
84
statedbound.
III. Security Model
Secure processes and channels and protect
objects encapsulated against unauthorized
access.
Protecting access to objects
Access rights
In client server systems: involves authentication
of clients.
Protecting processes and interactions
Threats
to
processes:
problem
unauthenticated requests / replies.
of
e.g., "man in the middle"
Threats to communication channels: enemy
may copy, alter or inject messages as they
travel across network.
85
1.Protecting access to objects
Access rights
invocation
Eg. users
private
data
(mailbox),
Server shared
data (web
pages)
Client
result
Principal (user)
Network
Object
Principal (server)
Authority ( principal)
Objects and principals
86
2. Protecting processes and
interactions
Copy of m
The enemy
m
Process p
Processq
Communication channel
Analysis of Security Threats
The enemy
Threats to processes:
server side or client side
Threats to communication channels:
the privacy and integrity of information as it
87
Defeating security threats
Principal B
Principal A
Processp
Secure channel
Process q
Secure channels
By Cryptography and shared secrets
88
Other types of threats
Denial of service
e.g., pings to selected web sites
Generating debilitating network or server load
so that network services become de facto
unavailable
Mobile code:
Requires executability privileges on target
machine
Code may be malicious (e.g., mail worms)
89
Chp.3:Distributed Objects &
Components
Introduction
Distributed Objects
Need from Distributed
Components
Components
Case Study
Object
to
Enterprise Java Bean
Fractals
90
Overlook of Architecture
model
3 Objectives : Arch. Elements, Arch. Patterns,
Middleware Platform Available
4 Arch. Element : Entities, Comm. Paradigms,
Roles & Responsibilities and Placements
2 Entities : System & Problem Oriented entities
3 Problem O.E. : Objects, Components & Web
services
3 Comm. Paradigms: Inter- process comm.
Remote invocation (RRP,RMI,RPC), indirect
Comm. (eg event based)
91
92
Introduction
This chapter discuss about complete middleware
solutions,
presenting
distributed
objects
and
components as two of the most important styles of
middleware in use today.
Software that allows a level of programming beyond
processes and message passing is called middleware.
Middleware layers are based on protocols and
application programming interfaces.
Applications
RMI, RPC and events
Request reply protocol
External data representation
Middleware
layers
Operating System
93
Programming Models
Remote Procedure Calls Client programs
call procedures in server programs
Remote Method Invocation Objects
invoke methods of remote objects on
distributed hosts
Event-based Programming Model Objects
receive notice of events in other objects in
which they have interest
94
Interface
Current programming languages allow programs to be
developed as a set of modules that communicate with
each other.
Permitted interactions between modules are defined by
interfaces.
A specified interface can be implemented by different
modules without the need to modify other modules
using the interface.
In Distributed system , a Remote Interface defines the
remote objects on a server, and each of the objects
methods input and output arguments that are
available to clients.
Remote objects can return objects as arguments
back to the client
Remote objects can return references to remote
objects to the client
95
Interfaces do not have constructors.
Benefits of Middleware
Location Transparency:
Remote Objects seem as if they are on the
same machine as the client
Communication Protocols:
Client/Server does not need to know if the
underlying protocol used by the middleware is
UDP or TCP
Computer Hardware/ Operating System:
Hides differences in data representation
caused by different computer hardware or
operating system
Programming Languages:
Allows the client and server programs to be
96
written in different languages
The major tasks of middleware :
To provide a higher-level programming
abstraction for the development of
distributed systems.
Through layering, to abstract over
heterogeneity
in
the
underlying
infrastructure
to
promote
interoperability and portability.
The types of middleware use today :
Distributed object middleware
Component-based middleware
97
1.Distributed object middleware
Adopt an object-oriented programming
model , where the communicating
entities are represented by objects.
The encapsulation inherent and data
abstraction provides more dynamic and
extensible solutions.
A range of middleware solutions based
on distributed objects are available,
including Java RMI and CORBA.
98
Benefits of Distributed object
middleware
The encapsulation inherent in object-based
solutions is well suited to distributed programming.
The related property of data abstraction provides a
clean separation between the specification of an
object
and
its
implementation,
allowing
programmers to deal solely in terms of interfaces
and not be concerned with implementation details
such as programming language and operating
system used.
This approach also lends itself to more dynamic
and extensible solutions, for example by enabling
the introduction of new objects or the replacement
of one object with another (compatible) object.
99
Limitations
Implicit dependencies:
Programming complexity:
Lack of separation
concerns:
of
distribution
No support for deployment:
100
2. Component-based middleware
Component-based middleware builds on the
limitations of distributed object middleware ,
but also adds significant support for distributed
systems development and deployment.
Software components are like distributed
objects in that they are encapsulated units of
composition.
A given component specifies both its
interfaces provided to the outside world and its
explicit dependencies on other components in
the distributed environment.
101
Distributed Objects
Middleware
based
on
distributed objects is designed
to provide a programming
model based on object-oriented
principles and benefits the
approach
to
distributed
programming.
The term distributed objects or remote
objects usually refers to software modules
that are designed to work together, but reside
either in multiple computer connected via a
network or may be different process in a
single computer.
One object sends a message to another
102
object in a remote machine or process to
Continue..
In a system for distributed objects, the
unit of distribution is the object.
Objects that can receive remote requests
for services are called remote objects.
Remote objects must have a way to be
accessed through a remote reference.
To invoke a method, its signature and
parameters must be defined in a remote
interface.
Together, these technologies are called
remote method invocation (action).
103
Rectangle r1 = new Rectangle();
Rectangle r2 = r1;
Interface for remote object:
public interface Hello extends java.rmi.Remote
{String sayHello()
throws java.rmi.RemoteException;}
remote
interface
(IDL)
Data
m1
m2
m3
remote
invocation
A
remoteobject
implementation
of methods
m4
m5
m6
local
C
E
invocation local
invocation
B
local
invocation
D
remote
invocation
F
104
RMI should be able to raise Distributed
exceptions such as timeouts that are due to
distribution as well as those raised during
the execution of the method invoked.
Distributed garbage collection is generally
achieved by cooperation between the
existing local garbage collector and an
added module that carries out a form of
distributed garbage collection.
Due to the level of heterogeneity that may
exist in a distributed system, both class and
inheritance avoided or adapted.
105
The added complexities
1. Inter-object communication:
Distributed object middleware framework must offer one or
more mechanisms for objects to communicate in the
distributed environment.
2. Lifecycle management:
Lifecycle management is concerned with the creation,
migration and deletion of objects, with each step having to
deal with the distributed nature of the underlying
environment.
3. Activation and deactivation:
Activation is the process of making an object active in the
distributed environment by providing the necessary
resources for it to process incoming invocations effectively,
locating the object in virtual memory and giving it the
necessary threads to execute. Deactivation is then the
opposite process, rendering an object temporarily unable to
106
process invocations.
Continue
4. Persistence:
Objects typically have state, and it is important to
maintain this state across possible cycles of activation
and deactivation and indeed system failures.
Distributed object middleware must therefore offer
persistency management for stateful objects.
5. Additional services:
A comprehensive distributed object middleware
framework must also provide support for the range of
distributed system services viz naming, security and
transaction services.
107
Examples of Dist. Obj.
Middleware
108
Need from Distributed Object to Components
1. Implicit dependencies:
A distributed object offers a contract (interface)
which represents a binding agreement between
the provider of the object and users of that object
in terms of its expected behavior (methods).
Implicit dependencies avoids to replace one
object with another, and hence also for third-party
developers to implement one particular element
in a distributed configuration.
Requirement: There should clear requirement to
specify not only the interfaces offered by an
object but also the dependencies that object has
on other objects in the distributed configuration.
109
2. Interaction with the middleware:
Despite the goals of transparency,
Programmers are exposed to many
relatively low-level details associated with
the middleware architecture which needs
further simplifications.
Requirement: There should be clear need
to simplify the programming of distributed
applications, to present a clean separation
of concerns between code related to
operation in a middleware framework and
code associated with the application.
110
3. Lack of separation of distribution concerns:
Programmers using distributed object
middleware also have to deal explicitly with
non-functional concerns related to issues
such as security, transactions, coordination
and replication.
Requirement: The separation of concerns
related to above issues should extended by
providing the full range of distributed
system services.
The complexities of dealing with the
distributed system services should be
hidden
wherever
possible
from
the
programmer.
111
4. No support for deployment:
Technologies such as Java RMI and
CORBA does not support for the
deployment of the developed arbitrary
distributed configurations.
Requirement:
Middleware
platforms
should provide intrinsic support for
deployment so that distributed software
can be installed and deployed in the
same way as software for a single
machine, with the complexities of
deployment hidden from the user.
112
Components
A component can be thought of as collection of
objects that provide a set of services to other
systems.
The set of services includes code providing
graphing facilities, network communication
services, browsing services related to database
tables etc.
The object linking embedded (OLE) architecture
is one of the first component based framework
on which Microsoft Excel spreadsheets are
designed.
113
Rationale for Components
Highlights
Improved productivity/ reduced complexity
Emphasis on reuse
Programming by assembly
(manufacturing) rather than
development (engineering)
Reduced skills requirement
Key benefit on server side development
(for example EJB )
114
Essence of component
A component is specified in terms of a contract, which
includes
A set of provided interfaces that is, interfaces that
the component offers as services to other components
A set of required interfaces that is, the dependencies
that this component has in terms of other components
that must be present and connected to this
components for it to function correctly.
Note: Interfaces
includes
in
component
based
middleware
interfaces supporting RMI, as in CORBA and Java RMI
interfaces supporting distributed events, as in indirect
communication
115
116
Component-based development
Programming in component-based systems is
concerned with the development of components
and their composition.
Goal
Support a style of software development that
parallels hardware development in using off-theshelf components and composing them together to
develop more sophisticated services.
It supports third-party development of software
components and also make it easier to adapt
system configurations at runtime, by replacing one
component with another.
Note : Components are encapsulated in Containers. 117
Containers
Containers
support
a
common
pattern
often
encountered in distributed systems development
It consists of:
A front-end (web-based) client
A container holding one or more components that
implement the application or business logic
System services that manage the associated data
in persistence storage.
Tasks of a container
Provides a managed server-side hosting
environment for components
Provides the necessary separation of concerns
the components deal with the application concerns
the container deals with the distributed systems and
118
middleware issues
Continued ..
The container implements middleware's services like:
To authenticate user
To make an application remotely accessible
To provide transaction handling
Other services
Activation and passivation, persistence
Life cycle management
Container metadata (introspection)
Packaging and deployment
Container invokes such services at appropriate time
during the execution of business logic in a
transparent way.
Container is capable of modularization of services
which can be encapsulated and decoupled to tailor
119
the specific applications needs.
Structure of Container
This shows a number of
components
encapsulated within a
container.
The container does not
provide direct access to
the components but
rather
intercepts
incoming
invocations
and
then
takes
appropriate actions to
ensure
the
desired
properties of the
120
distributed application
Example of Container (EJB)
121
Application Server
Middleware that supports the container
pattern and the separation of concerns implied
by this pattern is known as an application
server.
A wide range of application servers are now
available:
122
Note : Enterprise JavaBeans specification is an example of an application
Component-based deployment
Component-based middleware provides support for
the deployment of component configurations.
Deployment descriptors
fully describe how the
configurations should be deployed in a distributed
environment.
Deployment descriptors are typically interpreters
written in XML and include sufficient information to
ensure that:
components are correctly connected using appropriate
protocols and associated middleware support;
the underlying middleware and platform are configured to
provide the right level of support to the component
configuration
the associated distributed system services are set up to
provide the right level of security, transaction support and
123
so on.
Case Study 1: Enterprise Java
Bean(EJB)
What is EJB?
A server-side component architecture for
Java
Based on the concept of a container
Offers implicit distributed systems
management
Formalises the interface between a
managed bean (EJB) and its container
Event (callback) interface
Services expected in the container
Deployment using JAR files
124
125
Enterprise beans
The Enterprise JavaBeans architecture is a component
architecture for the development and deployment of
component-based distributed business applications.
Example: In an inventory control application, the
enterprise beans might implement the business logic
in
methods
called
checkInventoryLevel
and
orderProduct.
Benefits of Enterprise Beans
EJB container provides system-level services to
enterprise beans, the bean developer can concentrate
on solving business problems.
Client developer can focus on the presentation of the
client.
Application assembler can build new applications from
126
existing beans.
When to use enterprise beans
The application must be scalable. To accommodate
a growing number of users, there is need to
distribute an applications components across
multiple machines. Not only can the enterprise
beans of an application run on different machines,
but also their location will remain transparent to
the clients.
Transactions must ensure data integrity. Enterprise
beans support transactions, the mechanisms that
manage the concurrent access of shared objects.
The application will have a variety of clients. With
only a few lines of code, remote clients can easily
locate enterprise beans. These clients can be thin,
various, and numerous.
127
Programming in EJB
The task of programming in EJB has been simplified
significantly through the use of Enterprise JavaBeanPOJOs
(plain old Java objects) together with Java Enterprise
JavaBean annotations.
A bean is a POJO supplemented by annotations.
Annotations were introduced in Java 1.5 as a mechanism for
associating metadata with packages, classes, methods,
parameters and variables.
The following are examples of annotated bean definitions
@Stateful public class eShop implements Orders {...}
@Stateless public class CalculatorBean implements Calculator {...}
@MessageDriven public class SharePrice implements MessageListener
{...}
The following example introduces the Orders interface as a
remote interface and the Calculator interface from the
CalculatorBean as a local interface only:
@Remote public interface Orders {...}
@Local public interface Calculator {...}
128
Types of EJBs
1. Session Bean: EJB used for implementing highlevel business logic and processes :
Session beans handle complex tasks that require
interaction with other components (entits,
web services, messaging, etc.)
Session bean is used to represent state of single
interactive communication session between the
a client and the business tier of the server.
Session beans are transient :
when a session is completed , then the
associate bean is discarded.
in case of any failures , session bean are lost as
they are not stored in stable storages.
129
There are two categories of session beans :
Stateful session bean : Holds the conversational state and is
required currently open
Stateless session bean : Holds no state ( eg outside of call) .
These are the inputs from the client tier which may be
pooled or reused.
2. Message Driven Bean
EJB is used to integrate with the external services via
asynchronous messages using Java Message Services
(JMS).
Usually, EJB delegate business logic to session beans first
using RMI.
On the server tier is uses non blocking primitives.
Note : There are also entity beans which provide an inmemory copy of the long term data.
130
EJB containers
EJB container
Runtime environment that provides services, such as
transaction management, concurrency control, pooling,
and security authorization.
Historically, application servers have added other
features such as clustering, load balancing, and failover.
Some JEE Application Servers
GlassFish (Sun/Oracle, open source edition)
WebSphere (IBM)
WebLogic (Oracle)
JBoss (Apache)
WebObjects (Apple)
131
Need for Fractals
Lack of tailor ability in EJB container
There is no mechanism to configure EJB container
There is no mechanism to configure infrastructure
services.
It is not possible to add new services to EJB container
It prevents the non functional aspect like levels of
control facilities for components.
Lacks the tradeoff aspect such as degree of
confrigur_ability
vs
performance
and
space
consumption.
Lacks usability of frameworks and languages in
different environment eg. Embedded systems.
( Reminds the no support for the deployment of the
developed arbitrary distributed configurations)
132
Case Study 2: Fractal
Goals :
Motivate the main features of the Fractal model:
composite components (to have a uniform
view of applications at various abstraction
levels),
shared components (to model resources),
introspection capabilities (to monitor a
running system),
configuration and reconfiguration capabilities
(to deploy and dynamically reconfigure an
application)
133
Essence of Fractal
Fractal is a lightweight component model that
can be used with various programming
language to design , implement, deploy and
reconfigure various system and applications
from OS to middleware and to GUI.
Fractal component model uses 3 the
separation of concern design principles.
Fractal model is also referred as open
component model in the sense it also define
the factory components ie components that
can create new components.
134
I. Various programming
language
Programming platforms
Julia and AOKell (Java-based)
Cecilia and Think (C-based)
FracNet (.NET-based)
FracTalk (Smalltalk-based)
Julio (Python-based).
Julia and Cecilia are treated as the reference implementations of
Fractal.
Middleware platforms
Think (a configurable operating system kernel),
DREAM (supporting various forms of indirect communication),
GOTM (offering flexible transaction management)
Proactive ( Grid computing).
Jasmine (monitoring and management of SOA platforms)
135
II. Separation of concern
principles
1. Separation of interface and implementation
bridge pattern , separation of design and implementation
concern
Guaranties the replacement of one component with another ,
without worrying about class-inheritance problem.
Deals with core component model
2. Component oriented programming
separation of implementation concern
deals with the well separate entities called as components
3. Inversion of control
separation of functional and configuration concern
level of controls
Deals with the configuration and deploy of external entities
136
Core component model
Defined on : Binding
& Structure
of
Fractal
components and is based on the interfaces :
Two types of interfaces available:
server interfaces
which support incoming operational invocations
equivalent to provided interfaces
client interfaces
which support outgoing invocations
equivalent to required interfaces
Note :
Communication between the Fractal components is
only possible if their interface are bound.
137
This leads to the composition of Fractal components.
Binding in Fractals
To enable composition, Fractal supports
bindings between interfaces.
Two styles of binding are :
Primitive bindings:
Composite bindings:
138
Primitive bindings
Direct mapping between one client
interface and one server interface
within the same address space.
Operation invocation emitted by the
client interface should be accepted
by the specified server interface.
It can readily implements by using
pointers or by direct language
reference ( java using object
reference).
139
Composite bindings
Build out of set of primitive bindings and binding
components like stub, skeleton, adaptor etc.
Implemented in terms of communication path
between a number of component interfaces
potentially on different machines.
Composite bindings are themselves components
in Fractal
Interconnnection (remote invocation or indirect,
point-to-point or multiparty)
reconfigured at runtime (security and scalability)
140
Structure of Fractal
components
A Fractal component is
runtime entity that is encapsulated,
has a distinct identity and
supports one or more interfaces
Architecture based on
Membrane & Controllers (Non functional concern)
*supports the interfaces to introspect and
reconfigure its internal features.
*defines the control capabilities.
Content (functional concern)
*consists of finite sets of other components (sub/
nested or shared - components)
141
External
interface
interceptors
Internal
interface
1. Activity
controllers
2. Threads
controllers
3. Scheduling
controllers
The structure of fractal
component
142
Purpose of Controllers
1. Implementation of lifecycle management:
Activation or deactivation of a process
Allows even replacement of server by other
enhanced server
2. Offers Introspection capabilities:
Interface associated with similar components
replacement of client call by another client
3. Offers interception capabilities;
Implement an access control policy
Transparent invocation between client server.
143
Purpose of Membrane
To provide different level of controls
simple encapsulation of components
support for non functional issues like
transactions and security as in
application servers.
144
Level of controls
Low level controls (Run time entities,
base components, eg objects in java)
Middle level controls(introspection level,
provides components interfaces, eg
external interfaces in client-sever, com
services)
High level controls (configuration level,
exploits internal elements)
Additional level of controls(
a
framework for the instantiation of
components )
145
Content
The content of a component is composed of (a finite number of)
other components, called sub components, which are under the
control of the controller of the enclosing component.
The Fractal model is thus recursive and allows components to be
nested (i.e. to appear in the content of enclosing components) at an
arbitrary level.
A component that exposes its content is called
a composite component.
A component that does not expose its content, but has at least one
control interface is called a primitive component.
A component without any control interface is called a base
component.
146
Content
Sub-components :
Hierarchy of components
Components as run-time entities
(computational units)
Caller and callee interface in client-server
Sharing of components :
Software architectures with resources
Menu and toolbar components
Share of Undo button
147
III. Factory components
Component that can create new component
Generic component factories :
creates several kinds of components
Provides GenericFactory Interface
Standard factories:
creates only one kind of components
Use templates and sub-templates
148
Benefits of Fractal Component Model
1. It enforce the definition of good modular design in terms of
binding and structure of components.
2. It enforces the separation of interfaces and implementation,
which ensure the minimum level of flexibility.
3. It enforces the separation between the functional,
configuration and deployment concerns, which allows the
application architecture's to be describe separate from code.
Note : All these features increases the productivity.
(further reading :http://fractal.objectweb.org)
149
Chapter 4: Remote
Invocation
1. Introduction
2. Remote Procedure Call
3. Events & Notification
150
1. Introduction
Middleware layers are based on protocols and
application programming interfaces.
151
L1: Request-reply Protocol
Client
doOperation
Server
Request
message
(wait)
(continuation)
Reply
message
getRequest
select object
execute
method
sendReply
152
Operations of the
request-reply protocol
public byte[ ] doOperation (RemoteObjectRef o, int methodId, byte[ ] arguments)
sends a request message to the remote object and returns the reply.
The arguments specify the remote object, the method to be invoked and the
arguments of that method.
public byte[] getRequest ();
acquires a client request via the server port.
public void sendReply (byte[ ] reply, InetAddress clientHost, int clientPort);
sends the reply message reply to the client at its Internet address and port.
153
messageType
requestId
objectReference
int(0=Request,1=Reply)
int
RemoteObjectRef
methodId
intorMethod
arguments
arrayofbytes
Request-reply message structure
154
Three modes of doOperation
1. Retry request message
Whether to retransmit message
Until reply comes or Server seems to failed
2. Duplicate Filtering
Retransmission are used
Whether to duplicate or filter requests at the server
3. Retransmission of result
Whether to keep the history of result
Avoid the re-execution of server operation
Note : Combination of all these leads to variety of invocation
scematics
155
Invocation Scematics
Maybe invocation scematics
At-least-once invocation scematics
At-most-once invocation scematics
Note : Also known as call scematics
156
1.Maybe invocation
Remote method
may execute or not at all, invoker cannot tell
useful only if occasional failures
Invocation message lost...
method not executed
Result not received...
was method executed or not?
Server crash...
before or after method executed?
if timeout, result could be received after timeout...
157
2. At-least-once invocation
Remote method
invoker receives result (executed exactly) or exception
(no result, executed once or not at all)
retransmission of request messages
Invocation message retransmitted...
method may be executed more than once
arbitrary failure (wrong result possible)
method must be idempotent (repeated execution has the
same effect as a single execution)
Server crash...
dealt with by timeouts, exceptions
158
3. At-most-once invocation
Remote method
invoker receives result (method executed once)
or exception (no result was received)
retransmission of reply & request messages
duplicate filtering
Best fault-tolerance...
arbitrary failures prevented if method called at
most once
Used by CORBA and Java RMI
159
160
L2 : Programming Models
Remote procedure call (RPC)
call procedure in separate process
client programs call procedures in server programs
Remote method invocation (RMI)
extension of local method invocation in OO model
invoke the methods of an object of another process
Event-based model
Register interested events of other objects
Receive notification of the events at other objects
161
Remote Procedure Call (RPC)
Introduction
Design issues
Implementation
Case study :Sun RPC
162
Introduction
Remote Procedure Call (RPC) is a high-level model
for client-sever communication.
It provides the programmers with a familiar
mechanism for building distributed systems.
Examples: File service, Authentication service.
163
Introduction
Why we need Remote Procedure Call (RPC)?
The client needs a easy way to call the
procedures of the server to get some services.
RPC enables clients to communicate with
servers by calling procedures in a similar way
to the conventional use of procedure calls in
high-level languages.
RPC is modelled on the local procedure call,
but the called procedure is executed in a
different process and usually a different
computer.
164
Introduction
How to operate RPC?
When a process on machine A calls a procedure on
machine B, the calling process on A is suspended, and the
execution of the called procedure takes place on B.
Information can be transported from the caller to the callee
in the parameters and can come back in the procedure
result.
No message passing or I/O at all is visible to the
programmer.
165
Introduction
The RPC model
server
client
Call procedure and
wait for reply
request
Receive request and start
process execution
reply
Resume
execution
Blocking state
Send reply and wait for next
execution
Executing state
166
Characteristics
The called procedure is in another process which may
reside in another machine.
The processes do not share address space.
Passing of parameters by reference and passing
pointer values are not allowed.
Parameters are passed by values.
The called remote procedure executes within the
environment of the server process.
The called procedure does not have access to the
calling procedure's environment.
167
Features
Simple call syntax
Familiar semantics
Well defined interface
Ease of use
Efficient
Can communicate between processes on the same
machine or different machines
168
Limitations
Parameters passed by values only and pointer values are not
allowed.
Speed: remote procedure calling (and return) time (i.e.,
overheads) can be significantly (1 - 3 orders of magnitude)
slower than that for local procedure.
This may affect real-time design and the programmer should be aware of
its impact.
Failure: RPC is more vulnerable to failure (since it involves
communication system, another machine and another process).
The programmer should be aware of the call semantics, i.e. programs
that make use of RPC must have the capability of handling errors that
cannot occur in local procedure calls.
169
Design Issues
Exception handling
Necessary because of possibility of network and nodes
failures;
RPC uses return value to indicate errors;
Transparency
Syntactic achievable, exactly the same syntax as a local
procedure call;
Semantic impossible because of RPC limitation: failure
(similar but not exactly the same);
170
Design Issues
Delivery guarantees
Retry request message: whether to retransmit the request
message until either a reply or the server is assumed to
have failed;
Duplicate filtering : when retransmission are used, whether
to filter out duplicates at the server;
Retransmission of replies: whether to keep a history of
reply messages to enable lost replies to be retransmitted
without re-executing the server operations.
171
Call Semantics
Maybe call semantics
After a RPC time-out (or a client crashed and restarted), the
client is not sure if the RP may or may not have been
called.
This is the case when no fault tolerance is built into RPC
mechanism.
Clearly, maybe semantics is not desirable.
172
Call Semantics
At-least-once call semantics
With this call semantics, the client can assume that the RP
is executed at least once (on return from the RP).
Can be implemented by retransmission of the (call) request
message on time-out.
Acceptable only if the servers operations are idempotent.
That is f(x) = f(f(x)).
173
Call Semantics
At-most-once call semantics
When a RPC returns, it can assumed that the remote
procedure (RP) has been called exactly once or not at all.
Implemented by the server's filtering of duplicate requests
(which are caused by retransmissions due to IPC failure,
slow or crashed server) and caching of replies (in reply
history, refer to RRA protocol).
174
Call Semantics
This ensure the RP is called exactly once if the server does
not crash during execution of the RP.
When the server crashes during the RP's execution, the
partial execution may lead to erroneous results.
In this case, we want the effect that the RP has not been
executed at all.
175
RPC Mechanism
client process
server process
Request
client
program
client stub
procedure
Communication
module
Reply
server stub
procedure
Communication
dispatcher
module
service
procedure
176
RPC Mechanism:
Client computer
Local
return
Local
call
Marshal
Unmarshal
arguments
results
Receive
reply
Send
request
service
procedure
client
server
stub
client
proc.
stub
proc.
Communication
module
Server computer
Execute procedure
Unmarshal
arguments
Marshal
results
Select procedure
Receive
request
Send
reply
177
RPC Mechanism:
1. The client provides the arguments and calls the client stub in
the normal way.
2. The client stub builds (marshals) a message (call request) and
traps to OS & network kernel.
3. The kernel sends the message to the remote kernel.
4. The remote kernel receives the message and gives it to the
server dispatcher.
5. The dispatcher selects the appropriate server stub.
6. The server stub unpacks (unmarshals) the parameters and call
the corresponding server procedure.
178
RPC Mechanism
7. The server procedure does the work and returns the result to
the server stub.
8. The server stub packs (marshals) it in a message (call return)
and traps it to OS & network kernel.
9. The remote (receiver) kernel sends the message to the client
kernel.
10. The client kernel gives the message to the client stub.
11. The client stub unpacks (unmarshals) the result and returns to
client.
179
A pair of Stubs
Client-side stub
Looks like local server
function
Same interface as local
function
Bundles arguments into
message, sends to serverside stub
Waits for reply, unbundles results
returns
Server-side stub
Looks like local client
function to server
Listens on a socket for
message from client stub
Un-bundles arguments to
local variables
Makes a local function
call to server
Bundles result into reply
message to client stub
180
RPC Implementation
Three main tasks:
Interface processing: integrate the RPC mechanism with
client and server programs in conventional programming
languages.
Communication handling: transmitting and receiving
request and reply messages.
Binding: locating an appropriate server for a particular
service.
181
Case Studies: SUN RPC
Designed for client-server communication as in the
SUN NFS.
Also called ONC (Open Network Computing) RPC.
Supplied as part of Sun OS product with Unix
System V, Linux, BSD, OS X and even with NFS
Installation.
Uses at-least-once call schematics.
Interfaces defined in an Interface Definition
Language (IDL)
182
Interface definition language: XDR
initially XDR is used for data representation
a standard way of encoding data in a portable fashion between
different systems
Interface compiler: rpcgen
Use with C programming language
A compiler that takes the definition of a remote procedure
interface, and generates the client stubs and the server stubs
Communication handling: TCP or UDP
UDP is used for restricting the length of the request & reply
messages.
Max : 64 kilobytes , Min : 8 or 9 Kilobytes
Binding services: port mapper
183
RPC IDL
program numbers instead of interface names (unique)
procedure numbers instead of procedure names ( version changes)
single input procedure parameter (structs)
Procedure definition
(e.g. WRITE file procedure)
version 1
Procedure definition
(e.g. READ file procedure)
version 2
version number
program number
184
Files interface in Sun XDR (sample code)
const MAX = 1000;
typedef int FileIdentifier;
typedef int FilePointer;
typedef int Length;
struct Data {
int length;
char buffer[MAX];
};
struct writeargs {
FileIdentifier f;
FilePointer position;
Data data;
};
struct readargs {
FileIdentifier f;
FilePointer position;
Length length;
};
program FILEREADWRITE {
version VERSION {
void WRITE(writeargs)=1;
Data READ(readargs)=2;
}=2;
} = 9999;
185
Complier : rpcgen
rpcgen name.x
produces:
name.h header
name_svc.c
server stub
name_clnt.c
client stub
[ name_xdr.c ] XDR conversion routines
186
What goes on in the system:
server
Start server
Server stub creates a socket and binds any
available local port to it
Calls a function in the RPC library:
svc_register to register program#, port #
contacts portmapper (rpcbind on SVR4):
Name server
Keeps track of
{program#,version#,protocol}port# bindings
Server then listens and waits to accept
connections
187
What goes on in the system: client
Client calls clnt_create with:
Name of server
Program #
Version #
Protocol#
clnt_create contacts port mapper on
that server to get the port for that
interface
early binding done once, not per
188
procedure call
Authentication
SUN RPC request and reply message have
an additional fields for authentication
information to be passed between client
and server
SUN
RPC
supports
the
following
authentication protocols :
UNIX style using uid and gid of user
Shared key is eshtablished for signing the RPC
message
Well known Kerberos style of authentication
189
Advantages
Dont worry about getting a unique transport address (port)
But with SUN RPC you need a unique program number
per server
Greater portability
Transport independent
Protocol can be selected at run-time
Application does not have to deal with maintaining message
boundaries, fragmentation, reassembly
Applications need to know only one transport address
Port mapper
Function call model can be used instead of send/receive
190
Event-Notification model
Idea
One object react to a change occurring in another object
Event causes changes in the object that maintain the state of
application
Objects that represent events are called notifications
Event examples
modification of a document
Entering text in a text box using keyboard
Clicking a button using mouse
Publish/subscribe paradigm
event generator publish the type of events
event receiver subscribe to the types of events that are interest to
them
When event occur, notify the receiver
191
Distributed event-based system two characteristics
1.Heterogeneous:
Way to standardize communication in heterogeneous systems
not designed to communicate directly.
Components in a DS that were not designed to interoperate can
be made to work together.
The heterogeneous components are used in application to
describe users location and activities.
2.Asynchronous:
Decoupling of publisher and subscriber.
prevent publishers needing to synchronize with subscribers.
192
Example - dealing room system
Requirements
allow dealers to see the latest market price of the
tocks they deal in.
The market price for a single named stock is
represented by an object with several instant
variable.
System components
Information provider process
receive new trading information
publish stocks prices event
stock price update notification
Dealer process
subscribe stocks prices event
193
Dealing room system
External
source
Dealers computer
Dealer
Notification
Dealers computer
Notification
Information
provider
Notification
Notification
Dealer
Notification
Notification
Notification
Dealers computer
Dealers computer
Notification
Information
provider
Notification
Notification
Dealer
Dealer
External
source
194
The participants in the Dist. Event Notification
The object of interest
its changes of state might be of interest to other objects
Event
An event occurs at an object of interest as the
completion of a method execution
Notification
an object that contains information about an event
Subscriber
an object that has subscribed to some type of events in
another object
Observer objects
the main purpose is to decouple an object of interest
from its subscribers.
Avoid over-complicating the object of interest.
Publisher
an object that declares that it will generate notifications of
particular types of event. May be an object of interest or an
195
observer.
Architecture for distributed event notification
Event service: maintain a database of published
events and of subscribers interests .
decouple the publishers from the subscribers.
Event service
subscriber
object of interest
1.
notification
object of interest
2.
object of interest
3.
notification
observer
subscriber
notification
observer
subscriber
notification
196
Three cases
Inside object without an observer: send
notifications directly to the subscribers
Inside object with an observer: send
notification via the observer to the subscribers
Outside object (with an observer)
1. an observer queries the object of interest in
order to discover when events occur
2. the observer sends notifications to the
subscribers
197
Notification Delivery
Delivery semantics
Unreliable, e.g. deliver the latest state of a player
in a Internet game
Reliable, e.g. dealing room
real-time, e.g. a nuclear power station or a
hospital patient monitor
Roles for observers processes
Forwarding
send notifications to subscribers on behalf of one or
more objects of interests
Filtering of notifications
according to some predicate
reduces the number of notification
Patterns of events
describe the relationship between several events
Notification mailboxes
notification be delayed until subscriber being ready 198
to
receive
Jini distributed event specification
Allow a potential subscriber in one Java Virtual
Machine (JVM) to subscribe to and receive
notifications of events in an object of interest in
another JVM.
Main objects
event generators (publishers)
remote event listeners (subscribers)
remote events (events)
third-party agents (observers)
An object subscribes to events by informing the
event generator about the type of event and
specifying a remote event listener as the target
199
for notification.
EventGenerator interface
Provide register method
Event generator implement it
Subscriber invoke it to subscribe
interested events
to
the
RemoteEventListener interface
Provide notify method
subscriber implement it
receive notifications when the notify method is
invoked
RemoteEvent
a notification that is passed as argument to the
notify method
Third-party agents
interpose between an object of interest and a
subscriber
200
MCT702 :UNIT II
Distributed Operating Systems
Chapter 1 : Architectures of Distributed System
Introduction
Issues in distributed operating system(9)
Communication Primitives(2)
Chapter 2 : Theoretical Foundation
Inherent Limitations of a Distributed System
Lamports Logical Clock and its limitations
Vector Clock
Causal Ordering of messages
Global State and Chandy-Lamports Recording algorithm;
Cuts of Distributed Computation
Termination Detection
201
Continue
Chapter 3:Distributed Mutual Exclusion
Non-Token based Algorithms
Lamports Algorithm
Token based Algorithms
Suzuki-Kasamis Broadcast Algorithm
Consensus and related problems
Comparative Performance Analysis (4)
Chapter 4:Distributed Deadlock Detection
Issues
Centralized Deadlock-Detection Algorithms (2)
Distributed Deadlock-Detection Algorithms(4)
Reference : Chapter no 4, 5, 6 and 7.
Advanced Concepts In Operating System
by Mukesh Singhal and N.G.Shivaratri,(Tata McGraw-Hill)
202
Chp 1 : Architectures of Dist. Sys.
Introduction :
Distributed System is used to describe a system with the
following characteristics:
Consists of several computers that do not share a
memory or a clock;
The computers communicate with each other by
exchanging messages over a communication network;
and
Each computer has its own memory and runs its own
operating system.
203
Architecture of Distributed OS
204
System Architecture Types
Minicomputer model: The distributed system consists of
several minicomputers , where each computer supports
multiple users and provides access to remote resources.
E.g. VAX processors.
(no. of processors/ no. of users )<1
Workstation-server model: Consists of several workstations
where each user is provided with workstation which
consist of powerful processor, memory and display.
With the help of DFS, users can access data regardless of
its location. E.g. Athena and Andrew
(no. of processors/ no. of users )1
Processor-pool model: Allocates one or more processors
according to users need. Once the processors complete
their jobs, they return to the pool and await a new
assignment. E.g. Amoeba combinations
(no. of processors/ no. of users )>1
205
Evolution of Modern Operating. Sys.
206
Definition:
Distributed operating system:
Integration of system services presenting
a transparent view of a multiple computer
system with distributed resources and
control.
Consisting
of
concurrent
processes
accessing distributed, shared or replicated
resources through message passing in a
network environment.
207
Sharing of resources and coordination of distributed
activities in networked environments are the main
goals in the design of a distributed operating system.
The key distinction between a network OS and a
distributed OS is the concept of transparency:
concurrency transparency (also in centralized OS)
location transparency
parallelism and performance transparency
migration transparency
replication transparency
Distributed operating systems consist of three major
components:
coordination of distributed processes
management of distributed resources
implementation of distributed algorithms
208
Issues in Designing
Distributed Operating Sys.
1.
2.
3.
4.
5.
6.
7.
8.
9.
Global Knowledge
Naming
Scalability
Compatibility
Process Synchronization
Resource Management
Security
Structuring
Client Server Computing Models
209
1. Global Knowledge
Complete and accurate knowledge of all processes
and resources is not easily available
Difficulties arise due to
absence of global shared memory
absence of global clock
unpredictable message delays
Challenges
decentralized system wide control
total temporal ordering of system events
process synchronization (deadlocks, starvation)
210
2. Naming
Names are used to refer to objects which includes Computers,
printers, services, files and users.
Objects are encapsulated in servers and only visible entities in the
system are servers. To contact a server, server must be
identifiable.
Three identification methods:
1.Identification by name ( name server)
2.Identification by physical or logical address (network server)
3.Identification by service that servers provide ( components)
Object models and their naming must be addressed early in the
system design as many things depend on the naming scheme:
Ex:
Structure of the system
Management of the namespace
Name resolution
Access methods
211
3. Scalability
Systems generally grow with time.
Design should be such that system should not result
in system unavailability or degraded performance
when growth occurs
E.g. broadcast based protocols work well for small
systems but not for large systems
Distributed File System.( on a larger scale increase
in broadcast queries for file location affects the
performance of every computer)
212
4. Compatibility
Refers to the interoperability among the resources in
a system.
There are three levels of compatibility in DS
Binary Level: All processes execute the same instruction
set even though the processors may differ in performance
and in input-output
E.g. Emerald distributed system
Program development is easy
DS cannot include computers with different architectures
Rarely supported in large distributed systems
213
Compatibility
Execution level: The same source code can be
compiled and executed properly on any computer in
the system
E.g. Andrew and Athena systems support execution level
compatibility
Protocol level: least restrictive form of compatibility
Requires all system components to support a common set
of protocols
Individual computers can run different operating systems
Distributed system supporting protocol level compatibility
employs common protocols for essential system services
such as file system
214
5. Process Synchronization
Process synchronization is difficult because of
unavailability of shared memory.
DOS has to synchronize process running at different
computers when they try to concurrently access
shared resources.
Mutual exclusion problem.
Request must be serialized to secure the integrity of
the shared resources.
In DS, process can request resources (local or remote)
and release resources in any order .
If the sequence of the resource allocation is not
controlled, deadlock may occur which can lead to
decrease in system performance.
215
6. Resource Management
Concerned with making both local and
remote resources available to users in an
effective manner.
Users should be able to access remote
resources as easily as they can access
local resources.
Specific location of resources should be
hidden from users in the following ways:
Data Migration
Computation Migration and
Distributed scheduling
216
6.1:Data Migration
Data can either be file or contents of physical
memory.
In process of data migration, data is brought to the
location of the computation that needs access to it by
the DOS.
If computation updates a set of data, original location
may have to be updated.
In case of file DFS is involved as a component of DOS
that implements a common file system available to
the autonomous computers in the system.
Primary goal is to provide same functional capability
to access files regardless of their location.
If the data accessed is in the physical memory of
another system then a computations data request is
handled by distributed shared memory.
It provides a virtual address space that is shared
among all the computers in a DS, main issues are
217
consistency and delays.
6.2:Computation migration
In computation migration, computation migrates to
another location.
It may be efficient when information is needed
concerning a remote file directory.
It is more efficient to send the message and receive
the information back, instead of transferring the
whole directory.
Remote procedural call has been commonly used for
computation migration.
Only a part of computation of a process is normally
carried out on a different machine.
218
6.3 :Distributed Scheduling
Processes can be transferred from one computer to
another by the DOS.
A process may be executed at a computer different
from where it was originated.
Required when the computer is overloaded or does
not have the necessary resources.
Distributed scheduling is responsible for judiciously
and transparently distributing processes amongst
computers such that overall performance is
maximized.
219
7:Security
OS is responsible for the security of the
computer system
Two issues must be considered:
Authentication: process of guaranteeing that an
entity is what it claims to be
Authorization: process of deciding what privileges
an entity has and making only these privileges
available
220
8: Structuring
1. Monolithic Kernel:
. The kernel contains all the services
provided by operating system.
. A copy of huge kernel is running on all the
machines of the system.
. The limitation of this approach is that most
of the machines will not require most of the
services but the kernel is still providing it.
Note: one size fits all (diskless workstations,
multiprocessors, and file servers)
221
2. Collective Kernel Approach:
Operating system is designed as a collection of
independent processes.
Each process represents some service such as
distributed scheduling, distributed file system etc.
The kernel consists of a nucleus of operating system
called micro kernel which is installed on all the
machines and provides basic functionalities.
The micro kernel also provides interaction between
services running on different machines.
e.g. : Galaxy, V-Kernel.
3. Object Oriented Kernel:
All services of operating system are implemented in
the form of objects.
Each object encapsulate a data structure and also a
set of operations for those data structure.
222
e.g. Amoeba, CLOUDS.
9. Client server Computing
Model
Processes are categorized as servers (provide
services) and clients(need services).
Servers merely respond to the requests of the
clients
and
do
not
typically
initiate
conversations with clients
In case of multiple servers the location and the
conversation are transparent to the clients.
Clients generally make use of cache to
minimize the frequency of sending the data
request tot the server.
System Structured on the client server model
can easily adapt the collective kernel
223
structuring technique.
Communication Primitives
It is a mode to send raw bit streams of data in
distributed environment.
There are two models that are widely accepted
to develop distributed operating system(DOS).
1.Message Passing
2.Remote Procedure Call(RPC)
Note : In DS , recall communication paradigms in
architecture model:
Inter-process (Message passing, socket prog. Multicast)
Remote invocation (RRP,RPC,RMI,)
Indirect communication (group comm., event-based
etc)
224
1.Message Passing Model
The Message Passing model provides two basic
communication Primitives: SEND & RECEIVE
The SEND Primitives has two parameters:
A message and its destination.
The RECEIVE primitive has also two parameters:
The source of a message and a buffer for storing the
message.
An application of these primitives can be found in
client server computation model.
The scematics of SEND & RECEIVE primitives are
decides on the design issues namely
Non blocking vs Blocking primitives
Synchronous vs Asynchronous primitives
225
Non Blocking VS Blocking Primitives
In the standard message passing model
messages are copied three times
From user buffer to kernel buffer
From kernel buffer on sending computer to the
kernel buffer on receiving computer.
From buffer on receiving computer to user
buffer.
This is know as buffered option.
In the unbuffered option, data is copied from one
user buffer to another user directly.
user b
receive
user a send m
buffer b
buffer a
Communication channel
kernel buffer of sending computer
kernel buffer of receiving computer
226
Non-blocking Primitives:
With non- blocking primitive, the SEND primitive
return the control to user process.
While the RECEIVE primitive respond by signaling
and provide a buffer to copy the message.
Primary advantages is the programs have
maximum flexibility to perform computation and
communication in any order.
A significant disadvantages of non-blocking is that
programming becomes difficult.
A natural use of nonblocking communication
occurs in producer(SEND)-consumer(RECEIVE)
relationship.
227
Blocking Primitives :
The SEND primitive does not
the user program
return the control to
until the message has been sent (an unreliable blocking
primitive) or
until an acknowledgment has been received ( a reliable
blocking primitive).
In both cases user buffer can be reused.
The RECEIVE primitive does not return the control
untill the message is copied to the buffer.
In case of reliable RECEIVE primitive acknowledge is send
automatically
In case of unreliable RECEIVE primitive acknowledge is
not send
The advantage is the behavior of the program can
be predicted which makes the programming
relatively easy.
The disadvantage is the absence of concurrency
228
between the computation and communication
Synchronous Vs Asynchronous
Primitives
It is based on the concept of using buffer or not.
Both of these can be extended in terms of
blocking or non blocking primitives.
Synchronous primitive:
A SEND primitive is block until a corresponding
RECEIVE primitive is executed at the receiving
computer.
This strategy is referred as blocking synchronous
primitive or rendezvous.
In unblocking synchronous primitive , the
message is copied to a buffer at the sending
side, and then allowing the process to perform
other computation activity except another SEND229
Asynchronous primitive:
The messages are buffered
A SEND primitive is not block even if
there no corresponding execution of
a RECEIVE primitive.
The RECEIVE primitive can either be
a blocking or a nonblocking primitive.
The main disadvantage in using
buffers increases the complexity in
terms of creating , managing and
destroying the buffers.
230
2. Remote Procedural Call
A More natural way to communicate is through
Procedural call:
every language supports it
semantics are well defined and understood
natural for programmers to use
Programmer using such a model must handle the
following details:
Pairing of responses with request messages
Data representation
Knowing the address of remote machine on the server
Taking care of communication and system failure
231
Basic RPC Operation
The RPC Mechanism is based on the observation
that a procedural call is well known for transfer
of control and data with in a program running an
a single machine.
On invoking a remote procedure, the calling
process is suspended.
If any parameter are passed to the remote
machine where the procedure will execute.
On completion, the result are passed back from
server to client and resuming execution as if it
had called a local procedure.
232
Design Issues in RPC
RPC mechanism is based
on the
concept of stub procedures.
The server writer writes the server and
links it with the server-side stubs; the
client writes the respective program
and links it with the client-side stub.
The stubs are responsible for managing
all details of the remote communication
between client and server.
233
Design Issues (Contd)
234
Structure
When a program (client) makes a remote procedure
call, say p(x,y), it actually makes a local call on a
dummy procedure or a client-side stub procedure p.
The client-side stub procedure construct a message
containing the identity of the remote procedure and
parameters and then send to remote machine.
A stub procedure at server side stub receives the
message and makes a local call to the procedure
specified in the message.
After execution control returns to the server stub
procedure which return the control to client side
stub.
The stub procedures can be generated at compile
time or can be linked at run time.
235
Binding
Binding is process that determines the remote
procedure, and the machine on which it will be
executed.
It may also check the compatibility of
parameters passed and procedure type called.
Binding server essentially store the server
machine along with the services they provide.
Another approach used for binding is where the
client specifies the machine and the service
required and the binding server returns the port
number for communication.
236
Parameter and Result
237
Error handling, Sematics and
Correctness
A RPC can fail for at-least two reasons
Computer failure
Communication failures
The sematics of RPCs are classified as follows :
Schematics \Execution
Success
Failure
Partial
At-least once
>= 1
0 or more
possible
Exactly once
0 or 1
possible
At most once
0 or 1
none
Correctness conditions: C1 C2 W1 W2 where
Ci denote RPC calls and Wi work done on shared
data, and denotes a happened before relation
238
RPC other issues
Implementation issues for the RPC mechanism
low latency RPC calls (use UDP)
high-throughput RPC calls (use TCP)
Increase concurrency of RPC calls via
Blocks
call
process
Multi-RPC ( invokes only one procedures on many servers
but avoid different parallel procedure)
Parallel RPC ( invoking the parallel procedure call executes a
procedure in n different address spaces in parallel)
asynchronous calls ( avoid blocking but programming
becomes difficult)
Shortcomings of the RPC mechanism
does not allow for returning incremental results
Avoids protocol flexibility : remote procedures are not
first-class objects (e.g. can not be used everywhere
where local procedures/variables can be used)
239
Chp 2 : Theoretical Foundation
Inherent Limitation of a Distributed System
Lamports Logical Clock and its limitations
Vector Clock
Causal Ordering of messages
Global State and Chandy-Lamports Recording
algorithm;
Cuts of Distributed Computation
Termination Detection
240
Inherent Limitations of a Dist. System
A DS is a collection of computers that are spatially
separated and do not share a common memory.
Processes communicate by exchanging messages
over communication channel.
DS suffers some inherent limitations because of
Lack of a systemwide common clock
i.e.
Absence of global clock
Lack of common memory
i.e.
Absence of shared memory
241
1.Absence of a Global Clock
There is no system-wide common clock in a DS
Solutions can be:
Either having a global clock common to all the computers,
or
Having synchronized clocks, one at each computer
Both of the above solutions are impractical due to
following reasons:
If one global clock is provided in the distributed system:
Two processes will observe a global clock value at different
instants due to unpredictable delays
So two processes will falsely perceive two different instants in
physical time to be a single instant in physical time
242
Continued..
if the clocks of different systems are tried to
synchronize:
These clocks can drift from the physical time and the
drift rate may vary from clock to clock due to
technological limitations.
This may also end up with the same result.
We cannot have a system of perfectly synchronized
clocks
243
Impact of the absence of global
time
Temporal ordering of events is integral to the design
and development of DS
E.g.
an OS is responsible for scheduling processes
A basic criterion used in scheduling is the temporal
order in which requests to execute processes arrive.
Due to the absence of the global time, it is difficult
to reason about the temporal order of events in a DS.
Hence, algorithms for DS are more difficult to
design and debug.
Also, the up-to-date state of the system is harder to
collect.
244
2. Absence of shared
memory
Due to the lack of shared memory, an up-to-date state
of the entire system is not available to any individual
process
It is necessary for reasoning about the systems
behavior,
debugging and
Recovery
Information exchange is subject to arbitrary network
delays.
One process in a DS can get either
a coherent but partial view or
an incoherent but complete (global) view of the
system
245
Coherent means:
all processes make their observations at the same time.
Note : incoherent sounds that all processes donot make their
observations at the same time.
Complete (or global) includes:
all local views of the state, plus
any messages that are in transit
It is very difficult for every process to get a
complete and coherent view of the global state
Example: One person has two bank accounts, and
is in process of transferring $50 between the
accounts.
246
Example: coherent but
partial
Local state of A
$500
Local state of B
Communication
$200
Channel
S1: A
S2: B
Note :
Coherent requires sequential
consistency.
In its absence blocking may
occur.
The communication channel
cannot record its state by
itself.
The processes have to keep
the record of communication
247
channels.
247
Example: incoherent but complete state
Local state of A
(a)
$500
S1: A
(b)
(c)
Local state of B
Communication
$200
Channel
S2: B
$450
$200
S1: A
S2: B
$500
S1: A
$250
S2: B
248
248
Lamports Logical Clock: Basic
Concepts
Lamport proposed the following scheme to
order events in a distributed system using
logical clocks.
The execution of processes is characterized by
a sequence of events
Depending on the application,
the execution of a procedure could be one event or
the execution of an instruction could be one event
When processes exchange messages,
sending a message constitutes one event
and receiving a message constitutes one event.
249
Logical clock : Basic
Concept
Time
One dimension.
It can not move backward.
It can not stop.
It is derived from concept of the
order in which events occur.
The concepts before and after
need to be reconsidered in a
distributed system.
250
Lamports Logical Clock:
Due to the absence of perfectly synchronized clocks and
global time in distributed systems, the order in which
two events occur at two different computers cannot be
determined based on the local time at which they occur.
under certain conditions, it is possible to ascertain the
order in which two events occur based solely on the
behavior exhibited by the underlying computation.
The happened before relation() captures the causal
dependencies between events, i.e., whether two events
are causally related or not. The relation is defined in
the following slide.
251
Happened before
relationship
a b, if a and b are events in the same process
and a occurred before b. ( one process)
a b, if a is the event of sending a message m
in a process and b is the event of receipt of the
same message rn by another process. (two
process)
If a b and b c , then a c , i.e., " "
relation is transitive (more than two processes).
In distributed systems, processes interact with
each other and affect the outcome to events of
processes.
Being able to ascertain order between events is
very important for
designing, debugging, and understanding the sequence
of execution in distributed computation.
252
Lamports Logical Clock:
Events
In general, an event changes the
system
state,
which
in
turn
influences
the
occurrence
and
outcome of future events.
Past events influence future events
and this influence among causally
related events (those events that can
be ordered by ) is referred to as
causal affects.
253
Lamports Logical Clock:
Events
CASUALLY RELATED EVENTS: Event a
causally affects event b if a b.
CONCURRENT EVENTS: Two distinct
events a and b are said to be
concurrent (denoted by a||b) if
a b and b a.
For any two events a and b in a
system, either a, b a, or a||b
254
Lamports Logical Clock:
Events
1. Processes and events
2. Path and arrows
3. Causally related e22 &e14
4. e11 || e21
e11
e12
e13
e14
Space
P1
e21
e22
e23
e24
P2
Global time
255
System of Logical Clock
In order to realize the relation , Lamport
introduced the following system of logical
clocks.
There is a clock Ci at each process Pi in the
system.
The clock Ci can be thought of as a function
that assigns a number Ci(a) to any event a,
called the timestamp of event a, at Pi.
The numbers assigned by the system of clocks
have no relation to physical time, and hence
the name logical clocks.
The
logical
clocks
take
monotonically
increasing values. These clocks can be
implemented by counters. Typically, the
timestamp of an event is the value of the clock
256
when it occurs.
Lamports Logical Clock:
Conditions
For any events a and b: if a b, then C(a) < C(b)
The happened before relation can now be
realized by using the logical clocks if the following
two conditions are met:
[C1] For any two events a and b in a process Pi, if a
occurs before b, then Ci(a) < Ci(b)
[C2] If a is the event of sending a message m in process
Pi and b is the event of receiving the same message m
at process Pj, then
Ci(a) < Cj(b)
The following implementation rules (IR) for the
clocks guarantee that the clocks satisfy the
correctness conditions Cl and C2:
257
Lamport Logical Clock:
Implementation Rules
[IR1] Clock Ci is incremented between any two
successive events in process Pi:
Ci:=Ci+d (d>0)
lf a, and b are two successive events in Pi and
then Ci(b) = Ci(a) + d.
a b,
[IR2] If event a is the sending of message m
by process Pi, then message m is assigned
a timestamp tm, = Ci(a)
(note that the value of Ci(a) is obtained after applying
rule IRI).
On receiving the same message m by process Pj, Cj is
set to a value greater than or equal to its present
value and greater than tm
Cj := max(Cj, tm + d) (d > 0)
258
Lamports logical Clock: How does it
advance?
Space
P1
Clock
values
e11
e12
e13
(1)
(2)
(3)
(1)
e21
P2
(2)
e22
e14
(4)
Max(4+1,
2+1)
e15
e16
(5)
(6)
(3)
e23
Max(2+1,
2+1)
Global time
(4)
e24
Max(6+1,4+1
)e
17
(7)
(7)
e25
Max(4+1,
6+1)
259
Lamports logical Clock: How does it
advance?
Lamport's happened before relation, ()
defines an irreflexive partial order among the
events
The set of all the events in a distributed
computation can be totally ordered (the ordering
relation is denoted by =>) using the above
system of clocks as follows:
If a is any event at process Pi and b is any event at
process Pj then a => b if and only if either
Ci(a) < Cj(b) or Ci(a) = Cj(b) and Pi < Pj where < is any
arbitrary relation that totally orders the processes to break
ties
Partial Ordering
A simple way to implement < is to assign unique
Causal events are sequenced
identification numbers to each process and then Pi < Pj , if i
Total Ordering
< j.
All events are sequenced
260
Partial Order Example
1
2
1
a b : C0(a) = 1 < 2 = C0(b)
f i : C1 (f) = 4 < 5 = C2 (i)
a e : C0 (a) = 1 < 3 = C2 (e)
etc.
261
Getting a Total Order
If a total order is required, break ties
using ids.
In the example, C0(a) = (1,0), C1(c) =
(1,1), etc.
In the example, C0(a) < C1 (c).
262
Drawback of Logical Clocks
a b implies C(a) < C(b), but C(a) <
C(b) does not necessarily imply a
b.
In previous example, C(g) = 1 and
C(b) = 2, but g does not happen
before b.
Reason is that "happens before" is a
partial order, but logical clock values
are integers, which are totally
ordered.
263
Lamports Logical Clock:
Limitations
In Lamport's system of logical clocks, if a
b then C(a) < C(b).
But, the reverse is not necessarily true if
the events have occurred in different
processes.
i.e. if a and b are events in different
processes and C(a) < C(b), then a b is
not necessarily true; events a and b may
be causally related or may not be causally
related.
So, Lamport's system of clocks is not
powerful
enough
to
capture
such
264
situations .
Lamports Logical Clock:
Limitations
1. Clearly C(e11)<C(e22) and C(e11)<C(e32)
2. Causally related on the basis of path
e11
P1
Space
P2
P3
e12
(1)
(2)
e21
e22
(1)
(3)
e31
e32
e33
(1)
(2)
(3)
Global time
Note : if a and b are events in different processes and C(a) < C(b),
then a b is not necessarily true; events a and b may be causally
265
related or may not be causally related.
Vector Clocks
Generalize logical clocks to provide
non-causality information as well
as causality information.
Implement with values drawn from
a partially ordered set instead of a
totally ordered set.
Assign a value V(e) to each
computation event e in an
execution such that a b if and
only if V(a) < V(b).
266
Vector Timestamps
Algorithm
Each pi keeps an n-vector Vi, initially all
0's
Entry j in Vi is pi 's estimate of how many
steps pj has taken
Every msg pi sends is timestamped with
current value of Vi
At every step, increment Vi[i] by 1
When receiving a message with vector
timestamp T, update Vi 's components j
i so that Vi[j] = max(T[j],Vi[j])
If a is an event at pi, then assign V(a) to
be value of Vi at end of a.
267
Manipulating Vector
Timestamps
Let V and W be two n-vectors of integers.
Equality: V = W iff V[i] = W[i] for all i.
Example: (3,2,4) = (3,2,4)
Less than or equal: V W iff V[i] W[i]
for all i.
Example: (2,2,3) (3,2,4) and (3,2,4)
(3,2,4)
Less than: V < W iff V W but V W.
Example: (2,2,3) < (3,2,4)
Incomparable: V || W iff !(V W) and !(W
V).
268
Example: (3,2,4) || (4,1,4)
Vector Clock example
(1,0,0)
P1
(2,0,0)
Space
e11
e12
(0,1,0)
P2
e21
P3
(3,4,1)
e13
(2,2,0)
(2,3,1)
e22
e23
(0,0,1)
e31
(2,4,1)
e24
(0,0,2)
e32
Global time
V(e31) = (0,0,1) and V(e12) = (2,0,0), which are
incomparable.
Compare with logical clocks C(e31) = 1 and C(e21) = 2.
Vector timestamps implement vector clocks.
Means a b implies V(a) < V(b).
269
Causal Ordering of
Messages
If M1 is sent before M2, then every recepient of
both messages must get M1 before M2.
This is not guaranteed by the communication
network since M1 may be from P1 to P2 and M2
may be from P3 to P4.
Consider a replicated database system.
Updates to the entries should be received in
order!
Basic idea for message ordering :
Deliver a message only if the preceding one has
already been delivered.
Otherwise, buffer it up. ie buffer a later message
270
Violation of
Causal Ordering of Messages
Send(M1)
Space
P1
Send(M2)
P2
P3
Time
271
Causal Ordering of Messages
e.g. : send(M1) ->send(M2) => receive (M1) ->receive(M2)
P1
(0,0,0)
(0,0,1)
P2
(buffer) (0,0,1) (0,1,1)
deliver
from buffer
M2
(0,1,1)
M1
P3
(0,0,1)
(0,1,1)
Note : This diagram explains , without vector clock increments to
prove M1 (0,0,1) is received before M2 (0,1,1) at P1.
272
PROTOCOLS
1. Birman-Schiper-Stephenson Protocol
2. Schiper-Eggli-Sandaz Protocol
273
1.Birman-Schiper-Stephenson
Protocol
BSS: Birman-Schiper-Stephenson Protocol
Broadcast based: a message sent is received by
all other processes.
Deliver a message to a process only if the
message preceding it immediately, has been
delivered to the process.
Otherwise, buffer the message.
Accomplished by using a vector accompanying
the message.
274
Birman-Schiper-Stephenson
Protocol
Pi stamps sending messages m with a
vector time.
Pj, upon receiving message m from Pi
,VTm buffers it till
VTpj[i] = VTm[i] 1
forall k, k != i, VTpj[k] >= VTm[k]
When Pj receives message m, it updates
VTpj
275
BSS Algorithm ...
1. Process Pi increments the vector time VTpi[i], time stamps,
and broadcasts the message m. VTpi[i] - 1 denotes the number
of messages preceding m.
2. Pj != Pi receives m. m is delivered when:
a. VTpj[i] == VTm[i] 1 [Pj has received all messages from Pi before m]
b. VTpj[k] >= VTm[k] for all k in {1,2,..n} - {i}, n is the
total number of processes. Delayed message are queued
in a sorted manner. [Pj has received all those messages received by Pi before m]
c. Concurrent messages are ordered by time of receipt.
3. When m is delivered at Pj, VTpj updated according Rule 2 of
vector clocks.
2(a) : Pj has received all Pis messages preceding m.
2(b): Pj has received all other messages received by Pi
before sending m.
276
BSS Algorithm
e.g. 1
(1,0,1)
P1
(2,2,1)
M2
P2
(0,1,1) (0,2,1)
P3
M1
(0,0,1)
(0,2,2)
277
Implementing causal order using Vector Clocks in BSS
application processes
1,0,0
e.g. 2
2,2,0
P1
1,1,0
1,2,0
P2
P3
?
0,0,0
? message service
=
1,0,1 1,2,2
P3s vector is at (0,0,0) and a message with timestamp (1,2,0)
arrives from P2
i.e. P2 has received a message from P1 that P3 hasnt seen.
More detail of P3s message service:
receiver vector
sender
sender vector decision
new
receiver vector
0,0,0
P2
1,2,0
buffer
0,0,0
P3 is missing a message from P1 that sender P2 has already
received
0,0,0
P1
1,0,0
deliver
1,0,1
1,0,1
P2
1,2,0
deliver
1,2,2
In each case: do the sender and receiver agree on the state of all
other processes?
278
If the sender has a higher state value for any of these others, the
2. SES Protocol
SES: Schiper-Eggli-Sandoz Algorithm.
No need for broadcast messages.
Each process maintains a vector V_P of size N - 1, N
the number of processes in the system.
V_P is a vector of tuple (P,t): P the destination
process id and t, a vector timestamp.
Eg V_P : (P2,<1,1,0>)
Initially, V_P is empty. (at start point)
Tm: logical time of sending message m
Tpi: present logical time at pi
279
SES Algorithm
Sending a Message:
Send message M, time stamped tm, along with V_P1
to P2.
Insert (P2, tm) into V_P1. Overwrite the previous
value of (P2,t), if any.
(P2,tm) is not sent. Any future message carrying
(P2,tm) in V_P1 cannot be delivered to P2 until tm <
tP2.
Delivering a message
If V_M (vector with the message) does not contain
any pair (P2, t), it can be delivered.
/* (P2, t) exists */ If t > Tp2, buffer the message.
(Dont deliver).
else (t < Tp2) deliver it
280
SES Buffering Example
Tp1: (1,1,0)
P1
(2,2,2)
Tp2:
P2 (0,1,0)
(0,2,0)
M1
M2
V_P2 V_P2:
empty (P1, <0,1,0>)
M3
P3
Tp3: (0,2,1)
(0,2,2)V_P3:
(P1,<0,1,0>)
281
SES Buffering Example...
M1 from P2 to P1: M1 + Tm (=<0,1,0>) +
Empty V_P2
M2 from P2 to P3: M2 + Tm (<0, 2, 0>) + (P1,
<0,1,0>)
M3 from P3 to P1: M3 + <0,2,2> + (P1,
<0,1,0>)
M3 gets buffered because:
Tp1 is <0,0,0>, t in (P1, t) is <0,1,0> & so Tp1 < t
When M1 is received by P1:
Tp1 becomes <1,1,0>, by rules 1 and 2 of vector
clock.
After updating Tp1, P1 checks buffered M3.
Now, Tp1 > t [in (P1, <0,1,0>].
So M3 is delivered.
282
SES Algorithm ...
On delivering the message:
Merge V_M (in message) with V_P2 as
follows.
If (P,t) is not there in V_P2, merge.
If (P,t) is present in V_P2, t is updated with max(t
in Vm, t in V_P2).
Message cannot be delivered until t in V_M is
greater than t in V_P2
Update site P2s local, logical clock.
Check buffered messages after local, logical
clock update.
283
Global state
Global state of a distributed system
Local state of each process
Messages sent but not received (state of the
queues)
Many applications need to know the state of the
system
Failure recovery, distributed deadlock detection
Problem: how to figure out the state of a
distributed system?
Each process is independent
No global clock or synchronization
284
Global State
Due to absence of global clock, states are
recorded at different times.
For
global
consistency,
state
of
the
communication channel should be the sequence
of messages sent before the senders state was
recorded excluding the messages received
before the receivers state was recorded.
Local states are defined in context of an
application
a send is a part of the local state if it happened
before the state was recorded.
285
A message causes an inconsistency if its
received state is recorded, but not sent state.
A collection of local states forms a global state.
This global state is consistent iff there are no
pairwise inconsistency between local states.
A message is in transit when it has been sent,
but not received.
The global state is transitless iff there are no
local state pairs with messages in transit.
Transitless + Consistent Strongly Consistent
State
286
Global State:
1. GS ={LS1,LS2,LS3}
2. {LS11,LS22,LS32 } is inconsistent GS as received
recorded but not send recorded for M2
3. {LS12,LS23,LS33 } is consistent GS as avoided
inconsistency as well ( no. msg send) (no. of msg
received.)
4. .{LS11,LS21,LS31 } strong consistent GS.
LS11
S1
M1
S2
LS21 M2
LS12
LS22
LS23
M3
S3
LS31
LS32
LS33
287
Chandy Lamport G S R Algorithm
The idea behind this algorithm is that we can record a
consistent state of the global system if we know that
all messages that have been sent by one process have
been received by another.
This is accomplished by the use of a Marker which
traverses the distributed system across all channels.
The Marker, in turn, causes each process to record a
snapshot of itself and, eventually, of the entire system.
As long as the Marker can traverse the entire network
in finite time, the algorithm works.
The primary benefit of the global state recording
algorithm, is the ability to detect a stable property of
the distributed system. Such a property could be
deadlock, or termination.
288
Assumptions of C-L GSRA
There are a finite number of processes
and communications channels.
Communication channels have infinite
buffers that are error free.
Messages on a channel are received in
the same order as they are sent.
Processes in the distributed system do
not share memory or clocks.
289
C-L GSR Algorithm
Sender. (Process p).
1.1] Record the state of (p).
1.2] For each outgoing channel (c) incident to (p), send
a marker before sending ANY other messages.
Receiver (Process q receives marker on channel
c1).
2.1] If (q) has not yet recorded its state.
Record the state of (q).
Record the state of (c1) as empty.
For each outgoing channel (c) incident to (q), send a marker
before sending ANY other messages.
2.2] If (q) has already recorded its state.
Record the state of (c1) as all messages received since the
last time the state of (q) was recorded.
Algorithm terminates when every process has
received a marker from every other process
290
Pictured below is a system with three nodes.
All three processes begin with $500 and the channels are
empty.
Therefore the stable property of dollars in the system is
$1500.
291
Technical requirement:
initiator
c1
c2
c3
c4
r
p
x
q
x
x
r
marker
checkpoint
292
Step 1:
Process p sends out $10 to process q and then decides
to initiate the global state recording algorithm:
p records its current state ($490) and send out a marker
along channel c1.
Sender process p
Meanwhile, process q has sent $20 to p along channel c2
Also q has sent $10 to r along channel c3.
293
Step 2:
Process q receives the $10 transfer (increasing its
value to $480) and then receives the marker on
channel c1.
Receiver process
q point 1
Because it received a marker, process q records
its state as $480 and then sends markers along
each of its outgoing channels c2 and c3.
Meanwhile, process r has sent $25 along c4.
Note that it does not matter if r sent the message
before or after q received the marker.
294
Step 3:
process r receives the $10 transfer and the marker
from channel c3.
Therefore, rupdates its state to $485 and
records this state.
Receiver process
point 1
Process r also sends a marker on its outgoing
channel, c4.
Meanwhile, process p has sent another $20 to
process q along channel c1.
295
Step 4 :
Process q receives the $20 transfer on channel c1 and
updates its value to $500.
No marker
Notice that process q does not change is recorded state
found
value.
Also process p has receives the $20 transfer on channel c2.
Process p records the $20 transfer as Receiver
part of its process
recorded state
point 2
(because it received this after the state recording algorithm
had begun and the marker on that channel had not yet been
Receiver process
received).
point 2
Process p then receives the marker on channel c2 and can
At p:channel .
stop recording any further messages on that
470+20=490
Old recorded
state
296
Step 5 :
Process p receives the $25 on channel c4.
p adds this to its recorded state and also
changes its current value from $490 to $515.
When process p receives the marker on channel c4,
the state recording is complete because all
processes have received markers on all of their
input channels.
The final recorded state is shown in the table
below.
Old recorded state
Previous recorded
state
490+25=515
297
P1
P2
P3
Snapshot
Example
e
e
e
e
1
1,2
e20
M
M
e21,2,3
e30
e13
e23 e24
Consistent Cut
M M
e32,3,4
e31
1- P1 initiates snapshot: records its state (S1); sends Markers to P2 & P3;
turns on recording for channels C21 and C31
2- P2 receives Marker over C12, records its state (S2), sets state(C12) = {}
sends Marker to P1 & P3; turns on recording for channel C32
3- P1 receives Marker over C21, sets state(C21) = {a}
4- P3 receives Marker over C13, records its state (S3), sets state(C13) = {}
sends Marker to P1 & P2; turns on recording for channel C23
5- P2 receives Marker over C32, sets state(C32) = {b}
6- P3 receives Marker over C23, sets state(C23) = {}
7- P1 receives Marker over C31, sets state(C31) = {}
Consistent Cut =time-cut across processors and channels so no event
298
after the cut happens-before an event before the cut
Notable points of C-L GSRA
Recorded global state may not be the same as any actual
state, but is equivalent (reachable from) and is consistent.
If Sinit and Sfinal
are the global state when Chandy
Lamports algorithm started and finished respectively and
S* is the state recorded by the algorithm then,
S* is reachable from Sinit
Sfinal is reachable from S*
Specifically, we show that there
exists a computation seq where
seq is a permutation of seq,
such that Sinit, S* and Sfinal occur as global states in seq.
Sinit occurs earlier than S* : executes prefix of seq
S* occurs earlier than Sfinal in seq executes rest of
299
actions
Stability Detection
The reachability property of the snapshot
algorithm is useful for detecting stable properties.
If a stable predicate is true in the state Ssnap then
we may conclude that the predicate is true in the
state Sfinal
Similarly if the predicate evaluates to False for
Ssnap, then it must also be False for Sinit.
300
Cut:
Cuts: graphical representation of a global state.
Cut C = {c1, c2, .., cn} where ci: cut event at Si.
Th : A cut C is consistent iff no two cut event are
causally related.
c1
S1
M1
S2
M2
c2
M3
S3
LS31
c3
301
Time of a Cut
Let C = {c1, c2, .., cn} is a cut where ci is the
cut event at site Si with vector time stamp
VTci.
Vector time of the cut, VTc = sup(VTc1, VTc2, ..,
VTcn).
sup is a component-wise maximum,
i.e., VTci = max(VTc1[i], VTc2[i], .., VTcn[i]).
For consistency : No message is sent after the
cut event which was received before the same
cut event.
Th : A cut is consistent iff VTc = (VTc1[1],
VTc2[2], .., VTcn[n])
302
p0
1
1
p1
p2
303
consists from an event
from each process
Cut:
p0
1
1
p1
p2
4
1
2
3
Vector time of cut
304
no messages cross the cut
Consistent cut:
p0
1
1
p1
p2
4
1
2
3
Vector time of cut
305
messages can cross
from left to right of the cut
Consistent cut:
p0
1
1
p1
p2
Vector time of cut
7
2
3
5
306
messages cross
from right to left of the cut
nconsistent cut:
p0
1
1
p1
p2
5
1
2
4
Vector time of cut
307
messages cross
from right to left of the cut
nconsistent cut:
p0
1
1
p1
p2
4
4
6
4
3
5
Vector time of cut
308
Termination Detection
Termination: completion of the sequence of
algorithm.
(e.g.,)
leader
election,
deadlock
detection, deadlock resolution.
System Model
processes can be active or idle(passive)
only active processes send messages
idle process can become active on receiving a computation message
active process can become idle at any time
Termination: all processes are idle and no computation message are
in transit
Can use global snapshot to detect termination also
309
Termination Detection
Use a controlling agent or a monitor process.
Initially, all processes are idle. Weight of controlling
agent is 1 (0 for others).
Start of computation: message from controller to a
process. Weight: split into half (0.5 each).
Repeat this: any time a process send a computation
message to another process, split the weights
between the two processes (e.g., 0.25 each for the
third time).
End of computation: process sends its weight to the
controller. Add this weight to that of controllers.
(Sending processs weight becomes 0).
Rule: Sum of W always 1.
Termination: When weight of controller becomes 1
again.
310
Huangs Algorithm
B(DW): computation message, DW is the weight.
C(DW): control/end of computation message;
Rule 1: Before sending B,
compute W1, W2 (such that W1 + W2 is W of the
process).
Send B(W2) to Pi,
W = W1.
Rule 2: Receiving B(DW): a process having weight W
does
W = W + DW,
process becomes active.
Rule 3: Active to Idle -> send C(DW), W = 0.
Rule 4: Receiving C(DW) by controlling agent ->
W = W + DW,
311
If W == 1, computation has terminated.
Huangs Algorithm
Suppose :P1(1)->P2 and then P3 also P3->P4
and then P5
1/4 P1
1/2 P1
P2
1/2
P3
P2
P3
1/16
1
P1
P2
P3 0
1/2
P4
P5
P4
1/8
P5
P4
P5
1/16
312
Unit II : Chapter 3
Distributed Mutual Exclusion
Introduction : Mutual Exclusion
Non-Token based Algorithms Lamports Algorithm
Token based Algorithms
-Suzuki-Kasamis Broadcast Algorithm
Consensus and related problems
313
Distributed Mutual Exclusion:
Introduction
In the problem of mutual exclusion, concurrent
access to a shared resource by several
uncoordinated user-requests is serialized to secure
the integrity of the shared resource.
It requires that the actions performed by a user on
a shared resource must be atomic (one at a time).
For correctness, it is necessary that the shared
resource be accessed by a single site (or process)
at a time.
Mutual exclusion is a fundamental issue in the
design of distributed systems.
An efficient and robust technique for mutual
exclusion is essential to the viable design of
distributed systems.
314
Single-Computer vs. Distributed
System
In single-computer systems, the status of
a shared resource and the status of users
is readily available in the shared memory,
and solutions to the mutual exclusion
problem can be easily implemented using
shared variables (e.g., semaphores)
However, in distributed systems, both the
shared resources and the users may be
distributed and shared memory does not
exist
Consequently, approaches based on
shared variables are not applicable to
distributed systems and approaches based
315
on message passing must be used.
DME Algorithms:
Classification
Mutual exclusion algorithms can be grouped into 2 classes.
1.The algorithms in the first class are nontoken-based
2. The algorithms in the second class are token-based
The execution of DME Algorithms are mainly focused on
the existence of critical section (CS).
A critical section is the code segment in a process in which
a shared resource is accessed.
316
1.The algorithms in the first class are nontokenbased :
These algorithms require two (e.g. REQUEST and
REPLY) or more (e.g. RELEASE) successive rounds
of message exchanges among the sites.
These algorithms are assertion (criteria) based
because a site can enter its critical section (CS)
when an assertion defined on its local variables
becomes true.
Mutual exclusion is enforced because the
assertion becomes true only at one site at any
given time.
317
2.The algorithms in the second class are
token-based:
In these algorithms, a unique token (also
known as the PRIVILEGE message) is
shared among the sites.
A site is allowed to enter its CS if it
possesses the token and it continues to
hold the token until the execution of the
CS is over.
These algorithms essentially differ in the
way a site carries out the search for the
token.
318
System Model
At any instant, a site may have several requests for
CS
A site queues up these requests and serves them
one at a time
A site can be in one of the following three states:
requesting CS
In this state the site is blocked and cannot
make further requests for CS
executing CS : operates the define task
or neither requesting nor executing CS (i.e.,
idle)
In the idle state, the site is executing outside
its CS
In the token-based algorithms, a site can also
be in a state where a site holding the token is
executing outside the CS
319
DME: 5 Requirements
1.Maintain mutual exclusion: To
guarantee that only one request
accesses the CS at a time.
2.Freedom from Deadlocks. Two or
more sites should not endlessly wait for
messages that will never arrive.
3.Freedom from starvation. A site
should not be forced to wait indefinitely
to execute CS while other sites are
320
repeatedly executing CS. That is, every
Requirements conti.
4.Fairness. Fairness dictates that requests must
be executed in the order they are made (or the
order in which they arrive in the system). Since a
physical global clock does not exist, time is
determined by logical clocks. Note that fairness
implies freedom from starvation, but not viceversa.
5.Fault Tolerance. A mutual exclusion algorithm
is fault-tolerant if in the wake of a failure, it can
reorganize itself so that it continues to function
without any (prolonged) disruptions.
321
DME: 4 Performance
The
performance
of
mutual
exclusion
algorithms is generally measured by the
following four metrics:
1. The number of messages necessary
per CS invocation
2. The synchronization delay, which is the
time required after a site leaves the CS and
Last site exist CS
before
the next site enters the CS
Next Site Enter CS
Synchronizati
on
delay
time
322
3.The response time, which is the time
interval a request waits for its CS execution
to be over after its request messages have
CS Request
Its Request
The site enters
The site exits
been sent
.
Arrives
message sent
the CS
the CS
CS execution
time
Response
Time
4.The system throughput, which is the
rate at which the system executes
requests for the CS.
system throughput = 1/ (sd + E)
where sd is the synchronization delay and E is the
323
average critical section execution time
Performance Measuring Parameters
1. Low and High LOAD performance
Low load condition : avoids simultaneous request
in the system
High load condition : identifies the pending
request at a site
2. Best and Worst CASE performance
Best case : reflects the best possible value of the
response time.
Worst case : normally coincides with the best case
value in case DME algorithms.
Note : In case of fluctuating values of
performance we consider the average case.
324
DME: A Simple Solution
In a simple solution to distributed mutual
exclusion, a site, called the control site,
is assigned the task of granting permission
for the CS execution.
To request the CS, a site sends a REQUEST
message to the control site.
The control site queues up the requests
for the CS and grants them permission,
one by one
This method to achieve mutual exclusion
in distributed systems requires only three
messages per CS execution.
325
Non token-based algorithms
A site communicates with the set of other sites to arbitrate who
should execute the CS next.
For a site Si , the request set Ri contains ids of all those site
from where Si must acquire permission to enter the CS.
Uses timestamps to order request for the CS, which also helps in
resolving the conflict.
Generally a smaller timestamp request have priority over the
larger timestamp requests.
Maintain the logical clock and update them by the Lamports
scheme.
Depending upon the way a site carries out its ascertains, there
are numerous non token-based algorithms.
Lamports algorithm
Ricart-Agrawala algorithm
Maekawas algorithm
326
Lamports Algorithm
Lamport proposed DME algorithm
which was based on his clock
synchronization scheme
In Lamports algorithm
Every
site
Si
keeps
queue,
request_queue,
which
contains
mutual
exclusion
requests
ordered
by
their
timestamps.
Algorithm requires messages to be delivered in
the FIFO order between every pair of sites
327
DME: The Lamports
Algorithm
Requesting the CS :
1. When a site Si wants to enter the CS, it sends
a REQUEST(tsi,i) message to all the sites in its
request set Ri and places the request on
request_queuei,
Note :(tsi, i) is the timestamp of the request.
(2,1)
S1
s2
(1,2)
S3
Sites S1 and S2 are making the REQUEST for the
CS
328
2.
DME: The Lamports
Algorithm
Requesting the CS
:
When a site Sj receives the REQUEST (tsi,i) message
from site Si, it returns a timestamped REPLY message
to Si and places site Sis request on request_queuej
Note : (i) low timestamp have the priority for CS in the
queue.
(ii) wrt time, queue direction is
(2,1) (2,1),(1,2)
S1
s2
S3
(1,2)
(2,1),(1,2)
(1,2)
(2,1),(1,2)
Sites S1 and S2 are making the REPLY message
329
DME: The Lamports
Algorithm
Executing the CS. :
Site Si enters the CS when the two following
conditions hold:
L1: Si has received a message with timestamp
larger than (tsi, i) from all other sites.
L2: Sis request is at the top of request_queue
(2,1) (2,1),(1,2)
S1
s2
S3
(1,2)
(2,1),(1,2)
(1,2)
(2,1),(1,2)
S2 enters the CS
330
DME: The Lamports
Algorithm
Releasing the CS.
:
Site Si, upon exiting the CS, removes its request
from the top of its request queue and sends a
timestamped RELEASE message to all the sites in
its request set
When a site Sj receives a RELEASE message from
site Si, it removes Sis request from its request
queue
(2,1) (2,1),(1,2)
S1
s2
S3
(2,1)
S1 enters the CS
(1,2)
(2,1)
(2,1),(1,2)
(1,2)
(2,1),(1,2)
(2,1)
S2 exits the CS
331
Another Example
Note : Queue direction can be
executed first.
, , so (1,B) is
332
DME: Lamports Algorithm
Correctness :
Lamports algorithm achieves mutual exclusion.
Performance
Requires 3(N-1) messages per CS invocation:
(N-1) REQUEST, (N-1) REPLY, and (N-1) RELEASE
messages, Synchronization delay is T
Optimization
Can be optimized to require between 3(N-1) and 2(N-1)
messages per CS execution by suppressing REPLY
messages in certain cases
E.g. suppose site Sj receives a REQUEST message from site
Si after it has sent its own REQUEST messages with
timestamp higher than the timestamp of site Sis request
In this case, site Sj need not send a REPLY message to site
Si.
333
DME: Token based
Algorithms
In token-based algorithms, a unique token
is shared among all sites.
A site is allowed to enter its CS if it
possesses the token.
Token-based algorithms use sequence
numbers instead of timestamps.
Every request for the token contains a
sequence number and the sequence
numbers of sites advance independently.
A site increments its sequence number
counter every time it makes a request for
the token.
A primary function of the sequence
numbers is to distinguish between old and
334
current requests.
DME: Token based
Algorithms
Depending upon the way a site carries out
its search for the token, there are
numerous token-based algorithms.
Suzuki-Kasamis broadcast algorithm
Singhals heuristic algorithm
Raymonds tree-based algorithm
335
Suzuki-Kasamis broadcast algorithm
The Main idea :
1. Completely connected network
of processes.
2. There is one token in the
network. The holder of the token
has the permission to enter CS.
3. Any other process trying to enter
CS must acquire that token.
Request to enter CS
Request to enter CS
4. The token will move from one
process to another based on
demand.
336
SK Algorithms Requirements :
Process j broadcasts REQUEST (j, num),
where num is the sequence number of the req
request.
req
Q
queue
Each process maintains
-an array req: RN [i] denotes the
sequence no num of the latest request
from process i
req
Additionally, the holder of the token
maintains
-an array last : LN[i] denotes the
sequence number of the latest visit to CS
for process i.
- a queue Q of waiting processes
last
req
req
req: array[0..n-1] of integer
last: array [0..n-1] of integer
337
Algorithm :
If a node wants TOKEN, it broadcasts a REQUEST message to
all other nodes :
Requesting the CS :
At node:
REQUEST(j, n)
1. node j requesting n-th CS invocation n = 1, 2, 3, ... , seq #
(number)
2. node i receives REQUEST from j
update RNi[j]=max(RNi[j], n )
where RNi[j] = largest seq # received so far from node j
Activated channel
338
Executing the CS
3. The node i executes the CS when it has received the TOKEN.
where TOKEN(Q, LN ) ( suppose at node i )
Q -- queue of requesting nodes
LN -- array of size N such that
LN[j] = the sequence of the request of node j
granted most recently
Releasing the CS
When node i finished executing CS, it does the following
4.Set LN[i] = RNi[i] to indicate that current request of node i has
been granted ( executed )
5.For all node k such that RNi[k] > LN[i]
(i.e. node k requesting ) is appended to Q if its not there
6. When these updates are complete, if Q is not empty,
the front node is deleted and TOKEN is sent there
FirstComeFirstServe
339
Example
There are three processes, p1, p2, and p3.
p1 and p3 seek mutually exclusive access
to a shared resource.
Initially: the token is at p2 and the token's
state is LN = [0, 0, 0] and Q empty;
p1's state is: n1 ( seq # ) = 0, RN1 = [0, 0,
0];
p2's state is: n2 = 0, RN2 = [0, 0, 0];
p3's state is: n3 = 0, RN3 = [0, 0, 0];
p1 sends REQUEST(1, 1) to p2 and
p3;
p1: n1 = 1, RN1 = [ 1, 0, 0 ]
340
Meaning while p3 sends REQUEST(3, 1) to
p1 and p2;
p3: n3 = 1, RN3 = [ 0, 0, 1 ]
But p2 receives REQUEST(1, 1) from p1;
p2: n2 = 1, RN2 = [ 1, 0, 0 ], holding
token
p2 sends the token to p1
p1 receives REQUEST(3, 1) from p3: n1 =
1, RN1 = [ 1, 0, 1 ]
p2 receives REQUEST(3, 1) from p3:
RN2 = [ 1, 0, 1 ]
p3 receives REQUEST(1, 1) from p1;
p3: n3 = 1, RN3 = [ 1, 0, 1 ]
p1 receives the token from p2
p1 enters the critical section
p1 exits the critical section and
sets the token's state to LN = [ 1,
0, 0 ] and Q = ( 3 )
341
p1 sends the token to p3;
p1: n1 = 2, RN1 = [ 1, 0, 1 ], holding token;
token's state is LN = [ 1, 0, 0 ] and Q
empty
p3 receives the token from p1;
p3: n3 = 1, RN3 = [ 1, 0, 1 ],
holding token
p3 enters the critical section
p3 exits the critical section
sets the token's state to LN = [ 1,
0, 1 ] and Q empty
Note : Algorithm can terminate if
there is no more request
342
Correctness:
Mutual exclusion is guaranteed because there is only
one token in the system and a site holds the token
during the CS execution.
Theorem: A requesting site enters the CS in finite time.
Proof:
Token request messages of a site Si reach other sites
in finite time.
Since one of these sites will have token in finite time,
site Si s request will be placed in the token queue in
finite time.
Since there can be at most N 1 requests in front of
this request in the token queue, site Si will get the
token and execute the CS in finite time.
343
Performance:
No message is needed and the synchronization
delay is zero if a site holds the idle token at the
time of its request.
It requires at most N message exchange per CS
execution ( (N-1) REQUEST messages + TOKEN
message )
Synchronization delay in this algorithm is 0 or T
Deadlock free ( because of TOKEN requirement )
No starvation ( i.e. a requesting site enters CS in finite
time )
344
Consequences & Related Problems
Comparison of Lamport and Suzuki-Kazami Algorithms
Lamports Algorithm
Algorithm
Suzuki Kasami Broadcast
The essential difference is in who keeps the
queue.
In one case every site keeps its own local copy of
the queue.
In the other case, the queue is passed around
within the token.
345
Chap. 4:Distributed Deadlock Detection
Introduction
Issues
Centralized Deadlock-Detection Algorithms
1. The Completely Centralized Algorithm
2. The Ho-Ramamoorthy Algorithms
Distributed Deadlock-Detection Algorithms
1. A Path-Pushing Algorithm
2. An Edge-Chasing Algorithm
3. A Diffusion Computation Based
Algorithm
4. Global State Detection Based Algorithm
346
Deadlocks An
Introduction
What Are DEADLOCKS ?
A Blocked Process which can never be
resolved unless there is some outside
Intervention.
For Example: Resource R1 is requested by Process P1 but
is held by Process P2.
347
Condition for deadlock
Mutual exclusion :The resource can be used
by only one process at a time
No preemption: Resources are released
voluntarily; neither another process nor the
OS can force a process to release a resource.
Hold and wait: A process holds a resource
while waiting for other resources
Circular wait: A closed cycle of processes is
formed, where each process holds one or
more resources needed by the next process
in the cycle
348
Illustrating A Deadlock
Wait-For-Graph (WFG)
Nodes Processes in the
system
Directed Edges Wait-For blocking
relation
Held By Resource 1
Waits For
Process 1
Waits For
Process 2
Resource 2
Held By
Wait-for Graphs (WFG): P1 -> P2 implies P1 is waiting for a
resource from P2.
In fig . P1is blocked and waiting forP2to release Resource 2.
A Cycle represents a Deadlock
There are
models
AND model
two
basic
deadlock
349
AND Model
Presence of a cycle.
P1
P2
P4
P3
P5
350
OR Models
OR Model
Presence of a knot.
Knot: Subset of a graph such that starting from
any node in the subset, it is impossible to leave
the knot by following the edges of the graph.
P1
P2
P5
P4
P3
P6
351
Cycle vs Knot
P3
P1
Deadlock in AND Model
But no Deadlock in OR
Model
P4
P2
Cycle but no
Knot
P5
Deadlock in both AND & OR
Model
P1
P3
P4
P2
P5
Cycle & Knot
352
Distributed Deadlock
Detection
Assumptions:
1.
System has only reusable resources
(CPU, Main-memory, I/O Devices etc)
2.
Only exclusive access to resources
(where Only one copy of each resource is present)
3. States of a process: running or blocked
Running state: process has all the resources
Blocked state: waiting on one or more resource
. Types of Distributed Deadlocks:
1. Resources Deadlocks
2. Communication Deadlocks
353
Resource vs Communication
Deadlocks
Resource deadlocks:
Set of deadlocked processes,
where each process waits for a
resource held by another process
(e.g., data object in a database, I/O
resource on a server)
Communication deadlocks:
Set of deadlocked processes, where
each process waits to receive
messages (communication) from
other processes in the set.
354
Basic Issues
Deadlock detection and
addressing two basic issues:
resolution
entails
First, detection of existing deadlocks and
Second resolution of detected deadlocks.
The detection of deadlocks involves two Issues:
maintenance of the Wait For Graph(WFG) and
search of the WFG for the presence of cycles (or knots)
In distributed systems, a cycle may involve
several sites, so the search for cycles greatly
depends upon how the WFG of the system is
represented across the system.
Depending upon the manner in which WFG
information is maintained and the search for
cycles is carried out, there are centralized,
distributed, and hierarchical algorithms for
deadlock detection in distributed systems.
355
Basic Issue
A correct deadlock detection algorithm
must satisfy the following two conditions:
1. Progress : (no undetected deadlock)
Algorithm should be able to detect all
existing deadlocks in finite time.
Continuously able to detect the deadlock.
Avoid to support any new formation of
deadlock.
2. Safety: (No false deadlock)
Algorithm should not report non-existed deadlocks.
Should take care of the global state .
Eg. Segment exists at different instant of time but
complete cycle does not exists.
356
Basic Issue (contd.)
Deadlock resolution involves breaking
existing wait-for dependencies in the WFG
system so as to resolve the deadlock.
It involves rolling back one or more
processes that are deadlocked and assigning
their resources to blocked processes in the
deadlock so that they can resume execution.
It also involves the timely cleaning of the
deadlock information from the system as it
may lead to form false deadlocks.
357
Control Organisation
Centralized Control:
A control site constructs wait-for graphs (WFGs) and
checks for directed cycles.
WFG can be maintained continuously (or) built ondemand by requesting WFGs from individual sites.
Distributed Control:
WFG is spread over different sites. Any site can
initiate the deadlock detection process.
Hierarchical Control:
Sites are arranged in a hierarchy.
A site checks for cycles only in descendents.
358
Centralized DeadlockDetection
The Completely Centralized
Algorithm
The Ho-Ramamoorthy Algorithms
359
The Completely Centralized
Algorithm
A designated site called as the Control
Site
(co-ordinator) , maintains the
WFG of the entire system
It checks the WFG for the existence of
deadlock cycles whenever a request edge is
added to the WFG.
Sites request or release through
REQUEST and RELEASE message for all
resources, whether local or remote.
However, it is highly inefficient due to
concentration of all messages.
It
imposes
larger
delays,
large
communication
overhead,
and
the
congestion of communication links.
Moreover, the reliability is poor due to single 360
point of failure.
Example of Completely Centralized
Algorithm
361
Ho-Ramamoorthy 1-phase Algorithm
Each site maintains 2 status tables: resource
status table and process status table.
Resource table: keeps track of transactions.
transactions that have locked or are waiting for
resources.
Process table: keeps track of resources locked
by or waited on by transactions.
One of the Sites Becomes the Central Control
site.
The Central Control site periodically collects
these 2 tables from each site.
Constructs a WFG from transactions common to
both the tables.
362
No cycle, no deadlocks.
Shortcoming
s Occurance of Phantom
Deadlocks.
High Storage & Communication
Costs.
Example
of Phantom
Deadlocks
P0
P2
System A
System B
S
P1
P1 releases resource S and asks-for
resource
T. sent to Control Site:
2 Messages
1. Releasing
2. S.
Waiting-for
T.
Message
2 arrives at Control Site first.
Control Site makes a WFG with cycle,
detecting a phantom deadlock.
363
Ho-Ramamoorthy 2-phase Algorithm
Each site maintains a status table of all processes
initiated at that site: includes all resources locked & all
resources being waited on.
Controller requests (periodically) the status table from
each site.
Controller then constructs WFG from these tables,
searches for cycle(s).
If no cycles, no deadlocks.
Otherwise, (cycle exists): Request for state tables
again.
Construct WFG based only on common transactions in
the 2 tables.
If the same cycle is detected again, system is in
deadlock.
Note :Later proved-> cycles in 2 consecutive reports
need not result in a deadlock. Hence, this algorithm
364
detects false deadlocks.
Distributed Deadlock
Detection Algorithms
A Path-Pushing Algorithm
An Edge-Chasing Algorithm
A Diffusion Computation Based
Algorithm
Global State Detection Based
Algorithm
365
An Overview
All sites collectively cooperate to detect a
cycle in the state graph that is likely to be
distributed over several sites of the system.
The algorithm can be initiated whenever a
process is forced to wait.
The algorithm can be initiated either by the
local site of the process or by the site where
the process waits.
366
These algorithm can be divided into four classes,
1.Path-pushing
Path information is transmitted and accumulated
Distributed deadlocks are detected by maintaining
an explicit global WFG (constructed locally and
pushed to neighbors)
2.Edge-chasing (single resource model, AND model)
The presence of a cycle in a distributed graph
structure is be verified by propagating special
messages called probes, along the edges of the
graph.
The formation of cycle can be detected by a site if
it receives the matching probe sent by it previously.
3. Diffusion computation (OR model, AND-OR model)
deadlock detection computation is diffused through
the WFG of the system.
4. Global state detection (Unrestricted, P-out-of-Q model)
Take a snapshot of the system and examining it for
367
the condition of a deadlock.
Obermarcks Path-Pushing
Algorithm
Individual Sites maintain local WFG
A virtual node x exists at each site.
Node
x
represents
external
processes.
Detection Process
Case 1: If Site Sn finds a cycle not
involving x -> Deadlock exists.
Case 2: If Site Sn finds a cycle involving
x -> Deadlock possible.
Contd
369
If Case 2 ->
Site Sn sends a message containing its detected cycles to
other sites. All sites receive the message, update their
WFG and re-evaluate the graph.
Consider Site Sj receives the message:
Site Sj checks for local cycles. If cycle found not involving x (of
Sj) -> Deadlock exists.
If site Sj finds cycle involving x it forwards the updated
message to other sites.
Process continues till possible deadlock found.
If a process sees its own label come back then it is part of a cycle
, deadlock is finally detected.
Note :Algorithm detects false deadlocks, due to
asynchronous snapshots at different sites
370
Path pushing algorithm
Performance :
O(n(n-1)/2) messages complexity to
detect deadlock where n is no. of
sites.
O(n) message size
O(n) delay to detect deadlock
371
Obermarks Algorithm Example
Intial State
S1
S4
S2
S3
372
Obermarks Algorithm Example
Iteration 1
Iteration 2
373
Iteration 3
Iteration 4
Note : If a process
sees its own label
come back then it is
part of a cycle,
deadlock is finally
detected.
374
2. Edge Chasing algorithm
Desigend by Chandy-Misra-Haas for AND request
model.
The block process sends a probe massage to the
resource holding process.
The probe message is a triplet (i,j,k) where
i : detection initiated by Pi,
j : message sent by site ofPjand
k : message sent to site ofPk
When probe is received by blocked process it
forwards it to processes holding the requested
resources.
Deadlock is finally detected when a probe returns to
its initiator.
375
ALGORITHM:
Let Pi be the initiator
ifPiis locally dependent on
itself
then
declare a deadlock
else
send probe (i,j,k) to home
site ofPkfor eachj,ksuch
that all of the following
holds:
Piis locally dependent
onPj
Pjis waiting onPk
PjandPkare on different
sites
376
On receipt of probe (i,j,k)
check the following conditions
Pkis blocked
dependentk(i) =false
Pkhas not replied to all requests
ofPj
if these are all true, do the following
setdependentk(i) =true
ifk=i
declare thatPiis deadlocked
else
send probe (i,m,n) to the home
site ofPnfor everymandnsuch
that the following all hold
Pkis locally dependent onPm
Pmis waiting onPn
PmandPnare on different
377
Note : k=i, declare thatPiis
deadlocked.
Otherway :Deadlock is
finally detected when a
probe returns to its
initiator.
Analysis :
Message complexity : m(n-1)/2 messages
formprocesses atnsites.
message length :fixed :3-integer words ( too
small)
message delay :O(n) delay to detect deadlock
378
A book Example
P2
P1
Probe (1,9,1)
P3
Site S1
Probe (1,6,8)
P9
Probe (1,3,4)
P4
P6
P8
P5
P10
Site S3
P7
Probe (1,7,10)
Site S2
379
3. Diffusion Computation Based
Algorithm
Designed by Chandy for OR request model
Processes are active or blocked
A blocked process may start a diffusion.
If deadlock is not detected, the process will
eventually unblock and terminate the algorithm
message =query(i,j,k)
i= initiator of check
j= immediate sender
k= immediate recipient
reply =reply(i,k,j)
Numi(K) = number of message query sent by i to k
380
ALGORITHM
Initiate the process Pi by sending
query (i,i,j) to all Pj on which Pi
depends.
receipt ofquery(i,j,k) byk (some
blocked process)
if not blocked then
discard the query
if blocked
if
this
is
anengagingquery
propagate query(i,k,m) to
dependent set ofm
else
if not continously blocked
since engagement
discard the query
else
sendreply(i,k,j) toj
381
On
receipt
of
reply(i,j,k) byk
if this is not the last
reply
then just decrement the
awaited reply count
numk(i)= numk(i)-1
if this is the last reply
then
ifi=k
report a deadlock
else
send reply(i,k,m) to
the
engaging processm
At this point, a knot has been
382
4. Global State Detection Based Algorithm
Take snapshot of distributed WFG.
Global state detection based deadlock
detection algorithms exploit the following facts:
A consistent snapshot of a distributed system can
be obtained without freezing the underlying
computation and
If a stable property holds in the system before the
snapshot collection is initiated, this property will still
hold in the snapshot.
2
Use graph reduction to check for deadlock
while there is an unblocked process, remove the
process and all (resource-holding) edges to it
there is deadlock if the remaining graph is non-null
A.
383
384
385
386
Unit III : Distributed Resource
Management
Chapter 1.Distributed File Systems
-Architecture
-Mechanisms
-Design Issues
-Case Study: Sun Network File System
Chapter 2:Distributed Shared Memory
-Architecture
-DSM Algorithms (4)
-Memory Coherence : Protocols
-Design Issues.
387
Chapter 3 : Distributed Scheduling
- Issues in Load Distributing
- Components of a Load Distributing
Algorithm
- Load Distributing Algorithms(4)
- Load Sharing Algorithms.
388
Chapter 1: Distributed File
System
A DFS is a resource management component of
a distributed operating system.
It implements a common file system that can be
shared by all the autonomous computer in the
system.
Two important goals of distributed file systems
Network Transparency
To provide the same functional capabilities to access files
distributed over a network.
Users do not have to be aware of the location of files to
access them.
High Availability
Users should have the same easy access to files,
irrespective of their physical location.
System failures or regularly scheduled activities such as
backups or maintenance should not result in the
389
unavailability of files.
Architecture
Files can be stored at any machine and
computation can be performed at any machine.
A machine can access a file stored on a remote
machine where the file access operations (like
read operation) are performed and the data is
returned.
Alternatively, File Servers are provided as
dedicated to storing files and performing
storage and retrieval operations.
Two most important services in a DFS are
Name Server: a process that maps names specified
by clients to stored objects, e.g. files and directories
Cache Manager: a process that implements file
caching, i.e. copying a remote file to the clients
machine when referred by the client.
390
Architecture of DFS
Cache manager can
be present at both
client and file server.
Cache manager at
client subsequently
reduces the access
delay due to network
latency.
Cache manager at
server cache file s in
the main memory to
reduce the delay due
to disk latency.
391
Data Access Actions in DFS
392
Mechanisms for Building
DFS
1.Mounting
binding together of different filename spaces
2. Caching
reduce delays in the accessing of data
3. Hints
alternative of caching which helps in recovery
4.Bulk data transfer
reduces the high cost of communication
protocols
5. Encryption
provides security aspect to the DFS
393
1.Mounting
Allows the binding together of different
filename spaces to form a single
hierarchically structured name space.
A name space (or the collection of files) can
be bounded to or mounted at a internal
node or a leaf node of the name space tree.
A node onto which a namespace is
mounted is know as a mount point.
Kernel maintains a structure called the
mount table which maps mount points to
appropriate storage. devices.
394
In case of DFS, file system maintained by remote
server are mounted by the clients.
Two approach to maintain the mount information
are :
1. At client where each client has to individually
mount every required file system (eg. Sun
network file system).
Here every client need not see the identical
filename space.
Clinet needs to update the mount table.
2. At server where each client is able to see an
identical filename space.(eg. Sprite file
system)
Mount information are updated at servers.
395
2. Caching:
To reduce delays in the accessing of data
by exploiting the temporal locality of
reference exhibited by program.
Data can be either cached in the main
memory or on the local disk of the client.
Also data is cached in the main memory (
server cache) at the server to reduce the
disk access latency.
Caching
increase
the
system
performance , reduces the frequency of
access
to
the
file
server
and
communication network.
Improves the scalability of the file
system.
396
3. Hints:
An alternative to cached data as to
overcome inconsistency problem when
multiple clients access shared data.
Hints helps in recoveries when invalid
cache data are discovered.
For example , after the name of file or
directory is mapped to the physical data ,
the address of the object can be stored
as the hint in the cache. If the local
address fails to map to the object the
cached address can be used from cache
397
memory.
4.Bulk Data Transfer:
Used to overcome the high cost of executing
communication protocols, i.e. assembly/disassembly of
packets, copying of buffers between layers etc.
Transferring data in bulk reduces the protocol processing
overhead at both server as well as client.
Multiple consecutive data blocks are transferred from
server to client instead of just block referenced by clients.
5.Encryption:
To enforce security in distributed systems with a
scenario that two entities wishing to communicate
establish a key for conversation.
It is important to note that the conversation key is
determined by the authentication server , which never
398
sent plain text to either of the entities.
Design Issues
1.
2.
3.
4.
5.
6.
7.
Naming and Name Resolution
Caches on Disk or Main Memory
Writing Policy
Cache Consistency
Availability
Scalability
Semantics
399
Naming and Name
Resolution
in file systems is associated with an
Name
object
(e.g. a file or a directory)
Name resolution refers to the process of mapping a
name
to an object,
in case of replication, to multiple objects.
Name space is a collection of names which may or
may not share an identical resolution mechanism
Three approaches to name the files in Distributed
Environment:
Concatenation of host name to the names(unique)
of the file stored on that server
Mounting of remote directories onto local
directories (Sun NFS)
Maintaining
of a single global directory
structure(Sprite and Apollo)
400
The Concepts of Contexts:
The notion of context is used to partition a name
space based on:
Geographical boundaries
Organization boundaries
Specific host
File system type etc
A context identifies the name space in which to
resolve a given name:
In x-Kernel Logical File System : a user defines his
own file space hierarchy where an internal node
correspond to the context.
Tilde Naming Scheme: the name space is
partitioned into sets of logically independent
directory trees called as tilde tree.
401
Name Server:
Resolves the names in distributed systems.
A name server is the process that maps name
specified by the client to store the objects such
as file or directories.
The client can send their query to the single
name server which map the name to the
object.
Drawbacks involved such as single point of
failure, performance bottleneck.
Alternate is to have several name servers, e.g.
Domain Name Servers , where replication of
tables can achieve fault tolarance and high
performance.
402
2. Caches on Disk or Main
Memory
Cache in Main Memory
Diskless workstations can also take advantage of
caching.
Accessing a cache is much faster than access a
cache on local disk
The server-cache is in the main memory, and
hence a single cache design for both
Disadvantages
It competes with the virtual memory system for physical
memory space
A more complex cache manager and memory
management system
Large files cannot be cached completely in memory
Cache in Local Disk
Large files can be cached without affecting
performance
Virtual memory management is simple
403
3. Writing Policy
Writing policy decides when the modified
cache block at a client should be transferred
to the server
Write-through policy
All writes requested by the applications at clients
are also carried out at the server immediately.
Delayed writing policy
Modifications due to a write are reflected at the
server after some delay.
Write on close policy
The updating of the files at the server is not done
until the file is closed.
404
4. Cache Consistency
Two approaches to guarantee that the data
returned to the client is valid.
Server-initiated approach
Server inform cache managers whenever
the data in the client caches become stale.
Cache managers at clients can then
retrieve the new data or invalidate the
blocks containing the old data.
Client-initiated approach
The responsibility of the cache managers
at the clients to validate data with the
server before returning it to the client.
Both are expensive since communication cost is
high.
405
Alternative approach:
Concurrent-write sharing approach
A file is open at multiple clients and at least
one has it open for writing.
When this occurs for a file, the file server
informs all the clients to flushed their cached
data items belonging to that file.
Major issue:
Sequential-write sharing issues causes cache
inconsistency when
Client opens a file, it may have outdated
blocks in its cache
Client opens a file, the current data block
may still be in another clients cache waiting
to be flushed. (e.g. happens in Delayed
writing policy)
406
5. Availability
Immunity to the failure of server or the
communication network
Issue: what is the level of availability of files
in a distributed file system?
Resolution: use replication to increase
availability, i.e. many copies (replicas) of files
are maintained at different sites/servers.
It is expensive because
Extra storage space required
The overhead incurred in maintaining all the
replicas up to date
Replication Issues involve
How to keep replicas consistent?
How to detect inconsistency among replicas?
407
Causes of Inconsistency :
A replica is not updated due to failure of server
All the file servers are not reachable from all the clients
due to network partition
The replicas of a file in different partitions are updated
differently
Unit of Replication:
File
Group of files
a) Volume: group of all files of a user or group or all
files in a server
Advantage: ease of implementation
Disadvantage: wasteful, user may need only
a subset replicated
b) Primary pack vs. pack
Primary pack: all files of a user
Pack: subset of primary pack. Can receive a
different degree of replication for each pack
408
Replica Management:
Deals with the maintenance of replicas and in
making use of them to provide increased
availability
Concerns with the consistency among replicas
A weighted voting scheme (e.g. Roe File
System)
Latest updates of read/write
based
timestamp are maintain.
Designated agents scheme (e.g. Locus)
Designate one or more process/ site
( also called as current synchronization
site ) as agent for controlling the access
to the replicas of files.
Backups servers scheme (e.g. Harp File
System)
409
Designated site -> primary & Other
6. Scalability
The suitability of the design of a system to provide to
the demands of a growing system.
As the system grow larger, both the size of the server
state and the load due to invalidations increase.
The structure of the server process also plays a major
role in deciding how many clients a server can support.
If the server is designed with a single process, then
many clients have to wait for a long time whenever
a disk I/O is initiated.
These waits can be avoided if a separate process is
assigned to each client.
An alternate is to use Lightweight processes
(threads).
410
7. Semantics
The
semantics
of
a
file
system
characterizes the effects of accesses on
files.
Expected semantics: A read will return
data stored by the latest write.
To guarantee the above semantics
possible options are
All the reads and writes from various clients
will have to go through the server.
Disadvantage: communication overhead
Use of lock mechanism : sharing will have to be
disallowed either by the server, or by the use
of locks by client applications.
Disadvantage: file not always available
411
Case Study:
The Sun Network File System (NSF)
Developed by Sun Microsystems to provide a
distributed file system independent of the
hardware and operating system
The goal is to share a file system in a transparent
way.
Uses client-server model
NFS is stateless
The server do not maintain any record of past
request.
All client requests must be self-contained with
their information.
Fast Crash Recovery
Major reason behind stateless design
412
Basic Design
Three important parts
The protocol
The client side
The server side
413
1. Protocol
Uses the Sun RPC mechanism and
Sun eXternal Data Representation
(XDR) standard
Defined as a set of remote
procedures.
Protocol is stateless
Each procedure call contains all the
information necessary to complete
the call
Server maintains no between call
414
2. Client side
Provides transparent interface to NFS
Mapping between remote file names and remote file
addresses is done through remote mount
Extension of UNIX mounts
Specified in a mount table
New virtual file system(VFS) interface supports
VFS calls, which operate on whole file system
VNODE calls, which operate on individual files
Treats all files in the same fashion.
Note : Vnode (Virtual Node):
There is a network-wide vnode for every object in the
file system (file or directory)- equivalent of UNIX inode
vnode has a mount table, allowing any node to be a
mount node
415
3. Server side
Server implements a write-through
policy
Required by statelessness
Any blocks modified by a write request
must be written back to disk before the
call completes.
416
NFS Architecture
1. System call interface layer
a) Presents sanitized validated
requests in a uniform way to
the VFS.
2.
Virtual file system ( VFS )
layer -
b) Gives clean layer between
user and file system.
c) Acts as deflection point by
using global vnodes.
d) Understands the difference
between local and remote
names.
e) Keeps
in
memory
information
about
what
should
be
deflected
(mounted directories) and
how to get to these remote
417
directories.
4. NFS client code:
To create an r-node (remote i-node) in its internal
tables as to hold the file handles.The v-node points to
the r-node. Each v-node in the VFS layer will ultimately
contain either a pointer to an r-node in the NFS client
code, or a pointer to an i-node in the local operating
system. Thus from the v-node it is possible to see if a
file or directory is local or remote, and if it is remote,
to find its file handle.
5.Caching to improve the performance:
Transfer between client and server are done in large
chunks, normally 8 Kbytes, even if fewer bytes are
requested. This is known as read ahead.
The same for writes, If a write system call writes
fewer than 8 Kbytes, the data are just accumulated
locally. Only when the entire 8K chunk is full is it sent
to the server. However, when a file is closed, all of its
data are sent to the server immediately.
418
NFS (Cont.)
Naming and location:
Workstations are designated as clients or file servers
A client defines its own private file system by mounting a
subdirectory of a remote file system on its local file system
Each client maintains a table which maps the remote file
directories to servers
Mapping a filename to an object is done the first time a
client references the field. Example:
Filename: /A/B/C
Assume A corresponds to vnode1
Look up on vnode1/B returns vnode2 for B
wherevnode2 indicates that object is on server X
Client asks server X to lookup vnode2/C
file handle returned to client by server storing that file
Client uses file handle for all subsequent operation on
that file
419
NFS (Cont.)
Caching:
Caching done in main memory of clients
Caching done for: file blocks, translation of filenames to vnodes, and
attributes of files and directories
(1) Caching of file blocks
Cached on demand with time stamp of the file (when last modified on the
server)
Entire file cached, if under certain size, with timestamp when last modified
After certain age, blocks have to be validated with server
Delayed writing policy: Modified blocks flushed to the server after certain delay
(2) Caching of filenames to vnodes for remote directory names
Speeds up the lookup procedure
(3) Caching of file and directory attributes
Updated when new attributes received from the server, discarded after certain
time
Stateless Server :
Servers are stateless
File access requests from clients contain all needed information (pointer position,
etc)
Servers have no record of past requests
Simple recovery from crashes.
420
Chapter 2: Distributed Share
Memory
- The distributed shared memory
(DSM)
implements
the
shared
memory
model
in
distributed
systems, which have no physical
shared memory.
- The shared memory model provides
a virtual address space shared
between all nodes.
- To overcome the high cost of
communication
in
distributed
systems, DSM systems move data to
421
D S M Architecture
Communication Network
Node 1
Node 2
Node n
Memory
Memory
Memory
Mapping Manager Mapping Manager
Shared Memory
(virtual address space)
Mapping Manager
422
Architecture of DSM
- Programs access data in a shared address space
just they access data as if in traditional virtual
memory.
- Data moves between main memory and secondary
memory (within a node) and between main
memories of different nodes
- Each data object is owned by a node
- Initial owner is the node that created the object
- Ownership can change as object moves from
node to node
- When a process accesses data in the shared
address space, the mapping manager maps shared
memory address to physical memory (local or
remote).
- Mapping manager: a layer of software, perhaps
bundled with the OS or as a runtime library routine.
423
Advantages of distributed shared memory
(DSM)
1. Data sharing is implicit, hiding the data
movement (as opposed to Send/Receive in
message passing model)
2. Passing data structures containing pointers is
easier (in message passing model data
moves between different address spaces)
3. Moving entire object to user takes advantage
of locality difference. Entire block/page of
memory along with the reference data
/object can be moved. This can help in easier
referencing of associated data.
424
Advantages of distributed shared memory
(DSM)
4. Less expensive to build than tightly coupled
multiprocessor system: off-the-shelf hardware, no
expensive interface to shared physical memory.
5. Very large total physical memory for all nodes:
Large programs can run more efficiently.
6.
Tightly coupled multiprocessor systems access
main memory via a common bus. No serial access
to common bus for shared physical memory like
in tightly coupled multiprocessor systems.
7. Programs
written
for
shared
memory
multiprocessors can be run on DSM systems with
minimum changes.
425
Algorithms for implementing DSM
Issues
- How to keep track of the location of remote data
- How to minimize communication overhead when
accessing remote data
- How to access concurrently remote data at several
nodes
Types of algorithms:
1. Central-server
2. Data migration
3. Read-replication
4.Full-replication
426
1. The Central Server
Algorithm
- Central server maintains all shared data
Read request: returns data item
Write request: updates data and returns
acknowledgement message
- Implementation
A timeout is used to resend a request if
acknowledgment fails
Associated sequence numbers can be
used to detect duplicate write requests
If an applications request to access
shared data fails repeatedly, a failure
condition is sent to the application
- Issues: performance and reliability
- central server can become a bottleneck.
- Possible solutions
Partition shared data between several
servers
Use a mapping function to
distribute/locate data
427
2. The Migration Algorithm
- Operation
Ship (migrate) entire data object (page,
block) to requesting location
Allow only one node to access a shared
data at a time
- Advantages
Takes advantage of the locality of reference
DSM can be integrated with VM at each
node
- To locate a remote data object:
Use a location server
Maintain hints at each node
Broadcast query
- Issues
Only one node can access a data object at
a time
Thrashing can occur: to minimize it, set
minimum time data object resides at a
node
Thrashing :If two nodes compete for write access to a single data item, it may
be transferred back and forth at such a high rate that no real work can get
428
done ( a Ping-Pong effect ).
3. Read-replication Algorithm:
Extend migration algorithm:
Replicate data at multiple nodes for read access.
Write operation:
One node write access (multiple readers-one writer protocol)
After a write, invalidate all copies of shared data at various
nodes (or) update with modified value
Data Access Request
Write Operation
in Read-replication
Algorithm :
Node i
Node j
Data Replication
Invalidate
DSM must keep track of the location of all the copies of shared data.
Read cost low, write cost higher.
429
4. The FullReplication Algorithm :
Extension
of
read-replication
algorithm:
multiple nodes can read and multiple nodes
can write (multiple-readers, multiple-writers
protocol)
Issue: consistency of data for multiple writers
Solution: use of gap-free sequencer
All writes sent to sequencer
Sequencer assigns sequence number and
sends write request to all sites that have
copies
Each node performs writes according to
sequence numbers
A gap in sequence numbers indicates a
missing write request: node asks for
retransmission of missing write requests
430
Memory Coherence
The memory is said to be coherent when
value returned by read operation is the
expected value by the programmer (e.g.,
value of most recent write)
In DSM memory coherence is maintained
when the coherence protocol is chosen in
accordance with a consistency model.
Mechanism
that
control/synchronizes
accesses is needed to maintain memory
coherence which is based on following
models:
1. Strict Consistency: Requires total ordering
431
of requests where a read returns the most
2.
Sequential
consistency:
A
system
is
sequentially consistent if the result of any
execution is the same as if the operations of all
processors were executed in some sequential
order, and the operations of each individual
processor appear in this sequence in the order
specified by its program.
3. General consistency : All copies of a memory
location (replicas) eventually contain same data
when all writes issued by every processor have
been completed.
4. Processor consistency: Operations issued by a
processor are performed in the same order they
are issued.
5. Weak consistency : Synchronization operations
are guaranteed to be sequentially consistent.
6. Release consistency: Provides acquire and
432
Coherence Protocols
Issues
- How do we ensure that all replicas have the same
information.
- How do we ensure that nodes do not access
stale(old) data.
1. Write-invalidate protocol
- Invalidate(nullify) all copies except the one being
modified before the write can proceed.
- Once invalidated, data copies cannot be used.
- Advantage: good performance for
Many updates between reads
Per node locality of reference
- Disadvantage
Invalidations sent to all nodes that have
copies.
Inefficient if many nodes access same object.433
Coherence Protocols
2. Write-update protocol
- Causes all copies of shared data to be updated.
- More difficult to implement,
- Guaranteeing consistency may be more difficult as
reads may happen in between write-updates.
Examples of Implementation of memory coherence
1. Cache coherence in PLUS system
2. Type specific memory coherence in the Munin
system
Based on Process synchronization
3. Unifying Synchronization and data transfer in
Clouds
434
Cache coherence in the PLUS System
Based on write-update protocol and supports general
consistency.
Memory Coherence Manager (MCM) running at each
node is responsible for maintaining consistency.
Unit of replication: a page (4 Kbytes)
Unit of memory access and coherence maintenance:
one 32-bit word.
A virtual page corresponds to a list of replicas of a
page.
One of the replica is designated as master copy.
Distributed link list (copy-list) identifies the replicas of
a page. Copy-list has 2 pointers
Master pointer
Next-copy pointer
435
PLUS:
RW
Operations
Read operation:
On a read fault , if address points to local memory,
read it. Otherwise, local MCM sends a read request to
its counterpart at the specified remote node.
Data returned by remote MCM passed back to the
requesting processor.
Write operation:
To maintain consistency write are always performed
first on master copy and then propagated to copies
linked by the copy-list.
On write fault: update request sent to the remote
node pointed to by MCM.
If the remote node does not have the master copy,
update request sent to the node with master copy
and for further propagation.
436
PLUS Write-update
Protocol Distributed copy list
X
Master Next-copy
=1
on 2
X
2
3
4
1. MCM sends write req
to node 2.
2. Update message to
master node
3. MCM updates X
4. Update message to next
copy.
5. MCM updates X
6. Update message to next
copy
2
5
Node 1
Master Next-copy
=1
on 3
Master Next-copy
=1
on Nil
6
Node 3
Node 2
page table
X Node 2 Page p
1
7. Update X
8. MCM sends ack:
Update complete.
Node 4
437
PLUS: Protocol
Node issuing write is not blocked on write
operation.
However, a read on that location (being written
into) gets blocked till the whole update is
completed. (i.e., remember pending writes).
Strong ordering within a single processor
independent of replication (in the absence of
concurrent writes by other processors), but not
with respect to another processor.
write-fence operation: strong ordering with
synchronization among processors. MCM waits
for previous writes to complete.
438
Type specific memory coherence
in Munin System
Use application-specific semantic information to classify shared objects.
Use class-specific handlers.
Shared object classes based on access pattern are :
1. Write-once objects: written at the start, read many times
after that. Replicated on-demand, accessed locally at each
site. For Large object: Portions can be replicated instead of
whole object.
2. Private objects: accessed by a single thread. Not managed
by coherence manager unless accessed by a remote thread.
3. Write-many objects: modified by multiple threads between
synchronization points. Munin employs delayed updates.
Updates are propagated only when thread synchronizes.
Weak consistency.
4. Result objects: Assumption is concurrent updates to
different parts of a result object will not conflict and object
is not read until all parts are updated -> delayed update can
be efficient.
439
5.Synchronization objects: (e.g.,) distributed locks for giving
exclusive access to data objects.
6. Migratory objects: accessed in phases where each phase is a
series of accesses by a single thread: lock + movement, i.,e.,
migrate to the node requesting lock.
7. Producer-consumer objects: written by 1 thread, read by
another. Strategy: move the object to the reading thread in
advance.
8. Read-mostly object: i.e., writes are infrequent. Use broadcasts
to update cached objects.
9. General read-write objects: does not fall into any of the above
categories: Use Berkeley ownership protocol supporting strict
consistency. Objects can be in states such as:
Invalid: no useful data.
Unowned: has valid data. Other nodes have copies of the
object and the object cannot be updated without first
acquiring ownership.
Owned exclusively: Can be updated locally. Can be
replicated on-demand.
Owned non-exclusively: Cannot be updated before
invalidating other copies.
440
Design Issues
1. Granularity: size of the shared memory unit.
For better integration of DSM and
local memory
management: DSM page size can be multiple of the local
page size.
Integration with local memory management provides
built-in protection mechanisms to detect faults, to
prevent it and recover from inappropriate references.
Larger page size:
More locality of references.
Less overhead for page transfers.
Disadvantage: more contention for page accesses.
Smaller page size:
Less contention.
Reduces false sharing that occurs when 2 different
data items are not shared by 2 different processors
but contention occurs as they are on same page.
441
2. Page Replacement :
Needed as physical/main memory is limited.
Data may be used in many modes: shared,
private, read-only, writable etc
Least Recently Used (LRU) replacement policy
cannot be directly used in DSMs supporting data
movement. Modified policies more effective:
Private pages may be removed ahead of shared ones
as shared pages have to be moved across the network
Read-only pages can be deleted as owners will have a
copy
A page to be replaced should not be lost for
ever.
Swap it onto local disk.
Send it to the owner.
Use reserved memory in each node for swapping.
442
Unit III : Chapter 3
Distributed Scheduling
- Issues in Load Distributing
- Components of a Load Distributing
Algorithm
- Load Distributing Algorithms(4)
- Selection of Load Sharing Algorithms.
443
Introduction
Good resource allocation schemes are
needed to fully utilize the computing
capacity of the DS.
Distributed scheduler is a resource
management component of a DOS.
It focuses on judiciously and transparently
redistributing the load of the system
among the computers.
Target is to maximize the overall
performance of the system.
More suitable for DS based on LANs.
444
Issues in Load Distribution
1. Load
Resource queue lengths and particularly
the CPU queue length which are good
indicators of load.
Measuring the CPU queue length is fairly
simple and carries little overhead.
CPU queue length does not always tell
the correct situation as the jobs may
differ in types.
Another load measuring criterion is the
processor utilization.
Requires a background process that
monitors CPU utilization continuously
and imposes more overhead.
445
Used in most of the load balancing
2. Classification of LDA
Basic function is to transfer load from heavily
loaded systems to idle or lightly loaded
systems
These algorithms can be classified as :
(1) Static (load assigned before application
runs)
Does not consider system state.
Uses static information about average
behavior.
Load distribution decisions are hardwired into the algorithm using a prior
knowledge of the system.
Little run-time overhead.
446
(2) Dynamic (load assigned as applications run)
Takes current system state into account to
make load distributing decisions
Further categorized as :
o Centralized (Tasks assigned by the master
or root process)
o De-centralized (Tasks reassigned among
slaves)
Has some overhead for state monitoring
(3)Adaptive
special case of dynamic algorithms in that
they modify the algorithm based on the
system state parameters.
For example, stop collecting information (go
static) if all nodes are busy so as not to
impose extra overhead.
447
3. Load Balancing vs. Load Sharing
Load-balancing approach,.
Tries to equalize the load at all processors.
Moves tasks more often than load sharing; much
more overhead.
In a load balancing algorithm transfers tasks is
at higher rate than a load sharing algorithm.
Load balancing is an NP-Complete problem.
Requires the background processes for processor
utilization.
Load-sharing approach,
Tries to reduce the load on the heavily loaded
processors only.
Probably a better solution; much less overheads.
If the transfer rate for load sharing rises , then it
tends close to load balancing.
448
4. Preemptive vs. Non-preemptive transfer
Can a task be transferred to another processor once it
starts executing?
Non-preemptive transfer (task placement)
It can only transfer tasks that have not yet begun
execution
It have to transfer environment information like
program code and data
environment variables, working directory, inherited
privileges, etc.
It is simple
Preemptive transfers
It can transfer a task that has partially executed
It have to transfer entire state of the task like
virtual memory image, process control block,
unread I/O buffers and messages, file pointers,
timers that have been set, etc.
449
It is expensive
Components of a load distribution algorithm
1.Transfer policy
Determines if a processor is in a suitable state to
participate in a task transfer.
2.Selection policy
Selects a task for transfer, once the transfer policy decides
that the processor is a sender.
3.Location policy
Finds suitable processors (senders or receivers) to share
load
4. Information policy
Decides:
When information about the state of other processors
should be collected?
Where it should be collected from?
What information should be collected?
450
1. Transfer policy
Determines whether a processor is a sender or a
receiver
Sender overloaded processor
Receiver underloaded processor
Threshold-based transfer
Establish a threshold, expressed in units of load
When a new task originates on a processor, if
the load on that processor exceeds the
threshold, the transfer policy decides that that
processor is a sender
When the load at a processor falls below the
threshold, the transfer policy decides that the
processor can be a receiver
451
2. Selection Policy
Selects which task to transfer
Newly originated simple (task just started)
Long (response time improvement compensates
transfer overhead)
small size
with minimum location-dependent system calls
(residual bandwidth minimized)
lowest priority
Priority assignment policy
Selfish local processes given priority
Altruistic remote processes given priority
Intermediate give priority on the ratio of
local/remote processes in the system
452
3. Location Policy
Once the transfer policy designates a processor as a
sender, finds a receiver
Or, once the transfer policy designates a
processor as a receiver, finds a sender
Polling one processor polls another processor to
find out if it is a suitable processor for load
distribution, selecting the processor to poll either:
Randomly
Based on information collected in previous polls
On a nearest-neighbor basis
Can poll processors either serially or in parallel
(e.g., multicast)
Usually some limit on number of polls, and if that
number is exceeded, the load distribution is not
done
453
4. Information Policy
Decides:
When information about the state of other
processors should be collected
Where it should be collected from
What information should be collected
Demand-driven
A processor collect the state of the other
processors only when it becomes either a sender
or a receiver (based on transfer and selection
policies)
Dynamic driven by system state
Sender-initiated senders look for receivers
to transfer load onto
Receiver-initiated receivers solicit load from
senders
454
Symmetrically-initiated combination where
Periodic
- Processors exchange load information at periodic
intervals.
- Based on information collected, transfer policy on a
processor may decide to transfer tasks.
- Does not adapt to system state collects same
information (overhead) at high system load as at low
system load.
State-change-driven
Processors propagates state information whenever
their state changes by a certain degree.
Differs from demand-driven in that a processor
propagates information about its own state, rather
than collecting information about the state of other
processors.
May send to central collection point or may send to
peers.
455
Stability
The two views of stability are,
The Queuing-Theoretic Perspective
A system is termed as unstable if the CPU
queues grow without bound when the long
term arrival rate of work to a system is
greater than the rate at which the system
can perform work.
The Algorithmic Perspective
If an algorithm can perform fruitless actions
indefinitely with finite probability, the
algorithm is said to be unstable.
456
Load Distributing Algorithms
Sender-Initiated Algorithms
Receiver-Initiated Algorithms
Symmetrically Initiated Algorithms
Adaptive Algorithms
457
1. Sender-Initiated
Algorithms
Activity is initiated
by an overloaded node
(sender)
A task is sent to an underloaded node
(receiver)
CPU queue threshold T is decided for all
nodes
Transfer Policy
A node is identified as a sender if a
new task originating at the node
makes the queue length exceed a
threshold T.
Selection Policy
Only new arrived tasks are considered
for transfer
458
Location Policy
Random: dynamic location policy (select any node
to transfer the task at random).
The selected node X may be overloaded.
If transferred task is treaded as new arrival,
then X may transfer the task again.
No prior information exchange.
Effective under light-load conditions.
Threshold: Poll nodes until a receiver is found.
Up to PollLimit nodes are polled.
If none is a receiver, then the sender commits
to the task.
Shortest: Among the polled nodes that where found
to be receivers, select the one with the shortest
queue.
Information Policy
A demand-driven type
Stability
Location policies adopted cause system instability
at high loads
459
Yes
Select Node i
randomly
i is Poll-set
No
Poll Node i
Poll-set=Poll-set U i
Poll-set = Nil
Yes
Transfer task
to i
YesQueueLength at i
<T
Task
Arrives QueueLength+1
No
>T
Yes
No
No. of polls
<
PollLimit
No
Queue the
task locally
460
2. Receiver-Initiated
Algorithms
Initiated from an underloaded node (receiver)
obtain a task from an overloaded node (sender)
to
Transfer Policy
Triggered when a task departs , node compares its CPU queue
length with T, and if smaller, node is a receiver.
Selection Policy
Any approach, preference to non-preemptive transfers.
Location Policy
Randomly poll nodes until a sender is found, and transfer a task
from it. If no sender is found, wait for a period or until a task
completes, and repeat.
Information Policy
A demand-driven type
Stability
At high loads, a receiver will find a sender with high-probability
with a small number of polls. At low-loads, most polls will fail, but
this is not a problem, since CPU cycles are available.
Most transfers are preemptive and therefore expensive
461
Yes
Select Node i
randomly
i is Poll-set
No
Poll Node i
Poll-set=Poll-set U i
Poll-set = Nil
Transfer task
from i to j
Yes
YesQueueLength at I
>T
No
QueueLength
<T
No
Yes
Wait for a
perdetermined period
Task Departure at j
No. of polls
<
PollLimit
No
462
3. Symmetrically Initiated
Algorithms
Both senders and receivers search for receiver and
senders, respectively, for task transfer.
Combine both sender-initiated and receiver-initiated
components in order to get a hybrid algorithm with
the advantages of both.
Care must be taken since otherwise, the hybrid
algorithm may inherit the disadvantages of both
sender and receiver initiated algorithms.
Most popular algorithm : The Above-Average
Algorithm given by Krueger and Finkel.
463
The above-average algorithm of
Krueger and Finkel
Maintain node load at an acceptable range of the system
average load.
Transfer policy
two thresholds are used, equidistant from the node's estimate of
the average load across all nodes.
Nodes are classified as senders, receivers, or OK.
Location policy
Has a sender-initiated and a receiver initiated component
Selection policy: same as before
Information policy
Average system load is determined individually.
The thresholds can be adaptive to system state to control
responsiveness.
464
Location Policy of Krueger&Finkels
Algorithm
Sender-initiated part
sender sends TooHigh msg, sets TooHigh timeout, and
listens for Accept msgs.
receiver that gets a TooHigh msg, cancels its TooLow
timeout, sends Accept msg, increase its load value,
and sets AwaitingTask timeout. If AwaitTask timeout
expires, load value is decreased.
sender receiving Accept msg, transfers task, and
cancels timeout.
if sender receiving a TooLow msg from a receiver, while
waiting for an Accept, sends a TooHigh msg to it.
sender whose TooHigh timeout expires, it broadcasts a
ChangeAverage msg to all nodes to increase the
average load estimate at the other nodes.
465
Location Policy of Krueger&Finkels
Algorithm
receiver-initiated part:
a receiver sends a TooLow msg, sets a
TooLow timeout, and starts listening for
TooHigh msgs.
a receiver getting a TooHigh msg, sends
Accept msg, increase load, and sets
AwaitingTask timeout. If it expires,
decrease load value.
receiver whose TooLow timeout expires,
sends ChangeAverage to decrease load
estimate at other nodes.
466
4. Adaptive Algorithms
1. A Stable Symmetrically Initiated Algorithm
Utilizes the information gathered during polling to classify
the nodes in the system as either Sender, Receiver or OK.
The knowledge concerning the state of nodes is maintained
by a data structure at each node, comprised of a senders list,
a receivers list, and an OK list.
Initially, each node assumes that every other node is a
receiver.
Transfer Policy
Triggers when a new task originates or when a task departs.
Makes use of two threshold values, i.e. Lower (LT) and Upper
(UT)
Location Policy
Sender-initiated component: Polls the node at the head of
receivers list
Receiver-initiated component: Polling in three order
Head-Tail (senders list), Tail-Head (OK list), Tail-Head (receivers list)
Selection Policy: Newly arrived task (SI), other approached
(RI)
Information Policy: A demand-driven type
467
2. A stable sender-initiated algorithm
Use the sender-initiated component of the
previous stable symmetric algorithm A as
follows
augmented the information at each node J with
a state vector
V(I)=sender, receiver, OK depending on whether J
knows that node is in Is Sender, Receiver, or OK list
It keeps track to which lists it belongs at each other
node
The state vector is kept up-to-date during polling
The receiver component of is as follows
Whenever a node becomes a receiver it notifies all
other misinformed nodes using its state vector
468
Selecting a Suitable Load-Sharing Algorithm
Based on the performance trends of LSAs, on can
select a load sharing algorithm that is appropriate
to the system under consideration as follows:
1. If the system under consideration never attains
the high load, sender-initiated algorithms will give
an improved average response time over no load
sharing at all.
2. Stable scheduling algorithms are recommended
for systems that can reach high load. These
algorithms perform better that non adaptive
algorithms for the following reasons:
469
a. Under sender-initiated algorithms, an overloaded
processor must send inquiry messages delaying the
existing tasks. If an inquiry fails, two overloaded
processors are adversely affected because of
unnecessary message handling. Therefore, the
performance impact of an inquiry is quit severe at high
system loads, where most inquiries fail.
b. Receiver-initiated algorithms remain effective at high
loads but require the use of preemptive task transfers.
Note that preemptive task transfers are expensive
compared to non-preemptive task transfers because
they involve saving and communicating a far more
complicated task state.
3. For a system that experiences a wide range of load
fluctuations,
the
stable
symmetrically
initiated
scheduling algorithm is recommended because it
provides improved performance and stability over the
entire spectrum of system loads.
470
4. For a system that experiences wide fluctuations in
load and has a high cost for the migration of partly
executed tasks, stable sender-initiated algorithms
are recommended, as they perform better than
unstable sender-initiated algorithms at all loads,
perform better than receiver-initiated algorithms
over most system loads, and are stable at high
loads.
5. For a system that experiences heterogeneous
work arrival, adaptive stable algorithms are
preferable, as they provide substantial performance
improvement over non-adaptive algorithms.
471
Question bank
1. What are the central issues in load distributing?
2. What are the components of load distributing
algorithm?
3. Differentiate between load balancing & load
sharing.
4. Discuss the Above-average load sharing
algorithm.
5. How will you select a suitable load sharing
algorithm
6. Write short note on (expected any one)
Sender-Initiated Algorithms
Receiver-Initiated Algorithms
Symmetrically Initiated Algorithms
Adaptive Algorithms
472
Unit IV
Chapter 1: Transaction and Concurrency:
Introduction
Transactions
Nested Transactions
Methods of Concurrency Control:
Locks,
Optimistic concurrency control ,
Time Stamp Ordering,
Comparison for concurrency control.
Chapter 2 :Distributed Transactions:
Introduction
Flat and nested Distributed Transactions
Atomic Commit Protocols
Concurrency Control in Distributed Transactions
473
Unit IV: Chapter 1
Introduction :Transaction Concept
Supports daily operations of an
organization
Collection of database operations
Reliably and efficiently processed as
one unit of work
No lost data
Interference among multiple users
Failures
474
Airline Transaction Example
START TRANSACTION
Display greeting
Get reservation preferences from user
SELECT departure and return flight records
If reservation is acceptable then
UPDATE seats remaining of departure flight
record
UPDATE seats remaining of return flight record
INSERT reservation record
Print ticket if requested
End If
On Error: ROLLBACK
COMMIT
475
Transaction concept
Transaction: Specified by a client as a set of
operations on objects to be performed as an
indivisible unit where the servers manage those
objects.
Goal of transaction: Ensure all the objects managed
by a server remain in a consistent state when
accessed by multiple transactions (client side) and
in the presence of server crashes.
Objects that can be recovered after the server crashes
are called as recoverable objects.
Objects on server are stored on volatile memory (RAM)
or on persistent memory (disk)
Enhance reliability
Recovery from failures
Record in permanent storage
476
Introduction to this chapter
Focus is on
transaction.
single
server
A Transaction defines a sequence of server
operations that is guaranteed to be atomic
in the presence of multiple clients and
server crash.
Nested transaction
Methods of concurrency control
All concurrency control protocols are
based on serial equivalence and are
derived from rules of conflicting operations
477
Operations of the Account interface
deposit(amount)
deposit amount in the account
withdraw(amount)
withdraw amount from the account
getBalance() -> amount
return the balance of the account
setBalance(amount)
set the balance of the account to
amount
Operations of the Branch interface
create(name) -> account
create a new account with a given name
lookUp(name) -> account
return a reference to the account with the
given name
branchTotal() -> amount
return the total of all the balances at the
branch
Note : Each
account is
represented by
a remote
object whose
interface
Account
provides the
operations
Note : Each
branch of bank
is represented
by a remote
object whose
interface
Branch
provides the
operations
The banking example
The client side will work on behalf of users which will lookUp and
478
then can perform account interface.
Simple Synchronization (without
Transactions)
Multi-threaded banking server :
Main issue: Unless a server is carefully
designed, its operations performed on behalf
of different clients may sometimes interfere
with one another. Such interference may
result in incorrect values in the object.
The client operation can be synchronized
without recourse of transaction:
(i) Atomic operations at the server.
(ii) Enhancing Client Cooperation by Signaling
(synchronization of server operations)
479
(i) Atomic operations at the server :
The use of multiple threads is beneficial to the
performance. Multiple threads may access the same
objects.
For example, deposit and withdraw methods: the
actions of two concurrent executions of the methods
could be interleaved arbitrarily and have strange
effects on the instance variables of the account
object.
Synchronized keyword can be applied to method in
Java, so only one thread at a time can access an
object.
E.g. in account interface we can declare the method
as synchonized
Public synchronized void deposit(int amount) {}
480
If one thread invokes a synchronized
method on an object, then that object is
locked, another thread that invokes one
of the synchronized method will be
blocked until the lock is released.
Operations
that
are
free
from
interference from concurrent operations
being performed in other threads are
called as atomic operations.
The use of synchronised methods in
java is one way of achieving atomic
operation, which can be achieved by
any mutual exclusive mechanism.
481
(ii) Enhancing Client Cooperation by
Signaling
Clients may use a server as a means
of sharing some resources. E.g. some
clients update the servers objects
and other clients access them.
However, in some applications,
threads need to communicate and
coordinate their actions.
Producer and Consumer problem.
Wait and Notify actions.
482
Failure model for transactions [lamport1981]
Writes to permanent storage may
fail
Write nothing or wrong value
file storage may decay
reading bad data can be detect (by
checksum)
Servers may crash occasionally
Memory recover to the last updated state
continue recovery using information in
permanent storage
no arbitrary failure
An arbitrary delay of a message
483
A message may be lost, duplicated or
Transactions
Transaction concept is
originally from
database management systems.
Clients require a sequence of separate
requests to a server to be atomic in the
sense that:
They are free from interference by operations
being performed on behalf of other concurrent
clients; and
Either all of the operations must be completed
successfully or they must have no effect at all in
the presence of server crashes.
484
E.g. :A clients banking transaction
A client that performs a sequence of operations
on a particular account on behalf of a user.
Consider accounts with names A,B and C.
The clients looks them up and stores the
reference to them in variables a, b and c of
type Account.
Transaction T:
a.withdraw(1
00);
b.deposit(100
);
This is called as an atomic transaction.
c.withdraw(2
485
Two Aspects of Atomicity
All or nothing: A transaction either completes
successfully, and effects of all of its operations are
recorded in the object, or it has no effect at all.
Failure atomicity: effects are atomic even when server
crashes.
Durability: after a transaction has completed
successfully, all its effects are saved in permanent
storage for recover later.
Isolation: Each transaction must be performed
without interference from other transactions. The
intermediate effects of a transaction must not be
visible to other transactions.
486
ACID properties of Transaction
Atomicity
A transaction must be all or nothing.
Consistency
A transaction takes the system from one
consistent state to another consistent state
The state during a transaction is invisible to
another
Isolation
Serially equivalent or serializable.
Durability
Sucessful transaction are saved and are
recoverable.
487
Use a transaction
Transaction coordinator
Each transaction is created and
managed by a coordinator
Result of a transaction
Success
Aborted
Initiated by client
Initiated by server
488
Operations in Coordinator interface
openTransaction() -> trans;
starts a new transaction and delivers a
unique TID trans. This identifier will be
used in the other operations in the
transaction.
closeTransaction(trans)
->
(commit,
abort);
ends a transaction: a commit return
value indicates that the transaction has
committed; an abort return value
489
indicates that it has aborted.
Transaction life histories
Successful
Aborted by client
Aborted by server
openTransaction
openTransaction
openTransaction
operation
operation
operation
operation
operation
operation
server aborts
transaction
operation
operation
operation ERROR
reported to client
closeTransaction
abortTransaction
If a transaction aborts for any reason (self abort or server
abort), it must be guaranteed that future transaction will
not see the its effect either in the object or in 490
their
Major Issues of Transaction
1. Concurrency Control
2. Recoverability from Abort
491
1. Concurrency control
Problems of concurrent transaction
The lost update problem
Inconsistent retrievals
Conflict in operations
492
The lost update problem
a, b and c initially have bank account balance are: 100,
200, and 300. T transfers an amount from a to b. U
transfers an amount from c to b.
Transaction
T:
balance = b.getBalance();
b.setBalance(balance*1.1);
a.withdraw(balance/10)
Transaction
U:
balance = b.getBalance();
b.setBalance(balance*1.1);
c.withdraw(balance/10)
balance = b.getBalance();
$200
balance = b.getBalance();
$200
b.setBalance(balance*1.1);
$220
b.setBalance(balance*1.1);
$220
a.withdraw(balance/10) $80
c.withdraw(balance/10) $280
The final balance of b should be $242
rather than $220
493
The inconsistent retrievals problem
a, b accounts start with 200 both.
Transaction
W:
Transaction
V:
a.withdraw(100)
b.deposit(100)
a.withdraw(100);
aBranch.branchTotal()
$100
total = a.getBalance()
$100
total = total+b.getBalance()$300
total = total+c.getBalance()
b.deposit(100)
$300
W retrieval are inconsistent.
V perform only withdraw part at the time sum is calculated.
The net balance should be 400 instead of 300.
494
How to overcomes these
problems
If these transactions are done one at a time
in some order, then the final result will be
correct.
If we do not want to sacrifice the
concurrency, an interleaving of the
operations of transactions may lead to the
same effect as if the transactions had been
performed one at a time in some order.
We say it is a serially equivalent
interleaving.
495
Serial equivalence
What is serial equivalence?
An interleaving of the operations of
transactions in which the combined
effect is the same as if the
transactions had been performed one
at a time in some order.
Significance
The criterion for correct concurrent
execution
Avoid lost update and inconsistent
retrieval
496
A serially equivalent interleaving of T and U
Transaction
T:
balance = b.getBalance()
b.setBalance(balance*1.1)
a.withdraw(balance/10)
Transaction
U:
balance = b.getBalance()
b.setBalance(balance*1.1)
c.withdraw(balance/10)
balance = b.getBalance()
$200
b.setBalance(balance*1.1)
$220
balance = b.getBalance()
$220
b.setBalance(balance*1.1)
$242
a.withdraw(balance/10) $80
c.withdraw(balance/10) $278
497
A serially equivalent interleaving of
V and W
Transaction
V:
Transaction
W:
a.withdraw(100);
b.deposit(100)
aBranch.branchTotal()
a.withdraw(100);
$100
b.deposit(100)
$300
total = a.getBalance()
$100
$400
total = total+b.getBalance()
total = total+c.getBalance()
...
498
Conflicting operations
When we say a pair of operations
conflicts we mean that their combined
effect depends on the order in which
they are executed. E.g. read and write
Serial equivalence of two transactions
All pairs of conflicting operations of the two
transactions be executed in the same order
at all of the objects they both access.
499
Read and write operation conflict
rules
Operations of different
Conflict
transactions
read
read
No
Reason
Because the effect of a pairread
of operations
does not depend on the order in which they a
executed
read
write
Yes
Because the effect ofread
a and awriteoperation
depends on the order of their execution
write
write
Yes
Because the effect of a pairwrite
of operations
depends on the order of their execution
500
A non-serially equivalent interleaving of operations of
transactions T and U
TransactionT:
x = read(i)
write(i, 10)
write(j, 20)
TransactionU:
y = read(j)
write(j, 30)
z = read (i)
Ordering is not serially equivalent as the pair of conflicting
operations are not done in same order of both objects.
Serially equivalence ordering requires one of the following two
conditions:
1. T accesses i before U and T Accesses j before U
2. U accesses i before T and U Accesses j before T
501
Recoverability from aborts
The two problems here are
Dirty reads
Premature writes
502
Dirty Reads
The isolation property of transaction
requires that the transaction do not see
the uncommitted state of the other
transaction.
The dirty read problem is caused by the
interaction between the read operation in
one transaction and an earlier write
operation in another transaction
503
A dirty read when transaction T aborts
TRANSACTION T
TRANSACTION U
a.getBalance( );
a.setBalance(balance+10);
a.getBalance( );
a.setBalance(balance+20);
balance = a.getBalance( );
a.setBalance(balance+10);
$100
$110
balance = a.getBalance( );
a.setBalance(balance+20);
commit transaction;
$110
$130
abort transaction;
-Recoverability of transactions: delay commits
until after the commitment of any other
transaction whose uncommitted state has
been observed.
-Cascading aborts: the aborting of any
transactions may cause further transactions to
be aborted transactions are only allowed to504
Premature writes
This one is related to the interaction
between the write operations on the same
object belonging to different transactions.
It uses the concept of before image on write
operation.
505
Overwriting uncommitted values
TRANSACTION T
TRANSACTION U
a.setBalance(105);
a.setBalance(110);
a.setBalance(105);
$100
$105
a.setBalance(110);
$110
Strict execution of transaction: service delay
both read and write operations on an object
until all transactions that previously wrote that
object have either committed or aborted.
-
Tentative
versions:
update
operations
performed during a transaction are done in
tentative versions of objects in volatile memory.
506
Nested Transactions
Several transactions may be started from within
a
transaction, allowing transactions to be regarded as
modules that can be selfT possessed.
: top-level transaction
T1 = openSubTransaction
T1 :
T2 = openSubTransaction
commit
T2 :
openSubTransaction openSubTransaction
T11 :
T12 :
prov. commit
prov. commit
prov. commit
openSubTransaction
T21 :
abort
openSubTransaction
T211 :
prov. commit
prov.commit
A sub transaction appears atomic to its parent with
respect to transaction failures and to the concurrent
access.
507
Sub transaction at the same level (say T 1 and T2) can run
The advantages of nested transactions
Additional concurrency
Sub transactions at one level may run
concurrently with other subtransactions
at the same level
E.g.
concurrent
getBalances
in
branchTotal operation
More robust
Subtransactions can commit or abort
independently
508
The Rules for commitment of nested transactions
Transaction commit(abort) after its
child complete
A transaction may commit or abort only after
its child transactions have completed.
Child
completes:
provisionally or abort
commit
When a subtransaction completes, it makes
an independent decision either to commit
provisionally or to aborts. Its decision to abort
is final.
Parent abort, children abort
When
a
parent
aborts,
subtransactions are aborted.
all
of
its
509
The Rules for commitment of nested transactions
Child abort, parent abort or not
When a subtransaction aborts, the parent
can decide whether to abort or not
Top level transaction
provisionally
subtransactions commit
commit, all
committed
If the top-level transaction commits, then
all of the subtransactions that have
provisionally committed can commit too,
provided that none of their ancestors has
aborted
510
Methods For Concurrent Control
1. Locks
2. Optimistic concurrency control
3. Time stamp ordering
511
1.Locks
A simple example of a serializing
mechanism is the use of exclusive locks.
Server can lock any object that is about to
be used by a client.
If another client wants to access the same
object, it has to wait until the object is
unlocked in the end.
512
Simple exclusive locks
Lock any object that is about to be used by any operation of a
clients transaction.
In case of any other request to the same locked object is
suspended until the object is unlocked.
Transaction : T
Transaction : U
al = b.getBalance()
bal = b.getBalance()
b.setBalance(bal*1.1)
.setBalance(bal*1.1)
.withdraw(bal/10)
c.withdraw(bal/10)
Operations
Lock
Operations
Locks
s
openTransaction
al = b.getBalance()
Lock B
openTransaction
.setBalance(bal*1.1)
waits for
.withdraw(bal/10)lock Abal = b.getBalance()
Ts on B
lock
closeTransaction
unlock A,B
lock B
b.setBalance(bal*1.1)
c.withdraw(bal/10) lock
513
unlock
B,C
C,
closeTransaction
Two phase locking
To ensure serial equivalence of
any two transactions
A transaction is not allowed any new
locks, after it has released a lock
Growing phase: acquire locks
Shrinking phase: release locks
Strict two-phase locking
Any locks applied during the progress of a
transaction are held until the transaction
commits or aborts
In fact, the lock between two reads are
unnecessary
514
Two-Phasing Locking
Basic 2PL
When a transaction releases a lock, it may not request another lock
lock point
obtain lock
number
of locks
release lock
Phase 1
BEGIN
Phase 2
END
Conservative 2PL or static 2PL
a transaction locks all the items it accesses
before the transaction begins execution
pre-declaring read and write sets
515
Strict Two-Phasing Locking
Strict 2PL a transaction does not release any
of its locks until after it commits or aborts
leads to a strict schedule for recovery
obtain lock
release lock
number
of locks
BEGIN
period of data END
item use
Transaction
duration
516
Lock Rules
Lock granularity
as
small
concurrency
as
possible:
enhance
Read lock / write lock
Before access an object, acquire its lock
firstly
Lock compatibility
If a transaction T has already performed a
read operation on an object, then a
concurrent transaction U must not write
that object until T commits or aborts
If a transaction T has already performed a
write operation on an object, then a
517
concurrent transaction U must not read or
Lock rules continued
Prevent lost update and
inconsistent retrieval
Promotion of a lock
From read lock to write lock
Promotion can not be conducted if
the read lock is shared by another
transaction
518
Locking rule for nested transactions
Locks that are acquired by a successful
subtransaction is inherited by its parent
& ancestors when it completes. Locks
held
until
top-level
transaction
commits/aborts.
Parent transactions are not allowed to
run concurrently with their child
transactions.
Subtransactions at the same level are
519
allowed to run concurrently.
Definition:
Deadlocks
A state in which each member of a group of transactions
is waiting for some other member to release a lock.
Prevention:
Lock all the objects used by a transaction when it starts
not a good way.
Request locks on objects in a predefined order
premature locking & reduction in concurrency.
Detection:
Finding cycle in a wait-for graph select a transaction
for aborting to break the cycle. (Choice of transaction to
be aborted is not simple.)
Timeouts:
Each lock is given a limited period in which it is
untouched (safe).
Transaction is sometimes aborted but actually there is no
deadlock.
Appropriate length of a timeout.
520
Two schema of locking
Increasing concurrency in locking
schema:
Two-version locking:
the setting of exclusive locks is delayed until
a transaction commits.
Hierarchic locks:
mix-granularity
locks are used.
Branch
Account
521
Lock compatibility for two-version
locking Lock to be set
For one object
Lock already set
none
read
write
commit
Read
write
commit
OK
OK
OK
Wait
OK
OK
wait
wait
OK
wait
-------
Two-version locking: allows one transaction to write tentative
versions of objects when other transactions read from the
committed version of the same objects.
- read operations are delayed only while the transactions
are being committed rather than during entire execution.
- read operations can cause delay in committing other
transactions.
522
Lock compatibility for hierarchic
locks
For one object
Lock already set
none
read
write
I-read
I-write
Read
Lock to be set
write
I-read
OK
OK
wait
OK
Wait
OK
wait
wait
wait
wait
OK
OK
wait
OK
OK
I-write
OK
wait
wait
OK
OK
523
Drawbacks of locking:
Lock maintenance represents an overhead that
is not present in systems that do not support
concurrent access to shared data.
Deadlock. Deadlock prevention reduces
concurrency. Deadlock detection or timeout not
wholly satisfactory for use in interactive
programs.
To avoid cascading abort, locks can not be
release until the end of transaction reduce
potential concurrency.
524
2. Optimistic concurrency control:
Observation
In most applications the probability of two transactions
accessing the same object is low.
Scheme
No checking while the transaction is executing.
Check for conflicts after the transaction.
Checks are all made at once, so low transaction
execution overhead.
Relies on little interference between transactions
Updates are not applied until closeTransaction
Updates are applied to local copies in a transaction space.
Basic Idea
Transactions are allowed to proceed as though there were
no possibility of conflict with other transactions until the
client completes its task and issues a closeTransaction
525
request.
Three Phases of a Transaction
Working phase:
Each transaction has a tentative version of each of
the objects that it updates.
Initially, it is a copy of the most recently committed version
Read are performed on the tentative version
Written values are recorded as tentative version
Reading set / write set per transaction.
Validation phase:
Check the conflicts between overlapped
transactions when closeTransaction is issued
Success: commit
Fail: abort
Update phase:
Updates in tentative versions are made permanent.
526
Purpose : Validation of transaction
Transaction number
Each transaction is assigned a transaction number
(in ascending sequence) when it enters the validation phase.
Transactions enter validation phase according to the their
transaction number.
Transactions commit according to the transaction number.
Since the validation and update phase are short, so there is only one
transaction at a time.
Conflict rules
Validation uses the read-write conflict rules to ensure that the
scheduling of a particular transaction is serially equivalent
with respect to all other overlapping transactions.
E.g. :Tv is serializable with respect to an overlapping
transaction Ti , their operations must conform to the following
rules.
527
The validation test on transaction Tv is based
on conflicts between operations in pairs of
transaction Ti and Tv.
Tv
Ti
Rule
write
read
1.
must
Ti
not read objects written Tby
v
read
write
2.
must
Tv
not read objects writtenTby
i
write
write
3.
must
Ti
not write objects writtenTby
and
v
Tv must
not write objects written by
Ti
Serializability of transaction T with respect to transaction Ti
Note : The validation of a transaction must
ensure that rule 1 and rule 2 are obeyed
by testing for overlaps between the
528
objects of pair of transaction Tv and Ti.
Forms of Validation
1. Backward Validation: Checks the transaction
undergoing
validation
with
other
preceding
overlapping transactions- those that entered the
validation phase before it.
2.Forward Validation: Checks the transaction undergoing
validation with other later transactions, which are still
active ( lagging behind in respective validation
phases).
529
Validation forms of transactions
Working
Validation Update
T1
Earlier committed
transactions
T2
1. Backward
form
T3
Transaction
being validated
2. Forward form
Later active
transactions
Tv
active 1
active
1. Figure indicates overlapping transaction considered in the
validation of a transaction Tv.
2. Time increase from left to right.
3. The earlier committed transaction are T1,T2 and T3.
4. T1 is committed before Tv start and T2 and T3 committed
before Tv finished its working phase.
5. There are two later active transactions having
transaction identifiers but not transaction numbers.
530
i]. Backward validation
Test the previous overlapped transactions
:
Rule 1 is satisfied as read operation of earlier
transaction are not affected by the write
operation of the current transaction Tv.
To resolve any conflict Abort
transaction undergoing validation.
the
Rule 2 indicates that the read set of Tv must
be compared with the write set of T 2 and T3.
If there is a overlap, the validation fails.
Transaction that have no read operation
(only write operation) need not to be
checked.
531
Backward validation algorithm
startTn
The biggest transaction number assigned to some other committed
transaction at the time when transaction Tv started its working
phase.
finishTn
The biggest transaction number assigned to some other committed
transaction at the time when Tv entered the validation phase
T2
T3
boolean valid = true;
for ( int Ti = startTn +1; Ti <= finishTn; Ti
++)
{
if (read set of Tv intersects write set of
T i)
validof=all false;
Serial equivalence
committed transactions
Since
} backward validation can ensure the result that T commits
v
after all previously committed transactions, so all transactions are
committed in a serial equivalent order.
532
ii]. Forward validation
Test the still active (but lagging behind), latter
overlapped transactions
Rule 1:
Write set of the transaction being validated is
compared with the read sets of other overlapping
active transactions (still in working phase).
Rule 2:
Automatically fulfilled because the active transaction
do not write untill after Tv has completed.
Forward validation algorithm:
boolean valid = true;
for ( int Tid = active1 ; Tid <= activen; Tid ++)
{
if (write set of Tv intersects read set of Tid)
valid = false;
}
533
Ways to resolve a conflict in
Forward validation
Suspend the validation until a later time,
when the conflicting transactions have
finished.
Some conflicting transactions may have
aborted.
Abort all the conflicting active transactions
and commit the transaction being
validated.
Abort the transaction being validated.
The future conflicting transactions may
abort,
so
the
aborting
becomes
unnecessary.
534
Comparison of forward and backward validation
Backward validation
Overhead of comparison
For read set is bigger than write set, so
comparison in backward validation is
heavier than that in forward validation.
Overhead of storage
Storing old write sets until they are no
longer needed.
Forward validation
Overhead of time
To validate a transaction must wait until
all active transactions finished.
535
3:Timestamp Ordering
Basic Idea:
Each
transaction
has
a
Timestamp(TS)
associated with it.
TS is not necessarily real time, can be a logical
counter.
TS is unique for a transaction.
New transaction has larger TS than older
transaction.
Larger TS transactions wait for smaller TS
transactions and smaller TS transactions die
and restart when confronting larger TS
transactions.
No deadlock.
536
Basic time stamp ordering
Rule
A transactions request to write an
object is valid only if , that object
was last read and written by earlier
transactions.
A transactions request to read an
object is valid only if , that object
was last written by an earlier
transactions.
537
Operation conflicts for timestamp ordering
Rul Tc
e
Ti Condition
1.write read Tc must not write an object that has
been read by any Ti where Ti > Tc,
this requires that Tc the maximum
read timestamp of the object.
2.write write Tc must not write an object that has
been written by any Ti where Ti>Tc,
this requires that Tc > the write
timestamp of the committed object.
3.read write Tc must not read an object that has
been written by any Ti where Ti>Tc,
this requires that Tc
> write
538
Timestamp ordering write rule:
Let D be an object and Tc is a transaction
requesting write operation :
Based on Rule 1 and 2:
if (Tc maximum read timestamp on D &&
Tc > write timestamp on committed version of D)
perform write operation on tentative version of D with
write timestamp Tc
else
/*write is too late*/
abort transaction Tc
539
Write operations and timestamps
(b)T3 write
(a) T3 write
Before
T2
After
T2
Before T1
T3
After
T1
Key:
T2
T2
Committed
T3
Time
Time
(c)
T3 write
(d)T3 write
Before
T1
T4
After
T1
T3
T4
Time
Before
T4
After
T4
Ti
Transaction
aborts
Ti
Tentative
object produced
by transaction Ti
(with write timestamp T
T1<T2<T3<T4
Time
540
Timestamp ordering read rule:
Decision is made : to accept , to wait or to reject a read
operation requested by Tc on the object D.
Based on rule 3:
If (Tc> write timestamp on committed version of D )
{ let Dselected be the version of D with the maximum write timestamp Tc
if (Dselected is committed)
perform read operation on the version D selected
else
wait until the transaction that made version D selected
reapply the read rule
commits or aborts then
}
else
abort transaction Tc
541
Read operations and timestamps
(b) T3 read
(a) T3 read
Key:
read
proceeds
T2
Selected
T2
Time
T2
Selected
Ti
Time
Committed
Ti
(d) T3 read
(c) T3 read
T1
read
proceeds
T4
Tentative
read waits
Selected
T4
Time
Transaction
aborts
object produced
by transaction Ti
(with write timestamp
T1 < T2 < T3 < T4
Time
542
Multiversion Timestamp ordering
Basic Idea:
A list of old committed versions as well
as tentative versions is kept for each
object.
Read operations that arrive too late
need not be rejected.
The server direct the read operation to
the most recent version of an object.
543
Comparative study of 3 methods
2PL is the most popular choice because
it is simple.
2PL is a pessimistic protocol because it
achieves serializability by restricting
the operation in a transaction.
Timestamp ordering is less pessimistic,
allows operations to execute freely.
Optimistic concurrency control ignores
conflicts during execution, but requires
very elaborate validation.
546
Timestamp ordering vs. two phase locking
Timestamp ordering
Decide the serialization order statically
Better than locking for read-dominated
transactions
Two phase lock
Decide the serialization order dynamically
Better than timestamp ordering for
update-dominated transactions
Both are pessimistic methods
547
Pessimistic methods vs. optimistic methods
Optimistic methods
Efficient when there are few conflicts
A substantial amount of work may have to
be repeated when a transaction is aborted
Pessimistic methods
Less concurrency but simple in relative to
optimistic methods
548
Question Bank 4
Explain the concepts of Concurrency control and
Recoverability from abort used in the transactions.
Discuss the locking mechanism for the concurrency
control.
Write short note on :
(i) Nested transaction
(ii) Timestamp
ordering
What is the purpose of Validation of Transactions?
Explain the various forms of transaction with suitable
examples.
Illustrate a comparative study of the three methods of
concurrency control with suitable example.
549
Unit IV: Chapter 2
Introduction
In previous chapter, we discussed transactions
accessed objects at a single server. In the general
case, a transaction will access objects located in
different computers which communicate with the
remote objects in the server.
Distributed transaction accesses objects managed by
multiple servers.
The atomicity property requires that either all of the
servers involved in the same transaction commit the
transaction or all of them abort. Agreement among
servers are necessary.
Transaction recovery is to ensure that all objects
are recoverable. The values of the objects reflect all
changes made by committed transactions and none
of those made by aborted ones.
550
Transactions May Need on More than
One Server
Begin transaction BookTrip
book a plane from Nagpur
book hotel from Shimla
book rental car from New Delhi
End transaction BookTrip
The Two Phase Commit Protocol is a classic solution.
551
Focus on this Chapter
Distributed transaction
A flat or nested transaction that accesses
objects managed by multiple servers
Atomicity of transaction
All or nothing for all involved servers
Two phase commit
Concurrency control
Serialize locally + serialize globally
3 concurrency methods wrt to Dist.
Transaction
552
Distributed transactions
(1) Flat transaction
Flat transaction send out requests
to different servers and each
request is completed before client
goes to the next one. (bookTrip
example)
In figure Transaction T is a flat
transaction that invokes operation
on objects in servers X,Y,Z
Each
transaction
accesses
servers objects sequentially.
When servers use locking , a
transaction can only be waiting
for one object at a time.
553
Distributed transactions
2) Nested transaction
Here , the top-level transaction
can open sub-transaction, and
each sub-transaction can open
further sub-transactions down to
any depth of nesting (a parent
child relationship).
Each child start after its parent
and finish before it.
Nested transaction allows subtransactions at the same level to
execute concurrently.
figure
T1
and
T2
are
In
concurrent , and as they invoke
objects in different servers. Also
the four sub transaction T11,
554
Nested Banking transaction
X
Client
a.withdraw(10)
b.withdraw(20)
c.deposit(10)
d.deposit(20)
T
Y
T = openTransaction
openSubTransaction
a.withdraw(10);
openSubTransaction
b.withdraw(20);
openSubTransaction
c.deposit(10);
openSubTransaction
d.deposit(20);
closeTransaction
Z
T
T
3
4
Note : If this transaction is structured as a set of four nested
transaction, the four request(two deposit and two withdraw)
can run in parallel and the overall effect can be achieved
with better performance than a simple transaction in which
555
the four operation invoked sequentially.
The architecture of distributed transactions
The coordinator ( in any server)
Accept client request
Coordinate behaviors on different
servers
Send result to client
Record a list of references to the
participants
The participant (in every server)
Manages object accessed by a
transaction
Keep track of all recoverable objects at
each server
556
Cooperate with the coordinator
Coordination in Distributed
Transactions
Each server has a special participant process. Coordinator process
(leader) resides in one of the servers, talks to trans. &
participants.
Coordinato
r
join
Participant
X
join
Participant
join
Y
Participant
C
Coordinator &
Participants
Open
Transacto
n TID
Coordinato
r
Close
Transactio
n
Abort
Transactio
n
3
1
a.method (TID
Join (TID, ref)
)
Participant
2
The Coordination
Process
557
A distributed (flat) banking transaction
coordinator
join
openTransaction
closeTransaction
.
participant
A
a.withdraw(4);
join
BranchX
T
Client
participant
b.withdraw(T, 3);
T = openTransaction
a.withdraw(4);
c.deposit(4);
b.withdraw(3);
d.deposit(3);
closeTransaction
Note: client invoke an operation b.withdraw(),
B will inform participant at BranchY to join coordinator.
B
join
b.withdraw(3);
BranchY
participant
C
c.deposit(4);
d.deposit(3);
BranchZ
the coordinator is in one of the servers, e.g. BranchX
558
Working of Coordinator
Servers for a distributed transaction need to coordinate
their actions.
A client starts a transaction by sending an openTransaction
request to a coordinator. The coordinator returns the TID
to the client. The TID must be unique (serverIP and number
unique to that server)
Coordinator is responsible for committing or aborting it.
Each other server in a transaction is a participant.
Participants are responsible for cooperating with the
coordinator in carrying out the commit protocol, and keep
track of all recoverable objects managed by it.
Each coordinator has a set of references to the participants.
Each participant records a reference to the coordinator.
559
Interface for Coordinator
openTransaction() -> trans;
starts a new transaction and delivers a unique TID
trans. TID contains two parts( the server identifier
say its IP address which has created it and a
number unique to the server). The identifier will be
used in the other operations in the transaction.
join (trans, reference to participant)
/*additional
method*/
informs a co-ordinator that a new participant has
joined the transaction trans.
closeTransaction(trans) -> (commit, abort);
ends a transaction: a commit return value indicates
that the transaction has
committed; an abort
return value indicates that it has aborted.
abortTransaction(trans);
560
Atomic Commit Protocols
Atomic Commitment
When a distributed transaction comes to
an end, either all or none of its
operations are carried out.
Due to atomicity, if one part of a
transaction is aborted, then the whole
transaction must also be aborted.
The Coordinator has the responsibility to
either commit or abort the transaction.
One phase atomic commit protocol
Two phase atomic commit protocol
561
One-phase atomic commit protocol
The protocol
Client request to end a transaction
The
coordinator
communicates
the
commit or abort request to all of the
participants and to keep on repeating the
request
until
all
of
them
have
acknowledged that they had carried it out.
The problem
some servers commit, some servers abort
How to deal with the situation that some
servers decide to abort?
Go for Two phase atomic commit protocol
562
Introduction to two-phase commit protocol
Allow for any participant to abort
First phase
Each participant votes to commit or abort
The second phase
All participants reach the same decision
If any one participant votes to abort, then all
abort
If all participants votes to commit, then all
commit
The challenge
work correctly when error happens
Failure model
563
The two-phase commit protocol (working)
When the client request to abort
The coordinator informs all participants to
abort
When the client request to commit
First phase
The coordinator ask all participants if they
prepare to commit
If a participant prepare to commit, it saves in
the permanent storage all of the objects that it
has altered in the transaction and reply yes.
Otherwise, reply no
Second phase
The coordinator tell all participants to commit
564
( or abort)
Operations for two-phase commit
protocol
Three participant interface methods
canCommit?(trans)-> Yes / No
Call from coordinator to participant to ask
whether it can commit a transaction.
Participant replies with its vote.
doCommit(trans)
Call from coordinator to participant to tell
participant to commit its part of a
transaction.
doAbort(trans)
Call from coordinator to participant to tell
participant to abort its part of a transaction.
565
Operations for two-phase commit
protocol
Two coordinate interface methods
haveCommitted(trans, participant)
Call from participant to coordinator to
confirm that it has committed the
transaction.
getDecision(trans) -> Yes / No
Call from participant to coordinator to ask
for the decision on a transaction after it has
voted Yes but has still had no reply after
some delay. Used to recover from server
crash or delayed messages.
566
The two-phase commit protocol
Phase 1 (voting phase):
1. The coordinator sends a canCommit?
request to each of the participants in
the transaction.
2. When a participant receives a
canCommit? request it replies with its
vote (Yes or No) to the coordinator.
Before voting Yes, it prepares to
commit
by
saving
objects
in
permanent storage. If the vote is No
the participant aborts immediately.
567
The two-phase commit protocol
Phase 2 (completion according to outcome of
vote):
3. The coordinator collects the votes (including
its own).
(a) If there are no failures and all the
votes are Yes the coordinator decides
to commit the transaction and sends a
doCommit request to each of the
participants.
(b) Otherwise the coordinator decides
to abort the transaction and sends
doAbort requests to all participants
that voted Yes.
4. Participants that voted Yes are waiting for a
doCommit or doAbort request from the568
Timeout actions in the two-phase commit protocol
Coordinator
Participant
step status
step status
1
3
prepared to commit
(waiting for votes)
committed
canCommit?
Yes
prepared to commit
(uncertain)
committed
doCommit
haveCommitted
done
Communication in two-phase commit
protocol
569
Timeout actions in the two-phase commit protocol
New processes to mask crash failure
Crashed process of coordinator and
participant will be replaced by new
processes
Time out for the participant
Timeout of waiting for canCommit: abort
Timeout of waiting for doCommit
Uncertain status: Keep updates in the
permanent storage
getDecision request to the coordinator
Time out for the coordinator
Timeout of waiting for vote result: abort
Timeout of waiting for haveCommited: do
nothing
570
The protocol can work correctly without the confirmation
Performance of two-phase commit
protocol
Provided that all servers and communication
channels do not fail, with N participants
N number of canCommit? Messages and replies
Followed by N doCommit messages
The cost in messages is proportional to 3N
The cost in time is three rounds of message.
The cost of haveCommitted messages are not
counted, which can function correctly without
them- their role is to enable server to delete
stale coordinator information.
571
Failure of Coordinator
When a participant has voted Yes and is waiting for
the coordinator to report on the outcome of the vote,
such participant is in uncertain stage. If the
coordinator has failed, the participant will not be able
to get the decision until the coordinator is replaced,
which can result in extensive delays for participants in
the uncertain state.
One alternative strategy is allow the participants to
obtain a decision from other participants instead of
contacting coordinator. However, if all participants are
in the uncertain state, they will not get a decision.
572
Two-phase commit protocol for Nested transactions
Structure
Top level transaction
Subtransaction at any depth
Parent child relationship
Nested transaction semantics
Subtransaction completes to make independent
decision
Commit provisionally (local decision)
Abort
Parent transaction
Abort: all subtransactions abort
Commit: exclude aborting subtransactions
A two-phase commit protocol is needed for nested
transactions
it allows servers of provisionally committed transactions
that have crashed to abort them when they recover.
573
Distributed nested transaction
Each (sub)transaction has a coordinator which
has an interface for two operations:
openSubTransaction(trans)subTrans
Open a subtransaction whose parents is trans and
returns
a
unique
subtransaction
identifier.
(extension of its parent TID)
getStatus(trans)commited, aborted, provisional
Asks the coordinator to report on the status of the
transactions trans. Return values representing one
of the following: committed, aborted, provisional
Each sub-transaction starts after its parent
starts and finishes before its parent finishes.
When a subtransaction completes
provisionally committed updates are not saved in the
permanent storage.
574
Working of 2PCP for nested
transaction
The two operations provides the
interface for the coordinator of a
subtransaction.
It allows to open further subtransactions
It allows its subtransactions to enquire
about its status
Client starts by using openTransaction
to open a
top-level transaction.
This returns a TID for the top-level
transaction.
The TID can be used to open a
subtransaction
The subtransaction automatically joins
the parent and a TID is returned.
The client finishes a set of nested
transactions
by
calling
closeTransaction or abortTransacation
in the top-level transaction.
575
Example 2PC in Nested
Transactions
T1
1
T
1
Abort
A
Client
T1
2
T
T
2
Provision
al
T
B
N
T2
1
K
D
T2
2
Nested Distributed
Transaction
Yes
T1
1
Provision
al
T1
N
o
Yes
T
2
Abor
t
T2
1
N
o
Provision
al
Yes
Yes
T2
2
Provision
al
Yes
Bottom up decision in
2PC
576
An Example of Nested Transaction
Server status
T
abort (at M)
11
T1
provisional commit (at X)
12
provisional commit (at N)
T21 provisional commit (at N)
T
aborted (at Y)
T
22
provisional commit (at P)
Client status
Transaction T decides whether to commit
577
Information Held by Coordinators of Nested
Transactions
Coordinator
of
T
T1
T2
T11
T12, T21
T22
Child
subtrans
T1, T2
T11, T12
T21, T22
Participant
Provisional
commitlist
T1, T12
T1, T12
yes
yes
no(aborted)
no(aborted)
T12butnot T21 T21, T12
no(parentaborted) T22
Abortlist
T11, T2
T11
T2
T11
en each sub-transaction was created, it joined its parent sub-transaction.
e coordinator of each parent sub-transaction has a list of its child sub-transaction
en a nested transaction provisionally commits, it reports its status and the statu
descendants to its parent.
en a nested transaction aborts, it reports abort without giving any information a
descendants.
e top-level transaction receives a list of all sub-transactions, together with their s
578
Execution process of the Two phases
Coordinator
Participant
step status
step status
prepared to commit
1
(waiting for votes)
3
committed
canCommit?
Yes
doCommit
haveCommitted
done
Two-phase commit protocol
2prepared to commit
(uncertain)
4
committed
Note :
Phase I:
Step 1
Step 2
Phase II:
Conducted on the participant of T, T1 and T12 (in example)
Step 3
Performes canCommit in either
Step 4
Hierarchic manner
Flat manner
The second phase of Two-phase commit protocol is same
as for the non-nested case i.e.
Coordinates collects the votes for doCommit/ doAbort (step 3)
Participants makes haveCommit call in case of commit. (step579
4)
Hierarchic Two-Phase Commit Protocol for
Nested Transactions
canCommit?(trans,subTrans)>Yes/No
Callacoordinatortoaskcoordinatorofchildsubtransactionwhetheritcancommita
subtransactionsubTrans.Thefirstargumenttransisthetransactionidentifieroftop
leveltransaction.ParticipantreplieswithitsvoteYes/No.
The coordinator of the top-level sub-transaction sends
canCommit? to the coordinators of its immediate child subtransactions. The latter, in turn, pass them onto the coordinators of
their child sub-transactions.
Each participant collects the replies from its descendants before
replying to its parent.
T sends canCommit? messages to T1 (but not T2 which has
aborted); T1 sends CanCommit? messages to T12 (but not T11).
If a coord. finds no subtrans. matching 2nd paramater, then it must
have crashed, so it replies NO.
580
Flat Two-Phase Commit Protocol for Nested
Transactions
canCommit?(trans,abortList)>Yes/No
Call from coordinator to participant to ask whether it can commit a transaction.
ParticipantreplieswithitsvoteYes/No.
The coordinator of the top-level sub-transaction sends canCommit?
messages to the coordinators of all sub-transactions in the provisional
commit list (e.g., T1 and T12).
If the participant has any provisionally committed sub-transactions
that are descendants of the transaction with TID trans:
Check that they do not have any aborted ancestors in the
abortList. Then prepare to commit.
Those with aborted ancestors are aborted.
Send a Yes vote to the coordinator giving the good subtransactions
If the participant does not have a provisionally committed descendent,
it must have failed after it performed a provisional commit. Send a NO
581
vote to the coordinator.
Time-out actions in nested
2PC
With nested transactions delays can occur in the
same three places as before
when a participant is prepared to commit
when a participant has finished but has not yet
received canCommit?
when a coordinator is waiting for votes
Fourth place:
provisionally committed subtransactions of aborted
subtransactions e.g. T22 whose parent T2 has aborted
use getStatus on parent, whose coordinator should
remain active for a while
If parent does not reply, then abort
582
Concurrency Control in Distributed
Transactions
Concurrency
control
for
distributed
transactions: each server applies local
concurrency control to its own objects,
which ensure transactions serializability
locally.
However, the members of a collection of
servers of distributed transactions are jointly
responsible for ensuring that they are
performed in a serially equivalent manner.
Thus global serializability is required.
583
Methods of concurrency control for
nested transaction
1. Locking
2. Timestamp
3. Optimistic Concurrency control
584
1. Locking
Each participant locks on objects locally
strict two phase locking scheme
Lock manager at each server decide whether to
grant a lock or make the requesting transaction wait.
Atomic commit protocol
A server can not release any locks until it knows that
the transaction has been committed or aborted at
all.
Note :A lock managers in different servers set their locks
independently of one another. It is possible that
different servers may impose different orderings on
transactions.
585
Locking
T U
Write(A) at X locks A
Write(B) at Y locks B
Read(B) at Y
waits for U
Read(A) at X waits for T
***************************************************
T before U in one server X and U before T in
server Y. These different ordering can lead to
cyclic dependencies between transactions and a
distributed deadlock situation arises.
586
2.Timestamp ordering concurrency control
Globally unique
timestamp
transaction
Be issued to the client by the first
coordinator
accessed
by
a
transaction
The transaction timestamp is passed
to the coordinator at each server
Each server accesses shared objects
according to the timestamp
Resolution of a conflict
587
3.Optimistic concurrency control
The validation
takes place during the first phase of
two phase commit protocol
Commitment deadlock
T
Read (A) At X
U
Read (B) At Y
Write (A)
Read(B) At Y
Write (B)
Write (B)
Read(A) At X
Write (A)
588
Optimistic concurrency control
Parallel
Robinson)
validation(Kung
&
Suitable for distributed transaction.
write-write conflict must be checked
as well as write-read for backward
validation.
Possibly different validation order on
different server.
Measure1:global validation check
after
individual
server
is
serializable.
Measure2: each server validates
589
according to a globally unique
Question Bank 5.
Describe the Flat and Nested distributed
transaction. How these are utilized in a
distributed banking transaction?
How a transaction can be completed in
atomic manner? Explain in details the
working of two-phase commit protocol .
Discuss in details the concept of two-phase
commit protocol for nested transactions.
How can you achieve the concurrency
control in the distributed transactions.
590
Unit V: Resource Security and
Protection
Chapter 1:Access and Flow
control
Introduction
The Access Matrix Mode
Implementation of Access Matrix Model(3)
Safety in the Access Matrix Model
Advanced Models of Protection(3)
Chapter 2: Data Security
Introduction
Modern Cryptography:
Private Key Cryptography,
Public key Cryptography.
591
Chapter 1:Access and Flow control
Introduction
Deals
with
the
control
of
unauthorized use of software and
hardware.
Business
applications
such
as
banking requires high security and
protection during any transaction.
Security techniques should not only
prevent the misuse of secret
information but also its destruction.
592
Basics
Potential Security Violations [By AnderSon]:
1. Unauthorized information release : unauthorized
person is able to read information, unauthorized use
of computer program.
2. Unauthorized
information
modification:
unauthorized person is able to modify information
e.g changing grade of a university student,
changing account balances in bank databases
3. Unauthorized denial of service : Unauthorized
person should not succeed in preventing an
authorized person from accessing the information593
External vs Internal Security
1. External Security :
Also called physical security
Deals with regulating the access to locate of computer
systems [ e.g hardware, disks, tapes]
Can be enforced by placing a guard at the door, by giving
a secret key to authorized person.
Issues to be dealt are administrative
2. Internal Security :
Deals with the use of computer hardware and software
information stored in computer systems
Requires an issue of authentication [Logs into]
594
Policies and Mechanisms
Policy
1.
2.
3.
4.
What should be done?
Policy gives assignment of the access rights to users to various
resources.
Policies Decides which user has access to what resources
Policies can change with Time and application
Mechanism
1.
2.
3.
4.
5.
How it should be done?
Protection mechanism provides a set of tools that can be used to design
or specify a wide array of protection policies
Protection mechanism in OS controls user access to system resources.
Protection Scheme must be amenable to a wide variety of policies.
Protection is a mechanism and Security is a policy.
Separation of policies and mechanism enhances design flexibility
595
Protection Domain of a
Process
Specifies Resources that a process can access and type
of operation that a process can perform on the
resources.
Required for enforcing security
Allow the process to use only those resources that it
requires.
Every process executes in its protection domain and
protection domain is switched appropriately whenever
control jumps from process to process.
Advantage :
Eliminates the possibility of a process violating security
maliciously or unintentionally and increases accountability 596
Design Principles for a Secure System
[By Saltzer & Schroeder]
1. Economy : Protection mechanism should be
economical to develop and use. Should add extra
high costs for the system.
2. Complete Mediation : Requires that every request to
access an object be checked for the authority to do
so.
3. Open Design: A protection mechanism should work
even if its underlying principles are known to the
attcker.
4. Separation of Privileges: Protection Mechanism
requires two keys to unlock a lock.
597
Design Principles
cont
5. Least Privilege : Subject should be given bare
minimum rights for completion of task.
6. Least Common Mechanism : Portion common to more
than one user should be minimized. [Coupling among
users represents potential information path between
users and hence a potential threat to their security]
7. Acceptability : Protection Mechanism must be simple
to use.
8. Fail Safe Defaults : Default case should mean lack of
access.
598
Access Matrix Model
Model proposed by Lampson. Enhanced and Refined
further by Graham, Denning and Harrison.
Protection System consists of mechanism
to control user access for various resources
or
to control information flow.
Basic concept
Objects: the protected entities, O
Subjects: the active entities acting on the objects, S
Rights: the controlled operations subjects can
perform on objects, R
599
Access Matrix Model
3 Components :
1. Current Objects : Finite set (O) of entities to
which access is to be controlled. [Files]
2. Current Subjects: Finite set (S) of entities that
access current objects. E.g subject may be a process.
Subjects themselves can be treated as objects and
can be accessed like an object by other subjects.
[Users]
3. Generic Rights : A finite set of generic rights
R={r1,r2,r3,rm} gives various access rights
that subjects can have to objects. E.g read, write ,
execute, own , delete etc.
600
Access Matrix Model cont..
Protection State of a System : Protection state of a
system is represented by a triplet (S,O,P)
(S,O,P)
Set of current
subjects
Set of current
objects
Access
Matrix
Note :
Access Matrix has a row for every current subject and a
column for every current object.
601
Access Matrix Model cont..
Objects
o
s
Subject
s
P[s,o]
P[s,o] is a subset of generic rights subset R
It also denotes the access rights which subjects s has
to object o.
602
Access Matrix Representing
Protection State
O1
O2
O3
(S1)
O4
(S2)
O5
(S3)
S1
read,
write
own,
delete
own
sendmail
recmail
S2
execute
copy
recmail
own
block,
wakeup
S3
own
read,
write
sendmail
block,
wakeup
own
603
Access : A schematic view
A user requests access operations for
objects/resources.
The reference monitor checks request
validity and return either granting access
or denying access.
Access
Request
Reference
Monitor
Grant/ Deny
604
Access Matrix Model cont
Enforcing a Security Policy
1. A security Policy is enforced by validating every user
access for appropriate access rights.
2. Every Object has a monitor that validates all accesses
to that object in the following manner:
(i) A subject s requests an access to object o.
(ii) Protection System presents triplet(s,,o) to
monitor of o
(iii) Monitor looks into access rights of s to o. If
belongs to subset of P[s,o] then access is
permitted
Else it is denied.
605
Implementation of Access
Matrix Model
Three Implementations of Access matrix model
1. Capabilities Based
2. Access Control List
3. Lock-key Method
606
Capabilities
Capability based method corresponds to the row-wise
decomposition of the access matrix.
Each subject s is assigned a list of tuples (o, P [s , o])
for all objects o that it is allowed access. These tuples
are known as capability.
Typical view of capability
Object
Descript
or
Access Rights
read , write, execute etc.
Capability has two fields.
Object Descriptor is identifier for objects and
Allowed Access Rights for the object.
607
s1
s2
s3
O1
r1
Capability
Lists
O O
2
r3
r2
r4
r5
grouped by subject
s1
(r1, O1)
(r2, O3)
s2
(r3, O2)
(r4, O3)
s3
(r5, O1)
Capability Lists
608
Capabilities cont..
Possession of a capability treated as a evidence that
user has authority to access the object in the ways
specified in the capability.
At any point of time, a subject is autorized to access
only those objects for which it has capabilities.
609
Capability Based Addressing
1. Capabilities can be used for addressing mechanism
by the system using object descriptor.
2. The Main advantage of using capability as an
addressing mechanism that it provides an address
that is context independent[ Absolute Address].
3. However, System must allow embedding of
capabilities in user programs and data structures.
610
Capability Based Addressing cont..
er Request to access a word within an object
An address(of request) in a
program
Capability id
Offset
What object to be accessed in main
memory |
Relative location of word within an
object
Capability list of the user
to locate
length
base
Object Table
Entry for the object
offse
t
Access
Rights
lengt
h
Object
Descript
or
611
Capability Based Addressing cont..
A user Program issues a request to access a word with an
object.
Address contains capability ID of the object and an offset with
in the object
System uses capability ID to search the capability list of the
user to locate the capability that contains the allowed access
rights and an object descriptor.
System checks the access rights.
Object descriptor is used to search the object table to locate
entry for the object.
Object entry contains the base address of the object in main
memory.
612
Capability Based Addressing cont..
Two Salient features :
1. Relocatability : An object can be relocated any where
within main memory without changing the capability.
2. Sharing:Several programs can share the same object with
different names for the same object.
. Implementation Considerations:
1. To maintain a forgery-free capability, a user should not be
able to access [read, modify or construct] a capability.
2. Two ways for implemenattion:
(i) Tagged approach
(ii) Partitioned approach
613
1. Tagged approach
One or more bits are attached to each memory
location and every processor.
Tag indicates whether a memory word or register
contains a capability.
If tag = ON , the information is capability otherwise
ordinary data.
When tag =ON user can not manipulate the word.
Example: Burroughs B6700 and the Rich Research
Computer
614
2. Partitioned Approach:
Capabilities and Ordinary data are partitioned
[ stored separately]
Every object has two segments : one for data other
for capabilities.
Processor has two sets of registers : one for data
other for capabilities.
Users cannot manipulate segment and register
storing capabilities.
Examples : Chicago Magic Number Machine, and
Plessey System
615
Advantages Drawbacks of
Capabilities
Advantages
1. Efficient : validity can be easily tested
2. Simple : due to natural correspondence between structural
properties of capabilities and semantic properties of
addressing variables.
3. Flexible : user can decide which of his address contain
capabilities
616
Disadvantages:
1. Control of propagation :
Copy of capability is passed from one subject to
other subject without knowledge of 1st subject.
2. Review:
Determination of all subject accessing one object is
difficult.
3. Revocation of access rights
Destroy of object , which prevent all the undesired
subjects from accessing it.
4. Garbage Collection
When capabilities of an object disappear from the
system , the object is left inaccessible to user and
becomes garbage.
617
II. Access Control List
Method
Column wise decomposition of the access matrix.
Each object o is assigned a pairs (s, P[s,o]) for all
subjects s that are allowed to access the object.
P[s,o] denotes the access rights that subject s has to
o
When a subject s requests access to object o,
it is executed in the following manner:
1. System searches the access control list of o to find
out if an entry(s,) exists for subject s.
2. If exists then system checks for whether access is
permitted ( belongs to )
3. If yes access is granted otherwise a Exception is
raised.
618
Access
Control
Lists
O O O
1
s1
s2
s3
r1
r3
r2
r4
r5
Grouped by object
O1
O2
(s1, r1)
(s2, r3)
O3
(s1, r2)
(s2, r4)
(s3, r5)
Access Control Lists
619
Schematic of an access
control list
Subject
s
Smith
read,write,execute
Jones
read
Lee
Grant
Access Rights
write
execute
Execution Efficiency of the access control list
method is poor because an access control list must
be searched for every access to a protected object.
620
Access Control List Method
cont..
Main features :
1.
2.
.
1.
2.
Easy Revocation: Revocation of access rights is simple, fast
and efficient. Can be achieved simply by removing subjects
entry from objects access control list.
Easy review of an access: Can be easily determined what
subjects have access rights to an object
Implementation Considerations :
Efficiency of Execution : Since access control list needs to be
searched for every access to a protected object, it can be very
slow. [Can be avoided using shadow registers]
Efficiency of storage: List may require a huge amount of storage
[ Can be avoided using protection groups]
621
Lock and Key Method
subjects possess
a set of keys:
Key
Key
(O, k)
Lock
(l,y) (k, {r 1 , r 2 ,...})
objects are associated
with a set of locks
622
Lock Key Method
Hybrid of the capability-based method and access control list
method.
Every subject has a capability list that contains tuples of the
form (O,k) indicating that the subject can access Object O using
key k.
Every Object has an access control list that contains tuples of
the form (l,y) called a lock entry. It indicates that any subject
which can lock l can access this object in modes contained in y.
When a subject makes a request to access object o in , the
system is executed in the following manner:
1. System locates tuple (o,k) in the capability list of the subject.
If no such tuple is found access is not permitted
2. Otherwise access is permitted only if there exists a lock entry
(l,y) in the access control list of the object o such that k=l and
belongs to y.
623
Comparison of methods
Capability list
Access Control List
propagation
Good
Bad
review
Bad
revocation
Good
reclamation
Bad
Locks & Keys
Good
Good
Bad
Good
Good
Good
Good
1. need copy bit/count for control
2. need reference count
3. need user/hierarchical control
4. need to know subjectkey mapping
624
Changing The Protection State
Access matrix is itself a protected object
Commands for changing protection state
Set of commands C for changing protection state
defined in the form of the following primitive operations
enter r into P [s, o]
delete r from P [s, o]
create subject s
create object o
destroy subject s
destroy object o
Primitive operations: define changes to be made to the
access matrix of P
Example: Primitive operation delete r from P [s, o]
deletes access right r from the position P [s, o] in the
access matrix, i.e., access right r of subject s to object o
is withdrawn
625
Changing The Protection State (cont.)
Before the operation is performed (e.g., the delete in
previous example), a verification should be made
that the process has the right to perform this
operation on the access matrix:
Command syntax:
command < command id > (<formal parameters>)
if < conditions >
then
< list of primitive operations >
end.
Command execution
All checks in the condition part are evaluated. The
<conditions> part has checks in the form r in
P[s,o].
If all checks pass, primitive operations in <list of
primitive operations> are executed.
626
Changing The Protection State (cont.)
All accesses are validated by a mechanism
called a reference monitor: the reference
monitor can reject an access not allowed by the
access matrix.
Each object has an owner
If s is the owner of o, then own P [ s, o ]
The owner of an object can give a right to the
object to another subject
Example: command to create a file and assign own and read rights
to it
command create-read (process, file)
create object file
enter own into P [process, file]
enter read into P [process, file]
end.
627
Changing The Protection State (cont.)
Example: command owner of a file gives write
access rights to another process
command confer-write (owner, process, file)
if own P [ owner, file ]
then
enter write into P [process, file]
end.
628
Safety in Access Matrix
Model
AMM is safe if a subject cannot acquire an access
right to an object without consent of the object
owner.
A command may leak,
right r from a state
Q=(S,O,P) , if it enters r in a cell of P that did not
have r.
AMM is safe if a subject can determine whether its
actions can resolve the leakage of a right to
unauthorized subjects.
A State Q is unsafe for r, if there exists command
that leaks r form Q else we say Q is safe for r.
Safety is undecidable for general protection
systems.
Safety can be decided for mono-operational system.
629
Mono-Operational
Commands
Single primitive operation in a
command
Example: Make process p the owner
of file g
command makeowner(p, g)
enter own into A[p, g];
end
Note : Mono-operation commands can
also be conditional or biconditionals.
630
Advanced model of
Protection
1. Take Grant model
2. Bell Lapadula model
3. Lattice model
631
1.Take-Grant Model
Principles:
Uses directed graphs to model access control
Protection state of system represented by
directed graph
More efficient than (sparsely populated)
access matrix.
632
1.Take-Grant Model
Model:
Graph nodes: subjects and objects
An edge from node x to node y indicates that
subject x has an access right to the object y: the
edge is tagged with the corresponding access rights
Access rights
Read (r), write (w), execute (e)
Special access rights for propagating access
rights to other nodes
Take: If node x has access right take to node
y, then subject x can take any access right
that it has on y to another node
Grant: If node x has access right grant to node
y, then the entity represented by node y can
be granted any of the access rights that node
633
x has
Example: take operation
Node x has take access to node y
Node y has read and write access to node
z
Node x can take access right read from y
and have this access right for object z : a
directed edge
take
labeled r is added from
y r, w
x
node x to node z
z
take
y r, w
634
Example: grant operation
Node x has grant access to node y and
also has read and write access to node z
Node x can grant read access for z to
node y ( a directed edge labeled r from
y to z is added in the graph)
grand
r, w
z
grand
r, w
z
635
State and state transitions:
The protection state of the system is
represented by the directed graph
System changes state (state transition) when
the directed graph changes
The directed graph changes with the following
operations
Take
Grant
Create: A new node is added to the graph
When node x creates a new node y, a directed edge is
added from x to y
Remove: A node deletes some of its access rights to
another node
636
2. Bell-LaPadula Model
Used to control information flow
Model components
Subjects, objects, and access matrix
Several ordered security levels
Each subject has a (maximum) clearance and
a current clearance level
Each object has a classification (I.e., belongs
to a security level)
637
Subjects can have the following access rights to objects
Read-only
Append: subject can only write object (no read permitted)
Execute: no read or write
Read-write: both read and write are permitted
Subject that creates an object has control attribute to
that object and is the controller of the object
Subject can pass any of the four access rights of the
controlled object to another subject
Properties for a state to be secure
simple security property (restricts reading up)
the star-property (prohibits writing down)
Tranquility principle
no operation may change the classification of an
active object
638
Bell-LaPadula Model (cont.)
Restrictions on information flow and access control (reading
down and writing up properties):
1. The simple security property
A subject cannot have read access to an object with
classification higher than the clearance level of the subject.
2. The -property (star property)
A subject has append (I.e., write) access only to objects
which have classification (I.e., security level) higher than or
equal to the current security clearance level of the subject.
A subject has read access only to objects which have
classification (I.e., security level) lower than or equal to the
current security clearance level of the subject.
A subject has read-write access only to objects which have
classification (I.e., security level) equal to the current
security clearance level of the subject.
639
Bell-LaPadula Model (cont.)
Level n
can write
.
.
.
Level i+1
Level i
Level i-1
.
.
.
Level 1
can read
640
Star property indication
classification
clearance
level n
w
i
level i
r,w
objects
subject
level 1
*-property
641
3. The Lattice Model
The best-known Information Flow Model
Based upon the concept of lattice whose
mathematical meaning is a structure
consisting of a finite partially ordered set
together with a least upper bound and
greatest lower bound operator on the set.
Lattice is a Directed Acyclic Graph(DAG)
with a single source and sink.
Information is permitted to flow from a
lower class to upper class.
642
The lattice model
(continued)
643
The lattice model
(continued)
This satisfies the definition of
lattice. There is a single source
and sink.
The least upper bound of the
security classes {x} and {z} is
{x,z} and the greatest lower
bound of the security classes
{x,y} and {y,z} is {y}.
sink
{x,y,
z}
{x,z
}
{y,z
{x,y}
{z}
{x}
{}
source
644
{y}
Flow Properties of a Lattice
The relation is reflexive, transitive and
antisymmetric for all A,B,C SC.
Reflexive: A A
Information flow from an object to another object at the
same class does not violate security.
Transitive: A B and B C implies A C .
This indicates that a valid flow does not necessarily
occur between two classes adjacent to each other in the
partial ordering
Antisymmetric: A B and B A implies A=B
If information can flow back and forth between two
objects, they must have the same classes
645
Flow Properties of a Lattice (Contd..)
Two other inherent properties are as follows
Aggregation: A C and B C implies A U B C
If information can flow from both A and B to C , the
information aggregate of A and B can flow to C.
Separation: A U B C implies A C and B C
If the information aggregate of A and B can flow to C
,information can flow from either A or B to C
646
Application on Lattice model
Military Security model:
The objects are related to the information which is
to be protected.
The objects are ranked (R) in 4 security categories:
Unclassified : Least sensitive e.g {}
Confidential : single entities e.g. {x},{y} {z}
Secret : next level combination e.g. {x,y},{y,z} {x,z}
Top secret : most sensitive or the highest level
combination e.g. {x,y,z}
Note : The ranks can be provided as per the need of
the information flow.
647
The object are associated with one or more
compartments (C)
The compartments are based on subject relevance
and are enforce the need-to-know rule.
The subjects has also the security levels and the
compartments.
The class are associated with objects such that :
O=(Ro,Co)
The clearance are associated with subject such that :
S=(Rs,Cs)
The dominates relations between classes of objects
and clearances of subject defines a partial order
that turns out to be a lattices.
648
A lattices for a military security model with two ranks
say unclassified (1) and confidential (2) is given by
(2,
{p,s})
(2,
{s})
(1,
{s})
(1,
{p,s})
(2,
{})
(2,
{p})
(1,
{p})
(1,
{})
The largest element is the class (2, {p,s}) and the
smallest element is (1, {})
649
Mode of Information Flow:
The information flow from object x to object
y is denoted as xy
It indicates that the information stored in x
is used to derive information transferred to
y
Information flow can be
explicit flow if y=x i.e. y directly depend on x
implicit flow if y=x+1 i.e. y
conditionally
depend on x
650
Question Bank 6
Explain various implementation of Access
matrix with suitable example .
Explain the Take-grant model of information
flow with suitable example
How the Bell-LaPadula model deals with the
control of information flow.
Explain the Lattice model of information flow
with suitable example.
Write short note on
(i) Protection State
(ii)Safety in the access matrix model
651
Chapter 2 :Data Security
Introduction
Unauthorized User can gain access to confidential
information.
User may by pass protection mechanism of system.
To add extra protection techniques are needed to
ensure the an intruder is unable to understand or
make use of any information obtained by wrongful
access.
Cryptography can be used for extra protection.
Converting one piece text in to cryptic form before
652
storing it on to computer.
Model of Cryptography
Terminology:
Plaintext [cleartext or original message]
Ciphertext [message in encrypted form]
Encryption [ Process of converting Plaintext to ciphered
text]
Decryption [Process of converting ciphered to Plaintext text]
Cryptosystem [System for encryption and decryption of
information]
Symmetric Cryptography : If the key is same for both
encryption and decryption
Asymmetric Cryptography : If the key is not same for both
encryption and decryption
653
General Structure of a Cryptographic
System
SI
CA
C = Eke(M)
Ke
Encryption key
Kd
Decryption key
M = Plain text , C = Ciphertext =
EKe(M)
EKe = Encryption operation using Ke
DKd=M,
SI = side information
CA
Potential threats:
1. Ciphertext only attack
2. Known-plaintext attack
3. Chosen-plaintext attack
654
Design Principles
Shannons principle :
(Supports the conventional cryptography)
1.Principle of Diffusion : Spreading the correlation and
dependencies among key- string variables over substrings
as much as possible so as to maximize the length of the
plaintext needed to break the system
2.Principle of Confusion : Change the piece of information
so that output has no obvious relation with the input.
Exhaustive search principle:
(Supports the modern cryptography)
3.Determination of key needed to break the system
4.Requires exhaustive search of a space.
655
Classification of Cryptographic
Systems
Cryptographic Systems
Conventional
Modern
Systems
Systems
Open design
Private key
Systems
Public key
Systems
656
Conventional Cryptography
Based on substitution cipher
1. Caesar Cipher ( no. of keys <=25)
2.
A letter is transformed into third letter following in the
alphabetical sequence
E : M(M+3)%26 where 0<=M<=25
Simple sunstitution (no. of keys =26! almost >1026)
3.
Any permutation of letters can be mapped to English Letters
Positional correlation is eliminated
Polyalphabetic Ciphers: (no. of keys = (26!)n)
Uses periodic sequence of n substitution alphabetic ciphers
System switches among n substitution alphabet ciphers
periodically
Eg vegenere cipher : periodic seq. of int is 11,19,4,22,9,25
1,7,13->>11, where as 2,8,14->> 19 and so on.
657
Modern Cryptography
1. Private key Cryptography
Based on Data Encryption Stds. developed by IBM
Two basic operations :
1.
Permutation : permutes the bits of a word. [ To provide
diffusion]
2. Substitution : replaces m-bit input by an n-bit output.
[ No simple correlation between input and output. To
provide confusion]
(i) Convert m-bit input to decimal form
(ii) Decimal output is permuted to give
another decimal number
(iii) Final decimal output is converted into n-bit
output.
658
Data Encryption
Standard
[DES]
DES is a block cipher that crypts 64-bit data blocks
using 56-bit key
Error detection is provided by adding 8-bit parity
Basic Components involves :
Plaintext: X
Initial Permutation: IP( )
Roundi: 1 i 16
32-bit switch: SW( )
Inverse IP: IP-1( )
Ciphertext: Y
659
Three steps:
1.Plain text undergoes initial permutation(IP) in which 64 bits of the
block is permuted
2.Permuted block goes a complex transformation using a key and
involves 16 iterations and then 32 bit switch (SW)
3.The output of step(2) goes a final permutation which is the inverse of
step(1)
<< The output of step(3) is ciphered text>>
64-bit plaintext (X)
Initial Permutation (IP)
Key i
Round
(i)
56-bit key (K)
Key Generation
(KeyGen)
Key I of 48 bits
Inversion of Initial Permutation
(IP-1)
64-bit ciphertext (Y)
660
Iterative Transformation
Li-1
Ri-1
Iterative Transformation step
consists of 16 functionally
identical iterations
Let Li = left 32-bit halfs and Ri=
Right 32-bit half after ith iteration
Li=Ri-1 and Ri=Lif(Ri-1,Ki)
where Ki is 48 bit key
Ki
Key
Ri=Lif(Ri-1,Ki)
Li
Li
Ri
661
f Steps:
1. 32-bit Ri-1 is expanded to 48-bit E(Ri-1), depending on permutation
and duplication .
2. Ex-OR operation is performed between 48-bit key Ki and E(Ri-1).
48 bit output is partitioned into 8 partitions S1,S2,.S8 of 6bit each
3. Each Si, i<=i<=8 is fed into a separate 6-to-4 substitution box.
4. 32-bit output of 8 substitution boxes is fed to a permutation box
whose 32 bit output is f.
S-box
[
1
662
Decryption
The same algorithm as encryption.
Reversed the order of key (Key16, Key15,
Key1) based on.
Ri-1 =Ri
Li-1=Ri f(Li,Ki)
For example:
IP undoes IP-1 step of encryption.
The 3rd decryption step undoes the
permutation IP performed in the 1st
encryption step , yielding the
original[
663
1
plain text block
2. Public Key Cryptography
Encryption Procedure E is in public domain
Decryption Procedure is secret
Encryption procedure E and Decryption procedure D must
satisfy following properties:
1.For every message M, D(E(M)) = M
2.E and D can be efficiently applied to any message M
3.Knowledge of E does not compromise security.
<< It should be impossible to derive D from E>>
Public key cryptography allows two users to have secure
communication even if they have not communicated before
664
Rivest-Shamir-Adleman
Method
Popularly known as RSA method.
Binary Plaintext is divided into blocks. Each block is
represented by an integer between 0 and n-1.
Encryption key is a pair (e , n) where e is positive
integer
Message M is encrypted by raising it to eth power
moduloe n.
C = M modulo n
Cipher text C is an integer between 0 and n-1.
Encryption does not increase the length of plaintext
Decryption key (d, n) is a pair where d is a positive
665
integer.
Rivest-Shamir-Adleman
cont..
Cipher text block C is decrypted by raising it to dth
power modulo n.
M =C dmodulo n.
User possesses an encryption key(eX, nX) and a
decryption key(dX, nX) where as encryption key is
available in public domain but decrpytion key is
known to user only
666
Rivest-Shamir-Adleman
cont..
M
e
M mod n
e
C = M mod
n
d
C mod n
(e ,
(d ,
n)
n)
<< Encryption Key for user>> << Decryption Key for user>>
667
Determination of Keys
1. Chose two large prime numbers p and q and define n
as
n=p*q
2. p and q should be chosen such that it will be practically
impossible to determine p and q by factoring n.
3. Chose any large integer d as follows:
GCD(d,(p-1)*(q-1)) == 1
4. Compute integer e such that it is multiplicative inverse
of d in modulo (p-1)*(q-1)
668
Example of RSA
Let
p=5
and
q=11
such
that
n=pxq=>n=55
Therefore (p-1)x(q-1) =40
Let d=23
as 23 and 40 are relatively prime i.e.
gcd(23,40)=1.
Choose e such that dxe(modulo 40)=1.
Note e=7
M
M
C=M mod C
M=C mod
55
55 0 to 55 to
Consider any
integer between
8
209715 2
8388608
8
execute
encryption and decryption by
2
7
478296
23
70368744177664
23
669
Question Bank 7
What do you mean by data security?
Explain in detail the model of Cryptography.
Explain the concept of Public Key
Cryptography with suitable example.
Explain the concept of Private Key
Cryptography with suitable examples.
Write a note on Data encryption standards.
Discuss the Rivest Shamir Adleman method
with suitable example.
670
Unit VI : Application of Dist.
System
Chapter 1 : Distributed Multimedia Systems:
Introduction
Characteristics of multimedia system
Quality of Service Management
Resource Management
Stream Adaptation
Case Study
Chapter 2:Designing Distributed System: (Google
Case Study)
Introducing
Google- Overall architecture and Design Paradigm
Communication Paradigm
Data Storage and Coordination Services
671
Distributed Computation Services
Chapter 1: Distributed Multimedia
Systems
Introduction:
Modern computers can handle streams of
continuous, time-based data such as digital
audio and video applications.
This capability has led to the development of
distributed multimedia applications.
The requirements of multimedia applications
significantly differ from real-time applications:
Multimedia applications are highly distributed and
therefore
compute
with
other
distributed
applications for network bandwidth and computing
resources.
The
resource
requirements
of
multimedia
applications are dynamic.
672
A distributed multimedia system
Video camera
and m ike
Local network
Local network
Wide area gateway
Video
server
Digital
TV/radio
server
The above figure illustrates a typical distributed
multimedia system, capable of supporting a variety
of applications
Non-interactive applications: net radio and TV,
video-on-demand, e-learning, ...
[ may be
one way communication]
673
Interactive application: voice &video conference,
Basic requirement of such
systems
Characteristics
applications
of
multimedia
Timely delivery of streams of multimedia
data to end-users
Audio sample, video frame
To meet the timing requirements
QoS( quality of service)
Different from traditional real time system
674
Typical Multimedia Applications
without
QoS
Web-based multimedia
Provides best effort to access streams of audio/video data
via web
Extensive buffering affects the performance
Effective when there is little need for the synchronization
of data streams.
Network phone and audio conference
Requires relatively low bandwidth
Efficient compression techniques
High interactive latency
Video-on-demand services
Supply video information in digital form, from large online
storage systems to the user display
Require sufficient dedicated network bandwidth
Assumes that the video server and the receiving stations
are dedicated
675
QoS management
Traditional real-time system
E.g. avionics, air traffic control, telephone
switching
Small quantities of data, strict time requirement
QoS management: fixed schedule that ensures
worst-case requirements are always met
Different requirements of multimedia
app.
General environment
Compete with other distributed app. for network
bandwidth, computing resource
Dynamic resource requirements
E.g. the number of participants of a video conference
may vary
User participant in the control of resource
consumption
676
Highly Interactive Applications
Examples
Videoconference (cooperative, involves several users)
Distributed online ensemble (synchronous, close
coordination)
Requirements
Low-latency communication
round trip delays of 100-300 ms => interaction between user
to be synchronous
Synchronous distributed state
If one user stops a video on a given frame, the other users
should see it stopped at the same frame
Media synchronization
All participants in a music performance should hear the
performance at approximately the same time
External synchronization (other formats )
Sometime, other information need to be synchronized with the
time-based multimedia streams
Expecting rigorous QoS management
677
The Window of Scarcity
Many of todays computer systems
provide some capacity to handle
multimedia data, but the necessary
resources are very limited.
Especially, when dealing with large
audio and video streams many systems
are constrained in the quantity and
quality of streams they can support.
This situation is depicted as the Window
of Scarcity.
678
The Window of scarcity for computing and
communication
interactive
video
high-quality
audio
insufficient
resources
scarce
resources
abundant
resources
network
file access
remote
login
1980
1990
2000
A history of computer systems that support distributed data access.
679
The Window of Scarcity
operation
If a certain class of application lies within this
window, a system needs to allocate and
schedule its resources carefully in order to
provide the desired service.
Before the window of scarcity is reached, a
system has insufficient resources to execute
relevant applications.
Once an application class has left the window
of scarcity, system performance will be
sufficient to provide the service even under
adverse circumstances.
680
Characteristics of
Multimedia data
Multimedia data (video and audio) is continuous and
time-based.
Continuous data is represented as sequence of
discrete values that replace each other over time.
Refer to the users view of the data
Video: a image array is replaced 25 times per second
Audio: the amplitude value is replaced 8000 times per
second
Time-based (or isochronous data) is so called
because timed data elements in audio and video
streams define the semantics or content of the
stream.
The time at which the values are played effect the validity
of the data. Hence, the timing should be preserved.
The delivery delay for each element is bounded in a
value.
681
Multimedia systems are often bulky. Hence the data
should be moved with greater throughput.
Following table shows typical data rates and
frame/sample frequencies.
The resource bandwidth requirements for some are
very large especially for video of reasonable quality.
A standard TV/Video stream requires more than 120
Mbps.
The figures for HDTV are even higher and in video682
conferencing there is a need to handle multiple
Data compression
Reduce bandwidth requirements by factors
between 10 and 100.
Available in various formats like GIF, TIFF,
JPEG, MPEG-1, MPEG-2, MPEG-4.
It imposes substantial additional loads on
processing resources at the source and
destination.
E.g. the video and audio coders/ decoders found on
video cards.
The compression methods in MPEG video
formats is asymmetric with a complex
compression
algorithm
but
simpler
decompression algorithms.
The Varity in the modern
gadgets also
requires the transcoding approaches for data
compression and decompression to maintain
683
the quality of the digital data.
QoS Management
When multimedia run in networks of PCs, they compete
for resources at workstations running the applications and
in the network.
In multi-tasking operating system, the central processor is
allocated to individual tasks in a Round-Robin or other
scheduling scheme.
The key feature of these schemes is that they handle
increases in demand by spreading the available resources
more thinly between the competing tasks.
The timely processing and transmission of multimedia
streams in crucial. In order to achieve timely delivery,
applications need guarantees that the necessary
resources will be allocated and scheduled at the required
times.
The management and allocation of resources to provide
such guarantee is referred to as Quality of Service
Management (QoS Management)
684
QoS management is based on
Architecture of a typical system
Provides infrastructure for various components
of multimedia applications
Source
Stream processors
Connections
Network connection
In-memory transfer
Target
Each process must be allocated adequate CPU time, memory
capacity and network bandwidth
Resource requirement
Provides QoS specifications for components of
multimedia applications
QoS Manager
685
Typical infrastructure components
for multimedia applications
PC/workstation
PC/workstation
Window system
Camera
Microphones
Screen
K
Codec
Mixer
G
Codec
L
Network
connections
C
D
Video file system
Codec
Video
store
Window system
: multimediastream
White boxes represent media processing components,
many of which are implemented in software, including:
codec: coding/decodingfilter
mixer: sound-mixingcomponent
686
The above figure shows the most commonly
used abstract architecture for multimedia
software.
Continuously flowing streams of media data
elements are processed by a collection of
processed and transferred between the
processes by inter-process connections.
The
processes
produce,
transform
and
consume continuous streams of multimedia
data.
The connections link the processes in a sequence
from a source of media elements to a target.
For the elements of multimedia data to arrive at
their target on time , each process must be
allocated adequate resources to perform its task
and must be scheduled to use the resources
sufficiently frequently to enable it to deliver the
data elements in its stream to the next process on
687
time.
QoS specifications for components
Component
Bandwidth
Camera
Out:
Codec
Mixer
Window
system
Network
connection
Network
connection
In:
Out:
In:
Out:
In:
Out:
In/Out:
In/Out:
10frames/sec,rawvideo
640x480x16bits
10frames/sec,rawvideo
MPEG1stream
244kbpsaudio
144kbpsaudio
various
50frame/secframebuffer
MPEG1stream,approx.
1.5Mbps
Audio44kbps
Latency
Lossrate
Resourcesrequired
Zero
Interactive Low
Interactive
Verylow
Interactive Low
Interactive Low
Interactive
Verylow
10msCPUeach100ms;
10MbytesRAM
1msCPUeach100ms;
1MbytesRAM
5msCPUeach100ms;
5MbytesRAM
1.5Mbps,lowloss
streamprotocol
44kbps,verylowloss
streamprotocol
The above table sets out the resource requirements for
the main software components and network connections
in the previous Figure.
The required resources can be guaranteed only if there
is a system component responsible for the allocation
688
and scheduling of those resources.
QoS Managers Tasks
The QoS Managers two main subtasks
are:
Quality of Service Negotiation
Apps. specify the resource requirements
QoS manager evaluates the feasibility
Give a positive or negative response
Admission control
Applications run under a resource contract
Recycle the released resource
689
The QoS managers task
Flowchart
Admissioncontrol
QoS negotiation
Application components specif y their QoS
requirements to QoS manager
Flow spec.
QoSmanagerevaluatesnewrequirements
agains t the
available
res ources .
S ufficient?
Yes
Res ervethereques ted
res ources
Resource contract
Allowapplication
to proceed
Application runs with res ources as
per res ource contract
No
Negotiatereducedres ourceprovis ion
withapplication.
Agreement?
Yes
No
Do notallowapplication
to proceed
Application
notifiesQoS managerof
increas ed
res ourcerequirements
690
QoS Negotiation
The application indicates its resource
requirements to the QoS manager.
To Negotiate QoS between an application and
its underlying system an application must
specify its QoS requirements to the QoS
manager.
This is done by transmitting a set of
parameters.
691
QoS Negotiation Parameters
Bandwidth: The rate at which data flows
through a multimedia stream.
Latency: It is the time required for an
individual data element to move through a
stream from the source to the destination.
Loss Rate: The rate at which the data elements
are dropped due to untimely delivery.
692
The usage of resource requirements spec.
Describe a multimedia stream
Describe the characteristics of a
multimedia stream in a particular
environment
E.g. a video conference
Bandwidth: 1.5Mbps; delay: 150ms, loss rate:
1%
Describe the resources
Describe the capabilities of resources to
transport a stream
E.g. a network may provide
Bandwidth: 64kbps; delay: 10ms; loss rate:
693
Specify the QoS parameters for streams
Bandwidth
Specified as minimum-maximum
value or average value
Required bandwidth varies according to the
compression rate of the video. E.g., 1:50 1:100 of MPEG video
Specify burstiness
Different traffic patterns of streams with the
same average bandwidth
LBAP model: Rt + B, where R is the rate, B is
the maximum size of burst
694
Specify the QoS parameters for streams (2)
Latency
The frames of a stream should be
processed with the same rate at which
frames arrive
No human perception
E.g. 150ms for interactive apps, 500ms for demand on
video
No jitter
Jitter: the variation in the period between the delivery
of two adjacent frames
Loss rate
Typically be expressed as a probability
Be calculated based on worst-cast assumptions or on
695
standard distributions
Traffic Shaping
Traffic shaping is the term used to describe
the use of output buffering for the smooth
the flow of data elements.
The bandwidth parameter of a multimedia
stream
provides
an
idealistic
approximation of the actual traffic pattern.
The closer the actual pattern matches the
description, the better the system will
handle the traffic.
696
LBAP Model of bandwidth variations
This calls for the regulation of burstiness
of the multimedia streams.
Any stream can be regulated by inserting
a buffer at the source and by defining a
method by which data elements leave the
buffer.
This can be illustrated using following
algorithms:
Leaky Bucket
Token Bucket
697
Leaky Bucket Algorithm
The bucket can be filled arbitrarily
with water until it is full. Through a
leak at the bottom of the bucket water
will flow out.
The algorithm ensures that a stream
will never flow at a rate higher than R.
The size of the buffer B defines the
maximum burst a string an incur
without losing elements.
This algorithm completely eliminates
bursts.
698
Token Bucket Algorithm
The elimination of bursts in the
previous
algorithm
is
not
necessary as long as bandwidth
is bounded over any time
interval.
The token bucket algorithm
allows larger bursts to occur
when the stream has been idle
for a while.
Tokens are generated at a rate R
and collected in a bucket of size
B. Data can be sent only when
atleast S tokens are in bucket.
This ensures that over any
interval t the amount of data
sent is not larger than Rt+B
699
The RFC 1363 Flow Spec
Protocol version
Maximum transmission unit
Bandwidth:
Token bucket rate
Token bucket size
Maximum transmission rate
Delay:
Minimum delay noticed
Maximum delay variation
Loss sensitivity
Loss:
Burst loss sensitivity
Loss interval
Quality of guarantee
700
Flow Specifications
A collection of QoS parameters is
typically known as a flow
specification, or flow spec for short.
Several examples for flow spec
exists. In Internet RFC 1363 , a flow
spec is defined as a 16-bit numeric
values, which reflect the QoS
parameters.
701
QoS Admission Control
Admission control regulates access to
resources to avoid resource overload.
It protect resources from requests
that they cannot fulfill.
An admission control scheme is
based on the overall system capacity
and the load generated by each
application.
702
QoS Admission Control
Bandwidth reservation:
A common way to ensure a certain QoS level
for a multimedia stream is to reserve some
portion of resource bandwidth for its exclusive
use.
Used for applications that cannot adapt to
different QoS levels, e.g. x-ray video.
Statistical multiplexing:
Reserve minimum or average bandwidth.
Handle burst that cause some service drop
level occasionally.
Hypothesis
a large number of streams the aggregate
bandwidth required remains nearly constant
regardless of the bandwidth of individual streams.703
Resource Management
To provide a certain QoS level to an
application, a system needs to have sufficient
resources, it also needs to make the resources
available to an application when they are
needed (scheduling).
Resource Scheduling: A process needs to have
resources assigned to them according to their
priority. Following 2 methods are used:
Fair Scheduling
Round-robin
Packet-by-packet
Bit-by-bit
Weighted fair queuing
Real-time scheduling
Earliest-deadline-first (EDF)
704
(i)Fair Scheduling
If several streams compete for the same
resource, it becomes necessary to
consider fairness and to prevent illbehaved streams taking too much
bandwidth.
A straight forward approach is to apply
round-robin scheduling to all streams in
the same class, to ensure fairness.
In Nagle, a method was introduced on a
packet-by-packet basis that provides
more fairness w.r.t varying packet sizes
and arrival times. This is called Fair
Queuing.
705
(ii)Real-time scheduling
Several algorithms were developed to
meet
CPU
scheduling
needs
of
applications.
Traditional
real-time
scheduling
methods suit the model of regular
continuous multimedia streams very
well.
Earliest-Deadline-First
(EDF)
scheduler uses a deadline i.e.
associated with each of its work items
to determine the next item: The item
with earliest deadline goes in first.
706
Stream Adaptation
The simplest form of adjustment
when QoS cannot be guaranteed is
adjusting
its
performance
by
dropping pieces of information.
Two methodologies are used:
Scaling
Filtering
707
Scaling
Best applied when live streams are sampled.
Scaling algorithms are media-dependent,
although overall scaling approach is the
same: to subsample a given signal.
A system to perform scaling consists of a
monitor process at the target and a scalar
process at the source.
Monitor keeps track of the arrival times of
messages in a stream. Delayed messages
are an indication of bottle neck in the
system.
Monitor sends a scale-down message to the
source that scales up again .
708
Filtering
It is a method that provides the best
possible QoS to each target by applying
scaling at each target by applying scaling
at each relevant node on the path from
source to the target.
Filtering requires that a stream be
partitioned into a set of hierarchical
substreams, each adding a higher level of
quality.
A substream is not filtered at an
intermediate
node
if
somewhere
downstream a path exists that can carry
the entire substream.
709
Case study: The Tiger video
file server
A video storage system that supplies
multiple real-time video streams
simultaneously is an important
component to support consumer
oriented multimedia applications.
One
of
the
most
advanced
prototypes of these is the Tiger video
file server.
710
Design goals
Video-on-demand for a large number of
users
A large stored digital movie library
Delay of receiving the first frame is within a few seconds
Users can perform pause, rewind, fast-forward
Quality of service
Constant rate
a maximum jitter and low loss rate
Scalable and distributed
Support up to 10000 clients simultaneously
Low-cost hardware
Constructed by commodity PC
Fault tolerant
Tolerant to the failure of any single server or disk
711
System architecture
One controller
Connect with each server by low-bandwidth network
Cubs the server group
Each cub is attached by a number of disks ( 2-4)
Cubs are connected to clients by ATM
Controller
low-bandwidth network
n+1
Cub 0
n+2
Cub 1
n+3
Cub 2
n+4
Cub 3
2n+1
Cub n
high-bandwidth
ATM switching network
video distribution to clients
Start/Stop
requests from clients
712
Storage organization
Stripping
A movie is divided into blocks
The blocks of a movie are stored on disks
attached to different cubs in a sequence of the
disk number
Deliver a movie: deliver the blocks of the movie
from different disks in the sequence number
Load-balance when delivering hotspot movies
Mirroring
Each block is divided into several portions
(secondaries)
The secondaries are stored in the successors
If a block is on a disk i, then the secondaries are stored
on disks i+1 to i+d
713
Distributed Schedule
Slot
The work to be done to play one block of a movie
Deliver a stream
Deliver the blocks of the stream disk by disk
Can be viewed as a slot moving along disks step by step
Deliver multiple streams
Multiple slots moving along disks step by step
Viewer state
Network address of client
File ID for current movie
Number of next block
Viewers next play slot
2
slot 0
viewer 4
state
slot 1
free
block play timeT
slot 2
free
slot 3
viewer 0
state
block service
time t
1
slot 4
viewer 3
state
slot 5
viewer 2
state
slot 6
free
slot 7
viewer 1
state
714
Distributed schedule (conti)
Block play time - T
The time that will be required for a viewer to display a
block on the client computer
Typically about 1 second for all streams
The next block of a stream must begin to be delivered
T time after the current block begin to be delivered
Block service time t ( a slot )
Read the next block into buffer
Deliver it to the client
Update viewer state in the schedule and pass the
updated slot to the next cub
T / t typically result in a value > 4
The maximum streams the Tiger system can
support simultaneously
T/t * the number of disks
715
Performance and scalability
Initial prototype [1994]
5 x cubs: 133MHz Pentium PCs(48M RAM,
2G SCSI disk, Windows NT), ATM network
68 simultaneous streams with perfect quality
One cub failed, the loss rate is 0.02%
14 cubs: each 4 disks, ATM network
[1997]
602 simultaneous streams ( 2Mbps)
Loss rate < 0.01%; with one cub failed,
loss rate < 0.04%
The designers suggested that Tiger
could be scaled to 1000 cubs
supporting 30,000 clients.
716
Question bank
Explain the quality of service management and
resource management in multimedia applications.
Discuss the importance of Quality of service
negotiation and Admission control in the multimedia
applications.
What are the characteristics of multimedia streams?
Explain the impacts of Scaling and Filtering on
Stream adaptation
What is the purpose of Traffic shaping? What are
various approaches to avoid bursting of stream ?
Discuss the impact distributed multimedia in the
Tiger Video file server.
717
Chapter 2 :Google Case study
Google is a US-based corporation with its
headquarter in Mountain View, CA.
offering Internet search and broader web
applications and earning revenue largely
from advertising associated with such
services.
The name is a play on the word googol,
the number 10^100 ( or 1 followed by a
hundred zeros), emphasizing the sheer
scale of information in Internet today.
Google was born out of a research project
at Standford with the company 718launched
Google Distributed System: Design
Strategy
Google has diversified and as well as providing
a search engine is now a major player in cloud
computing.
88 billion queries a month by the end of 2010.
The user can expect query result in 0.2
seconds.
Good performance in terms of scalability,
reliability, performance and openness.
We will examine the strategies and design
decision behind that success, and provide
insight into design of complex distributed
system.
719
Google Search Engine
Consist of a set of services
Crawling: to locate and retrieve the contents of the web
and pass the content onto the indexing subsystem.
Performed by a software called Googlebot.
Indexing: produce an index for the contents of the web
that is similar to an index at the back of a book, but on a
much larger scale. Indexing produces what is known as
an inverted index mapping words appearing in web
pages and other textual web resources onto the position
where they occur in documents. In addition, index of
links is also maintained to keep track of links to a given
site.
Ranking: Relevance of the retrieved links. Ranking
algorithm is called PageRank inspired by citation
number for academic papers. A page will be viewed as
important if it is linked to by a large number of other
720
pages.
Outline architecture of the original Google search engine
[Brin and Page 1998]
721
Google as a cloud
provider
Google is now a major player in cloud computing which is
defined as a set of Internet-based application, storage and
computing services sufficient to support most user's needs, thus
enabling them to largely or totally dispense with local data
storage and application software.
Software as a service: offering application-level software over
the Internet as web application. A prime example is a set of
web-based applications including Gmail, Google Docs, Google
Talk and Google Calendar. Aims to replace traditional office
suites. ( more examples in the following table)
Platform as a service: concerned with offering distributed
system APIs and services across the Internet, with these APIs
used to support the development and hosting of web
applications. With the launch of Google App Engine, Google
went beyond software as a service and now offers it distributed
system infrastructure as a cloud service. Other organizations to
run their own web applications on the Google platform.
722
Example Google applications
723
Google Physical
The keymodel
philosophy of Google in terms of physical infrastructure is
to use very large numbers of commodity PCs to produce a costeffective environment for distributed storage and computation.
Purchasing decision are based on obtained the best performance
per dollar rather than absolute performance. When Brin and Page
built the first Google search engine from spare hardware
scavenged from around the lab at Standford university.
Typical spend is $1k per PC unit with 2 Terabytes of disk
storage and 16 gigabytes of memory and run a cut-down
version of Linux kernel.
Physical Architecture of Google is constructed as:
724
Commodity PCs are organized in racks with between 40 to 80
PCs in a given rack. Each rack has a Ethernet Switch.
30 or more Racks are organized into a cluster, which are a key
unit of management for placement and replication of services.
Each cluster has two switched connected the outside world or
other data centers.
Clusters are housed in data centers that spread around the
world.
Physical model
Organization of the Google physical infrastructure
(To avoid clutter the Ethernet connections are shown from only one of the clusters to
the external links)
725
Key Requirements
Scalability: i). Deal with more data ii) deal with more
queries and iii) seeking better results
Reliability: There is a need to provide 24/7 availability.
Google offers 99.9% service level agreement to paying
customers of Google Apps covering Gmail, Google
Calendar, Google Docs, Google sites and Google Talk.
The well-reported outage of Gmail on Sept. 1 st 2009
(100 minutes due to cascading problem of overloading
servers) acts as reminder of challenges.
Performance: Low latency of user interaction. Achieving
the throughput to respond to all incoming requests
while dealing with very large datasets over network.
Openness: Core services and applications should be
open to allow innovation and new applications.
726
The overall Google systems
architecture
727
Google infrastructure
728
Google Infrastructure
The underlying communication paradigms, including services for both
remote invocation and indirect communication.
The protocol buffers offers a common serialization format including
the serialization of requests and replies in remote invocation.
The publish-subscribe supports the efficient dissemination of events
to large numbers of subscribers.
Data and coordination services providing unstructured and semistructured abstractions for the storage of data coupled with services to
support access to the data.
GFS offers a distributed file system optimized for Google application and services
like large file storage.
Chubby supports coordination services and the ability to store small volumes of
data
BigTable provides a distributed database offering access to semi-structure data.
Distributed computation services providing means for carrying out
parallel and distributed computation over the physical infrastructure.
MapReduce supports distributed computation over potentially very large datasets
for example stored in Bigtable.
Sawzall provides a higher-level language for the execution of such distributed
computation.
729
Protocol buffers example
730
Summary of design choices related
to communication paradigms - part 1
731
Summary of design choices related
to communication paradigms - part 2
732
Data Storage and Coordination
Service
1. Namespace for files
2. Access control
3. Mapping of file to set
of chunks and each
chunk is replicated on
three chunkservers.
64Mega Each
chunk
NFS and AFS are general-purpose distributed file system
offering file and directory abstraction. The GFS offers similar
abstractions but is specialized for storage and access to very
large quantities of data (not huge number of files but each file
is massive 100Mega or 1Giga) and sequential reads and
sequential write as opposed to random reads and writes. Must
also run reliably in the face of any failure condition.
733
Chubby API
Four distinct capabilities:
1.Distribute locks to synchronize
distributed activities in a largescale asynchronous
environment.
2.File system offering reliable
storage of small files
complementing the service
offered by GFS.
3.Support the election of a
primary in a set of replicas.
4.Used as a name service within
Google.
It might appear to contradict the
over design principle of
simplicity doing one thing and
doing it well. However, we will
see that its heart is one core
service that is offering a solution
to distributed consensus and
734
other facets emerge from this
Overall architecture of Chubby
735
Message exchanges in Paxos
(in absence of failures) - step 1
736
Message exchanges in Paxos
(in absence of failures) - step 2
737
Message exchanges in Paxos
(in absence of failures) - step 3
738
The table abstraction in Bigtable
For example, web pages uses
rows to represent individual
web pages, and the columns
to represent data and
metadata associated with
that given web page.
For example, Google earth
uses rows to represent
geographical segments and
columns to represent
different images available for
that segment.
GFS offers storing and accessing large flat file which is accessed relative to byte
offsets within a file. It is efficient to store large quantities of data and perform
sequential read and write (append) operations. However, there is a strong need for a
distributed storage system that provide access to data that is indexed in more
sophisticated ways related to its content and structure.
Instead of using an existing relational database with a full set of relational operators
(union, selection, projection, intersection and join). However, the performance and
scalability is a problem. So Google uses BigTable in 2008 which retains the table
model but with a much simpler interface.
Given table is a three-dimensional structure containing cells indexed by a row key, a
739
column key and a timestamp to save multiple versions.
Overall architecture of Bigtable
A Bigtable is broken up into tablets, with a given tablet being
approximately 100 to 200 megabytes in size. It use both GFS and
Chubby for data storage and distributed coordination.
Three major components:
A library component on the client side
A master server
A potential large number of tablet servers
740
The storage architecture in Bigtable
741
The hierarchical indexing scheme
adopted by Bigtable
A Bigtable client seeking the location of a tablet starts the search by
looking up a particular file in Chubby that is known to hold the
location of a root tablet (containing the root index of the tree
structure).
The root contains metadata about other tablets specifically about
other metadata tablets, which in turn contain the location of the
actual data tablets.
742
Summary of design choices related
to data storage and coordination
743
Distributed Computation
Services
It is important to support high performance distributed computation
over the large datasets stored in GFS and Bigtable. The Google
infrastructure supports distributed computation through MapReduce
service and also the higher level Sawzall language.
Carry out distributed computation by breaking up the data into
smaller fragments and carrying out analyses (sorting, searching and
constructing inverted indexes) of such fragments in parallel, making
use of the physical architecture.
MapReduce {Dean and Ghemawat 2008} is a simple programming
model to support the development of such application, hiding
underlying detail from the programmer including details related to the
parallelization of the computation, monitoring and recovery from
failure, data management and load balancing onto the underlying
physical infrastructure.
Key principle behind MapReduce is that many parallel
computations share the same overall pattern that is:
Break the input data into a number of chunks
Carry out initial processing on these chunks of data to produce
744
intermediary results ( map function)
Distributed Computation Services:
MapReduce
For example, search web with words distributed system book:
Assume map and reduce function is supplied with a web page
name and its contents as input, the map function searches linearly
through the contents, emitting a key-value pair consisting of the
phrase followed by the name of the web document containing this
phrase.
The reduce function is in this case is trivial, simply emitting the
intermediary ressults ready to be collated together into a
complete index.
The MapReduce implementation is responsible for breaking the
data into chunks, creating multiple instances of the map and
reduce function, allocating and activating them on available
machines in the physical infrastructure, monitoring the
computations for any failures and implementing appropriate
recovery strategies, dispatching intermediary results and ensuring
optimal performance of the whole system.
745
Google reimplemented the main production indexing system in
Examples of the use of MapReduce
746
The overall execution of a MapReduce
program
The first stage is to split the input file into M pieces, with each piece being
typically 16-64 megabytes in size (no bigger than a single chunk in GFS). The
intermediary results is also partitioned into R pieces. So M map and R reduce.
The library then starts a set of worker machines from the pool available in the
cluster with one being designed as the master and other being used for executing
map or reduce steps.
A worker that has been assigned a map task will first read the contents of the
input file allocated to that map task, extract the key-value pairs and supply them
as input to the map function. The output of the map function is a processed set of
key/value pairs that are held in an intermediary buffer.
The intermediary buffers are periodically written to a file local to the map
computation. At this stage, the data are partitioned resulting in R regions.
Unusually apply hash function to key then modulo R to the hashed value to
produce R partitions.
When a worker is assigned to carry out a reduce function, it reads
747 its
corresponding partition from the local disk of the map workers using RPC. The
The overall execution of a Sawzall
program
748
Summary of design choices related to
distributed computation
749
Question Bank
Discuss the overall Google architecture for
distributed computing.
Discuss in details the data storage and coordination
services provided in the Google infrastructure.
What is the purpose of distributed computing
services?
Explain how the Google infrastructure supports
distributed computation?
Write short note on
(i) Chubby
(ii) Communication paradigm
750