0% found this document useful (0 votes)
316 views181 pages

Distributed Systems - Final Materials

Uploaded by

Ganesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
316 views181 pages

Distributed Systems - Final Materials

Uploaded by

Ganesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 181

RAJALAKSHMI ENGINEERING COLLEGE [AUTONOMOUS]

RAJALAKSHMI NAGAR, THANDALAM, CHENNAI-602105


DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

CS 17501
Distributed Systems

Dr. B. Swaminathan
Mr. N. Duraimurugan
Dr. U. Karthikeyan
RAJALAKSHMI ENGINEERING COLLEGE
[AUTONOMOUS]
RAJALAKSHMI NAGAR, THANDALAM, CHENNAI-602105
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

CS17501 DISTRIBUTED SYSTEMS


THIRD YEAR
FIFTH SEMESTER
FPP 2019 - 2020

Prepared by
Dr. B. Swaminathan
Mr. N. Duraimurugan
Dr. U. Karthikeyan

Dr. B. SWAMINATHAN, Page 2/181


RAJALAKSHMI ENGINEERING COLLEGE (AUTONOMOUS), THANDALAM

Department of Computer Science and Engineering

Vision

To promote highly ethical and innovative computer professionals through excellence in


teaching, training and research.

Mission

To produce globally competent professionals, motivated to learn the emerging


technologies and to be innovative in solving real world problems.

To promote research activities amongst the students and the members of faculty that could
benefit the society.

To impart moral and ethical values in their profession.

Programme Educational Objectives (PEOs)

PEO I

To equip students with essential background in computer science, basic electronics and
applied mathematics.

PEO II

To prepare students with fundamental knowledge in programming languages and tools and
enable them to develop applications.

PEO III

To encourage the research abilities and innovative project development in the field of
networking, security, data mining, web technology, mobile communication and also emerging
technologies for the cause of social benefit.

PEO IV

To develop professionally ethical individuals enhanced with analytical skills,


communication skills and organizing ability to meet industry requirements.

Dr. B. SWAMINATHAN, Page 3/181


PROGRAM OUTCOMES (POs)
A graduate of the Computer Science and Engineering Program will demonstrate:
PO1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering
problems.
PO2: Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
PO3: Design/development of solutions: Design solutions for complex engineering problems
and design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
PO4: Conduct investigations of complex problems: Use research-based knowledge and
research methods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.
PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex engineering
activities with an understanding of the limitations.
PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.
PO7: Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.
PO9: Individual and team work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
PO10: Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and
receive clear instructions.
PO11: Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
PO12: Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological change.

Dr. B. SWAMINATHAN, Page 4/181


PROGRAM SPECIFIC OUTCOMES (PSOs)
A graduate of the Computer Science and Engineering Program will demonstrate:
PSO1: Foundation Skills: Ability to understand, analyze and develop computer programs in
the areas related to algorithms, system software, web design, machine learning, data analytics,
and networking for efficient design of computer-based systems of varying complexity.
Familiarity and practical competence with a broad range of programming language and open
sourceplatforms.

PSO2: Problem-Solving Skills: Ability to apply mathematical methodologies to solve


computational task, model real world problem using appropriate data structure and suitable
algorithm. To understand the Standard practices and strategies in software project
development using open-ended programming environments to deliver a quality product.

PSO3: Successful Progression: Ability to apply knowledge in various domains to identify


research gaps and to provide solution to new ideas, inculcate passion towards higher studies ,
creating innovative career paths to be an entrepreneur and evolve as an ethically social
responsible computer science professional.

. Course Outcomes (COs)

CO 1 The student must gain knowledge of the goals and types of distributed
systems

CO 2 The student must have an ability to describe distributed OS and


communications

CO 3 The student must have a clear knowledge about distributed objects and file
system

CO 4 The student must emphasize the benefits of using distributed transactions and
concurrency

CO 5 The student must have an ability to explicate issues related to developing


fault-tolerant systems and security

Dr. B. SWAMINATHAN, Page 5/181


CO - PO – PSO matrices of course
PO/
PSO PO PO PO PO PO PO PO
PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
CO 1 2 3 4 5 6 7

3 2 2 2 2 2 2 2 3 1 3 2 2 2 3
CO 1
3 3 3 3 2 3 2 2 2 2 3 2 2 3 3
CO 2
3 3 3 2 2 3 3 2 2 2 2 2 2 2 2
CO 3
3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
CO 4
3 3 3 2 2 2 2 2 2 2 3 2 3 3 2
CO 5
3 2 .8 2 .8 2 .4 2 2 .6 2 .2 2 2 .2 1.8 3 2 22 2 .4 2.4
Average

Note: Enter correlation levels 1, 2 or 3 as defined below:


1: Slight (Low) 2: Moderate (Medium) 3: Substantial (High)
If there is no correlation, put “-“

Dr. B. SWAMINATHAN, Page 6/181


CS17501 DISTRIBUTED SYSTEMS LTPC 3003

OBJECTIVES:
To Know the goals and types of Distributed Systems
To Describe Distributed OS and Communications
To Learn about Distributed Objects and File System
To Emphasize the benefits of using Distributed Transactions and Concurrency
To learn issues related to developing fault-tolerant systems and Security

UNIT I INTRODUCTION 9
Introduction to Distributed systems – Design Goals - Types of Distributed Systems -
Architectural Styles – Middleware - System Architecture – Centralized and Decentralized
organizations – Peer-to-Peer System – Case Study: Skype and Bit-Torrent

UNIT II OPERATING SYSTEMS AND COMMUNICATIONS 9


Process – Threads – Virtualization – Client-Server Model - Case Study: Apache WebServer -
Code Migration- Communication: Fundamentals - Remote Procedure Call – Stream oriented
communication – Message orientedcommunication – Multicast communication.

UNIT III DISTRIBUTED OBJECTS AND FILE SYSTEM 9


Remote Invocation – Request Reply Protocol - Java RMI - Distributed Objects - CORBA -
Introduction to Distributed File System - File Service architecture –Andrew File System, Sun
NetworkFile System- Introduction to Name Services- Name services and DNS - Directory and
directory services - Case Study: Google File System

UNIT IV DISTRIBUTED TRANSACTIONSANDCONCURRENCY 9


Clock Synchronization – Logical Clocks – Global States – Mutual Exclusion - Election
AlgorithmsIntroduction – Data-Centric Consistency Models – Client-Centric Consistency
Models – Distribution Protocol – Consistency Protocol

UNIT V FAULT TOLERANCEAND SECURITY 9


Introduction to Fault Tolerance – Process Resilience – Reliable Communications –
Distributed Commit – Recovery – Introduction to Security – Secure Channels – Access
Control – Secure Naming - Security Management

TOTAL: 45 PERIODS
OUTCOMES:
Gain knowledge of the goals and types of Distributed Systems
Ability to Describe Distributed OS and Communications
A clear knowledge about Distributed objects and File System
Emphasize the benefits of using Distributed Transactions and Concurrency
Ability to explicate issues related to Developing Fault-Tolerant Systems and Security

Dr. B. SWAMINATHAN, Page 7/181


TEXT BOOKS:
1. Tanenbaum, A. and van Steen, M., Distributed Systems: Principles and Paradigms, 2nd ed, Prentice
Hall, 2007.
2. Coulouris, G, Dollimore, J., and Kindberg, Distributed Systems: Concepts and Design, 5th ed T.,
Addison-Wesley, 2011.

REFERENCES:
1. Pradeep K Sinha, Distributed Operating Systems, Prentice-Hall of India, NewDelhi, 1st ed,
2001.
2. Jean Dollimore, Tim Kindberg, George Coulouris, Distributed Systems - Concepts and
Design, Pearson Education, 4th ed, 2005.
3. M.L. Liu, Distributed Computing Principles and Applications, Pearson Education, 1st ed,
2004.
4. HagitAttiya and Jennifer Welch, Distributed Computing: Fundamentals, Simulations and
Advanced Topics, Wiley, 1st ed, 2004.

Contents Beyond the Syllabus

UNIT-I: Google Classroom / Facebook

UNIT-II: Apache Tomcat(Testing)

UNIT-III: Hadoop(HDFS) / FAT

UNIT-IV: Coda

UNIT-V: Access Control & Authentication Encryption

Dr. B. SWAMINATHAN, Page 8/181


RAJALAKSHMI ENGINEERING COLLEGE(AUTONOMOUS)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CS17501 - DISTRIBUTED SYSTEMS

Faculty Name: Dr.U. Karthikeyan, Dr. B. Swaminathan, Mr. N Duraimurugan,


Year and Semester / Section: III Year & V-Sem/ CSE - A, B, C,D & E

Lesson Plan
No of
Dat Hou Periods Un
Sl. No. TOPIC Ref.
e r Require it
d
UNIT I : INTRODUCTION
1. Introduction-Distributed Systems 1 I T1:1-2
2. Design Goals 1 I T1:3-9
3. Types of Distributed Systems 1 I T1:17-24
4. Architectural Styles 1 I T1:34-35
5. Middleware 1 I T1:54-57
6. System Architecture 1 I T1: 35-36
7. Centralized and Decentralized I
1 T1: 36-44
organizations
8. Peer-to-Peer System 1 I TI:44-51
9. I T1:51-53
Case Study: Skype and Bit-Torrent 1
& Internet
TOTAL HOURS FOR UNIT I 9

UNIT II : OPERATING SYSTEMS AND COMMUNICATIONS


10. Process & Threads 1 II T1:69-70
11. Virtualization 1 II T1: 79-82
12. Client-Server Model 2 II T1: 82-103
13. Case Study: Apache Web Server 1 II T1:556-558
14. Code Migration 1 II T1:103-112
15. Communication: Fundamentals 1 II T1:116-125
16. Remote Procedure Call 1 II T1:125-140
17. Stream oriented communication 1 II T1:157-166
18. Message orientedcommunication 1 II T1:140-157
19. Multicast communication 1 II T1:166-174
TOTAL HOURS FOR UNIT II 11
UNIT III : DISTRIBUTED OBJECTS AND FILE SYSTEM
20. Remote Invocation 1 III T2:185-186
21. Request Reply Protocol 1 III T2:187-194
22. Java RMI 1 III T2:217-224
23. Distributed Objects 1 III T2:336-339
24. C OR B A 1 III T2:340-357
25. Introduction to Distributed File System 1 III T2:522-529

Dr. B. SWAMINATHAN, Page 9/181


26. File Service architecture 1 III T2:530-535
Andrew File System, Sun NetworkFile III
27. 1 T2:536-556
System
28. Introduction to Name Services 1 III T2:566-568

29. Name services and DNS 1 III T2:569-583

30. Directory and directory services 1 III T2:584-585

31. Case Study: Google File System 1 III Internet


12
TOTAL HOURS FOR UNIT III
UNIT IV : DISTRIBUTED OBJECTS AND FILE SYSTEM
32. Clock Synchronization 1 IV T1:232-238
33. Logical Clocks 1 IV T1:244-248
34. Global States 1 IV

35. Mutual Exclusion 1 IV T1:252-259

36. Election Algorithms 1 IV T1:263-269

37. Introduction – Replication 1 IV T1:274-275


38. Data-Centric Consistency Models 1 IV T1:276-281
39. Client-Centric Consistency Models 1 IV T1:288-295
40. Distribution Protocol 1 IV T1:296-302
41. Consistency Protocol 1 IV T1:306-315
10
TOTAL HOURS FOR UNIT IV
UNIT V : FAULT TOLERANCE AND SECURITY
42. Introduction to Fault Tolerance 1 V T1:322-326
43. Process Resilience 1 V T1:328-335
44. Reliable Communications 1 V T1:336-348
45. Distributed Commit 1 V T1:355-360
46. Recovery 1 V T1:363-372
47. Introduction to Security 1 V T1:378-389
48. Secure Channels 1 V T1:396-411
49. Access Control 1 V T1:413-427
50. Secure Naming 1 V Internet
51. Security Management 1 V T1:428-434
TOTAL HOURS FOR UNIT V 10
TOTAL HOURS 52
*->Content Beyond Syllabus

Dr. B. SWAMINATHAN, Page 10/181


TEXT BOOKS:
1. Tanenbaum, A. and van Steen, M., Distributed Systems: Principles and Paradigms, 2nd
ed, Prentice Hall, 2007.
2. Coulouris, G, Dollimore, J., and Kindberg, Distributed Systems: Concepts and Design,
5th ed T., Addison-Wesley, 2011.

REFERENCES:
1. Pradeep K Sinha, Distributed Operating Systems, Prentice-Hall of India, NewDelhi, 1st
ed, 2001.
2. Jean Dollimore, Tim Kindberg, George Coulouris, Distributed Systems - Concepts and
Design, Pearson Education, 4th ed, 2005.
3. M.L. Liu, Distributed Computing Principles and Applications, Pearson Education,
1st ed, 2004.
4. HagitAttiya and Jennifer Welch, Distributed Computing: Fundamentals, Simulations
and Advanced Topics, Wiley, 1st ed, 2004.

Dr. B. SWAMINATHAN, Page 11/181


UNIT I INTRODUCTION
Introduction to Distributed systems – Design Goals - Types of Distributed Systems -
Architectural Styles – Middleware - System Architecture – Centralized and
Decentralized organizations – Peer-to-Peer System – Case Study: Skype and Bittorrent
1.1 Introduction to Distributed systems:
A group of computers working together as to appear as a single computer to the end-
user. These machines have a shared state, operate concurrently and can fail independently
without affecting the whole system’s uptime.

CAP Theorem The CAP theorem states that a distributed data store cannot
simultaneously be consistent, available and partition tolerant.
• Consistency — What you read and write
sequentially is what is expected (remember the
gotcha with the database replication a few
paragraphs ago?
• Availability — the whole system does not die —
 every non-failing node always returns a
response.(The propability that the system is operational at a given time.)
• Partition Tolerant — The system continues to
function and uphold its consistency / availability
guarantees in spite of network partitions

1.1.1 DEFINITION OF A DISTRIBUTED SYSTEM


“A distributed system is a collection of independent computers that appears to its users as a
single coherent system.”
A distributed system consists of components (i.e., computers) that are autonomous.
Users (be they people or programs) think they are dealing with a single system.
This means that one way or the other the autonomous components need to collaborate.

Figure I-I. A distributed system organized as middleware.


The middleware layer extends over multiple machines, and offers each application the same interface.

Distributed systems should also be relatively easy to expand or scale. will normally be
continuously available, although perhaps some parts may be temporarily out of order. Users
and applications should not notice that parts are being replaced or fixed, or that new parts are
added to serve more users or applications.
Distributed systems are often organized by means of a layer of software-that is, logically
placed between a higher-level layer consisting of users and applications, and a layer
underneath consisting of operating systems and basic communication facilities(It also called
Middleware).

Dr. B. SWAMINATHAN, Page 12/181


1.2 Design Goals Distributed Systems:
Making resources available
Distribution transparency
Openness
Scalability, (other Goals Flexibility, Reliability, & Performance
Hetrogeniety variety and difference N/W, H/W, S/W, OS.)

1.2.1 Distribution transparency


•How to achieve the single-system image, i.e., how to make a collection of computers appear as a single computer.
•Hiding all the distribution from the users as well as the application programs can be achieved at two levels:
1)hide the distribution from users
2)at a lower level, make the system look transparent to programs.
1) and 2) requires uniform interfaces such as access to files, communication.
Transparency Description
Access Hide differences in data representation and how an object is accessed
Location Hide where an object is located
Users cannot tell where hardware and software resources such as CPUs, printers, files, data bases are located.
Relocation Hide that an object may be moved to another location while in use
Migration Hide that an object may move to another location
Resources must be free to move from one location to another without their names changed. E.g., /usr/lee,
/central/usr/lee
Replication Hide that an object is replicated
OS can make additional copies of files and resources without users noticing.
Concurrency Hide that an object may be shared by several independent users
The users are not aware of the existence of other users. Need to allow multiple users to concurrently access the
same resource. Lock and unlock for mutual exclusion.
Failure Hide the failure and recovery of an object
Parallelism Trans. Automatic use of parallelism without having to program explicitly. The
holy grail for distributed and parallel system designers.

Degree of transparency
Aiming at full distribution transparency may be too much:
Users may be located in different continents
Completely hiding failures of networks and nodes is (theoretically and practically)
impossible
You cannot distinguish a slow computer from a failing one
You can never be sure that a server actually performed an operation before a crash
Full transparency will cost performance, exposing distribution of the system
Keeping Web caches exactly up-to-date with the master
Immediately flushing write operations to disk for fault tolerance

1.2.2 Openness of distributed systems


Open distributed system
Be able to interact with services from other open systems, irrespective of the underlying
environment:
Systems should conform to well-defined interfaces
Systems should support portability of applications
Systems should easily interoperate
Achieving openness
At least make the distributed system independent from heterogeneity of the underlying
environment:
Hardware
Platforms
Languages

Dr. B. SWAMINATHAN, Page 13/181


Policies versus mechanisms
Implementing openness : Requires support for different policies:
What level of consistency do we require for client-cached data?
Which operations do we allow downloaded code to perform?
Which QoS requirements do we adjust in the face of varying bandwidth?
What level of secrecy do we require for communication?
Implementing openness : Ideally, a distributed system provides only mechanisms:
Allow (dynamic) setting of caching policies
Support different levels of trust for mobile code
Provide adjustable QoS parameters per data stream
Offer different encryption algorithms

1.2.3. Scale in distributed systems: Many developers of modern distributed system easily use
the adjective “scalable” without making clear why their system actually scales.
Scalability: At least three components:
Number of users and/or processes (size scalability)
Maximum distance between nodes (geographical scalability)
Number of administrative domains (administrative scalability)

Techniques for scaling


Hide communication latencies: Avoid waiting for responses; do something else:
Make use of asynchronous communication
Have separate handler for incoming response
Problem: not every application fits this model
Distribution: Partition data and computations across multiple machines:
Move computations to clients (Java applets)
Decentralized information systems (WWW)
Decentralized naming services (DNS)

Scaling – Problem: Applying scaling techniques is easy,


except for one thing:
• Having multiple copies (cached or replicated), leads toinconsistencies: modifying one
copy makes that copy differentfrom the rest.
• Always keeping copies consistent and in a general way requires global
synchronization on each modification.
• Global synchronization precludes large-scale solutions.
• If we can tolerate inconsistencies, we may reduce the need for global synchronization,
but tolerating inconsistencies is application dependent.
Developing distributed systems: Pitfalls
Many distributed systems are needlessly complex caused by mistakes that required
patching later on. There are many false assumptions:
• The network is reliable
• The network is secure
• The network is homogeneous
• The topology does not change
• Latency is zero
• Bandwidth is infinite
• Transport cost is zero
• There is one administrator

Dr. B. SWAMINATHAN, Page 14/181


1.3 Types of Distributed Systems
Distributed computing systems
Distributed information systems
Distributed pervasive systems

1.3.1 Distributed computing systems


Many distributed systems are configured for High-Performance Computing
Cluster Computing: Essentially a group of high-end systems connected through a LAN
Homogeneous: same OS, near-identical hardware
Single managing node

Grid Computing: The next step: lots of nodes from everywhere: Heterogeneous
Dispersed across several organizations
Can easily span a wide-area network

Clouds Computing: Make a distinction between four layers:

• Hardware: Processors, routers, power and cooling systems. Customers normally never
get to see these.
• Infrastructure: Deploys virtualization techniques. Evolves around allocating and
managing virtual storage devices and virtual servers.
• Platform: Provides higher-level abstractions for storage and such. Example: Amazon
S3 storage system offers an API for (locally created) files to be organized and stored in
so-called buckets.
• Application: Actual applications, such as office suites (text processors, spreadsheet
applications, presentation applications). Comparable to the suite of apps shipped with
OSes.

Dr. B. SWAMINATHAN, Page 15/181


1.3.2 Distributed Information Systems: The vast amount of distributed systems in use today
are forms of traditional information systems, that now integrate legacy systems.

Example: Transaction processing systems.


BEGIN TRANSACTION(server, transaction)
READ(transaction, file-1, data)
WRITE(transaction, file-2, data)
newData := MODIFIED(data)
IF WRONG(newData) THEN
ABORT TRANSACTION(transaction)
ELSE
WRITE(transaction, file-2, newData)
E ND I F
END TRANSACTION(transaction)

Note: Transactions form an atomic operation.

Transactions Model: A transaction is a collection of operations on the state of an object


(database, object composition, etc.) that satisfies the following properties (ACID)
Atomicity: All operations either succeed, or all of them fail. When the transaction fails,
the state of the object will remain unaffected by the transaction.

Consistency: A transaction establishes a valid state transition. This does not exclude the
possibility of invalid, intermediate states during the transaction’s execution.

Isolation: Concurrent transactions do not interfere with each other. It appears to each
transaction T that other transactions occur either before T, or after T, but never both.

Durability: After the execution of a transaction, its effects are made

Transaction processing monitor: In many cases, the data involved in a transaction is


distributed across several servers. A TP Monitor is responsible for coordinating the execution
of a transaction.

Dr. B. SWAMINATHAN, Page 16/181


Distr. info. systems:
Enterprise application integration: A TP monitor doesn’t separate apps from their databases.
Also needed are facilities for direct communication between apps.
Remote Procedure Call (RPC)
Message-Oriented Middleware (MOM)

1.3.3 Distributed pervasive systems: Emerging next-generation of distributed systems in


which nodes are small, mobile, and often embedded in a larger system, characterized by the
fact that the system naturally blends into the user’s environment.
Three (overlapping) subtypes
1. Ubiquitous computing systems: pervasive and continuously present, i.e., there is a
continous interaction between system and user.
Basic characteristics
• (Distribution) Devices are networked, distributed, and accessible in a
transparent manner
• (Interaction) Interaction between users and devices is highly unobtrusive
• (Context awareness) The system is aware of a user’s context in order to
optimize interaction
• (Autonomy) Devices operate autonomously without human intervention, and
are thus highly self-managed
• (Intelligence) The system as a whole can handle a wide range ofdynamic
actions and interactions

2. Mobile computing systems: pervasive, but emphasis is on the fact that devices are
inherently mobile.
Mobile computing systems are generally a subclass of ubiquitous computing systems
and meet all of the five requirements.
Typical characteristics
• Many different types of mobile divices: smart phones, remote
• controls, car equipment, and so on
• Wireless communication
• Devices may continuously change their location )
▪ setting up a route may be problematic, as routes can change frequently
▪ devices may easily be temporarily disconnected ) disruption-t olerant
networks

Dr. B. SWAMINATHAN, Page 17/181


3. Sensor (and actuator) networks: pervasive, with emphasis on the actual (collaborative)
sensing and actuation of the environment
Characteristics: The nodes to which sensors are attached are:
• Many (10s-1000s)
• Simple (small memory/compute/communication capacity)
• Often battery-powered (or even battery-less)
Sensor networks as distributed systems

1.4 Architectural Styles


Introduction Architectural
• Architectural styles
• Software architectures
• Architectures versus middleware
• Self-management in distributed systems
Architectural styles
Organize into logically different components, and distribute those components over the
various machines.

(a) Layered style is used for client-server system


(b) Object-based style for distributed object systems.

Decoupling processes in space (“anonymous”) and also time (“asynchronous”) has led to
alternative styles.

(a) Publish/subscribe [decoupled in space]


(b) Shared dataspace [decoupled in space and time]

Dr. B. SWAMINATHAN, Page 18/181


1.5 System Architecture
1 .5 .1 Centralized Architectures
Basic Client–Server Model
Characteristics:
There are processes offering services (servers)
There are processes that use services (clients)
Clients and servers can be on different machines
Clients follow request/reply model wrt to using services

Application Layering
Traditional three-layered view
• User-interface layer contains units for an application’s user interface
• Processing layer contains the functions of an application, i.e. without specific data
• Data layer contains the data that a client wants to manipulate through the application
components

This layering is found in many distributed information systems, using traditional database
technology and accompanying applications.

Multi-Tiered Architectures
Single-tiered: dumb terminal/mainframe configuration
Two-tiered: client/single server configuration
Three-tiered: each layer on separate machine

Traditional two-tiered configurations:

Dr. B. SWAMINATHAN, Page 19/181


1.5.2 Decentralized Architectures :
In the last couple of years we have been seeing a tremendous growth in peer-to-peer systems.
1.5.2.1. Structured P2P: nodes are organized following a specific distributed data structure
Organize the nodes in a structured overlay network such as a logical ring, or a
hypercube, and make specific nodes responsible for services based only on their ID.

1.5.2.2. Unstructured P2P: nodes have randomly selected neighbors


Many unstructured P2P systems are organized as a random overlay: two nodes are
linked with probability p.
We can no longer look up information deterministically, but will have to resort to
searching:
Flooding: node u sends a lookup query to all of its neighbors. A neighbor
responds, or forwards (floods) the request.
There are many variations:
Limited flooding (maximal number of forwarding)
Probabilistic flooding (flood only with a certain probability).
Random walk: Randomly select a neighbor v. If v has the answer, it replies,
otherwise v randomly selects one of its neighbors. Variation: parallel random walk.
Works well with replicated data.

1.5.2.3. Hybrid P2P: some nodes are appointed special functions in a well-organized fashion
Hybrid Architectures: Client-
server combined with P2P :
Edge-server architectures,
which are often used for Content
Delivery Networks

1.6. Architectures versus Middleware


In many cases, distributed systems/applications are developed according to a specific
architectural style. The chosen style may not be optimal in all cases ) need to
(dynamically) adapt the behavior of the middleware.
Interceptors: Intercept the usual flow of control when invoking a remote object.

Self-managing Distributed Systems: Distinction between system and software architectures


blurs when automatic adaptivity needs to be taken into account:
Self-configuration, Self-managing, Self-healing, Self-optimizing

Dr. B. SWAMINATHAN, Page 20/181


1.7 Case Study: Skype
Unstructured P2P computer network architecture:
Pure P2P, Hybrid P2P, and Centralized P2P
Structured P2P computer network architecture,
Workstations (peers), and sometimes resources as well, are organized according to
specific criteria and algorithms.

Advantages and weaknesses of P2P


P2P networks have clients with resources such as bandwidth, storage space and processing
power. As more demand is put on the system through each node, the capacity of the
whole system increases.

Voice over Internet Protocol


Voice over Internet Protocol (Voice over IP, VoIP) is a methodology and group of
technologies for the delivery of voice communications and multimedia sessions over
Internet Protocol (IP) networks, such as the Internet.

Similar to traditional digital telephony Calls, it involve signaling, channel setup,


digitization of the analog voice signals, and encoding. Instead of being transmitted over a
circuit-switched network; however, the digital information is packetized, and transmission
occurs as IP packets over a packet-switched network.

Protocols Used
Voice over IP has been implemented in various ways using both proprietary protocols
and protocols based on open standards . VoIP protocols include:
● Session Initiation Protocol (SIP)
● H.323
● Media Gateway Control Protocol (MGCP)
● Gateway Control Protocol (Megaco, H.248)
● Real-time Transport Protocol (RTP)
● Real-time Transport Control Protocol (RTCP)
● Secure Real-time Transport Protocol (SRTP)
● Session Description Protocol (SDP)
● Inter-Asterisk eXchange (IAX)
● Jingle XMPP VoIP extensions
● Skype protocol

Working of VoIP
Voice is converted from an analog signal to a digital signal. It is then sent over the
Internet in data packets to a location that will be close to the destination. Then it will be
converted back to an analog signal for the remaining distance over a traditional circuit
switched (PSTN) (unless it is VoIP to VoIP). Your call can be received by traditional
telephones worldwide, as well as other VoIP users. VoIP to VoIP calls can travel entirely over
the Internet. Since your voice is changed to digital (so that it can travel over the Internet),
other great features such as voice messages to email, call forwarding, logs of incoming and
outgoing calls, caller ID, etc., can be included in your basic calling plan all for one low price.

Skype was the first peer-to-peer IP telephony network created by the developers of KaZaa.
Skype uses wide-band codec (iLBC, iSAC and iPCM developed by GlobalIPSound ) which
allows it to maintain reasonable call quality at an available bandwidth of 32 kb/s (The Skype

Dr. B. SWAMINATHAN, Page 21/181


claimed bandwidth usage of 3-16 kilobytes/s) and the minimum and maximum audible
frequency Skype codec allowed to pass-through are 50 Hz and 8,000 Hz respectively.
The network contains three types of entities:
Supernodes
Ordinary nodes
Login server
Each client maintains a host cache with the IP address and port numbers of reachable
supernodes. The Skype user directory is decentralized and distributed among the supernodes
in the network.

Key Components
Skype Client (SC): Skype application which can be used to place calls, send messages
and etc. The Skype network is an overlay network and thus each SC needs to build and refresh
a table of reachable nodes. In Skype, this table is called host cache (HC) and it contains IP
address and port number of super nodes. This host cache is stored in an XML file called
"shared.xml". Also, NAT and firewall information is stored in "shared.xml". If this file is not
present, SC tries to establish a TCP connection with each of the seven Skype maintained
default SNs IP address on port 33033.
● Super Node (SN): Super nodes are the endpoints where Skype clients connect to. Any
node with a public IP address having sufficient CPU, memory, and network bandwidth is a
candidate to become a super node and a Skype client cannot prevent itself from becoming a
super node. Also, if a SC cannot establish a TCP connection with a SN then it will report a
login failure.
● Skype Authentication Server: This is the only centralized Skype server which is used
to authenticate Skype users. An authenticated user is then announced to other peers and
buddies. If the user saves his/her credentials, authentication will not be necessary. This server
(IP address: 212.72.49.141 [Buddy list] or 195.215.8.141) also stores the buddy list for each
user. Note that the buddy list is also stored locally in an unencrypted file called "config.xml".
In addition, if two SCs have the same buddy, their corresponding config.xml files have a
different four-byte number for the same buddy. Finally, it has been shown that Skype routes
login messages through SNs if the authentication server is blocked.

Dr. B. SWAMINATHAN, Page 22/181


● Start of Message (SoM) Structure: Skype uses the same port to communicate with
outside world. Therefore, it needs an unencrypted structure in the beginning of each UDP
packet to analyze the sequence and flows at application layer. This structure is called SoM.
Skype Connections Skype to Skype (End to End) (E2E) :
Call signalling and media transfer
1. If both caller and receiver are on public IPs and receiver is in the buddy list of the
caller, then they establish a call through a direct TCP connection with each other and
transfer media using UDP.
2. If the caller or receiver is behind a port-restricted NAT then they stablish a call through
a few packets initially transferred between caller, receiver, SN and other hosts and a UDP
connection is established between the caller and receiver which is used to transfer media
as well.
3. If caller and receiver are behind a UDP-restricted firewall they will need a relay (node)
in between to establish TCP connection to and then the traffic (including media) will go
through from one side to the other.
For users that are not present in the buddy list, call placement is equal to user search plus
call signalling.
● Skype to PSTN (Public Switched Telephone Network) (SkypeOut): For Skype out, the
application initially contacts the SN and then the PSTN gateway at port 12340. The gateway
servers are a separate part of the architecture and not a part of the overlay network. In
addition, host servers 195.215.8.140 and 212.72.49.155 are only connected when a user tries
to call another user in the PSTN network; therefore, we assume these servers to be Skype-to-
PSTN gateways (SkypeOut).
Skype Functions
● Startup : When SC v1.4 was run for the first time after installation, it sent a HTTP 1.1 GET
request to the Skype server (skype.com). The first line of this request contained
the keyword ‘installed’.
● Login : Login is perhaps the most critical function to the Skype operation. It is during this
process a SC authenticates its user name and password with the login server, advertises its
presence to other peers and its buddies, determines the type of NAT and firewall it is behind,
discovers online Skype nodes with public IP addresses, and checks the availability of latest
Skype version.
● User Search : Skype uses its Global Index (GI) technology to search for a user. Skype
claims that search is distributed and is guaranteed to find a user if it exists and has logged in
during the last 72 hours. Extensive testing suggests that Skype was always able to locate users
who logged in using a public or private IP address in the last 72 hours.
● Call Establishment and Teardown : we consider call establishment for users that are in
the buddy list of caller and for users that are not present in the buddy list. It is important to
note that call signaling is always carried over TCP. For users that are not present in the buddy
list, call placement is equal to user search plus call signaling.
● Media Transfer and Codecs : If both Skype clients (v1.4) were on machines with public
IP addresses, then media traffic flowed directly between them over UDP. The media traffic
flowed to and from the UDP port configured in the options dialog box. The voice packet size
varied between 40 and 120 bytes. For two users connected to Internet over 100 Mbps Ethernet
with almost no congestion in the network, roughly 85 voice packets were exchanged both
ways in one second. The total uplink and downlink bandwidth used for voice traffic was 5
kilobytes/s. This bandwidth usage agrees with the Skype claim of 3-16 kilobytes/s.
● Conference Calling : During a conference call, the most powerful machine always gets
elected as a conference host and the other clients send their data to that host

Dr. B. SWAMINATHAN, Page 23/181


1.8 Case Study: Bittorrent

What is BitTorrent?
Efficient content distribution system using file swarming. Usually does not perform all the
functions of a typical p2p system, like searching.

Characteristics of the BitTorrent protocol


• Peer selection is about selecting peers who are willing to share files back to the current
peer
– Tit for tat in peer selection based on download-speed.
– The mechanism uses a choking/unchoking mechanism to control peer selection.
The goal is to get good TCP performance and mitigate free riders
• Optimistic unchoking
– The client uses a part of its available bandwidth for sending data to random peers
– The motivation for this mechanism is to avoid bootstrapping problem with the tit for
tat selection process and ensure that new peers can join the swarm
• Piece selection is about supporting high piece diversity
– Local Rarest First for piece selection (start with random, then finally use end game
mode)
– BITFIELD message after handshake with a peer, then HAVE messages for
downloaded pieces
• End game mode
– To avoid delays in obtaining the last blocks the protocol requests the last blocks
from all peers
– Sends cancel messages for downloaded blocks to avoid unnecessary transmissions
– When to start the end game mode is not detailed in the specification

File sharing
To share a file or group of files, a peer first creates a .torrent file, a small file that
contains
(1)metadata about the files to be shared, and
(2) Information about the tracker, the computer that coordinates the file distribution.
Peers first obtain a .torrent file, and then connect to the specified tracker, which tells
them from which other peers to download the pieces of the file.

Downloading a file Steps:

File sharing
Large files are broken into pieces of size between 64 KB and 1 MB

Dr. B. SWAMINATHAN, Page 24/181


Pieces and Sub-Pieces
• A piece is broken into sub-pieces ... typically 16KB in size
• Until a piece is assembled, only download the sub-pieces of that piece only
• This policy lets pieces assemble quickly

Pipelining
• When transferring data over TCP, always have several requests pending at once, to avoid
a delay between pieces being sent. At any point in time, some number, typically 5, are
requested simultaneously.
• Every time a piece or a sub-piece arrives, a new request is sent out.

Piece Selection
• The order in which pieces are selected by different peers is critical for good performance
• If an inefficient policy is used, then peers may end up in a situation where each has all
identical set of easily available pieces, and none of the missing ones.
• If the original seed is prematurely taken down, then the file cannot be completely
downloaded! What are “good policies?”

BT: internal Chunk Selection mechanisms


• Strict Priority
– First Priority
• Rarest First
– General rule
• Random First Piece
– Special case, at the beginning
• Endgame Mode
– Special case

Random First Piece


• Initially, a peer has nothing to trade
• Important to get a complete piece ASAP
• Select a random piece of the file and download it
Rarest Piece First
• Determine the pieces that are most rare among your peers, and download those first.
• This ensures that the most commonly available pieces are left till the end to download.

Data transport in BitTorrent


• Typically, BitTorrent uses TCP as its transport protocol for exchanging pieces,
and it uses HTTP for tracker comms.
• Possible to use HTTP port and real/fake HTTP headers for transport to avoid
throttling (not in the specification)
• The well known TCP port for BitTorrent traffic is 6881-6889 (and 6969 for the
tracker port).
• The DHT extension (peer-to-peer tracker) uses various UDP ports negotiated by
the peers.
• Web seeding (extension) Use HTTP to download pieces from Web sites
• Security extensions (similar to TLS: message stream encryption)

Dr. B. SWAMINATHAN, Page 25/181


UNIT II OPERATING SYSTEMS AND COMMUNICATIONS 9
Process – Threads – Virtualization – Client-Server Model - Case Study: Apache Web
server -Code Migration- Communication: Fundamentals - Remote Procedure Call –
Stream oriented communication – Message oriented communication – Multicast
communication.
2.0 OPERATING SYSTEMS AND COMMUNICATIONS
Introduction to Process and Threads
Build virtual processors in software, on top of physical processors:
Processor: Provides a set of instructions along with the capability of automatically
executing a series of those instructions.
Thread: A minimal software processor in whose context a series of instructions
can be executed. Saving a thread context implies stopping the current
execution and saving all the data needed to continue the execution at a
later stage.
Process: A software processor in whose context one or more threads may be
executed. Executing a thread, means executing a series of instructions
in the context of that thread.
Context Switching
Processor context: The minimal collection of values stored in the registers of a
processor used for the execution of a series of instructions (e.g., stack
pointer, addressing registers, program counter).
Thread context: The minimal collection of values stored in registers and memory, used
for the execution of a series of instructions (i.e., processor context,
state).
Process context: The minimal collection of values stored in registers and memory,
used for the execution of a thread (i.e., thread context, but now also at
least MMU register values).

Threads share the same address space. Thread context switching can be done entirely
independent of the operating system.

Process switching is generally more expensive as it involves getting the OS in the


loop, i.e., trapping to the kernel.

Creating and destroying threads is much cheaper than doing so for processes.
Process State
As a process executes, it changes
state. The state of a process is
defined in part by the current activity
of that process. A process may be in
one of the following states:
• New. The process is being created.
• Running. Instructions are being executed.
• Waiting. The process is waiting for some event to occur (such as an I/O completion or
reception of a signal).
• Ready. The process is waiting to be assigned to a processor.
Dr. B. SWAMINATHAN, Page 26/181
• Terminated. The process has finished execution.
CPU switch from process to process

2.1 Threads and Operating Systems


To execute a program, an operating system creates a number of virtual processors, each
one for running a different program. To keep track of these virtual processors, the operating
system has a process table, containing entries to store CPU register values, memory maps,
open files, accounting information. privileges, etc.

A solution lies in a hybrid form of user-level and kernel-level threads, generally referred
to as lightweight processes (LWP). An LWP runs in the context of a single (heavy-weight)
process, and there can be several LWPs per process. In addition to having LWPs, a system
also offers a user-level thread package. Offering applications the usual operations for creating
and destroying threads. In addition the package provides facilities for thread synchronization
such as mutexes and condition variables. The important issue is that the thread package is
implemented entirely in user space.

Combining kernel-level lightweight processes and user-level threads.

User-level packages: User-space solution(implementation 1)


• All operations can be completely handled within a single process → implementations can
be extremely efficient.
• All services provided by the kernel are done on behalf of the process in which a thread
resides → if the kernel decides to block a thread, the entire process will be blocked.
• Threads are used when there are lots of external events: threads block on a per-event basis
→ if the kernel can’t distinguish threads, how can it support signaling events to them.

Dr. B. SWAMINATHAN, Page 27/181


Kernel solution(implementation 2)
The whole idea is to have the kernel contain the implementation of a thread package. This
means that all operations return as system calls
• Operations that block a thread are no longer a problem: the kernel schedules
another available thread within the same process.
• Handling external events is simple: the kernel (which catches all events) schedules
the thread associated with the event.
• The problem is (or used to be) the loss of efficiency due to the fact that each thread
operation requires a trap to the kernel.

Try to mix user-level and kernel-level threads into a single concept, however,
performance gain has not turned out to outweigh the increased complexity.

Threads and Distributed Systems


Multithreaded Web client : Hiding network latencies:
• Web browser scans an incoming HTML page, and finds that more files need to be
fetched.
• Each file is fetched by a separate thread, each doing a (blocking) HTTP request.
• As files come in, the browser displays them.
Multiple request-response calls to other machines (RPC)
• A client does several calls at the same time, each one by a different thread.
• It then waits until all results have been returned.
• Note: if calls are to different servers, we may have a linear speed-up.
Improve performance:
• Starting a thread is much cheaper than starting a new process.
• Having a single-threaded server prohibits simple scale-up to a multiprocessor system.
• As with clients: hide network latency by reacting to next request while previous one is
being replied.
Better structure
• Most servers have high I/O demands. Using simple, well-understood blocking calls
simplifies the overall structure.
• Multithreaded programs tend to be smaller and easier to understand due to simplified
flow of control.

Advantages to using LWPs in combination with a user-level thread package.


1. Creating, destroying, and synchronizing threads is relatively cheap and involves no
kernel intervention at all.
2. Provided that a process has enough LWPs, a blocking system call will not suspend the
entire process.
3. There is no need for an application to
know about the LWPs. All it sees are
user-level threads.
4. LWPs can be easily used in
multiprocessing environments, by
executing different LWPs on different
CPUs. This multiprocessing can be
hidden entirely from the application.
Multithreaded server

Dr. B. SWAMINATHAN, Page 28/181


Differences between a process and a thread:
Comparison Basis Process Thread
A process is a program under execution i.e an A thread is a lightweight process that can be
Definition
active program. managed independently by a scheduler.
Context switching Processes require more time for context Threads require less time for context switching as
time switching as they are more heavy. they are lighter than processes.
Processes are totally independent and don’t share A thread may share some memory with its peer
Memory Sharing
memory. threads.
Communication between processes requires Communication between threads requires less time
Communication
more time than between threads. than between processes .
If a process gets blocked, remaining processes If a user level thread gets blocked, all of its peer
Blocked
can continue execution. threads also get blocked.
Resource Threads generally need less resources than
Processes require more resources than threads.
Consumption processes.
Individual processes are independent of each
Dependency Threads are parts of a process and so are dependent.
other.
Data and Code Processes have independent data and code A thread shares the data segment, code segment,
sharing segments. files etc. with its peer threads.
All the different processes are treated separately All user level peer threads are treated as a single task
Treatment by OS
by the operating system. by the operating system.
Time for creation Processes require more time for creation. Threads require less time for creation.
Time for
Processes require more time for termination. Threads require less time for termination.
termination

2.2 Virtualization : Virtualization is a technology that allows operating systems to run as


applications within other operating systems.
Virtualization is one member of a class of software that also includes emulation.
Emulation is used when the source CPU type is different from the target CPU type. For
example, when Apple switched from the IBM Power CPU to the Intel x86 CPU for its
desktop and laptop computers, it included an emulation facility called “Rosetta,” which
allowed applications compiled for the IBM CPU to run on the Intel CPU. That same concept
can be extended to allow an entire operating system
written for one platform to run on another. MicroSoft
Product is Hyper-V(Hypervisor)

It is becoming increasingly important:


• Hardware changes faster than software
• Ease of portability and code migration
• Isolation of failing or attacked components
Fig : a) General organization between a program, interface, and system.
b) General organization of virtualizing system A on top of system B.

Types of VM
Different kinds of virtual machines, each with different functions:
• System virtual machines (also termed full virtualization VMs) provide a substitute for a
real machine. They provide functionality needed to execute entire operating systems.
A hypervisor uses native execution to share and manage hardware, allowing for multiple
environments which are isolated from one another, yet exist on the same physical

Dr. B. SWAMINATHAN, Page 29/181


machine. Modern hypervisors use hardware-assisted virtualization, virtualization-specific
hardware, primarily from the host CPUs.

Process virtual machines are designed to execute computer programs in a platform-


independent environment. Some virtual machines, such as QEMU, are designed to also
emulate different architectures and allow execution of software applications and operating
systems written for another CPU or architecture. Operating-system-level virtualization allows
the resources of a computer to be partitioned via the kernel. The terms are not universally
interchangeable.

2.2.1 Architecture of VMs


Virtualization can take place at very different levels, strongly depending on the interfaces
as offered by various systems components:

Four Different types of Interfaces in VM :


1. An interface between the hardware and software, consisting of machine instructions
that can be invoked by any program.
2. An interface between the hardware and software, consisting of machine instructions that
can be invoked only by privileged programs, such as an operating system.
3. An interface consisting of system calls as
offered by an operating system.
4. An interface consisting of library calls,
generally forming what is known as an
application programming interface
(API). In many cases, the afore
mentioned system calls are hidden by an
API.

Process VMs versus VM Monitors


a) A process virtual machine, with multiple
instances of (application, runtime)
combinations.
(b) A virtual machine monitor. with multiple
instances of (applications, operating system)
combinations.
Fig a): Build a runtime system that essentially provides an abstract instruction set that is to
be used for executing applications. Instructions can be interpreted but could also be
emulated as is done for running Windows applications on UNIX platforms.

Fig b): A system that is essentially implemented as a layer completely shielding the
original hardware, but offering the complete instruction set of that same (or other
hardware) as an interface. Crucial is the fact that this interface can be offered
simultaneously to different programs. As a result, it is now possible to have multiple, and
different

Process VM: A program is compiled to intermediate (portable) code, which is then executed
by a runtime system (Example: Java VM).

Dr. B. SWAMINATHAN, Page 30/181


• VM Monitor: A separate software layer mimics the instruction set of hardware →
a complete operating system and its applications can be supported (Example:
VMware, VirtualBox).

2.2.3 VM Monitors on operating systems


VMMs run on top of existing operating systems.
• Perform binary translation: while executing an application or operating system,
translate in7structions to that of the underlying machine.
• Distinguish sensitive instructions: traps to the orginal kernel (think of system calls,
or privileged instructions).
• Sensitive instructions are replaced with calls to the VMM.

2.3 Client-Server Model:


Clients: User Interfaces
Clients and servers, and the ways they interact. A major part of client-side software is
focused on
(graphical) user
interfaces.
(a) A networked
application with its own
protocol. (b) A general
solution to allow access to
remote applications.

widely-used networked user


interfaces is the X Window system. X,
is used to control bit-mapped
terminals, which include a monitor,
keyboard, and a pointing device such
as a mouse. In a sense, X can be
viewed as that part of an operating
system that controls the terminal. It
designed to provide lo-level
mechanism for managing the graphics display, but not to have any control over what is
displayed. It has a flexibility to be used on many different ways both for the simplest
window management up to the most complex desktop environment. X.Org currently
maintains and provides open source implementation for the X windows system.

Client-Side Software
• access transparency: client-side stubs for RPCs
• location/migration transparency: let client-side
software keep track of actual location.
• replication transparency: multiple invocations
handled by client stub.
• failure transparency: can often be placed only
at client (we’re trying to mask server and communication failures).
General Design Issues

Dr. B. SWAMINATHAN, Page 31/181


(a) Client-to-server binding using a daemon. (b) Client-to-server binding using a superserver.
1. Servers and state
Stateless servers : Never keep accurate information about the status of a client after having
handled a request:
Don’t record whether a file has been opened (simply close it again after access)
Don’t promise to invalidate a client’s cache
Don’t keep track of your clients
Consequences :
Clients and servers are completely independent
State inconsistencies due to client or server crashes are reduced
Possible loss of performance because, e.g., a server cannot anticipate client behavior
(think of prefetching file blocks)

Stateful servers : Keeps track of the status of its clients.


Record that a file has been opened, so that prefetching can be done
Knows which data a client has cached, and allows clients to keep local copies of shared data
The performance of stateful servers can be extremely high, provided clients are allowed to
keep local copies. As it turns out, reliability is not a major problem.

2. Sequential – Serve :one request at a time – Can


service multiple requests by employing events and
asynchronous communication
Concurrent – Server spawns a process or thread
to service each request – Can also use a pre-spawned
pool of threads/processes (apache)
Servers could be – Pure-sequential, event-
based, thread-based, process-based
Discussion: which architecture is most efficient?

Server Clusters
Request Handling: Having the first tier handle all
communication from/to the cluster may lead to a bottleneck.

Server clusters: three different tiers


The first tier is generally responsible for passing
requests to an appropriate server.

Distributed servers with stable IPv6 address(es)

Dr. B. SWAMINATHAN, Page 32/181


Type of servers
Superservers: Servers
that listen to several ports, i.e., provide several independent services. In practice, when a
service request comes in, they start a subprocess to handle the request (UNIX inetd)

Iterative vs. concurrent servers: Iterative servers can handle only one client at a time, in
contrast to concurrent servers

Thick Client:
A thick client also known as Fat, Rich or Heavy client is one of the component of client
server architecture connected to the server through a network connection and does not
consume any of the server's computer resources to execute applications.
Why To Select Thick Client?
A thick client is a type of client device in client-server architecture that has most hardware
resources on board to perform operations, run applications and perform other functions
independently.
Although a thick client can perform most operations, it still needs to be connected to the
primary server to download programs
Where To Implement Thick Client?
Thick clients are generally implemented in computing environments when the primary
server has low network speed, limited computing and storage capacity to facilitate client
machine, or there is a need to work offline.
Thin Client
A thin client also known as Lean, Zero or Slim Client is a computer or computer program
that depends heavily on some other computer (server) to fulfill its computational roles.
Why To Select Thick Client?
Thin clients occur as components of a broader computer infrastructure, where many
clients share their computations with the same server. As such, thin client
infrastructure can be viewed as providing some computing service via several user
interfaces. This is desirable in contexts where individual thick clients have much more
functionality or power than a infrastructure required.
Thin client computing is also a way of easily maintaining computational services at a
reduced total cost of ownership.

Other types of Servers


Application Servers, Audio/Video Servers, Chat Servers, Fax Servers, FTP Servers,
Groupware Servers, IRC Servers, List Servers, Mail Servers, News Servers, Proxy Servers,
Telnet Servers, Web Servers.,
Servers : A server is a process that waits for incoming service requests at a specific transport
address. In practice, there is a one-to-one mapping between a port and a service.
ftp-data 20 File Transfer [Default Data]
ftp 21 File Transfer [Control]
telnet 23 Telnet
24 any private mail system
Smtp 25 Simple Mail Transfer
Login 49 Login Host Protocol
Sunrpc 111 SUN RPC (portmapper)
Courier 530 Xerox RPC

Dr. B. SWAMINATHAN, Page 33/181


Out-of-band communication Issue:
Is it possible to interrupt a server once it has accepted (or is in the process of accepting) a
service request?

Solution 1 Use a separate port for urgent data:


Server has a separate thread/process for urgent messages
Urgent message comes in → associated request is put on hold
Note: we require OS supports priority-based scheduling

Solution 2 Use out-of-band communication facilities of the transport layer:


Example: TCP allows for urgent messages in same connection
Urgent messages can be caught using OS signaling techniques

Client Server Architecture


Client-Server Architecture is a shared architecture system where loads of client-server are
divided. The client-server architecture is a centralized resource system where server holds all
the resources. The server receives numerous performances at its edge for sharing resources to
its clients when requested. Client and server may be on the same or in a network. The server is
profoundly stable and scalable to return answers to clients. This Architecture is Service
Oriented which means client service will not be interrupted. Client-Server Architecture
subdues network traffic by responding to the inquiries of the clients rather than a complete file
transfer. It restores the file server with the database server.
Client computers implement a bond to allow a computer user to request services of the
server and to represent the results the server returns. Servers wait for requests to appear from
clients and then return them. A server usually gives a standardized simple interface to clients
to avoid a hardware/software confusion. Clients are located at workplaces or on personal
machines, at the same time servers will be located somewhere powerful in the network. This
architecture is useful mostly when clients and the server each have separate tasks that they
routinely perform. Many clients can obtain the server’s information concurrently, and also a
client computer can execute other tasks, for instance, sending e-mails.

Types of Client Server Architecture


1-tier architecture
In this category of client-server setting, the user interface,
marketing logic and data logic are present in the same system.
This kind of service is reasonable but it is hard to manage due
to data variance that allots replication of work. One-tier
architecture consists of layers.
For example, Presentation, Business, Data Access layers
within a single software package. The data is usually stored in
the local system or a shared drive. Applications which handle
all the three tiers such as MP3 player, MS Office come under one-tier application.
2-tier architecture
In this type of client-server environment, the user interface is stored at client machine and
the database is stored on the server. Database logic and business logic are filed at either client
or server but it needs to be maintained. If Business Logic and Data Logic are collected at a
client side, it is named as fat client thin server architecture. If Business Logic and Data

Dr. B. SWAMINATHAN, Page 34/181


Logic are handled on the server, it is called thin client fat server architecture. This is
considered as affordable.
In two-tier architecture, client and server
have to come in direct incorporation. If a client is
giving an input to the server there shouldn’t be
any intermediate. This is done for rapid results
and to avoid confusion between different clients.
For instance, online ticket reservations software
use this two-tier architecture.

3-tier architecture
In this variety of client-server
context, an extra middleware is used
that means client request goes to the
server through that middle layer and
the response of server is received by
middleware first and then to the
client. This architecture protects 2-
tier architecture and gives the best
performance. This system comes
expensive but it is simple to use. The
middleware stores all the business logic and data passage logic. The idea of middleware is to
database staging, queuing, application execution, scheduling etc. A Middleware improves
flexibility and gives the best performance.
The Three-tier architecture is split into 3 parts, namely, The presentation layer (Client
Tier), Application layer (Business Tier) and Database layer (Data Tier). The Client system
manages Presentation layer; the Application server takes care of the Application layer, and the
Server system supervises Database layer.
In the present scenario of online business, there has been growing demands for the quick
responses and quality services. Therefore, the complex client architecture is crucial for the
business activities. Companies usually explore possibilities to keep service and quality meet to
maintain its marketplace with the help of client-server architecture. The architecture increases
productivity through the practice of cost-efficient user interfaces, improved data storage,
expanded connectivity and secure services.
4 tier architecture: in a 4 tier
architecture Database -> Application
-> Presentation -> Client Tier ..
where does the BI layer fit in? i just
want to add BI piece to something
like below but I am not sure how to
proceed.
https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.halc001/o4ag1.htm

Dr. B. SWAMINATHAN, Page 35/181


List out or draw a typical client-server session communication model.
Step 1: Server and client create a stream socket s with the socket() call.
Step 2: (Optional for client) Sever bind socket s to a local address with the bind() call.
Step 3: Server uses the listen() call to alert the TCP/IP machine of the willingness to
accept connections.
Step 4: Client connects socket s to a foreign host with the connect() call.
Step 5: Server accepts the connection and receives a second socket, for example ns, with
the accept() call.
Step 6 and 7: Server reads and writes data on socket ns, client reads and writes data on
socket s, by using send() and recv() calls, until all data has been exchanged.
Step 8: Sever closes socket ns with the close() call. Client closes socket s and end the
TCP/IP session with the close() call. Go to step 5.

Dr. B. SWAMINATHAN, Page 36/181


Dr. B. SWAMINATHAN, Page 37/181
2.4 Case Study: Apache Web server
Distributed Cooperative Apache Web Server (DC-Apache Web Server)
The DC-Apache system can
dynamically manipulate the
hyperlinks embedded in web
documents in order to distribute
access requests among multiple
cooperating web servers.

A collection of web
documents can be viewed as a
directed document graph, where
each document is a node and
each hyperlink is a directed link
from one node to another.
The DC-Apache solution takes this graph-
based approach and is built upon the
hypothesis that most web sites only have a
few well-known entry points from which
users start navigating through the
documents on these sites. Empirical
studies performed on the current prototype
system indicate that the DC-
Apache system has a high potential for
achieving linear scalability by effectively
removing potential bottlenecks caused by
centralized resources
Apache allows to extend its functionality by linking new modules directly to its binary
code.
The main process of the Apache server dispatches requests to several child processes.
Each child process services the request in the request processing cycle in several phases.
Using the handler mechanism, we implemented the DC-Apache system as an Apache module
that processes requests in the request processing cycle. The function of pinger process is to
compute and collect load information about participating servers. It also carries out the task of
migrating and replicating documents. The shared memory contains the document graph and
statistics information.

Dr. B. SWAMINATHAN, Page 38/181


2.5 Code Migration- Communication:
Process migration (aka strong mobility)
– Improved system-wide performance – better utilization of system-wide resources
– Examples: Condor, DQS
Code migration (aka weak mobility)
– Shipment of server code to client – filling forms (reduce communication, no need to
pre-link stubs with client)
– Ship parts of client application to
server instead of data from server to
client (e.g., databases)
– Improve parallelism – agent-based
web searches

Flexibility – Dynamic configuration of


distributed system – Clients don’t need
preinstalled software – download on
demand

Migration models
• Process = code seg + resource seg + execution seg
• Weak versus strong mobility
– Weak => transferred program starts from initial state
• Sender-initiated versus receiver-initiated
– Sender-initiated (code is with sender)
• Client sending a query to database server
• Client should be pre-registered
– Receiver-initiated
• Java applets
• Receiver can be anonymous
• Code migration:
– Execute in a separate process
– [Applets] Execute in target process
• Process migration
– Remote cloning
– Migrate the process

Models for Code Migration

Dr. B. SWAMINATHAN, Page 39/181


Resources Migrate
• Depends on resource to process binding
– By identifier: specific web site, ftp server
– By value: Java libraries
– By type: printers, local devices
• Depends on type of “attachments”
– Unattached to any node: data files
– Fastened resources (can be moved only at high cost)
• Database, web sites
– Fixed resources
• Local devices, communication end points

Resource Migration Actions

Actions to be taken with respect to the references to local resources when migrating code to
another machine.
• GR: establish global system-wide reference
• MV: move the resources
• CP: copy the resource
• RB: rebind process to locally available resource

Machine Migration
Rather than migrating code or process, migrate an “entire machine” (OS + all processes)
– Feasible if virtual machines are used
– Entire VM is migrated • Can handle small differences in architecture (Intel-AMD)
• Live VM Migration: migrate while executing
– Assume shared disk (no need to migrate disk state)
– Iteratively copy memory pages (memory state)
• Subsequent rounds: send only pages dirtied in prior round
• Final round: Pause and switch to new machine

Server Design Issues

Dr. B. SWAMINATHAN, Page 40/181


Server Design
– Iterative versus concurrent
• How to locate an end-point (port #)?
– Well known port #
– Directory service (port mapper in Unix)
– Super server (inetd in Unix)
• Stateful server
– Maintain state of connected clients
– Sessions in web servers
• Stateless server
– No state for clients
• Soft state
– Maintain state for a limited time; discarding state does not impact correctness

Server Cluster

• Web applications use tiered architecture


– Each tier may be optionally replicated; uses a dispatcher
– Use TCP splicing or handoffs

• Sequential
– Serve one request at a time
– Can service multiple requests by employing events and asynchronous communication
• Concurrent
– Server spawns a process or thread to service each request
– Can also use a pre-spawned pool of threads/processes (apache)
• Thus servers could be
– Pure-sequential, event-based, thread-based, process-based

Scalability
• Question:How can you scale the server capacity?
• Buy bigger machine!
• Replicate
• Distribute data and/or algorithms
• Ship code instead of data
• Cache

Dr. B. SWAMINATHAN, Page 41/181


2.6 Fundamentals - Remote Procedure Call
Basic of RPC
• A remote procedure call makes a call to a remote service look like a local call
• RPC makes transparent whether server is local or remote
• RPC allows applications to become distributed transparently
• RPC makes architecture of remote machine transparent

Communication between caller & callee can be hidden by using procedure-call


mechanism.
Stubs: obtaining transparency • Compiler generates from API stubs for a procedure on
the client and server
• Client stub:
Marshals arguments into machine-independent format
Sends request to server • Waits for response
Unmarshals result and returns to caller
• Server stub:
Unmarshals arguments and builds stack frame
Calls procedure
Server stub marshals results and sends reply

1 Client procedure calls client stub.


2 Stub builds message; calls local OS.
3 OS sends message to remote OS.
4 Remote OS gives message to stub.
5 Stub unpacks parameters and calls server.
6 Server makes local call and returns result to stub.
7 Stub builds message; calls OS.
8 OS sends message to client’s OS.
9 Client’s OS gives message to stub.
10 Client stub unpacks result and returns to the client.

Dr. B. SWAMINATHAN, Page 42/181


Asynchronous RPC
The interaction between client and server in a traditional RPC.

The interaction using asynchronous RPC.

A client and server interacting through two asynchronous RPCs.

2.7 Stream-Oriented Communication


• Support for continuous media
• Streams in distributed systems
• Stream management
Continuous media
Observation: All communication facilities discussed are essentially based on a discrete, that is
time independent exchange of information
Continuous media: Characterized by the fact that values are time dependent:
• Audio
• Video
• Animations
• Sensor data (temperature, pressure, etc.)
Transmission modes: Different timing guarantees with respect to data transfer:
• Asynchronous: no restrictions with respect to when data is to be delivered
• Synchronous: define a maximum end-to-end delay for individual data packets
• Isochronous: define a maximum and minimum end-to-end delay (jitter is
bounded)

Dr. B. SWAMINATHAN, Page 43/181


Stream Definition: A (continuous) data stream is a connection-oriented communication
facility that supports is aochronous data transmission

Some common stream characteristics:


• Streams are unidirectional. There is generally a single source, and one or more sinks
• Often, either the sink and/or source is a wrapper around hardware (e.g., camera, CD
device, TV monitor, dedicated storage)
Stream types:
• Simple: consists of a single flow of data (e.g., audio or video)
• Complex: multiple data flows (e.g., stereo audio or combination audio/video)

Issue: Streams can be set up between two processes at different machines, or directly between
two different devices. Combinations are possible as well.

Streams and QoS


Essence: Streams are all about timely delivery of data. How do you specify this Quality of
Service (QoS)? Make distinction between specification and implementation of QoS.

Dr. B. SWAMINATHAN, Page 44/181


Flow specification: Use a token-bucket model and express QoS in that model.

Implementing QuS
Problem: QoS specifications translate to resource reservations in underlying communication
system. There is no standard way of (1) QoS specs, (2) describing resources, (3) mapping
specs to reservations.

Approach: Use Resource reSerVation Protocol (RSVP) as first attempt. RSVP is a transport-
level protocol.

Stream Synchronization
Problem: Given a complex stream, how do you keep the different substreams in synch?
Example: Think of playing out two channels, that together form stereo sound. Difference
should be less than 20–30 µsec!

Dr. B. SWAMINATHAN, Page 45/181


Alternative: multiplex all substreams into a single stream, and demultiplex at the receiver.
Synchronization is handled at multiplexing/demultiplexing point (MPEG).

Dr. B. SWAMINATHAN, Page 46/181


2.8 Message oriented communication
Asynchronous communication
Sender continues immediately after it has submitted its message for transmission
Message may be stored, in a local buffer at sending host or at an intermediate
communication server
Asynchronous transmission mode: Data items in a stream are transmitted one
after the other, but there are no further timing constraints on when transmission of
items should take place.
Synchronous communication
Sender is blocked until its message is stored in a local buffer at receiving host or
actually delivered to receiver
Strongest form – Sender blocked until receiver has processed message
Synchronous transmission mode Maximum end-to-end delay defined for each unit in
a data stream.

Isochronous transmission mode It is necessary that data units are transferred on time. Data
transfer is subject to bounded (delay) jitter.

Persistent communication A message is stored at a communication server as long as it takes


to deliver it at the receiver.

Transient communication A message is discarded by a communication server as soon as it


cannot be delivered at the next server or at the receiver

Dr. B. SWAMINATHAN, Page 47/181


1. Persistent, asynchronous communication
Messages are persistently stored in local host buffer, or at an intermediate communication
server e.g., e-mail ;

2. Persistent, synchronous communication


Messages can be persistently stored only at receiving host
Weaker form of synchronous communication
It isn’t necessary for receiving application to be executing

3. Transient, asynchronous communication


Messages is temporarily stored in a local buffer & sender immediately continues
e.g., UDP, RPC fire & forget

4. Transient, synchronous communication


I. Weakest form
Receipt-based, transient, synchronous communication
Sender is blocked until message is stored in a local buffer of receiving host
e.g., Asynchronous RPC delivery part (send with Ack)
Delivery-based, transient, synchronous communication
e.g., Asynchronous RPC delivery part (Send with Ack)
II. Strongest form
Response-based, transient, synchronous communication
e.g., RPC & RMI

Message-oriented transient communication


Transport-level sockets
Message-Passing Interface (MPI)
Message transfer latency milliseconds to seconds

Message-oriented persistent communication


Message-queuing systems or Message-Oriented Middleware (MOM)
Provide intermediate-term storage capacity for messages
Doesn’t requiring either sender or receiver to be active during message transmission
Message transfer latency
seconds to minutes
Applications communicate by inserting
messages into a series of queues

Loosely-coupled communication
Sender is given guarantee that its
message will eventually be inserted in
recipient’s queue
No guarantee on timing, or message will
actually be read

Dr. B. SWAMINATHAN, Page 48/181


Message-oriented communication:
• Transient messaging
• Message-Queuing System
• Message Brokers
• Example: IBM’s WebSphere
Transient messaging: sockets
Berkeley socket interface
Operat
Description
ion
Create a new communication end point
socket Attach a local address to a socket
bind Tell operating system what the maximum
listen number of pending
accept connection requests should be
connect Block caller until a connection request arrives
send Actively attempt to establish a connection
receive Send some data over the connection
close Receive some data over the connection
Release the connection

Sockets: Python code


Server
from socket import *
s = socket(AF_INET, SOCK_STREAM)
s.bind((HOST, PORT))
s.listen(1)
(conn, addr) = s.accept() # returns new socket and addr. client
while True: # forever
data = conn.recv(1024) # receive data from client
if not data: break # stop if client stopped
conn.send(str(data)+"*") # return sent data plus an "*"
conn.close() # close the connection

Client
from socket import *
s = socket(AF_INET, SOCK_STREAM)
s.connect((HOST, PORT)) # connect to server (block until accepted)
s.send(’Hello, world’) # send same data
data = s.recv(1024) # receive the response
print data # print the result
s.close() # close the connection
Message-oriented middleware
Essence

Dr. B. SWAMINATHAN, Page 49/181


Asynchronous persistent communication through support ofmiddleware-level queues.
Queues correspond to buffers atcommunication servers.

PUT Append a message to a specified queue


GET Block until the specified queue is nonempty, and remove the first message
POLL Check a specified queue for messages, and remove the first. Never block
NOTIFY Install a handler to be called when a message is put into the specified queue

Message broker
Observation
Message queuing systems assume a common messaging protocol: allapplications agree on
message format (i.e., structure and datarepresentation)

Observation
Message queuing systems assume a common messaging protocol: allapplications agree on
message format (i.e., structure and datarepresentation)

IBM’s WebSphere MQ
Basic concepts
• Application-specific messages are put into, and removed fromqueues
• Queues reside under the regime of a queue manager
• Processes can put messages only in local queues, or through anRPC mechanism
Message transfer
• Messages are transferred between queues
• Message transfer between queues at different processes, requiresa channel
• At each endpoint of channel is a message channel agent
• Message channel agents are responsible for:
o Setting up channels using lower-level network communicationfacilities (e.g.,
TCP/IP)
o (Un)wrapping messages from/in transport-level packets
o Sending/receiving packets

Dr. B. SWAMINATHAN, Page 50/181


• Channels are inherently unidirectional
• Automatically start MCAs when messages arrive
• Any network of queue managers can be created
• Routes are set up manually (system administration)
Routing
By using logical names, in combination with name resolution to local queues,it is possible to
put a message in a remote queue

Stream-oriented communication
• Support for continuous media
• Streams in distributed systems
• Stream management

Continuous media
All communication facilities discussed so far are essentially based on adiscrete, that is time-
independent exchange of information
Characterized by the fact that values are time dependent:
• Audio
• Video
• Animations
• Sensor data (temperature, pressure, etc.)

Dr. B. SWAMINATHAN, Page 51/181


Transmission modes
Different timing guarantees with respect to data transfer:
Asynchronous: no restrictions with respect to when data is to bedelivered
Synchronous: define a maximum end-to-end delay for individualdata packets
Isochronous: define a maximum and minimum end-to-end delay(jitter is bounded)

Stream
Definition
A (continuous) data stream is a connection-oriented communicationfacility that supports
isochronous data transmission.
Some common stream characteristics
• Streams are unidirectional
• There is generally a single source, and one or more sinks
• Often, either the sink and/or source is a wrapper around hardware
(e.g., camera, CD device, TV monitor)
• Simple stream: a single flow of data, e.g., audio or video
• Complex stream: multiple data flows, e.g., stereo audio or
combination audio/video

Streams and QoS


Essence
Streams are all about timely delivery of data. How do you specify thisQuality of Service
(QoS)? Basics:
• The required bit rate at which data should be transported.
• The maximum delay until a session has been set up (i.e., when anapplication can start
sending data).
• The maximum end-to-end delay (i.e., how long it will take until adata unit makes it to
a recipient).
The maximum delay variance, or jitter.
• The maximum round-trip delay.

Enforcing QoS There are various network-level tools, such as differentiated servicesby
which certain packets can be prioritized Also Use buffers to reduce jitter:

How to reduce the effects of packet loss (when multiple samples are ina single packet)?

Dr. B. SWAMINATHAN, Page 52/181


Stream synchronization : Given a complex stream, how do you keep the different substreams
insynch?
Example
Think of playing out two channels, that together form stereo sound.Difference should be less
than 20–30 msec!

Alternative
Multiplex all substreams into a single stream, and demultiplex at thereceiver. Synchronization
is handled at multiplexing/demultiplexingpoint (MPEG).
Multicast communication
• Application-level multicasting
• Gossip-based data dissemination
Application-level multicasting; Organize nodes of a distributed system into an overlay
network and use thatnetwork to disseminate data.
Chord-based tree building
1. Initiator generates a multicast identifier mid.
2. Lookup succ(mid), the node responsible for mid.
3. Request is routed to succ(mid), which will become the root.
4. If P wants to join, it sends a join request to the root.
5. When request arrives at Q:
• Q has not seen a join request before )it becomes forwarder; Pbecomes child of
Q. Join request continues to be forwarded.
• Q knows about tree )P becomes child of Q. No need to forwardjoin request
anymore.

ALM: Some costs

Dr. B. SWAMINATHAN, Page 53/181


• Link stress: How often does an ALM message cross the samephysical link?
Example: message from A to D needs to crosshRa;Rbitwice.
• Stretch: Ratio in delay between ALM-level path and network-levelpath.
Example: messages B to C follow path of length 71 at ALM,but 47 at network level
=>stretch = 71/47.
Gossiping Basic model
A server S having an update to report, contacts other servers. If aserver is contacted to which
the update has already propagated, Sstops contacting other servers with probability 1=k.
If s is the fraction of ignorant servers (i.e., which are unaware of theupdate), it can be shown
that with many servers
s = e-(k+1)(1-s)
Deleting values: We cannot remove an old value from a server and expect the removalto
propagate. Instead, mere removal will be undone in due time usingepidemic algorithms
Solution
Removal has to be registered as a special update by inserting a deathcertificate
Next problem
When to remove a death certificate (it is not allowed to stay for ever):Run a global algorithm
to detect whether the removal is knowneverywhere, and then collect the death certificates
(looks like garbage
collection)Assume death certificates propagate in finite time, and associate amaximum
lifetime for a certificate (can be done at risk of not reaching allservers)

It is necessary that a removal actually reaches all servers.

Example applications
Data dissemination: Perhaps the most important one. Note thatthere are many variants of
dissemination.
• Aggregation: Let every node imaintain a variable xi. When twonodes gossip, they each
reset their variable to

Result: in the end each node will have computed the average

Dr. B. SWAMINATHAN, Page 54/181


2.9 Multicast communication
• Unicast
– One-to-one
– Destination – unique receiver host address
• Broadcast
– One-to-all
– Destination – address of network
• Multicast
– One-to-many
– Multicast group must be identified
– Destination – address of group
Send message to multiple nodes
• A node can join a multicast group, and receives all messages sent to that group
• The sender sends only once: to the group address
• The network takes care of delivering to all nodes in the group
• Note: groups are restricted to specific networks such as LANs & WANs – Multicast in
the university network will not reach nodes outside the network.
• A special version of broadcast (restricted to a subset of nodes)
• In a LAN
– Sender sends a broadcast
– Interested nodes accept the message others reject
• In larger networks we can use a tree
– Remember trees can be used for broadcast
– Interested nodes join the tree, and thus get messages
– All nodes can use the same tree to multicast to the same group

IP Multicast
• IP has a specific multicast protocol
• Addresses from 224.0.0.0 to 239.255.255.255 are reserved for multicast
– They act as groups
– Some of these are reserved for specific multicast based protocols
• Any message sent to one of the addresses goes to all processes subscribed to the group
– Must be in the same “network”
– Basically depends on how routers are configured
• In a LAN, communication is broadcast
• In more complex networks, tree-based protocols can be used
• Any process interested in joining a group informs its OS
• The OS informs the “network”
– The network interface (LAN card) receives and delivers group messages to the OS &
process
– The router may need to be informed
– IGMP – Internet group management protocol
• Sender sends only once
• Any router also forwards only once
• No acknowledgement mechanism – Uses UDP
• No guarantee that intended recipient gets the message
• Often used for streaming media type content
• Not good for critical information
• Other applications will use this service to perform multicasts.
• We have to ensure that everything goes correctly

Dr. B. SWAMINATHAN, Page 55/181


Multicast application
• Financial services
– Delivery of news, stock quotes, financial indices, etc
• Remote conferencing/e-learning
– Streaming audio and video to many participants (clients, students)
– Interactive communication between participants
• Data distribution
– e.g., distribute experimental data from Large Hadron Collider (LHC) at
CERN lab to interested physicists around the world

IP multicast
• Highly efficient
bandwidth usage
• Key Architectural
Decision: Add support for
multicast in IP layer
• Scalability (with number
of groups)
• -- Routers maintain per-
group state
• IP Multicast: best-effort multi-point delivery service
- Providing higher level features such as reliability, congestion control, flow
control, and security has shown to be more difficult than in the unicast case

Application layer multicast


Pros and Cons
• Scalability
– Routers do not maintain per-group
state
– End systems do, but they participate
in very few groups
• Potentially simplify support for higher level
functionality
– Leverage computation and storage of
end systems
– Leverage solutions for unicast
congestion, error and flow control
• Efficiency concerns
– redundant traffic on physical links
– increase in latency due to end-systems

Dr. B. SWAMINATHAN, Page 56/181


UNIT III DISTRIBUTED OBJECTS AND FILE SYSTEM 9
Remote Invocation – Request Reply Protocol - Java RMI - Distributed Objects -
CORBA - Introduction to Distributed File System - File Service architecture – Andrew
File System, Sun Network File System - Introduction to Name Services- Name
services and DNS - Directory and directory services - Case Study: Google File System

3.1 Remote Invocation


3.1.1 Request Reply Protocol

Middleware layers

Request-Reply communication is synchronous because the client process blocks until the
reply arrives from the server. It can also be reliable because the reply from the server is
effectively an acknowledgement to the client.

Asynchronous request-reply communication is an alternative that may be useful in


situations where clients can afford to retrieve replies later.

Operations of the request-reply protocol

The request-reply protocol:


The protocol is based on communication primitives, doOperation, getRequest
and sendReply, This request-reply protocol matches requests to replies. It may be designed to
provide certain delivery guarantees. If UDP datagrams are used, the delivery guarantees must
be provided by the request-reply protocol, which may use the server reply message as an
acknowledgement of the client request message.

doOperation: method is used by clients to invoke remote operations. Its arguments


specify the remote server and which operation to invoke, together with additional
information (arguments) required by the operation. Its result is a byte array containing
the reply.

Dr. B. SWAMINATHAN, Page 57/181


getRequest: is used by a server process to acquire service requests.

sendReply: is used to send the reply message to the client. When the reply message is
received by the client the original doOperation is unblocked and execution of the
client program continues.

public byte[] doOperation (RemoteRef s, int operationId, byte[] arguments)


sends a request message to the remote server and returns the reply.
The arguments specify the remote server, the operation to be invoked and the arguments
of that operation.

public byte[] getRequest ();


acquires a client request via the server port.

public void sendReply (byte[] reply, InetAddress clientHost, int clientPort);


sends the reply message reply to the client at its Internet address and port.

Styles of exchange protocols


Three protocols that produce differing behaviours in the presence of communication
failures are used for implementing various types of request behaviour.

They were originally identified by Spector [1982]:


• the request (R) protocol;
• the request-reply (RR) protocol;
• the request-reply-acknowledge reply (RRA) protocol.

Request-reply message structure

RPC exchange protocols

HTTP(Hyper Text Transfer Protocol)

Dr. B. SWAMINATHAN, Page 58/181


HTTP is a protocol that specifies the messages involved in a request-reply exchange, the
methods, arguments and results, and the rules for representing them in the messages. It
supports a fixed set of methods (GET, PUT,POST, etc) that are applicable to all of the
server’s resources. In addition to invoking methods on web resources, the protocol allows for
content negotiation and password-style authentication:
Content negotiation: Clients’ requests can include information as to what data
representations they can accept (for example, language or media type), enabling the server to
choose the representation that is the most appropriate for the user.
Authentication: Credentials and challenges are used to support password-style
authentication. On the first attempt to access a password-protected area, the server reply
contains a challenge applicable to the resource. When a client receives a challenge, it gets the

user to type a name and password and submits the associated credentials with subsequent
requests.
HTTP request message
HTTP Reply message

Interface definition languages IDL


An RPC mechanism can be integrated with a particular programming language if it

includes an adequate notation for defining interfaces, allowing input and output parameters to
be mapped onto the language’s normal use of parameters.
This approach is useful when all the parts of a distributed application can be written in the
same language. It is also convenient because it allows the programmer to use a single
language, for example, Java, for local and remote invocation. However, many existing useful
services are written in C++ and other languages. It would be beneficial to allow programs
written in a variety of languages, including Java, to access them remotely.
Interface definition languages (IDLs) are designed to allow procedures implemented in
different languages to invoke one another. An IDL provides a notation for defining interfaces
in which each of the parameters of an operation may be described as for input or output in
addition to having its type specified.

CORBA Interface Definition Languages (IDL) example


// In file Person.idl
struct Person {
string name;
string place;
long year;
};
interface PersonList {
readonly attribute string listname;
void addPerson(in Person p) ;
void getPerson(in string name, out Person p);
long number();
};
Dr. B. SWAMINATHAN, Page 59/181
RPC call semantics
Request-reply protocols showed that doOperation can be implemented in different ways to
provide different delivery guarantees.

The main choices are:


Retry request message: Controls whether to retransmit the request message until either a
reply is received or the server is assumed to have failed.

Duplicate filtering: Controls when retransmissions are used and whether to filter out
duplicate requests at the server.

Retransmission of results: Controls whether to keep a history of result messages to enable


lost results to be retransmitted without re-executing the operations at the server.
Call semantics

Role of client and server stub procedures in RPC

Remote and local method invocations

Dr. B. SWAMINATHAN, Page 60/181


Remote object and its remote interface

Instantiation of remote objects

Role of proxy and skeleton in remote method invocation

Dr. B. SWAMINATHAN, Page 61/181


3.1.2 Remote method invocation (RMI)
• Low-level sockets can be used to develop client/server distributed applications
• But in that case, a protocol must be designed
• Designing such a protocol is hard and error-prone (how to avoid deadlock?)
• RMI is an alternative to sockets

What is RMI?
• A core package of the JDK1.1+ that can be used to develop distributed applications
• Similar to the RPC mechanism found on other systems
• In RMI, methods of remote objects can be invoked from other JVMs
• In doing so, the programmer has the illusion of calling a local method (but all
arguments are actually sent to the remote object and results sent back to callers)

Local v. Remote method invocation

Goals and Features of RMI


• seamless object remote invocations
• callbacks from server to client
• distributed garbage collection
• NOTE: in RMI, all objects must be written in Java!

RMI System Architecture


• Built in three layers (they are all independent): – Stub/Skeleton layer
– Remote reference layer
– Transport layer

Dr. B. SWAMINATHAN, Page 62/181


The Stub/Skeleton layer
• The interface between the application layer and the rest of the system
• Stubs and skeletons are generated using the RMIC compiler
• This layer transmits data to the remote reference layer via the abstraction of marshal
streams (that use object serialization)
• This layer doesn’t deal with the specifics of any transpo r t

The Stub/Skeleton
• Client stub responsible for: – Initiate remote calls
– Marshal arguments to be sent
– Inform the remote reference layer to invoke the call – Unmarshaling the return value
– Inform remote reference the call is complete
• Server skeleton responsible for:
– Unmarshaling incoming arguments from client
– Calling the actual remote object implementation
– Marshaling the return value for transport back to client

The Remote Reference Layer


• The middle layer
• Provides the ability to support varying remote reference or invocation protocols
independent of the client stub and server skeleton
• Example: the unicast protocol provides point-to-point invocation, and multicast
provides invocation to replicated groups of objects, other protocols may deal with
different strategies…
• Not all these features are supported….

The Transport Layer


• A low-level layer that ships serialized objects between different address spaces
• Responsible for:
– Setting up connections to remote address spaces – Managing the connections
– Listening to incoming calls
– Maintaining a table of remote objects that reside in the same address space
– Setting up connections for an incoming call
– Locating the dispatcher for the target of the remote call

Dr. B. SWAMINATHAN, Page 63/181


How does RMI work?
• An invocation will pass through the stub/skeleton layer, which will transfer data to the
remote reference layer
• The semantics of the invocation are carried to the transport layer
• The transport layer is responsible for setting up the connection

The Naming Registry


• The remote object must register itself with the RMI naming registry
• A reference to the remote object is obtained by the client by looking up the registry

Distributed Garbage Collection


• RMI provides a distributed garbage collector that deletes remote objects no longer
referenced by a client
• Uses a reference-counting algorithm to keep track of live references in each Virtual
Machine
• RMI keeps track of VM identifiers, so objects are collected when no local or remote
references to them RMI and the OSI reference model
• RMI be described by this model is

Security
• While RMI is a straightforward method
for creating distributed applications, some
security issues you should be aware of:
– Objects are serialized and
transmitted over the network in
plain text
– No authentication: a client requests
an object, all subsequent
communication is assumed to be
from the same client
– No Security checks on the register
– No version control

Programming with RMI: Anatomy of an RMI-based application – Define a remote interface


– Provide an implementation of the remote interface
– Develop a client
– Generate stubs and skeletons – Start the RMI registry
– Run the client and server

Define a remote interface


• It specifies the characteristics of the methods provided by a server and visible to clients
– Method signatures (method names and the type of their parameters)
• By looking at the interface, programmers know what methods are supported and how
to invoke them
• Remote method invocations must be able to handle error messages (e.g. can’t connect
to server or server is down)

Dr. B. SWAMINATHAN, Page 64/181


Characteristics of remote interface
• Must be declared public
• To make an object a remote one, the interface must extend the java.rmi.Remote
interface
• Each method declared in the interface must declare java.rmi.RemoteException in its
throws clause.
• see DateSer

Implement the remote interface


• The implementation needs to:
– Specify the remote interface being implemented – Define the constructor of the
remote object
– Implement the methods that can be invoked remotely
– Create an instance of a remote object – Register it with the RMI registry
• see DateServerImpl example

Develop a client
• Example:DateClient.java

Generate stubs and skeletons


• Use the rmic compiler
• Place the remote interface and the stub class on the client side, and both the stub and
skeleton on the server side.
– alternately, some of these files could be dynamically loaded (will see this later)

Start the RMI registry


• It is a naming service that allows clients to obtains references to remote objects

Run the server and client


• Run the rmi registery (default port: 1099) • Run the server
• Run the client

Working with the RMI Registry


• refer to the java.rmi.Naming class

Dr. B. SWAMINATHAN, Page 65/181


RMI (Remote Method Invocation)
The RMI (Remote Method Invocation) is an API that provides a mechanism to create
distributed application in java. The RMI allows an object to invoke methods on an object
running in another JVM.
The RMI provides remote communication between the applications using two
objects stub and skeleton.

Understanding stub and skeleton


RMI uses stub and skeleton object for communication with the remote object.
A remote object is an object whose method can be invoked from another JVM. Let's
understand the stub and skeleton objects:

Stub
The stub is an object, acts as a gateway for the client side. All the outgoing requests
are routed through it. It resides at the client side and represents the remote object. When the
caller invokes method on the stub object, it does the following tasks:
1. It initiates a connection with remote Virtual Machine (JVM),
2. It writes and transmits (marshals) the parameters to the remote Virtual Machine
(JVM),
3. It waits for the result
4. It reads (unmarshals) the return value or exception, and
5. It finally, returns the value to the caller.

Skeleton
The skeleton is an object, acts as a gateway for the server side object. All the incoming
requests are routed through it. When the skeleton receives the incoming request, it does the
following tasks:
1. It reads the parameter for the remote method
2. It invokes the method on the actual remote object, and
3. It writes and transmits (marshals) the result to the caller.

In the Java 2 SDK, an stub protocol was introduced that eliminates the need for skeletons.

Dr. B. SWAMINATHAN, Page 66/181


Understanding requirements for the distributed applications
If any application performs these tasks, it can be distributed application.

.The application need to locate the remote method


1. It need to provide the communication with the remote objects, and
2. The application need to load the class definitions for the objects.

The RMI application have all these features, so it is called the distributed application.

Steps to write the RMI program


The is given the 6 steps to write the RMI program.
1. Create the remote interface
2. Provide the implementation of the remote interface
3. Compile the implementation class and create the stub and skeleton objects using the
rmic tool
4. Start the registry service by rmiregistry tool
5. Create and start the remote application
6. Create and start the client application

RMI Example
In this example, we have followed all the 6
steps to create and run the rmi application. The client
application need only two files, remote interface and
client application. In the rmi application, both client
and server interacts with the remote interface. The
client application invokes methods on the proxy
object, RMI sends the request to the remote JVM.
The return value is sent back to the proxy object and
then to the client application.

1) create the remote interface

For creating the remote interface, extend the Remote interface and declare the
RemoteException with all the methods of the remote interface. Here, we are creating a remote
interface that extends the Remote interface. There is only one method named add() and it
declares RemoteException.

import java.rmi.*;
public interface Adder extends Remote {
public int add(int x,int y)throws RemoteException;
}

Dr. B. SWAMINATHAN, Page 67/181


2) Provide the implementation of the remote interface

Now provide the implementation of the remote interface. For providing the
implementation of the Remote interface, we need to
• Either extend the UnicastRemoteObject class,
• or use the exportObject() method of the UnicastRemoteObject class
In case, you extend the UnicastRemoteObject class, you must define a constructor that
declares RemoteException.

import java.rmi.*;
import java.rmi.server.*;
public class AdderRemote extends UnicastRemoteObject implements Adder {
AdderRemote()throws RemoteException {
super();
}
public int add(int x,int y) {return x+y;}
}

3) create the stub and skeleton objects using the rmic tool.

Next step is to create stub and skeleton objects using the rmi compiler. The rmic tool
invokes the RMI compiler and creates stub and skeleton objects.
1. rmic AdderRemote

4) Start the registry service by the rmiregistry tool

Now start the registry service by using the rmiregistry tool. If you don't specify the
port number, it uses a default port number. In this example, we are using the port number
5000.

rmiregistry 5000

5) Create and run the server application


Now rmi services need to be hosted in a server process. The Naming class provides
methods to get and store the remote object. The Naming class provides 5 methods.
1. public static java.rmi.Remote lookup(java.lang.String) throws
java.rmi.NotBoundException, java.net.MalformedURLException,
java.rmi.RemoteException; it returns the reference of the remote object.
2. public static void bind(java.lang.String, java.rmi.Remote) throws
java.rmi.AlreadyBoundException, java.net.MalformedURLException,
java.rmi.RemoteException; it binds the remote object with the given name.
3. public static void unbind(java.lang.String) throws java.rmi.RemoteException,
java.rmi.NotBoundException, java.net.MalformedURLException; it destroys the
remote object which is bound with the given name.
4. public static void rebind(java.lang.String, java.rmi.Remote) throws
java.rmi.RemoteException, java.net.MalformedURLException; it binds the
remote object to the new name.
5. public static java.lang.String[] list(java.lang.String) throws
java.rmi.RemoteException, java.net.MalformedURLException; it returns an array
of the names of the remote objects bound in the registry.
Dr. B. SWAMINATHAN, Page 68/181
In this example, we are binding the remote object by the name sonoo.
import java.rmi.*;
import java.rmi.registry.*;
public class MyServer {
public static void main(String args[]) {
try {
Adder stub=new AdderRemote();
Naming.rebind("rmi://localhost:5000/sonoo",stub);
} catch(Exception e){System.out.println(e);}
}
}

6) Create and run the client application

At the client we are getting the stub object by the lookup() method of the Naming class
and invoking the method on this object. In this example, we are running the server and client
applications, in the same machine so we are using localhost. If you want to access the remote
object from another machine, change the localhost to the host name (or IP address) where the
remote object is located.

import java.rmi.*;
public class MyClient {
public static void main(String args[]) {
try {
Adder stub=(Adder)Naming.lookup("rmi://localhost:5000/sonoo");
System.out.println(stub.add(34,4));
} catch(Exception e){}
}
}

For running this rmi example,


1) compile all the java files
javac *.java
2)create stub and skeleton object by rmic tool
rmic AdderRemote
3)start rmi registry in one command prompt
rmiregistry 5000
4)start the server in another command prompt
java MyServer
5)start the client application in another command prompt
java MyClient

Dr. B. SWAMINATHAN, Page 69/181


Dr. B. SWAMINATHAN, Page 70/181
Java Remote interfaces Shape and ShapeList
import java.rmi.*;
import java.util.Vector;
public interface Shape extends Remote {
int getVersion() throws RemoteException;
GraphicalObject getAllState() throws RemoteException; 1
}
public interface ShapeList extends Remote {
Shape newShape(GraphicalObject g) throws RemoteException; 2
Vector allShapes() throws RemoteException;
int getVersion() throws RemoteException;
}
The Naming class of Java RMIregistry
void rebind (String name, Remote obj)
This method is used by a server to register the identifier of a remote object by name, as
shown in line 3.

void bind (String name, Remote obj)


This method can alternatively be used by a server to register a remote object by name, but
if the name is already bound to a remote object reference an exception is thrown.

void unbind (String name, Remote obj)


This method removes a binding.

Remote lookup(String name)


This method is used by clients to look up a remote object by name, as shown in line 1. A
remote object reference is returned.

String [] list()
This method returns an array of Strings containing the names bound in the registry.
Java class ShapeListServer with main method
import java.rmi.*;
public class ShapeListServer{
public static void main(String args[]){
System.setSecurityManager(new RMISecurityManager());
try{
ShapeList aShapeList = new ShapeListServant(); 1
Naming.rebind("Shape List", aShapeList ); 2
System.out.println("ShapeList server ready");
}catch(Exception e) {
System.out.println("ShapeList server main " + e.getMessage());}
}
}

Dr. B. SWAMINATHAN, Page 71/181


Java class ShapeListServant implements interface ShapeList
import java.rmi.*;
import java.rmi.server.UnicastRemoteObject;
import java.util.Vector;
public class ShapeListServant extends UnicastRemoteObject implements ShapeList {
private Vector theList; // contains the list of Shapes
private int version;
public ShapeListServant()throws RemoteException{...}
public Shape newShape(GraphicalObject g) throws RemoteException { 1
version++;
Shape s = new ShapeServant( g, version); 2
theList.addElement(s);
return s;
}
public Vector allShapes()throws RemoteException{...}
public int getVersion() throws RemoteException { ... }
}
Java client of ShapeList
import java.rmi.*;
import java.rmi.server.*;
import java.util.Vector;
public class ShapeListClient{
public static void main(String args[]){
System.setSecurityManager(new RMISecurityManager());
ShapeList aShapeList = null;
try{
aShapeList = (ShapeList) Naming.lookup("//bruno.ShapeList") ;
Vector sList = aShapeList.allShapes();
} catch(RemoteException e) {System.out.println(e.getMessage());
}catch(Exception e) {System.out.println("Client: " + e.getMessage());}
}
}
Classes supporting Java RMI

Dr. B. SWAMINATHAN, Page 72/181


3.1.3 Common Object Request Broker Architecture (CORBA)
CORBA is a standard defined by the Object Management Group (OMG) designed to
facilitate the communication of systems that are deployed on diverse platforms.
CORBA enables collaboration between systems on different operating
systems, programming languages, and computing hardware. CORBA uses an object-
oriented model although the systems that use the CORBA do not have to be object-
oriented. CORBA is an example of the distributed object paradigm

Corba concepts & corba architecture


1. Why CORBA
• Rapid changes in HW and OS lead to advantages of client/server systems
• Result: greater system complexity, user demand, management expectations
• Additional pressure from the necessity to maintain legacy systems.

2. CORBA
• Object Management Group, (OMG) formed in 1989
• The Common Object Request Broker Architecture (CORBA) is a standard
defined by the Object Management Group (OMG) that enables software
components written in multiple computer languages and running on multiple
computers to work together (i.e., it supports multiple platforms).
• Focus on integration of systems and applications across heterogeneous
platforms.

3. Thus CORBA allows applications and their objects to communicate with each other no
matter where they are and or who designed them!!.
• The only REAL competitor is, of course……MICROSOFT DCOM.
• After soliciting input, CORBA standardwas defined and introduced in 1991

4. CORBA
• When introduced in 1991, CORBA defined the Interface Design Language,
(IDL) and Application Programming Interface, (API).
• These allow client/server interaction within a specific implementation of an
Object Request Broker, (ORB).
• The client sends an ORB request to the SERVER/OBJECT
IMPLEMENTATION and this in turn returns back either ORB Result or Error
to the client..
• CORBA is just a specification for creating and using distributed objects
• CORBA is not a programming language.
• CORBA is a standard (not a product!)
• Allows objects to transparently make requests and receive responses.

5. CORBA Architecture
• The CORBA architecture is based on the object model.
• A CORBA-based system is a collection of objects that isolates the requestors
of services (clients) from the providers of services(servers) by a well-defined
encapsulating interface.
• CORBA is composed of five major components: ORB, IDL, dynamic
invocation interface(DII), interface repositories (IR), and object adapters (OA).

Dr. B. SWAMINATHAN, Page 73/181


6. ORB Interface : Contains functionality that might be required by clients or servers
DII InterFace : Dynamic Invocatin Interface used fordynamically invoking CORBA
objects that were not known at implementation time.
DSI Interface : Dynamic Skelton Interface helps to implement generic CORBA
servants.
BOI Interface : Basic Object Adapter; API used by the servers to register their object
implementation.
IFR –Interface Repository : Registru of fully qualified interface definitions
Provides type information necessary to issue requests using the DII.

7. Dynamic skeleton interface


• Analogous to the DII is the server-side dynamic skeleton interface (DSI),
which allows servers to be written without having skeletons, or compile-time
knowledge, for the objects being implemented.
• Unlike DII, which was part of the initial CORBA specification, DSI was
introduced in CORBA 2.0.
• Its main purpose is to support the implementation of gateways between ORBs
which utilize different communication protocols.

8. Object Request Broker (ORB)


• see only the object’s interface, never the implementation.
• To communicate, the request does not pass directly from client to object For
objects to communicate across the network, they need a communication
infrastructure named Object Request Broker (ORB).
• Both client and object implementation are isolated from the ORB by an IDL
interface.
• Clients implementation,instead every request is passed to the client’s local
ORB, which manages it.
• The interface the client sees is completely independent of where the object is
located, what programming language it is implemented in,or any other aspect
that is not reflected in the object’s interface.
• The ORB is responsible for:
• finding the object implementation for the request,
• preparing the object implementation to receive the request,
• communicating the data making up the request..

Dr. B. SWAMINATHAN, Page 74/181


• Intercepts calls
• Finds object
• Invokes method
• Passes parameters
• Returns results or error messages
• REGARDLESS OF THE OBJECTS LOCATION, ITS PROGRAMMING
LANGUAGE OR EVEN THE OPERATING SYSTEMS INVOLVED!!.

9. Object Request Broker (ORB).

10. IDLCompiler.

11. Object Adapter


An object adapter is the primary means for an object implementation to access
ORB services such as object reference generation..

12. Interface Repository


The IR provides another way to specify the interfaces to objects.
Interfaces can be added to the interface repository service.
Using the IR, a client should be able to locate an object that is unknown at
compile time, find information about its interface, then build a request to be
forwarded through the ORB.

13. Dynamic invocation interface


Invoking operations can be done through either static or dynamic interfaces.

Dr. B. SWAMINATHAN, Page 75/181


Static invocation interfaces are determined at compile time, and they are
presented to the client using stubs.
The DII, on the other hand, allows client applications to use server objects
without knowing the type of those objects at compile time.
It allows a client to obtain an instance of a CORBA object and make
invocations on that object by dynamically constructing requests..

14. CORBA Objects


• It is important to note that CORBA objects differ from typical programming
objects in three ways:
• CORBA objects can run on any platform.
• CORBA objects can be located anywhere on the network.
• CORBA objects can be written in any language that has IDL mapping..

15. Dynamic invocation interface


o DII uses the interface repository to validate and retrieve the signature of the
operation on which a request is made.
o CORBA supports both the dynamic and the static invocation interfaces..

16. CORBA works with interfaces


• All CORBA Objects are encapsulated
• Objects are accessible through interface only.
• Separation of interfaces and implementation enables
multiple implementations for one interface.

17. Interface description language(IDL)


IDL is a specification language used to describe a software components interface.
IDLs describe an interface in a language- neutral way, enabling communication
between software components that do not share a language.
for ex., between components written in C++ and components written in Java..

IDLs are commonly used in remote procedure call software.


In these cases the machines at either end of the "link" may be using different
operating systems and computer languages.
IDLs offer a bridge between the two different systems..

18. Advantages of CORBA


Object Location Transparency:-
The client does not need to know where an object is physically located. An
object can either be linked into the client, run in a different process on the same
machine, or run in a server on the other side of the planet. A request invocation
looks the same regardless, and the location of an object can change over time
without, breaking applications..

Server Transparency:-
The client is, as far as the programming model is concerned, ignorant of the
existence of servers. The client does not know (and cannot find out) which

Dr. B. SWAMINATHAN, Page 76/181


server hosts a particular object, and does not care whether the server is running
at the time the client invokes a request..

Language Transparency :-
Client and server can be written in different languages. This fact
encapsulates the whole point of CORBA; that is, the strengths of different
languages can be utilized to develop different aspects of a system, which
can interoperate through IDL. A server can be implemented in a different
language without clients being aware of this..

Implementation Transparency :-
The client is unaware of how objects are implemented. A server can use
ordinary flat files as its persistent store today and use an OO database
tomorrow, without clients ever noticing a difference (other than
performance)..

Architecture Transparency :-
The idiosyncrasies of CPU architectures are hidden from both clients and
servers. A little- endian client can communicate with a big- endian server
with different alignment restrictions..

Operating System Transparency :-


Client and server are unaffected by each others operating system. In
addition, source code does not change if you need to port the source from
one operating system to another.

Protocol Transparency :-
Clients and servers do not care about the data link and transport layer. They
can communicate via token ring, Ethernet, wireless links, ATM (Asynchronous
Transfer Mode), or any number of other networking technologies..

Dr. B. SWAMINATHAN, Page 77/181


3.2.1 Introduction to Distributed File System
File service architecture
1. The relevant modules and their relationship is shown in Figure providing access to files is obtained by structuring
the file service as three components:
1. Flat file service
2. Directory service
3. Client module.

2. File service architecture


Client computer Server computer Application program Application program Client module Directory service Flat
file service Lookup AddName UnName GetNames Read Write Create Delete GetAttributes SetAttributes * Figure
8 .5

Responsibilities of various modules can be defined as follows:


Flat file service
Concerned with the implementation of operations on the contents of file.
Unique File Identifiers (UFIDs) are used to refer to files in all requests for flat file service operations.
Flat file service operations
1. Read(FileId, i, n) : Reads a sequence of up to n items from a file starting at item i.
2. Write(FileId, i, Data) : Write a sequence of Data to a file, starting at item i.
3. Create() : Creates a new file of length0 and delivers a UFID for it.
4. Delete(FileId) :Removes the file from the file store.
5. GetAttributes(FileId) : Returns the file attributes for the file.
6. SetAttributes(FileId, Attr) :Sets the file attributes.
Directory service
Provides mapping between text names for the files and their UFIDs.
Clients may obtain the UFID of a file by quoting its text name to directory service.
Directory service supports functions to add new files to directories.
Directory service operations
1. Lookup(Dir, Name) : Locates the text name in the directory and returns the relevant UFID. If Name is not
in the directory, throws an exception.
2. AddName(Dir, Name, File) :If Name is not in the directory, adds(Name,File) to the directory and updates
the file’s attribute record. • If Name is already in the directory: throws an exception.
3. UnName(Dir, Name) :If Name is in the directory, the entry containing Name is removed from the
directory. • If Name is not in the directory: throws an exception.
4. GetNames(Dir, Pattern):Returns all the text names in the directory that match the regular expression
Pattern.
Client module
• It runs on each computer and provides integrated service (flat file and directory) as a single API to application
programs.
• It holds information about the network locations of flat-file and directory server processes.
Access control
In distributed implementations, access rights checks have to be performed at the server .
Hierarchic file system
A hierarchic file system consists of a number of directories arranged in a tree structure.
File Group
A file group is a collection of files that can be located on any server.

Dr. B. SWAMINATHAN, Page 78/181


3.2.2 File Systems : Introduction
▪ File system were originally developed for centralized computer systems and desktop
computers.
▪ File system was as an operating system facility providing a convenient programming
interface to disk storage.
▪ Distributed file systems support the sharing of information in the form of files and
hardware resources.
▪ With the advent of distributed object systems (CORBA, Java) and the web, the picture
has become more complex.
▪ Figure 1 provides an overview of types of storage system.

▪ Figure 2 shows a typical layered module structure for the implementation of a non-
distributed file system in a conventional operating system.

Direc tory module: relates file names to file IDs

File module: relates file IDs to partic ular files

Ac cess control module: c hecks permis sion for operation requested

File acc es s module: reads or w rites file data or attributes

Bloc k module: acc es ses and alloc ates disk blocks

Device module: dis k I/O and buffering

▪ File systems are responsible for the organization, storage, retrieval, naming, sharing
and protection of files.
▪ Files contain both data and attributes.

Dr. B. SWAMINATHAN, Page 79/181


▪ A typical attribute record structure is illustrated in
Figure 3.

▪ Above figure summarizes the main operations on files that are available to
applications in UNIX systems.
▪ Distributed File system requirements: Related requirements in distributed file
systems are:
❖ Transparency
❖ Concurrency
❖ Replication
❖ Heterogeneity
❖ Fault tolerance
❖ Consistency
❖ Security
❖ Efficiency

File service architecture

Dr. B. SWAMINATHAN, Page 80/181


• Stateless file service architecture
– Flat file service: unique file identifiers (UFID)
– Directory service: map names to UFIDs
– Client module
• integrate/extend flat file and directory services
• provide a common application programming interface (can emulate
different file interfaces)
• stores location of flat file and directory services
Flat file service interface
• RPC used by client modules, not by user-level programs
• Compared to UNIX
– no open/close
• Create is not idempotent
• at-least-once semantics
• reexecution gets a new file
– specify starting location in Read/Write
▪ stateless server

Access control
• UNIX checks access rights when a file is opened
o subsequent checks during read/write are not necessary
• distributed environment
o server has to check
o stateless approaches
▪ access check once when UFID is issued
• client gets an encoded "capability" (who can access and how)
• capability is submitted with each subsequent request
▪ access check for each request.
• second is more common

Directory service operations


• A directory service interface translates text names to file identifiers and performs a
number of other services such as those listed among the sample commands in figure
8.7. This is the remote procedure call interface to extend the local directory services to
a distributed model.

Dr. B. SWAMINATHAN, Page 81/181


File collections
• Hierarchical file system
o Directories containing other directories and files
o Each file can have more than one name (pathnames)
▪ how in UNIX, Windows?
• •File groups
o a logical collection of files on one server
▪ a server can have more than one group
▪ a group can change server
▪ a file can't change to a new group (copying doesn't count)
▪ filesystems in unix
▪ different devices for non-distributed
▪ different hosts for distributed

3.2.1 Service architecture – Andrew File System,


• Designed by Carnegie Mellon University
o Developed during mid-1980s as part of the Andrew distributed computing
environment
o Designed to support a WAN of more than 5000 workstations
o Much of the core technology is now part of the Open Software
Foundation (OSF) Distributed Computing Environment (DCE), available
for most UNIX and some other operating systems
• AFS was made to span large campuses and scale well therefore the emphasis was
placed on offloading the work to the clients
• as much as possible data is cached on clients, uses session semantics - cache
consistency operations are done when file is opened or closed
• Provides transparent access to remote files on a WAN, for clients running on
UNIX and other operating systems
o Access to all files is via the usual UNIX file primitives
o Compatible with NFS — servers can mount NFS file systems
• AFS is stateful, when a client reads a file from a server it holds a callback, the
server keeps track of callbacks and when one of the clients closes the file (and
synchronizes it’s cached copy) and updates it, the server notifies all the callback
holders of the change breaking the callback, callbacks can be also broken to
conserve storage at server.

Dr. B. SWAMINATHAN, Page 82/181


• Distribution of processes in the Andrew File System

• System call interception in AFS

Andrew File System (AFS):The main components of the Vice service interface

Dr. B. SWAMINATHAN, Page 83/181


• problems with AFS:
o even if the data is in local cache - if the client performs a write a complex
protocol of local callback verification with the server must be used; cache
consistency preservation leads to deadlocks
o in a stateful model, it is hard to deal with crushes.
Caching in Andrew
• When a remote file is accessed, the server sends the entire file to the client
o The entire file is then stored in a disk cache on the client computer
▪ Cache is big enough to store several hundred files
• Implements session semantics
o Files are cached when opened
o Modified files are flushed to the server when they are closed
o Writes may not be immediately visible to other processes
• When client caches a file, server records that fact — it has a callback on the file
o When a client modifies and closes a file, other clients lose their callback,
and are notified by server that their copy is invalid

3.2.2 Sun Network File System


• Designed by Sun Microsystems
o First distributed file service designed as a project, introduced in 1985
o To encourage its adoption as a standard
▪ Definitions of the key interfaces were placed in the public domain in
1989
▪ Source code for a reference implementation was made available to
other computer vendors under license
▪ Currently the de facto standard for LANs
• Provides transparent access to remote files on a LAN, for clients running on UNIX and
other operating systems
o A UNIX computer typically has a NFS client and server module in its OS
kernel
▪ Available for almost any UNIX
▪ Client modules are available for Macintosh and PCs
• NFS - mounting remote file system (cont.)
o Remote file systems may be

Dr. B. SWAMINATHAN, Page 84/181


o Hard mounted — when a user-level process accesses a file, it is suspended
until the request can be completed
▪ If a server crashes, the user-level process will be suspended until
recovers
o Soft mounted — after a small number of retries, the NFS client returns a failure
code to the user process
▪ Most UNIX utilities don’t check this code…
• Automounting
o The automounter dynamically mounts a file system whenever an “empty”
mount point is referenced by a client
▪ Further accesses do not result in further requests to the automounter…
▪ Unless there are no references to the remote file system for several
minutes, in which case the automounter unmounts it
• Virtual file system:

o Separates generic file-system operations from their implementation (can have


different types of local file systems)
o Based on a file descriptor called a vnode that is unique networkwide (UNIX
inodes are only unique on a single file system)
• NFS protocol provides a set of RPCs for remote file operations
o Looking up a file within a directory
o Manipulating links and directories
o Creating, renaming, and removing files
o Getting and setting file attributes
o Reading and writing files
• NFS is stateless
o Servers do not maintain information about their clients from one access to the
next
o There are no open-file tables on the server
• There are no open and close operations
o Each request must provide a unique file identifier, and an offset within the file
• Easy to recover from a crash, but file operations must be idempotent

Dr. B. SWAMINATHAN, Page 85/181


• Because NFS is stateless, all modified data must be written to the server’s disk before
results are returned to the client
o Server crash and recovery should be invisible to client —data should be intact
o Lose benefits of caching
▪ Solution — RAM disks with battery backup (un-interruptable power
supply), written to disk periodically
• A single NFS write is guaranteed to be atomic, and not intermixed with other writes to
the same file
o However, NFS does not provide concurrency control
▪ A write system call may be decomposed into several NFS writes,
which may be interleaved
▪ Since NFS is stateless, this is not considered to be an NFS Problem

Caching in NFS
• Traditional UNIX
o Caches file blocks, directories, and file attributes
• Uses read-ahead (prefetching), and delayed-write (flushes every 30 seconds)
• NFS servers
• Same as in UNIX, except server’s write operations perform write-through
o Otherwise, failure of server might result in undetected loss of data by clients
• NFS clients
o Caches results of read, write, getattr, lookup, and readdir operations
o Possible inconsistency problems
▪ Writes by one client do not cause an immediate update of other clients’
caches
• File reads
o When a client caches one or more blocks from a file, it also caches a timestamp
indicating the time when the file was last modified on the server
o Whenever a file is opened, and the server is contacted to fetch a new block
from the file, a validation check is performed
▪ Client requests last modification time from server, and compares that
time to its cached timestamp
▪ If modification time is more recent, all cached blocks from that file are
invalidated
▪ Blocks are assumed to valid for next 3 seconds (30 seconds for
directories)
• File writes
o When a cached page is modified, it is marked as dirty, and is flushed when the
file is closed, or at the next periodic flush
o Now two sources of inconsistency: delay after validation, delay until flush
• Caching : Server caching
o caching file pages, directory/file attributes
o read-ahead: prefetch pages following the most-recently read file pages

Dr. B. SWAMINATHAN, Page 86/181


o delayed-write: write to disk when the page in memory is needed for other
purposes
o "sync" flushes "dirty" pages to disk every 30 seconds
o two write option
o write-through: write to disk before replying to the client
o cache and commit:
▪ stored in memory cache
▪ write to disk before replying to a "commit" request from the client
• Client caching
o caches results of read, write, getattr, lookup, readdir
o clients responsibility to poll the server for consistency

3.3.1 Introduction to Name Services


▪ In a distributed system, names are used to refer to a wide variety of resources such
as:
➢ Computers, services, remote objects, and files, as well as users.
▪ Basic design issues for name services, such as the structure and management of the
spaces of names recognized by the service and the operations that the name service
supports, are outlined and discussed in the context of the Internet Domain Name
Service.
▪ Resources are accessed using identifier or reference
➢ An identifier can be stored in variables and retrieved from tables quickly.
➢ Identifier includes or can be transformed to an address for an object.
❖ E.g. NFS file handle, Corba remote object reference.
➢ A name is human-readable value (usually a string) that can be resolved to an
identifier or address.
❖ Internet domain name, file pathname, process number
E.g ./etc/passwd, http://www.cdk3.net/
▪ For many purposes, names are preferable to identifiers
➢ The binding of the named resource to a physical location is deferred and can be
changed.
➢ They are more meaningful to users.
▪ Resource names are resolved by name services
To give identifiers and other useful attributes.

Dr. B. SWAMINATHAN, Page 87/181


Name Services and the Domain Name System
▪ A name service stores a collection of one or more naming contexts, sets of bindings
between textual names and attributes for objects such as computers, services, and
users.
▪ The major operation that a name service supports is to resolve names.
▪ DNS supports a model known as iterative navigation.
(Figure 2)

▪ Reason for NFS iterative name resolution:


➢ This is because the file service may encounter a symbolic link (i.e. an alias)
when resolving a name. A symbolic link must be interpreted in the client’s file
system name space because it may point to a file in a directory stored at
another server. The client computer must determine which server this is,
because only the client knows its mount points.

Dr. B. SWAMINATHAN, Page 88/181


▪ DNS offers recursive navigation as an option, but iterative is the standard technique.
▪ Recursive navigation must be used in domains that limit client access to their DNS
information for security reasons.
(Figure 3)

▪ DNS - The Internet Domain Name System


➢ DNS is a distributed naming database.
➢ The arrangement of some of the DNS database is shown in Figure 4.

Dr. B. SWAMINATHAN, Page 89/181


DNS name servers

▪ Zone data are stored by name servers in files in one of several fixed types of resource
record.
(Figure 5)

3.3.2 Name services and DNS


3.3.3 Directory and directory services
Directories in Linux
User database : /etc/passwd, /etc/shadow
Group database : /etc/group

Dr. B. SWAMINATHAN, Page 90/181


Host names : /etc/hosts
Network names: /etc/network
Protocol names: /etc/protocols
Service names: /etc/services
RPC program numbers: /etc/rpc
Known ethernet addresses: /etc/ethers
Automount maps: /etc/auto.master
The scalability problem
Example
• 13000 users and 5000 hosts
• Passwords valid for 30 days
• 50% of changes made at 8-10
• One change every 28.8 seconds
• Propagation time: 0.00567s

What is a directory service


A specialized database
• Attribute-value type information
• More reads than updates
• Consistency problems are sometimes OK
• No transactions or rollback
• Support for distribution and replication
• Clear patterns to searches

Directory services
Components
• A data model
• A protocol for searching
• A protocol for reading
• A protocol for updating
• Methods for replication
• Methods for distribution

Common directory services


• DNS
• X.500 Directory Service
• Network Information Service
• NIS+
• Active Directory (Windows NT)
• NDS (Novell Directory Service)
• LDAP (Lightweight X.500)

Global directory service


• Context: entire network or entire internet
• Namespace: uniform
• Distribution: usually
• Examples: DNS, X.500, NIS+,

LDAP: Local directory service


Dr. B. SWAMINATHAN, Page 91/181
• Context: intranet or smaller
• Namespace: non-uniform
• Examples: NIS, local files

Directory services in Linux


Alias: name services
• /etc/nsswitch.conf selects service
• Several services per directory
• Modular design/implementation

Examples from /etc/nsswitch.conf


• users files,nis
• users nis[notfound=return],files
• hosts dns,files
• Network Information Service

Domain (NIS domain)


• Systems administered with NIS
• No connection to DNS domain

NIS server
• Server that has information
• accessible through NIS
• Serves one or more domains

NIS client
• Host that uses NIS as a directory

• service for something

Protocol
• RPC based
• No security
• No updates
• Replication support

Dr. B. SWAMINATHAN, Page 92/181


Replication
• Master/slave servers

Distribution
• No distribution support!

Data model
• Directories known as maps
• Simple key-value mapping
• Values have no structure

Master server
• Maps built from text files
• Maps in /var/yp
• Maps built with make
• Maps stored in binary form
• Replication to slaves with
• yppush

Slave servers
• Receive data from master
• Load balancing and failover

Processes/commands
• ypserv Server process
• ypbind Client process
• ypcat To view maps
• ypmatch To search maps
• ypwhich Show status
• yppasswdd Change password

NIS client
• Knows its NIS domain
• Binds to a NIS server

Two options
• Broadcast
• Hard coded NIS-server
ypbind

Dr. B. SWAMINATHAN, Page 93/181


Scalability problems
• Flat namespace
• No distribution

Security problems
• No access control
• Broadcast for binding
• Patched as an afterthought

Primitive protocol
• No updates
• Hack for password change
• Search only on key
• Primitive data model

Scalability
• Hierarchical namespace
• Distributed administration

Security
• Authentication of server, client and user
• Access control on per-cell level

New protocol
• Updates through NIS+
• General searches
• Data model with real tables

LDAP Protocol
• TCP-based
• Fine-grained access control
• Support for updates
• Flexible search protocol

Dr. B. SWAMINATHAN, Page 94/181


Replication
• Replication is possible Distribution
• Distributed management is
• Possible

DNS: Data model

TYPE
• SOA – Start of authority
• NS – Name server
• MX – Mail exchanger
• A – Address
• A6 – IPv6 address
• AAAA – IPv6 address
• PTR – Domain name pointer

Dr. B. SWAMINATHAN, Page 95/181


• CNAME – Canonical name
• TXT – Text
• … and many more

RDATA
• Binary data, hardcoded format
• TYPE determines format
• DNS: Namespace

Names
• Dot-separated parts
• one.part.after.another

FQDN
• Fully Qualified Domain Name
• Complete name
• Always ends in a dot

Partial name
• Suffix of name implicit
• Does not end in a dot

Namespace
• Global and hierarchical

DNS: Replication
Secondary/slave nameserver
• Indicated by NS RR
• Data transfer with AXFR/IXFR

Questions
• How does a slave NS know
• when there is new information?
• How often should a slave NS
• attempt to update?
• How long is replicated data valid?

Dr. B. SWAMINATHAN, Page 96/181


Example
• sysi-00:~# host -t ns ida.liu.se
• ida.liu.se NS nsauth.isy.liu.se
• ida.liu.se NS ns.ida.liu.se
• ida.liu.se NS ns1.liu.se

Rule of thumb
• Every zone needs at least two
• Nameservers
• DNS: Distribution
Delegation
• A NS can delegate responsibility for a subtree to another NS
• Only entire subtrees can be delegated
Zone
• The part of the namespace that a NS is authoritative for
• Defined by SOA and NS
Domain
• A subtree of the namespace

DNS: Delegation

Delegating NS
NS record for delegated zone
A record (glue) for NS when needed
Example
a.example.com NS ns2.xmp.com
b.xmp.com NS ns.b.xmp.com
ns.b.xmp.com A 10.1.2.3

Delegated-to NS
SOA record for the zone
Example
b.xmp.com SOA ( ns.b.xmp.com
dns.xmp.com
20040909001
24H
2H
1W
2D )

DNS: Delegation

Dr. B. SWAMINATHAN, Page 97/181


Format of SOA
MNAME Master NS
RNAME Responsible (email)
SERIAL Serial number
REFRESH Refresh interval
RETRY Retry interval
MINIMUM TTL for negative
reply

SERIAL
• Increase for every update
• Date format common
• 20040909001

REFRESH/RETRY
How often secondary NS
updates the zone

MINIMUM
How long to cache NXDOMAIN
©2003–2004 David Byers
DNS: Cacheing
• Cacheing creates scalability
• Cacheing reduces tree traversal
• Cacheing of A and PTR reduce
• duplicate DNS queries

Choosing good cache parameters is vital


Cache parameters
TTL – Set per RR
Negative TTL – Set in SOA

Example
$TTL 4H
SOA (MNAME RNAME
SERIAL REFRESH
RETRY 1H )
24H NS ns
ns 24H A 10.1.2.3

3.3.4 Case Study: Google File System

Dr. B. SWAMINATHAN, Page 98/181


UNIT IV DISTRIBUTED TRANSACTIONS AND CONCURRENCY 9
Clock Synchronization – Logical Clocks – Global States – Mutual Exclusion -
Election Algorithms– Data-Centric Consistency Models – Client-Centric Consistency
Models – Distribution Protocol – Consistency Protocol

4.1 Clock Synchronization


Physical Clock
• It is impossible to guarantee that crystals in different computers all run at exactly
the same frequency. This difference in time values is clock skew.
• “Exact” time was computed by astronomers
◼ The difference between two transits of the sun is termed a solar day. Divide a solar
day by 24*60*60 yields a solar second.

• However, the earth is slowing! (35 days less in a year over 300 million years)
• There are also short-term variations caused by turbulence deep in the earth’s core.
◼ A large number of days (n) were used used to the average day length, then dividing
by 86,400 to determine the mean solar second.

Coputation of Solar day

◼ Physicists take over from astronomers and count the transitions of cesium 133 atom
◼ 9,192,631,770 cesium transitions == 1 solar second
◼ 50 International labs have cesium 133 clocks.
◼ The Bureau Internationale de l’Heure (BIH) averages reported clock ticks to
produce the International Atomic Time (TAI).
◼ The TAI is mean number of ticks of cesium 133 clocks since midnight on
January 1, 1958 divided by 9,192,631,770 .
◼ To adjust for lengthening of mean solar day, leap seconds are used to translate
TAI into Universal Coordinated Time (UTC).

Dr. B. SWAMINATHAN, Page 99/181


◼ UTC is broadcast by NIST from Fort Collins, Colorado over shortwave radio
station WWV. WWV broadcasts a short pulses at the start of each UTC
second. [accuracy 10 msec.]
◼ GEOS (Geostationary Environment Operational Satellite) also offer UTC
service. [accuracy 0.5 msec.]

◼ Computer timers go off H times/sec, and increment the count of ticks (interrupts) since
an agreed upon time in the past.
◼ This clock value is C.
◼ Using UTC time, the value of clock on machine p is Cp(t).
◼ For a perfect time, Cp(t) = t and dC/dt = 1.
For an ideal timer, H =60, should generate 216,000 ticks per hour

Clock Synchronization Algorithms


◼ But typical errors, 10–5, so the range of ticks per second will vary from 215,998 to
216,002.
◼ Manufacturer specs can give you the maximum drift rate ().
◼ Every t seconds, the worst case drift between two clocks will be at most 2t.
◼ To guarantee two clocks never differ by more than , the clocks must re-synchronize
every /2 seconds using one of the various clock synchronization algorithms.
◼ Centralized Algorithms
◼ Cristian’s Algorithm (1989)
◼ Berkeley Algorithm (1989)
◼ Decentralized Algorithms
◼ Averaging Algorithms (e.g. NTP)
◼ Multiple External Time Sources

Cristian's Algorithm
◼ Assume one machine (the time server) has a WWV receiver and all other machines are
to stay synchronized with it.
◼ Every /2 seconds, each machine sends a message to the time server asking for the
current time.
◼ Time server responds with message containing current time, CUTC.

Dr. B. SWAMINATHAN, Page 100/181


Getting the current time from a time server
◼ A major problem – the client clock is fast ➔ arriving value of CUTC will be smaller
than client’s current time, C.
One needs to gradually slow down client clock by adding less time per tick.
◼ Minor problem – the one-way delay from the server to client is “significant” and may
vary considerably.
◼ Measure this delay and add it to CUTC.
◼ The best estimate of delay is (T1 – T0)/2.
◼ In cases when T1 – T0 is above a threshold, then ignore the measurement.
{outliers}
◼ Can subtract off I (the server interrupt handling time).
◼ Can use average delay measurement or relative latency (shortest recorded delay).

The Berkeley Algorithm

a) The time daemon asks all the other machines for their clock values.
b) The machines answer and the time daemon computes the average.
c) The time daemon tells everyone how to adjust their clock.

Averaging Algorithms
◼ Every R seconds, each machine broadcasts its current time.
◼ The local machine collects all other broadcast time samples during some time interval,
S.
◼ The simple algorithm:: the new local time is set as the average of the value received
from all other machines.
◼ A slightly more sophisticated algorithm :: Discard the m highest and m lowest to
reduce the effect of a set of faulty clocks.
◼ Another improved algorithm :: Correct each message by adding to the received time
an estimate of the propagation time from the ith source.
◼ extra probe messages are needed to use this scheme.
◼ One of the most widely used algorithms in the Internet is the Network Time Protocol
(NTP).
◼ Achieves worldwide accuracy in the range of 1-50 msec.

Dr. B. SWAMINATHAN, Page 101/181


Time notion:
Each computer is equippedwith a physical (hardware) clock.
It can be viewed as a counter incremented by ticks of an oscillator At time t,
the Operating System (OS) of a process i reads the hardware clock H(t)
of the processor.
Then, it generates the software clock C = a H(t) + b
C approximately measures the time t of process i

Time in Distributed Systems (DS):


Time is a key factor in a DS to analyze how distributed executions evolve
Problems:
Lacking of a global reference time: it's hard to know the state of a process during a
distributed computation.
However, it's important for processes to share a common time notion
The technique used to coordinate a common time notion among processes is known as
Clock Synchronization

4.2 Logical Clocks


◼ For a certain class of algorithms, it is the internal consistency of the clocks that
matters. The convention in these algorithms is to speak of logical clocks.
◼ Lamport showed clock synchronization need not be absolute. What is important is
that all processes agree on the order in which events occur
◼ Lamport defined a relation ”happens before”. a → b ‘a happens before b’.
◼ Happens before is observable in two situations:
1. If a and b are events in the same process, and a occurs before b, then a → b is true.
If a is the event of a message being sent by one process, and b is the event of the
message being received by another process, then a → b is also true.

Lamport Timestamps
a) Each processes with own clock with different rates.
b) Lamport's algorithm corrects the clocks.
c) Can add machine ID to break ties

Dr. B. SWAMINATHAN, Page 102/181


Totally-Ordered Multicasting

◼ San Fran customer adds $100, NY bank adds 1% interest


◼ San Fran will have $1,111 and NY will have $1,110
◼ Updating a replicated database and leaving it in an inconsistent state.
◼ Can use Lamport’s to totally order
◼ A multicast operation by which all messages are delivered in the same order to each
receiver.
◼ Lamport Details:
◼ Each message is timestamped with the current logical time of its sender.
◼ Multicast messages are conceptually sent to the sender.
◼ Assume all messages sent by one sender are received in the order they were
sent and that no messages are lost.
◼ Receiving process puts a message into a local queue ordered according to
timestamp.
◼ The receiver multicasts an ACK to all other processes.
◼ Key Point from Lamport: the timestamp of the received message is lower than
the timestamp of the ACK.
◼ All processes will eventually have the same copy of the local queue →
consistent global ordering.
4.3 Global States
Need for physical clocks
One or more processors share a common bus, time isn't much of a concern. The
entire system shares the same understanding of time: right or wrong, it is
consistent.
Physical clock - Multiple systems
In distributed systems, this is not the case. Unfortunately, each system has its own
timer that drives its clock. Each timer has different characteristics --characteristics
that might change with time, temperature, etc. This implies that each systems time
will drift away from the true time at a different rate.

Dr. B. SWAMINATHAN, Page 103/181


Logical clock
• Messages sent between machines may arrive zero or more times at any point
after they are sent.
• If two machines do not interact, no need to synchronize them
Can we order the events on different machines using local time?
Causality The purpose of a logical clock is not necessarily to maintain the same notion
of time as a reliable watch. Instead, it is to keep track of information pertaining to the
order of events
Lamport’s logical clock

Key Ideas
Processes exchange messages
Message must be sent before received
Send/receive used to order events and to synchronize clocks
Happened before relation
Causally ordered events
Concurrent events
Implementation
Limitation of Lamport’s clock
Happened before relation
• a -> b : Event a occurred before event b. Events in the same process p1.
• b -> c : If b is the event of sending a message m1 in a process p1 and c is the
event of receipt of the same message m1 by another process p2.
a -> b, b -> c, then a -> c; “->” is
Causally Ordered Events
a -> b : Event a “causally” affects event b
Concurrent Events
a || e: if a !-> e and e !-> a

Algorithm
Sending end
time = time+1;
time_stamp = time;
send(message, time_stamp);
Receiving end
(message, time_stamp) = receive();
time = max(time_stamp, time)+1;

a -> b C(a) < C(b)


b -> c C (b) and C(c) must be assigned in such a
way that C(b) < C(c) and the clock time, C, must
always go forward (increasing), never backward
(decreasing). Corrections to time can be made by adding a positive value, never by subtracting
one.
Dr. B. SWAMINATHAN, Page 104/181
An illustration: Three processes, each
with its own clock. The clocks run at
different rates and Lamport's algorithm
corrects the clocks.

Limitations
• m1−>m3
C(m1)<C(m3)
• m2−>m3
C(m2)<C(m3)
m1 or m2 caused
m3 to be sent?
• Lamport’s logical clocks lead to a situation where all events in a distributed system
are totally ordered. That is, if a -> b, then we can say C(a)<C(b).
• Unfortunately, with Lamport’s clocks, nothing can be said about the actual time of a
and b. If the logical clock says a -> b, that does not mean in reality that a actually
happened before b in terms of real time.
• The problem with Lamport clocks is that they do not capture causality.
• If we know that a -> c and b -> c we cannot say which action initiated c.
• This kind of information can be important when trying to replay events in a distributed
system (such as when trying to recover after a crash).
• The theory goes that if one node goes down, if we know the causal relationships
between messages, then we can replay those messages and respect the causal
relationship to get that node back up to the state it needs to be in.

4.4 Mutual Exclusion


Concurrent access of processes to a shared resource or data is executed in mutually
exclusive manner.
Only one process is allowed to execute the critical section (CS) at any given time.
In a distributed system, shared variables (semaphores) or a local kernel cannot be used to
implement mutual exclusion.
Message passing is the sole means for implementing distributed mutual exclusion.
Distributed mutual exclusion algorithms must deal with unpredictable message delays and
incomplete knowledge of the system state.

Dr. B. SWAMINATHAN, Page 105/181


Three basic approaches for distributed mutual exclusion:
Token-based approach:
A unique token is shared among the sites.
A site is allowed to enter its CS if it possesses the token.
Mutual exclusion is ensured because the token is unique.

Non-token based approach:


Two or more successive rounds of messages are exchanged among the sites to
determine which site will enter the CS next.

Quorum based approach:


Each site requests permission to execute the CS from a subset of sites (called a
quorum).
Any two quorums contain a common site.
This common site is responsible to make sure that only one request executes the CS at
any time.

System Model:
The system consists of N sites, S1, S2, ..., SN.
We assume that a single process is running on each site. The process at site Si is
denoted by pi.
A site can be in one of the following three states: requesting the CS, executing the CS,
or neither requesting nor executing the CS (i.e., idle).
In the ‘requesting the CS’ state, the site is blocked and can not make further requests
for the CS. In the ‘idle’ state, the site is executing outside the CS.
In token-based algorithms, a site can also be in a state where a site holding the token is
executing outside the CS (called the idle token state).
At any instant, a site may have several pending requests for CS. A site queues up these
requests and serves them one at a time.

Requirements of Mutual Exclusion Algorithms


Safety Property: At any instant, only one process can execute the critical section.
Liveness Property: This property states the absence of deadlock and starvation. Two
or more sites should not endlessly wait for messages which will never arrive.
Fairness: Each process gets a fair chance to execute the CS. Fairness property
generally means the CS execution requests are executed in the order of their arrival
(time is determined by a logical clock) in the system.

Performance Metrics
The performance is generally measured by the following four metrics:

Message complexity: The number of messages required per CS execution by a site.

Synchronization delay: After a site leaves the CS, it is the time required and before the
next site enters the CS

Dr. B. SWAMINATHAN, Page 106/181


Response time: The time interval a request waits for its CS execution to be over after its
request messages have been sent out.

System throughput: The rate at which the system executes requests for the CS.
system throughput = 1/(SD+E)
where SD is the synchronization delay and
E is the average critical section execution time.

Low and High Load Performance:


We often study the performance of mutual exclusion algorithms under two special
loading conditions, viz., “low load” and “high load”.
The load is determined by the arrival rate of CS execution requests.
Under low load conditions, there is seldom more than one request for the critical
section present in the system simultaneously.
Under heavy load conditions, there is always a pending request for critical section at a
site.

Lamport’s Algorithm:
Requests for CS are executed in the increasing order of timestamps and time is
determined by logical clocks.
Every site Si keeps a queue, request queuei , which contains mutual exclusion requests
ordered by their timestamps.
This algorithm requires communication channels to deliver messages the FIFO order.

Algorithm: Requesting the critical section: When a site Si wants to enter the CS, it
broadcasts a REQUEST(tsi , i) message to all other sites and places the request on request
queuei . ((tsi , i) denotes the timestamp of the request.)
When a site Sj receives the REQUEST(tsi , i) message from site Si ,places site Si ’s
request on request queuej and it returns a timestamped REPLY message to Si .

Executing the critical section: Site Si enters the CS when the following two
conditions hold:
L1: Si has received a message with timestamp larger than (tsi , i) from all other sites.
L2: Si ’s request is at the top of request queuei .

Dr. B. SWAMINATHAN, Page 107/181


Releasing the critical section:
Site Si , upon exiting the CS, removes its request from the top of its request queue
and broadcasts a timestamped RELEASE message to all other sites.
When a site Sj receives a RELEASE message from site Si , it removes Si ’s request
from its request queue.

When a site removes a request from its request queue, its own request may come at the
top of the queue, enabling it to enter the CS.
4.5 Election Algorithms
• Any process can serve as coordinator
• Any process can “call an election” (initiate the algorithm to choose a new
coordinator).
– There is no harm (other than extra message traffic) in having multiple
concurrent elections.
• Elections may be needed when the system is initialized, or if the coordinator crashes or
retires.
Assumption
• Every process/site has a unique ID; e.g.
– the network address
– a process number
• Every process in the system should know the values in the set of ID numbers, although
not which processors are up or down.
• The process with the highest ID number will be the new coordinator.
Process groups (as with ISIS toolkit or MPI) satisfy these requirements.
Requirements:
• When the election algorithm terminates a single process has been selected and
every process knows its identity.
• Formalize: every process pi has a variable ei to hold the coordinator’s process number.
– ∀i, ei = undefined or ei = P, where P is the non-crashed process with highest id
– All processes (that have not crashed) eventually set ei = P.

Bully Algorithm – Overview


• Process p calls an election when it notices that the coordinator is no longer
responding.
• High-numbered processes “bully” low-numbered processes out of the election, until
only one process remains.
• When a crashed process reboots, it holds an election. If it is now the highest-numbered
live process, it will win.

Dr. B. SWAMINATHAN, Page 108/181


Process p sends an election message to all higher-numbered processes in the system.
If no process responds, then p becomes the coordinator.
If a higher-level process (q) responds, it sends p a message that terminates p’s role in
the algorithm.
The process q now calls an election (if it has not already done so).
Repeat until no higher-level process responds. The last process to call an election
“wins” the election.
The winner sends a message to other processes announcing itself as the new
coordinator.

Analysis
• Works best if communication in the system has bounded latency so processes can
determine that a process has failed by knowing the upper bound (UB) on message
transmission time (T) and message processing time (M).
– UB = 2 * T + M
However, if a process calls an election when the coordinator is still active, the
coordinator will win the election.
Ring Algorithm – Overview
• The ring algorithm assumes that the processes are arranged in a logical ring and each
process is knows the order of the ring of processes.
• Processes are able to “skip” faulty systems: instead of sending to process j, send to j + 1.
• Faulty systems are those that don’t respond in a fixed amount of time.
• P thinks the coordinator has crashed; builds an ELECTION message which contains its
own ID number.
• Sends to first live successor
• Each process adds its own number and forwards to next.
• OK to have two elections at once.

Dr. B. SWAMINATHAN, Page 109/181


• When the message returns to p, it sees its own process ID in the list and knows that the
circuit is complete.
• P circulates a COORDINATOR message with the new high number.
• Here, both 2 and 5 elect 6:
[5,6,0,1,2,3,4]
[2,3,4,5,6,0,1]

Dr. B. SWAMINATHAN, Page 110/181


4.6 Data-Centric Consistency Models
Consistency and Replication
· Introduction (what's it all about)
· Data-centric consistency
· Client-centric consistency
· Replica management
· Consistency protocols

Introduction Reasons for Replication


Performance and Scalability
Main issue: To keep replicas consistent, we generally need to ensure that
all conflicting operations are done in the the same order everywhere
Conflicting operations: From the world of transactions:
o Read-write conflict: a read operation and a write operation act concurrently
o Write-write conflict: two concurrent write operations
Guaranteeing global ordering on conflicting operations may be a costly operation,
downgrading scalability

Solution: weaken consistency requirements so that hopefully global synchronization can


be avoided

Data-Centric Consistency Models


Consistency model: a contract between a (distributed) data store and processes, in
which the data store specifies precisely what the results of read and write operations
are in the presence of concurrency.

Essence: A data store is a distributed collection of storages accessible to clients:


The general organization of a logical data store, physically distributed and replicated
across multiple processes.

Continuous Consistency: Degree of consistency:


· replicas may differ in their numerical value
· replicas may differ in their relative staleness
· there may differences with respect to (number and order) of performed update
operations conistency unit ) specifies the data unit over which consistency is to be
measured.
· e.g. stock record, weather report, etc.

example: numerical and ordering deviations

Dr. B. SWAMINATHAN, Page 111/181


Conit: contains the variables x and y:
· Each replica maintains a vector clock
· B sends A operation [h5,Bi: x := x + 2];
A has made this operation permanent (cannot be rolled back)
· A has three pending operations ) order deviation = 3
· A has missed one operation from B, yielding a max diff of 5 units ) (1,5)

Strict Consistency
Any read on a data item ‘x’ returns a value corresponding to the result of the most recent
write on ‘x’ (regardless of where the write occurred).

With Strict Consistency, all writes are instantaneously visible to all processes
and absolute global time order is maintained throughout the distributed system.

This is the consistency model “Holy Grail” – not at all easy in the real world, and all
but impossible within a DS.

Example: Begin with:


Wi(x)a –write by process Pi to data item x with the value a
Ri(x)b - read by process Pi from data item x returning the value b
Assume:
Time axis is drawn horizontally, with time increasing from left to right
Each data item is initially NIL.

▪ P1 does a write to a data item x, modifying its value to a.


▪ Operation W1(x)a is first performed on a copy of the data store that is
local to P1, and is then propagated to the other local copies.
▪ P2 later reads the value NIL, and some time after that reads a (from its
local copy of the store).
NOTE: it took some time to propagate the update of x to P2, which is perfectly acceptable.

Dr. B. SWAMINATHAN, Page 112/181


Behavior of two processes, operating on the same data item:
a)A strictly consistent data-store.
b)A data-store that is not strictly consistent.

Sequential Consistency
A weaker consistency model, which represents a relaxation of the rules.
It is also must easier (possible) to implement.
Sequential Consistency: The result of any execution is the same as if the (read and write)
operations by all proceses on the data-store were executed in the same sequential order
and the operations of each individual process appear in this sequence in the order
specified by its program.

Example: Time independent process


Four processes operating on the same data item x.
Figure(a) Figure(b)
· Process P1 first performs W(x)a to x. · Violates sequential consistency - not
· Later (in absolute time), all processes see the same interleaving of
process P2 performs a write operation, by write operations.
setting the value of x to b. · To process P3, it appears as if the data
· Both P3 and P4 first read value b, and item has first been changed to b, and
later value a. later to a.
· Write operation of process P2 appears to · BUT, P4 will conclude that the final
have taken place before that of P1. value is b.
(a) A sequentially consistent data store. (b) A data store that is not sequentially consistent.

Example:
·Three concurrently-executing processes P1, P2, and P3
·Three integer variables x, y, and z, which stored in a (possibly distributed) shared
sequentially consistent data store.
·Assume that each variable is initialized to 0.
·An assignment corresponds to a write operation, whereas a print statement corresponds to
a simultaneous read operation of its two arguments.
·All statements are assumed to be indivisible.
·Various interleaved execution sequences are possible.
·With six independent statements, there are potentially 720 (6!) possible execution
sequences

Dr. B. SWAMINATHAN, Page 113/181


·Consider the 120 (5!) sequences that begin with x ¬1.
o Half of these have print (x,z) before y ¬1 and thus violate program order.
o Half have print (x,y) before z ¬1 and also violate program order.
o Only 1/4 of the 120 sequences, or 30, are valid.
o Another 30 valid sequences are possible starting with y¬ 1
o Another 30 can begin with z ¬1, for a total of 90 valid execution sequences.
Four valid execution sequences for the processes. The vertical axis is time.
· Figure (a) - The three processes are in order - P1, P2, P3.
· Each of the three processes prints two variables.
o Since the only values each variable can take on are the initial value (0), or the
assigned value (1), each process produces a 2-bit string.
o The numbers after Prints are the actual outputs that appear on the output device.
· Output concatenation of P1, P2, and P3 in sequence produces a 6-bit string that
characterizes a particular interleaving of statements.
o This is the string listed as the Signature.

Examples of possible output:


· 000000 is not permitted - implies that the Print statements ran before the assignment
statements, violating the requirement that statements are executed in program order.
· 001001 – is not permitted
o First two bits 00 - y and z were both 0 when P1 did its printing.
This situation occurs only when P1 executes both statements
before P2 or P3 start.
o Second two bits 10 - P2 must run after P1 has started but before P3 has started.
o Third two bits, 01- P3 must complete before P1 starts, but P1 execute first.

Dr. B. SWAMINATHAN, Page 114/181


Causal Consistency
· Writes that are potentially causally related must be seen by all processes in the same
order.
· Concurrent writes (i.e. writes that are NOT causally related) may be seen in a different
order by different processes.

Example:
· Interaction through a distributed shared database.
· Process P1 writes data item x.
· Then P2 reads x and writes y.
· Reading of x and writing of y are potentially causally related because the computation
of y may have depended on the value of x as read by P2 (i.e., the value written by P1).
· Conversly, if two processes spontaneously and simultaneously write two different data
items, these are not causally related.
· Operations that are not causally related are said to be concurrent.
· For a data store to be considered causally consistent, it is necessary that the store obeys
the following condition:
o Writes that are potentially causally related must be seen by all processes in the
same order.
o Concurrent writes may be seen in a different order on different machines.

Example 1: causal consistency


· This sequence is allowed with a causally-consistent store, but not with a sequentially
consistent store.

NOTE: The writes W2(x)b and W1(x)c are concurrent, so it is not required that all processes
see them in the same order.

Example 2: causal consistency


· W2(x)b potentially depending · Read has been removed,
on W1(x)a because b may result from a so W1(x)a and W2(x)b are now concurrent
computation involving the value read writes.
by R2(x)a. · A causally-consistent store does not require
· The two writes are causally related, so concurrent writes to be globally ordered,
all processes must see them in the same · It is correct.
order. · Note: situation that would not be acceptable
· It is incorrect. for a sequentially consistent store.
(a) A violation of a causally-consistent store. (b) A correct sequence of events in a causally-
consistent store.

Dr. B. SWAMINATHAN, Page 115/181


FIFO Consistency
Writes done by a single process are seen by all other processes in the order in which they
were issued, but writes from different processes may be seen in a different order by different
processes.
· Also called “PRAM Consistency” – Pipelined RAM.
· Easy to implement -There are no guarantees about the order in which different
processes see writes – except that two or more writes from a single process must be
seen in order.

Example:

· A valid sequence of FIFO consistency events.


· Note that none of the consistency models given so far would allow this sequence of
events.

Weak Consistency
· Not all applications need to see all writes, let alone seeing them in the same order.
· Leads to Weak Consistency (which is primarily designed to work
with distributed critical sections).
· This model introduces the notion of a synchronization variable”, which is used update
all copies of the data-store.
Properties Weak Consistency:
1. Accesses to synchronization variables associated with a data-store are sequentially
consistent.
2. No operation on a synchronization variable is allowed to be performed until all previous
writes have been completed everywhere.
3. No read or write operation on data items are allowed to be performed until all previous
operations to synchronization variables have been performed.
MeaningBy doing a sync.
· a process can force the just written value out to all the other replicas.
· a process can be sure it’s getting the most recently written value before it reads.
Essence: the weak consistency models enforce consistency on a group of operations, as
opposed to individual reads and writes (as is the case with strict, sequential, causal and FIFO
consistency).

Grouping Operations
· Accesses to synchronization variables are sequentially consistent.
· No access to a synchronization variable is allowed to be performed until all
previous writes have completed everywhere.
· No data access is allowed to be performed until all previous accesses to
synchronization variables have been performed.

Dr. B. SWAMINATHAN, Page 116/181


Basic idea: You don't care that reads and writes of a series of operations are immediately
known to other processes.
· You just want the effect of the series itself to be known.

Convention: when a process enters its critical section it should acquire the
relevant synchronization variables, and likewise when it leaves the critical section, it releases
these variables.

Critical section: a piece of code that accesses a shared resource (data structure or device) that
must not be concurrently accessed by more than one thread of execution.

Synchronization variables: are synchronization primitives that are used to coordinate the
execution of processes based on asynchronous events.
· When allocated, synchronization variables serve as points upon which one or more
processes can block until an event occurs.
· Then one or all of the processes can be unblocked. at the same time.
· Each synchronization variable has a current owner, namely, the process that last acquired
it.
o The owner may enter and exit critical sections repeatedly without having to
send any messages on the network.
o A process not currently owning a synchronization variable but wanting to
acquire it has to send a message to the current owner asking for ownership and
the current values of the data associated with that synchronization variable.
o It is also possible for several processes to simultaneously own a
synchronization variable in nonexclusive mode, meaning that they can read,
but not write, the associated data.
Note: that the data in a process' critical section may be associated to different synchronization
variables.
The following criteria must be met
1. An acquire access of a synchronization variable is not allowed to perform with respect
to a process until all updates to the guarded shared data have been performed with
respect to that process.
2. Before an exclusive mode access to a synchronization variable by a process is allowed
to perform with respect to that process, no other process may hold the synchronization
variable, not even in nonexclusive mode.
3. After an exclusive mode access to a synchronization variable has been performed, any
other process' next nonexclusive mode access to that synchronization variable may not
be performed until it has performed with respect to that variable's owner.
Example:

a) A valid sequence of events for weak consistency.


· This is because P2 and P3 have yet to synchronize, so there’s no guarantees about the
value in ‘x’.
b)An invalid sequence for weak consistency.
· P2 has synchronized, so it cannot read ‘a’ from ‘x’ – it should be getting ‘b’.
Dr. B. SWAMINATHAN, Page 117/181
Release Consistency
· Question: how does a weakly consistent data-store know that the sync is the result of a
read or a write?
o Answer: It doesn’t!
· It is possible to implement efficiencies if the data-store is able to determine whether the
sync is a read or write.
· Two sync variables can be used, acquire and release, and their use leads to the Release
Consistency model.
· When a process does an acquire, the data-store will ensure that all the local copies of
the protected data are brought up to date to be consistent with the remote ones if needs be.
· When a release is done, protected data that have been changed are propagated out to
the local copies of the data-store.
Example:

A valid event sequence for release consistency.


· Process P3 has not performed an acquire, so there are no guarantees that the read of ‘x’
is consistent.
· The data-store is simply not obligated to provide the correct answer.
· P2 does perform an acquire, so its read of ‘x’ is consistent.
Release Consistency Rules
A distributed data-store is “Release Consistent” if it obeys the following rules:
1. Before a read or write operation on shared data is performed, all previous acquires
done by the process must have completed successfully.
2. Before a release is allowed to be performed, all previous reads and writes by the
process must have completed.
3. Accesses to synchronization variables are FIFO consistent (sequential consistency is
not required).

Entry consistency
● Acquire and release are still used, and the data-store meets the following conditions:
● An acquire access of a synchronization variable is not allowed to perform with respect
to a process until all updates to the guarded shared data have been performed with
respect to that process.
● Before an exclusive mode access to a synchronization variable by a process is allowed
to perform with respect to that process, no other process may hold the synchronization
variable, not even in nonexclusive mode.
● After an exclusive mode access to a synchronization variable has been performed, any
other process's next nonexclusive mode access to that synchronization variable may
not be performed until it has performed with respect to that variable's owner.
· At an acquire, all remote changes to guarded data must be brought up to date.
· Before a write to a data item, a process must ensure that no other process is trying to
write at the same time.
Locks are associates with individual data items, as opposed to the entire data-store.
· a lock is a synchronization mechanism for enforcing limits on access to a resource in
an environment where there are many threads of execution.
Dr. B. SWAMINATHAN, Page 118/181
Example:
· P1 does an acquire for x, changes x once, after which it also does an acquire for y.
· Process P2 does an acquire for x but not for y, so that it will read value a for x, but
may read NIL for y.
· Because process P3 first does an acquire for y, it will read the value b when y is
released by P1.
Note: P2’s read on ‘y’ returns NIL as no locks have been requested.

A valid event sequence for entry consistency.

Summary of Consistency Models

Consistency Description: Consistency models that do not use synchronization


operations.
Strict Absolute time ordering of all shared accesses matters

Linearizability All processes must see all shared accesses in the same
order. Accesses are furthermore ordered according to a
(nonunique) global timestamp.
Sequential All processes see all shared accesses in the same order. Accesses are
not ordered in time.
Causal All processes see causally-related shared accesses in the same order.
FIFO All processes see writes from each other in the order they were
used. Writes from different processes may not always be seen in
that order.
Consistency Description: Models that do use synchronization operations
Weak Shared data can be counted on to be consistent only after a
synchronization is done.
Release Shared data are made consistent when a critical region is exited.
Entry Shared data pertaining to a critical region are made consistent when
a critical region is entered.

Consistency versus Coherence


· A number of processes execute read and write operations on a set of data items.
· A consistency model describes what can be expected with respect to that set when
multiple processes concurrently operate on that data.
· The set is then said to be consistent if it adheres to the rules described by the model.
· Coherence models describe what can be expected to only a single data item
· Sequential consistency model - applied to only a single data item.

Dr. B. SWAMINATHAN, Page 119/181


4.7 Client-Centric Consistency Models
Above consistency models - maintaining a consistent (globally accessible) data-store in
the presence of concurrent read/write operations. Another class of distributed datastore -
characterized by the lack of simultaneous updates.
o Here, the emphasis is more on maintaining a consistent view of things for the
individual client process that is currently operating on the data-store.

Question: How fast should updates (writes) be made available to read-only processes?
· Most database systems: mainly read.
· DNS: write-write conflicts do no occur.
· WWW: as with DNS, except that heavy use of client-side caching is present: even
the return of stale pages is acceptable to most users.
NOTE: all exhibit a high degree of acceptable inconsistency … with the replicas gradually
become consistent over time.

Eventual Consistency
Special class of distributed data stores:
· Lack of simultaneous updates
· When updates occur they can easily be resolved.
· Most operations involve reading data.
· These data stores offer a very weak consistency model, called eventual consistency.

The eventual consistency model states that, when no updates occur for a long period of
time, eventually all updates will propagate through the system and all the replicas will
be consistent.

Example: Consistency for Mobile Users


· Consider a distributed database to which you have access through your notebook.
· Assume your notebook acts as a front end to the database.
o At location A you access the database doing reads and updates.
o At location B you continue your work, but unless you access the same server as
the one at location A, you may detect inconsistencies:
your updates at A may not have yet been propagated to B
you may be reading newer entries than the ones available at A
your updates at B may eventually conflict with those at A

Note: The only thing you really want is that the entries you updated and/or read at A, are
in B the way you left them in A. In that case, the database will appear to be consistent to you.

Dr. B. SWAMINATHAN, Page 120/181


o Client-centric consistency models originate from the work on Bayou
o Bayou is a database system developed for mobile computing, where it is assumed that
network connectivity is unreliable and subject to various performance problems.
o Wireless networks and networks that span large areas, such as the Internet, fall
into this category.
o Bayou distinguishes four different consistency models:
1. monotonic reads
2. monotonic writes
3. read your writes
4. writes follow reads

Notation:
o xi[t] denotes the version of data item x at local copy Li at time t.
o WS xi[t] is the set of write operations at Li that lead to version xi of x (at time t);
o If operations in WS xi[t1] have also been performed at local copy Lj at a later
time t2, we write WS (xi[t1] , xj[t2] ).
o If the ordering of operations or the timing is clear from the context, the time index
will be omitted.

Monotonic Reads
If a process reads the value of a data item x, any successive read operation on x by that
process will always return that same or a more recent value.
o Monotonic-read consistency guarantees that if a process has seen a value of x at time t, it
will never see an older version of x at a later time.

Example: Automatically reading your personal calendar updates from different servers.
o Monotonic Reads guarantees that the user sees all updates, no matter from which
server the automatic reading takes place.

Example: Reading (not modifying) incoming mail while you are on the move.

Dr. B. SWAMINATHAN, Page 121/181


o Each time you connect to a different e-mail server, that server fetches (at least) all
the updates from the server you previously visited.

Example:
o The read operations performed by a single process P at two different local copies of
the same data store.
o Vertical axis - two different local copies of the data store are shown - L1 and L2
o Time is shown along the horizontal axis
o Operations carried out by a single process P in boldface are connected by a dashed
line representing the order in which they are carried out.
(a) A monotonic-read consistent data store. (b) A data store that does not provide
monotonic reads.

(a) (b)
o Process P first performs a read operation o Situation in which monotonic-read
on x at L1, returning the value of x1 (at consistency is not guaranteed.
that time). o After process P has read x1 at L1, it
o This value results from the write later performs the operation R (x2 ) at L2
operations in WS (x1) performed .
at L1. o But, only the write operations in WS
o Later, P performs a read operation on x (x2 ) have been performed at L2 .
at L2, shown as R (x2). o No guarantees are given that this set
o To guarantee monotonic-read also contains all operations contained in
consistency, all operations in WS (x1) WS (x1).
should have been propagated to L2 before
the second read operation takes place.

Monotonic Writes
In a monotonic-write consistent store, the following condition holds:
A write operation by a process on a data item x is completed before any successive
write operation on x by the same process.

Hence: A write operation on a copy of item x is performed only if that copy has been brought
up to date by means of any preceding write operation, which may have taken place on other
copies of x. If need be, the new write must wait for old ones to finish.

Example: Updating a program at server S2, and ensuring that all components on which
compilation and linking depends, are also placed at S2.

Example: Maintaining versions of replicated files in the correct order everywhere (propagate
the previous version to the server where the newest version is installed).

Example:
The write operations performed by a single process P at two different local copies of the same
data store.

Dr. B. SWAMINATHAN, Page 122/181


(a) A monotonic-write consistent data store. (b) A data store that does not provide
monotonic-write consistency.

(a) (b)
o Process P performs a write operation on o Situation in which monotonic-write
x at local copy L1, presented as the consistency is not guaranteed.
operation W(x1). o Missing is the propagation of W(x1) to
o Later, P performs another write operation copy L2.
on x, but this time at L2, shown as W (x2). o No guarantees can be given that the
o To ensure monotonic-write consistency, copy of x on which the second write is
the previous write operation at L1 must being performed has the same or more
have been propagated to L2. recent value at the time W(x1 )
o This explains operation W (x1) at L2, completed at L1.
and why it takes place before W (x2).

Read Your Writes


o A client-centric consistency model that is closely related to monotonic reads is as
follows. A data store is said to provide read-your-writes consistency, if the following
condition holds:
The effect of a write operation by a process on data item x will always be seen by a
successive read operation on x by the same process.

Hence: a write operation is always completed before a successive read operation by the same
process, no matter where that read operation takes place.
Example: Updating your Web page and guaranteeing that your Web browser shows the
newest version instead of its cached copy.

Example:
(a) A data store that provides read-your-writes consistency. (b) A data store that does not.

(a) (b)
o Process P performed a write operation W(x1) o W (x1) has been left out of WS
and later a read operation at a different local (x2), meaning that the effects of the
copy. previous write operation by process
o Read-your-writes consistency guarantees that P have not been propagated to L2.
the effects of the write operation can be seen by
the succeeding read operation.
o This is expressed by WS (x1;x2), which states
that W (x1) is part of WS (x2).

Dr. B. SWAMINATHAN, Page 123/181


Writes Follow Reads

A data store is said to provide writes-follow-reads consistency, if the following holds:

A write operation by a process on a data item x following a previous read operation on


x by the same process is guaranteed to take place on the same or a more recent value
of x that was read.

Hence: any successive write operation by a process on a data item x will be performed on a
copy of x that is up to date with the value most recently read by that process.

Example: See reactions to posted articles only if you have the original posting (a read .pulls
in. the corresponding write operation).

Example:
(a) A writes-follow-reads consistent data store. (b) A data store that does not provide
writes-follow-reads consistency.

(a) (b)
o A process reads x at local copy L1. o No guarantees are given that the operation
o The write operations that led to the performed at L2,
value just read, also appear in the write o They are performed on a copy that is
set at L2, where the same process later consistent with the one just read at L1.
performs a write operation.
o (Note that other processes at L2 see
those write operations as well.)

Dr. B. SWAMINATHAN, Page 124/181


4.8 Replica Management
Key issues:
o Decide where, when, and by whom replicas should be placed
o Which mechanisms to use for keeping the replicas consistent.

Placement problem:
o Placing replica servers
· Replica-server placement is concerned with finding the best locations to place a
server that can host (part of) a data store.
o Placing content.
· Content placement deals with finding the best servers for placing content.

Replica-Server Placement
Essence: Figure out what the best K places are out of N possible locations.

1. Select best location out of N - k for which the average distance to clients is minimal.
Then choose the next best server. (Note: The first chosen location minimizes the average
distance to all clients.) Computationally expensive.

2. Select the k-th largest autonomous system and place a server at the best-connected
host. Computationally expensive.
o An autonomous system (AS) can best be viewed as a network in which the
nodes all run the same routing protocol and which is managed by a single
organization.

3. Position nodes in a d-dimensional geometric space, where distance reflects latency.


Identify the K regions with highest density and place a server in every
one. Computationally cheap. (Szymaniak et al. 2006)

Content Replication and Placement


Model: We consider objects (and don't worry whether they contain just data or code, or both)
Distinguish different processes: A process is capable of hosting a replica of an object or
data:
· Permanent replicas: Process/machine always having a replica
· Server-initiated replica: Process that can dynamically host a replica on request of
another server in the data store
· Client-initiated replica: Process that can dynamically host a replica on request of a
client (client cache)
· The logical organization of different kinds of copies of a data store into three concentric
rings.

Dr. B. SWAMINATHAN, Page 125/181


Permanent Replicas
Examples:
· The initial set of replicas that constitute a distributed data store.
· The number of permanent replicas is small.
· The files that constitute a site are replicated across a limited number of servers at a
single location.
o Whenever a request comes in, it is forwarded to one of the servers, for instance,
using a round-robin strategy.
· Alternatively, a database is distributed and possibly replicated across a number of
geographically dispersed sites.
o This architecture is generally deployed in federated databases

Server-Initiated Replicas : Copies of a data store are created to enhance performance


Example:
· Dynamic placement of replica servers in Web hosting services.
Example:
· Placement of content with replica servers in place
· Dynamic replication of files for Web hosting service is also there
· Algorithm is designed to support Web pages for which reason it assumes that updates
are relatively rare compared to read requests.
o Algorithm considers:
1. Replication take used to reduce the load on a server.
2. Specific files on a server can be migrated or replicated to servers placed in
the proximity of clients that issue many requests for those files.
· Each server keeps track of:
o access counts per file
o where access requests originate
· Given a client C, each server can determine which of the servers in the Web hosting
service is closest to C.
· If client C1 and client C2 share the same "closest" server P, all access requests for
file F at server Q from C1 and C2 are jointly registered at Q as a single access count
cntQ(P,F).
· Removed from S - when the number of requests for a file F at server S drops below
a deletion threshold del (S,F)
· Replicate F – when replication threshold rep (S,F) is surpassed
· Migrate F – when the number of requests lies between the deletion and replication
thresholds
Counting access requests from different clients.

Dr. B. SWAMINATHAN, Page 126/181


Client-Initiated Replicas
· Client-initiated replicas aka (client) caches.
· Cache is a local storage facility that is used by a client to temporarily store a copy of the
data it has just requested.
· Managing cache is left to the client.
· Used to improve access times to data.

Approaches to Cache placement:


· Traditional file systems: data files are rarely shared at all rendering a shared cache
useless.
· LAN caches : machine shared by clients on the same local-area network.
· WAN caches: place (cache) servers at specific points in a wide-area network and let
a client locate the nearest server.
o When the server is located, it can be requested to hold copies of the data the
client was previously fetching from somewhere else,

Content Distribution State versus Operations

What is to be propagated:

1. Propagate only a notification of an update.


· Performed by an invalidation protocol
· In an invalidation protocol, other copies are informed that an update has
taken place and that the data they contain are no longer valid.
i. Adv. - Use less bandwidth

2. Transfer data from one copy to another.


· Useful when the read-to-write ratio is relatively high
· Modifications:
i. Log the changes and transfer only those logs to
save bandwidth.
ii. Aggregate transfers so that multiple modifications
are packed into a single message to save communication overhead.

3. Propagate the update operation to other copies - active replication


· Do not to transfer any data modifications at all
· Tell each replica which update operation it should perform and send only the
parameter values that those operations need
· Assumes that each replica is represented by a process capable of "actively"
keeping its associated data up to date by performing operations (Schneider, 1990)
· Adv. –
o updates can be propagated at minimal bandwidth costs, provided the size
of the parameters associated with an operation are relatively small
o the operations can be of arbitrary complexity, which may allow further
improvements in keeping replicas consistent
· Disadv. -
o more processing power may be required by each replica

Dr. B. SWAMINATHAN, Page 127/181


Pull versus Push Protocols
Should updates be pulled or pushed?
· Push-based - server-based protocols - updates are propagated to other replicas without
those replicas requesting updates.
o Used between permanent and server-initiated replicas and can used to push
updates to client caches.
o Used when replicas need to maintain a relatively high degree of consistency.

· Pull-based - client-based protocols - a server or client requests another server to send it


updates it has at that moment.
o Used by client caches.

Comparison between push-based and pull-based protocols in the case of multiple-client,


single-server systems.
Issu e Push-based Pull-based
State at server Must keep list of client replicas and None
caches
Messages sent Update (and possibly fetch update Poll and update
later)
Response time at client Immediate (or fetch-update time) Fetch-update time

· Trade-offs have lead to a hybrid form of update propagation based on - leases.


· A lease is a promise by the server that it will push updates to the client for a
specified time.
· When a lease expires, the client is forced to poll the server for updates and pull
in the modified data if necessary.
o Or a client requests a new lease for pushing updates when the previous lease
expires.
· Dynamically switches between push-based and pull-based strategy.

Issue: Make lease expiration time dependent on system's behavior


· Age-based leases: An object that hasn't changed for a long time, will not change in
the near future, so provide a long-lasting lease
· Renewal-frequency based leases: The more often a client requests a specific
object, the longer the expiration time for that client (for that object) will be
· State-based leases: The more loaded a server is, the shorter the expiration times
become

Epidemic Protocols
· Used to implement Eventual Consistency (note: these protocols are used in Bayou).
· Main concern is the propagation of updates to all the replicas in as few a number of
messages as possible.
· Idea is to “infect” as many replicas as quickly as possible.
Infective replica: a server that holds an update that can be spread to other
replicas.
Susceptible replica: a yet to be updated server.
Removed replica: an updated server that will not (or cannot) spread the update
to any other replicas.
· The trick is to get all susceptible servers to either infective or removed states as
quickly as possible without leaving any replicas out.

Dr. B. SWAMINATHAN, Page 128/181


The Anti-Entropy Protocol
· Entropy: “a measure of the degradation or disorganization of the universe”.
· Server P picks Q at random and exchanges updates, using one of three approaches:
1.P only pushes to Q.
2.P only pulls from Q.
3.P and Q push and pull from each other.
· Sooner or later, all the servers in the system will be infected (updated). Works well.

The Gossiping Protocol


This variant is referred to as “gossiping” or “rumour spreading”, as works as follows:
1. P has just been updated for item ‘x’.
2. It immediately pushes the update of ‘x’ to Q.
3. If Q already knows about ‘x’, P becomes disinterested in spreading any more updates
(rumours) and is removed.
4. Otherwise P gossips to another server, as does Q.
This approach is good, but can be shown not to guarantee the propagation of all updates to all
servers.

The Best of Both Worlds


A mix of anti-entropy and gossiping is regarded as the best approach to rapidly infecting
systems with updates.
BUT... what about removing data?
o Updates are easy, deletion is much, much harder!
o Under certain circumstances, after a deletion, an “old” reference to the deleted
item may appear at some replica and cause the deleted item to be reactivated!
o One solution is to issue “Death Certificates” for data items – these are a special
type of update.
o Only problem remaining is the eventual removal of “old” death certificates (with
which timeouts can help).

Dr. B. SWAMINATHAN, Page 129/181


4.9 Consistency Protocols
A consistency protocol describes an implementation of a specific consistency model.
The most widely implemented models are:
1. Sequential Consistency. Those in which operations can be grouped through locking or
transactions
2. Weak Consistency (with sync variables).
3. Atomic Transactions

Numerical Errors
Principle: consider a data item x and let weight(W) denote the numerical change in its value
after a write operation W.
· Assume that "W : weight(W) > 0.
· W is initially forwarded to one of the N replicas, denoted as origin(W).
· TW[i, j] are the writes executed by server Si that originated from Sj:

Note: Actual value v(t) of x:

value vi of x at replica i:

Problem: We need to ensure that v(t) - vi < di for every server Si.
Approach: Let every server Sk maintain a view TWk[i, j] of what it believes is the value
of TW[i, j]. This information can be gossiped when an update is propagated.
Note: 0 £ TWk[i, j] £ TW[i, j] £ TW[j, j].
Solution: Sk sends operations from its log to Si when it sees that TWk[i, j] is getting too far
from TW[i, j], in particular, when TW[k, k] - TWk[i, k] > dI /(N -1).

Primary-Based Protocols
· Use for sequential consistency
· Each data item is associated with a “primary” replica.
· The primary is responsible for coordinating writes to the data item.
· There are two types of Primary-Based Protocol:
1.Remote-Write.
2.Local-Write.

Remote-Write Protocols
· AKA primary backup protocols
· All writes are performed at a single (remote) server.
· Read operations can be carried out locally.
· This model is typically associated with traditional client/server systems.

Dr. B. SWAMINATHAN, Page 130/181


Example:
1. A process wanting to perform a write operation on data item x, forwards that
operation to the primary server for x.
2. The primary performs the update on its local copy of x, and forwards the update to the
backup servers.
3. Each backup server performs the update as well, and sends an acknowledgment back
to the primary.
4. When all backups have updated their local copy, the primary sends an
acknowledgment back to the initial process.

The Bad and Good of Primary-Backup


· Bad: Performance!
o All of those writes can take a long time (especially when a “blocking write
protocol” is used).
o Using a non-blocking write protocol to handle the updates can lead to fault
tolerant problems (which is our next topic).
· Good: as the primary is in control, all writes can be sent to each backup replica IN THE
SAME ORDER, making it easy to implement sequential consistency.

Local-Write Protocols
o AkA fully migrating approach
o A single copy of the data item is still maintained.
o Upon a write, the data item gets transferred to the replica that is writing.
o the status of primary for a data item is transferrable.

Process: whenever a process wants to update data item x, it locates the primary copy of x, and
moves it to its own location.

Dr. B. SWAMINATHAN, Page 131/181


Example:
Primary-based local-write protocol in which a single copy is migrated between processes
(prior to the read/write).

Local-Write Issues
o Question to be answered by any process about to read from or write to the data item is:
“Where is the data item right now?”
o Processes can spend more time actually locating a data item than using it!
Primary-backup protocol in which the primary migrates to the process wanting to perform an
update.

Advantage:
o Multiple, successive write operations can be carried out locally, while reading processes
can still access their local copy.
o Can be achieved only if a nonblocking protocol is followed by which updates are
propagated to the replicas after the primary has finished with locally performing the
updates.
Replicated-Write Protocols
o AKA Distributed-Write Protocols
o Writes can be carried out at any replica.
o There are two types:
1.Active Replication.
2.Majority Voting (Quorums).

Dr. B. SWAMINATHAN, Page 132/181


Active Replication
o A special process carries out the update operations at each replica.
o Lamport’s timsestamps can be used to achieve total ordering, but this does not scale well
within Distributed Systems.
o An alternative/variation is to use a sequencer, which is a process that assigns a unique
ID# to each update, which is then propagated to all replicas.

Issue: replicated invocations

The problem of replicated invocations –


o ‘B’ is a replicated object (which itself calls ‘C’).
o When ‘A’ calls ‘B’, how do we ensure ‘C’ isn’t invoked three times?
o

a)Using a coordinator for ‘B’, which is responsible for forwarding an invocation request from
the replicated object to ‘C’.

b)Returning results from ‘C’ using the same idea: a coordinator is responsible for returning
the result to all ‘B’s. Note the single result returned to ‘A’.

Dr. B. SWAMINATHAN, Page 133/181


Quorum-Based Protocols
o Clients must request and acquire permissions from multiple replicas before either
reading/writing a replicated data item.
o Methods devised by Thomas (1979) and generalized by Gifford (1979).

Example:
o A file is replicated within a distributed file system.
o To update a file, a process must get approval from a majority of the replicas to perform a
write.
o The replicas need to agree to also perform the write.
o After the update, the file has a new version # associated with it (and it is set at all the
updated replicas).
o To read, a process contacts a majority of the replicas and asks for the version # of the files.
o If the version # is the same, then the file must be the most recent version, and the read can
proceed.

Gifford's method
o To read a file of which N replicas exist a client needs to assemble a read quorum, an
arbitrary collection of any NR servers, or more.
o To modify a file, a write quorum of at least NW servers is required.
o The values of NR and NW are subject to the following two constraints:
NR + NW > N
NW > N/2
o First constraint prevents read-write conflicts
o Second constraint prevents write-write conflicts.
Only after the appropriate number of servers has agreed to participate can a file be read or
written.

Example:
o NR = 3 and NW = 10
o Most recent write quorum consisted of the 10 servers C through L.
o All get the new version and the new version number.
o Any subsequent read quorum of three servers will have to contain at least one member
of this set.
o When the client looks at the version numbers, it will know which is most recent and take
that one.

Three examples of the voting algorithm:


(a) A correct choice of read and write set.
(b) A choice that may lead to write-write conflicts.
(c) A correct choice, known as ROWA (read one, write all).

Dr. B. SWAMINATHAN, Page 134/181


(b) a write-write conflict may occur because NW £ N/2.
o If one client chooses {A,B,C,E,F,G} as its write set and another client chooses
{D,H,I,J,K,L} as its write set, then the two updates will both be accepted without
detecting that they actually conflict.

(c) NR = 1, making it possible to read a replicated file by finding any copy and using it.
o poor performance, becausewrite updates need to acquire all copies. (aka Read-One,
Write-All (ROWA)). T

Cache-Coherence Protocols
These are a special case, as the cache is typically controlled by the client not the server.
Coherence Detection Strategy:
o When are inconsistencies actually detected?
o Statically at compile time: extra instructions inserted.
o Dynamically at runtime: code to check with the server.

Coherence Enforcement Strategy


o How are caches kept consistent?
o Server Sent: invalidation messages.
o Update propagation techniques.
o Combinations are possible.

See these papers of middle-ware solutions for further discussion


What about Writes to the Cache?
Read-only Cache: updates are performed by the server (ie, pushed) or by the client (ie,
pulled whenever the client notices that the cache is stale).
Write-Through Cache: the client modifies the cache, then sends the updates to the
server.
Write-Back Cache: delay the propagation of updates, allowing multiple updates to be
made locally, then sends the most recent to the server (this can have a
dramatic positive impact on performance).

Dr. B. SWAMINATHAN, Page 135/181


UNIT V FAULT TOLERANCE AND SECURITY 9
Introduction to Fault Tolerance – Process Resilience – Reliable Communications –
Distributed Commit – Recovery – Introduction to Security – Secure Channels –
Access Control – Secure Naming - Security Management.
5.1 Introduction to Fault Tolerance
Fault Tolerance
Dealing successfully with partial failure within a Distributed System.
Key technique: Redundancy.
Basic Concepts
Fault Tolerance is closely related to the notion of “Dependability”
In Distributed Systems, this is characterized under a number of headings:
• Availability – the system is ready to be used immediately.
• Reliability – the system can run continuously without failure.
• Safety – if a system fails, nothing catastrophic will happen.
• Maintainability – when a system fails, it can be repaired easily and quickly (and,
sometimes, without its users noticing the failure).
What Is “Failure”?
Definition: A system is said to “fail” when it cannot meet its promises.
• A failure is brought about by the existence of “errors” in the system.
• The cause of an error is a “fault”.
• Distinction between preventing, removing, and forecasting faults
• Fault tolerance - meaning that a system can provide its services even in the presence of
faults.
▪ The system can tolerate faults and continue to operate normally.
Types of Faults
• Transient Fault – appears once, then disappears.
• Intermittent Fault – occurs, vanishes, reappears; but: follows no real pattern (worst
kind).
• Permanent Fault – once it occurs, only the replacement/repair of a faulty component
will allow the DS to function normally.
Failure Models
Different types of failures.
Type of failure Description
Crash failure A server halts, but is working correctly until it halts
Omission failure A server fails to respond to incoming requests
• Receive omission • A server fails to receive incoming messages
• Send omission • A server fails to send messages

Timing failure A server's response lies outside the specified time interval
Response failure A server's response is incorrect
• Value failure • The value of the response is wrong
• State transition • The server deviates from the correct flow of control
failure
Arbitrary failure A server may produce arbitrary responses at arbitrary times

Failure Masking by Redundancy


Strategy: hide the occurrence of failure from other processes using redundancy.

Dr. B. SWAMINATHAN, Page 136/181


Three main types:
• Information Redundancy – add extra bits to allow for error detection/recovery (e.g.,
Hamming codes and the like).
• Time Redundancy – perform operation and, if needs be, perform it again.
▪ Think about how transactions work (BEGIN/END/COMMIT/ABORT).
• Physical Redundancy – add extra (duplicate) hardware and/or software to the system.

Distributed Systems Fault Tolerance Topics


1. Process Resilience
2. Reliable Client/Server Communications
3. Reliable Group Communciation
4. Distributed COMMIT
5. Recovery Strategie
Process Resilience
• Processes can be made fault tolerant by arranging to have a group of processes, with
each member of the group being identical .
• A message sent to the group is delivered to all of the “copies” of the process (the
group members), and then only one of them performs the required service.
• If one of the processes fail, it is assumed that one of the others will still be able to
function (and service any pending request or operation
Flat Groups versus Hierarchical Groups
(a) Communication in a flat group. (b) Communication in a simple hierarchical group.

Communication in a flat group – all the processes are equal, decisions are made
collectively.
• Note: no single point-of-failure, however: decision making is complicated as
consensus is required.
Communication in a simple hierarchical group - one of the processes is elected to be the
coordinator, which selects another process (a worker) to perform the operation.
• Note: single point-of failure, however: decisions are easily and quickly made by the
coordinator without first having to get consensus.

Failure Masking and Replication


By organizing a fault tolerant group of processes , we can protect a single vulnerable process.
Two approaches to arranging the replication of the group:

Dr. B. SWAMINATHAN, Page 137/181


Primary (backup) Protocols
• A group of processes is organized in a hierarchical fashion in which a primary
coordinates all write operations.
• When the primary crashes, the backups execute some election algorithm to choose a
new primary.
Replicated-Write Protocols
• Replicated-write protocols are used in the form of active replication, as well as by
means of quorum-based protocols.
• Solutions correspond to organizing a collection of identical processes into a flat group.
• Adv. - these groups have no single point of failure, at the cost of distributed
coordination.
Agreement in Faulty Systems
Goal of distributed agreement algorithms - have all the non-faulty processes reach consensus
on some issue, and to establish that consensus within a finite number of steps.
Complications:
• Different assumptions about the underlying system require different solutions,
assuming solutions even exist.
• distinguish the following cases:
1. Synchronous versus asynchronous systems.
• A system is synchronous if and only if the processes are known to operate in a lock-
step mode.
• Formally, this means that there should be some constant c >= 1, such that if any
processor has taken c + 1 steps, every other process has taken at least 1 step.
• A system that is not synchronous is said to be asynchronous.
2. Communication delay is bounded or not.
• Delay is bounded if and only if we know that every message is delivered with a
globally and predetermined maximum time.
3. Message delivery is ordered or not.
• In other words, we distinguish the situation where messages from the same sender are
delivered in the order that they were sent, from the situation in which we do not have
such guarantees.
4. Message transmission is done through unicasting or multicasting.

Circumstances under which distributed agreement can be reached.

• In all other cases, it can be shown that no solution exists.


Note - most distributed systems in practice assume that processes behave asynchronously,
message transmission is unicast, and communication delays are unbounded.
• Known as the Byzantine agreement problem

Dr. B. SWAMINATHAN, Page 138/181


History Lesson: The Byzantine Empire
• Time: 330-1453 AD.
• Place: Balkans and Modern Turkey.
• Endless conspiracies, intrigue, and untruthfullness were alleged to be common practice
in the ruling circles of the day (sounds strangely familiar … ).
• That is: it was typical for intentionally wrong and malicious activity to occur among
the ruling group. A similar occurance can surface in a DS, and is known as ‘byzantine
failure’.
• Question: how do we deal with such malicious group members within a distributed
system?
How does a process group deal with a faulty member?
The “Byzantine Generals Problem” for 3 loyal generals and 1 traitor.
1. The generals announce their troop strengths (in units of 1 kilosoldiers) to the other
members of the group by sending a message.
2. The vectors that each general assembles based on (a), each general knows their own
strength. They then send their vectors to all the other generals.
3. The vectors that each general receives in step 3. It is clear to all that General 3 is the
traitor. In each ‘column’, the majority value is assumed to be correct.

Dr. B. SWAMINATHAN, Page 139/181


Goal of Byzantine agreement is that consensus is reached on the value for the non-faulty
processes only.

Solution in computer terms:


• Assume that processes are synchronous, messages are unicast while preserving
ordering, and communication delay is bounded.
• Assume N processes, where each process i will provide a value vi to the others.
• Goal - let each process construct a vector V of length N, such that if process i is non-
faulty, V [i ] = vi
▪ ELSE V [i ] is undefined.
• We assume that there are at most k faulty processes.
• Algorithm for the case of N = 4 and k = 1.

Algorithm operates in four steps.


1. Every non-faulty process i sends vi to every other process using reliable unicasting.
• Faulty processes may send anything and different values to different processes.
• Let vi =i. In (Fig.a) t process 1 reports 1, process 2 reports 2, process 3 lies to
everyone, giving x, y, and z, respectively, and process 4 reports a value of 4.
2. The results of the announcements of step 1 are collected together in the form of the
vectors (Fig.b).
3. Every process passes its vector from (Fig.b) to every other process.
• Every process gets three vectors, one from every other process.
• Process 3 lies, inventing 12 new values, a through l.
• Results in (Fig.c).
4. Each process examines the ith element of each of the newly received vectors.
• If any value has a majority, that value is put into the result vector.
• If no value has a majority, the corresponding element of the result vector is marked
UNKNOWN.
• From (Fig.c) - 1, 2, and 4 all come to agreement on the values for v1, v2, and v4,
which is the correct result.
• What these processes conclude regarding v 3 cannot be decided, but is also irrelevant.

Example Again:
With 2 loyal generals and 1 traitor.
Note: It is no longer possible to determine the majority value in each column, and the
algorithm has failed to produce agreement.
• Lamport et al. (1982) proved that in a system with k faulty processes, agreement can
be achieved only if 2k + 1 correctly functioning processes are present, for a total of 3k
+ 1.
• Agreement is possible only if more than two-thirds of the processes are working
properly.

Dr. B. SWAMINATHAN, Page 140/181


Two correct process and one faulty process.

Reliable Client-Server Communication


Kinds of Failures:
• Crash (system halts);
• Omission (incoming request ignored);
• Timing (responding too soon or too late);
• Response (getting the order wrong);
• Arbitrary/Byzantine (indeterminate, unpredictable).

Detecting process failures:


Processes actively send "are you alive?" messages to each other (for which they obviously
expect an answer)
• Makes sense only when it can be guaranteed that there is enough communication
between processes.
Processes passively wait until messages come in from different processes.
• In practice, actively pinging processes is usually followed.

Example: RPC Semantics and Failures


Remote Procedure Call (RPC) mechanism works well as long as both the client and server
function perfectly!!!

Five classes of RPC failure can be identified:


1. The client cannot locate the server, so no request can be sent.
2. The client’s request to the server is lost, so no response is returned by the server to the
waiting client.
3. The server crashes after receiving the request, and the service request is left
acknowledged, but undone.
4. The server’s reply is lost on its way to the client, the service has completed, but the
results never arrive at the client
5. The client crashes after sending its request, and the server sends a reply to a newly-
restarted client that may not be expecting it.

A server in client-server communication.


(a). A request arrives, is carried out, and a reply is sent.
(b). A request arrives and is carried out, just as before, but the server crashes before it can
send the reply.
(c). Again a request arrives, but this time the server crashes before it can even be carried out.
And, no reply is sent back.

Dr. B. SWAMINATHAN, Page 141/181


Server crashes are dealt with by implementing one of three possible implementation
philosophies:
• At least once semantics: a guarantee is given that the RPC occurred at least once, but
(also) possibly more that once.
• At most once semantics: a guarantee is given that the RPC occurred at most once, but
possibly not at all.
• No semantics: nothing is guaranteed, and client and servers take their chances!

•It has proved difficult to provide exactly once semantics.

Lost replies are difficult to deal with.


• Why was there no reply?
• Is the server dead, slow, or did the reply just go missing?

A request that can be repeated any number of times without any nasty side-effects is said to be
idempotent.
• (For example: a read of a static web-page is said to be idempotent).

Nonidempotent requests (for example, the electronic transfer of funds) are a little harder to
deal with.
• A common solution is to employ unique sequence numbers.
• Another technique is the inclusion of additional bits in a retransmission to identify it as
such to the server.

Client Crashes
When a client crashes, and when an ‘old’ reply arrives, such a reply is known as an orphan.

Four orphan solutions have been proposed:


1. extermination (the orphan is simply killed-off),
2. reincarnation (each client session has an epoch associated with it, making orphans
easy to spot),
3. gentle reincarnation (when a new epoch is identified, an attempt is made to locate a
requests owner, otherwise the orphan is killed),
4. expiration (if the RPC cannot be completed within a stardard amount of time, it is
assumed to have expired).

In practice, however, none of these methods are desirable for dealing with orphans.
Orphan elimination is discussed in more detail by Panzieri and Shrivastava (1988).

Dr. B. SWAMINATHAN, Page 142/181


Reliable Group Communication
Reliable multicast services guarantee that all messages are delivered to all members of a
process group.
• Sounds simple, but is surprisingly tricky (as multicasting services tend to
be inherently unreliable).

Small group: multiple, reliable point-to-point channels will do the job, however, such
a solution scales poorly as the group membership grows.
• What happens if a process joins the group during communication?
• Worse: what happens if the sender of the multiple, reliable point-to-point
channels crashes half way through sending the messages?

Basic Reliable-Multicasting Schemes


Simple solution to reliable multicasting when all receivers are known and are assumed not to
fail.
• The sending process assigns a sequence number to outgoing messages (making it easy
to spot when a message is missing).
• Assume that messages are received in the order they are sent.
• Each multicast message is stored locally in a history buffer at the sender.
• Assuming the receivers are known to the sender, the sender simply keeps the message
in its history buffer until each receiver has returned an acknowledgment.
• If a receiver detects it is missing a message, it may return a negative acknowledgment,
requesting the sender for a retransmission.

a)Message transmission – note that the third receiver is expecting 24.


b)Reporting feedback – the third receiver informs the sender.

• But, how long does the sender keep its history-buffer populated?
• Also, such schemes perform poorly as the group grows … there are too many ACKs.

Dr. B. SWAMINATHAN, Page 143/181


Scalability in Reliable Multicasting
• Receivers never acknowledge successful delivery.
• Only missing messages are reported.
• Negative acknoledgements (NACK) are multicast to all group members. (Don't send
any more.)
• This allows other members to supress their feedback, if necessary.
• To avoid “retransmission clashes”, each member is required to wait a random delay
prior to NACKing.
• but no hard guarantees can be given that feedback implosions will never happen.

Nonhierarchical Feedback Control


Feedback Suppression – reducing the number of feedback messages to the sender (as
implemented in the Scalable Reliable Multicasting Protocol).

• Successful delivery is never acknowledged, only missing messages are reported


(NACK), which are multicast to all group members.
• If another process is about to NACK, this feedback is suppressed as a result of the first
multicast NACK.
• In this way, only a single NACK is delivered to the sender.

Several receivers have scheduled a request for retransmission, but the first
retransmission request leads to the suppression of others.

Hierarchical Feedback Control


Hierarchical reliable multicasting - the main characteristic is that it supports the creation
of very large groups.
a)Sub-groups within the entire group are created, with each local
coordinator forwarding messages to its children.
b)A local coordinator handles retransmission requests locally, using any appropriate
multicasting method for small groups.

Dr. B. SWAMINATHAN, Page 144/181


Main problem : construction of the tree.

Conclusion:
• Building reliable multicast schemes that can scale to a large number of receivers
spread across a wide-area network, is a difficult problem.
• No single best solution exists, and each solution introduces new problems.

Atomic Multicast
Atomic multicast problem:
• A requirement where the system needs to ensure that all processes get the message, or
that none of them get it.
• An additional requirement is that all messages arrive at all processes in sequential
order.

• Atomic multicasting ensures that nonfaulty processes maintain a consistent view of the
database, and forces reconciliation when a replica recovers and rejoins the group.

Virtual Synchrony
The concept of virtual synchrony as the abstraction that group communication protocols
should attempt to build on top of an asynchronous system.

Virtual synchrony is defined as follows:


1. All recipients have identical group views when a message is delivered. (The group
view of a recipient defines the set of "correct" processors from the perspective of that
recepient.)
2. The destination list of the message consists precisely of the members in that view
3. The message should be delivered either to all members in its destination list or to no
one at all. The latter case can occur only if the sender fails during transmission.
4. Messages should be delivered in causal or total order (depending on application
semantics).

Dr. B. SWAMINATHAN, Page 145/181


Reliable multicast with the above properties is said to be virtually synchronous.
• Whole idea of atomic multicasting is that a multicast message m is uniquely associated
with a list of processes to which it should be delivered.
• Delivery list corresponds to a group view, namely, the view on the set of processes
contained in the group, which the sender had at the time message m was
multicast. (Virtual synchrony #2)
• Each process on that list has the same view. In other words, they should all agree that
m should be delivered to each one of them and to no other process. (Virtual synchrony
#1)
• Need to guarantee that m is either delivered to all processes in the list in order or m is
not delivered at all. (Virtual synchrony #3 & #4)

Message Ordering
Four different orderings:
1. Unordered multicasts
• virtually synchronous multicast in which no guarantees are given concerning the order
in which received messages are delivered by different processes
2. FIFO-ordered multicasts
• the communication layer is forced to deliver incoming messages from the same
process in the same order as they have been sent
3. Causally-ordered multicasts
• delivers messages so that potential causality between different messages is preserved
4. Totally-ordered multicasts
• regardless of whether message delivery is unordered, FIFO ordered, or causally
ordered, it is required additionally that when messages are delivered, they are
delivered in the same order to all group members.
• Virtually synchronous reliable multicasting offering totally-ordered delivery of
messages is called atomic multicasting.
• With the three different message ordering constraints discussed above, this leads to six
forms of reliable multicasting (Hadzilacos and Toueg, 1993).

Six different versions of virtually synchronous reliable multicasting.

Distributed Commit
General Goal: We want an operation to be performed by all group members or none at all.
• [In the case of atomic multicasting, the operation is the delivery of the message.]
• There are three types of “commit protocol”: single-phase, two-phase and three-phase
commit.

Dr. B. SWAMINATHAN, Page 146/181


One-Phase Commit Protocol:
• An elected co-ordinator tells all the other processes to perform the operation in
question.
• But, what if a process cannot perform the operation?
• There’s no way to tell the coordinator!
• The solutions: Two-Phase and Three-Phase Commit Protocols

Two-Phase Commit Protocol:


• First developed in 1978!!! Gray (1978)
• Summarized: GET READY, OK, GO AHEAD.
1. The coordinator sends a VOTE_REQUEST message to all group members.
2. A group member returns VOTE_COMMIT if it can commit locally,
otherwise VOTE_ABORT.
3. All votes are collected by the coordinator.
• A GLOBAL_COMMIT is sent if all the group members voted to commit.
• If one group member voted to abort, a GLOBAL_ABORT is sent.
4. Group members then COMMIT or ABORT based on the last message received from
the coordinator.

First phase - voting phase - steps 1 and 2.


Second phase - decision phase steps 3 and 4.

(a) The finite state machine for the coordinator in 2PC.


(b) The finite state machine for a participant.

Big Problem with Two-Phase Commit


• It can lead to both the coordinator and the group members blocking, which may lead
to the dreaded deadlock.
• If the coordinator crashes, the group members may not be able to reach a final
decision, and they may, therefore, block until the coordinator recovers …
• Two-Phase Commit is known as a blocking-commit protocol for this reason.
• The solution? The Three-Phase Commit Protocol

Three-Phase Commit Protocol:


Essence: the states of the coordinator and each participant satisfy the following two
conditions:
1. There is no single state from which it is possible to make a transition directly to either
a COMMIT or an ABORT state.
2. There is no state in which it is not possible to make a final decision, and from which a
transition to a COMMIT state can be made.

Dr. B. SWAMINATHAN, Page 147/181


(a) The finite state machine for the coordinator in 3PC.
(b) The finite state machine for a participant.

Recovery
• Once a failure has occurred, it is essential that the process where the failure
happened recovers to a correct state.
• Recovery from an error is fundamental to fault tolerance.
• Two main forms of recovery:
1. Backward Recovery: return the system to some previous correct state
(using checkpoints), then continue executing.
2. Forward Recovery: bring the system into a correct state, from which it can then
continue to execute.

Forward and Backward Recovery


Backward Recovery:
Advantages
• Generally applicable independent of any specific system or process.
• It can be integrated into (the middleware layer) of a distributed system as a general-
purpose service.
Disadvantages:
• Checkpointing (can be very expensive (especially when errors are very rare).
[Despite the cost, backward recovery is implemented more often. The “logging” of
information can be thought of as a type of checkpointing.].
• Recovery mechanisms are independent of the distributed application for which they
are actually used – thus no guarantees can be given that once recovery has taken place,
the same or similar failure will not happen again.

Disadvantage of Forward Recovery:


• In order to work, all potential errors need to be accounted for up-front.
• When an error occurs, the recovery mechanism then knows what to do to bring the
system forward to a correct state.

Example
Consider as an example: Reliable Communications.
Retransmission of a lost/damaged packet - backward recovery technique.

Dr. B. SWAMINATHAN, Page 148/181


Erasure Correction - When a lost/damaged packet can be reconstructed as a result of the
receipt of other successfully delivered packets - forward recovery technique.

Recovery-Oriented Computing
Recovery-oriented computing - Start over again .
• Underlying principle - it may be much cheaper to optimize for recovery, then it is
aiming for systems that are free from failures for a long time.

Different flavors:
• Simply reboot (part of a system) e.g. restart Internet servers
• To reboot only a part of the system - i the fault is properly localized.
• means deleting all instances of the identified components, along with the threads
operating on them, and (often) to just restart the associated requests.
• Apply checkpointing and recovery techniques, but to continue execution in a changed
environment.
• Basic idea - many failures can be simply avoided if programs are given extra buffer
space, memory is zeroed before allocated, changing the ordering of message delivery
(as long as this does not affect semantics), and so on

Dr. B. SWAMINATHAN, Page 149/181


5.6 Introduction to Security
Strategies for securing Distributed Systems
Generally very similar to techniques used in a non-distributed system, only much
more difficult to implement …
Difficult to get right, impossible to get perfect!
Security Topics
1. Providing a secure communications channel – authentication, confidentiality and
integrity.
2. Handling authorization – who is entitled to use what in the system?
3. Providing effective Security Management.
4. Example systems: SESAME and e-payment systems.
Types of Threats
• Interception – unauthorized access to data.
• Interruption – a service becomes unavailable.
• Modification – unauthorized changes to, and tampering of, data.
• Fabrication – non-normal, additional activity.

Security Mechanisms
• Encryption – fundamental technique: used to implement confidentiality and integrity.
• Authentication – verifying identities.
• Authorization – verifying allowable operations.
• Auditing – who did what to what and when/how did they do it?

Key Point
Matching security mechanisms to threats is only possible when a Policy on security and
security issues exists.

Example: The Globus Security Architecture


• Globus is a system supporting largescale distributed computations in which many
hosts, files, and other resources are simultaneously used for doing a computation.
• Also referred to as computational grids
• Resources in these grids are often located in different administrative domains that may
be located in different parts of the world.

Globus Security Policy:


1. The environment consists of multiple administrative domains.
2. Local operations (i.e., operations that are carried out only within a single domain) are
subject to a local domain security policy only.
3. Global operations (i.e., operations involving several domains) require the initiator to
be known in each domain where the operation is carried out.
4. Operations between entities in different domains require mutual authentication.
5. Global authentication replaces local authentication.
6. Controlling access to resources is subject to local security only.
7. Users can delegate rights to processes.
8. A group of processes in the same domain can share credentials.

The Globus security architecture:


o consists of entities such as: users, user proxies, resource proxies, and general
processes.
Entities are located in domains and interact with each other.

Dr. B. SWAMINATHAN, Page 150/181


Security architecture defines four different protocols
The Globus security architecture. (See diagram)

Design Issues
Three design issues when considering security:
1. Focus of Control.
2. Layering of Security Mechanisms.
3. Simplicity.

Design Issue: Focus of Control


Three approaches for protection against security threats.
(a) Protection against invalid operations
(b) Protection against unauthorized invocations.
(c) Protection against unauthorized users.

Design Issue: Layering of Security Mechanisms


Decision required as to where the security mechanism is to be placed.

Dr. B. SWAMINATHAN, Page 151/181


Trust and Security
A system is either secure or it is not.
Whether a client considers a system to be secure is a matter of trust.
Layer in which security mechanisms are placed depends on the trust a client has in
how secure the services are in any particular layer.

Example
Several sites connected through a wide-area backbone service.

Switched Multi-megabit Data Service (SMDS)

If the encryption mechanisms cannot be trusted, additional mechanisms can be


employed by clients to provide the level of trust required (e.g., using SSL (Secure Sockets
Layer).
Secure Sockets Layer and can be used to securely send messages across a TCP
connection.

Distribution of Security Mechanisms


The Trusted Computing Base (TCB) is the set of services/mechanisms within a distributed
system required to support a security policy.

Reduced Interfaces for Secure System Components (RISSC)


Prevents clients and their applications from direct access to critical services.
Any security-critical server is placed on a separate machine isolated from end-user
systems using low-level secure network interfaces.
Clients and their applications run on different machines and can access the secured
server only through these network interfaces.

Dr. B. SWAMINATHAN, Page 152/181


The principle of RISSC as applied to secure distributed systems.

Design Issue: Simplicity


Designing a secure computer system is difficult, regardless of whether it is also a DS.
Using a few simple mechanisms that are easily understood and trusted to work is the
ideal situation.
o Real world - introducing security mechanisms to an already complex system can
make matters worse.
o However, this is still a design goal to aim for!

Security Mechanisms
Fundamental technique within any distributed systems security environment: Cryptography.
A sender S wanting to transmit message m to a receiver R.
The sender encrypts its message into an unintelligible message m', and sends m' to R.
R must decrypt the received message into its original form m.

Three kinds of Intruders and eavesdroppers in communication.

Original form of the message that is sent is called the plaintext, P


Encrypted form is referred to as the ciphertext, illustrated as C.

Dr. B. SWAMINATHAN, Page 153/181


Notation to relate plaintext, ciphertext, and keys.
C = EK(P) denotes that the ciphertext C is obtained by encrypting the plaintext P using key K.
P = DK(C) is used to express the decryption of the ciphertext C using key K, resulting in
the plaintext P.

Types of Cryptosystems:

Symmetric: often referred to as conventional cryptography, defined as:


o Symmetric cryptosystems =>
referred to as secret-key or shared-key systems
1. because the sender and
receiver are required to share the same key,
2. this shared key must be kept secret; no one else is allowed to see the
key.
3. Use the notation KA,B to denote a key shared by A and B.
Asymmetric: often referred to as public-key cryptography, defined as:
The keys for encryption and decryption are different, but together form a unique
pair.
There is a separate key KE for encryption and one for decryption, KD , such that

• One of the keys in an asymmetric cryptosystem is


kept private; the other is made public.
• Notation to denote a public key belonging to A, and as its corresponding
private key.

Notation used in cryptography

Introducing Alice, Bob & Co.


Alice and Bob are the good guys.
Chuck and Eve are usually the bad guys.

Example 1: Bob and Alice


• Alice wants to send a confidential message to Bob
• She should use Bob's public key to encrypt the message.
• Because Bob is the only one holding the private decryption key, he is also the only
person that can decrypt the message.

Dr. B. SWAMINATHAN, Page 154/181


Example 2: Bob and Alice Redux
• Bob wants to know for sure that the message he just received actually came
from Alice.
• Alice keeps her encryption key private to encrypt the messages she sends.
• If Bob can successfully decrypt a message using Alice's public key (and the plaintext
in the message has enough information to make it meaningful to Bob), he knows that
message must have come from Alice, because the decryption key is uniquely tied to
the encryption key.

Hash (One-Way) Functions


A hash function H takes a message m of arbitrary length as input and produces a bit
string h having a fixed length as output:
h = H( m )
Given H and m, h is easy to compute.
However, given H and h, it is computationally infeasible to compute m.
That is, the function only works One-Way.
Weak Collision Resistence: given m and h = H( m ), it is hard to find another m, say, m2,
such that m and m2 produce the same h.
Strong Collision Resistence: Given H, it is infeasible to find an m and an m2 such
that H( m ) and H( m2 ) produce the same value.

Similarly:
• For any encryption function E, it should be computationally infeasible to find the
key K when given the plaintext P and associated ciphertext C = EK(P)
• Analogous to collision resistance, when given a plaintext P and a key K, it should be
effectively impossible to find another key K' such that EK(P) = EK' (P).

Symmetric Cryptosystems: DES

Example of a cryptographic algorithm: Data Encryption Standard (DES) (detailed


example)
• Used for symmetric cryptosystems
• DES is designed to operate on 64-bit blocks of data.
• A block is transformed into an encrypted (64 bit) block of output in 16 rounds, where
each round uses a different 48-bit key for encryption.
• Each of these 16 keys is derived from a 56-bit master key (see figure).
• Before an input block starts its 16 rounds of encryption, it is first subject to an initial
permutation, of which the inverse is later applied to the encrypted output leading to the
final output block.

Dr. B. SWAMINATHAN, Page 155/181


(a) The principle of DES. (b) Outline of one encryption round.

• Each encryption round i takes the 64-bit block produced by the previous round i - 1 as
its input.
• The 64 bits are split into a left part Li-1 and a right part Ri-1, each containing 32 bits.
• The right part is used for the left part in the next round, that is, Li = Ri-1.

Work is done in the mangler function f.


• This function takes a 32-bit block Ri-1 as input, together with a 48-bit key Ki, and
produces a 32-bit block that is XORed with Li-1 to produce Ri.
o (XOR is an abbreviation for the exclusive or operation.)
• The mangler function first expands Ri-1 to a 48-bit block and XORs it with Ki.
• The result is partitioned into eight chunks of six bits each.
• Each chunk is then fed into a different S-box, which is an operation that
substitutes each of the 64 possible 6-bit inputs into one of 16 possible 4-bit
outputs.
• The eight output chunks of four bits each are then combined into a 32-bit value and
permuted again.

The 48-bit key Ki for round i is derived from the 56-bit master key as follows
• First, the master key is permuted and divided into two 28-bit halves.
• For each round, each half is first rotated one or two bits to the left, after which 24 bits
are extracted.
• Together with 24 bits from the other rotated half, a 48-bit key is constructed.

Dr. B. SWAMINATHAN, Page 156/181


Details of per-round key generation in DES.

Principle of DES is quite simple:


• Algorithm is difficult to break using analytical methods.
• Brute-force attack by simply searching for a key that will do the job has become easy
as has been demonstrated a number of times.
• DES three times in a special encrypt-decrypt-encrypt mode with different keys, also
known as Triple DES is much more safe and is still often used .

Public-Key Cryptosystems: RSA


• Public-key systems: RSA : named after its inventors: Rivest, Shamir,
and Adleman (1978).
• Security of RSA - no methods are known to efficiently find the prime factors of large
numbers.
• It can be shown that each integer can be written as the product of prime numbers.
• For example, 2100 can be written as
2100 = 2 x 2 x 3 x 5 x 5 x 7
making 2, 3, 5, and 7 the prime factors in 2100.
• Private and public keys are constructed from very large prime numbers (consisting of
hundreds of decimal digits).
• Breaking RSA is equivalent to finding those two prime numbers.
• So far, this has shown to be computationally infeasible despite mathematicians
working on the problem for centuries.

Dr. B. SWAMINATHAN, Page 157/181


Generating the private and public key requires four steps:
1. Choose two very large prime numbers, p and q.
2. Compute n = p x q and z = (p – 1) x (q – 1).
3. Choose a number d that is relatively prime to z.
4. Compute the number e such that e x d = 1 mod z.
• One of the numbers, say d, can subsequently be used for decryption, whereas e is used
for encryption.
• Only one of these two is made public, depending on what the algorithm is being used
for.

Example 3: Bob and Alice


• Alice wants to keep the messages she sends to Bob confidential.
• She wants to ensure that no one but Bob can intercept and read her messages to him.
• RSA considers each message m to be just a string of bits.
• Each message is first divided into fixed-length blocks, where each block mi,
interpreted as a binary number that should lie in the interval 0 mi < n.
• To encrypt message m, the sender calculates for each
block mi the value , which is then sent to the receiver.
• Decryption at the receiver's side takes place by computing .
• Note for the encryption, both e and n are needed, whereas decryption requires
knowing the values d and n.
• RSA has the drawback of being computationally more complex.
• encrypting messages using RSA is approximately 100–1000 times slower than DES,
depending on the implementation technique used.
• Many cryptographic systems use RSA to exchange only shared keys in a secure way,
but much less for actually encrypting "normal" data.

Hash Functions: MD5 (Rivest, 1992)

• MD5 is a hash function for computing a 128-bit, fixed length message digest from an
arbitrary length binary input string.
• The input string is first padded to a total length of 448 bits (modulo 512), after which
the length of the original bit string is added as a 64-bit integer.
• In effect, the input is converted to a series of 512-bit blocks.

The structure of MD5 algorithm.

Dr. B. SWAMINATHAN, Page 158/181


• Starting with some constant 128-bit value, the algorithm proceeds in k phases, where k
is the number of 512-bit blocks comprising the padded message.
• During each phase, a 128-bit digest is computed out of a 512-bit block of data coming
from the padded message, and the 128-bit digest computed in the preceding phase.

A phase in MD5 consists of four rounds of computations, where each round uses one of
the following four functions:

F (x,y,z) = (x AND y) OR ((NOT x) AND z)

G (x,y,z) = (x AND z) OR (y AND (NOT z))

H (x,y,z) = x XOR y XOR z

I (x,y,z) = y XOR (x OR (NOT z))

• Each of these functions operates on 32-bit variables x, y, and z.


• Consider a 512-bit block b from the padded message that is being processed during
phase k.
• Block b is divided into 16 32-bit subblocks b0,b1,...,b15.
• During the first round, function F is used to change four variables (denoted as p, q, r,
and s, respectively) in 16 iterations
• These variables are carried to each next round, and after a phase has finished, passed
on to the next phase.
• There are a total of 64 predefined constants Ci.
• The notation x <<< n is used to denote a left rotate: the bits in x are shifted n positions
to the left, where the bit shifted off the left is placed in the rightmost position.

The 16 iterations during the first round in a phase in MD5.

• The second round uses the function G in a similar fashion, whereas H and I are used in
the third and fourth round, respectively.
• Each step thus consists of 64 iterations, after which the next phase is started, but now
with the values that p, q, r, and s have at that point.

Dr. B. SWAMINATHAN, Page 159/181


Secure Channels
DS Security: Two Major Issues
1.Secure communications between parties.
2.Authorization.
Secure channels protect against (protected by): (Voydock and Kent, 1983)
Interception (confidentiality).
Modification (auth. and integrity).
Fabrication (auth. and integrity).

Note that authentication and message integrity as technologies rely on each other

A detailed description of the logics underlying authentication can be found in Lampson et al.
(1992).

Applications of Cryptography
1. Authentication.
2. Message Integrity.
3. Confidentiality.
Common practice to use secret-key cryptography by means of session keys.
A session key is a shared (secret) key that is used to encrypt messages for integrity
and possibly also confidentiality.
This key is used only for as long as the channel exists.
When the channel is closed, its associated session key is securely destroyed.

Authentication Based on a Shared Secret Key


Ensure data message integrity exchanged after authentication use secret-key
cryptography with session keys.
Session key - is a shared (secret) key that is used to encrypt messages for integrity
Generally used only for as long as the channel exists.

Example: Bob and Alice Again


• Alice and Bob are abbreviated by A and B, respectively, and their shared key is
denoted as KA,B.
• One party challenges the other to a response that can be correct only if the other knows
the shared secret key.
• also known as challenge-response protocols.

1. Alice sends her identity to Bob (message 1), indicating that she wants to set up a
communication channel between the two.
2. Bob sends a challenge RB to Alice, shown as message 2.
Such a challenge could take the form of a random number.
Alice must encrypt the challenge with the secret key KA,B. that she shares with
Bob, and return the encrypted challenge to Bob. This response is shown as
message 3 containing KA,B.(RB).

Authentication based on a shared secret key, using a ‘challenge response’ protocol.


Note: R is a random number.

Dr. B. SWAMINATHAN, Page 160/181


3. When Bob receives the response KA,B.(RB) to his challenge RB, he can decrypt the
message using the shared key again to see if it contains RB.
If so, he then knows that Alice is on the other side, for who else could have
encrypted RB with KA,B in the first place?
4. Alice has not yet verified that it is Bob is on the other side of the channel.
She sends a challenge RA (message 4), which Bob responds to by
returning KA,B.(RA), shown as message 5.
5. When Alice decrypts it with KA,B and sees her RA, she knows she is talking to Bob.

Things can go wrong:


Consider an "optimization" of the authentication protocol in which the number of
messages has been reduced from five to three, as shown below.
If Alice eventually wants to challenge Bob anyway, she might as well send a challenge
along with her identity when setting up the channel.
Likewise, Bob returns his response to that challenge, along with his own challenge in a
single message.

Example: Bob and Alice – Taking Shortcuts


Optimization of the Authentication based on a shared secret key, but using three instead of
five messages.

Idea:
• If Alice eventually wants to challenge Bob anyway, she might as well send a challenge
along with her identity when setting up the channel.

Dr. B. SWAMINATHAN, Page 161/181


• Bob returns his response to that challenge, along with his own challenge in a single
message

This protocol can easily be defeated by a reflection attack.


• A reflection attack is a potential way of attacking a challenge-response authentication
system which uses the same protocol in both directions.
• The basic idea is to trick the target into providing the answer to its own challenge.

The general attack outline is as follows:


1. The attacker initiates a connection to a target.
2. The target attempts to authenticate the attacker by sending it a challenge.
3. The attacker opens another connection to the target, and sends the target this
challenge as its own.
4. The target responds to that challenge.
5. The attacker sends that response back to the target ("reflects" it) on the first
connection.

NOTE: If the authentication protocol is not carefully designed, the target will accept that
response as valid, thereby leaving the attacker with one fully-authenticated channel
connection (the other one is simply abandoned).

Example: Bob and Chuck – When Things Go Wrong


The ‘reflection attack’. - Chuck wants Bob to think he is Alice, so he starts up a second
session to trick Bob.
• Chuck's goal is to set up a channel with Bob so that Bob believes he is talking
to Alice.
• Chuck can establish this if he responds correctly to a challenge sent by Bob
o e.g, by returning the encrypted version of a number that Bob sent.
• Without knowledge of KA,B., only Bob can do such an encryption, and this is precisely
what Chuck tricks Bob into doing.
1. Chuck starts out by sending a message containing Alice's identity A, along with a
challenge RC.
2. Bob returns his challenge RB and the response KA,B(RC) in a single message.
3. Chuck needs to prove he knows the secret key by returning KA,B(RB) to Bob.
o He does not have KA,B.
o He attempts to set up a second channel to let Bob do the encryption for him.
4. Chuck sends A and RB in a single message as before, but now pretends that he
wants a second channel. This is shown as message 3

The reflection attack.

Dr. B. SWAMINATHAN, Page 162/181


5. Bob not recognizing that he, himself, had used RB before as a challenge,
responds with KA,B(RB) and another challenge RB2, shown as message 4.
6. At that point, Chuck has KA,B(RB) and finishes setting up the first session by
returning message 5 containing the response KA,B(RB), which was originally
requested from the challenge sent in message 2.

Mistake 1: the two parties in the new version of the protocol were using the same challenge
in two different runs of the protocol.

Better Design: always use different challenges for the initiator and for the responder.
• In general, letting the two parties setting up a secure channel do a number of things
identically is not a good idea.

Mistake 2: Bob gave away valuable information in the form of the response KA,B(RC) without
knowing for sure to whom he was giving it.
• Not violated in the original protocol, in which Alice first needed to prove her identity,
after which Bob was willing to pass her encrypted information.

More on design principles for protocols can be found in Abadi and Needham (1996).

Authentication Using a Key Distribution Center

Problem with a shared secret key for authentication: scalability.


• Given N hosts - each host is required to share a secret key with each of the other N -
1 hosts
• System as a whole must manage N (N - 1)/2 keyes
• Each host has to manage N - 1 keys.

Alternative: use a centralized approach by means of a Key Distribution Center (KDC).


• This KDC shares a secret key with each of the hosts
• No pair of hosts is required to have a shared secret key as well.
• Using a KDC requires that we manage N keys instead of N (N - 1)/2

• If Alice wants to set up a secure channel with Bob, she can do so with the help of a
(trusted) KDC.
• The KDC hands out a key to both Alice and Bob that they can use for communication,
Dr. B. SWAMINATHAN, Page 163/181
Principle of using a KDC.

1. Alice sends a message to the KDC, telling it that she wants to talk to Bob.
2. The KDC returns a message containing a shared secret key KA,B that she can use.
• The message is encrypted with the secret key KA,KDC that Alice shares with the KDC.
3. The KDC sends KA,B to Bob, but now encrypted with the secret key KB,KDC it shares
with Bob.

Drawbacks :
• Alice may want to set up a secure channel with Bob even before Bob had received the
shared key from the KDC.
• The KDC is required to pass Bob the key.

Solution: ticket
• KDC passes KB,KDC(KA,B) back to Alice and lets her connect to Bob.
• The message KB,KDC(KA,B) is also known as a ticket.
• It is Alice's job to pass this ticket to Bob.
• Note that Bob is still the only one that can make sensible use of the ticket, as he is the
only one besides the KDC who knows how to decrypt the information it contains.

Using a ticket and letting Alice set up a connection to Bob.

• Protocol is a variant of a well-known example of an authentication protocol using a


KDC, known as the Needham-Schroeder authentication protocol, named after its
inventors (Needham and Schroeder, 1978).

Dr. B. SWAMINATHAN, Page 164/181


The Needham-Schroeder protocol is a multiway challenge-response protocol and works
as follows.

1. When Alice wants to set up a secure channel with Bob, she sends a request to the KDC
containing a challenge RA, along with her identity A and that of Bob.
2. The KDC responds by giving her the ticket KB,KDC(KA,B), along with the secret
key KA,B that she can subsequently share with Bob.
The challenge RA1 that Alice sends to the KDC along with her request to set up
a channel to Bob is also known as a nonce.
A nonce is a random number that is used only once, such as one chosen from a
very large set.
Purpose of a nonce is to uniquely relate two messages to each other
e.g. message 1 and message 2.
by including RA1 again in message 2, Alice will know for sure that message 2
is sent as a response to message 1, and that it is not a replay of an older message.
Message 2 also contains B, the identity of Bob.
By including B, the KDC protects Alice against the following attack.

3. After the KDC has passed the ticket to Alice, the secure channel between Alice and
Bob can be set up.
Alice sends message 3, which contains the ticket to Bob, and a
challenge RA2 encrypted with the shared key KA,B that the KDC had just generated.
4. Bob then decrypts the ticket to find the shared key, and returns a response RA2 - 1
along with a challenge RB for Alice.
NOTE: by returning RA2 - 1 and not just RA2, Bob not only proves he knows the
shared secret key, but also that he has actually decrypted the challenge.
This ties message 4 to message 3 in the same way that the nonce RA tied
message 2 to message 1.

Weak Point:
• If Chuck got an old key KA,B, he could replay message 3 and get Bob to set up a
channel.
• Bob will then believe he is talking to Alice, while, in fact, Chuck is at the other end.
• Need to relate message 3 to message 1 - make the key dependent on the initial request
from Alice to set up a channel with Bob.

Dr. B. SWAMINATHAN, Page 165/181


Protection against malicious reuse of a previously generated session key in the Needham-
Schroeder protocol.

Solution: incorporate a nonce in the request sent by Alice to the KDC.


• The nonce has to come from Bob: this assures Bob that whoever wants to set up a
secure channel with him, will have gotten the appropriate information from the KDC.
• Alice first requests Bob to send her a nonce RB1, encrypted with the key shared
between Bob and the KDC.
• Alice incorporates this nonce in her request to the KDC, which will then decrypt it and
put the result in the generated ticket.
• Bob will know for sure that the session key is tied to the original request from Alice to
talk to Bob.

Authentication Using Public-Key Cryptography


Mutual authentication in a public-key cryptosystem.
• Note that the KDC is missing …
• But, this assumes that some mechanism exists to verify everyone’s public key.
Example: Bob and Alice
• Alice wants to set up a secure channel to Bob
• Both are in the possession of each other's public key.

Mutual authentication in a public-key cryptosystem.

Dr. B. SWAMINATHAN, Page 166/181


• Alice sends a challenge RA to Bob encrypted with his public key .
• Bob's must decrypt the message and return the challenge to Alice.
• Because Bob is the only person that can decrypt the message (using the private key
that is associated with the public key Alice used), Alice will know that she is talking to
Bob.
• When Bob receives Alice's request to set up a channel, he returns the decrypted
challenge, along with his own challenge RB to authenticate Alice.
• He also generates a session key KA,B that can be used for further communication.
• Bob's response to Alice's challenge, his own challenge, and the session key are put into
a message encrypted with the public key belonging to Alice, shown as message 2 .
• Only Alice will be capable of decrypting this message using the private key
associated with .
• Alice, finally, returns her response to Bob's challenge using the session
key KA,B generated by Bob.

More on Secure Channels


In addition to authentication, a secure channel also requires that messages are confidential,
and that they maintain their integrity.
For example:
Alice needs to be sure that Bob cannot change a received message and claim it
came from her.
Bob needs to be sure that he can prove the message was sent by/from Alice, just in
case she decides to deny ever having sent it in the first place.
Solution: Digital Signing

Digital Signatures
• Digital signing a message using public-key cryptography.
This is implemented in the RSA technology.
Note: the entire document is encrypted/signed - this can sometimes be a costly
overkill.

Example:
o Bob has sold Alice a collector's item of some phonograph record for $500.
o The whole deal was done through e-mail.
o Alice sends Bob a message confirming that she will buy the record for $500.

Issues:
o Alice needs to be assured that Bob will not maliciously change the $500 mentioned
in her message into something higher, and claim she promised more than $500.
o Bob needs to be assured that Alice cannot deny ever having sent the message
because she had second thoughts.

Solution:
o Alice digitally signs the message in such a way that her signature is uniquely tied to
its content.
o Association between a message and its signature prevents that modifications to the
message will go unnoticed.
o Alice's signature can be verified to be genuine; she cannot later repudiate the fact
that she signed the message.

Dr. B. SWAMINATHAN, Page 167/181


Ways to place digital signatures:
1. Use a public-key cryptosystem such as RSA
o When Alice sends a message m to Bob, she encrypts it with her
private key , and sends it off to Bob.
o If she wants to keep the message content a secret, she can use Bob's public
key and send , which combines m and the version signed
by Alice.

Digital signing a message using public-key cryptography.

o Message arrives at Bob => he can decrypt it using Alice's public key.
o If the public key is owned by Alice, then decrypting the signed version of m and
successfully comparing it to m can mean only that it came from Alice.
o Alice is protected against any malicious modifications to m by Bob, because Bob
will always have to prove that the modified version of m was also signed by Alice.

Problems with scheme:


1. Validity of Alice's signature holds only as long as Alice's private key remains a secret.
2. Alice decides to change her private key.
o Once Alice has changed her key, her statement sent to Bob becomes worthless.
3. Alice encrypts the entire message with her private key.
o Such an encryption may be costly in terms of processing requirements

o A cheaper more elegant scheme is to use a message digest.

Message digest => is a fixed-length bit string h that has been computed from an arbitrary-
length message m by means of a cryptographic hash function H.
o If m is changed to m', its hash H (m') will be different from h = H (m) so that it can
easily be detected that a modification has taken place.

To digitally sign a message:


1. Alice first computes a message digest and encrypts the digest with her private key.
2. The encrypted digest is sent along with the message to Bob.

Note that the message itself is sent as plaintext: everyone is allowed to read it.
o If confidentiality is required, then the message should also be encrypted with Bob's
public key.

Dr. B. SWAMINATHAN, Page 168/181


Digitally signing a message using a message digest.

3. When Bob receives the message and its encrypted digest, he decrypts the digest
with Alice's public key, and separately calculates the message digest.
o If the digest calculated from the received message and the decrypted digest
match, Bob knows the message has been signed by Alice.

Session Keys
A session key is a key used for encrypting one message or a group of messages in a
communication session.
During the establishment of a secure channel, after the authentication phase has
completed, the communicating parties generally use a unique shared session key for
confidentiality.
The session key is safely discarded when the channel is no longer used.

Secure Group Communication


Design issue: How can you share secret information between multiple members without
losing everything when one member turns bad.

Confidentiality: Follow a simple (hard-to-scale) approach by maintaining a separate secret


key between each pair of members.

Replication: You also want to provide replication transparency.


Apply secret sharing:
No process knows the entire secret; it can be revealed only through joint
cooperation
Assumption: at most k out of N processes can produce an incorrect answer
At most c <= k processes have been corrupted

Note: We are dealing with a k fault tolerant process group.

Secure Replicated Servers


Secure and Transparent Replicated Servers
Example:
Given a securely replicated group of servers
Each server accompanies its response with a digital signature.
If ri is the response from server Si, let md (ri) denote the message digest computed
by server Si.
• This digest is signed with server Si's private key .
Dr. B. SWAMINATHAN, Page 169/181
Want to protect the client against at most c corrupted servers => the server group should be
able to tolerate corruption by at most c servers, and still be capable of producing a response
that the client can put its trust in.

Let the replicated servers generate a secret valid signature with the property
that c corrupted servers alone are not enough to produce that signature.

Consider a group of five replicated servers that should be able to tolerate two corrupted
servers, and still produce a response that a client can trust.

Sharing a secret signature in a group of replicated servers.

Let N = 5, c = 2
Each server Si gets to see each request and responds with ri
• Response ri is sent along with digest md(ri), and signed with private key Ki- .
• Signature is denoted as sig(Si, ri) = Ki- (md(ri)).

Client uses special decryption function D that computes a single


digest d from three signatures:
d = D(sig(S,r), sig(S’,r’), sig(S”,r”))
If d = md(ri) for some ri, ri is considered correct
Also known as (m,n)-threshold scheme.
(with m = c + 1, n = N)

There are 5!/(3!2!)=10 possible combinations of three signatures that the client can
use as input for D.
If one of these combinations produces a correct digest md (ri) for some response ri,
then the client can consider ri as being correct.
• It can trust that the response has been produced by at least three honest servers.

Dr. B. SWAMINATHAN, Page 170/181


Example:
Kerberos is a network authentication protocol.
It is designed to provide strong authentication for client/server applications by
using secret-key cryptography.
• Kerberos was developed at M.I.T. and is based on the Needham-Schroeder
authentication protocol.
• Two different versions of Kerberos- version 4 (V4) and version 5 (V5).
o V5 being more flexible and scalable.

Two different components:


Authentication Server (AS) - responsible for handling a login request from a user.
• AS authenticates a user and provides a key that can be used to set up secure channels
with servers.
Ticket Granting Service (TGS) - sets up secure channels.
• The TGS hands out special messages, known as tickets, that are used to convince a
server that the client is really who he or she claims to be.

Example: Alice sets up a secure channel with server Bob.


For Alice to log onto the system, she can use any workstation available. The workstation
sends her name in plaintext to the AS, which returns a session key KA,TGS and a ticket that
she will need to hand over to the TGS.

The ticket that is returned by the AS contains the identity of Alice, along with a generated
secret key that Alice and the TGS can use to communicate with each other. The ticket itself
will be handed over to the TGS by Alice. Therefore, it is important that no one but the TGS
can read it. For this reason, the ticket is encrypted with the secret key KAS,TGS shared
between the AS and the TGS.

Authentication in Kerberos.

Message 1 - Alice types in her login name at a workstation.


Message 2 - contains login name and is sent to the AS.
Message 3 - contains the session key KA,TGS and the ticket KAS,TGS(A,KA,TGS).
• To ensure privacy, message 3 is encrypted with the secret key KA,AS shared between
Alice and the AS.
Message 4 - workstation prompts Alice for her password
Message 5 – PWD returned to workstation which generates shared key KA,AS and find
session key KA,TGS.

Dr. B. SWAMINATHAN, Page 171/181


Message 6 –to talk to Bob, she requests the TGS to generate a session key for Bob.
• The fact that Alice has the ticket KAS,TGS(A,KA,TGS) proves that she is Alice.
• Message alsocontains a timestamp, t, encrypted with the secret key shared between
Alice and the TGS.

Setting up a secure channel with Bob

1. Alice sends to Bob a message containing the ticket she got from the TGS, along with
an encrypted timestamp.
2. When Bob decrypts the ticket, he notices that Alice is talking to him, because only
the TGS could have constructed the ticket.
He also finds the secret key KA,B, allowing him to verify the timestamp.
At that point, Bob knows he is talking to Alice and not someone
maliciously replaying message 1.
By responding with KA,B(t + 1), Bob proves to Alice that he is indeed Bob.

Access Control
Access rights is referred to as access control, whereas authorization is about granting access
rights

Authorization versus Authentication


Authentication: Verify the claim that a subject says it is S: verifying the identity of a subject
Authorization: Determining whether a subject is permitted certain services from an object
Note: authorization makes sense only if the requesting subject has been authenticated

Simple model of access control:


Subjects issue a request to access an object.
Processes acting on behalf of users, but can also be objects that need the services of
other objects in order to carry out their work.
An object encapsulates its own state and implements the operations on that state.
Operations of an object that subjects can request to be carried out are made
available through interfaces.

General model of controlling access to objects.

Dr. B. SWAMINATHAN, Page 172/181


Reference Monitor records which subject may do what, and decides whether a subject is
allowed to have a specific operation carried out.
• This monitor is called (e.g., by the underlying trusted operating system) each time an
object is invoked.

Access Control Matrix


Essence: Maintain an access control matrix ACM in which entry ACM[S,O] contains the
permissible operations that subject S can perform on object O
Issues:
System may need to support thousands of users and millions of objects that require
protection
• Many entries in the matrix will be empty: a single subject will generally have access to
relatively few objects.

Solutions:
Implementation (a): Each object O maintains an access control list (ACL): ACM[*,O]
describing the permissible operations per subject (or group of subjects)
Implementation (b): Each subject S has a capability: ACM[S,*] describing the permissible
operations per object (or category of objects)
Comparison between ACLs and capabilities for protecting objects.
(a) Using an ACL. (b) Using capabilities.

Protection Domains
Issue: ACLs or capability lists can be very large.
Reduce information by means of protection domains: (Saltzer and Schroeder, 1975)
Set of (object, access rights) pairs
Each pair is associated with a protection domain
For each incoming request the reference monitor first looks up the appropriate
protection domain

Dr. B. SWAMINATHAN, Page 173/181


Common implementation of protection domains:
Groups: Users belong to a specific group; each group has associated access rights
Roles: Don't differentiate between users, but only the roles they can play.
Your role is determined at login time. Role changes are allowed.

Firewalls
Essence: Sometimes it's better to select service requests at the lowest level: network packets.
Packets that do not fit certain requirements are simply removed from the channel
Solution: Protect your company by a firewall: it implements access control

A common implementation of a firewall

Two Types pf Firewalls:


Packet-filtering gateway - operates as a router and makes decisions as to whether or not to
pass a network packet based on the source and destination address as contained in the packet's
header.
• Typically, the packet-filtering gateway shown on the outside LAN above would
protect against incoming packets, whereas the one on the inside LAN would filter
outgoing packets.

Application-level gateway - this type of firewall inspects the content of an incoming or


outgoing message.
• e.g. mail gateway that discards incoming or outgoing mail exceeding a certain size.
• e.g. filtering spam e-mail.
• e.g. proxy gateway - works as a front end to a specific kind of application, and ensures
that only those messages are passed that meet certain criteria.

Secure Mobile Code


Problem:
Mobile code is great for balancing communication and computation, but is hard to
implement a general-purpose mechanism that allows different security policies for
local-resource access.
Also, we may need to protect the mobile code (e.g., agents) against malicious hosts.

Protecting an Agent
Ajanta: Detect that an agent has been tampered with while it was on the move.

Dr. B. SWAMINATHAN, Page 174/181


Most important: append-only logs:
Data can only be appended to the log; there is no way that data can be removed or
modified without the owner being able to detect this
There is always an associated checksum. Initially,
Cinit = K+owner(N), with N a nonce.
Adding data X by server S:
Cnew = K+owner(Cold, sig(S,X),S)
Removing data from the log:
K-owner(C) -> Cprev, sig(S,X),S
allowing the owner to check integrity of X

Protecting a Host
Simple solution: Enforce a (very strict) single policy, and implement that by means of a few
simple mechanisms
Sandbox model: Policy: Remote code is allowed access to only a pre-defined collection of
resources and services.
Mechanism: Check instructions for illegal memory access and service access
Playground model: Same policy, but mechanism is to run code on separate unprotected
machine.

Observation: We need to be able to distinguish local from remote code before being able to
do anything
Refinement 1: We need to be able to assign a set of permissions to mobile code before its
execution and check operations against those permissions at all times
Refinement 2: We need to be able to assign different sets of permissions to different units of
mobile code => authenticate mobile code (e.g. through signatures)

(a) A sandbox. (b) A playground.

Denial of Service (DoS)


Maliciously preventing authorized processes from accessing resources.
Issue: Distributed denial of service (DDoS)
Huge collection of processes jointly attempt to bring down a networked service.
Attackers succeed in hijacking a large group of machines which unknowingly
participate in the attack.
distinguish two types of attacks:
1. those aimed at bandwidth depletion

Dr. B. SWAMINATHAN, Page 175/181


2. those aimed at resource depletion.

Solutions:
No single method to protect against DDoS attacks. BUT…
Continuously monitor network traffic
• Starting at the egress routers where packets leave an organization's network.
o Experience shows that by dropping packets whose source address does not
belong to the organization's network we can prevent a lot of havoc.
o In general, the more packets can be filtered close to the sources, the better.

• Concentrate on ingress routers, that is, where traffic flows into an organization's
network.
o detecting an attack at an ingress router is too late as the network will probably
already be unreachable for regular traffic.
o Better to have routers further in the Internet, such as in the networks of ISPs,
start dropping packets when they suspect that an attack is going on.

• In general, a myriad of techniques need to be deployed, whereas new attacks continue


to emerge.
o A practical overview of the state-of-the-art in denial-of-service attacks and
solutions.
Security Management
Key establishment and distribution
Secure group management
Authorization management

Key Establishment: Diffie-Hellman


Observation: We can construct secret keys in a safe way without having to trust a third party
(i.e. a key distribution center (KDC)):
Alice and Bob have to agree on two large numbers, n and g. Both numbers may be
public.
Alice chooses large number x, and keeps it to herself. Bob does the same, say y.

1: Alice sends (n, g, gx mod n) to Bob


2: Bob sends (gy mod n) to Alice
3: Alice computes KA,B = (gy mod n)x = gxy mod n
4: Bob computes KA,B = (gx mod n)y = gxy mod n

The principle of Diffie-Hellman key exchange

Dr. B. SWAMINATHAN, Page 176/181


Key Distribution
Essence: If authentication is based on cryptographic protocols, and we need session keys to
establish secure channels, who's responsible for handing out keys?

Secret keys:
Alice and Bob will have to get a shared key.
They can invent their own and use it for data exchange.
Alternatively, they can trust a key distribution center (KDC) and ask it for a key.

Public keys:
Alice will need Bob's public key to decrypt (signed) messages from Bob, or to send
private messages to Bob.
But she'll have to be sure about actually having Bob's public key, or she may be in
big trouble.
Use a trusted certification authority (CA) to hand out public keys.

A public key is put in a certificate, signed by a CA.

Another problem: How do we get the secret keys to their new owners?

• If there are no keys available to Alice and Bob to set up such a secure channel, it is
necessary to distribute the key out-of band.
o Alice and Bob will have to get in touch with each other using some other
communication means than the network.
o For example, one of them may phone the other, or send the key on a floppy
disk using snail mail.

(a) Secret-key distribution.


(b) Public-key distribution.

Dr. B. SWAMINATHAN, Page 177/181


Public-key cryptosystem (b), need to distribute the public key in such a way that the receivers
can be sure that the key is paired to a claimed private key.
• although the public key itself may be sent as plaintext, it is necessary that the channel
through which it is sent can provide authentication.
• The private key, needs to be sent across a secure channel providing authentication as
well as confidentiality.

Secure Group Management


Structure: Group uses a key pair (K+G, K-G) for communication with nongroup members.
There is a separate shared secret key CKG for internal communication.
Assume process P wants to join the group and contacts Q.

1: P generates a one-time reply pad RP, and a secret key KP,G. It sends a join request to Q,
signed by itself (notation: [JR]P), along with a certificate containing its public key K+P .

Securely admitting a new group member.

2: Q authenticates P, checks whether it can be allowed as member.


It returns the group key CKG, encrypted with the one-time pad, as well as the group's private
key, encrypted as CKG(K-G).

3: Q authenticates P and sends back K P,G(N) letting Q know that it has all the necessary keys.

Dr. B. SWAMINATHAN, Page 178/181


Authorization Management
Issue: To avoid that each machine needs to know about all users, use capabilities and attribute
certificates to express the access rights that the holder has.

In Amoeba, restricted access rights are encoded in a capability, along with data for
an integrity check to protect against tampering:

• A capability is a 128-bit identifier, internally organized as shown below.


• First 48 bits are initialized by the object's server when the object is created and
effectively form a machine-independent identifier of the object's server, referred to as
the server port.
o Amoeba uses broadcasting to locate the machine where the server is currently
located.

A capability in Amoeba.

Next 24 bits are used to identify the object at the given server.
o Note that the server port, along with the object identifier, form a 72-bit
system wide unique identifier for every object in Amoeba.
Next 8 bits are used to specify the access rights of the holder of the capability.
• 48-bits check field is used to make a capability unforgeable, as we explain in the
following pages.

When an object is created, its server picks a random check field and stores it both in
the capability as well as internally in its own tables.
All the right bits in a new capability are initially on, and it is this owner capability that
is returned to the client.
When the capability is sent back to the server in a request to perform an operation, the
check field is verified.

To create a restricted capability, a client can pass a capability back to the server, along
with a bit mask for the new rights.
o The server takes the original check field from its tables, XORs it with the new
rights (which must be a subset of the rights in the capability), and then runs the
result through a one-way function.
o The server then creates a new capability, with the same value in the object field,
but with the new rights bits in the rights field and the output of the one-way
function in the check field. The new capability is then returned to the caller.
o The client may send this new capability to another process, if it wishes.

Dr. B. SWAMINATHAN, Page 179/181


Generation of a restricted capability from an owner capability

Delegation
Observation: A subject sometimes wants to delegate its privileges to an object O1, to allow
that object to request services from another object O2
Example: A client tells the print server PS to fetch a file F from the file server FS to make a
hard copy => the client delegates its read privileges on F to PS

Nonsolution: Simply hand over your attribute certificate to a delegate (which may pass it on
to the next one, etc.)

Problem: To what extent can the object trust a certificate to have originated at the initiator of
the service request, without forcing the initiator to sign every certificate?

Solution: Ensure that delegation proceeds through a secure channel, and let a delegate prove
it got the certificate through such a path of channels originating at the initiator.

General approach to delegation (Neuman 1993) - make use of a proxy.


o A proxy in the context of security in computer systems is a token that allows its owner
to operate with the same or restricted rights and privileges as the subject that granted
the token.
o A process can create a proxy with at best the same rights and privileges it has itself.
o If a process creates a new proxy based on one it currently has, the derived proxy will
have at least the same restrictions as the original one, and possibly more.

Neuman's scheme has two parts :


o Let A be the process that created the proxy.
o The first part of the proxy is a set , consisting of a set R of access
rights that have been delegated by A, along with a publicly-known part of a secret that
is used to authenticate the holder of the certificate.
o The certificate carries the signature sig (A,C) of A, to protect it against
modifications.
o The second part contains the other part of the secret, denoted as .
o It is essential that is protected against disclosure when delegating rights
to another process.
Dr. B. SWAMINATHAN, Page 180/181
The general structure of a proxy as used for delegation

A protocol for delegating and exercising rights:


o Assume that Alice and Bob share a secret key KA,B that can be used for encrypting
messages they send to each other.
o Alice first sends Bob the certificate C = {R,}, signed with sig (A,C) (and denoted
again as [R,]A).
o There is no need to encrypt this message: it can be sent as plaintext.
o Only the private part of the secret needs to be encrypted, shown as KA,B() in
message 1.

Using a proxy to delegate and prove ownership of access rights

o Suppose that Bob wants an operation to be carried out at an object that resides at a
specific server.
o Also, assume that Alice is authorized to have that operation carried out, and that she
has delegated those rights to Bob.
o Therefore, Bob hands over his credentials to the server in the form of the

signed certificate [R, ]A.

o At that point, the server will be able to verify that C has not been tampered
with: any modification to the list of rights, or the nasty question will be
noticed, because both have been jointly signed by Alice.

o However, the server does not know yet whether Bob is the rightful owner of
the certificate.

▪ To verify this, the server must use the secret that came with C.
o By decrypting (N) and returning N, Bob proves he knows the secret and
is thus the rightful holder of the certificate.

Dr. B. SWAMINATHAN, Page 181/181

You might also like