Distributed Systems - Final Materials
Distributed Systems - Final Materials
CS 17501
Distributed Systems
Dr. B. Swaminathan
Mr. N. Duraimurugan
Dr. U. Karthikeyan
RAJALAKSHMI ENGINEERING COLLEGE
[AUTONOMOUS]
RAJALAKSHMI NAGAR, THANDALAM, CHENNAI-602105
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Prepared by
Dr. B. Swaminathan
Mr. N. Duraimurugan
Dr. U. Karthikeyan
Vision
Mission
To promote research activities amongst the students and the members of faculty that could
benefit the society.
PEO I
To equip students with essential background in computer science, basic electronics and
applied mathematics.
PEO II
To prepare students with fundamental knowledge in programming languages and tools and
enable them to develop applications.
PEO III
To encourage the research abilities and innovative project development in the field of
networking, security, data mining, web technology, mobile communication and also emerging
technologies for the cause of social benefit.
PEO IV
CO 1 The student must gain knowledge of the goals and types of distributed
systems
CO 3 The student must have a clear knowledge about distributed objects and file
system
CO 4 The student must emphasize the benefits of using distributed transactions and
concurrency
3 2 2 2 2 2 2 2 3 1 3 2 2 2 3
CO 1
3 3 3 3 2 3 2 2 2 2 3 2 2 3 3
CO 2
3 3 3 2 2 3 3 2 2 2 2 2 2 2 2
CO 3
3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
CO 4
3 3 3 2 2 2 2 2 2 2 3 2 3 3 2
CO 5
3 2 .8 2 .8 2 .4 2 2 .6 2 .2 2 2 .2 1.8 3 2 22 2 .4 2.4
Average
OBJECTIVES:
To Know the goals and types of Distributed Systems
To Describe Distributed OS and Communications
To Learn about Distributed Objects and File System
To Emphasize the benefits of using Distributed Transactions and Concurrency
To learn issues related to developing fault-tolerant systems and Security
UNIT I INTRODUCTION 9
Introduction to Distributed systems – Design Goals - Types of Distributed Systems -
Architectural Styles – Middleware - System Architecture – Centralized and Decentralized
organizations – Peer-to-Peer System – Case Study: Skype and Bit-Torrent
TOTAL: 45 PERIODS
OUTCOMES:
Gain knowledge of the goals and types of Distributed Systems
Ability to Describe Distributed OS and Communications
A clear knowledge about Distributed objects and File System
Emphasize the benefits of using Distributed Transactions and Concurrency
Ability to explicate issues related to Developing Fault-Tolerant Systems and Security
REFERENCES:
1. Pradeep K Sinha, Distributed Operating Systems, Prentice-Hall of India, NewDelhi, 1st ed,
2001.
2. Jean Dollimore, Tim Kindberg, George Coulouris, Distributed Systems - Concepts and
Design, Pearson Education, 4th ed, 2005.
3. M.L. Liu, Distributed Computing Principles and Applications, Pearson Education, 1st ed,
2004.
4. HagitAttiya and Jennifer Welch, Distributed Computing: Fundamentals, Simulations and
Advanced Topics, Wiley, 1st ed, 2004.
UNIT-IV: Coda
Lesson Plan
No of
Dat Hou Periods Un
Sl. No. TOPIC Ref.
e r Require it
d
UNIT I : INTRODUCTION
1. Introduction-Distributed Systems 1 I T1:1-2
2. Design Goals 1 I T1:3-9
3. Types of Distributed Systems 1 I T1:17-24
4. Architectural Styles 1 I T1:34-35
5. Middleware 1 I T1:54-57
6. System Architecture 1 I T1: 35-36
7. Centralized and Decentralized I
1 T1: 36-44
organizations
8. Peer-to-Peer System 1 I TI:44-51
9. I T1:51-53
Case Study: Skype and Bit-Torrent 1
& Internet
TOTAL HOURS FOR UNIT I 9
REFERENCES:
1. Pradeep K Sinha, Distributed Operating Systems, Prentice-Hall of India, NewDelhi, 1st
ed, 2001.
2. Jean Dollimore, Tim Kindberg, George Coulouris, Distributed Systems - Concepts and
Design, Pearson Education, 4th ed, 2005.
3. M.L. Liu, Distributed Computing Principles and Applications, Pearson Education,
1st ed, 2004.
4. HagitAttiya and Jennifer Welch, Distributed Computing: Fundamentals, Simulations
and Advanced Topics, Wiley, 1st ed, 2004.
CAP Theorem The CAP theorem states that a distributed data store cannot
simultaneously be consistent, available and partition tolerant.
• Consistency — What you read and write
sequentially is what is expected (remember the
gotcha with the database replication a few
paragraphs ago?
• Availability — the whole system does not die —
every non-failing node always returns a
response.(The propability that the system is operational at a given time.)
• Partition Tolerant — The system continues to
function and uphold its consistency / availability
guarantees in spite of network partitions
Distributed systems should also be relatively easy to expand or scale. will normally be
continuously available, although perhaps some parts may be temporarily out of order. Users
and applications should not notice that parts are being replaced or fixed, or that new parts are
added to serve more users or applications.
Distributed systems are often organized by means of a layer of software-that is, logically
placed between a higher-level layer consisting of users and applications, and a layer
underneath consisting of operating systems and basic communication facilities(It also called
Middleware).
Degree of transparency
Aiming at full distribution transparency may be too much:
Users may be located in different continents
Completely hiding failures of networks and nodes is (theoretically and practically)
impossible
You cannot distinguish a slow computer from a failing one
You can never be sure that a server actually performed an operation before a crash
Full transparency will cost performance, exposing distribution of the system
Keeping Web caches exactly up-to-date with the master
Immediately flushing write operations to disk for fault tolerance
1.2.3. Scale in distributed systems: Many developers of modern distributed system easily use
the adjective “scalable” without making clear why their system actually scales.
Scalability: At least three components:
Number of users and/or processes (size scalability)
Maximum distance between nodes (geographical scalability)
Number of administrative domains (administrative scalability)
Grid Computing: The next step: lots of nodes from everywhere: Heterogeneous
Dispersed across several organizations
Can easily span a wide-area network
• Hardware: Processors, routers, power and cooling systems. Customers normally never
get to see these.
• Infrastructure: Deploys virtualization techniques. Evolves around allocating and
managing virtual storage devices and virtual servers.
• Platform: Provides higher-level abstractions for storage and such. Example: Amazon
S3 storage system offers an API for (locally created) files to be organized and stored in
so-called buckets.
• Application: Actual applications, such as office suites (text processors, spreadsheet
applications, presentation applications). Comparable to the suite of apps shipped with
OSes.
Consistency: A transaction establishes a valid state transition. This does not exclude the
possibility of invalid, intermediate states during the transaction’s execution.
Isolation: Concurrent transactions do not interfere with each other. It appears to each
transaction T that other transactions occur either before T, or after T, but never both.
2. Mobile computing systems: pervasive, but emphasis is on the fact that devices are
inherently mobile.
Mobile computing systems are generally a subclass of ubiquitous computing systems
and meet all of the five requirements.
Typical characteristics
• Many different types of mobile divices: smart phones, remote
• controls, car equipment, and so on
• Wireless communication
• Devices may continuously change their location )
▪ setting up a route may be problematic, as routes can change frequently
▪ devices may easily be temporarily disconnected ) disruption-t olerant
networks
Decoupling processes in space (“anonymous”) and also time (“asynchronous”) has led to
alternative styles.
Application Layering
Traditional three-layered view
• User-interface layer contains units for an application’s user interface
• Processing layer contains the functions of an application, i.e. without specific data
• Data layer contains the data that a client wants to manipulate through the application
components
This layering is found in many distributed information systems, using traditional database
technology and accompanying applications.
Multi-Tiered Architectures
Single-tiered: dumb terminal/mainframe configuration
Two-tiered: client/single server configuration
Three-tiered: each layer on separate machine
1.5.2.3. Hybrid P2P: some nodes are appointed special functions in a well-organized fashion
Hybrid Architectures: Client-
server combined with P2P :
Edge-server architectures,
which are often used for Content
Delivery Networks
Protocols Used
Voice over IP has been implemented in various ways using both proprietary protocols
and protocols based on open standards . VoIP protocols include:
● Session Initiation Protocol (SIP)
● H.323
● Media Gateway Control Protocol (MGCP)
● Gateway Control Protocol (Megaco, H.248)
● Real-time Transport Protocol (RTP)
● Real-time Transport Control Protocol (RTCP)
● Secure Real-time Transport Protocol (SRTP)
● Session Description Protocol (SDP)
● Inter-Asterisk eXchange (IAX)
● Jingle XMPP VoIP extensions
● Skype protocol
Working of VoIP
Voice is converted from an analog signal to a digital signal. It is then sent over the
Internet in data packets to a location that will be close to the destination. Then it will be
converted back to an analog signal for the remaining distance over a traditional circuit
switched (PSTN) (unless it is VoIP to VoIP). Your call can be received by traditional
telephones worldwide, as well as other VoIP users. VoIP to VoIP calls can travel entirely over
the Internet. Since your voice is changed to digital (so that it can travel over the Internet),
other great features such as voice messages to email, call forwarding, logs of incoming and
outgoing calls, caller ID, etc., can be included in your basic calling plan all for one low price.
Skype was the first peer-to-peer IP telephony network created by the developers of KaZaa.
Skype uses wide-band codec (iLBC, iSAC and iPCM developed by GlobalIPSound ) which
allows it to maintain reasonable call quality at an available bandwidth of 32 kb/s (The Skype
Key Components
Skype Client (SC): Skype application which can be used to place calls, send messages
and etc. The Skype network is an overlay network and thus each SC needs to build and refresh
a table of reachable nodes. In Skype, this table is called host cache (HC) and it contains IP
address and port number of super nodes. This host cache is stored in an XML file called
"shared.xml". Also, NAT and firewall information is stored in "shared.xml". If this file is not
present, SC tries to establish a TCP connection with each of the seven Skype maintained
default SNs IP address on port 33033.
● Super Node (SN): Super nodes are the endpoints where Skype clients connect to. Any
node with a public IP address having sufficient CPU, memory, and network bandwidth is a
candidate to become a super node and a Skype client cannot prevent itself from becoming a
super node. Also, if a SC cannot establish a TCP connection with a SN then it will report a
login failure.
● Skype Authentication Server: This is the only centralized Skype server which is used
to authenticate Skype users. An authenticated user is then announced to other peers and
buddies. If the user saves his/her credentials, authentication will not be necessary. This server
(IP address: 212.72.49.141 [Buddy list] or 195.215.8.141) also stores the buddy list for each
user. Note that the buddy list is also stored locally in an unencrypted file called "config.xml".
In addition, if two SCs have the same buddy, their corresponding config.xml files have a
different four-byte number for the same buddy. Finally, it has been shown that Skype routes
login messages through SNs if the authentication server is blocked.
What is BitTorrent?
Efficient content distribution system using file swarming. Usually does not perform all the
functions of a typical p2p system, like searching.
File sharing
To share a file or group of files, a peer first creates a .torrent file, a small file that
contains
(1)metadata about the files to be shared, and
(2) Information about the tracker, the computer that coordinates the file distribution.
Peers first obtain a .torrent file, and then connect to the specified tracker, which tells
them from which other peers to download the pieces of the file.
File sharing
Large files are broken into pieces of size between 64 KB and 1 MB
Pipelining
• When transferring data over TCP, always have several requests pending at once, to avoid
a delay between pieces being sent. At any point in time, some number, typically 5, are
requested simultaneously.
• Every time a piece or a sub-piece arrives, a new request is sent out.
Piece Selection
• The order in which pieces are selected by different peers is critical for good performance
• If an inefficient policy is used, then peers may end up in a situation where each has all
identical set of easily available pieces, and none of the missing ones.
• If the original seed is prematurely taken down, then the file cannot be completely
downloaded! What are “good policies?”
Threads share the same address space. Thread context switching can be done entirely
independent of the operating system.
Creating and destroying threads is much cheaper than doing so for processes.
Process State
As a process executes, it changes
state. The state of a process is
defined in part by the current activity
of that process. A process may be in
one of the following states:
• New. The process is being created.
• Running. Instructions are being executed.
• Waiting. The process is waiting for some event to occur (such as an I/O completion or
reception of a signal).
• Ready. The process is waiting to be assigned to a processor.
Dr. B. SWAMINATHAN, Page 26/181
• Terminated. The process has finished execution.
CPU switch from process to process
A solution lies in a hybrid form of user-level and kernel-level threads, generally referred
to as lightweight processes (LWP). An LWP runs in the context of a single (heavy-weight)
process, and there can be several LWPs per process. In addition to having LWPs, a system
also offers a user-level thread package. Offering applications the usual operations for creating
and destroying threads. In addition the package provides facilities for thread synchronization
such as mutexes and condition variables. The important issue is that the thread package is
implemented entirely in user space.
Try to mix user-level and kernel-level threads into a single concept, however,
performance gain has not turned out to outweigh the increased complexity.
Types of VM
Different kinds of virtual machines, each with different functions:
• System virtual machines (also termed full virtualization VMs) provide a substitute for a
real machine. They provide functionality needed to execute entire operating systems.
A hypervisor uses native execution to share and manage hardware, allowing for multiple
environments which are isolated from one another, yet exist on the same physical
Fig b): A system that is essentially implemented as a layer completely shielding the
original hardware, but offering the complete instruction set of that same (or other
hardware) as an interface. Crucial is the fact that this interface can be offered
simultaneously to different programs. As a result, it is now possible to have multiple, and
different
Process VM: A program is compiled to intermediate (portable) code, which is then executed
by a runtime system (Example: Java VM).
Client-Side Software
• access transparency: client-side stubs for RPCs
• location/migration transparency: let client-side
software keep track of actual location.
• replication transparency: multiple invocations
handled by client stub.
• failure transparency: can often be placed only
at client (we’re trying to mask server and communication failures).
General Design Issues
Server Clusters
Request Handling: Having the first tier handle all
communication from/to the cluster may lead to a bottleneck.
Iterative vs. concurrent servers: Iterative servers can handle only one client at a time, in
contrast to concurrent servers
Thick Client:
A thick client also known as Fat, Rich or Heavy client is one of the component of client
server architecture connected to the server through a network connection and does not
consume any of the server's computer resources to execute applications.
Why To Select Thick Client?
A thick client is a type of client device in client-server architecture that has most hardware
resources on board to perform operations, run applications and perform other functions
independently.
Although a thick client can perform most operations, it still needs to be connected to the
primary server to download programs
Where To Implement Thick Client?
Thick clients are generally implemented in computing environments when the primary
server has low network speed, limited computing and storage capacity to facilitate client
machine, or there is a need to work offline.
Thin Client
A thin client also known as Lean, Zero or Slim Client is a computer or computer program
that depends heavily on some other computer (server) to fulfill its computational roles.
Why To Select Thick Client?
Thin clients occur as components of a broader computer infrastructure, where many
clients share their computations with the same server. As such, thin client
infrastructure can be viewed as providing some computing service via several user
interfaces. This is desirable in contexts where individual thick clients have much more
functionality or power than a infrastructure required.
Thin client computing is also a way of easily maintaining computational services at a
reduced total cost of ownership.
3-tier architecture
In this variety of client-server
context, an extra middleware is used
that means client request goes to the
server through that middle layer and
the response of server is received by
middleware first and then to the
client. This architecture protects 2-
tier architecture and gives the best
performance. This system comes
expensive but it is simple to use. The
middleware stores all the business logic and data passage logic. The idea of middleware is to
database staging, queuing, application execution, scheduling etc. A Middleware improves
flexibility and gives the best performance.
The Three-tier architecture is split into 3 parts, namely, The presentation layer (Client
Tier), Application layer (Business Tier) and Database layer (Data Tier). The Client system
manages Presentation layer; the Application server takes care of the Application layer, and the
Server system supervises Database layer.
In the present scenario of online business, there has been growing demands for the quick
responses and quality services. Therefore, the complex client architecture is crucial for the
business activities. Companies usually explore possibilities to keep service and quality meet to
maintain its marketplace with the help of client-server architecture. The architecture increases
productivity through the practice of cost-efficient user interfaces, improved data storage,
expanded connectivity and secure services.
4 tier architecture: in a 4 tier
architecture Database -> Application
-> Presentation -> Client Tier ..
where does the BI layer fit in? i just
want to add BI piece to something
like below but I am not sure how to
proceed.
https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.halc001/o4ag1.htm
A collection of web
documents can be viewed as a
directed document graph, where
each document is a node and
each hyperlink is a directed link
from one node to another.
The DC-Apache solution takes this graph-
based approach and is built upon the
hypothesis that most web sites only have a
few well-known entry points from which
users start navigating through the
documents on these sites. Empirical
studies performed on the current prototype
system indicate that the DC-
Apache system has a high potential for
achieving linear scalability by effectively
removing potential bottlenecks caused by
centralized resources
Apache allows to extend its functionality by linking new modules directly to its binary
code.
The main process of the Apache server dispatches requests to several child processes.
Each child process services the request in the request processing cycle in several phases.
Using the handler mechanism, we implemented the DC-Apache system as an Apache module
that processes requests in the request processing cycle. The function of pinger process is to
compute and collect load information about participating servers. It also carries out the task of
migrating and replicating documents. The shared memory contains the document graph and
statistics information.
Migration models
• Process = code seg + resource seg + execution seg
• Weak versus strong mobility
– Weak => transferred program starts from initial state
• Sender-initiated versus receiver-initiated
– Sender-initiated (code is with sender)
• Client sending a query to database server
• Client should be pre-registered
– Receiver-initiated
• Java applets
• Receiver can be anonymous
• Code migration:
– Execute in a separate process
– [Applets] Execute in target process
• Process migration
– Remote cloning
– Migrate the process
Actions to be taken with respect to the references to local resources when migrating code to
another machine.
• GR: establish global system-wide reference
• MV: move the resources
• CP: copy the resource
• RB: rebind process to locally available resource
Machine Migration
Rather than migrating code or process, migrate an “entire machine” (OS + all processes)
– Feasible if virtual machines are used
– Entire VM is migrated • Can handle small differences in architecture (Intel-AMD)
• Live VM Migration: migrate while executing
– Assume shared disk (no need to migrate disk state)
– Iteratively copy memory pages (memory state)
• Subsequent rounds: send only pages dirtied in prior round
• Final round: Pause and switch to new machine
Server Cluster
• Sequential
– Serve one request at a time
– Can service multiple requests by employing events and asynchronous communication
• Concurrent
– Server spawns a process or thread to service each request
– Can also use a pre-spawned pool of threads/processes (apache)
• Thus servers could be
– Pure-sequential, event-based, thread-based, process-based
Scalability
• Question:How can you scale the server capacity?
• Buy bigger machine!
• Replicate
• Distribute data and/or algorithms
• Ship code instead of data
• Cache
Issue: Streams can be set up between two processes at different machines, or directly between
two different devices. Combinations are possible as well.
Implementing QuS
Problem: QoS specifications translate to resource reservations in underlying communication
system. There is no standard way of (1) QoS specs, (2) describing resources, (3) mapping
specs to reservations.
Approach: Use Resource reSerVation Protocol (RSVP) as first attempt. RSVP is a transport-
level protocol.
Stream Synchronization
Problem: Given a complex stream, how do you keep the different substreams in synch?
Example: Think of playing out two channels, that together form stereo sound. Difference
should be less than 20–30 µsec!
Isochronous transmission mode It is necessary that data units are transferred on time. Data
transfer is subject to bounded (delay) jitter.
Loosely-coupled communication
Sender is given guarantee that its
message will eventually be inserted in
recipient’s queue
No guarantee on timing, or message will
actually be read
Client
from socket import *
s = socket(AF_INET, SOCK_STREAM)
s.connect((HOST, PORT)) # connect to server (block until accepted)
s.send(’Hello, world’) # send same data
data = s.recv(1024) # receive the response
print data # print the result
s.close() # close the connection
Message-oriented middleware
Essence
Message broker
Observation
Message queuing systems assume a common messaging protocol: allapplications agree on
message format (i.e., structure and datarepresentation)
Observation
Message queuing systems assume a common messaging protocol: allapplications agree on
message format (i.e., structure and datarepresentation)
IBM’s WebSphere MQ
Basic concepts
• Application-specific messages are put into, and removed fromqueues
• Queues reside under the regime of a queue manager
• Processes can put messages only in local queues, or through anRPC mechanism
Message transfer
• Messages are transferred between queues
• Message transfer between queues at different processes, requiresa channel
• At each endpoint of channel is a message channel agent
• Message channel agents are responsible for:
o Setting up channels using lower-level network communicationfacilities (e.g.,
TCP/IP)
o (Un)wrapping messages from/in transport-level packets
o Sending/receiving packets
Stream-oriented communication
• Support for continuous media
• Streams in distributed systems
• Stream management
Continuous media
All communication facilities discussed so far are essentially based on adiscrete, that is time-
independent exchange of information
Characterized by the fact that values are time dependent:
• Audio
• Video
• Animations
• Sensor data (temperature, pressure, etc.)
Stream
Definition
A (continuous) data stream is a connection-oriented communicationfacility that supports
isochronous data transmission.
Some common stream characteristics
• Streams are unidirectional
• There is generally a single source, and one or more sinks
• Often, either the sink and/or source is a wrapper around hardware
(e.g., camera, CD device, TV monitor)
• Simple stream: a single flow of data, e.g., audio or video
• Complex stream: multiple data flows, e.g., stereo audio or
combination audio/video
Enforcing QoS There are various network-level tools, such as differentiated servicesby
which certain packets can be prioritized Also Use buffers to reduce jitter:
How to reduce the effects of packet loss (when multiple samples are ina single packet)?
Alternative
Multiplex all substreams into a single stream, and demultiplex at thereceiver. Synchronization
is handled at multiplexing/demultiplexingpoint (MPEG).
Multicast communication
• Application-level multicasting
• Gossip-based data dissemination
Application-level multicasting; Organize nodes of a distributed system into an overlay
network and use thatnetwork to disseminate data.
Chord-based tree building
1. Initiator generates a multicast identifier mid.
2. Lookup succ(mid), the node responsible for mid.
3. Request is routed to succ(mid), which will become the root.
4. If P wants to join, it sends a join request to the root.
5. When request arrives at Q:
• Q has not seen a join request before )it becomes forwarder; Pbecomes child of
Q. Join request continues to be forwarded.
• Q knows about tree )P becomes child of Q. No need to forwardjoin request
anymore.
Example applications
Data dissemination: Perhaps the most important one. Note thatthere are many variants of
dissemination.
• Aggregation: Let every node imaintain a variable xi. When twonodes gossip, they each
reset their variable to
Result: in the end each node will have computed the average
IP Multicast
• IP has a specific multicast protocol
• Addresses from 224.0.0.0 to 239.255.255.255 are reserved for multicast
– They act as groups
– Some of these are reserved for specific multicast based protocols
• Any message sent to one of the addresses goes to all processes subscribed to the group
– Must be in the same “network”
– Basically depends on how routers are configured
• In a LAN, communication is broadcast
• In more complex networks, tree-based protocols can be used
• Any process interested in joining a group informs its OS
• The OS informs the “network”
– The network interface (LAN card) receives and delivers group messages to the OS &
process
– The router may need to be informed
– IGMP – Internet group management protocol
• Sender sends only once
• Any router also forwards only once
• No acknowledgement mechanism – Uses UDP
• No guarantee that intended recipient gets the message
• Often used for streaming media type content
• Not good for critical information
• Other applications will use this service to perform multicasts.
• We have to ensure that everything goes correctly
IP multicast
• Highly efficient
bandwidth usage
• Key Architectural
Decision: Add support for
multicast in IP layer
• Scalability (with number
of groups)
• -- Routers maintain per-
group state
• IP Multicast: best-effort multi-point delivery service
- Providing higher level features such as reliability, congestion control, flow
control, and security has shown to be more difficult than in the unicast case
Middleware layers
Request-Reply communication is synchronous because the client process blocks until the
reply arrives from the server. It can also be reliable because the reply from the server is
effectively an acknowledgement to the client.
sendReply: is used to send the reply message to the client. When the reply message is
received by the client the original doOperation is unblocked and execution of the
client program continues.
user to type a name and password and submits the associated credentials with subsequent
requests.
HTTP request message
HTTP Reply message
includes an adequate notation for defining interfaces, allowing input and output parameters to
be mapped onto the language’s normal use of parameters.
This approach is useful when all the parts of a distributed application can be written in the
same language. It is also convenient because it allows the programmer to use a single
language, for example, Java, for local and remote invocation. However, many existing useful
services are written in C++ and other languages. It would be beneficial to allow programs
written in a variety of languages, including Java, to access them remotely.
Interface definition languages (IDLs) are designed to allow procedures implemented in
different languages to invoke one another. An IDL provides a notation for defining interfaces
in which each of the parameters of an operation may be described as for input or output in
addition to having its type specified.
Duplicate filtering: Controls when retransmissions are used and whether to filter out
duplicate requests at the server.
What is RMI?
• A core package of the JDK1.1+ that can be used to develop distributed applications
• Similar to the RPC mechanism found on other systems
• In RMI, methods of remote objects can be invoked from other JVMs
• In doing so, the programmer has the illusion of calling a local method (but all
arguments are actually sent to the remote object and results sent back to callers)
The Stub/Skeleton
• Client stub responsible for: – Initiate remote calls
– Marshal arguments to be sent
– Inform the remote reference layer to invoke the call – Unmarshaling the return value
– Inform remote reference the call is complete
• Server skeleton responsible for:
– Unmarshaling incoming arguments from client
– Calling the actual remote object implementation
– Marshaling the return value for transport back to client
Security
• While RMI is a straightforward method
for creating distributed applications, some
security issues you should be aware of:
– Objects are serialized and
transmitted over the network in
plain text
– No authentication: a client requests
an object, all subsequent
communication is assumed to be
from the same client
– No Security checks on the register
– No version control
Develop a client
• Example:DateClient.java
Stub
The stub is an object, acts as a gateway for the client side. All the outgoing requests
are routed through it. It resides at the client side and represents the remote object. When the
caller invokes method on the stub object, it does the following tasks:
1. It initiates a connection with remote Virtual Machine (JVM),
2. It writes and transmits (marshals) the parameters to the remote Virtual Machine
(JVM),
3. It waits for the result
4. It reads (unmarshals) the return value or exception, and
5. It finally, returns the value to the caller.
Skeleton
The skeleton is an object, acts as a gateway for the server side object. All the incoming
requests are routed through it. When the skeleton receives the incoming request, it does the
following tasks:
1. It reads the parameter for the remote method
2. It invokes the method on the actual remote object, and
3. It writes and transmits (marshals) the result to the caller.
In the Java 2 SDK, an stub protocol was introduced that eliminates the need for skeletons.
The RMI application have all these features, so it is called the distributed application.
RMI Example
In this example, we have followed all the 6
steps to create and run the rmi application. The client
application need only two files, remote interface and
client application. In the rmi application, both client
and server interacts with the remote interface. The
client application invokes methods on the proxy
object, RMI sends the request to the remote JVM.
The return value is sent back to the proxy object and
then to the client application.
For creating the remote interface, extend the Remote interface and declare the
RemoteException with all the methods of the remote interface. Here, we are creating a remote
interface that extends the Remote interface. There is only one method named add() and it
declares RemoteException.
import java.rmi.*;
public interface Adder extends Remote {
public int add(int x,int y)throws RemoteException;
}
Now provide the implementation of the remote interface. For providing the
implementation of the Remote interface, we need to
• Either extend the UnicastRemoteObject class,
• or use the exportObject() method of the UnicastRemoteObject class
In case, you extend the UnicastRemoteObject class, you must define a constructor that
declares RemoteException.
import java.rmi.*;
import java.rmi.server.*;
public class AdderRemote extends UnicastRemoteObject implements Adder {
AdderRemote()throws RemoteException {
super();
}
public int add(int x,int y) {return x+y;}
}
3) create the stub and skeleton objects using the rmic tool.
Next step is to create stub and skeleton objects using the rmi compiler. The rmic tool
invokes the RMI compiler and creates stub and skeleton objects.
1. rmic AdderRemote
Now start the registry service by using the rmiregistry tool. If you don't specify the
port number, it uses a default port number. In this example, we are using the port number
5000.
rmiregistry 5000
At the client we are getting the stub object by the lookup() method of the Naming class
and invoking the method on this object. In this example, we are running the server and client
applications, in the same machine so we are using localhost. If you want to access the remote
object from another machine, change the localhost to the host name (or IP address) where the
remote object is located.
import java.rmi.*;
public class MyClient {
public static void main(String args[]) {
try {
Adder stub=(Adder)Naming.lookup("rmi://localhost:5000/sonoo");
System.out.println(stub.add(34,4));
} catch(Exception e){}
}
}
String [] list()
This method returns an array of Strings containing the names bound in the registry.
Java class ShapeListServer with main method
import java.rmi.*;
public class ShapeListServer{
public static void main(String args[]){
System.setSecurityManager(new RMISecurityManager());
try{
ShapeList aShapeList = new ShapeListServant(); 1
Naming.rebind("Shape List", aShapeList ); 2
System.out.println("ShapeList server ready");
}catch(Exception e) {
System.out.println("ShapeList server main " + e.getMessage());}
}
}
2. CORBA
• Object Management Group, (OMG) formed in 1989
• The Common Object Request Broker Architecture (CORBA) is a standard
defined by the Object Management Group (OMG) that enables software
components written in multiple computer languages and running on multiple
computers to work together (i.e., it supports multiple platforms).
• Focus on integration of systems and applications across heterogeneous
platforms.
3. Thus CORBA allows applications and their objects to communicate with each other no
matter where they are and or who designed them!!.
• The only REAL competitor is, of course……MICROSOFT DCOM.
• After soliciting input, CORBA standardwas defined and introduced in 1991
4. CORBA
• When introduced in 1991, CORBA defined the Interface Design Language,
(IDL) and Application Programming Interface, (API).
• These allow client/server interaction within a specific implementation of an
Object Request Broker, (ORB).
• The client sends an ORB request to the SERVER/OBJECT
IMPLEMENTATION and this in turn returns back either ORB Result or Error
to the client..
• CORBA is just a specification for creating and using distributed objects
• CORBA is not a programming language.
• CORBA is a standard (not a product!)
• Allows objects to transparently make requests and receive responses.
5. CORBA Architecture
• The CORBA architecture is based on the object model.
• A CORBA-based system is a collection of objects that isolates the requestors
of services (clients) from the providers of services(servers) by a well-defined
encapsulating interface.
• CORBA is composed of five major components: ORB, IDL, dynamic
invocation interface(DII), interface repositories (IR), and object adapters (OA).
10. IDLCompiler.
Server Transparency:-
The client is, as far as the programming model is concerned, ignorant of the
existence of servers. The client does not know (and cannot find out) which
Language Transparency :-
Client and server can be written in different languages. This fact
encapsulates the whole point of CORBA; that is, the strengths of different
languages can be utilized to develop different aspects of a system, which
can interoperate through IDL. A server can be implemented in a different
language without clients being aware of this..
Implementation Transparency :-
The client is unaware of how objects are implemented. A server can use
ordinary flat files as its persistent store today and use an OO database
tomorrow, without clients ever noticing a difference (other than
performance)..
Architecture Transparency :-
The idiosyncrasies of CPU architectures are hidden from both clients and
servers. A little- endian client can communicate with a big- endian server
with different alignment restrictions..
Protocol Transparency :-
Clients and servers do not care about the data link and transport layer. They
can communicate via token ring, Ethernet, wireless links, ATM (Asynchronous
Transfer Mode), or any number of other networking technologies..
▪ Figure 2 shows a typical layered module structure for the implementation of a non-
distributed file system in a conventional operating system.
▪ File systems are responsible for the organization, storage, retrieval, naming, sharing
and protection of files.
▪ Files contain both data and attributes.
▪ Above figure summarizes the main operations on files that are available to
applications in UNIX systems.
▪ Distributed File system requirements: Related requirements in distributed file
systems are:
❖ Transparency
❖ Concurrency
❖ Replication
❖ Heterogeneity
❖ Fault tolerance
❖ Consistency
❖ Security
❖ Efficiency
Access control
• UNIX checks access rights when a file is opened
o subsequent checks during read/write are not necessary
• distributed environment
o server has to check
o stateless approaches
▪ access check once when UFID is issued
• client gets an encoded "capability" (who can access and how)
• capability is submitted with each subsequent request
▪ access check for each request.
• second is more common
Andrew File System (AFS):The main components of the Vice service interface
Caching in NFS
• Traditional UNIX
o Caches file blocks, directories, and file attributes
• Uses read-ahead (prefetching), and delayed-write (flushes every 30 seconds)
• NFS servers
• Same as in UNIX, except server’s write operations perform write-through
o Otherwise, failure of server might result in undetected loss of data by clients
• NFS clients
o Caches results of read, write, getattr, lookup, and readdir operations
o Possible inconsistency problems
▪ Writes by one client do not cause an immediate update of other clients’
caches
• File reads
o When a client caches one or more blocks from a file, it also caches a timestamp
indicating the time when the file was last modified on the server
o Whenever a file is opened, and the server is contacted to fetch a new block
from the file, a validation check is performed
▪ Client requests last modification time from server, and compares that
time to its cached timestamp
▪ If modification time is more recent, all cached blocks from that file are
invalidated
▪ Blocks are assumed to valid for next 3 seconds (30 seconds for
directories)
• File writes
o When a cached page is modified, it is marked as dirty, and is flushed when the
file is closed, or at the next periodic flush
o Now two sources of inconsistency: delay after validation, delay until flush
• Caching : Server caching
o caching file pages, directory/file attributes
o read-ahead: prefetch pages following the most-recently read file pages
▪ Zone data are stored by name servers in files in one of several fixed types of resource
record.
(Figure 5)
Directory services
Components
• A data model
• A protocol for searching
• A protocol for reading
• A protocol for updating
• Methods for replication
• Methods for distribution
NIS server
• Server that has information
• accessible through NIS
• Serves one or more domains
NIS client
• Host that uses NIS as a directory
Protocol
• RPC based
• No security
• No updates
• Replication support
Distribution
• No distribution support!
Data model
• Directories known as maps
• Simple key-value mapping
• Values have no structure
Master server
• Maps built from text files
• Maps in /var/yp
• Maps built with make
• Maps stored in binary form
• Replication to slaves with
• yppush
Slave servers
• Receive data from master
• Load balancing and failover
Processes/commands
• ypserv Server process
• ypbind Client process
• ypcat To view maps
• ypmatch To search maps
• ypwhich Show status
• yppasswdd Change password
NIS client
• Knows its NIS domain
• Binds to a NIS server
Two options
• Broadcast
• Hard coded NIS-server
ypbind
Security problems
• No access control
• Broadcast for binding
• Patched as an afterthought
Primitive protocol
• No updates
• Hack for password change
• Search only on key
• Primitive data model
Scalability
• Hierarchical namespace
• Distributed administration
Security
• Authentication of server, client and user
• Access control on per-cell level
New protocol
• Updates through NIS+
• General searches
• Data model with real tables
LDAP Protocol
• TCP-based
• Fine-grained access control
• Support for updates
• Flexible search protocol
TYPE
• SOA – Start of authority
• NS – Name server
• MX – Mail exchanger
• A – Address
• A6 – IPv6 address
• AAAA – IPv6 address
• PTR – Domain name pointer
RDATA
• Binary data, hardcoded format
• TYPE determines format
• DNS: Namespace
Names
• Dot-separated parts
• one.part.after.another
FQDN
• Fully Qualified Domain Name
• Complete name
• Always ends in a dot
Partial name
• Suffix of name implicit
• Does not end in a dot
Namespace
• Global and hierarchical
DNS: Replication
Secondary/slave nameserver
• Indicated by NS RR
• Data transfer with AXFR/IXFR
Questions
• How does a slave NS know
• when there is new information?
• How often should a slave NS
• attempt to update?
• How long is replicated data valid?
Rule of thumb
• Every zone needs at least two
• Nameservers
• DNS: Distribution
Delegation
• A NS can delegate responsibility for a subtree to another NS
• Only entire subtrees can be delegated
Zone
• The part of the namespace that a NS is authoritative for
• Defined by SOA and NS
Domain
• A subtree of the namespace
DNS: Delegation
Delegating NS
NS record for delegated zone
A record (glue) for NS when needed
Example
a.example.com NS ns2.xmp.com
b.xmp.com NS ns.b.xmp.com
ns.b.xmp.com A 10.1.2.3
Delegated-to NS
SOA record for the zone
Example
b.xmp.com SOA ( ns.b.xmp.com
dns.xmp.com
20040909001
24H
2H
1W
2D )
DNS: Delegation
SERIAL
• Increase for every update
• Date format common
• 20040909001
REFRESH/RETRY
How often secondary NS
updates the zone
MINIMUM
How long to cache NXDOMAIN
©2003–2004 David Byers
DNS: Cacheing
• Cacheing creates scalability
• Cacheing reduces tree traversal
• Cacheing of A and PTR reduce
• duplicate DNS queries
Example
$TTL 4H
SOA (MNAME RNAME
SERIAL REFRESH
RETRY 1H )
24H NS ns
ns 24H A 10.1.2.3
• However, the earth is slowing! (35 days less in a year over 300 million years)
• There are also short-term variations caused by turbulence deep in the earth’s core.
◼ A large number of days (n) were used used to the average day length, then dividing
by 86,400 to determine the mean solar second.
◼ Physicists take over from astronomers and count the transitions of cesium 133 atom
◼ 9,192,631,770 cesium transitions == 1 solar second
◼ 50 International labs have cesium 133 clocks.
◼ The Bureau Internationale de l’Heure (BIH) averages reported clock ticks to
produce the International Atomic Time (TAI).
◼ The TAI is mean number of ticks of cesium 133 clocks since midnight on
January 1, 1958 divided by 9,192,631,770 .
◼ To adjust for lengthening of mean solar day, leap seconds are used to translate
TAI into Universal Coordinated Time (UTC).
◼ Computer timers go off H times/sec, and increment the count of ticks (interrupts) since
an agreed upon time in the past.
◼ This clock value is C.
◼ Using UTC time, the value of clock on machine p is Cp(t).
◼ For a perfect time, Cp(t) = t and dC/dt = 1.
For an ideal timer, H =60, should generate 216,000 ticks per hour
Cristian's Algorithm
◼ Assume one machine (the time server) has a WWV receiver and all other machines are
to stay synchronized with it.
◼ Every /2 seconds, each machine sends a message to the time server asking for the
current time.
◼ Time server responds with message containing current time, CUTC.
a) The time daemon asks all the other machines for their clock values.
b) The machines answer and the time daemon computes the average.
c) The time daemon tells everyone how to adjust their clock.
Averaging Algorithms
◼ Every R seconds, each machine broadcasts its current time.
◼ The local machine collects all other broadcast time samples during some time interval,
S.
◼ The simple algorithm:: the new local time is set as the average of the value received
from all other machines.
◼ A slightly more sophisticated algorithm :: Discard the m highest and m lowest to
reduce the effect of a set of faulty clocks.
◼ Another improved algorithm :: Correct each message by adding to the received time
an estimate of the propagation time from the ith source.
◼ extra probe messages are needed to use this scheme.
◼ One of the most widely used algorithms in the Internet is the Network Time Protocol
(NTP).
◼ Achieves worldwide accuracy in the range of 1-50 msec.
Lamport Timestamps
a) Each processes with own clock with different rates.
b) Lamport's algorithm corrects the clocks.
c) Can add machine ID to break ties
Key Ideas
Processes exchange messages
Message must be sent before received
Send/receive used to order events and to synchronize clocks
Happened before relation
Causally ordered events
Concurrent events
Implementation
Limitation of Lamport’s clock
Happened before relation
• a -> b : Event a occurred before event b. Events in the same process p1.
• b -> c : If b is the event of sending a message m1 in a process p1 and c is the
event of receipt of the same message m1 by another process p2.
a -> b, b -> c, then a -> c; “->” is
Causally Ordered Events
a -> b : Event a “causally” affects event b
Concurrent Events
a || e: if a !-> e and e !-> a
Algorithm
Sending end
time = time+1;
time_stamp = time;
send(message, time_stamp);
Receiving end
(message, time_stamp) = receive();
time = max(time_stamp, time)+1;
Limitations
• m1−>m3
C(m1)<C(m3)
• m2−>m3
C(m2)<C(m3)
m1 or m2 caused
m3 to be sent?
• Lamport’s logical clocks lead to a situation where all events in a distributed system
are totally ordered. That is, if a -> b, then we can say C(a)<C(b).
• Unfortunately, with Lamport’s clocks, nothing can be said about the actual time of a
and b. If the logical clock says a -> b, that does not mean in reality that a actually
happened before b in terms of real time.
• The problem with Lamport clocks is that they do not capture causality.
• If we know that a -> c and b -> c we cannot say which action initiated c.
• This kind of information can be important when trying to replay events in a distributed
system (such as when trying to recover after a crash).
• The theory goes that if one node goes down, if we know the causal relationships
between messages, then we can replay those messages and respect the causal
relationship to get that node back up to the state it needs to be in.
System Model:
The system consists of N sites, S1, S2, ..., SN.
We assume that a single process is running on each site. The process at site Si is
denoted by pi.
A site can be in one of the following three states: requesting the CS, executing the CS,
or neither requesting nor executing the CS (i.e., idle).
In the ‘requesting the CS’ state, the site is blocked and can not make further requests
for the CS. In the ‘idle’ state, the site is executing outside the CS.
In token-based algorithms, a site can also be in a state where a site holding the token is
executing outside the CS (called the idle token state).
At any instant, a site may have several pending requests for CS. A site queues up these
requests and serves them one at a time.
Performance Metrics
The performance is generally measured by the following four metrics:
Synchronization delay: After a site leaves the CS, it is the time required and before the
next site enters the CS
System throughput: The rate at which the system executes requests for the CS.
system throughput = 1/(SD+E)
where SD is the synchronization delay and
E is the average critical section execution time.
Lamport’s Algorithm:
Requests for CS are executed in the increasing order of timestamps and time is
determined by logical clocks.
Every site Si keeps a queue, request queuei , which contains mutual exclusion requests
ordered by their timestamps.
This algorithm requires communication channels to deliver messages the FIFO order.
Algorithm: Requesting the critical section: When a site Si wants to enter the CS, it
broadcasts a REQUEST(tsi , i) message to all other sites and places the request on request
queuei . ((tsi , i) denotes the timestamp of the request.)
When a site Sj receives the REQUEST(tsi , i) message from site Si ,places site Si ’s
request on request queuej and it returns a timestamped REPLY message to Si .
Executing the critical section: Site Si enters the CS when the following two
conditions hold:
L1: Si has received a message with timestamp larger than (tsi , i) from all other sites.
L2: Si ’s request is at the top of request queuei .
When a site removes a request from its request queue, its own request may come at the
top of the queue, enabling it to enter the CS.
4.5 Election Algorithms
• Any process can serve as coordinator
• Any process can “call an election” (initiate the algorithm to choose a new
coordinator).
– There is no harm (other than extra message traffic) in having multiple
concurrent elections.
• Elections may be needed when the system is initialized, or if the coordinator crashes or
retires.
Assumption
• Every process/site has a unique ID; e.g.
– the network address
– a process number
• Every process in the system should know the values in the set of ID numbers, although
not which processors are up or down.
• The process with the highest ID number will be the new coordinator.
Process groups (as with ISIS toolkit or MPI) satisfy these requirements.
Requirements:
• When the election algorithm terminates a single process has been selected and
every process knows its identity.
• Formalize: every process pi has a variable ei to hold the coordinator’s process number.
– ∀i, ei = undefined or ei = P, where P is the non-crashed process with highest id
– All processes (that have not crashed) eventually set ei = P.
Analysis
• Works best if communication in the system has bounded latency so processes can
determine that a process has failed by knowing the upper bound (UB) on message
transmission time (T) and message processing time (M).
– UB = 2 * T + M
However, if a process calls an election when the coordinator is still active, the
coordinator will win the election.
Ring Algorithm – Overview
• The ring algorithm assumes that the processes are arranged in a logical ring and each
process is knows the order of the ring of processes.
• Processes are able to “skip” faulty systems: instead of sending to process j, send to j + 1.
• Faulty systems are those that don’t respond in a fixed amount of time.
• P thinks the coordinator has crashed; builds an ELECTION message which contains its
own ID number.
• Sends to first live successor
• Each process adds its own number and forwards to next.
• OK to have two elections at once.
Strict Consistency
Any read on a data item ‘x’ returns a value corresponding to the result of the most recent
write on ‘x’ (regardless of where the write occurred).
With Strict Consistency, all writes are instantaneously visible to all processes
and absolute global time order is maintained throughout the distributed system.
This is the consistency model “Holy Grail” – not at all easy in the real world, and all
but impossible within a DS.
Sequential Consistency
A weaker consistency model, which represents a relaxation of the rules.
It is also must easier (possible) to implement.
Sequential Consistency: The result of any execution is the same as if the (read and write)
operations by all proceses on the data-store were executed in the same sequential order
and the operations of each individual process appear in this sequence in the order
specified by its program.
Example:
·Three concurrently-executing processes P1, P2, and P3
·Three integer variables x, y, and z, which stored in a (possibly distributed) shared
sequentially consistent data store.
·Assume that each variable is initialized to 0.
·An assignment corresponds to a write operation, whereas a print statement corresponds to
a simultaneous read operation of its two arguments.
·All statements are assumed to be indivisible.
·Various interleaved execution sequences are possible.
·With six independent statements, there are potentially 720 (6!) possible execution
sequences
Example:
· Interaction through a distributed shared database.
· Process P1 writes data item x.
· Then P2 reads x and writes y.
· Reading of x and writing of y are potentially causally related because the computation
of y may have depended on the value of x as read by P2 (i.e., the value written by P1).
· Conversly, if two processes spontaneously and simultaneously write two different data
items, these are not causally related.
· Operations that are not causally related are said to be concurrent.
· For a data store to be considered causally consistent, it is necessary that the store obeys
the following condition:
o Writes that are potentially causally related must be seen by all processes in the
same order.
o Concurrent writes may be seen in a different order on different machines.
NOTE: The writes W2(x)b and W1(x)c are concurrent, so it is not required that all processes
see them in the same order.
Example:
Weak Consistency
· Not all applications need to see all writes, let alone seeing them in the same order.
· Leads to Weak Consistency (which is primarily designed to work
with distributed critical sections).
· This model introduces the notion of a synchronization variable”, which is used update
all copies of the data-store.
Properties Weak Consistency:
1. Accesses to synchronization variables associated with a data-store are sequentially
consistent.
2. No operation on a synchronization variable is allowed to be performed until all previous
writes have been completed everywhere.
3. No read or write operation on data items are allowed to be performed until all previous
operations to synchronization variables have been performed.
MeaningBy doing a sync.
· a process can force the just written value out to all the other replicas.
· a process can be sure it’s getting the most recently written value before it reads.
Essence: the weak consistency models enforce consistency on a group of operations, as
opposed to individual reads and writes (as is the case with strict, sequential, causal and FIFO
consistency).
Grouping Operations
· Accesses to synchronization variables are sequentially consistent.
· No access to a synchronization variable is allowed to be performed until all
previous writes have completed everywhere.
· No data access is allowed to be performed until all previous accesses to
synchronization variables have been performed.
Convention: when a process enters its critical section it should acquire the
relevant synchronization variables, and likewise when it leaves the critical section, it releases
these variables.
Critical section: a piece of code that accesses a shared resource (data structure or device) that
must not be concurrently accessed by more than one thread of execution.
Synchronization variables: are synchronization primitives that are used to coordinate the
execution of processes based on asynchronous events.
· When allocated, synchronization variables serve as points upon which one or more
processes can block until an event occurs.
· Then one or all of the processes can be unblocked. at the same time.
· Each synchronization variable has a current owner, namely, the process that last acquired
it.
o The owner may enter and exit critical sections repeatedly without having to
send any messages on the network.
o A process not currently owning a synchronization variable but wanting to
acquire it has to send a message to the current owner asking for ownership and
the current values of the data associated with that synchronization variable.
o It is also possible for several processes to simultaneously own a
synchronization variable in nonexclusive mode, meaning that they can read,
but not write, the associated data.
Note: that the data in a process' critical section may be associated to different synchronization
variables.
The following criteria must be met
1. An acquire access of a synchronization variable is not allowed to perform with respect
to a process until all updates to the guarded shared data have been performed with
respect to that process.
2. Before an exclusive mode access to a synchronization variable by a process is allowed
to perform with respect to that process, no other process may hold the synchronization
variable, not even in nonexclusive mode.
3. After an exclusive mode access to a synchronization variable has been performed, any
other process' next nonexclusive mode access to that synchronization variable may not
be performed until it has performed with respect to that variable's owner.
Example:
Entry consistency
● Acquire and release are still used, and the data-store meets the following conditions:
● An acquire access of a synchronization variable is not allowed to perform with respect
to a process until all updates to the guarded shared data have been performed with
respect to that process.
● Before an exclusive mode access to a synchronization variable by a process is allowed
to perform with respect to that process, no other process may hold the synchronization
variable, not even in nonexclusive mode.
● After an exclusive mode access to a synchronization variable has been performed, any
other process's next nonexclusive mode access to that synchronization variable may
not be performed until it has performed with respect to that variable's owner.
· At an acquire, all remote changes to guarded data must be brought up to date.
· Before a write to a data item, a process must ensure that no other process is trying to
write at the same time.
Locks are associates with individual data items, as opposed to the entire data-store.
· a lock is a synchronization mechanism for enforcing limits on access to a resource in
an environment where there are many threads of execution.
Dr. B. SWAMINATHAN, Page 118/181
Example:
· P1 does an acquire for x, changes x once, after which it also does an acquire for y.
· Process P2 does an acquire for x but not for y, so that it will read value a for x, but
may read NIL for y.
· Because process P3 first does an acquire for y, it will read the value b when y is
released by P1.
Note: P2’s read on ‘y’ returns NIL as no locks have been requested.
Linearizability All processes must see all shared accesses in the same
order. Accesses are furthermore ordered according to a
(nonunique) global timestamp.
Sequential All processes see all shared accesses in the same order. Accesses are
not ordered in time.
Causal All processes see causally-related shared accesses in the same order.
FIFO All processes see writes from each other in the order they were
used. Writes from different processes may not always be seen in
that order.
Consistency Description: Models that do use synchronization operations
Weak Shared data can be counted on to be consistent only after a
synchronization is done.
Release Shared data are made consistent when a critical region is exited.
Entry Shared data pertaining to a critical region are made consistent when
a critical region is entered.
Question: How fast should updates (writes) be made available to read-only processes?
· Most database systems: mainly read.
· DNS: write-write conflicts do no occur.
· WWW: as with DNS, except that heavy use of client-side caching is present: even
the return of stale pages is acceptable to most users.
NOTE: all exhibit a high degree of acceptable inconsistency … with the replicas gradually
become consistent over time.
Eventual Consistency
Special class of distributed data stores:
· Lack of simultaneous updates
· When updates occur they can easily be resolved.
· Most operations involve reading data.
· These data stores offer a very weak consistency model, called eventual consistency.
The eventual consistency model states that, when no updates occur for a long period of
time, eventually all updates will propagate through the system and all the replicas will
be consistent.
Note: The only thing you really want is that the entries you updated and/or read at A, are
in B the way you left them in A. In that case, the database will appear to be consistent to you.
Notation:
o xi[t] denotes the version of data item x at local copy Li at time t.
o WS xi[t] is the set of write operations at Li that lead to version xi of x (at time t);
o If operations in WS xi[t1] have also been performed at local copy Lj at a later
time t2, we write WS (xi[t1] , xj[t2] ).
o If the ordering of operations or the timing is clear from the context, the time index
will be omitted.
Monotonic Reads
If a process reads the value of a data item x, any successive read operation on x by that
process will always return that same or a more recent value.
o Monotonic-read consistency guarantees that if a process has seen a value of x at time t, it
will never see an older version of x at a later time.
Example: Automatically reading your personal calendar updates from different servers.
o Monotonic Reads guarantees that the user sees all updates, no matter from which
server the automatic reading takes place.
Example: Reading (not modifying) incoming mail while you are on the move.
Example:
o The read operations performed by a single process P at two different local copies of
the same data store.
o Vertical axis - two different local copies of the data store are shown - L1 and L2
o Time is shown along the horizontal axis
o Operations carried out by a single process P in boldface are connected by a dashed
line representing the order in which they are carried out.
(a) A monotonic-read consistent data store. (b) A data store that does not provide
monotonic reads.
(a) (b)
o Process P first performs a read operation o Situation in which monotonic-read
on x at L1, returning the value of x1 (at consistency is not guaranteed.
that time). o After process P has read x1 at L1, it
o This value results from the write later performs the operation R (x2 ) at L2
operations in WS (x1) performed .
at L1. o But, only the write operations in WS
o Later, P performs a read operation on x (x2 ) have been performed at L2 .
at L2, shown as R (x2). o No guarantees are given that this set
o To guarantee monotonic-read also contains all operations contained in
consistency, all operations in WS (x1) WS (x1).
should have been propagated to L2 before
the second read operation takes place.
Monotonic Writes
In a monotonic-write consistent store, the following condition holds:
A write operation by a process on a data item x is completed before any successive
write operation on x by the same process.
Hence: A write operation on a copy of item x is performed only if that copy has been brought
up to date by means of any preceding write operation, which may have taken place on other
copies of x. If need be, the new write must wait for old ones to finish.
Example: Updating a program at server S2, and ensuring that all components on which
compilation and linking depends, are also placed at S2.
Example: Maintaining versions of replicated files in the correct order everywhere (propagate
the previous version to the server where the newest version is installed).
Example:
The write operations performed by a single process P at two different local copies of the same
data store.
(a) (b)
o Process P performs a write operation on o Situation in which monotonic-write
x at local copy L1, presented as the consistency is not guaranteed.
operation W(x1). o Missing is the propagation of W(x1) to
o Later, P performs another write operation copy L2.
on x, but this time at L2, shown as W (x2). o No guarantees can be given that the
o To ensure monotonic-write consistency, copy of x on which the second write is
the previous write operation at L1 must being performed has the same or more
have been propagated to L2. recent value at the time W(x1 )
o This explains operation W (x1) at L2, completed at L1.
and why it takes place before W (x2).
Hence: a write operation is always completed before a successive read operation by the same
process, no matter where that read operation takes place.
Example: Updating your Web page and guaranteeing that your Web browser shows the
newest version instead of its cached copy.
Example:
(a) A data store that provides read-your-writes consistency. (b) A data store that does not.
(a) (b)
o Process P performed a write operation W(x1) o W (x1) has been left out of WS
and later a read operation at a different local (x2), meaning that the effects of the
copy. previous write operation by process
o Read-your-writes consistency guarantees that P have not been propagated to L2.
the effects of the write operation can be seen by
the succeeding read operation.
o This is expressed by WS (x1;x2), which states
that W (x1) is part of WS (x2).
Hence: any successive write operation by a process on a data item x will be performed on a
copy of x that is up to date with the value most recently read by that process.
Example: See reactions to posted articles only if you have the original posting (a read .pulls
in. the corresponding write operation).
Example:
(a) A writes-follow-reads consistent data store. (b) A data store that does not provide
writes-follow-reads consistency.
(a) (b)
o A process reads x at local copy L1. o No guarantees are given that the operation
o The write operations that led to the performed at L2,
value just read, also appear in the write o They are performed on a copy that is
set at L2, where the same process later consistent with the one just read at L1.
performs a write operation.
o (Note that other processes at L2 see
those write operations as well.)
Placement problem:
o Placing replica servers
· Replica-server placement is concerned with finding the best locations to place a
server that can host (part of) a data store.
o Placing content.
· Content placement deals with finding the best servers for placing content.
Replica-Server Placement
Essence: Figure out what the best K places are out of N possible locations.
1. Select best location out of N - k for which the average distance to clients is minimal.
Then choose the next best server. (Note: The first chosen location minimizes the average
distance to all clients.) Computationally expensive.
2. Select the k-th largest autonomous system and place a server at the best-connected
host. Computationally expensive.
o An autonomous system (AS) can best be viewed as a network in which the
nodes all run the same routing protocol and which is managed by a single
organization.
What is to be propagated:
Epidemic Protocols
· Used to implement Eventual Consistency (note: these protocols are used in Bayou).
· Main concern is the propagation of updates to all the replicas in as few a number of
messages as possible.
· Idea is to “infect” as many replicas as quickly as possible.
Infective replica: a server that holds an update that can be spread to other
replicas.
Susceptible replica: a yet to be updated server.
Removed replica: an updated server that will not (or cannot) spread the update
to any other replicas.
· The trick is to get all susceptible servers to either infective or removed states as
quickly as possible without leaving any replicas out.
Numerical Errors
Principle: consider a data item x and let weight(W) denote the numerical change in its value
after a write operation W.
· Assume that "W : weight(W) > 0.
· W is initially forwarded to one of the N replicas, denoted as origin(W).
· TW[i, j] are the writes executed by server Si that originated from Sj:
value vi of x at replica i:
Problem: We need to ensure that v(t) - vi < di for every server Si.
Approach: Let every server Sk maintain a view TWk[i, j] of what it believes is the value
of TW[i, j]. This information can be gossiped when an update is propagated.
Note: 0 £ TWk[i, j] £ TW[i, j] £ TW[j, j].
Solution: Sk sends operations from its log to Si when it sees that TWk[i, j] is getting too far
from TW[i, j], in particular, when TW[k, k] - TWk[i, k] > dI /(N -1).
Primary-Based Protocols
· Use for sequential consistency
· Each data item is associated with a “primary” replica.
· The primary is responsible for coordinating writes to the data item.
· There are two types of Primary-Based Protocol:
1.Remote-Write.
2.Local-Write.
Remote-Write Protocols
· AKA primary backup protocols
· All writes are performed at a single (remote) server.
· Read operations can be carried out locally.
· This model is typically associated with traditional client/server systems.
Local-Write Protocols
o AkA fully migrating approach
o A single copy of the data item is still maintained.
o Upon a write, the data item gets transferred to the replica that is writing.
o the status of primary for a data item is transferrable.
Process: whenever a process wants to update data item x, it locates the primary copy of x, and
moves it to its own location.
Local-Write Issues
o Question to be answered by any process about to read from or write to the data item is:
“Where is the data item right now?”
o Processes can spend more time actually locating a data item than using it!
Primary-backup protocol in which the primary migrates to the process wanting to perform an
update.
Advantage:
o Multiple, successive write operations can be carried out locally, while reading processes
can still access their local copy.
o Can be achieved only if a nonblocking protocol is followed by which updates are
propagated to the replicas after the primary has finished with locally performing the
updates.
Replicated-Write Protocols
o AKA Distributed-Write Protocols
o Writes can be carried out at any replica.
o There are two types:
1.Active Replication.
2.Majority Voting (Quorums).
a)Using a coordinator for ‘B’, which is responsible for forwarding an invocation request from
the replicated object to ‘C’.
b)Returning results from ‘C’ using the same idea: a coordinator is responsible for returning
the result to all ‘B’s. Note the single result returned to ‘A’.
Example:
o A file is replicated within a distributed file system.
o To update a file, a process must get approval from a majority of the replicas to perform a
write.
o The replicas need to agree to also perform the write.
o After the update, the file has a new version # associated with it (and it is set at all the
updated replicas).
o To read, a process contacts a majority of the replicas and asks for the version # of the files.
o If the version # is the same, then the file must be the most recent version, and the read can
proceed.
Gifford's method
o To read a file of which N replicas exist a client needs to assemble a read quorum, an
arbitrary collection of any NR servers, or more.
o To modify a file, a write quorum of at least NW servers is required.
o The values of NR and NW are subject to the following two constraints:
NR + NW > N
NW > N/2
o First constraint prevents read-write conflicts
o Second constraint prevents write-write conflicts.
Only after the appropriate number of servers has agreed to participate can a file be read or
written.
Example:
o NR = 3 and NW = 10
o Most recent write quorum consisted of the 10 servers C through L.
o All get the new version and the new version number.
o Any subsequent read quorum of three servers will have to contain at least one member
of this set.
o When the client looks at the version numbers, it will know which is most recent and take
that one.
(c) NR = 1, making it possible to read a replicated file by finding any copy and using it.
o poor performance, becausewrite updates need to acquire all copies. (aka Read-One,
Write-All (ROWA)). T
Cache-Coherence Protocols
These are a special case, as the cache is typically controlled by the client not the server.
Coherence Detection Strategy:
o When are inconsistencies actually detected?
o Statically at compile time: extra instructions inserted.
o Dynamically at runtime: code to check with the server.
Timing failure A server's response lies outside the specified time interval
Response failure A server's response is incorrect
• Value failure • The value of the response is wrong
• State transition • The server deviates from the correct flow of control
failure
Arbitrary failure A server may produce arbitrary responses at arbitrary times
Communication in a flat group – all the processes are equal, decisions are made
collectively.
• Note: no single point-of-failure, however: decision making is complicated as
consensus is required.
Communication in a simple hierarchical group - one of the processes is elected to be the
coordinator, which selects another process (a worker) to perform the operation.
• Note: single point-of failure, however: decisions are easily and quickly made by the
coordinator without first having to get consensus.
Example Again:
With 2 loyal generals and 1 traitor.
Note: It is no longer possible to determine the majority value in each column, and the
algorithm has failed to produce agreement.
• Lamport et al. (1982) proved that in a system with k faulty processes, agreement can
be achieved only if 2k + 1 correctly functioning processes are present, for a total of 3k
+ 1.
• Agreement is possible only if more than two-thirds of the processes are working
properly.
A request that can be repeated any number of times without any nasty side-effects is said to be
idempotent.
• (For example: a read of a static web-page is said to be idempotent).
Nonidempotent requests (for example, the electronic transfer of funds) are a little harder to
deal with.
• A common solution is to employ unique sequence numbers.
• Another technique is the inclusion of additional bits in a retransmission to identify it as
such to the server.
Client Crashes
When a client crashes, and when an ‘old’ reply arrives, such a reply is known as an orphan.
In practice, however, none of these methods are desirable for dealing with orphans.
Orphan elimination is discussed in more detail by Panzieri and Shrivastava (1988).
Small group: multiple, reliable point-to-point channels will do the job, however, such
a solution scales poorly as the group membership grows.
• What happens if a process joins the group during communication?
• Worse: what happens if the sender of the multiple, reliable point-to-point
channels crashes half way through sending the messages?
• But, how long does the sender keep its history-buffer populated?
• Also, such schemes perform poorly as the group grows … there are too many ACKs.
Several receivers have scheduled a request for retransmission, but the first
retransmission request leads to the suppression of others.
Conclusion:
• Building reliable multicast schemes that can scale to a large number of receivers
spread across a wide-area network, is a difficult problem.
• No single best solution exists, and each solution introduces new problems.
Atomic Multicast
Atomic multicast problem:
• A requirement where the system needs to ensure that all processes get the message, or
that none of them get it.
• An additional requirement is that all messages arrive at all processes in sequential
order.
• Atomic multicasting ensures that nonfaulty processes maintain a consistent view of the
database, and forces reconciliation when a replica recovers and rejoins the group.
Virtual Synchrony
The concept of virtual synchrony as the abstraction that group communication protocols
should attempt to build on top of an asynchronous system.
Message Ordering
Four different orderings:
1. Unordered multicasts
• virtually synchronous multicast in which no guarantees are given concerning the order
in which received messages are delivered by different processes
2. FIFO-ordered multicasts
• the communication layer is forced to deliver incoming messages from the same
process in the same order as they have been sent
3. Causally-ordered multicasts
• delivers messages so that potential causality between different messages is preserved
4. Totally-ordered multicasts
• regardless of whether message delivery is unordered, FIFO ordered, or causally
ordered, it is required additionally that when messages are delivered, they are
delivered in the same order to all group members.
• Virtually synchronous reliable multicasting offering totally-ordered delivery of
messages is called atomic multicasting.
• With the three different message ordering constraints discussed above, this leads to six
forms of reliable multicasting (Hadzilacos and Toueg, 1993).
Distributed Commit
General Goal: We want an operation to be performed by all group members or none at all.
• [In the case of atomic multicasting, the operation is the delivery of the message.]
• There are three types of “commit protocol”: single-phase, two-phase and three-phase
commit.
Recovery
• Once a failure has occurred, it is essential that the process where the failure
happened recovers to a correct state.
• Recovery from an error is fundamental to fault tolerance.
• Two main forms of recovery:
1. Backward Recovery: return the system to some previous correct state
(using checkpoints), then continue executing.
2. Forward Recovery: bring the system into a correct state, from which it can then
continue to execute.
Example
Consider as an example: Reliable Communications.
Retransmission of a lost/damaged packet - backward recovery technique.
Recovery-Oriented Computing
Recovery-oriented computing - Start over again .
• Underlying principle - it may be much cheaper to optimize for recovery, then it is
aiming for systems that are free from failures for a long time.
Different flavors:
• Simply reboot (part of a system) e.g. restart Internet servers
• To reboot only a part of the system - i the fault is properly localized.
• means deleting all instances of the identified components, along with the threads
operating on them, and (often) to just restart the associated requests.
• Apply checkpointing and recovery techniques, but to continue execution in a changed
environment.
• Basic idea - many failures can be simply avoided if programs are given extra buffer
space, memory is zeroed before allocated, changing the ordering of message delivery
(as long as this does not affect semantics), and so on
Security Mechanisms
• Encryption – fundamental technique: used to implement confidentiality and integrity.
• Authentication – verifying identities.
• Authorization – verifying allowable operations.
• Auditing – who did what to what and when/how did they do it?
Key Point
Matching security mechanisms to threats is only possible when a Policy on security and
security issues exists.
Design Issues
Three design issues when considering security:
1. Focus of Control.
2. Layering of Security Mechanisms.
3. Simplicity.
Example
Several sites connected through a wide-area backbone service.
Security Mechanisms
Fundamental technique within any distributed systems security environment: Cryptography.
A sender S wanting to transmit message m to a receiver R.
The sender encrypts its message into an unintelligible message m', and sends m' to R.
R must decrypt the received message into its original form m.
Types of Cryptosystems:
Similarly:
• For any encryption function E, it should be computationally infeasible to find the
key K when given the plaintext P and associated ciphertext C = EK(P)
• Analogous to collision resistance, when given a plaintext P and a key K, it should be
effectively impossible to find another key K' such that EK(P) = EK' (P).
• Each encryption round i takes the 64-bit block produced by the previous round i - 1 as
its input.
• The 64 bits are split into a left part Li-1 and a right part Ri-1, each containing 32 bits.
• The right part is used for the left part in the next round, that is, Li = Ri-1.
The 48-bit key Ki for round i is derived from the 56-bit master key as follows
• First, the master key is permuted and divided into two 28-bit halves.
• For each round, each half is first rotated one or two bits to the left, after which 24 bits
are extracted.
• Together with 24 bits from the other rotated half, a 48-bit key is constructed.
• MD5 is a hash function for computing a 128-bit, fixed length message digest from an
arbitrary length binary input string.
• The input string is first padded to a total length of 448 bits (modulo 512), after which
the length of the original bit string is added as a 64-bit integer.
• In effect, the input is converted to a series of 512-bit blocks.
A phase in MD5 consists of four rounds of computations, where each round uses one of
the following four functions:
• The second round uses the function G in a similar fashion, whereas H and I are used in
the third and fourth round, respectively.
• Each step thus consists of 64 iterations, after which the next phase is started, but now
with the values that p, q, r, and s have at that point.
Note that authentication and message integrity as technologies rely on each other
A detailed description of the logics underlying authentication can be found in Lampson et al.
(1992).
Applications of Cryptography
1. Authentication.
2. Message Integrity.
3. Confidentiality.
Common practice to use secret-key cryptography by means of session keys.
A session key is a shared (secret) key that is used to encrypt messages for integrity
and possibly also confidentiality.
This key is used only for as long as the channel exists.
When the channel is closed, its associated session key is securely destroyed.
1. Alice sends her identity to Bob (message 1), indicating that she wants to set up a
communication channel between the two.
2. Bob sends a challenge RB to Alice, shown as message 2.
Such a challenge could take the form of a random number.
Alice must encrypt the challenge with the secret key KA,B. that she shares with
Bob, and return the encrypted challenge to Bob. This response is shown as
message 3 containing KA,B.(RB).
Idea:
• If Alice eventually wants to challenge Bob anyway, she might as well send a challenge
along with her identity when setting up the channel.
NOTE: If the authentication protocol is not carefully designed, the target will accept that
response as valid, thereby leaving the attacker with one fully-authenticated channel
connection (the other one is simply abandoned).
Mistake 1: the two parties in the new version of the protocol were using the same challenge
in two different runs of the protocol.
Better Design: always use different challenges for the initiator and for the responder.
• In general, letting the two parties setting up a secure channel do a number of things
identically is not a good idea.
Mistake 2: Bob gave away valuable information in the form of the response KA,B(RC) without
knowing for sure to whom he was giving it.
• Not violated in the original protocol, in which Alice first needed to prove her identity,
after which Bob was willing to pass her encrypted information.
More on design principles for protocols can be found in Abadi and Needham (1996).
• If Alice wants to set up a secure channel with Bob, she can do so with the help of a
(trusted) KDC.
• The KDC hands out a key to both Alice and Bob that they can use for communication,
Dr. B. SWAMINATHAN, Page 163/181
Principle of using a KDC.
1. Alice sends a message to the KDC, telling it that she wants to talk to Bob.
2. The KDC returns a message containing a shared secret key KA,B that she can use.
• The message is encrypted with the secret key KA,KDC that Alice shares with the KDC.
3. The KDC sends KA,B to Bob, but now encrypted with the secret key KB,KDC it shares
with Bob.
Drawbacks :
• Alice may want to set up a secure channel with Bob even before Bob had received the
shared key from the KDC.
• The KDC is required to pass Bob the key.
Solution: ticket
• KDC passes KB,KDC(KA,B) back to Alice and lets her connect to Bob.
• The message KB,KDC(KA,B) is also known as a ticket.
• It is Alice's job to pass this ticket to Bob.
• Note that Bob is still the only one that can make sensible use of the ticket, as he is the
only one besides the KDC who knows how to decrypt the information it contains.
1. When Alice wants to set up a secure channel with Bob, she sends a request to the KDC
containing a challenge RA, along with her identity A and that of Bob.
2. The KDC responds by giving her the ticket KB,KDC(KA,B), along with the secret
key KA,B that she can subsequently share with Bob.
The challenge RA1 that Alice sends to the KDC along with her request to set up
a channel to Bob is also known as a nonce.
A nonce is a random number that is used only once, such as one chosen from a
very large set.
Purpose of a nonce is to uniquely relate two messages to each other
e.g. message 1 and message 2.
by including RA1 again in message 2, Alice will know for sure that message 2
is sent as a response to message 1, and that it is not a replay of an older message.
Message 2 also contains B, the identity of Bob.
By including B, the KDC protects Alice against the following attack.
3. After the KDC has passed the ticket to Alice, the secure channel between Alice and
Bob can be set up.
Alice sends message 3, which contains the ticket to Bob, and a
challenge RA2 encrypted with the shared key KA,B that the KDC had just generated.
4. Bob then decrypts the ticket to find the shared key, and returns a response RA2 - 1
along with a challenge RB for Alice.
NOTE: by returning RA2 - 1 and not just RA2, Bob not only proves he knows the
shared secret key, but also that he has actually decrypted the challenge.
This ties message 4 to message 3 in the same way that the nonce RA tied
message 2 to message 1.
Weak Point:
• If Chuck got an old key KA,B, he could replay message 3 and get Bob to set up a
channel.
• Bob will then believe he is talking to Alice, while, in fact, Chuck is at the other end.
• Need to relate message 3 to message 1 - make the key dependent on the initial request
from Alice to set up a channel with Bob.
Digital Signatures
• Digital signing a message using public-key cryptography.
This is implemented in the RSA technology.
Note: the entire document is encrypted/signed - this can sometimes be a costly
overkill.
Example:
o Bob has sold Alice a collector's item of some phonograph record for $500.
o The whole deal was done through e-mail.
o Alice sends Bob a message confirming that she will buy the record for $500.
Issues:
o Alice needs to be assured that Bob will not maliciously change the $500 mentioned
in her message into something higher, and claim she promised more than $500.
o Bob needs to be assured that Alice cannot deny ever having sent the message
because she had second thoughts.
Solution:
o Alice digitally signs the message in such a way that her signature is uniquely tied to
its content.
o Association between a message and its signature prevents that modifications to the
message will go unnoticed.
o Alice's signature can be verified to be genuine; she cannot later repudiate the fact
that she signed the message.
o Message arrives at Bob => he can decrypt it using Alice's public key.
o If the public key is owned by Alice, then decrypting the signed version of m and
successfully comparing it to m can mean only that it came from Alice.
o Alice is protected against any malicious modifications to m by Bob, because Bob
will always have to prove that the modified version of m was also signed by Alice.
Message digest => is a fixed-length bit string h that has been computed from an arbitrary-
length message m by means of a cryptographic hash function H.
o If m is changed to m', its hash H (m') will be different from h = H (m) so that it can
easily be detected that a modification has taken place.
Note that the message itself is sent as plaintext: everyone is allowed to read it.
o If confidentiality is required, then the message should also be encrypted with Bob's
public key.
3. When Bob receives the message and its encrypted digest, he decrypts the digest
with Alice's public key, and separately calculates the message digest.
o If the digest calculated from the received message and the decrypted digest
match, Bob knows the message has been signed by Alice.
Session Keys
A session key is a key used for encrypting one message or a group of messages in a
communication session.
During the establishment of a secure channel, after the authentication phase has
completed, the communicating parties generally use a unique shared session key for
confidentiality.
The session key is safely discarded when the channel is no longer used.
Let the replicated servers generate a secret valid signature with the property
that c corrupted servers alone are not enough to produce that signature.
Consider a group of five replicated servers that should be able to tolerate two corrupted
servers, and still produce a response that a client can trust.
Let N = 5, c = 2
Each server Si gets to see each request and responds with ri
• Response ri is sent along with digest md(ri), and signed with private key Ki- .
• Signature is denoted as sig(Si, ri) = Ki- (md(ri)).
There are 5!/(3!2!)=10 possible combinations of three signatures that the client can
use as input for D.
If one of these combinations produces a correct digest md (ri) for some response ri,
then the client can consider ri as being correct.
• It can trust that the response has been produced by at least three honest servers.
The ticket that is returned by the AS contains the identity of Alice, along with a generated
secret key that Alice and the TGS can use to communicate with each other. The ticket itself
will be handed over to the TGS by Alice. Therefore, it is important that no one but the TGS
can read it. For this reason, the ticket is encrypted with the secret key KAS,TGS shared
between the AS and the TGS.
Authentication in Kerberos.
1. Alice sends to Bob a message containing the ticket she got from the TGS, along with
an encrypted timestamp.
2. When Bob decrypts the ticket, he notices that Alice is talking to him, because only
the TGS could have constructed the ticket.
He also finds the secret key KA,B, allowing him to verify the timestamp.
At that point, Bob knows he is talking to Alice and not someone
maliciously replaying message 1.
By responding with KA,B(t + 1), Bob proves to Alice that he is indeed Bob.
Access Control
Access rights is referred to as access control, whereas authorization is about granting access
rights
Solutions:
Implementation (a): Each object O maintains an access control list (ACL): ACM[*,O]
describing the permissible operations per subject (or group of subjects)
Implementation (b): Each subject S has a capability: ACM[S,*] describing the permissible
operations per object (or category of objects)
Comparison between ACLs and capabilities for protecting objects.
(a) Using an ACL. (b) Using capabilities.
Protection Domains
Issue: ACLs or capability lists can be very large.
Reduce information by means of protection domains: (Saltzer and Schroeder, 1975)
Set of (object, access rights) pairs
Each pair is associated with a protection domain
For each incoming request the reference monitor first looks up the appropriate
protection domain
Firewalls
Essence: Sometimes it's better to select service requests at the lowest level: network packets.
Packets that do not fit certain requirements are simply removed from the channel
Solution: Protect your company by a firewall: it implements access control
Protecting an Agent
Ajanta: Detect that an agent has been tampered with while it was on the move.
Protecting a Host
Simple solution: Enforce a (very strict) single policy, and implement that by means of a few
simple mechanisms
Sandbox model: Policy: Remote code is allowed access to only a pre-defined collection of
resources and services.
Mechanism: Check instructions for illegal memory access and service access
Playground model: Same policy, but mechanism is to run code on separate unprotected
machine.
Observation: We need to be able to distinguish local from remote code before being able to
do anything
Refinement 1: We need to be able to assign a set of permissions to mobile code before its
execution and check operations against those permissions at all times
Refinement 2: We need to be able to assign different sets of permissions to different units of
mobile code => authenticate mobile code (e.g. through signatures)
Solutions:
No single method to protect against DDoS attacks. BUT…
Continuously monitor network traffic
• Starting at the egress routers where packets leave an organization's network.
o Experience shows that by dropping packets whose source address does not
belong to the organization's network we can prevent a lot of havoc.
o In general, the more packets can be filtered close to the sources, the better.
• Concentrate on ingress routers, that is, where traffic flows into an organization's
network.
o detecting an attack at an ingress router is too late as the network will probably
already be unreachable for regular traffic.
o Better to have routers further in the Internet, such as in the networks of ISPs,
start dropping packets when they suspect that an attack is going on.
Secret keys:
Alice and Bob will have to get a shared key.
They can invent their own and use it for data exchange.
Alternatively, they can trust a key distribution center (KDC) and ask it for a key.
Public keys:
Alice will need Bob's public key to decrypt (signed) messages from Bob, or to send
private messages to Bob.
But she'll have to be sure about actually having Bob's public key, or she may be in
big trouble.
Use a trusted certification authority (CA) to hand out public keys.
Another problem: How do we get the secret keys to their new owners?
• If there are no keys available to Alice and Bob to set up such a secure channel, it is
necessary to distribute the key out-of band.
o Alice and Bob will have to get in touch with each other using some other
communication means than the network.
o For example, one of them may phone the other, or send the key on a floppy
disk using snail mail.
1: P generates a one-time reply pad RP, and a secret key KP,G. It sends a join request to Q,
signed by itself (notation: [JR]P), along with a certificate containing its public key K+P .
3: Q authenticates P and sends back K P,G(N) letting Q know that it has all the necessary keys.
In Amoeba, restricted access rights are encoded in a capability, along with data for
an integrity check to protect against tampering:
A capability in Amoeba.
Next 24 bits are used to identify the object at the given server.
o Note that the server port, along with the object identifier, form a 72-bit
system wide unique identifier for every object in Amoeba.
Next 8 bits are used to specify the access rights of the holder of the capability.
• 48-bits check field is used to make a capability unforgeable, as we explain in the
following pages.
When an object is created, its server picks a random check field and stores it both in
the capability as well as internally in its own tables.
All the right bits in a new capability are initially on, and it is this owner capability that
is returned to the client.
When the capability is sent back to the server in a request to perform an operation, the
check field is verified.
To create a restricted capability, a client can pass a capability back to the server, along
with a bit mask for the new rights.
o The server takes the original check field from its tables, XORs it with the new
rights (which must be a subset of the rights in the capability), and then runs the
result through a one-way function.
o The server then creates a new capability, with the same value in the object field,
but with the new rights bits in the rights field and the output of the one-way
function in the check field. The new capability is then returned to the caller.
o The client may send this new capability to another process, if it wishes.
Delegation
Observation: A subject sometimes wants to delegate its privileges to an object O1, to allow
that object to request services from another object O2
Example: A client tells the print server PS to fetch a file F from the file server FS to make a
hard copy => the client delegates its read privileges on F to PS
Nonsolution: Simply hand over your attribute certificate to a delegate (which may pass it on
to the next one, etc.)
Problem: To what extent can the object trust a certificate to have originated at the initiator of
the service request, without forcing the initiator to sign every certificate?
Solution: Ensure that delegation proceeds through a secure channel, and let a delegate prove
it got the certificate through such a path of channels originating at the initiator.
o Suppose that Bob wants an operation to be carried out at an object that resides at a
specific server.
o Also, assume that Alice is authorized to have that operation carried out, and that she
has delegated those rights to Bob.
o Therefore, Bob hands over his credentials to the server in the form of the
o At that point, the server will be able to verify that C has not been tampered
with: any modification to the list of rights, or the nasty question will be
noticed, because both have been jointly signed by Alice.
o However, the server does not know yet whether Bob is the rightful owner of
the certificate.
▪ To verify this, the server must use the secret that came with C.
o By decrypting (N) and returning N, Bob proves he knows the secret and
is thus the rightful holder of the certificate.