0% found this document useful (0 votes)
12 views49 pages

Unit i Distributed Databases Adt

The document provides an overview of distributed databases, including their architecture, advantages, disadvantages, and types. It explains the characteristics of distributed systems, the communication methods between nodes, and the operations involved in database management systems (DBMS). Additionally, it discusses the complexities and challenges of maintaining distributed databases, such as data consistency and latency, while also outlining various DBMS architectures and types.

Uploaded by

rajikarthi2013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views49 pages

Unit i Distributed Databases Adt

The document provides an overview of distributed databases, including their architecture, advantages, disadvantages, and types. It explains the characteristics of distributed systems, the communication methods between nodes, and the operations involved in database management systems (DBMS). Additionally, it discusses the complexities and challenges of maintaining distributed databases, such as data consistency and latency, while also outlining various DBMS architectures and types.

Uploaded by

rajikarthi2013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

UNIT I DISTRIBUTED DATABASES

Distributed Systems – Introduction – Architecture – Distributed Database Concepts –


Distributed Data Storage – Distributed Transactions – Commit Protocols – Concurrency
Control – Distributed Query Processing

DISTRIBUTED DATABASES

What Are Distributed Systems?

A distributed system is a computing environment in which various components are


spread across multiple computers (or other computing devices) on a network. These
devices split up the work, coordinating their efforts to complete the job more efficiently than
if a single device had been responsible for the task.

Definition: A distributed system is a collection of independent computers that appear to the


users of the system as a single coherent system. These computers or nodes work together,
communicate over a network, and coordinate their activities to achieve a common goal by
sharing resources, data, and tasks.

Advantages of Distributed Systems


Some advantages of Distributed Systems are as follows −
 All the nodes in the distributed system are connected to each other. So nodes can easily
share data with other nodes.
 More nodes can easily be added to the distributed system i.e. it can be scaled as required.
 Failure of one node does not lead to the failure of the entire distributed system. Other nodes
can still communicate with each other.
 Resources like printers can be shared with multiple nodes rather than being restricted to
just one.
Disadvantages of Distributed Systems
Some disadvantages of Distributed Systems are as follows −
 It is difficult to provide adequate security in distributed systems because the nodes as well
as the connections need to be secured.
 Some messages and data can be lost in the network while moving from one node to another.
 The database connected to the distributed systems is quite complicated and difficult to
handle as compared to a single user system.
 Overloading may occur in the network if all the nodes of the distributed system try to send
data at once.
Distributed database benefits
 Flexibility: Flexibility of data structures and schemas used within a distributed database
(e.g., heterogeneous) are a significant benefit for organizations with a variety of data asset
types and processing requirements.
 Resiliency: Because distributed databases locate data across multiple nodes in the
distributed system, the risk of a single point of failure is significantly reduced.
 Scalability: Distributed databases can easily scale up (or down) by simply adjusting the
number of nodes in the database, making them ideal for growing organizations.
 Improved performance: Distributed databases are able to use load balancing and query
optimization to improve overall database performance while reducing user wait times.
 High availability: Fault tolerance (e.g., data replication, continuous failure detection)
provide high system availability for users.
Distributed database challenges
 Complexity: Because there are more moving parts to distributed databases vs centralized
databases, they can be more complex to both design and manage. The Atlas developer data
platform simplifies this dramatically by providing a single UI/API to control and manage
secure MongoDB distributed systems at scale.
 Latency: If not managed properly, latency can occur when users query data from multiple
nodes.
 Data consistency: Since distributed databases are able to employ multiple data schemas
and structures, maintaining data consistency requires more effort than traditional databases.
In addition, if there is a hardware or network failure, data restoration can be more complex.
 Cost: Distributed databases can be more expensive due to the added complexity that their
greater flexibility brings. In addition, there may be additional networking costs since they
tend to have more sites and hardware than traditional databases.
 Distributed coordination: Distributed systems require coordination among the nodes,
which can be challenging because of the distributed nature of the system.
 Data consistency: Maintaining data consistency across multiple nodes in a distributed
system can be challenging.
Characteristics of Distributed System
 Resource Sharing: It is the ability to use any Hardware, Software, or Data anywhere in
the System.
 Concurrency: It is naturally present in Distributed Systems, that deal with the same
activity or functionality that can be performed by separate users who are in remote
locations. Every local system has its independent Operating Systems and Resources.
 Scalability: It increases the scale of the system as several processors communicate with
more users by accommodating to improve the responsiveness of the system.
 Transparency: It hides the complexity of the Distributed Systems from the Users and
Application programs as there should be privacy in every system.

How do distributed systems communicate with each other?

Distributed System Architecture Distributed systems must have a network that


connects all components (machines, hardware, or software) together so they can transfer
messages to communicate with each other. That network could be connected with an IP address
or use cables or even on a circuit board.

One of the major disadvantages of distributed systems is the complexity of the


underlying hardware and software arrangements. This arrangement is generally known as a
topology or an overlay. This is what provides the platform for distributed nodes to
communicate and coordinate with each other as needed.

INTRODUCTION
A distributed system is piece of software that serves to coordinate the actions of several
computers. This coordination is achieved by exchanging messages, i.e., pieces of data conveying
information. The system relies on a network that connects the computers and handles the routing
of messages.
Database and Database Management system:
A distributed system is a software system that interconnects a collection of
heterogeneous independent computers, where coordination and communication between
computers only happen through message passing, with the intention of working towards a
common goal.
A database is an ordered collection of related data that is built for a specific purpose.
A database may be organized as a collection of multiple tables, where a table represents a real
world element or entity. Each table has several different fields that represent the characteristic
features of the entity.
For example, a company database may include tables for projects, employees,
departments, products and financial records. The fields in the Employee table may be Name,
Company_Id, Date_of_Joining, and so forth.
A database management system is a collection of programs that enables creation and
maintenance of a database. DBMS is available as a software package that facilitates definition,
construction, manipulation and sharing of data in a database.
Definition of a database includes description of the structure of a database.
Construction of a database involves actual storing of the data in any storage medium.
Manipulation refers to the retrieving information from the database, updating the
database and generating reports. Sharing of data facilitates data to be accessed by different
users or programs.
Examples of DBMS Application Areas
 Automatic Teller Machines
 Train Reservation System
 Employee Management System
 Student Information System
Examples of DBMS Packages
 MySQL Oracle
 SQL Server
 dBASE
 FoxPro
 PostgreSQL, etc.
Database Schemas:
A database schema is a description of the database which is specified during database
design and subject to infrequent alterations. It defines the organization of the data, the
relationships among them, and the constraints associated with them.
Databases are often represented through the three-schema architecture or ANSISPARC
architecture. The goal of this architecture is to separate the user application from the physical
database.
The three levels are
Internal Level having Internal Schema
− It describes the physical structure, details of internal storage and access paths for the
database.
Conceptual Level having Conceptual Schema
− It describes the structure of the whole database while hiding the details of physical
storage of data. This illustrates the entities, attributes with their data types and constraints, user
operations and relationships.
External or View Level having External Schemas or Views
− It describes the portion of a database relevant to a particular user or a group of users
while hiding the rest of database.
Types of DBMS
 Hierarchical DBMS
 Network DBMS
 Relational DBMS
 Object Oriented DBMS
 Distributed DBMS
Hierarchical DBMS
In hierarchical DBMS, the relationships among data in the database are established so
that one data element exists as a subordinate of another. The data elements have parent-child
relationships and are modelled using the “tree” data structure. These are very fast and simple.

Network DBMS
Network DBMS in one where the relationships among data in the database are of type
manyto-many in the form of a network. The structure is generally complicated due to the
existence of numerous many-to-many relationships. Network DBMS is modelled using
“graph” data structure.

Relational DBMS
In relational databases, the database is represented in the form of relations. Each
relation models an entity and is represented as a table of values. In the relation or table, a row
is called a tuple and denotes a single record. A column is called a field or an attribute and
denotes a characteristic property of the entity. RDBMS is the most popular database
management system.
For example − A Student Relation

Object Oriented DBMS


Object-oriented DBMS is derived from the model of the object-oriented programming
paradigm. They are helpful in representing both consistent data as stored in databases, as well
as transient data, as found in executing programs. They use small, reusable elements called
objects. Each object contains a data part and a set of operations which works upon the data.
The object and its attributes are accessed through pointers instead of being stored in relational
table models.
For example − A simplified Bank Account object-oriented database −
Distributed DBMS
A distributed database is a set of interconnected databases that is distributed over
the computer network or internet. A Distributed Database Management System (DDBMS)
manages the distributed database and provides mechanisms so as to make the databases transparent
to the users. In these systems, data is intentionally distributed among multiple nodes so that all
computing resources of the organization can be optimally used.
A distributed database is a collection of multiple interconnected databases, which are spread
physically across various locations that communicate via a computer network.
Features
• Databases in the collection are logically interrelated with each other. Often they represent a single
logical database.
• Data is physically stored across multiple sites. Data in each site can be managed by a DBMS
independent of the other sites.
• The processors in the sites are connected via a network. They do not have any multiprocessor
configuration.
• A distributed database is not a loosely connected file system.
• A distributed database incorporates transaction processing, but it is not synonymous with a
transaction processing system.
Operations on DBMS:
The four basic operations on a database are
 Create,
 Retrieve,
 Update and
 Delete.
CREATE database structure and populate it with data − Creation of a database relation involves
specifying the data structures, data types and the constraints of the data to be stored.
Example − SQL command to create a student table −
CREATE TABLE STUDENT (
ROLL INTEGER PRIMARY KEY,
NAME VARCHAR2(25),
YEAR INTEGER,
STREAM VARCHAR2(10)
);
 Once the data format is defined, the actual data is stored in accordance with the format in
some storage medium.
Example SQL command to insert a single tuple into the student table –

INSERT INTO STUDENT ( ROLL, NAME, YEAR, STREAM)


VALUES ( 1, 'ANKIT JHA', 1, 'COMPUTER SCIENCE');

RETRIEVE information from the database – Retrieving information generally involves selecting
a subset of a table or displaying data from the table after some computations have been done. It is
done by querying upon the table.

Example − To retrieve the names of all students of the Computer Science stream, the following
SQL query needs to be executed –

SELECT NAME FROM STUDENT WHERE STREAM = 'COMPUTER SCIENCE';

UPDATE information stored and modify database structure – Updating a table involves changing
old values in the existing table’s rows with new values.
Example − SQL command to change stream from Electronics to Electronics and Communications

UPDATE STUDENT SET STREAM = 'ELECTRONICS AND COMMUNICATIONS' WHERE


STREAM = 'ELECTRONICS';
 Modifying database means to change the structure of the table. However, modification of
the table is subject to a number of restrictions.
Example − To add a new field or column, say address to the Student table, we use the following
SQL command −
ALTER TABLE STUDENT ADD ( ADDRESS VARCHAR2(50) );

DELETE information stored or delete a table as a whole – Deletion of specific information


involves removal of selected rows from the table that satisfies certain conditions.
Example − To delete all students who are in 4th year currently when they are passing out, we use
the SQL command −
DELETE FROM STUDENT WHERE YEAR = 4;
 Alternatively, the whole table may be removed from the database.

Example − To remove the student table completely, the SQL command used is −

DROP TABLE STUDENT;


Data distribution
Proper data distribution is critical to the efficiency, security, and optimal user access
in a distributed database. This process, sometimes referred to as data partitioning, can be
accomplished using two different methods.
 Horizontal partitioning: Horizontal partitioning involves splitting data tables into rows
across multiple nodes.
 Vertical partitioning: Vertical partitioning splits tables into columns across multiple nodes.
Types of Distributed Databases

Distributed databases can be broadly classified into homogeneous and


heterogeneous distributed database environments.
Homogeneous Distributed Databases
In a homogeneous distributed database, all the sites use identical DBMS and
operating systems. Its properties are −
The sites use very similar software.
 The sites use identical DBMS or DBMS from the same vendor.
 Each site is aware of all other sites and cooperates with other sites to process user
requests.
 The database is accessed through a single interface as if it is a single database.
Types of Homogeneous Distributed Database

There are two types of homogeneous distributed database −


Autonomous − Each database is independent that functions on its own. They are
integrated by a controlling application and use message passing to share data
updates.
Non-autonomous − Data is distributed across the homogeneous nodes and a
central or master DBMS co-ordinates data updates across the sites.
Heterogeneous Distributed Databases
In a heterogeneous distributed database, different sites have different
operating systems, DBMS products and data models.

Its properties are −


 Different sites use dissimilar schemas and software.
 The system may be composed of a variety of DBMSs like relational, network,
hierarchical or object oriented.
 Query processing is complex due to dissimilar schemas. Transaction processing is
complex due to dissimilar software.
 A site may not be aware of other sites and so there is limited co-operation in
processing user requests.

Types of Heterogeneous Distributed Databases


Federated − The heterogeneous database systems are independent in nature and
integrated together so that they function as a single database system.
Un-federated − The database systems employ a central coordinating module
through which the databases are accessed.

Distributed DBMS Architectures:


DDBMS architectures are generally developed depending on three parameters −
 Distribution − It states the physical distribution of data across the different sites.
 Autonomy − It indicates the distribution of control of the database system and the
degree to which each constituent DBMS can operate independently.
 Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system
components and databases.

Architectural Models

 Client - Server Architecture for DDBMS


 Peer - to - Peer Architecture for DDBMS
 Multi - DBMS Architecture

Client - Server Architecture for DDBMS


This is a two-level architecture where the functionality is divided into
servers and clients.

The server functions primarily encompass


 data management,
 query processing,
 optimization and
 transaction management.
Client functions include mainly user interface. However, they have some
functions like consistency checking and transaction management.
Distinguish the functionality and divide these functions into two classes, server
functions and client functions.
Server does most of the data management work:
 query processing
 data management
 Optimization
 Transaction management etc

Client performs
 Application
 User interface
 DBMS Client model

The two different client - server architecture are


 Single Server Multiple Client
 Multiple Server Multiple Client

Single Server accessed by multiple clients


 A client server architecture has a number of clients and a few servers connected in a
network.
 A client sends a query to one of the servers. The earliest available server solves it and
replies.
 A Client-server architecture is simple to implement and execute due to centralized
server system.
Multiple Server Multiple Client

Peer- to-Peer Architecture for DDBMS


In these systems, each peer acts both as a client and a server for imparting
database services. The peers share their resource with other peers and co-ordinate
their activities.
This architecture generally has four levels of schemas –
Schemas Present
 Individual internal schema definition at each site, local internal schema
 Enterprise view of data is described the global conceptual schema.
 Local organization of data at each site is describe in the local conceptual schema.
 User applications and user access to the database is supported by external schemas
Local conceptual schemas are mappings of the global schema onto each site. Databases
are typically designed in a top-down fashion, and, therefore all external view definitions
are made globally.
Major Components of a Peer-to-Peer System
User Processor
 Data processor User Processor
 User-interface handler
 responsible for interpreting user commands, and formatting the result data
 Semantic data controller
 checks if the user query can be processed.
 Global Query optimizer and decomposer
 determines an execution strategy
 Translates global queries into local one.
 Distributed execution
 Coordinates the distributed execution of the user request
Data processor
 Local query optimizer
 Acts as the access path selector
 Responsible for choosing the best access path
 Local Recovery Manager
 Makes sure local database remains consistent
 Run-time support processor
 Is the interface to the operating system and contains the database buffer
 Responsible for maintaining the main memory buffers and managing the
data access.

Multi - DBMS Architectures


This is an integrated database system formed by a collection of two or
more autonomous database systems.
Multi-DBMS can be expressed through six levels of schemas –

 Multi-database View Level − Depicts multiple user views comprising of subsets of the
integrated distributed database.
 Multi-database Conceptual Level − Depicts integrated multi-database that comprises of
global logical multi-database structure definitions.
 Multi-database Internal Level − Depicts the data distribution across different sites and
multi-database to local data mapping.
 Local database View Level − Depicts public view of local data.
 Local database Conceptual Level − Depicts local data organization at each site.
 Local database Internal Level − Depicts physical data organization at each site.
There are two design alternatives for multi-DBMS −
Model with multi-database conceptual level.

 GCS is defined by integrating either the external schemas of local autonomous


databases or parts of their local conceptual schema
 Users of a local DBMS define their own views on the local database.
 If heterogeneity exists in the system, then two implementation alternatives exist:
unilingual and multilingual
 Unilingual requires the users to utilize possibly different data models and
languages
 Basic philosophy of multilingual architecture, is to permit each user to access the global
database.

Distributed System Architectures


Distributed system architectures are bundled up with components and connectors.
Components can be individual nodes or important components in the architecture whereas
connectors are the ones that connect each of these components.
Component: A modular unit with well-defined interfaces; replaceable; reusable
Connector: A communication link between modules which mediates coordination or
cooperation among components.

So, the idea behind distributed architectures is to have these components presented on
different platforms, where components can communicate with each other over a
communication network in order to achieve specific objectives.

Architectural Styles
There are four different architectural styles, plus the hybrid architecture, when it comes to
distributed systems. The basic idea is to organize logically different components, and distribute
those computers over the various machines.
 Layered Architecture
 Object Based Architecture
 Data-centered Architecture
 Event Based Architecture
 Hybrid Architecture
Layered Architecture
The layered architecture separates layers of components from each other, giving it a
much more modular approach. A well-known example for this is the OSI model that
incorporates a layered architecture when interacting with each of the components.
The layers on the bottom provide a service to the layers on the top. The request flows
from top to bottom, whereas the response is sent from bottom to top.
The advantage of using this approach is that, the calls always follow a predefined path,
and that each layer can be easily replaced or modified without affecting the entire architecture.
Object Based Architecture
This architecture style is based on loosely coupled arrangement of objects. This has no
specific architecture like layers. Like in layers, this does not have a sequential set of steps that
needs to be carried out for a given call. Each of the components is referred to as objects, where
each object can interact with other objects through a given connector or interface.
These are much 1 more direct where all the different components can interact directly with
other components through a direct method call.

As shown in the above image, communication between object happen as method


invocations. These are generally called Remote Procedure Calls (RPC). Some popular
examples are Java RMI, Web Services and REST API Calls. This has the following
properties.
 This architecture style is less structured.
 component = object
 connector = RPC or RMI
When decoupling these processes in space, people wanted the components to be
anonymous and replaceable. And the synchronization process needed to be asynchronous,
which has led to Data Centered Architectures and Event Based Architectures.
Data Centered Architecture
This architecture is based on a data center, where the primary communication happens via
a central data repository.
This common repository can be either active or passive.
This is more like a producer consumer problem.
The producers produce items to a common data store, and the consumers can request data
from it.
This common repository could even be a simple database. But the idea is that, the
communication between objects happening through this shared common storage. This
supports different components (or objects) by providing a persistent storage space for those
components (such as a MySQL database).
All the information related to the nodes in the system are stored in this persistent storage. In
event-based architectures, data is only sent and received by those components who have already
subscribed.
Some popular examples are distributed file systems, producer consumer, and web base data
services.

Event Based Architecture


The entire communication in this kind of a system happens through events. When an
event is generated, it will be sent to the bus system. With this, everyone else will be notified
saying that such an event has occurred. So, if anyone is interested, that node can pull the event
from the bus and use it. Sometimes these events could be data, or even URLs to resources. So,
the receiver can access whatever the information is given in the event and process accordingly.
These events occasionally carry data. An advantage in this architectural style is that,
components are loosely coupled. So, it is easy to add, remove and modify components in the
system.
One major advantage is that, these heterogeneous components can contact the bus,
through any communication protocol.

This architectural style is based on the publisher-subscriber architecture. Between each


node there is no direct communication or coordination. Instead, objects which are subscribed
to the service communicate through the event bus.
The event-based architecture supports, several communication styles.
Publisher-subscriber
Broadcast
Point-to-Point

The major advantage of this architecture is that the Components are decoupled in space - loosely
coupled.

System Level Architecture


The two major system level architectures that we use today are Client-server and Peer-
to peer (P2P). We use these two kinds of services in our day to day lives, but the difference
between these two is often misinterpreted.

i) Client Server Architecture

The client server architecture has two major components.


 The client
 The server.
The Server is where all the processing, computing and data handling is happening, whereas
the Client is where the user can access the services and resources given by the Server (Remote
Server).
The clients can make requests from the Server, and the Server will respond accordingly.
Generally, there is only one server that handles the remote side. But to be on the safe side, we
do use multiple servers will load balancing techniques.

As one common design feature, the Client Server architecture has a centralized security
database. This database contains security details like credentials and access details. Users can't
log in to a server, without the security credentials. So, it makes this architecture a bit more
stable and secure than Peer to Peer. The stability comes where the security database can allow
resource usage in a much more meaningful way. But on the other hand, the system might get
low, as the server only can handle a limited amount of workload at a given time.

Advantages:
Easier to Build and Maintain
Better Security
Stable

Disadvantages:
Single point of failure
Less scalable

ii) Peer to Peer (P2P)

The general idea behind peer to peer is where there is no central control in a distributed
system. The basic idea is that, each node can either be a client or a server at a given time. If
the node is requesting something, it can be known as a client, and if some node is providing
something, it can be known as a server. In general, each node is referred to as a Peer.
In this network, any new node has to first join the network. After joining in, they can either
request a service or provide a service. The initiation phase of a node (Joining of a node), can
vary according to implementation of a network. There are two ways in how a new node can get
to know, what other nodes are providing.

Centralized Lookup Server - The new node has to register with the centralized look up
server and mention the services it will be providing, on the network. So, whenever you want to
have a service, you simply have to contact the centralized look up server and it will direct you
to the relevant service provider.

Decentralized System - A node desiring for specific services must, broadcast and ask every
other node in the network, so that whoever is providing the service will respond.
Middleware in Distributed Applications

If we look at Distributed systems today, they lack the uniformity and consistency. Various
heterogeneous devices have taken over the world where distributed system cater to all these
devices in a common way.
One-way distributed systems can achieve uniformity is through

a common layer to support the underlying hardware and operating systems. This common layer
is known as a middleware, where it provides services beyond what is already provided by
Operating systems, to enable various features and components of a distributed system to
enhance its functionality better.
This layer provides a certain data structures and operations that allow processes and users
on machines to inter-operate and work together in a consistent way.

Centralized vs Decentralized Architectures


The two main structures that we see within distributed system overlays are Centralized and
Decentralized architectures. The centralized architecture can be explained by a simple client
server architecture where the server acts as a central unit. This can also be considered as
centralized look up table with the following characteristics.
 Low overhead
 Single point of failure
 Easy to Track
 Additional Overhead.
When it comes to distributed systems, we are more interested in studying more on the
overlay and unstructured network topologies that we can see today. In general, the peer-to-peer
systems that we see today can be separated into three unique sections.
Structured P2P: nodes are organized following a specific distributed data structure
Unstructured P2P: nodes have randomly selected neighbors
Hybrid P2P: some nodes are appointed special functions in a well-organized fashion

Structured P2P Architecture

Every structured network inherently suffers from poor scalability, due to the need for
structure maintenance. In general, the nodes in a structured overlay network are formed in a
logical ring, with nodes being connected to this ring. In this ring, certain nodes are responsible
for certain services.
A common approach that can be used to tackle the coordination between nodes, is to
use distributed hash tables (DHTs). A traditional hash function converts a unique key into a
hash value that will represent an object in the network. The hash function value is used to insert
an object in the hash table and to retrieve it.
In a DHT, each key is assigned to a unique hash, where the random hash value needs to
be of a very large address space, in order to ensure uniqueness. A mapping function is being
used to assign objects to nodes based on the hash function value. A look up based on the hash
function
value, returns the network address of the node that stores the requested object

 Hash Function: Takes a key and produces a unique hash value


 Mapping Function: Map the hash value to a specific node in the system
 Lookup table: Return the network address of the node represented by the unique hash
value.
Distributed database concepts:
A distributed database (DDB) as a collection of multiple logically interrelated databases
distributed over a computer network, and a distributed data-base management system
(DDBMS) as a software system that manages a distributed database while making the distribution
transparent to the user.

Features Distributed database Systems


 Databases is collection of logically related shared data. often they represent a single
logical database
 Data is physically stored accross multiple sites. Data in each site can be managed by a
DBMS independent of other sites
 Data in DDBMS is split into number of fragments and fragments can be replicated in
distributed system. Fragments/ Replicas allocated to different sites
 The processors in sites are connected via network.
 Data at each site is under the control of DBMS. It can handle local applications
independently.
What constitutes a DDB?
For a database to be called distributed, the following minimum conditions should be satisfied:
 Connection of database nodes over a computer network. There are multiple
computers, called sites or nodes. These sites must be connected by an
underlying communication network to transmit data and commands among sites, as
shown later in Figure 25.3(c).
 Logical interrelation of the connected databases. It is essential that the information
in the databases be logically related.
 Absence of homogeneity constraint among connected nodes. It is not nec-essary
that all nodes be identical in terms of data, hardware, and software.
The sites may all be located in physical proximity—say, within the same building
or a group of adjacent buildings—and connected via a local area network, or they may be
geographically distributed over large distances and connected via a long-haul or wide
area network.
Local area networks typically use wireless hubs or cables, whereas long-haul
networks use telephone lines or satellites. It is also possible to use a combination of
networks.
Networks may have different topologies that define the direct communication
paths among sites. The type and topology of the network used may have a significant
impact on the performance and hence on the strategies for distributed query processing and
distributed database design.
Transparency
The concept of transparency extends the general idea of hiding implementation details from
end users.
A highly transparent system offers a lot of flexibility to the end user/application developer
since it requires little or no awareness of underlying details on their part. In the case of a traditional
centralized database, transparency simply pertains to logical and physical data independence for
application developers. However, in a DDB scenario, the data and software are distributed over
multiple sites connected by a computer network, so additional types of transparencies are
introduced.
Consider the company database that we have been discussing through-out the book.
The EMPLOYEE, PROJECT, and WORKS_ON tables may be fragmented horizontally (that is,
into sets of rows, as we will discuss in Section 25.4) and stored with possible replication.
The following types of transparencies are possible:
 Data organization transparency (also known
as distribution or network transparency). This refers to freedom for the user from the
operational details of the network and the placement of the data in the distributed system.
It may be divided into location transparency and naming transparency.
 Location transparency refers to the fact that the command used to
perform a task is independent of the location of the data and the location of
the node where the command was issued.
 Naming transparency implies that once a name is associated with an
object, the named objects can be accessed unambiguously without
additional specification as to where the data is located.
 Replication transparency. A copies of the same data objects may be stored at
multiple sites for better availability, performance, and reliability. Replication transparency
makes the user unaware of the existence of these copies.
 Fragmentation transparency. Two types of fragmentation are possible.
 Horizontal fragmentation distributes a relation (table) into subrelations at are
subsets of the tuples (rows) in the original relation.
 Vertical fragmentation distributes a relation into subrelations where each
subrelation is defined by a subset of the columns of the original relation. A global
query by the user must be transformed into several fragment queries. Fragmentation
transparency makes the user unaware of the existence of fragments.
 Other transparencies include design transparency and execution transparency—
referring to freedom from knowing how the distributed database is designed and where a
transaction executes.
Autonomy
Autonomy determines the extent to which individual nodes or DBs in a connected DDB
can operate independently. A high degree of autonomy is desirable for increased flexibility and
customized maintenance of an individual node. Autonomy can be applied to design,
communication, and execution. Design autonomy refers to independence of data model usage and
transaction management techniques among nodes. Communication autonomy determines the
extent to which each node can decide on sharing of information with other nodes. Execution
autonomy refers to independence of users to act as they please.
Reliability and Availability
Reliability and availability are two of the most common potential advantages cited for
distributed databases.
Reliability is broadly defined as the probability that a system is running (not down) at a certain
time point, whereas availability is the prob-ability that the system is continuously available during
a time interval. We can directly relate reliability and availability of the database to the faults, errors,
and failures associated with it.
A failure can be described as a deviation of a system’s behavior from that which is specified
in order to ensure correct execution of operations. Errors constitute that subset of system states
that causes the failure. Fault is the cause of an error.
Scalability and Partition Tolerance:
Scalability determines the extent to which the system can explain it capacity while
continuing to operate with interruption.
Types of Scalability:
1.Horizontal scalability: This
refers to expanding the number of nodes in the distributed system. As nodes are added to the
system, it should be possible to distribute some of the data and processing loads from existing
nodes ot the new nodes.
2. Vertical scalability: this refers to expanding the capacity of the individual nodes in the system,
such as expanding the storage capacity or the processing power of a node.
The concept of Partition Tolerance states that the system should have the capacity to continue
operating while the network is partitioned.
Advantages of Distributed Databases
Organizations resort to distributed database management for various reasons. Some
important advantages are listed below.
1. Improved ease and flexibility of application development. Developing and maintaining
applications at geographically distributed sites of an organization is facilitated owing to
transparency of data distribution and control.
2. Increased reliability and availability. This is achieved by the isolation of faults to their site of
origin without affecting the other databases connected to the network. When the data and DDBMS
software are distributed over several sites, one site may fail while other sites continue to operate.
Only the data and software that exist at the failed site cannot be accessed. This improves both
reliability and availability.
3. Improved performance. A distributed DBMS fragments the database by keeping the data closer
to where it is needed most. Data localization reduces the contention for CPU and I/O services and
simultaneously reduces access delays involved in wide area networks. When a large database is
distributed over multiple sites, smaller databases exist at each site. As a result, local queries and
transactions accessing data at a single site have better performance because of the smaller local
databases. This contributes to improved performance.
4. Easier expansion. In a distributed environment, expansion of the system in terms of adding
more data, increasing database sizes, or adding more processors is much easier.

DISTRIBUTED DATA STORAGE


Distributed database storage is managed in two ways: In database replication, the
systems store copies of data on different sites. If an entire database is available on multiple sites,
it is a fully redundant database.
Distributed databases are used for horizontal scaling, and they are designed to meet the
workload requirements without having to make changes in the database application or
vertically scale a single machine.

Distributed Database Definition


A distributed database represents multiple interconnected databases spread out across
several sites connected by a network. Since the databases are all connected, they appear as a
single database to the users.
Distributed databases utilize multiple nodes. They scale horizontally and develop a
distributed system. More nodes in the system provide more computing power, offer greater
availability, and resolve the single point of failure issue.
Different parts of the distributed database are stored in several physical locations, and
the processing requirements are distributed among processors on multiple database nodes.
A centralized distributed database management system (DDBMS) manages the
distributed data as if it were stored in one physical location. DDBMS synchronizes all data
operations among databases and ensures that the updates in one database automatically reflect
on databases in other sites.

Distributed Database Features

Some general features of distributed databases are:


● Location independency - Data is physically stored at multiple sites and managed by an
independent DDBMS.
● Distributed query processing - Distributed databases answer queries in a distributed
environment that manages data at multiple sites. High-level queries are transformed into a
query execution plan for simpler management.
● Distributed transaction management - Provides a consistent distributed database through
commit protocols, distributed concurrency control techniques, and distributed recovery
methods in case of many transactions and failures.
● Seamless integration - Databases in a collection usually represent a single logical
database, and they are interconnected.
● Network linking - All databases in a collection are linked by a network and communicate
with each other.
● Transaction processing - Distributed databases incorporate transaction processing, which
is a program including a collection of one or more database operations. Transaction
processing is an atomic process that is either entirely executed or not at all.
Distributed database storage is managed in two ways:
 Replication
 Fragmentation

Replication
In database replication, the systems store copies of data on different sites. If an entire
database is available on multiple sites, it is a fully redundant database. The advantage of database
replication is that it increases data availability on different sites and allows for parallel query
requests to be processed. However, database replication means that data requires constant updates
and synchronization with other sites to maintain an exact database copy. Any changes made on
one site must be recorded on other sites, or else inconsistencies occur. Constant updates cause a
lot of server overhead and complicate concurrency control, as a lot of concurrent queries must be
checked in all available sites.
Data Replication
Data replication is the process of storing separate copies of the database at two or
more sites. It is a popular fault tolerance technique of distributed databases.

Advantages of Data Replication


 Reliability − In case of failure of any site, the database
system continues to work since a copy is available at another
site(s).
 Reduction in Network Load − Since local copies of data are
available, query processing can be done with reduced
network usage, particularly during prime hours. Data
updating can be done at non-prime hours.
 Quicker Response − Availability of local copies of data
ensures quick query processing and consequently quick
response time.

 Simpler Transactions − Transactions require less number of


joins of tables located at different sites and minimal
coordination across the network. Thus, they become simpler
in nature.

Disadvantages of Data Replication

 Increased Storage Requirements − Maintaining multiple


copies of data is associated with increased storage costs. The
storage space required is in multiples of the storage required
for a centralized system.
 Increased Cost and Complexity of Data Updating − Each
time a data item is updated, the update needs to be reflected
in all the copies of the data at the different sites. This requires
complex synchronization techniques and protocols.
 Undesirable Application – Database coupling − If complex
update mechanisms are not used, removing data
inconsistency requires complex co- ordination at application
level. This results in undesirable application – database
coupling.
Some commonly used replication techniques are
 Snapshot replication
 Near-real-time replication
 Pull replication
Fragmentation

When it comes to fragmentation of distributed database storage, the relations are


fragmented, which means they are split into smaller parts. Each of the fragments is stored on
a different site, where it is required. The prerequisite for fragmentation is to make sure that the
fragments can later be reconstructed into the original relation without losing data. The
advantage of fragmentation is that there are no data copies, which prevents data inconsistency.
There are two types of fragmentation:
● Horizontal fragmentation - The relation schema is fragmented into groups of rows, and each
group (tuple) is assigned to one fragment.
● Vertical fragmentation - The relation schema is fragmented into smaller schemas, and each
fragment contains a common candidate key to guarantee a lossless join.
Fragmentation is the task of dividing a table into a set of smaller tables. The
subsets of the table are called fragments. Fragmentation can be of three types:
horizontal, vertical, and hybrid (combination of horizontal and vertical). Horizontal
fragmentation can further be classified into two techniques: primary horizontal
fragmentation and derived horizontal fragmentation.
Fragmentation should be done in a way so that the original table can be
reconstructed from the fragments. This is needed so that the original table can be
reconstructed from the fragments whenever required. This requirement is called
“reconstructiveness.”
Advantages
1. Permits a number of transactions to executed concurrently
2. Results in parallel execution of a single query
3. Increases level of concurrency, also referred to as, intra query concurrency
4. Increased System throughput.
5. Since data is stored close to the site of usage, efficiency
of the database system is increased.
6. Local query optimization techniques are sufficient for
most queries since data is locally available.
7. Since irrelevant data is not available at the sites, security
and privacy of the database system can be maintained.

Disadvantages
1. Applications whose views are defined on more than
one fragment may suffer performance degradation, if
applications have conflicting requirements.
2. Simple tasks like checking for dependencies, would result
in chasing after data in a number of sites
3. When data from different fragments are required, the
access speeds may be very high.
4. In case of recursive fragmentations, the job of
reconstruction will need expensive techniques.
5. Lack of back-up copies of data in different sites may render
the database ineffective in case of failure of a site.
Vertical Fragmentation
In vertical fragmentation, the fields or columns of a table are grouped into
fragments. In order to maintain reconstructiveness, each fragment should contain
the primary key field(s) of the table. Vertical fragmentation can be used to enforce
privacy of data.
Grouping
 Starts by assigning each attribute to one fragment
o At each step, joins some of the fragments until some criteria is
satisfied.
 Results in overlapping fragments
Splitting
 Starts with a relation and decides on beneficial
partitioning based on the access behavior of
applications to the attributes
 Fits more naturally within the top-down design
 Generates non-overlapping fragments
For example, let us consider that a University database keeps records of all
registered students in a Student table having the following schema.
STUDENT
Now, the fees details are maintained in the accounts section. In this case, the designer will

CREATE TABLE STD_FEES AS


SELECT Regd_No, Fees
FROM STUDENT;

fragment

Horizontal Fragmentation
Horizontal fragmentation groups the tuples of a table in accordance to values of one
or more fields. Horizontal fragmentation should also confirm to the rule of
reconstructiveness. Each horizontal fragment must have all columns of the original
base table.
 Primary horizontal fragmentation is defined by a selection operation on the owner relation of a
database schema.
 Given relation Ri, its horizontal fragments are given by Ri = σFi(R), 1<= i <= w
Fi selection formula used to obtain fragment Ri

The example mentioned in slide 20, can be represented by using the above

formula as Emp1 = σSal <= 20K (Emp)


Emp2 = σSal > 20K (Emp)
For example, in the student schema, if the details of all students of Computer
Science Course needs to be maintained at the School of Computer Science, then the
designer will horizontally fragment the database as follows −

CREATE COMP_STD AS SELECT * FROM STUDENT

WHERE COURSE = "Computer Science";

Derived Horizontal Fragmentation


Defined on a member relation of a link according to a selection operation
specified on its owner.
Link between the owner and the member relations is defined as equi-join
An equi-join can be implemented by means of semijoins.
Given a link L where owner (L) = S and member (L) = R, the derived horizontal fragments of R
are defined as
Ri = R α Si, 1 <= I <= w
Where, Si = σ Fi (S)
w is the max number of fragments that will be defined on
Fi is the formula using which the primary horizontal fragment Si is defined

Hybrid Fragmentation
In hybrid fragmentation, a combination of horizontal and vertical fragmentation
techniques are used. This is the most flexible fragmentation technique since it generates
fragments with minimal extraneous information. However, reconstruction of the original table is
often an expensive task.
Hybrid fragmentation can be done in two alternative ways −
 At first, generate a set of horizontal fragments; then generate vertical fragments from one
or more of the horizontal fragments.
 At first, generate a set of vertical fragments; then generate horizontal fragments from one
or more of the vertical fragments.

DISTRIBUTED TRANSACTION
A distributed transaction is a set of operations on data that is performed across two or
more data repositories (especially databases). It is typically coordinated across separate nodes
connected by a network, but may also span multiple databases on a single server.
There are two possible outcomes:
1) all operations successfully complete, or
2) none of the operations are performed at all due to a failure somewhere in the system.
In the latter case, if some work was completed prior to the failure, that work will be
reversed to ensure no net work was done. This type of operation is in compliance with the
“ACID” (atomicity-consistency-isolation-durability) principles of databases that ensure data
integrity. ACID is most commonly associated with transactions on a single database server,
but distributed transactions extend that guarantee across multiple databases.

Below are some steps to understand how distributed transactions work:


Step 1: Application to Resource – Issues Distributed Transaction
The first step is to issue that distributed transaction. The application initiates the
transaction by sending the request to the available resources. The request consists of details
such as operations that are to be performed by each resource in the given transaction.
Step 2: Resource 1 to Resource 2 – Ask Resource 2 to Prepare to Commit
Once the resource receives the transaction request, resource 1 contacts resource 2 and
asks resource 2 to prepare the commit. This step makes sure that both the available resources
are able to perform the dedicated tasks and successfully complete the given transaction.
Step 3: Resource 2 to Resource 1 – Resource 2 Acknowledges Preparation
After the second step, Resource 2 receives the request from Resource 1, it prepares for
the commit. Resource 2 makes a response to resource 1 with an acknowledgment and confirms
that it is ready to go ahead with the allocated transaction.
Step 4: Resource 1 to Resource 2 – Ask Resource 2 to Commit
Once Resource 1 receives an acknowledgment from Resource 2, it sends a request to
Resource 2 and provides an instruction to commit the transaction. This step makes sure that
Resource 1 has completed its task in the given transaction and now it is ready for Resource 2
to finalize the operation.
Step 5: Resource 2 to Resource 1 – Resource 2 Acknowledges Commit
When Resource 2 receives the commit request from Resource 1, it provides Resource 1
with a response and makes an acknowledgment that it has successfully committed the
transaction it was assigned to. This step ensures that Resource 2 has completed its task from
the operation and makes sure that both the resources have synchronized their states.
Step 6: Resource 1 to Application – Receives Transaction Acknowledgement
Once Resource 1 receives an acknowledgment from Resource 2, Resource 1 then sends
an acknowledgment of the transaction back to the application. This acknowledgment confirms
that the transaction that was carried out among multiple resources has been completed
successfully

Types of Distributed Transactions


Distributed transactions involve coordinating actions across multiple nodes or
resources to ensure atomicity, consistency, isolation, and durability (ACID properties). Here
are some common types and protocols:
1. Two-Phase Commit Protocol (2PC)
This is a classic protocol used to achieve atomicity in distributed transactions.
 It involves two phases: a prepare phase where all participants agree to commit or abort
the transaction, and a commit phase where the decision is executed synchronously
across all participants.
 2PC ensures that either all involved resources commit the transaction or none do,
thereby maintaining atomicity.
2. Three-Phase Commit Protocol (3PC)
3PC extends 2PC by adding an extra phase (pre-commit phase) to address certain
failure scenarios that could lead to indefinite blocking in 2PC.
 In 3PC, participants first agree to prepare to commit, then to commit, and finally to
complete or abort the transaction.
 This protocol aims to reduce the risk of blocking seen in 2PC by introducing an
additional decision-making phase.
3. XA Transactions
XA (eXtended Architecture) Transactions are a standard defined by The Open Group
for coordinating transactions across heterogeneous resources (e.g., databases, message
queues).
 XA specifies interfaces between a global transaction manager (TM) and resource
managers (RMs).
 The TM coordinates the transaction’s lifecycle, ensuring that all participating RMs
either commit or rollback the transaction atomically.

Implementing Distributed Transactions


Below is how distributed transactions is implemented:
 Transaction Managers (TM):
o Transaction Managers are responsible for coordinating and managing
transactions across multiple resource managers (e.g., databases, message
queues).
o TMs ensure that transactions adhere to ACID properties (Atomicity,
Consistency, Isolation, Durability) even when involving disparate resources.
 Resource Managers (RM):
o Resource Managers are responsible for managing individual resources (e.g.,
databases, file systems) involved in a distributed transaction.
o RMs interact with the TM to prepare for committing or rolling back transactions
based on the TM’s coordination.
 Coordination Protocols:
o Implementations of distributed transactions often rely on coordination protocols
like 2PC, 3PC, or variants such as Paxos and Raft for consensus.
o These protocols ensure that all participants in a transaction reach a consistent
decision regarding commit or rollback.

Advantages of Distributed Transactions


Below are the advantages of distributed transaction:
 Data Consistency: Data Consistency is being provided across multiple resources by
distributed transactions. Various Operations are being coordinated across multiple
database resources. This makes sure that system remains in a consistent state even in
case of any type of failure.
 Fault Tolerance: Distributed systems can handle faults and ensure proper transactions.
If the participating resource fails during the execution of the transaction the transaction
can be then rolled back on alternate resources and completed successfully.
 Guarantees Transactions: Distributed systems guarantee the transaction. It provides
features such as durability and isolation. The durability makes sure that if any
transaction is committed, the changes last even if any failures occur.
Applications Distributed Transactions
Below are the applications of Distributed Transaction:
 Enterprise Resource Planning (ERP) Systems: ERP systems consist of departments
within one organization. Therefore distributed transactions are used here in order to
maintain transactions from various modules such as sales, inventory, finance, and
human resources management.
 Cloud Computing: Distributed transactions are being used in cloud-based
applications. Transactions can be done with the help of multiple data sources and
ensure that data updates and operations that are performed consistently.
 Healthcare Systems: Healthcare systems make use of Distributed transactions when
coordinating patient records, scheduling appointments for patients, and managing the
billing systems. Distributed transactions maintain data consistency and performance in
healthcare systems.
The operation known as a “two-phase commit” (2PC) is a form of a distributed
transaction. “XA transactions” are transactions using the XA protocol, which is one
implementation of a two-phase commit operation. A distributed transaction spans multiple
databases and guarantees data integrity.

How Do Distributed Transactions Work?


● Distributed transactions have the same processing completion requirements as regular
database transactions, but they must be managed across multiple resources, making them
more challenging to implement for database developers.
● The multiple resources add more points of failure, such as the separate software systems
that run the resources (e.g., the database software), the extra hardware servers, and network
failures.
● This makes distributed transactions susceptible to failures, which is why safeguards must
be put in place to retain data integrity.
● For a distributed transaction to occur, transaction managers coordinate the resources
(either multiple databases or multiple nodes of a single database). The transaction manager
can be one of the data repositories that will be updated as part of the transaction, or it can
be a completely independent separate resource that is only responsible for coordination.
● The transaction manager decides whether to commit a successful transaction or rollback
an unsuccessful transaction, the latter of which leaves the database unchanged.
● First, an application requests the distributed transaction to the transaction manager.
● The transaction manager then branches to each resource, which will have its own
“resource manager” to help it participate in distributed transactions.
● Distributed transactions are often done in two phases to safeguard against partial updates
that might occur when a failure is encountered.
● The first phase involves acknowledging intent to commit, or a “prepare-to-commit” phase.
After all resources are acknowledged, they are then asked to run a final commit, and then
the transaction is completed.

Transaction management
Distributed databases must often support distributed transactions, where one transaction
can involve more than one node. This support methodology is highlighted in the ACID
properties (atomicity, consistency, isolation, durability) of transactions across distributed
database systems. Key elements of ACID properties include:
 Atomicity means that a transaction is treated as a single unit. This also means that either
a complete transaction is available for storage or it's rejected as an error which ensures
data integrity.
 Consistency is maintained in distributed database systems by enforcing predefined
rules and data constraints. If the state, nature, or content of a transaction violates these
rules, the transaction will not be ingested and stored in the distributed system.
 Isolation involves the separation of each transaction from the other transactions to
prevent data conflicts and maintain data integrity. In addition, this benefits operations
when managing multiple distributed data records that may exist across local data stores,
virtual machines via cloud computing, and multiple database nodes which may be
located across multiple sites.
 Durability ensures that stored data is preserved in the event of a system failure. There
are a variety of ways that a transactional distributed database management system
accomplishes this task.

COMMIT PROTOCOLS
In a local database system, for committing a transaction, the transaction manager has
to only convey the decision to commit to the recovery manager. However, in a distributed
system, the transaction manager should convey the decision to commit to all the servers in the
various sites where the transaction is being executed and uniformly enforce the decision.
When processing is complete at each site, it reaches the partially committed transaction
state and waits for all other transactions to reach their partially committed states. When it
receives the message that all the sites are ready to commit, it starts to commit. In a distributed
system, either all sites commit or none of them does.
The different distributed commit protocols are −
● One-phase commit
● Two-phase commit
● Three-phase commit
Distributed One-phase Commit
Distributed one-phase commit is the simplest commit protocol. Let us consider that
there is a controlling site and a number of slave sites where the transaction is being executed.
The steps in distributed commit are −
● After each slave has locally completed its transaction, it sends a “DONE” message to the
controlling site.
● The slaves wait for “Commit” or “Abort” messages from the controlling site. This waiting
time is called a window of vulnerability.
● When the controlling site receives a “DONE” message from each slave, it makes a
decision to commit or abort. This is called the commit point. Then, it sends this message
to all the slaves.
● On receiving this message, a slave either commits or aborts and then sends an
acknowledgement message to the controlling site.

Distributed Two-phase Commit


Distributed two-phase commit reduces the vulnerability of one-phase commit
protocols. The steps performed in the two phases are as follows −
Phase 1: Prepare Phase
● After each slave has locally completed its transaction, it sends a “DONE” message to the
controlling site. When the controlling site has received a “DONE” message from all
slaves, it sends a “Prepare” message to the slaves.
● The slaves vote on whether they still want to commit or not. If a slave wants to commit, it
sends a “Ready” message.
● A slave that does not want to commit sends a “Not Ready” message. This may happen when
the slave has conflicting concurrent transactions or there is a timeout.
Phase 2: Commit/Abort Phase
● After the controlling site has received “Ready” message from all the slaves − o The
controlling site sends a “Global Commit” message to the slaves. o the slaves apply the
transaction and send a “Commit ACK” message to the controlling site.
● When the controlling site receives “Commit ACK” message from all the slaves, it
considers the transaction as committed.
● After the controlling site has received the first “Not Ready” message from any slave
● The controlling site sends a “Global Abort” message to the slaves. o The slaves abort the
transaction and send a “Abort ACK” message to the controlling site.
● When the controlling site receives “Abort ACK” message from all the slaves, it
considers the transaction has aborted.
Distributed Three-phase Commit
The steps in distributed three-phase commit are as follows −
Phase 1: Prepare Phase
• The steps are same as in distributed two-phase commit.
Phase 2: Prepare to Commit Phase
• The controlling site issues an “Enter Prepared State” broadcast message. The
slave sites vote “OK” in response.
Phase 3: Commit / Abort Phase
• The steps are same as two-phase commit except that “Commit ACK”/”Abort
ACK” message is not required.
Advantages of Commit Protocol in DBMS
 The commit protocol in DBMS helps to ensure that the changes in the database remains
consistent throughout the database.
 It basically also helps to ensure that the integrity of the data is maintained throughout the
database.
 It will also helps to maintain the atomicity which means that either all the operations in a
transaction are completed successfully or not done at all.
 The commit protocol provide mechanisms for system recovery in the case of system
failures.

CONCURRENCY CONTROL
Concurrency control in distributed systems is achieved by a program which is called
scheduler. Schedulers help to order the operations of transactions in such a way that the
resulting logs are serializable. There are two types of the concurrency control that are locking
approach and non-locking approach.
Concurrency Problems
Since the two main operations in a database transaction are Read and Write operations.
This problem mainly arises when one user is writing and the other is reading or when both the
users try to write the same data simultaneously. Following are some common concurrency
problems:
 Dirty Read Problem ( W-R conflict )
 Lost Update Problem ( W-W conflict )
 Non-repeatable Read Problem ( W-R conflict )

Dirty Read Problem


This problem occurs when one Transaction updates an item of the database, and
somehow that Transaction fails. Before the data gets a rollback, another transaction can access
that updated database item. This situation will cause the Write-Read conflict between both
transactions.
For example:
Consider the two transactions Tx and Ty in the following diagram performing read or
write operations on A. Let say the given balance in account A is 650 rupees.
 At t1 time, transaction Tx will read the value of account A 650 rupees.
 At t2 time, transaction Tx adds 250 rupees to account A, which becomes 800.
 At t3 time, transaction Tx writes the updated value in account A, which is 800.
 At t4 time, transaction Tx rollbacks due to server problem, and A’s value changes back
to 650 rupees(as initially).
 But the value for account A remains 800 for transaction Ty (because at time t4, Ty
Reads the value of A , i.e., 800 ), which is the dirty read. Therefore it is known as the
Dirty Read Problem.
Lost Update Problem
This problem occurs when two different transactions perform the read or write
operations on the same database items in an interleaved manner (concurrent execution ),
making the values inconsistent, resulting in invalid behavior.
For example:
Consider the two transactions Tx and Ty in the following diagram performing read or
write operations on A. Let say the given balance in account A is 650 rupees.
 At t1 time, transaction Tx will read the value of account A that is 650 rupees.
 At t2 time, transaction Tx deducts 100 rupees from account A, which becomes 550
rupees(only deducted and not updated yet).
 At t3 time, transaction Ty read the value of A in account A, which is 650 only because
Tx didn’t update the value yet.
 At t4 time, transaction Ty adds 100 to account A, which becomes 750 (only added but
not updated yet).
 At t6 time, transaction tx writes the value of account A which becomes 550, as Ty
hasn’t updated the value yet.
 At t7 time, transaction ty writes the value of account A, which will become 750 as it
was added at time t4. It means the value written by Tx at time t6 is lost. Hence data
becomes incorrect, and database sets to inconsistent. This is known as the Lost Update
problem.
Non-repeatable Read Problem
It is also known as Inconsistent Retrievals Problem. This problem occurs when in a
transaction, two different values of the same item are read from the database.
For example:
Consider the two transactions Tx and Ty in the following diagram performing read or
write operations on A. Let say the given balance in account A is 650 rupees.

 At t1 time, transaction Tx will read the value from account A, that is, 650 rupees.
 A t2 time, transaction Ty will read the value from account A, that is, 650 rupees.
 At t3 time, transaction Ty adds 250 to account A, which will become 900 rupees( only
added, not updated yet).
 At t4 time, transaction Ty writes the updated value of A that is 900.
 Later at t5 time, transaction Tx reads the value of account A, which is 900.
Within the same Transaction Tx, it reads two different values of (which are 650 at time t1 and
900 at time t5). It is a non-repeatable read and therefore known as a Non-repeatable read
problem.

Various Approaches for Concurrency Control.


1. Locking Based Concurrency Control Protocols
Locking-based concurrency control protocols use the concept of locking data items. A
lock is a variable associated with a data item that determines whether read/write operations can
be performed on that data item. Generally, a lock compatibility matrix is used which states
whether a data item can be locked by two transactions at the same time. Locking-based
concurrency control systems can use either one-phase or two-phase locking protocols.
1. One-phase Locking Protocol: In this method, each transaction locks an item before use
and releases the lock as soon as it has finished using it. This locking method provides
for maximum concurrency but does not always enforce serializability.
2. Two-phase Locking Protocol: In this method, all locking operations precede the first
lock-release or unlock operation. The transaction comprises of two phases. In the first
phase, a transaction only acquires all the locks it needs and do not release any lock.
This is called the expanding or the growing phase. In the second phase, the transaction
releases the locks and cannot request any new locks. This is called the shrinking phase.
Every transaction that follows a two-phase locking protocol is guaranteed to be
serializable. However, this approach provides low parallelism between two conflicting
transactions.
2. Timestamp Concurrency Control Algorithms:
Timestamp-based concurrency control algorithms use a transaction’s timestamp to
coordinate concurrent access to a data item to ensure serializability. A timestamp is a unique
identifier given by DBMS to a transaction that represents the transaction’s start time.
These algorithms ensure that transactions are committed in the order dictated by their
timestamps. An older transaction should commit before a younger transaction, since the older
transaction enters the system before the younger one.
Timestamp-based concurrency control techniques generate serializable schedules such
that the equivalent serial schedule is arranged in order of the age of the participating
transactions.
3.Optimistic Concurrency Control Algorithm:
In systems with low conflict rates, the task of validating every transaction for
serializability may lower performance. In these cases, the test for serializability is postponed
to just before commit. Since the conflict rate is low, the probability of aborting transactions
which are not serializable is also low. This approach is called optimistic concurrency control
technique.
In this approach, a transaction’s life cycle is divided into the following three phases−
● Execution Phase − A transaction fetches data items to memory and performs
operations upon them.
● Validation Phase − A transaction performs checks to ensure that committing its
changes to the database passes serializability test.
● Commit Phase − A transaction writes back modified data item in memory to the disk.
Problems arise in a distributed DBMS environment for concurrency control and recovery purposes
that are not encountered in a centralized DBMS environment. These include the following:
● Dealing with multiple copies of the data items. The concurrency control method is
responsible for maintaining consistency among these copies. The recovery method is
responsible for making a copy consistent with other copies of the site on which fails and
recovers later.
● Failure of individual sites. The DDBMS should continue to operate with its running sites,
even when one or more individual sites fail. When a site recovers, its local database must
be updated the same as the rest of the sites.
● Failure of communication links. The system must be able to deal with the failure of one
or more communication links that connect the sites. An extreme case of this problem is that
network partitioning may occur. This breaks up the sites into two or more partitions, where
the sites within each partition can communicate only with one another and not with sites in
other partitions.
● Distributed commit. Problems can arise with committing a transaction that is accessing
databases stored on multiple sites if some sites fail during the commit process. The two-
phase commit protocol is often used to deal with this problem.
● Distributed deadlock. Deadlock may occur among several sites, so techniques for dealing
with deadlocks must be extended to take this into account.

Techniques to deal with recovery and concurrency control in DDBMSs:


1. Distributed Concurrency Control Based on a Distinguished Copy of a Data Item
The idea is to designate a particular copy of each data item as a distinguished copy. The
locks for this data item are linked with the distinguished copy, and the locking and unlocking
requests are sent to the site that contains that copy. The distinguished copies are chosen based on
four methods. They are,
1. Primary Site Technique
2. Primary Site with Backup Site
3. Primary Copy Technique
4. Choosing a New Coordinator Site in Case of Failure
2. Distributed Concurrency Control Based on Voting
In the voting method, there is no distinguished copy; rather, a lock request is sent to all
sites that includes a copy of the data item. Each copy maintains its own lock and can grant or deny
the request for it. If a transaction that requests a lock is granted by a majority of the copies, it holds
the lock and informs all copies that it has been granted the lock. If a transaction does not receive a
majority of votes granting it a lock within a certain time-out period, it cancels its request and
informs all sites of the cancellation.
The voting method is considered a truly distributed concurrency control method, since the
responsibility for a decision resides with all the sites involved.
3. Distributed Recovery
In some cases, it is difficult even to determine whether a site is down without exchanging
numerous messages with other sites. For example, suppose that site X sends a message to site Y
and expects a response from Y but does not receive it. There are several possible explanations:
● The message was not delivered to Y because of communication failure.
● Site Y is down and could not respond.
● Site Y is running and sent a response, but the response was not delivered.
Without additional information or the sending of additional messages, it is difficult to
determine what actually happened.
Another problem with distributed recovery is distributed commit. When a transaction is
updating data at several sites, it cannot commit until it is sure that the effect of the transaction on
every site cannot be lost. This means that every site must first have recorded the local effects of
the transactions permanently in the local site log on disk. The two-phase commit protocol is often
used to ensure the correctness of distributed commits.

QUERY PROCESSING IN DISTRIBUTED DBMS


What is Query?
A query in a database is a request for information from a database management system
(DBMS), which is the software program that maintains data. Users can make a query to
retrieve data or change information in a database, such as adding or removing data.

Query processing in a distributed system requires the transmission of data between computers in
a network. The arrangement of data transmissions and local data processing is known as a
distribution strategy for a query.
A distribution strategy for a query is the ordering of data transmissions and local data
processing in a database system.
Distributed query processing is the procedure of answering queries (which means
mainly read operations on large data sets) in a distributed environment where data is
managed at multiple sites in a computer network.
Query processing involves the transformation of a high-level query (e.g., formulated in
SQL) into a query execution plan (consisting of lower-level query operators in some variation
of relational algebra) as well as the execution of this plan.
The goal of the transformation is to produce a plan which is equivalent to the original
query (returning the same result) and efficient, i.e., to minimize resource consumption like total
costs or response time.
Layers of Query Processing

Processing of Query in distributed DBMS instead of centralized (Local) DBMS.

Understanding Query processing in distributed database environments is very difficult
instead of centralized database, because there are many elements involved.

So, Query processing problem is divided into several sub-problems/ steps which are
easier to solve individually.

A general layering scheme for describing distributed query processing is given below:

Main layers of Query Processing:


Query processing involves 4 main layers:
 Query Decomposition
 Data Localization
 Global Query Optimization
 Distributed Execution

Main layers of Query Processing



The input is a query on global data expressed in relational calculus. This query is posed
on global (distributed) relations, meaning that data distribution is hidden.

The first three layers map the input query into an optimized distributed query execution
plan. They perform the functions of query decomposition, data localization, and global
query optimization.

Query decomposition and data localization correspond to query rewriting.

The first three layers are performed by a central control site and use schema information
stored in the global directory.

 The fourth layer performs distributed query execution by executing the plan and returns
the answer to the query. It is done by the local sites and the control site.
Query Decomposition
The first layer decomposes the calculus query into an algebraic query on global relations. The
information needed for this transformation is found in the global conceptual schema describing
the global relations.

Both input and output queries refer to global relations, without knowledge of the distribution of
data. Therefore, query decomposition is the same for centralized and distributed systems.

Query decomposition can be viewed as four successive steps:


1) Normalization,
2) Analysis,
3) Elimination of redundancy, and
4) Rewriting.

First, the calculus query is rewritten in a normalized form that is suitable for subsequent
manipulation. Normalization of a query generally involves the manipulation of the query
quantifiers and of the query qualification by applying logical operator priority.

 Second, the normalized query is analyzed semantically so that incorrect queries are
detected and rejected as early as possible. Typically, they use some sort of graph that
captures the semantics of the query.

Third, the correct query is simplified. One way to simplify a query is to eliminate
redundant predicates.

 Fourth, the calculus query is restructured as an algebraic query. Several algebraic queries
can be derived from the same calculus query, and that some algebraic queries are “better”
than others. The quality of an algebraic query is defined in terms of expected
performance.
Localization of Distributed Data

Output of the first layer is an algebraic query on distributed relations which is input to the
second layer.

The main role of this layer is to localize the query’s data using data distribution
information.

We know that relations are fragmented and stored in disjoint subsets, called fragments
where each fragment is stored at different site.

This layer determines which fragments are involved in the query and transforms the
distributed query into a fragment query.

A naive way to localize a distributed query is to generate a query where each global relation is
substituted by its localization program. This can be viewed as replacing the leaves of the operator
tree of the distributed query with subtrees corresponding to the localization programs. We call
the query obtained this way the localized query.
Global Query Optimization

The input to the third layer is a fragment algebraic query.


The goal of this layer is to find an execution strategy for the algebraic fragment query which is
close to optimal.
Query optimization consists of
 Finding the best ordering of operations in the fragment query,

Finding the communication operations which minimize a cost function.

The cost function refers to computing resources such as disk space, disk I/Os, buffer
space, CPU cost, communication cost, and so on.
 Query optimization is achieved through the semi-join operator instead of join operators.
Local Query Optimization (Distributed Execution)

The last layer is performed by all the sites having fragments involved in the query.

Each subquery, called a local query, is executing at one site. It is then optimized using the
local schema of the site.

QUERY PROCESSING
A distributed database query is processed in stages as follows:
1. Query Mapping:
● The input query on distributed data is specified using a query language.
● It is then translated into an algebraic query on global relations.
● This translation is referred to as a global conceptual schema. Hence, this translation
is mostly identical to the one performed in a centralized DBMS.
● It is first normalized, analyzed for semantic errors, simplified, and finally
restructured into an algebraic query.
2. Localization:
● In a distributed database, fragmentation results in fragments or relations being
stored in separate sites, with some fragments replicated.
● This stage maps the distributed query on the global schema to separate queries on
individual fragments using data distribution and replication information.
3. Global Query Optimization.
● Optimization consists of selecting a strategy from a list of candidates that is
closest to optimal.
● A list of candidate queries can be obtained by permuting the ordering of
operations within a fragment query generated by the previous stage.
● Time is the preferred unit for measuring cost.
● The total cost is a weighted combination of costs such as CPU cost, I/O costs,
and communication costs.

4. Local Query Optimization.


● This stage is common to all sites in the DDB.
● The techniques are similar to those used in centralized systems.
● The first three stages discussed above are performed at a central control site,
whereas the last stage is performed locally.
1. Data Transfer Costs of Distributed Query Processing
In Distributed Query processing, the data transfer cost of distributed query processing
means the cost of transferring intermediate files to other sites for processing and therefore the
cost of transferring the ultimate result files to the location where that results are required.
Commonly, the data transfer cost is calculated in terms of the size of the messages. By
using the below formula, we can calculate the data transfer cost:
Data transfer cost = C * Size
C refers to the cost per byte of data transferring and Size is the no. of bytes transmitted.
In a distributed system, the complicating factors in query processing are,
● The cost of transferring data over the network.
● The goal of reducing the amount of data transfer

Example:
Find the name of employees and their department names. Also, find the amount of data
transfer to execute this query when the query is submitted to Site 3.
Answer: Considering the query is submitted at site 3 and neither of the two relations is an
EMPLOYEE and the DEPARTMENT not available at site 3. So, to execute this query, we have
three strategies:
 Transfer both the tables that are EMPLOYEE and DEPARTMENT at SITE 3 then join the
tables there. The total cost in this is 1000 * 60 + 50 * 30 = 60,000 + 1500 = 61500 bytes.
 Transfer the table EMPLOYEE to SITE 2, join the table at SITE 2 and then transfer the
result at SITE 3. The total cost in this is 60 * 1000 + 60 * 1000 = 120000 bytes since we
have to transfer 1000 tuples having NAME and DNAME from site 1,
 Transfer the table DEPARTMENT to SITE 1, join the table at SITE 2 join the table at site1
and then transfer the result at site3. The total cost is 30 * 50 + 60 * 1000 = 61500 bytes
since we have to transfer 1000 tuples having NAME and DNAME from site 1 to site 3
which is 60 bytes each.
Now, If the Optimisation criteria are to reduce the amount of data transfer, we can choose either 1
or 3 strategies from the above.

A more complex strategy, which sometimes works better than these simple strategies, uses an
operation called semijoin.

Distributed Query Processing Using Semijoin

● Distributed query processing uses the semijoin operation to reduce the number of tuples in
a relation before transferring it to another site.
● Joining is done by sending the column of one relation R to the site where the other relation
S is located.
● The join attributes and the attributes required in the result, are projected out and shipped
back to the original site and joined with R.
● Hence, only the joining column of R is transferred in one direction, and a subset of S with
no irrelevant tuples or attributes is transferred in the other direction. This can be an efficient
solution to minimizing data transfer.

Example: Find the amount of data transferred to execute the same query given in the above
example using a semi-join operation.
Answer: The following strategy can be used to execute the query.
 Select all (or Project) the attributes of the EMPLOYEE table at site 1 and then transfer
them to site 3. For this, we will transfer NAME, DID(EMPLOYEE) and the size is 30 *
1000 = 30000 bytes.
 Transfer the table DEPARTMENT to site 3 and join the projected attributes of
EMPLOYEE with this table. The size of the DEPARTMENT table is 30 * 50 = 1500
Applying the above scheme, the amount of data transferred to execute the query will be 30000 +
1500 = 31500 bytes.

You might also like