Unit i Distributed Databases Adt
Unit i Distributed Databases Adt
DISTRIBUTED DATABASES
INTRODUCTION
A distributed system is piece of software that serves to coordinate the actions of several
computers. This coordination is achieved by exchanging messages, i.e., pieces of data conveying
information. The system relies on a network that connects the computers and handles the routing
of messages.
Database and Database Management system:
A distributed system is a software system that interconnects a collection of
heterogeneous independent computers, where coordination and communication between
computers only happen through message passing, with the intention of working towards a
common goal.
A database is an ordered collection of related data that is built for a specific purpose.
A database may be organized as a collection of multiple tables, where a table represents a real
world element or entity. Each table has several different fields that represent the characteristic
features of the entity.
For example, a company database may include tables for projects, employees,
departments, products and financial records. The fields in the Employee table may be Name,
Company_Id, Date_of_Joining, and so forth.
A database management system is a collection of programs that enables creation and
maintenance of a database. DBMS is available as a software package that facilitates definition,
construction, manipulation and sharing of data in a database.
Definition of a database includes description of the structure of a database.
Construction of a database involves actual storing of the data in any storage medium.
Manipulation refers to the retrieving information from the database, updating the
database and generating reports. Sharing of data facilitates data to be accessed by different
users or programs.
Examples of DBMS Application Areas
Automatic Teller Machines
Train Reservation System
Employee Management System
Student Information System
Examples of DBMS Packages
MySQL Oracle
SQL Server
dBASE
FoxPro
PostgreSQL, etc.
Database Schemas:
A database schema is a description of the database which is specified during database
design and subject to infrequent alterations. It defines the organization of the data, the
relationships among them, and the constraints associated with them.
Databases are often represented through the three-schema architecture or ANSISPARC
architecture. The goal of this architecture is to separate the user application from the physical
database.
The three levels are
Internal Level having Internal Schema
− It describes the physical structure, details of internal storage and access paths for the
database.
Conceptual Level having Conceptual Schema
− It describes the structure of the whole database while hiding the details of physical
storage of data. This illustrates the entities, attributes with their data types and constraints, user
operations and relationships.
External or View Level having External Schemas or Views
− It describes the portion of a database relevant to a particular user or a group of users
while hiding the rest of database.
Types of DBMS
Hierarchical DBMS
Network DBMS
Relational DBMS
Object Oriented DBMS
Distributed DBMS
Hierarchical DBMS
In hierarchical DBMS, the relationships among data in the database are established so
that one data element exists as a subordinate of another. The data elements have parent-child
relationships and are modelled using the “tree” data structure. These are very fast and simple.
Network DBMS
Network DBMS in one where the relationships among data in the database are of type
manyto-many in the form of a network. The structure is generally complicated due to the
existence of numerous many-to-many relationships. Network DBMS is modelled using
“graph” data structure.
Relational DBMS
In relational databases, the database is represented in the form of relations. Each
relation models an entity and is represented as a table of values. In the relation or table, a row
is called a tuple and denotes a single record. A column is called a field or an attribute and
denotes a characteristic property of the entity. RDBMS is the most popular database
management system.
For example − A Student Relation
RETRIEVE information from the database – Retrieving information generally involves selecting
a subset of a table or displaying data from the table after some computations have been done. It is
done by querying upon the table.
Example − To retrieve the names of all students of the Computer Science stream, the following
SQL query needs to be executed –
UPDATE information stored and modify database structure – Updating a table involves changing
old values in the existing table’s rows with new values.
Example − SQL command to change stream from Electronics to Electronics and Communications
−
Example − To remove the student table completely, the SQL command used is −
Architectural Models
Client performs
Application
User interface
DBMS Client model
Multi-database View Level − Depicts multiple user views comprising of subsets of the
integrated distributed database.
Multi-database Conceptual Level − Depicts integrated multi-database that comprises of
global logical multi-database structure definitions.
Multi-database Internal Level − Depicts the data distribution across different sites and
multi-database to local data mapping.
Local database View Level − Depicts public view of local data.
Local database Conceptual Level − Depicts local data organization at each site.
Local database Internal Level − Depicts physical data organization at each site.
There are two design alternatives for multi-DBMS −
Model with multi-database conceptual level.
So, the idea behind distributed architectures is to have these components presented on
different platforms, where components can communicate with each other over a
communication network in order to achieve specific objectives.
Architectural Styles
There are four different architectural styles, plus the hybrid architecture, when it comes to
distributed systems. The basic idea is to organize logically different components, and distribute
those computers over the various machines.
Layered Architecture
Object Based Architecture
Data-centered Architecture
Event Based Architecture
Hybrid Architecture
Layered Architecture
The layered architecture separates layers of components from each other, giving it a
much more modular approach. A well-known example for this is the OSI model that
incorporates a layered architecture when interacting with each of the components.
The layers on the bottom provide a service to the layers on the top. The request flows
from top to bottom, whereas the response is sent from bottom to top.
The advantage of using this approach is that, the calls always follow a predefined path,
and that each layer can be easily replaced or modified without affecting the entire architecture.
Object Based Architecture
This architecture style is based on loosely coupled arrangement of objects. This has no
specific architecture like layers. Like in layers, this does not have a sequential set of steps that
needs to be carried out for a given call. Each of the components is referred to as objects, where
each object can interact with other objects through a given connector or interface.
These are much 1 more direct where all the different components can interact directly with
other components through a direct method call.
The major advantage of this architecture is that the Components are decoupled in space - loosely
coupled.
As one common design feature, the Client Server architecture has a centralized security
database. This database contains security details like credentials and access details. Users can't
log in to a server, without the security credentials. So, it makes this architecture a bit more
stable and secure than Peer to Peer. The stability comes where the security database can allow
resource usage in a much more meaningful way. But on the other hand, the system might get
low, as the server only can handle a limited amount of workload at a given time.
Advantages:
Easier to Build and Maintain
Better Security
Stable
Disadvantages:
Single point of failure
Less scalable
The general idea behind peer to peer is where there is no central control in a distributed
system. The basic idea is that, each node can either be a client or a server at a given time. If
the node is requesting something, it can be known as a client, and if some node is providing
something, it can be known as a server. In general, each node is referred to as a Peer.
In this network, any new node has to first join the network. After joining in, they can either
request a service or provide a service. The initiation phase of a node (Joining of a node), can
vary according to implementation of a network. There are two ways in how a new node can get
to know, what other nodes are providing.
Centralized Lookup Server - The new node has to register with the centralized look up
server and mention the services it will be providing, on the network. So, whenever you want to
have a service, you simply have to contact the centralized look up server and it will direct you
to the relevant service provider.
Decentralized System - A node desiring for specific services must, broadcast and ask every
other node in the network, so that whoever is providing the service will respond.
Middleware in Distributed Applications
If we look at Distributed systems today, they lack the uniformity and consistency. Various
heterogeneous devices have taken over the world where distributed system cater to all these
devices in a common way.
One-way distributed systems can achieve uniformity is through
a common layer to support the underlying hardware and operating systems. This common layer
is known as a middleware, where it provides services beyond what is already provided by
Operating systems, to enable various features and components of a distributed system to
enhance its functionality better.
This layer provides a certain data structures and operations that allow processes and users
on machines to inter-operate and work together in a consistent way.
Every structured network inherently suffers from poor scalability, due to the need for
structure maintenance. In general, the nodes in a structured overlay network are formed in a
logical ring, with nodes being connected to this ring. In this ring, certain nodes are responsible
for certain services.
A common approach that can be used to tackle the coordination between nodes, is to
use distributed hash tables (DHTs). A traditional hash function converts a unique key into a
hash value that will represent an object in the network. The hash function value is used to insert
an object in the hash table and to retrieve it.
In a DHT, each key is assigned to a unique hash, where the random hash value needs to
be of a very large address space, in order to ensure uniqueness. A mapping function is being
used to assign objects to nodes based on the hash function value. A look up based on the hash
function
value, returns the network address of the node that stores the requested object
Replication
In database replication, the systems store copies of data on different sites. If an entire
database is available on multiple sites, it is a fully redundant database. The advantage of database
replication is that it increases data availability on different sites and allows for parallel query
requests to be processed. However, database replication means that data requires constant updates
and synchronization with other sites to maintain an exact database copy. Any changes made on
one site must be recorded on other sites, or else inconsistencies occur. Constant updates cause a
lot of server overhead and complicate concurrency control, as a lot of concurrent queries must be
checked in all available sites.
Data Replication
Data replication is the process of storing separate copies of the database at two or
more sites. It is a popular fault tolerance technique of distributed databases.
Disadvantages
1. Applications whose views are defined on more than
one fragment may suffer performance degradation, if
applications have conflicting requirements.
2. Simple tasks like checking for dependencies, would result
in chasing after data in a number of sites
3. When data from different fragments are required, the
access speeds may be very high.
4. In case of recursive fragmentations, the job of
reconstruction will need expensive techniques.
5. Lack of back-up copies of data in different sites may render
the database ineffective in case of failure of a site.
Vertical Fragmentation
In vertical fragmentation, the fields or columns of a table are grouped into
fragments. In order to maintain reconstructiveness, each fragment should contain
the primary key field(s) of the table. Vertical fragmentation can be used to enforce
privacy of data.
Grouping
Starts by assigning each attribute to one fragment
o At each step, joins some of the fragments until some criteria is
satisfied.
Results in overlapping fragments
Splitting
Starts with a relation and decides on beneficial
partitioning based on the access behavior of
applications to the attributes
Fits more naturally within the top-down design
Generates non-overlapping fragments
For example, let us consider that a University database keeps records of all
registered students in a Student table having the following schema.
STUDENT
Now, the fees details are maintained in the accounts section. In this case, the designer will
fragment
Horizontal Fragmentation
Horizontal fragmentation groups the tuples of a table in accordance to values of one
or more fields. Horizontal fragmentation should also confirm to the rule of
reconstructiveness. Each horizontal fragment must have all columns of the original
base table.
Primary horizontal fragmentation is defined by a selection operation on the owner relation of a
database schema.
Given relation Ri, its horizontal fragments are given by Ri = σFi(R), 1<= i <= w
Fi selection formula used to obtain fragment Ri
The example mentioned in slide 20, can be represented by using the above
Hybrid Fragmentation
In hybrid fragmentation, a combination of horizontal and vertical fragmentation
techniques are used. This is the most flexible fragmentation technique since it generates
fragments with minimal extraneous information. However, reconstruction of the original table is
often an expensive task.
Hybrid fragmentation can be done in two alternative ways −
At first, generate a set of horizontal fragments; then generate vertical fragments from one
or more of the horizontal fragments.
At first, generate a set of vertical fragments; then generate horizontal fragments from one
or more of the vertical fragments.
DISTRIBUTED TRANSACTION
A distributed transaction is a set of operations on data that is performed across two or
more data repositories (especially databases). It is typically coordinated across separate nodes
connected by a network, but may also span multiple databases on a single server.
There are two possible outcomes:
1) all operations successfully complete, or
2) none of the operations are performed at all due to a failure somewhere in the system.
In the latter case, if some work was completed prior to the failure, that work will be
reversed to ensure no net work was done. This type of operation is in compliance with the
“ACID” (atomicity-consistency-isolation-durability) principles of databases that ensure data
integrity. ACID is most commonly associated with transactions on a single database server,
but distributed transactions extend that guarantee across multiple databases.
Transaction management
Distributed databases must often support distributed transactions, where one transaction
can involve more than one node. This support methodology is highlighted in the ACID
properties (atomicity, consistency, isolation, durability) of transactions across distributed
database systems. Key elements of ACID properties include:
Atomicity means that a transaction is treated as a single unit. This also means that either
a complete transaction is available for storage or it's rejected as an error which ensures
data integrity.
Consistency is maintained in distributed database systems by enforcing predefined
rules and data constraints. If the state, nature, or content of a transaction violates these
rules, the transaction will not be ingested and stored in the distributed system.
Isolation involves the separation of each transaction from the other transactions to
prevent data conflicts and maintain data integrity. In addition, this benefits operations
when managing multiple distributed data records that may exist across local data stores,
virtual machines via cloud computing, and multiple database nodes which may be
located across multiple sites.
Durability ensures that stored data is preserved in the event of a system failure. There
are a variety of ways that a transactional distributed database management system
accomplishes this task.
COMMIT PROTOCOLS
In a local database system, for committing a transaction, the transaction manager has
to only convey the decision to commit to the recovery manager. However, in a distributed
system, the transaction manager should convey the decision to commit to all the servers in the
various sites where the transaction is being executed and uniformly enforce the decision.
When processing is complete at each site, it reaches the partially committed transaction
state and waits for all other transactions to reach their partially committed states. When it
receives the message that all the sites are ready to commit, it starts to commit. In a distributed
system, either all sites commit or none of them does.
The different distributed commit protocols are −
● One-phase commit
● Two-phase commit
● Three-phase commit
Distributed One-phase Commit
Distributed one-phase commit is the simplest commit protocol. Let us consider that
there is a controlling site and a number of slave sites where the transaction is being executed.
The steps in distributed commit are −
● After each slave has locally completed its transaction, it sends a “DONE” message to the
controlling site.
● The slaves wait for “Commit” or “Abort” messages from the controlling site. This waiting
time is called a window of vulnerability.
● When the controlling site receives a “DONE” message from each slave, it makes a
decision to commit or abort. This is called the commit point. Then, it sends this message
to all the slaves.
● On receiving this message, a slave either commits or aborts and then sends an
acknowledgement message to the controlling site.
CONCURRENCY CONTROL
Concurrency control in distributed systems is achieved by a program which is called
scheduler. Schedulers help to order the operations of transactions in such a way that the
resulting logs are serializable. There are two types of the concurrency control that are locking
approach and non-locking approach.
Concurrency Problems
Since the two main operations in a database transaction are Read and Write operations.
This problem mainly arises when one user is writing and the other is reading or when both the
users try to write the same data simultaneously. Following are some common concurrency
problems:
Dirty Read Problem ( W-R conflict )
Lost Update Problem ( W-W conflict )
Non-repeatable Read Problem ( W-R conflict )
At t1 time, transaction Tx will read the value from account A, that is, 650 rupees.
A t2 time, transaction Ty will read the value from account A, that is, 650 rupees.
At t3 time, transaction Ty adds 250 to account A, which will become 900 rupees( only
added, not updated yet).
At t4 time, transaction Ty writes the updated value of A that is 900.
Later at t5 time, transaction Tx reads the value of account A, which is 900.
Within the same Transaction Tx, it reads two different values of (which are 650 at time t1 and
900 at time t5). It is a non-repeatable read and therefore known as a Non-repeatable read
problem.
Query processing in a distributed system requires the transmission of data between computers in
a network. The arrangement of data transmissions and local data processing is known as a
distribution strategy for a query.
A distribution strategy for a query is the ordering of data transmissions and local data
processing in a database system.
Distributed query processing is the procedure of answering queries (which means
mainly read operations on large data sets) in a distributed environment where data is
managed at multiple sites in a computer network.
Query processing involves the transformation of a high-level query (e.g., formulated in
SQL) into a query execution plan (consisting of lower-level query operators in some variation
of relational algebra) as well as the execution of this plan.
The goal of the transformation is to produce a plan which is equivalent to the original
query (returning the same result) and efficient, i.e., to minimize resource consumption like total
costs or response time.
Layers of Query Processing
Processing of Query in distributed DBMS instead of centralized (Local) DBMS.
Understanding Query processing in distributed database environments is very difficult
instead of centralized database, because there are many elements involved.
So, Query processing problem is divided into several sub-problems/ steps which are
easier to solve individually.
A general layering scheme for describing distributed query processing is given below:
The fourth layer performs distributed query execution by executing the plan and returns
the answer to the query. It is done by the local sites and the control site.
Query Decomposition
The first layer decomposes the calculus query into an algebraic query on global relations. The
information needed for this transformation is found in the global conceptual schema describing
the global relations.
Both input and output queries refer to global relations, without knowledge of the distribution of
data. Therefore, query decomposition is the same for centralized and distributed systems.
Second, the normalized query is analyzed semantically so that incorrect queries are
detected and rejected as early as possible. Typically, they use some sort of graph that
captures the semantics of the query.
Third, the correct query is simplified. One way to simplify a query is to eliminate
redundant predicates.
Fourth, the calculus query is restructured as an algebraic query. Several algebraic queries
can be derived from the same calculus query, and that some algebraic queries are “better”
than others. The quality of an algebraic query is defined in terms of expected
performance.
Localization of Distributed Data
Output of the first layer is an algebraic query on distributed relations which is input to the
second layer.
The main role of this layer is to localize the query’s data using data distribution
information.
We know that relations are fragmented and stored in disjoint subsets, called fragments
where each fragment is stored at different site.
This layer determines which fragments are involved in the query and transforms the
distributed query into a fragment query.
A naive way to localize a distributed query is to generate a query where each global relation is
substituted by its localization program. This can be viewed as replacing the leaves of the operator
tree of the distributed query with subtrees corresponding to the localization programs. We call
the query obtained this way the localized query.
Global Query Optimization
QUERY PROCESSING
A distributed database query is processed in stages as follows:
1. Query Mapping:
● The input query on distributed data is specified using a query language.
● It is then translated into an algebraic query on global relations.
● This translation is referred to as a global conceptual schema. Hence, this translation
is mostly identical to the one performed in a centralized DBMS.
● It is first normalized, analyzed for semantic errors, simplified, and finally
restructured into an algebraic query.
2. Localization:
● In a distributed database, fragmentation results in fragments or relations being
stored in separate sites, with some fragments replicated.
● This stage maps the distributed query on the global schema to separate queries on
individual fragments using data distribution and replication information.
3. Global Query Optimization.
● Optimization consists of selecting a strategy from a list of candidates that is
closest to optimal.
● A list of candidate queries can be obtained by permuting the ordering of
operations within a fragment query generated by the previous stage.
● Time is the preferred unit for measuring cost.
● The total cost is a weighted combination of costs such as CPU cost, I/O costs,
and communication costs.
Example:
Find the name of employees and their department names. Also, find the amount of data
transfer to execute this query when the query is submitted to Site 3.
Answer: Considering the query is submitted at site 3 and neither of the two relations is an
EMPLOYEE and the DEPARTMENT not available at site 3. So, to execute this query, we have
three strategies:
Transfer both the tables that are EMPLOYEE and DEPARTMENT at SITE 3 then join the
tables there. The total cost in this is 1000 * 60 + 50 * 30 = 60,000 + 1500 = 61500 bytes.
Transfer the table EMPLOYEE to SITE 2, join the table at SITE 2 and then transfer the
result at SITE 3. The total cost in this is 60 * 1000 + 60 * 1000 = 120000 bytes since we
have to transfer 1000 tuples having NAME and DNAME from site 1,
Transfer the table DEPARTMENT to SITE 1, join the table at SITE 2 join the table at site1
and then transfer the result at site3. The total cost is 30 * 50 + 60 * 1000 = 61500 bytes
since we have to transfer 1000 tuples having NAME and DNAME from site 1 to site 3
which is 60 bytes each.
Now, If the Optimisation criteria are to reduce the amount of data transfer, we can choose either 1
or 3 strategies from the above.
A more complex strategy, which sometimes works better than these simple strategies, uses an
operation called semijoin.
● Distributed query processing uses the semijoin operation to reduce the number of tuples in
a relation before transferring it to another site.
● Joining is done by sending the column of one relation R to the site where the other relation
S is located.
● The join attributes and the attributes required in the result, are projected out and shipped
back to the original site and joined with R.
● Hence, only the joining column of R is transferred in one direction, and a subset of S with
no irrelevant tuples or attributes is transferred in the other direction. This can be an efficient
solution to minimizing data transfer.
Example: Find the amount of data transferred to execute the same query given in the above
example using a semi-join operation.
Answer: The following strategy can be used to execute the query.
Select all (or Project) the attributes of the EMPLOYEE table at site 1 and then transfer
them to site 3. For this, we will transfer NAME, DID(EMPLOYEE) and the size is 30 *
1000 = 30000 bytes.
Transfer the table DEPARTMENT to site 3 and join the projected attributes of
EMPLOYEE with this table. The size of the DEPARTMENT table is 30 * 50 = 1500
Applying the above scheme, the amount of data transferred to execute the query will be 30000 +
1500 = 31500 bytes.