0% found this document useful (0 votes)
49 views18 pages

Adt Unit I

A distributed database is a collection of interconnected databases located in different places that communicate through a computer network. It is controlled centrally by a distributed database management system. Data is stored across sites and each site can be managed independently by its own DBMS. The goals of distributed databases are to improve reliability, availability, and performance by distributing data across multiple locations. The two main types are homogeneous, where all databases use the same DBMS software, and heterogeneous, where different DBMS are used. Fragmentation divides a database relation into smaller non-overlapping parts called fragments to distribute data across sites. The three types of fragmentation are horizontal, vertical, and hybrid.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views18 pages

Adt Unit I

A distributed database is a collection of interconnected databases located in different places that communicate through a computer network. It is controlled centrally by a distributed database management system. Data is stored across sites and each site can be managed independently by its own DBMS. The goals of distributed databases are to improve reliability, availability, and performance by distributing data across multiple locations. The two main types are homogeneous, where all databases use the same DBMS software, and heterogeneous, where different DBMS are used. Fragmentation divides a database relation into smaller non-overlapping parts called fragments to distribute data across sites. The three types of fragmentation are horizontal, vertical, and hybrid.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT-I

What are distributed databases?

A distributed database is a collection of multiple interconnected databases, which are


spread physically across various locations that communicate via a computer network.
 Distributed database is a system in which storage devices are not connected to a
common processing unit.
 Database is controlled by Distributed Database Management System and data may be
stored at the same location or spread over the interconnected network. It is a loosely
coupled system.
 Shared nothing architecture is used in distributed databases.

 The above diagram is a typical example of distributed database system, in which


communication channel is used to communicate with the different locations and
every system has its own memory and database.
Features
 Databases in the collection are logically interrelated with each other. Often they
represent a single logical database.
 Data is physically stored across multiple sites. Data in each site can be managed by a
DBMS independent of the other sites.
 The processors in the sites are connected via a network. They do not have any
multiprocessor configuration.
 A distributed database is not a loosely connected file system.
 A distributed database incorporates transaction processing, but it is not synonymous
with a transaction processing system.

Distributed Database Management System


A distributed database management system (DDBMS) is a centralized software system that
manages a distributed database in a manner as if it were all stored in a single location.
Features
 It is used to create, retrieve, update and delete distributed databases.
 It synchronizes the database periodically and provides access mechanisms by
the virtue of which the distribution becomes transparent to the users.
 It ensures that the data modified at any site is universally updated.
 It is used in application areas where large volumes of data are processed and
accessed by numerous users simultaneously.
 It is designed for heterogeneous database platforms.
 It maintains confidentiality and data integrity of the databases.

Goals of Distributed Database system.

The concept of distributed database was built with a goal to improve:

Reliability: In distributed database system, if one system fails down or stops working for
some time another system can complete the task.
Availability: In distributed database system reliability can be achieved even if sever fails
down. Another system is available to serve the client request.
Performance: Performance can be achieved by distributing database over different locations.
So the databases are available to every location which is easy to maintain.

Types of distributed databases.

The two types of distributed systems are as follows:

1. Homogeneous distributed databases system:


 Homogeneous distributed database system is a network of two or more databases
(With same type of DBMS software) which can be stored on one or more machines.
 So, in this system data can be accessed and modified simultaneously on several
databases in the network. Homogeneous distributed system are easy to handle.
Example: Consider that we have three departments using Oracle-9i for DBMS. If some
changes are made in one department then, it would update the other department also.
Types of Homogeneous Distributed Database
There are two types of homogeneous distributed database −
 Autonomous − Each database is independent that functions on its own. They are
integrated by a controlling application and use message passing to share data updates.
 Non-autonomous − Data is distributed across the homogeneous nodes and a central
or master DBMS co-ordinates data updates across the sites.

2. Heterogeneous distributed database system.


 Heterogeneous distributed database system is a network of two or more databases
with different types of DBMS software, which can be stored on one or more
machines.
 In this system data can be accessible to several databases in the network with the help
of generic connectivity (ODBC and JDBC).
Example: In the following diagram, different DBMS software are accessible to each
other using ODBC and JDBC.

Types of Heterogeneous Distributed Databases


 Federated − The heterogeneous database systems are independent in nature and
integrated together so that they function as a single database system.
 Un-federated − The database systems employ a central coordinating module through
which the databases are accessed.
The basic types of distributed DBMS are as follows:

1. Client-server architecture of Distributed system.

 A client server architecture has a number of clients and a few servers connected in a
network.
 A client sends a query to one of the servers. The earliest available server solves it and
replies.
 A Client-server architecture is simple to implement and execute due to centralized
server system.

2. Collaborating server architecture.

 Collaborating server architecture is designed to run a single query on multiple


servers.
 Servers break single query into multiple small queries and the result is sent to the
client.
 Collaborating server architecture has a collection of database servers. Each server is
capable for executing the current transactions across the databases.

3. Middleware architecture.
 Middleware architectures are designed in such a way that single query is executed on
multiple servers.
 This system needs only one server which is capable of managing queries and
transactions from multiple servers.
 Middleware architecture uses local servers to handle local queries and transactions.
 The softwares are used for execution of queries and transactions across one or more
independent database servers, this type of software is called as middleware.

What is fragmentation?

Fragmentation
Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of the
table are called fragments. Fragmentation can be of three types: horizontal, vertical, and
hybrid (combination of horizontal and vertical). Horizontal fragmentation can further be
classified into two techniques: primary horizontal fragmentation and derived horizontal
fragmentation.
Fragmentation should be done in a way so that the original table can be reconstructed from
the fragments. This is needed so that the original table can be reconstructed from the
fragments whenever required. This requirement is called “reconstructiveness.”
 The process of dividing the database into a smaller multiple parts is called
as fragmentation.
 These fragments may be stored at different locations.
 The data fragmentation process should be carrried out in such a way that the
reconstruction of original database from the fragments is possible.
Advantages of Fragmentation
 Since data is stored close to the site of usage, efficiency of the database system is
increased.
 Local query optimization techniques are sufficient for most queries since data is
locally available.
 Since irrelevant data is not available at the sites, security and privacy of the database
system can be maintained.
Disadvantages of Fragmentation
 When data from different fragments are required, the access speeds may be very low.
 In case of recursive fragmentations, the job of reconstruction will need expensive
techniques.
 Lack of back-up copies of data in different sites may render the database ineffective
in case of failure of a site.

Types of data Fragmentation

The three fragmentation techniques are −


 Vertical fragmentation
 Horizontal fragmentation
 Hybrid fragmentation

There are three types of data fragmentation:

1. Horizontal data fragmentation

Horizontal fragmentation divides a relation(table) horizontally into the group of rows to


create subsets of tables.

Example:
Account (Acc_No, Balance, Branch_Name, Type).
In this example if values are inserted in table Branch_Name as Pune, Baroda, Delhi.

The query can be written as:


SELECT*FROM ACCOUNT WHERE Branch_Name= “Baroda”

Types of horizontal data fragmentation are as follows:

1) Primary horizontal fragmentation


Primary horizontal fragmentation is the process of fragmenting a single table, row wise using
a set of conditions.

Example:

Acc_No Balance Branch_Name


A_101 5000 Pune
A_102 10,000 Baroda
A_103 25,000 Delhi

For the above table we can define any simple condition like, Branch_Name= 'Pune',
Branch_Name= 'Delhi', Balance < 50,000

Fragmentation1:
SELECT * FROM Account WHERE Branch_Name= 'Pune' AND Balance < 50,000

Fragmentation2:
SELECT * FROM Account WHERE Branch_Name= 'Delhi' AND Balance < 50,000
2) Derived horizontal fragmentation
Fragmentation derived from the primary relation is called as derived horizontal
fragmentation.

Example: Refer the example of primary fragmentation given above.

The following fragmentation are derived from primary fragmentation.

Fragmentation1:
SELECT * FROM Account WHERE Branch_Name= 'Baroda' AND Balance < 50,000

Fragmentation2:
SELECT * FROM Account WHERE Branch_Name= 'Delhi' AND Balance < 50,000

3) Complete horizontal fragmentation


 The complete horizontal fragmentation generates a set of horizontal fragmentation,
which includes every table of original relation.
 Completeness is required for reconstruction of relation so that every table belongs to
at least one of the partitions.
4) Disjoint horizontal fragmentation
The disjoint horizontal fragmentation generates a set of horizontal fragmentation in which no
two fragments have common tables. That means every table of relation belongs to only one
fragment.

5) Reconstruction of horizontal fragmentation


Reconstruction of horizontal fragmentation can be performed using UNION operation on
fragments.

2. Vertical Fragmentation

Vertical fragmentation divides a relation(table) vertically into groups of columns to create


subsets of tables.

Example:

Acc_No Balance Branch_Name


A_101 5000 Pune
A_102 10,000 Baroda
A_103 25,000 Delhi

Fragmentation1:
SELECT * FROM Acc_NO

Fragmentation2:
SELECT * FROM Balance
Complete vertical fragmentation
 The complete vertical fragmentation generates a set of vertical fragments, which can
include all the attributes of original relation.
 Reconstruction of vertical fragmentation is performed by using Full Outer
Join operation on fragments.

3) Hybrid Fragmentation

 Hybrid fragmentation can be achieved by performing horizontal and vertical partition


together.
 Mixed fragmentation is group of rows and columns in relation.

Example: Consider the following table which consists of employee information.

Emp_ID Emp_Name Emp_Address Emp_Age Emp_Salary


101 Surendra Baroda 25 15000
102 Jaya Pune 37 12000
103 Jayesh Pune 47 10000

Fragmentation1:
SELECT * FROM Emp_Name WHERE Emp_Age < 40

Fragmentation2:
SELECT * FROM Emp_Id WHERE Emp_Address= 'Pune' AND Salary < 14000

Reconstruction of Hybrid Fragmentation


The original relation in hybrid fragmentation is reconstructed by performing UNION
and FULL OUTER JOIN.

What is data replication?

Data replication is the process in which the data is copied at multiple locations (Different
computers or servers) to improve the availability of data.

Data replication is the process of storing separate copies of the database at two or more sites.
It is a popular fault tolerance technique of distributed databases.
Advantages of Data Replication
 Reliability − In case of failure of any site, the database system continues to work
since a copy is available at another site(s).
 Reduction in Network Load − Since local copies of data are available, query
processing can be done with reduced network usage, particularly during prime hours.
Data updating can be done at non-prime hours.
 Quicker Response − Availability of local copies of data ensures quick query
processing and consequently quick response time.
 Simpler Transactions − Transactions require less number of joins of tables located at
different sites and minimal coordination across the network. Thus, they become
simpler in nature.
Disadvantages of Data Replication
 Increased Storage Requirements − Maintaining multiple copies of data is associated
with increased storage costs. The storage space required is in multiples of the storage
required for a centralized system.
 Increased Cost and Complexity of Data Updating − Each time a data item is
updated, the update needs to be reflected in all the copies of the data at the different
sites. This requires complex synchronization techniques and protocols.
 Undesirable Application – Database coupling − If complex update mechanisms are
not used, removing data inconsistency requires complex co-ordination at application
level. This results in undesirable application – database coupling.

Goals of data replication

Data replication is done with an aim to:


 Increase the availability of data.
 Speed up the query evaluation.

Types of data replication

There are two types of data replication:

1. Synchronous Replication:
In synchronous replication, the replica will be modified immediately after some changes are
made in the relation table. So there is no difference between original data and replica.

2. Asynchronous replication:
In asynchronous replication, the replica will be modified after commit is fired on to the
database.

Replication Schemes

The three replication schemes are as follows:

1. Full Replication

In this design alternative, at each site, one copy of all the database tables is stored. Since,
each site has its own copy of the entire database, queries are very fast requiring negligible
communication cost. On the contrary, the massive redundancy in data requires huge cost
during update operations. Hence, this is suitable for systems where a large number of queries
is required to be handled whereas the number of database updates is low.
In full replication scheme, the database is available to almost every location or user in
communication network.

Advantages of full replication


 High availability of data, as database is available to almost every location.
 Faster execution of queries.
Disadvantages of full replication
 Concurrency control is difficult to achieve in full replication.
 Update operation is slower.

2. No Replication

In this design alternative, different tables are placed at different sites. Data is placed so that it
is at a close proximity to the site where it is used most. It is most suitable for database
systems where the percentage of queries needed to join information in tables placed at
different sites is low. If an appropriate distribution strategy is adopted, then this design
alternative helps to reduce the communication cost during data processing.

No replication means, each fragment is stored exactly at one location.


Advantages of no replication
 Concurrency can be minimized.
 Easy recovery of data.
Disadvantages of no replication
 Poor availability of data.
 Slows down the query execution process, as multiple clients are accessing the same
server.

3. Partial replication

Copies of tables or portions of tables are stored at different sites. The distribution of the
tables is done in accordance to the frequency of access. This takes into consideration the fact
that the frequency of accessing the tables vary considerably from site to site. The number of
copies of the tables (or portions) depends on how frequently the access queries execute and
the site which generate the access queries.

Partial replication means only some fragments are replicated from the database.

Advantages of partial replication


The number of replicas created for fragments depend upon the importance of data in that
fragment.

Distributed databases - Query processing and Optimization

DDBMS processes and optimizes a query in terms of communication cost of processing a


distributed query and other parameters.

Various factors which are considered while processing a query are as follows:
Costs of Data transfer

 This is a very important factor while processing queries. The intermediate data is
transferred to other location for data processing and the final result will be sent to the
location where the actual query is processing.
 The cost of data increases if the locations are connected via high performance
communicating channel.
 The DDBMS query optimization algorithms are used to minimize the cost of data
transfer.

Semi-join based query optimization

 Semi-join is used to reduce the number of relations in a table before transferring it to


another location.
 Only joining columns are transferred in this method.
 This method reduces the cost of data transfer.

Cost based query optimization

 Query optimization involves many operations like, selection, projection, aggregation.


 Cost of communication is considered in query optimization.
 In centralized database system, the information of relations at remote location is
obtained from the server system catalogs.
 The data (query) which is manipulated at local location is considered as a sub query
to other global locations. This process estimates the total cost which is needed to
compute the intermediate relations.
Transactions
A transaction is a program including a collection of database operations, executed as a
logical unit of data processing. The operations performed in a transaction include one or
more of database operations like insert, delete, update or retrieve data. It is an atomic process
that is either performed into completion entirely or is not performed at all. A transaction
involving only data retrieval without any data update is called read-only transaction.
Each high level operation can be divided into a number of low level tasks or operations. For
example, a data update operation can be divided into three tasks −
 read_item() − reads data item from storage to main memory.
 modify_item() − change value of item in the main memory.
 write_item() − write the modified value from main memory to storage.
Database access is restricted to read_item() and write_item() operations. Likewise, for all
transactions, read and write forms the basic database operations.
Transaction Operations
The low level operations performed in a transaction are −
 begin_transaction − A marker that specifies start of transaction execution.
 read_item or write_item − Database operations that may be interleaved with main
memory operations as a part of transaction.
 end_transaction − A marker that specifies end of transaction.
 commit − A signal to specify that the transaction has been successfully completed in
its entirety and will not be undone.
 rollback − A signal to specify that the transaction has been unsuccessful and so all
temporary changes in the database are undone. A committed transaction cannot be
rolled back.
Transaction States
A transaction may go through a subset of five states, active, partially committed, committed,
failed and aborted.
 Active − The initial state where the transaction enters is the active state. The
transaction remains in this state while it is executing read, write or other operations.
 Partially Committed − The transaction enters this state after the last statement of the
transaction has been executed.
 Committed − The transaction enters this state after successful completion of the
transaction and system checks have issued commit signal.
 Failed − The transaction goes from partially committed state or active state to failed
state when it is discovered that normal execution can no longer proceed or system
checks fail.
 Aborted − This is the state after the transaction has been rolled back after failure and
the database has been restored to its state that was before the transaction began.

Desirable Properties of Transactions

Any transaction must maintain the ACID properties, viz. Atomicity, Consistency, Isolation,
and Durability.
 Atomicity − This property states that a transaction is an atomic unit of
processing, that is, either it is performed in its entirety or not performed at all.
No partial update should exist.
 Consistency − A transaction should take the database from one consistent state
to another consistent state. It should not adversely affect any data item in the
database.
 Isolation − A transaction should be executed as if it is the only one in the
system. There should not be any interference from the other concurrent
transactions that are simultaneously running.
 Durability − If a committed transaction brings about a change, that change
should be durable in the database and not lost in case of any failure.

Distributed Transactions

 A Distributed Databases Management System should be able to survive in a system


failure without losing any data in the database.
 This property is provided in transaction processing.
 The local transaction works only on own location(Local Location) where it is
considered as a global transaction for other locations.
 Transactions are assigned to transaction monitor which works as a supervisor.
 A distributed transaction process is designed to distribute data over many locations
and transactions are carried out successfully or terminated successfully.
 Transaction Processing is very useful for concurrent execution and recovery of data.

What is recovery in distributed databases?

Recovery is the most complicated process in distributed databases. Recovery of a failed


system in the communication network is very difficult.

For example:
Consider that, location A sends message to location B and expects response from B but B is
unable to receive it. There are several problems for this situation which are as follows.

 Message was failed due to failure in the network.


 Location B sent message but not delivered to location A.
 Location B crashed down.
 So it is actually very difficult to find the cause of failure in a large communication
network.
 Distributed commit in the network is also a serious problem which can affect the
recovery in a distributed databases.

COMMIT PROTOCOL
In a local database system, for committing a transaction, the transaction manager has to only
convey the decision to commit to the recovery manager. However, in a distributed system,
the transaction manager should convey the decision to commit to all the servers in the
various sites where the transaction is being executed and uniformly enforce the decision.
When processing is complete at each site, it reaches the partially committed transaction state
and waits for all other transactions to reach their partially committed states. When it receives
the message that all the sites are ready to commit, it starts to commit. In a distributed system,
either all sites commit or none of them does.
The different distributed commit protocols are −
 One-phase commit
 Two-phase commit
 Three-phase commit
Distributed One-phase Commit
Distributed one-phase commit is the simplest commit protocol. Let us consider that there is a
controlling site and a number of slave sites where the transaction is being executed. The
steps in distributed commit are −
 After each slave has locally completed its transaction, it sends a “DONE” message to
the controlling site.
 The slaves wait for “Commit” or “Abort” message from the controlling site. This
waiting time is called window of vulnerability.
 When the controlling site receives “DONE” message from each slave, it makes a
decision to commit or abort. This is called the commit point. Then, it sends this
message to all the slaves.
 On receiving this message, a slave either commits or aborts and then sends an
acknowledgement message to the controlling site.

Two-phase commit protocol in Distributed databases

Distributed two-phase commit reduces the vulnerability of one-phase commit protocols.

 Two-phase protocol is a type of atomic commitment protocol. This is a distributed


algorithm which can coordinate all the processes that participate in the database and
decide to commit or terminate the transactions. The protocol is based on commit and
terminate action.
 The two-phase protocol ensures that all participant which are accessing the database
server can receive and implement the same action (Commit or terminate), in case of
local network failure.
 Two-phase commit protocol provides automatic recovery mechanism in case of a
system failure.
 The location at which original transaction takes place is called as coordinator and
where the sub process takes place is called as Cohort.

Commit request:
In commit phase the coordinator attempts to prepare all cohorts and take necessary
steps to commit or terminate the transactions.

Commit phase:
The commit phase is based on voting of cohorts and the coordinator decides to
commit or terminate the transaction.
The steps performed in the two phases are as follows −
Phase 1: Prepare Phase
 After each slave has locally completed its transaction, it sends a “DONE”
message to the controlling site. When the controlling site has received
“DONE” message from all slaves, it sends a “Prepare” message to the slaves.
 The slaves vote on whether they still want to commit or not. If a slave wants to
commit, it sends a “Ready” message.
 A slave that does not want to commit sends a “Not Ready” message. This may
happen when the slave has conflicting concurrent transactions or there is a
timeout.
Phase 2: Commit/Abort Phase
 After the controlling site has received “Ready” message from all the slaves −
o The controlling site sends a “Global Commit” message to the
slaves.
o The slaves apply the transaction and send a “Commit ACK”
message to the controlling site.
o When the controlling site receives “Commit ACK” message
from all the slaves, it considers the transaction as committed.
 After the controlling site has received the first “Not Ready” message from any
slave −
o The controlling site sends a “Global Abort” message to the
slaves.
o The slaves abort the transaction and send a “Abort ACK”
message to the controlling site.
o When the controlling site receives “Abort ACK” message from
all the slaves, it considers the transaction as aborted.

Distributed Three-phase Commit

The steps in distributed three-phase commit are as follows −


Phase 1: Prepare Phase
The steps are same as in distributed two-phase commit.
Phase 2: Prepare to Commit Phase

 The controlling site issues an “Enter Prepared State” broadcast message.


 The slave sites vote “OK” in response.
Phase 3: Commit / Abort Phase
The steps are same as two-phase commit except that “Commit ACK”/”Abort ACK” message
is not required.

Concurrency problems in distributed databases.

Some problems which occur while accessing the database are as follows:

1. Failure at local locations


When system recovers from failure the database is out dated compared to other locations. So
it is necessary to update the database.

2. Failure at communication location


System should have a ability to manage temporary failure in a communicating network in
distributed databases. In this case, partition occurs which can limit the communication
between two locations.

3. Dealing with multiple copies of data


It is very important to maintain multiple copies of distributed data at different locations.

4. Distributed commit
While committing a transaction which is accessing databases stored on multiple locations, if
failure occurs on some location during the commit process then this problem is called as
distributed commit.

5. Distributed deadlock
Deadlock can occur at several locations due to recovery problem and concurrency problem
(multiple locations are accessing same system in the communication network).

Concurrency Controls in distributed databases

There are three different ways of making distinguish copy of data by applying:
1) Lock based protocol
A lock is applied to avoid concurrency problem between two transaction in such a way that
the lock is applied on one transaction and other transaction can access it only when the lock is
released. The lock is applied on write or read operations. It is an important method to avoid
deadlock.
2) Shared lock system (Read lock)
The transaction can activate shared lock on data to read its content. The lock is shared in such
a way that any other transaction can activate the shared lock on the same data for reading
purpose.
3) Exclusive lock
The transaction can activate exclusive lock on a data to read and write operation. In this
system, no other transaction can activate any kind of lock on that same data.

You might also like