Chapter 4 Distributed Databases
Chapter 4 Distributed Databases
Chapter 4 Distributed Databases
1
Distributed Database Concepts
A distributed database (DDB) is a collection of multiple
logically related databases distributed over a computer
network
A distributed database management system (DDMS) is a
software system that manages a distributed database
while making the distribution transparent to the user
A transaction can be executed by multiple networked
computers in a unified manner
2
Distributed Database System
Advantages
Management of distributed data with different levels
of transparency:
This refers to the physical placement of data (files,
relations, etc.) which is not known to the user
(distribution transparency).
3
Distributed Database System(cont…)
Example:
4
Distributed Database System(cont…)
Advantages (cont…)
Distribution and Network transparency:
Fragmentation transparency:
6
Distributed Database System(cont…)
Other Advantages (cont…)
Improved performance:
A distributed DBMS fragments the database to keep
data closer to where it is needed most
This reduces data management overhead (access
and modification time) significantly
Easier expansion (scalability):
Refers to expansion of the system in terms of
adding more data, increasing database sizes or
adding more processors
7
Data Fragmentation, Replication and Allocation
Data Fragmentation
Split a relation into logically related and correct parts. A
= 5). All tuples that satisfy this condition will create a subset
which will be a horizontal fragment of Employee relation.
A selection condition may be composed of several conditions
connected by AND / OR
8
Data Fragmentation, Replication and
Allocation(cont…)
Vertical fragmentation
It is a subset of a relation which is created by a subset of
columns. Thus a vertical fragment of a relation will contain
values of selected columns. There is no selection condition
used in vertical fragmentation.
Consider the Employee relation. A vertical fragment of can
be created by keeping the values of Name, Bdate, Sex, and
Address.
Because there is no condition for creating a vertical
fragment, each fragment must include the primary key
attribute of the parent relation Employee.
In this way all vertical fragments of a relation are connected.
9
Data Fragmentation, Replication and
Allocation(cont…)
Representing horizontal fragmentation
Each horizontal fragment on a relation can be specified
applied
10
Data Fragmentation, Replication and
Allocation(cont…)
Vertical fragmentation
A vertical fragment on a relation can be specified by a
Li(R) operation in the relational algebra.
Complete vertical fragmentation
A set of vertical fragments whose projection lists L1, L2,
…, Ln include all the attributes in R but share only the
primary key of R. In this case the projection lists satisfy
the following two conditions:
L1 L2 ... Ln = ATTRS (R)
11
Data Fragmentation, Replication and
Allocation(cont…)
12
Data Fragmentation, Replication and
Allocation(cont…)
Data Replication
Replication refers to the distribution of whole or part of
the data to a number of sites
Useful in improving availability of data
13
Types of Distributed Database Systems
Homogeneous
All sites of the database
system have identical Window
setup, i.e., same database Site 5 Unix
Oracle Site 1
system software. Oracle
For example, all sites run Window
Oracle or DB2, or Sybase Site 4 Communications
network
or some other but the
same database system Oracle
software.
Site 3 Site 2
The underlying operating Linux Oracle Linux Oracle
systems may be different
(can be a mixture of Linux,
Window, Unix, etc.)
14
Types of Distributed Database Systems
Heterogeneous
Federated: Each site may run different database
system but the data access is managed through a
single conceptual schema.
Multidatabase: There is no one conceptual global
schema. For data access a schema is constructed
dynamically as needed by the application software.
Network
Object DBMS
Oriented Site 3 Site 2 Relational
Linux Linux
15
Types of Distributed Database Systems
Differences in constraints:
processing constraints.
Differences in query language:
Even with the data model, the language and their version
16
Query Processing in Distributed Databases
Issues
17
Query Processing in Distributed Databases (cont…)
Issues (cont…)
Query Q : For each employee, retrieve employee name and
Employee Fname MName Lname SSN Bdate Address Sex Slary Superssn Dno
18
Query Processing in Distributed Databases (cont…)
Result
Employee
Site 1
Site 2 Site 3
Department
19
Query Processing in Distributed Databases (cont…)
20
Query Processing in Distributed Databases (cont…)
21
Query Processing in Distributed Databases (cont…)
Execution strategies:
1. Transfer Employee and Department to the result site and
perform the join at site 3.
Total bytes transferred = 1,000,000 + 3500 = 1,003,500 bytes
2. Transfer Employee to site 2, execute join at site 2 and
send the result to site 3.
Query result size = 40 * 100 = 4000 bytes.
Total transfer size = 1,000,000 +4000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site
1 and send the result to site 3.
Total transfer size = 3500 + 4000 = 7500 bytes.
Preferred strategy: Choose strategy 3.
22
Query Processing in Distributed
Databases (cont…)
Now suppose the result site is 2.
Possible strategies :
1. Transfer Employee relation to site 2, execute the query and
present the result to the user at site 2
Total transfer size = 1,000,000 bytes for both queries Q and Q’.
2. Transfer Department relation to site 1, execute join at site 1
and send the result back to site 2
Total transfer size for Q:
3500 +400,000 = 403,500 bytes
Total transfer size for Q’:
3500 +4000 = 7500 bytes
23
Query Processing in Distributed Databases
using semijoin
Semijoin:
The idea behind semijoin operation is to reduce the number of tuples
in a relation before transferring it to another site.
Example: using the queries Q or Q’ discussed in the previous slides :
1. Project the join attributes of Department at site 2, and transfer them to
site 1.
Assume size of Dnumber=4 bytes and size of Mgrssn=9 bytes
Assume size of fname and lname is 15 bytes each
For Q, 4 * 100 = 400 bytes are transferred and for Q’, 9 * 100 = 900
bytes are transferred
2. Join the transferred file with the Employee relation at site 1, and
transfer the required attributes from the resulting file to site 2.
For Q, 34 * 10,000 = 340,000 bytes are transferred and
For Q’, 39 * 100 = 3900 bytes are transferred
3. Execute the query by joining the transferred file with Department and
present the result to the user at site 2.
Using this strategy, we transfer 340,400 bytes for Q and 4800 bytes for Q’
24
Concurrency Control and Recovery
25
Concurrency Control and Recovery (cont…)
Details
Dealing with multiple copies of data items:
The concurrency control must maintain global
consistency
Likewise, the recovery mechanism must recover all
copies and maintain consistency after recovery
Failure of individual sites:
Database availability must not be affected due to
the failure of one or two sites and the recovery
scheme must recover them before they are
available for use
26
Concurrency Control and Recovery (cont…)
(Details….)
Communication link failure:
This failure may create network partition which would
Site 3 Site 2
28
Concurrency Control and Recovery
Transaction management:
Concurrency control and commit are managed by
this site
All locks are kept at that site and all requests for
locking or unlocking are sent there
In two phase locking, this site manages locking
and releasing of data items
If all transactions follow two-phase policy at all
sites, then serializability is guaranteed
29
Concurrency Control and Recovery (cont…)
Advantages:
It is an extension to the centralized two phase locking
inaccessible
Primary site with backup site
To aid recovery, a backup site is designated which
behaves as a shadow of primary site.
In case of primary site failure, backup site can act as
primary site.
Slows down system performance for granting of locks
30
Concurrency Control and Recovery (cont…)
B. Primary Copy Technique:
In this approach, Distinguished copies of different data
items are stored at different sites
Load of lock coordination is distributed among
the various sites
To lock a data item, just the primary copy of the
data item is locked
Advantages:
defined.
If the requesting transaction does not get any vote
33
Client-Server Database Architecture
It consists of clients running client software, a set of
servers which provide all database functionalities and a
reliable communication infrastructure.
Server 1 Client 1
Client 2
Server 2 Client 3
Server n Client n
34
Client-Server Database Architecture
Server: is responsible for local data management at a
site, much like centralized DBMS software
Client: is responsible for most of the distribution function;
it accesses data distribution information from the DBMS
catalog and processes all requests that require access to
more than one site
The communication software manages communication
among clients and servers
35
Client-Server Database Architecture
36