Unit 2
Unit 2
The outline of this chapter is as follows. Section 5.1 represents the basic
concepts of distributed database design. The objectives of data
distribution are introduced in Section 5.2. In Section 5.3, data
fragmentation – one important issue in distributed database design – is
explained briefly with examples. Section 5.4 focuses on the allocation
of fragments, and the measure of costs and benefits of fragment
allocation. In Section 5.5, different types of distribution transparencies
are represented.
The frequency with which a transaction is run, that is, the number
of transaction requests in the unit time. In case of general
applications that are issued from multiple sites, it is necessary to
know the frequency of activation of each transaction at each site.
The site from which a transaction is run (also called site of origin
of the transaction).
The performance criteria for transactions.
If fragments are stored at the site where they are used most
frequently, locality of reference is high. As there is no replication
of data, storage cost is low. Reliability and availability are also low
but still higher than centralized data allocation strategies, as the
failure of a site results in the loss of local data only. In this case,
communication costs are incurred only for global transactions.
However, in this approach, performance should be good, and
communication costs are low if the data distribution is designed
properly.
Data Fragmentation
In a distributed system, a global relation may be divided into several
non-overlapping subrelations and allocated to different sites,
called fragments. This process is called data fragmentation. The
objective of data fragmentation design is to determine non-overlapping
fragments, which are logical units of allocation. Fragments can be
designed by grouping a number of tuples or attributes of relations. Each
group of tuples or attributes that constitute a fragment has the same
properties.
HORIZONTAL FRAGMENTATION
Horizontal fragmentation partitions a relation along its tuples, that is,
horizontal fragments are subsets of the tuples of a relation. A
horizontal fragment is produced by specifying a predicate that performs
a restriction on the tuples of a relation. In this fragmentation, the
predicate is defined by using the selection operation of the relational
algebra. For a given relation R, a horizontal fragment is defined as
σρ(R)
Project
P1
P2
Thus, P1 ∪ P2 = Project.
Assume that there are two values for region attribute, “eastern” and
“northern”, in the relational schema Sales (depo-no, depo-name,
region). Let us consider an application that can generate from any site
of the distributed system and involves the following SQL query.
Since the set of predicates {p1, p2} is complete and minimal, the
process is terminated. The relevant predicates cannot be deduced by
analysing the code of an application. In this case, the midterm
predicates are as follows:
P1: product-id ≤ 10
P2: 10 < product-id ≤ 15
P3: product-id > 15
P4: product-type = “consumable”
P5: product-type = “Non-consumable”
F1: product-id ≤ 10
F2: (10 < product-id ≤ 15) AND (product-type = “consumable”)
F3: (10 < product-id ≤ 15) AND (product-type = “Non-
consumable”)
F4: product-id > 15.
VERTICAL FRAGMENTATION
Vertical fragmentation partitions a relation along its attributes, that is,
vertical fragments are subsets of attributes of a relation. A vertical
fragment is defined by using the projection operation of relational
algebra. For a given relation R, a vertical fragment is defined as
∏a ,a ,...,a (R)
1 2 n
a1 1 0
a2 0
a3
a4
Example 5.3.
V1
Bond energy algorithm The Bond Energy Algorithm (BEA) is the most
suitable algorithm for vertical fragmentation [Navathe et al., 1984]. The
bond energy algorithm uses attribute affinity matrix (AA) as input and
produces a clustered affinity matrix (CA) as output by permuting rows
and columns of AA. The generation of CA from AA involves three
different steps: initialization, iteration, and row ordering, which are
illustrated in the following:
Now, for a given set of attributes many orderings are possible. For
example, for n number of attributes n orderings are possible. One
efficient algorithm for ordering is searching for clusters. The BEA
proceeds by linearly traversing the set of attributes. In each step, one of
the remaining attributes is added and is inserted in the current order of
attributes in such a way that the maximal contribution is achieved. This
is first done for the columns. Once all the columns are determined, the
row ordering is adapted to the column ordering, and the resulting
affinity matrix exhibits the desired clustering. To compute the
contribution to the global affinity value, the loss incurred through
separation of previously joint columns is subtracted from the gain,
obtained by adding a new column. The contribution of a pair of
columns is the scalar product of the columns, which is maximal if the
columns exhibit the same value distribution.
Example 5.4.
Consider Q = {Q1, Q2, Q3, Q4} as a set of queries, A = {A1, A2, A3, A4} as a
set of attributes for the relation R, and S = {S1, S2, S3} as a set of sites in
the distributed system. Assume that A1 is the primary key of the
relation R, and the following matrices represent the attribute usage
values of the relation Rand application access frequencies at different
sites:
A1 A2 A3
Q1 0 1 1
Q2 1 1 1
Q3 1 0 0
Q4 0 0 1
S1 S2 S3 Sum
Q1 10 20 0 30
Q2 5 0 10 15
Q3 0 35 5 40
Q4 0 10 0 10
A2 A3 A4
A2 45 45 0
A3 45 55 0
A4 0 0 40
Now,
A2 A3 A4
A2 45 45 0
A3 45 55 0
A4 0 0 40
and
A1 A2 A3 A4
Q1 0 1 1 0
Q2 1 1 1 0
Q3 1 0 0 1
Q4 0 0 1 0
S1 S2 S3 Sum
Q1 10 20 0 30
Q2 5 0 10 15
Q3 0 35 5 40
Q4 0 10 0 10
Hence,
accesses (fragment 1: {A2}): 0
accesses (fragment 2: {A3, A4}): 50
accesses (fragment 1 AND fragment 2): 45
sq = −1,975
accesses (fragment 1: {A2, A3}): 55
accesses (fragment 2: {A4}): 40
accesses (fragment 1 AND fragment 2): 0
sq = 2,200
accesses (fragment 1: {A2, A4}): 40
accesses (fragment 2: {A3}): 10
accesses (fragment 1 AND fragment 2): 45
sq = −1,625
Therefore, two partitions are {A1, A4} and {A1, A2, A3}. In the case of
vertical fragmentation, the primary key will be repeated in each
partition. The same calculation can be done with all attributes.
MIXED FRAGMENTATION
Mixed fragmentation is a combination of horizontal and vertical
fragmentation. This is also referred to as hybrid or nested
fragmentation. A mixed fragment consists of a horizontal fragment that
is subsequently vertically fragmented, or a vertical fragment that is
then horizontally fragmented. A mixed fragment is defined by using
selection and projection operations of relational algebra. For example,
a mixed fragment for a given relation R can be defined as follows:
Example 5.5.
Let us consider the same Project relation used in the previous example.
The mixed fragments of the above Project relation can be defined as
follows:
DERIVED FRAGMENTATION
A derived fragmentation is a horizontal fragment that is based on the
horizontal fragmentation of a parent relation and it does not depend on
the properties of its own attributes. Derived fragmentation is used to
facilitate the join between fragments. The term child is used to refer to
the relation that contains the foreign key, and the term parent is used
for the relation containing the targeted primary key. Derived
fragmentation is defined by using the semi-join operation of relation
algebra. For a given child relation C and parent relation P, the derived
fragmentation of C can be represented as follows:
Ci =C⊳Pi, l ≤ i ≤ w
Pi = σFi(S)
Example 5.6.
R = ∪ Ri, 1≤i ≤ n
NO FRAGMENTATION
A final strategy of the fragmentation is not to fragment a relation. If a
relation contains a smaller number of tuples and not updated
frequently, then is better not to fragment the relation. It will be more
sensible to leave the relation as a whole and simply replicate the
relation at each site of the distributed system.
The Allocation of Fragments
The allocation of fragments is a critical performance issue in the
context of distributed database design. Before allocation of fragments
into different sites of a distributed system, it is necessary to identify
whether the fragments are replicated or not. The allocation of non-
replicated fragments can be handled easily by using “best-fit”
approach. In best-fit approach, the best cost-effective allocation
strategy is selected among several alternatives of possible allocation
strategies. Replication of fragments adds extra complexity to the
fragment allocation issue as follows:
In the first approach, the set of all sites in the distributed system
is determined where the benefit of allocating one replica of the
fragment is higher than the cost of allocation. One replica of the
fragment is allocated to such beneficial sites.
In the alternative approach, allocation of fragments is done using
best-fit method considering fragments are not replicated, and
then progressively replicas are introduced starting from the most
beneficial. This process is terminated when addition of replicas is
no more beneficial.
HORIZONTAL FRAGMENTS
1. In this case, using best-fit approach for non-replicated fragments,
the fragment Ri of relation R is allocated at site j where the
number of references to the fragment Ri is maximum. The
number of local references of Ri at site j is as follows:
5.
VERTICAL FRAGMENTS
In this case, the benefit is calculated by vertically partitioning a
fragment Ri into two vertical fragments Rs and Rt allocated at site s and
site t, respectively. The effect of this partition is listed below.
Transparencies in Distributed
Database Design
According to the definition of distributed database, one major objective
is to achieve the transparency into the distributed
system. Transparency refers to the separation of the higher-level
semantics of a system from lower-level implementation issues. In a
distributed system, transparency hides the implementation details from
users of the system. In other words, the user believes that he or she is
working with a centralized database system, and that all the
complexities of a distributed database are either hidden or transparent
to the user. A distributed DBMS may have various levels of
transparency. In a distributed DBMS, the following four main categories
of transparency have been identified:
Distribution transparency
Transaction transparency
Performance transparency
DBMS transparency.
Example 5.7.
Example 5.8.
Example 5.9.
union
Example 5.10.
One solution that can overcome the disadvantages of the above two
approaches is the use of aliases(sometimes called synonyms) for each
database object. It is the responsibility of the distributed database
system to map an alias to the appropriate database object.
Creator ID–. This represents a unique site identifier for the user
who created the database object.
Creator site ID–. It indicates a globally unique identifier for the
site from which the database object was created.
Local name–. It represents an unqualified name for the database
object.
Birth-site ID–. This represents a globally unique identifier for the
site at which the object was initially stored.
Example 5.11.
Fragmentation transparency:
Update Employee
set emp-branch = 20
where emp-id = 55.
Location transparency:
Select emp-name, project-no into $emp-name, $project-no from
Emp1
where emp-id = 55
Select salary, design into $salary, $design from Emp2
where emp-id = 55
Insert into Emp3 (emp-id, emp-name, emp-branch, project-no)
values (55, $emp-name, 20, $project-no)
Insert into Emp4 (emp-id, salary, design)
values (55, $salary,$design)
Delete from Emp1 where emp-id = 55
Delete from Emp2 where emp-id = 55
Local Mapping transparency:
Select emp-name, project-no into $emp-name, $project-no from
Emp1 at site 1
where emp-id = 55
Select salary, design into $salary, $design from Emp2 at site 2
where emp-id = 55
Insert into Emp3 at site 3 (emp-id, emp-name, emp-branch,
project-no)
values (55, $emp-name, 20, $project-no)
Insert into Emp3 at site 7 (emp-id, emp-name, emp-branch,
project-no)
values (55, $emp-name, 20, $project-no)
Insert into Emp4 at site 4(emp-id, salary, design)
values (55, $salary,$design)
Insert into Emp4 at site 8(emp-id, salary, design)
values (55, $salary,$design)
Delete from Emp1 at site 1 where emp-id = 55
Delete from Emp1 at site 5 where emp-id = 55
Delete from Emp2 at site 2 where emp-id = 55
Delete from Emp2 at site 6 where emp-id = 55
Hence, it is assumed that the fragment Emp1 has two replicas stored at
site 1 and site 5, respectively, the fragment Emp2 has two replicas
stored at site 2 and site 6, respectively, the fragment Emp3 has two
replicas stored at site 3 and site 7, respectively, and the fragment Emp4
has two replicas stored at site 4 and site 8, respectively.
Transaction Transparency
Transaction transparency in a distributed DBMS ensures that all
distributed transactions maintain the distributed database integrity and
consistency. A distributed transaction can update data stored at many
different sites connected by a computer network. Each transaction is
divided into several subtransactions (represented by an agent), one for
each site that has to be accessed. Transaction transparency ensures
that the distributed transaction will be successfully completed only if all
subtransactions executing in different sites associated with the
transaction are completed successfully. Thus, a distributed DBMS
requires complex mechanism to manage the execution of distributed
transactions and to ensure the database consistency and integrity.
Moreover, transaction transparency becomes more complex due to
fragmentation, allocation, and replication schemas in distributed
DBMS. Two further aspects of transaction transparency
are concurrency transparency and failure transparency, which are
discussed in the following:
Performance Transparency
Performance transparency in a distributed DBMS ensures that it
performs its tasks as centralized DBMS. In other words, performance
transparency in a distributed environment assures that the system does
not suffer from any performance degradation due to the distributed
architecture and it will choose the most cost-effective strategy to
execute a request. In a distributed environment, the distributed Query
processor maps a data request into an ordered sequence of operations
on local databases. In this context, the added complexity of
fragmentation, allocation, and replication schemas is to be considered.
The distributed Query processor has to take decision regarding the
following issues:
The access time (I/O) cost involved in accessing the physical data
on disk.
The CPU time cost incurred when performing operations on data
in main memory.
The communication cost associated with the transmission of data
across the network.
A number of query processing and query optimization techniques have
been developed for distributed database system: some of them
minimize the total cost of query execution time [Sacco and Yao, 1982],
and some of them attempt to maximize the parallel execution of
operations [Epstein et al., 1978] to minimize the response time of
queries.
DBMS Transparency
DBMS transparency in a distributed environment hides the knowledge
that the local DBMSs may be different and is, therefore, only applicable
to heterogeneous distributed DBMSs. This is also known
as heterogeneity transparency, which allows the integration of several
different local DBMSs (relational, network, and hierarchical) under a
common global schema. It is the responsibility of distributed DBMS to
translate the data requests from the global schema to local DBMS
schemas to provide DBMS transparency.
Chapter Summary
Distributed database design involves the following important
issues: fragmentation, replication, and allocation.
Fragmentation–. A global relation may be divided into a
number of subrelations, called fragments, which are then
distributed among sites. There are two main types of
fragmentation: horizontal and vertical. Horizontal fragments
are subsets of tuples and vertical fragments are subsets of
attributes. Other two types of fragmentations are mixed and
horizontal.
Allocation–. Allocation involves the issue of allocating
fragments among sites.
Replication–. The distributed database system may maintain
a copy of fragment at several different sites.
Fragmentation must ensure the correctness rules – completeness,
reconstruction, and disjointness.
Alternative data allocation strategies are centralized, partitioned,
selective replication, and complete replication.
Transparency hides the implementation details of the distributed
systems from the users. Different transparencies in distributed systems
are distribution transparency, transaction transparency, performance
transparency, and DBMS transparency.
Client/Server System
In the late 1970s and early 1980s smaller systems (mini computer) were
developed that required less power and air conditioning. The term
client/server was first used in the 1980s, and it gained acceptance in
referring to personal computers (PCs) on a network. In the late 1970s,
Xerox developed the standards and technology that is familiar today as
the Ethernet. This provided a standard means for linking together
computers from different manufactures and formed the basis for
modern local area networks (LANs) and wide area networks (WANs).
Client/server system was developed to cope up with the rapidly
changing business environment. The general forces that drive the move
to client/server systems are as follows:
Passive (slave)
Waiting for requests
On request serves clients and sends reply.
Active (Master)
Sending requests
Waits until reply arrives.
Global system catalog (GSC). The GSC provides the same functionality
as system catalog of a centralized DBMS. In addition to metadata of the
entire database, a GSC contains all fragmentation, replication and
allocation details considering the distributed nature of a DDBMS. It can
itself be managed as a distributed database and thus, it can be
fragmented and distributed, fully replicated or centralized like any
other relations in the system. [The details of GSC management will be
introduced in Chapter 12, Section 12.2].
Chapter Summary
This chapter introduces several atternative architectures for a
distributed database system such as Client/server, peer-to-peer and
MDBSs.