Distributed Database Design
Distributed Database Design
Distributed Database Design
TOPICS
Distributed database design concept,
objective of Data Distribution,
Data Fragmentation,
The allocation of fragment ,
Transparencies in Distributed Database Design
Design Problem
In the general setting :
applications entails
placement of the distributed DBMS software; and
placement of the applications that run on the
database
data
data +
program
Level of sharing
partial
information
Level of knowledge
complete
information
Distribution Design
Top-down
mostly in designing systems from scratch
mostly in homogeneous systems
Bottom-up
when the databases already exist at a
number of sites
Top-Down Design
Requirements
Analysis
Objectives
User Input
Conceptual
Design
View Integration
View Design
Access
Information
GCS
Distribution
Design
LCSs
Physical
Design
LISs
ESs
User Input
Fragmentation
Can't we just distribute relations?
What is a reasonable unit of distribution?
relation
views are subsets of relations locality
extra communication
Fragmentation Alternatives
Horizontal
PROJ
PNO
PNAME
BUDGET
P1 Instrumentation 150000
P2 Database Develop.135000
P3 CAD/CAM
250000
P4 Maintenance
310000
P5 CAD/CAM
500000
LOC
Montreal
New York
New York
Paris
Boston
PROJ2
PNAME
P1 Instrumentation
BUDGET
LOC
15000 Montreal
0
P2 Database Develop.135000 New York
PNO
P3
PNAME
CAD/CAM
BUDGET
LOC
P4 Maintenance
310000 Paris
P5
500000 Boston
CAD/CAM
Fragmentation Alternatives
Vertical
PROJ
PNO
PNAME
BUDGET
P1 Instrumentation 150000
P2 Database Develop.135000
P3
CAD/CAM
250000
P4 Maintenance
310000
P5
CAD/CAM
500000
PROJ1
PROJ2
PNO
BUDGET
PNO
P1
P2
P3
P4
P5
150000
135000
250000
310000
500000
PNAME
LOC
P1 Instrumentation Montreal
P2 Database Develop. New York
P3
CAD/CAM
New York
P4 Maintenance
Paris
P5 CAD/CAM
Boston
LOC
Montreal
New York
New York
Paris
Boston
Degree of Fragmentation
finite number of alternatives
tuples
or
attributes
relations
Correctness of Fragmentation
Completeness
Decomposition of relation R into fragments R1, R2, ...,
Disjointness
If relation R is decomposed into fragments R1, R2, ...,
Allocation Alternatives
Non-replicated
one site
Replicated
of the sites
If read-only queries << 1, replication is advantageous,
update queries
Rule of thumb:
Comparison of Replication
Alternatives
Full-replication
Partial-replication
Partitioning
QUERY
PROCESSING
Easy
Same Difficulty
DIRECTORY
MANAGEMENT
Easy or
Non-existant
Same Difficulty
CONCURRENCY
CONTROL
Moderate
Difficult
Easy
RELIABILITY
Very high
High
Low
Possible
application
Realistic
Possible
application
REALITY
Information Requirements
Four categories:
Database information
Application information
SKILL
TITLE, SAL
L1
EMP
ENO, ENAME, TITLE
PROJ
PNO, PNAME, BUDGET, LOC
ASG
ENO, PNO, RESP, DUR
Application Information
1. Qualitative Information
. The fundamental qualitative information consists of the
predicates used in user queries.
. Analyze user queries based on 80/20 rule: 20% of user
queries account for 80% of the total data access.
One should investigate the more important queries
2. Quantitative Information
. Minterm Selectivity sel(mi): number of tuples that
would be accessed by a query specified according to
a given minterm predicate.
. Access Frequency acc(mi): the access frequency of a
given minterm predicate in a given period.
Fragmentation
Horizontal Fragmentation (HF)
Primary Horizontal Fragmentation
(PHF)
Derived Horizontal Fragmentation
(DHF)
Vertical Fragmentation (VF)
Hybrid Fragmentation (HF)
Primary Horizontal
Fragmentation
EMP table
Three branch
After fragmentation
Select * from MPLS_EMPS
Union
Select * from LA_EMPS)
Union
Select * from NY_EMPS;
Example 1
AP1:looking for those employees who work in Los
Angeles (LA).
Pr = {p1: Loc= LA}
PHF - Information
Requirements
Application Information
simple predicates : Given R[A1, A2, , An], a simple
predicate pj is
pj : Ai Value
define M = {m1,m2,,mr} as
M = { mi | mi =
accesses data.
Access frequency for a minterm predicate can also be
defined.
PHF Algorithm
Given: A relation R, the set of simple
predicates Pr
Output:The set of fragments of R = {R1, R2,
,Rw} which obey the fragmentation
rules.
Preliminaries :
Pr should be complete
Pr should be minimal
Completeness of Simple
Predicates
A set of simple predicates Pr is said to be
which is complete.
Minimality of Simple
Predicates
If a predicate influences how fragmentation
card( fi ) card( fj )
Minimality of Simple
Predicates
Example :
Pr ={LOC=Montreal,LOC=New York,
LOC=Paris,
BUDGET200000,BUDGET>200000}
COM_MIN Algorithm
Given: a relation R and a set of simple
predicates Pr
Output: a complete and minimal set of
simple predicates Pr' for Pr
Rule 1: a relation or fragment is partitioned
into at least two parts which are
accessed differently by at least one
application.
COM_MIN Algorithm
Initialization :
find a pi Pr such that pi partitions R according to
Rule 1
set Pr' = pi ; Pr Pr {pi} ; F {fi}
Iteratively add predicates to Pr' until it is complete
find a pj Pr such that pj partitions some fk defined
according to minterm predicate over Pr' according to
Rule 1
set Pr' = Pr' {pj }; Pr Pr {pj }; F F {fi}
if pk Pr' which is nonrelevant then
Pr' Pr {pk}
F F {fk}
PHORIZONTAL Algorithm
Makes use of COM_MIN to perform fragmentation.
Input: a relation R and a set of simple predicates
Pr
Output: a set of minterm predicates M according to
which relation R is to be fragmented
Pr' COM_MIN (R,Pr)
determine the set M of minterm predicates
determine the set I of implications among pi Pr
eliminate the contradictory minterms from M
PHF Example 3
Two candidate relations : PAY and PROJ.
Fragmentation of relation PAY
Application: Check the salary info and determine
raise.
Employee records kept at two sites application
run at two sites
Simple predicates
p1 : SAL 30000
p2 : SAL > 30000
Pr = {p1,p2} which is complete and minimal Pr'=Pr
Minterm predicates
m1 : (SAL 30000)
m2 : NOT(SAL 30000) (SAL > 30000)
PHF Example 3
PAY1
TITLE
PAY2
SAL
TITLE
SAL
Elect. Eng.
40000
Programmer 24000
Syst. Anal.
34000
PHF Example 3
Fragmentation of relation PROJ
Applications:
Find the name and budget of projects given their no.
Issued
at three sites
Simple predicates
For application (1)
p1 : LOC = Montreal
p2 : LOC = New York
p3 : LOC = Paris
PHF Example 3
Fragmentation of relation PROJ continued
Minterm fragments left
PHF Example
PROJ2
PROJ1
PNO
P1
PNAME
BUDGET
Instrumentation150000
LOC
Montrea
l
PROJ4
PNO
PNAME
P3
CAD/CAM
PNO
P2
PNAME
BUDGET
LOC
Database
Develop.
PROJ6
BUDGET
250000
LOC
New
York
PNO
P4
PNAME
BUDGET
LOC
Maintenance
310000
Paris
PHF Correctness
Completeness
Since Pr' is complete and minimal, the selection predicates
are complete
Reconstruction
If relation R is fragmented into FR = {R1,R2,,Rr}
R =
Ri FR Ri
Disjointness
Minterm predicates that form the basis of fragmentation
Derived Horizontal
Fragmentation
Defined on a member relation of a link according to a
PROJ
ASG
ENO, PNO, RESP, DUR
DHF Definition
Given a link L where owner(L)=S and
member(L)=R, the derived horizontal
fragments of R are defined as
Ri = R F Si, 1iw
DHF Example
Given link L1 where owner(L1)=SKILL/PAY and member(L1)=EMP
Group engineers into two groups according to their salary: those making less
than or equal to $30,000, and those making more than $30,000.
EMP1 = EMP SKILL/PAY1
EMP2 = EMP SKILL/PAY2
where
SKILL/PAY1 = SAL30000(SKILL/PAY)
SKILL/PAY2 = SAL>30000(SKILL/PAY)
EMP1
EMP2
ENO
ENAME
E3
E4
E7
A. Lee
J. Miller
R. Davis
TITLE
Mech. Eng.
Programmer
Mech. Eng.
ENO
ENAME
TITLE
E1
E2
E5
E6
E8
J. Doe
M. Smith
B. Casey
L. Chu
J. Jones
Elect. Eng.
Syst. Anal.
Syst. Anal.
Elect. Eng.
Syst. Anal.
DHF Correctness
Completeness
Referential integrity
Let R be the member relation of a link whose owner is relation S which is fragmented
as FS = {S1, S2, ..., Sn}. Furthermore, let A be the join attribute between R and S. Then,
for each tuple t of R, there should be a tuple t' of S such that
t[A] = t' [A]
Reconstruction
Same as primary horizontal fragmentation.
Disjointness
Simple join graphs between the owner and the member fragments.
Vertical Fragmentation
Group the columns of a table into fragments.
Because each fragment contains a subset of the total set of columns in the table, VF can be
number of possible fragments is equal to B(m), which is the mth Bell number
Two approaches :
grouping
attributes to fragments
first step creates as many vertical fragments as the number of non-key columns in the
table. Then grouping approach uses joins across the primary key, to group some of
these fragments together, and continues as needed
Not usually considered a valid approach
splitting
relation to fragments
placing each non-key column in one and only one fragment
VF Information
Requirements
Application Information
Attribute affinities
a measure that indicates how closely related the attributes are
This is obtained by: access frequency + usage pattern
Access freq: how many times an application/query runs in a given
period of time at different sites
Usage pattern:Indicates whether a column is used by an
application/query.
Attribute usage values
Given a set of queries Q = {q1, q2,, qq} that will run on the
relation
R[A1 , A2,, An],
1 if attribute Aj is referenced by query qi
use(qi,Aj) =
0 otherwise
VF Definition of use(qi,Aj)
Consider the following 4 queries for relation PROJ
q1: SELECT BUDGET q2: SELECT PNAME,BUDGET
FROM PROJ
FROM PROJ
WHERE
PNO=Value
q3: SELECT PNAMEq4: SELECT SUM(BUDGET)
FROM PROJ
FROM PROJ
WHERE
LOC=Value
WHERE
LOC=Value
A2
A3
A4
q1
q2
q3
q4
VF Affinity Measure
af(Ai,Aj)
The attribute affinity measure between two
attributes Ai and Aj of a relation R[A1, A2, , An]
with respect to the set of applications Q = (q1, q2,
, qq) is defined as follows :
af (Ai, Aj)
(query access)
query access
access
access frequency of a query
execution
all sites
q1
q2
q2
q3
q3
q4
q4
15 20
5
10
25 25
25
Then
af(A1, A3) = 15*1 + 20*1+10*1
= 45
A1
A2
A3
A4
A1 A2 A3 A4
45 0 45 0
5 75
0 80
45 5 53 3
3 78
0 75
VF Clustering Algorithm
Take the attribute affinity matrix AA and
where
af(Az,Ax)af(Az,Ay)
bond(Ax,Ay
)=
z 1
BEA Example
Consider the following AA matrix and the corresponding CA matrix
where A1 and A2 have been placed. Place A3:
Ordering (0-3-1) :
cont(A0,A3,A1) = 2bond(A0 , A3)+2bond(A3 , A1)2bond(A0 , A1)
= 2* 0 + 2* 4410 2*0 = 8820
Ordering (1-3-2) :
cont(A1,A3,A2) = 2bond(A1 , A3)+2bond(A3 , A2)2bond(A1,A2)
= 2* 4410 + 2* 890 2*225 = 10150
Ordering (2-3-4) :
cont (A2,A3,A4)= 1780
BEA Example
Therefore, the CA matrix has the form
A1 A3 A2
45 45
0
5 80
45 53
0
3 75
A1 A3 A2 A4
organization)
is 45 0 0
A1 45
A3 45 53
A2
5 80 75
A4
3 75 78
VF Algorithm
How can you divide a set of clustered attributes {A1, A2, , An} into two (or more) sets {A1, A2, , Ai} and {Ai, , An} such that there are no (or minimal)
applications that access both (or more than one) of the sets.
the function will produce fragments that are balanced.
For Best partitioning split the columns into a one-column BC and n 1 column TC first , and then repeatedly add columns from TC to BC until TC
is
Z can be positive if total accesses to only one fragment are maximized while the total accesses to both fragments are minimized
Disadvantage :Not able to carve out an embedded or inner block of columns as a partition.
Sol: Overcome by adding a shift operation(moves the topmost row of the matrix to the bottom and then it moves the leftmost column of the matrix to the
extreme right)
A1 A2 A3 Ai Ai+1. . .Am
...
A1
A2
TA
Ai
...
Ai+1
Am
BA
SHIFT OPERATION
VF ALgorithm
Define
TQ = set of applications that access only TA(Top corner
attributes)
BQ = set of applications that access only BA
OQ = set of applications that access both TA and BA
and
CTQ = total number of accesses to attributes by applications
that access only TA
CBQ = total number of accesses to attributes by applications
that access only BA
COQ = total number of accesses to attributes by applications
that access both TA and BA
Example
1)
TC: all the applications that access one of the TC columns (C4, C1, or
C3) but do not access any BC
[i.e AP1, AP2, and AP4]
No application is BC-only.
Result :The two vertical fragments will be defined as VF1(C, C4) and
VF2(C, C1, C2, C3).
3)
TCW = AFF(AP2) = 7
BCW = AFF(AP3) + AFF(AP4) = 4 + 3 = 7
BOCW = AFF(AP1) = 3
Z = 7*7 32 = 49 9 = 40
4)
TCW = AFF(AP4) = 3
BCW = AFF(AP2) = 7
BOCW = AFF(AP1) + AFF(AP3) = 3 + 4 =
7
Z = 3*7 72 = 21 49 = 28
VF Algorithm
Two problems :
Cluster forming in the middle of the CA matrix
Shift a row up and a column left and apply the
VF Correctness
A relation R, defined over attribute set A and key K,
generates the vertical partitioning FR = {R1, R2, , Rr}.
Completeness
The following should be true for A:
A=
AR i
Reconstruction
Reconstruction can be achieved by
R=
K Ri, Ri FR
Disjointness
TID's are not considered to be overlapping since they are
Hybrid Fragmentation
R
HF
HF
R1
R2
VF
VF
R11
R12
VF
R21
VF
R22
VF
R23
Fragment Allocation
Problem Statement
Given
F = {F1, F2, , Fn}
fragments
network sites
applications
Minimal cost
Communication + storage + processing (read & update)
Cost in terms of time (usually)
Performance
Constraints
Information Requirements
Database information
selectivity of fragments
size of a fragment
Application information
Allocation
File Allocation (FAP) vs Database Allocation
(DAP):
Fragments are not individual files
relationships have to be maintained
considered
Cost of concurrency control should be
considered
Allocation Information
Requirements
Database Information
selectivity of fragments
size of a fragment
Application Information
number of read accesses of a query to a fragment
number of update accesses of query to a fragment
A matrix indicating which queries updates which fragments
A similar matrix for retrievals
originating site of each query
Site Information
unit cost of storing data at a site
unit cost of processing at a site
Network Information
communication cost/frame between two sites
frame size
Allocation Model
General Form
min(Total Cost)
subject to
response time constraint
storage constraint
processing constraint
Decision Variable
xij
Allocation Model
Total Cost
all queries
Allocation Model
Query Processing Cost
Processing component
access cost + integrity enforcement cost +
concurrency control cost
Access cost
control costs
Can be similarly calculated
Allocation Model
Query Processing Cost
Transmission component
cost of processing updates + cost of processing
retrievals
Cost of updates
acknowledgment cost
all sites all fragments
Retrieval Cost
minall sites
(cost of retrieval command
all fragments
Allocation Model
Constraints
Response Time
execution time of query max. allowable response
time for that query
all fragments
all queries
Allocation Model
Solution Methods
FAP is NP-complete
DAP also NP-complete
Heuristics based on
single commodity warehouse location (for
FAP)
knapsack problem
branch and bound techniques
network flow
Allocation Model
Attempts to reduce the solution space
assume all candidate partitionings known;