2 Distribution Design
2 Distribution Design
Systems
M. Tamer Özsu
Patrick Valduriez
platforms advocate.
Primary Horizontal Fragmentation (PHF)
Derived Horizontal Fragmentation (DHF)
Vertical fragmentation (VF) : project operator
Used in column-store parallel DBMSs for analytical applications
(typically require fast access to a few attributes)
Hybrid fragmentation
Non-replicated
partitioned : each fragment resides at only one site
Replicated
fully replicated : each fragment at each site
partially replicated : each fragment at some of the sites
Rule of thumb:
Example
m1: PNAME="Maintenance" BUDGET≤200000
Application Information
Quantitative information about the workload
minterm selectivities: sel(mi)
The number of tuples of the relation that would be
accessed by a user query which is specified according to
a given minterm predicate mi.
access frequencies: acc(qi)
The frequency with which a user application qi accesses
data.
Access frequency for a minterm predicate can also be
defined.
Given a set of minterm predicates M, there are as many horizontal
fragments of relation R as there are minterm predicates.
Set of horizontal fragments also referred to as minterm fragments.
Preliminaries :
Pr should be complete
Pr should be minimal
Example :
Assume PROJ [PNO,PNAME,BUDGET,LOC] has two
applications defined on it.
Find the budgets of projects at each location. (1)
Find projects with budgets less than $200000. (2)
According to (1),
Pr={LOC=“Montreal”,LOC=“New York”,LOC=“Paris”}
which is complete.
Example :
Pr ={LOC=“Montreal”,LOC=“New York”, LOC=“Paris”,
BUDGET≤200000,BUDGET>200000}
Simple predicates
For application (1)
p1 : LOC = “Montreal”
p2 : LOC = “New York”
p3 : LOC = “Paris”
For application (2)
p4 : BUDGET ≤ 200000
p5 : BUDGET > 200000
Pr = Pr' = {p1,p2,p4}
Completeness
Since Pr' is complete and minimal, the selection predicates are
complete
Reconstruction
If relation R is fragmented into FR = {R1,R2,…,Rr}
R = Ri FR Ri
Disjointness
Minterm predicates that form the basis of fragmentation should
be mutually exclusive.
QUERY 2:
ASG1 = ASG ⋉ EMP1
ASG2 = ASG ⋉ EMP2
Derived
fragmentation of ASG
with respect to PROJ
Derived
fragmentation of ASG
with respect to EMP
Completeness
Referential integrity
Let R be the member relation of a link whose owner is relation S
which is fragmented as FS = {S1, S2, ..., Sn}. Furthermore, let A
be the join attribute between R and S. Then, for each tuple t of
R, there should be a tuple t' of S such that
t[A] = t' [A]
Reconstruction
Same as primary horizontal fragmentation.
Disjointness
Simple join graphs between the owner and the member
fragments.
: the number of accesses to attributes (Ai, Aj) for each execution of application at
site
: the application access frequency measure previously defined and modified to
include frequencies at different sites
Assume each query in the previous example accesses the attributes once
during each execution.
Also assume the access frequencies S 1 S 2 S
3
q1 15 20 10
q2 5 0 0
q3 25 25 25
q4 3 0 0
=1
Then
aff(A1, A3) = ++
= 45
R(A1, . . . , An)
Permutes its rows and columns, and generates a
Where
where
n
bond(Ax,Ay) =
z 1
aff(A ,A )aff(A ,A )
z x z y
Ordering (0-3-1) :
cont(A0,BUDGET,PNO) = 2bond(A0, BUDGET)+2bond(BUDGET, PNO)
–2bond(A0 , PNO)
= 8820
Ordering (1-3-2) :
cont(PNO,BUDGET,PNAME) = 10150
Ordering (2-3-4) :
cont (PNAME,BUDGET,LOC) = 1780
Define
TQ = set of applications that access only TA
BQ = set of applications that access only BA
OQ = set of applications that access both TA and BA
and
CTQ = total number of accesses to attributes by applications
that access only TA
CBQ = total number of accesses to attributes by applications
that access only BA
COQ = total number of accesses to attributes by applications
that access both TA and BA
Then find the point along the diagonal that maximizes
CTQCBQCOQ2
Reconstruction
Reconstruction can be achieved by
R = ⋈K Ri, Ri FR
Disjointness
TID's are not considered to be overlapping since they are
maintained by the system
Duplicated keys are not considered to be overlapping
© 2020, M.T. Özsu & P. Valduriez
Hybrid Fragmentation
General Form
min(Total Cost)
subject to
response time constraint
storage constraint
processing constraint
Decision Variable
Total Cost
Processing component
access cost + integrity enforcement cost + concurrency control cost
Access cost
Constraints
Response Time
execution time of query ≤ max. allowable response time for that query