0% found this document useful (0 votes)
27 views78 pages

2 Distribution Design

Uploaded by

Justin William
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views78 pages

2 Distribution Design

Uploaded by

Justin William
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

Principles of Distributed Database

Systems
Presenter: Mr. Thomas Tesha

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 1
Outline
 Introduction
 Distributed and Parallel Database Design
 Distributed Data Control
 Distributed Query Processing
 Distributed Transaction Processing
 Data Replication
 Database Integration – Multidatabase Systems
 Parallel Database Systems

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 2
Outline
 Distributed and Parallel Database Design
 Fragmentation
 Data distribution

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 3
Distribution Design

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 4
Distribution Design

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 5
Outline
 Distributed and Parallel Database Design
 Fragmentation
 Data distribution

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 6
Fragmentation

 Why Fragmentation?
 First, application views are usually subsets of relations.
 Therefore, the locality of accesses of applications is defined not on
entire relations but on their subsets. Hence it is only natural to
consider subsets of relations as distribution units.
 Second, if the applications that have views defined on a given
relation reside at different sites, two alternatives can be followed,
with the entire relation being the unit of distribution.
 Either the relation is not replicated and is stored at only one site, or
it is replicated at all or some of the sites where the applications
reside. The former results in an unnecessarily high volume of
remote data accesses. The latter, has unnecessary replication,
which causes problems in executing updates (to be discussed later)
and may not be desirable if storage is limited.
 Finally to facilitate concurrency where each unit of fragment can
permit transaction execution.. 7
Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez
Fragmentation alternatives

 What is a reasonable way of fragmentation in


distribution?
 Relation instances
are essentially
tables,
 Alternative ways of
dividing a table into
smaller ones?
 There two
alternatives for this:
dividing it
horizontally or
dividing it vertically.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 8
Fragmentation Alternatives – Horizontal

PROJ1 : projects with budgets


less than $200,000
PROJ2 : projects with budgets
greater than or equal
to $200,000

PROJ1= σ BUDGET <=200000 (PROJ)


PROJ2= σBUDGET > 200000 (PROJ)

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 9
Fragmentation Alternatives – Vertical
PROJ1: information about
project budgets
PROJ2: information about
project names and
locations

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 10
Correctness of Fragmentation

 Completeness
 Decomposition of relation R into fragments R1, R2, ..., Rn is
complete if and only if each data item in R can also be found in
some Ri
 Reconstruction
 If relation R is decomposed into fragments R1, R2, ..., Rn, then
there should exist some relational operator ∇ such that
R = ∇1≤i≤nRi
 Disjointness
 If relation R is decomposed into fragments R1, R2, ..., Rn, and
data item di is in Rj, then di should not be in any other fragment
Rk (k ≠ j ).

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 11
Allocation Alternatives

 Non-replicated
 partitioned : each fragment resides at only one site
 Replicated
 fully replicated : each fragment at each site
 partially replicated : each fragment at some of the sites
 Rule of thumb:

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 12
Comparison of Replication Alternatives

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 13
Fragmentation

 Horizontal Fragmentation (HF)


 Primary Horizontal Fragmentation (PHF)
 Derived Horizontal Fragmentation (DHF)
 Vertical Fragmentation (VF)
 Hybrid Fragmentation (HF)

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 14
PHF – Information Requirements

 Database Information
 relationship

 cardinality of each relation: card(R)

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 15
PHF - Information Requirements
 Application Information
 simple predicates : Given R[A1, A2, …, An], a simple predicate pj
is
pj : Ai θValue
where θ  {=,<,≤,>,≥,≠}, Value  Di and Di is the domain of Ai.
For relation R we define Pr = {p1, p2, …,pm}
Example :
PNAME = "Maintenance"
BUDGET ≤ 200000
 minterm predicates : Given R and Pr = {p1, p2, …,pm}
define M = {m1,m2,…,mr} as
M = { mi | mi =  pjPr pj* }, 1≤j≤m, 1≤i≤z
where pj* = pj or pj* = ¬(pj).
Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 16
PHF – Information Requirements

Example
m1: PNAME="Maintenance"  BUDGET≤200000

m2: NOT(PNAME="Maintenance")  BUDGET≤200000

m3: PNAME= "Maintenance"  NOT(BUDGET≤200000)

m4: NOT(PNAME="Maintenance")  NOT(BUDGET≤200000)

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 17
PHF – Information Requirements

 Application Information
 minterm selectivities: sel(mi)
 The number of tuples of the relation that would be accessed by a
user query which is specified according to a given minterm
predicate mi.
 access frequencies: acc(qi)
 The frequency with which a user application qi accesses data.
 Access frequency for a minterm predicate can also be defined.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 18
Primary Horizontal Fragmentation

 Definition : A primary horizontal fragmentation is defined


by a selection operation on the owner relations of a
database schema. Therefore, with relation R, its
horizontal fragment is given by
Rj = Fj(R), 1 ≤ j ≤ w
where Fj is a selection formula, which is (preferably) a minterm
predicate.
 Therefore, A horizontal fragment Ri of relation R consists
of all the tuples of R which satisfy a minterm predicate
mi.
 Given a set of minterm predicates M, there are as many
horizontal fragments of relation R as there are minterm
predicates. Set of horizontal fragments also referred to as
Lecture slidesminterm 19
fragments.
as adapted and customized from © 2020, M.T. Özsu & P. Valduriez
PHF – Algorithm

Given: A relation R, the set of simple predicates Pr


Output: The set of fragments of R = {R1, R2,…,Rw} which
obey the fragmentation rules.

Preliminaries :
 Pr should be complete
 Pr should be minimal

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 20
Completeness of Simple Predicates

 Completeness of simple predicates refers to the extent


to which queries involving simple conditions or
predicates can be executed efficiently and effectively
across distributed nodes.
 A simple predicate typically involves a single condition applied to
one or more attributes of a relation (table).
 Ensuring completeness of simple predicates is essential for
achieving optimal query performance and data retrieval in
distributed database systems.
 A set of simple predicates Pr is said to be complete if
and only if the accesses to the tuples of the minterm
fragments defined on Pr requires that two tuples of the
same minterm fragment have the same probability of
being accessed by any application.
Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 21
Completeness of Simple Predicates

 Example :
 Assume PROJ[PNO,PNAME,BUDGET,LOC] has two
applications defined on it.
 Find the budgets of projects at each location. (1)
 Find projects with budgets less than $200000. (2)
 According to (1),
Pr={LOC=“Montreal”,LOC=“New York”,LOC=“Paris”}
which is not complete with respect to (2).
 Modify

Pr ={LOC=“Montreal”, LOC=“New York”, LOC=“Paris”,


BUDGET≤200000, BUDGET>200000}
which is complete.
Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 22
Minimality of Simple Predicates

 In the context of distributed database fragmentation, the


minimality of simple predicates refers to the principle of
selecting the simplest and most concise conditions or
predicates to determine how data is distributed across
nodes in the distributed system.
 Minimizing predicates in distributed database
fragmentation is essential for ensuring efficient data
distribution, optimizing query performance, and reducing
unnecessary data movement across nodes.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 23
Minimality of Simple Predicates

 If a predicate influences how fragmentation is performed,


(i.e., causes a fragment f to be further fragmented into,
say, fi and fj) then there should be at least one
application that accesses fi and fj differently.
 In other words, the simple predicate should be relevant
in determining a fragmentation.
 If all the predicates of a set Pr are relevant, then Pr is
minimal.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 24
Minimality of Simple Predicates

 Definition.. Let mi and mj be two minterm predicates that


are identical in their definition, except that mi contains
the simple predicate pi in its natural form while mj
contains ¬pi . Also, let fi and fj be two fragments defined
according to mi and mj, respectively. Then pi is relevant
if and only if
acc(mi ) acc(m j )

card ( f i ) card ( f j )

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 25
Minimality of Simple Predicates

Example :
Pr ={LOC=“Montreal”,LOC=“New York”, LOC=“Paris”,
BUDGET≤200000,BUDGET>200000}

is minimal (in addition to being complete). However, if we


add
PNAME = “Instrumentation”

then Pr is not minimal.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 26
COM_MIN Algorithm

Given: a relation R and a set of simple predicates Pr


Output: a complete and minimal set of simple predicates
Pr' for Pr

Rule 1: a relation or fragment is partitioned into at least


two parts which are accessed differently by at
least one application.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 27
COM_MIN Algorithm

 Initialization :
 find a pi  Pr such that pi partitions R according to Rule 1
 set Pr' = pi ; Pr Pr – {pi} ; F  {fi}
 Iteratively add predicates to Pr' until it is complete
 find a pj  Pr such that pj partitions some fk defined according to
minterm predicate over Pr' according to Rule 1
 set Pr' = Pr'  {pi}; Pr Pr – {pi}; F  F  {fi}
 if pk  Pr' which is nonrelevant then
Pr'  Pr – {pi}
F  F – {fi}

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 28
PHORIZONTAL Algorithm

Makes use of COM_MIN to perform fragmentation.


Input: a relation R and a set of simple predicates Pr
Output: a set of minterm predicates M according to which
relation R is to be fragmented

 Pr'  COM_MIN (R,Pr)


 determine the set M of minterm predicates
 determine the set I of implications among pi  Pr
 eliminate the contradictory minterms from M

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 29
PHF – Example

 Two candidate relations : PAY and PROJ.


 Fragmentation of relation PAY
 Application: Check the salary info and determine raise.
 Employee records kept at two sites  application run at two sites
 Simple predicates
p1 : SAL ≤ 30000
p2 : SAL > 30000
Pr = {p1,p2} which is complete and minimal Pr'=Pr
 Minterm predicates
m1 : (SAL ≤ 30000)
m2 : NOT(SAL ≤ 30000) = (SAL > 30000)

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 30
PHF – Example

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 31
PHF – Example
 Fragmentation of relation PROJ
 Applications:
 Find the name and budget of projects given their no.
 Issued at three sites

 Access project information according to budget


 one site accesses ≤200000 other accesses >200000

 Simple predicates
 For application (1)
p1 : LOC = “Montreal”
p2 : LOC = “New York”
p3 : LOC = “Paris”
 For application (2)
p4 : BUDGET ≤ 200000
p5 : BUDGET > 200000
 Pr = Pr' = {p1,p2,p3,p4,p5}
Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 32
PHF – Example

 Fragmentation of relation PROJ continued


 Minterm fragments left after elimination
m1 : (LOC = “Montreal”)  (BUDGET ≤ 200000)
m2 : (LOC = “Montreal”)  (BUDGET > 200000)
m3 : (LOC = “New York”)  (BUDGET ≤ 200000)
m4 : (LOC = “New York”)  (BUDGET > 200000)
m5 : (LOC = “Paris”)  (BUDGET ≤ 200000)
m6 : (LOC = “Paris”)  (BUDGET > 200000)

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 33
PHF – Example

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 34
PHF – Correctness

 Completeness
 Since Pr' is complete and minimal, the selection predicates are
complete
 Reconstruction
 If relation R is fragmented into FR = {R1,R2,…,Rr}

R = Ri FR Ri
 Disjointness
 Minterm predicates that form the basis of fragmentation should
be mutually exclusive.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 35
Derived Horizontal Fragmentation

 Defined on a member relation of a link according to a


selection operation specified on its owner.
 Each link is an equijoin.
 Equijoin can be implemented by means of semijoins.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 36
DHF – Definition

Given a link L where owner(L)=S and member(L)=R, the


derived horizontal fragments of R are defined as
Ri = R ⋉F Si, 1≤i≤w
where w is the maximum number of fragments that will be
defined on R and
Si = F (S)
i

where Fi is the formula according to which the primary


horizontal fragment Si is defined.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 37
DHF – Example

Given link L1 where owner(L1)=SKILL and member(L1)=EMP


EMP1 = EMP ⋉ SKILL1
EMP2 = EMP ⋉ SKILL2
where
SKILL1 = SAL≤30000(SKILL)
SKILL2 = SAL>30000(SKILL)

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 38
DHF – Correctness
 Completeness
 Referential integrity
 Let R be the member relation of a link whose owner is relation S
which is fragmented as FS = {S1, S2, ..., Sn}. Furthermore, let A
be the join attribute between R and S. Then, for each tuple t of
R, there should be a tuple t' of S such that
t[A] = t' [A]
 Reconstruction
 Same as primary horizontal fragmentation.
 Disjointness
 Simple join graphs between the owner and the member
fragments.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 39
Vertical Fragmentation

 Has been studied within the centralized context..


 design methodology….. Its motivation within the centralized
context is as a design tool, which allows the user queries to deal
with smaller relations, thus causing a smaller number of page
accesses
 physical clustering….. It has also been suggested that the most
“active” sub-relations can be identified and placed in a faster
memory subsystem in those cases where memory hierarchies
are supported

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 40
Vertical Fragmentation

 More difficult than horizontal, because more alternatives


exist.
Two approaches :
 Grouping… starts by assigning each attribute to one fragment,
and at each step, joins some of the fragments until some criteria
is satisfied. It was first suggested for centralized databases and
was used later for distributed databases
 attributes to fragments
 Splitting… starts with a relation and decides on beneficial
partitioning based on the access behavior of applications to the
attributes. The technique was also first discussed for centralized
database design then later for distributed databases

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 41
Vertical Fragmentation

 Overlapping fragments vs Non-overlapping fragments


 Splitting…. generates non-overlapping fragments whereas
grouping…. typically results in overlapping fragments
We do not consider the replicated key attributes to be
overlapping.
Advantage:
Easier to enforce functional dependencies
(for integrity checking etc.)

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 42
VF – Information Requirements

 Application Information
 Attribute affinities
 a measure that indicates how closely related the attributes are
 This is obtained from more primitive usage data
 Attribute usage values
 Given a set of queries Q = {q1, q2,…, qq} that will run on the relation
R[A1, A2,…, An],

 if attribute Aj is referenced by query qi


1
use(qi,Aj) =
 0 otherwise
use(qi,•) can be defined accordingly

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 43
VF – Examples
The following are example of 4 queries on applications that
are defined for the relation PROJ where vertical fragment
on attributes is indicated
PNO PNAMEBUDGETLOC
q1: Find the budget of a project, given q1 1 0 1 0
its identification number q2 0 1 1 0
SELECT BUDGET FROM PROJ q3 0 1 0 1
WHERE PNO=Value q4 0 0 1 1
q2: Find the names and budgets of all projects.
Attribute Usage Matrix
SELECT PNAME,BUDGET FROM PROJ

q3: Find the names of projects located at a given city.

SELECT PNAME FROM PROJ WHERE LOC=Value


q4: Find the total project budgets for each city.
SELECTSUM(BUDGET) FROM PROJ WHERE LOC=Value

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 44
VF – Affinity Measure aff(Ai,Aj)

The attribute affinity measure between two attributes Ai and Aj


of a relation R[A1, A2, …, An] with respect to the set of
applications Q = (q1, q2, …, qq) is defined as follows :

aff (Ai, Aj)   (query access)


all queries that access A and A i j

access
query access  access 
frequency of a query 
execution
all sites

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 45
VF – Calculation of aff(Ai, Aj)

Assume each query in the previous example accesses the attributes once
during each execution.
Also assume the access frequencies S S S 1 2 3

q1 15 20 10
q2 5 0 0
q3 25 25 25

q4 3 0 0

Attribute access frequencies


Then, the cost Matrix is where
aff(PNO, BUDGET) = 15*1 + 20*1+10*1
PNO BUDGET =45 = 45
PNAMEBUDGET= 5
PNAME LOC =75
BUDGET LOC 3
Attribute usage cost matrix

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 46
VF – Calculation of aff(Ai, Aj)

Using the cost matrix, we can construct the attribute affinity matrix as follows
PNO BUDGET =45
PNAMEBUDGET= 5
PNAME LOC =75
BUDGET LOC 3

Attribute usage cost matrix

The attribute affinity matrix

PNO PNAMEBUDGET LOC


PNO 45 0 45 0
PNAME 0 80 5 75
BUDGET 45 5 53 3

LOC 0 75 3 78

Attribute affinity matrix


Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 47
VF – Clustering Algorithm

 The fundamental task in designing a vertical fragmentation


algorithm is to find some means of grouping the attributes of
a relation based on the attribute affinity values in AA.
 The reasons behind are the following (Bond energy)
1. It is designed specifically to determine groups of similar items as
opposed to, say, a linear ordering of the items (i.e., it clusters the
attributes with larger affinity values together, and the ones with
smaller values together).
2. The final groupings are insensitive to the order in which items are
presented to the algorithm.
3. The computation time of the algorithm is reasonable: O(n 2), where
n is the number of attributes.
4. Secondary interrelationships between clustered attribute groups
are identifiable.
Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 48
VF – Clustering Algorithm

 In short, the Bond Energy Algorithm (BEA) has been used


for clustering of entities. BEA finds an ordering of entities (in
our case attributes) such that the global affinity measure is
maximized.

AM   (affinity of A and A with their neighbors)


i j

i j

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 49
Bond Energy Algorithm

Input: The AA matrix


Output: The clustered affinity matrix CA which is a
perturbation of AA
 Initialization: Place and fix one of the columns of AA in
CA.
 Iteration: Place the remaining n-i columns in the
remaining i+1 positions in the CA matrix. For each
column, choose the placement that makes the most
contribution to the global affinity measure.
 Row order: Order the rows according to the column
ordering.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 50
Bond Energy Algorithm

“Best” placement? Define contribution of a placement:


cont(Ai, Ak, Aj) = 2bond(Ai, Ak)+2bond(Ak, Aj) –2bond(Ai, Aj)
Where n
bond(Ax,Ay) = aff(A ,A )aff(A ,A )
z x z y

z 1

For instance, with the tabular attributes, here show


computation of bond itself
bond(PNO,BUDGET) =aff(PNO,BUDGET)*aff(BUDGET,PNO)+
aff(PNO,PNAME)*aff(BUDGET,PNAME)+
aff(PNO,BUDGET)*aff(BUDGET,BUDGET)+
aff(PNO,LOC)*aff(BUDGET,LOC)

bond(PNO,BUDGET) =45*45+0*5+45*53+0*78=4410
Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 51
BEA – Complete Example
Consider the following AA matrix and the corresponding CA matrix where
PNO and PNAME have been placed. Place BUDGET:
PNO PNAMEBUDGET LOC PNO PNAME
0
PNO 45 0 45 PNO 45 0
75 PNAME 0 80
PNAME 0 80 5
3 BUDGET 45 5
BUDGET 45 5 53

0 75 3 78 LOC 0 75
LOC
Ordering (0-3-1) :
cont(A0,BUDGET,PNO) = 2bond(A0, BUDGET)+2bond(BUDGET, PNO)
–2bond(A0 , PNO)
= 8820
Ordering (1-3-2) :
cont(PNO,BUDGET,PNAME) = 10150
Ordering (2-3-4) :
cont (PNAME,BUDGET,LOC) = 1780

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 52
BEA – Example

 Therefore, the CA matrix has the form

 When LOC is placed, the final form of the CA


matrix (after row organization) is

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 53
VF – Algorithm: Partitioning Algorithm

How can you divide a set of clustered attributes {A1, A2,


…, An} into two (or more) sets {A1, A2, …, Ai} and {Ai, …,
An} such that there are no (or minimal) applications that
access both (or more than one) of the sets.

The objective of
the splitting
activity is to
find sets of
attributes that
are accessed Locating a Splitting Point
solely, or for the
54
most part,
Lecture slides bycustomized from © 2020, M.T. Özsu & P. Valduriez
as adapted and
VF – Algorithm

Two problems :
 Cluster forming in the middle of the CA matrix
 Shift a row up and a column left and apply the algorithm to find
the “best” partitioning point
 Do this for all possible shifts
 Cost O(m2)
 More than two clusters
 m-way partitioning
 try 1, 2, …, m–1 split points along diagonal and try to find the
best point for each of these
 Cost O(2m)

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 55
VF – Correctness

A relation R, defined over attribute set A and key K,


generates the vertical partitioning FR = {R1, R2, …, Rr}.
 Completeness
 The following should be true for A:
A =  ARi

 Reconstruction
 Reconstruction can be achieved by
R = ⋈K Ri, Ri  FR
 Disjointness
 TID's are not considered to be overlapping since they are
maintained by the system
 Duplicated keys are not considered to be overlapping

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 56
VF – Algorithm

Other approaches to VF apart from Attribute affinity and


bond energy are:
 Association Rule Mining:
 Association rule mining is a data mining technique used to
discover interesting patterns, associations, or relationships
among variables in large datasets.
 In the context of database query analysis, association rule
mining can be applied to identify frequent itemsets, which are
sets of columns that often appear together in queries.
 Algorithms such as Apriori and FP-Growth can be used to mine
association rules from query logs or query execution histories.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 57
VF – Algorithm

Other approaches to VF apart from Attribute affinity and


bond energy are:
 Query Log Analysis:
 Analyzing query logs or query execution histories provides
valuable insights into the usage patterns of database columns.
 By examining the frequency of column references in queries,
you can identify which columns are commonly accessed
together and which queries are frequently executed.
 Techniques such as frequency analysis, sequence analysis, and
clustering can be applied to query logs to uncover patterns of
column usage.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 58
VF – Algorithm

Other approaches to VF apart from Attribute affinity and


bond energy are:
 Correlation analysis:
 Correlation analysis measures the statistical relationship
between pairs of columns in a dataset.
 By calculating correlation coefficients such as Pearson
correlation or Spearman correlation, you can identify columns
that are positively or negatively correlated with each other.
 High correlation between columns suggests that they are
frequently accessed together in queries.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 59
VF – Algorithm

Other approaches to VF apart from Attribute affinity and


bond energy are:
 Dimensionality Reduction:
 Dimensionality reduction techniques such as principal
component analysis (PCA) or singular value decomposition
(SVD) can be applied to query matrices representing the usage
of columns in queries.
 These techniques help identify latent factors or patterns in the
data that explain the variability in query patterns and column
usage.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 60
VF – Algorithm

Other approaches to VF apart from Attribute affinity and


bond energy are:
 Graph-based Analysis:
 Representing queries and column relationships as a graph
enables the application of graph-based analysis techniques.
 Algorithms such as community detection, centrality analysis, and
graph clustering can be applied to identify groups of columns
that are tightly connected or frequently accessed together in
queries.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 61
VF – Algorithm

Other approaches to VF apart from Attribute affinity and


bond energy are:
 Statistical Hypothesis Testing:
 Statistical hypothesis testing techniques can be used to assess
the significance of relationships between columns based on their
usage in queries.
 Methods such as chi-square tests, t-tests, or ANOVA tests can
help determine whether the co-occurrence of columns in queries
is statistically significant.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 62
Hybrid Fragmentation
In most cases a simple horizontal or vertical fragmentation
of a database schema will not be sufficient to satisfy the
requirements of user applications. In this case a vertical
fragmentation may be followed by a horizontal one, or vice
versa, producing a tree structured partitioning

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 63
Reconstruction of HF

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 64
Hybrid Fragmentation
In most cases a simple horizontal or vertical fragmentation
of a database schema will not be sufficient to satisfy the
requirements of user applications. In this case a vertical
fragmentation may be followed by a horizontal one, or vice
versa, producing a tree structured partitioning

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 65
Outline
 Distributed and Parallel Database Design
 Fragmentation
 Data distribution

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 66
Fragment Allocation
 Problem Statement
Given
F = {F1, F2, …, Fn} fragments
S ={S1, S2, …, Sm} network sites
Q = {q1, q2,…, qq} applications
Find the "optimal" distribution of F to S.
 Optimality
 Minimal cost
 Communication + storage + processing (read & update)
 Cost in terms of time (usually)
 Performance
Response time and/or throughput
 Constraints
 Per site constraints (storage & processing)

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 67
Information Requirements
 Database information
 selectivity of fragments
 size of a fragment
 Application information
 access types and numbers
 access localities
 Communication network information
 unit cost of storing data at a site
 unit cost of processing at a site
 Computer system information
 bandwidth
 latency
 communication overhead

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 68
Allocation

File Allocation (FAP) vs Database Allocation (DAP):


 Fragments are not individual files
 relationships have to be maintained
 Access to databases is more complicated
 remote file access model not applicable
 relationship between allocation and query processing
 Cost of integrity enforcement should be considered
 Cost of concurrency control should be considered

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 69
Allocation Model

General Form
min(Total Cost)
subject to
response time constraint
storage constraint
processing constraint

Decision Variable

1 if fragment Fi is stored at site Sj


xij 
0 otherwise

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 70
Allocation Model

 Total Cost

 query processing cost 


all queries

  cost of storing a fragment at a site


all sites all fragments

 Storage Cost (of fragment Fj at Sk)


(unit storage cost at Sk)  (size of Fj)  xjk
 Query Processing Cost (for one query)
processing component + transmission component

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 71
Allocation Model

 Query Processing Cost

Processing component
access cost + integrity enforcement cost + concurrency control cost
 Access cost

  (no. of update accesses+ no. of read accesses) 


all sites all fragments
xij  local processing cost at a site
 Integrity enforcement and concurrency control costs
 Can be similarly calculated

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 72
Allocation Model

 Query Processing Cost


Transmission component
cost of processing updates + cost of processing retrievals
 Cost of updates

  update message cost 


all sites all fragments
  acknowledgment cost
all sites all fragments
 Retrieval Cost

 min all sites (cost of retrieval command 


all fragments cost of sending back the result)

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 73
Allocation Model

 Constraints
 Response Time
execution time of query ≤ max. allowable response time for that query

 Storage Constraint (for a site)

 storage requirement of a fragment at that site 


storage capacity at that site
all fragments
 Processing constraint (for a site)

 processing load of a query at that site 


all queries processing capacity of that site

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 74
Allocation Model

 Solution Methods
 FAP is NP-complete
 DAP also NP-complete
 Heuristics based on
 single commodity warehouse location (for FAP)
 knapsack problem
 branch and bound techniques
 network flow

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 75
Allocation Model

 Attempts to reduce the solution space


 assume all candidate partitionings known; select the “best”
partitioning
 ignore replication at first
 sliding window on fragments

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 76
Individual exercise

Question 1

a. Define vertical fragmentation and explain its significance


in distributed database management.

b. Discuss the advantages and challenges associated with


vertical fragmentation compared to other fragmentation
techniques, such as horizontal fragmentation.

c. Explore real-world scenarios or application domains


where vertical fragmentation can provide significant
benefits in terms of data management, query
optimization, and scalability.
Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 77
Individual exercise

Question 2: Consider fragment design

a. Select a specific scenario or application domain (e.g., e-


commerce, healthcare, finance) for which vertical
fragmentation is suitable.

b. Identify a relation schema relevant to the chosen


scenario and propose a vertical fragmentation strategy
based on the attributes of the relation.

c. Justify your fragmentation strategy by analyzing the


specific requirements, access patterns, and scalability
considerations of the chosen scenario.
Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 78

You might also like