0% found this document useful (0 votes)
77 views

Distributed Data Systems: Sesapzg554

This document discusses vertical fragmentation of distributed databases. It describes attribute usage matrices that show how attributes are used in queries, and affinity matrices that represent the relationships between attributes. It then introduces the bond energy algorithm, which takes an affinity matrix as input and outputs a clustered affinity matrix by reordering attributes to maximize affinity between neighboring attributes. This algorithm is used to help determine how to vertically fragment a relation into subsets of its attributes. Examples are provided to illustrate attribute usage matrices, affinity matrices, and applying the bond energy algorithm.

Uploaded by

Home TV
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views

Distributed Data Systems: Sesapzg554

This document discusses vertical fragmentation of distributed databases. It describes attribute usage matrices that show how attributes are used in queries, and affinity matrices that represent the relationships between attributes. It then introduces the bond energy algorithm, which takes an affinity matrix as input and outputs a clustered affinity matrix by reordering attributes to maximize affinity between neighboring attributes. This algorithm is used to help determine how to vertically fragment a relation into subsets of its attributes. Examples are provided to illustrate attribute usage matrices, affinity matrices, and applying the bond energy algorithm.

Uploaded by

Home TV
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Distributed Data Systems

SESAPZG554

BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Parthasarathy

1
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

SESAPZG554 – CS#5
Distributed Database Design
Issues & Integration Part 2
2
Agenda for CS #5

1) Recap of Sessions
2) Vertical Fragmentation
 Attribute usage Matrix (AU Matrix)
 Attribute affinity Matrix (AA Matrix)
 Clustered affinity Matrix (CA Matrix) using Bond Energy
Algorithm
3) Hybrid Fragmentation
4) Allocation
5) Bottom-up Design Methodology
6) Schema Matching, Integration & mapping
7) Data Cleaning
8) Portions for Mid-Semester Examination (EC2)
3

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Vertical Fragmentation

 A vertical fragmentation of a relation R produces fragments


R1,R2,….Rr , each of which contains a subset of R’s attributes
as well as the primary key of R.
 The objective of vertical fragmentation is to partition a relation
into a set of smaller relations so that many of the user
applications will run on only one fragment.
 In this context, an “optimal” fragmentation is one that
produces a fragmentation scheme which minimizes the
execution time of user applications that run on these
fragments.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Vertical Fragmentation

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Vertical Fragmentation Example

 PROJ relation is partitioned vertically into two sub relations, PROJ1


and PROJ2.
 PROJ1: information about project budgets
 PROJ2: information about project names and locations
 It is important to notice that the primary key to the relation (PNO)
is included in both fragments
6

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Vertical Fragmentation Example

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Vertical Fragmentation

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Vertical Fragmentation - Attribute
Usage Matrix

Attribute Usage Matrix =

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Vertical Fragmentation – Affinity
Measure

10

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Vertical Fragmentation – Affinity
Matrix Example

Attribute Usage Matrix =

11

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


CA Matrix using - Bond Energy
Algorithm
 Take the attribute affinity matrix AA and reorganize the attribute
orders to form clusters where the attributes in each cluster
demonstrate high affinity to one another.
 Bond Energy Algorithm (BEA) has been used for clustering of
entities. BEA finds an ordering of entities (in our case attributes)
such that the global affinity measure
AM    (affinity of A and A with their neighbors)
i j

i j
is maximized.

12

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bond Energy Algorithm

 Input: The AA matrix


 Output: The clustered affinity matrix CA which is a perturbation of
AA
 Step 01: Initialization: Place and fix two of the columns of AA in
CA.
 Step 02: Iteration: Place the remaining n-i columns in the remaining
i+1 positions in the CA matrix. For each column, choose the
placement that makes the most contribution to the global affinity
measure.
 Step 03: Row order: Order the rows according to the column
ordering.
13

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bond Energy Algorithm

 “Best” placement? Define contribution of a placement:

cont(Ai, Ak, Aj) = 2bond(Ai, Ak)+2bond(Ak, Aj) –2bond(Ai, Aj)

 where
n
bond(Ax,Ay) =  aff(A ,A ) aff(A ,A )
z x z y

z 1

14

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bond Energy Algorithm Example

 Consider the following AA matrix (which we built previously) and


the corresponding CA matrix where A1 and A2 have been placed.

 Step 01:

 Now we need to place A3, but “where” do we do so ?


 We have three options here:
 A3 , A1 , A2
 A1 , A3 , A2
 A1 , A2 , A3
15

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bond Energy Algorithm Example

 Bond Contribution Formula:


cont(Ai, Ak, Aj) = 2bond(Ai, Ak)+2bond(Ak, Al) –2bond(Ai, Aj)
 where Ak is the column under consideration & Ai is the column
to its left and Aj is the column to the right of it.
 Pseudo columns A0 and An can be added when needed and will
have 0 values.
 Step 02: So, our options gets transformed into: For each option, apply
BEA

A0, A3, A1 (A3 is 1st column)

A1, A3, A2 (A3 is 2nd column)

A2, A3, An (A3 is last column)


16

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bond Energy Algorithm Example

Consider (A0, A3, A1):


By, BEA : cont(A0, A3, A1) =2bond(A0, A3)+2bond(A3, A1) –2bond(A0, A1)

The column
values are from
AA matrix and A0
is pseudo column

cont(A0, A3, A1) =2(0)+2(4410) –2(0) = 8820


17

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bond Energy Algorithm Example

Consider (A1, A3, A2):


By, BEA : cont(A1, A3, A2) = 2bond(A1, A3)+2bond(A3, A2) –2bond(A1, A2)

The column
values are from
AA matrix and A0
is pseudo column

cont(A1, A3, A2) =2(4410) + 2(890) –2(225) = 10150


18

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bond Energy Algorithm Example

Consider (A2, A3, An):


By, BEA : cont(A2, A3, An) = 2bond(A2, A3)+2bond(A3, An) –2bond(A2, An)

The column
values are from
AA matrix and A0
is pseudo column

cont(A2, A3, An) =2(890) + 2(0) –2(0) = 1780


19

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bond Energy Algorithm Example
 Now, we have applied the BEA on each of the options and see the
results are:
 cont(A0, A3, A1) = 8820
 cont(A1, A3, A2) = 10150
 cont(A2, A3, An) = 1780
 Since, the bond contribution value is highest when A3 is placed in
between A1 and A2, we would place A3 in between them. Now that
CA matrix looks as follows:

20

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bond Energy Algorithm Example

 Now, the next column to add is A4 into CA matrix, but again


the question is “where” to add it ?
 We have four options:
 A0, A4, A1 (A4 is 1st column)
 A1, A4, A3 (A4 is 2nd column)
 A3, A4, A2 (A4 is 3rd column)
 A2, A4, An (A4 is last column)
 Again we need to apply the BEA on each of the options and
decide where to place A4.

21

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bond Energy Algorithm Example

Consider (A0, A4, A1):


Cont(A0, A4, A1) = 2 Bond(A0,A4) + 2Bond(A4,A1) – 2Bond(A0,A1)
= 2 (0) + 2(135) – 2(0) = 270
Consider (A3, A4, A2):
Cont(A3, A4, A2) = 2 Bond(A3,A4) + 2Bond(A4,A2) – 2Bond(A3,A2)
= 2 (768) + 2(11865) – 2(890) = 23486
Consider (A1, A4, A3):
Cont(A1, A4, A3) = 2 Bond(A1,A4) + 2Bond(A4,A3) – 2Bond(A1,A3)
= 2 (135) + 2(768) – 2(4410) = –7014
Consider (A2, A4, An):
Cont(A2, A4, An) = 2 Bond(A2,A4) + 2Bond(A4,An) – 2Bond(A2,An)
= 2 (11865) + 2(0) – 2(0) = 23730 22

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bond Energy Algorithm Example

 So, from the above bond contribution values, it is evident that


Cont(A2, A4, An) has the maximum value and hence, A4 must
be places as the last column in CA matrix.

 Step 03: Rearrange the matrix rows to matrix the column!

23

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bond Energy Algorithm Example

 Clearly, there is a cluster as shown below:

 Hence, we will vertically fragment keeping in mind that A1 and


A3 are together and A2 and A4 are together.
24

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bond Energy Algorithm

25

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Hybrid Fragmentation

 In most cases a simple horizontal or vertical fragmentation of a database


schema will not be sufficient to satisfy the requirements of user
applications.
 In this case a vertical fragmentation may be followed by a horizontal one,
or vice versa, producing a tree structured partitioning
 Since the two types of partitioning strategies are applied one after the
other, this alternative is called hybrid fragmentation.
 It has also been named mixed fragmentation or nested fragmentation

26

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Allocation

 The allocation of resources across the nodes of a computer network


is an old problem that has been studied extensively.
 Most of this work, however, does not address the problem of
distributed database design, but rather that of placing individual files
on a computer network.
 Assume that there are a set of fragments F = {F1;F2; : : : ;Fn} and a
distributed system consisting of sites S = {S1;S2; : : : ;Sm} on
which a set of applications Q = {q1;q2; : : : ;qn} is running.
 The allocation problem involves finding the “optimal” distribution
of F to S.
 The optimality can be defined with respect to two measures
 Minimal cost
 Performance 27

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Allocation Problems

Minimal cost
 The cost function consists of the cost of storing each Fi at a site Sj,
the cost of querying Fi at site Sj , the cost of updating Fi at all sites
where it is stored, and the cost of data communication.
 The allocation problem, then, attempts to find an allocation
scheme that minimizes a combined cost function.
 Why such models are not available ?
Performance
 The allocation strategy is designed to maintain a performance
metric.
 Two well-known ones are to minimize the response time and to
maximize the system throughput at each site. 28

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Allocation Problems

 It is apparent that the “optimality” measure should include both


the performance and the cost factors.
 In other words, one should be looking for an allocation scheme
that, for example, answers user queries in minimal time while
keeping the cost of processing minimal.
 A similar statement can be made for throughput maximization.

29

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Information Requirements

 It is at the allocation stage that we need the quantitative data about


 The database
 The applications that run on it
 The communication network
 The processing capabilities and
 Storage limitations of each site on the network.

30

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Quick Break !!

31

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bottom-Up Design Methodology

 Top-down distributed database design, is suitable for tightly


integrated, homogeneous distributed DBMSs.
 Bottom-up design is appropriate in multidatabase systems.
 In this case, a number of databases already exist, and the design
task involves integrating them into one database.
 Bottom-up design involves the process by which information
from participating databases can be (physically or logically)
integrated to form a single cohesive multidatabase.
 There are two alternative approaches. In some cases, the global
conceptual (or mediated) schema is defined first, in which case
the bottom-up design involves mapping LCSs to this schema.
32

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bottom-Up Design Methodology

 In other cases, the GCS is defined as an integration of parts of LCSs.


 In this case, the bottom-up design involves both the generation of the GCS
and the mapping of individual LCSs to this GCS
 If the GCS is defined up-front, the relationship between the GCS and the
local conceptual schemas (LCS) can be of two fundamental types:
 Local-as-view, and
 Global-as-view
 In local-as-view (LAV) systems, the GCS definition exists, and each LCS
is treated as a view definition over it.
 In global-as-view systems (GAV), on the other hand, the GCS is defined
as a set of views over the LCSs.
 These views indicate how the elements of the GCS can be derived, when
needed, from the elements of LCSs 33

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bottom-Up Design Methodology

34

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bottom-Up Design Methodology

 Bottom-up design occurs in two


general steps :
 Schema translation (or simply
translation) and
 Schema generation.
 In the first step, the component
database schemas are translated to
a common intermediate canonical
representation
 In the second step of Bottom-up
design, the intermediate schemas
are used to generate a GCS
35

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Schema Translation
 In the first step, the component database schemas are translated to
a common intermediate canonical representation.
 The use of a canonical representation facilitates the translation
process by reducing the number of translators that need to be
written.
 The choice of the canonical model is important it should be one
that is sufficiently expressive to incorporate the concepts available
in all the databases that will later be integrated.
 Alternatives that have been used include the entity-relationship
model, object-oriented model or a graph that may be simplified to a
tree
 The translation step is necessary only if the component databases
are heterogeneous and local schemas are defined using different
data models 36

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Schema Generation
 In the second step of bottom-up design, the
intermediate schemas are used to generate a GCS.
In some methodologies, local external (or export)
schemas are considered for integration rather than
full database schemas
 Local systems may only be willing to contribute
some of their data to the multidatabase
 The schema generation process consists of the
following steps:
o Schema matching to determine the syntactic and
semantic correspondences among the translated LCS
elements or between individual LCS elements and the
pre-defined GCS elements
o Integration of the common schema elements into a
global conceptual (mediated) schema if one has not yet
been defined
o Schema mapping that determines how to map the
elements of each LCS to the other elements of the GCS
37

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Cleaning

 Errors in source databases can always occur, requiring cleaning in


order to correctly answer user queries.
 Data cleaning is a problem that arises in both data warehouses and
data integration systems, but in different contexts.
 In data warehouses where data are actually extracted from local
operational databases and materialized as a global database,
cleaning is performed as the global database is created.
 In the case of data integration systems, data cleaning is a process
that needs to be performed during query processing when data are
returned from the source databases

38

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Cleaning

 The errors that are subject to data cleaning can generally be


broken down into either schema-level or instance-level concerns
 Schema-level problems can arise in each individual LCS due to
violations of explicit and implicit constraints.
 Instance level errors are those that exist at the data level.
 For example, the values of some attributes may be missing
although they were required, there could be misspellings and
word transpositions or differences in abbreviations embedded
values that were erroneously placed in other fields, duplicate
values, and contradicting values

39

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Mid-Semester Examination
Portions
 The below are the portions for mid-semester examination (inclusive
of CS, Pre-CS videos & Post-CS readings) :
o M1: Distributed Data Storage Technology
o M2: Distributed File Systems & Security
o M3: Distributed Databases
o M4: Distributed Database Design Issues & Integration
 The EC2 will be for 2hrs on 16th April 2022 and for 30Marks and
OPEN BOOK.

ALL THE BEST!!

40

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


41

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Thank You for your
time & attention !
Contact : [email protected]

42

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

You might also like