An Optimized Scheme For Vertical Partitioning of A
An Optimized Scheme For Vertical Partitioning of A
net/publication/254308210
CITATIONS READS
40 470
1 author:
Eltayeb Abuelyaman
Imam Abdul Rahman bin Faisal University
36 PUBLICATIONS 147 CITATIONS
SEE PROFILE
All content following this page was uploaded by Eltayeb Abuelyaman on 07 May 2019.
Summary
This paper proposes a scheme for vertical database. The distribution of data across various
partitioning of a database at the design cycle. sites of computer networks involves making
When a partition is formed, attributes are proper fragmentation and placement decisions.
divided among various systems or even The first phase in the process of distributing a
database is fragmentation which clusters
throughout different geographical locations.
information into fragments. This process is
This may result in situations where a query followed by the allocation phase which distributes,
may include attributes that are located at and if necessary, replicates the generated
different sites. The scheme determines the hit fragments among the nodes of a computer
ratio of a partition. As long as it falls below a network. The use of data fragmentation to improve
predetermined threshold, the partition is performance is not new and commonly appears in
altered. Although no proof is provided, file design and optimization literature [3].
experimental data showed that moving an
attribute that is loosely coupled to a different Partitioning based on attributes has been studied
subset within a partition improves hit ratio. A earlier in [3], [4], [6], [8]. Stocker and Dearnley
simulator was built to test the proposed discussed implementation of a self-reorganizing
algorithm. Results of various simulation runs DBMS that carries out attribute clustering [9].
They showed that it is beneficial to cluster
are consistent with the hypothesis. That is, the attributes of a DBMS where storage cost is low
proposed algorithm enables a reliable compared to the cost of accessing subfiles. Such is
distribution of newly designed database tables the case because increases in storage costs will be
across multiple storage devices based on a offset by savings in access cost. Hoffer developed
predetermined hit ratio. The scheme is a non-linear, zero-one program which minimizes a
independent of frequencies of queries thus, linear combination of the costs of: storing,
can be used as a stepping stone for its retrieving and updating, with capacity constraints
counterpart, the dynamic partitioning for each file [6]. Navathe et al used a two-step
technique. approach for vertical partitioning. In the first step,
they used the given input parameters in the form of
Keywords: database, partition, frequency, query, an Attribute Usage Matrix (AUM) to construct an
reflexive, symmetry, transitivity, hit ratio. Attribute Affinity Matrix (AAM) for clustering
[8]. After clustering, an empirical objective
function is used to perform binary partitioning
1. Introduction iteratively. In the second step, estimated storage
cost factors are considered for further refinement
Distributed and parallel processing is an efficient of the partitioning process. Further details about
way of improving performance of Data-Base AUM and AAM matrices will be provided in the
Management Systems (DBMSs) and applications next paragraph.
that manipulate large volumes of data. Such
improvement comes from limiting queries only to Cornell and Yu extended Navathe et al approach to
data that are relevant to their respective decrease the number of disk accesses for optimal
transactions. This is one of the main design goals binary partitioning [5]. Their extension involved
of distributed databases according to [2]. specific physical factors such as: the number of
attributes, their length and selectivity, the
The primary concern of DBMS design is the cardinality of the relation and so on. Navathe and
fragmentation and allocation of the underlying Ra developed a new algorithm that follows graph
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.1, January 2008 311
theory partitioning techniques [7]. Their algorithm benchmark is given in Section 4. The conclusion is
starts from the AAM matrix, which is transformed presented in sections 5.
into a graph called the Affinity Graph (AG). An
edge in AG is labeled with a weight that represents 2. StatPart
the affinity between its vertices, where: vertices In general, designers very frequently delay
represent attributes; affinity between vertices important steps to the end of design cycles. The
represents the number of queries in which the design of efficient database systems is not an
attributes occurred simultaneously. For the interest exception because database partitioning is based on
of clarity of presentation we will define what an FOQ. That is, data must be collected from a large
AAM matrix is. For more details, interested number of queries before partitioning. To
readers are referred to reference [8]. Basically, an circumvent such a constraint, dependency on FOQ
n x n AAM matrix is one whose AAM(i, j) entry must be eliminated. One way to do so is to perform
represents the number of queries that database partitioning at the design phase and
simultaneously access the attributes represented by immediately after completion of the schema.
i and j. Based on the AAM, an iterative binary Conveniently, partitioning can be decided even
partitioning method has been proposed in [5 and before database tables are populated. For such a
8]. The authors first clustered the attributes and partition, which we classified as static, the DBD
then applied empirical objective functions and/or must:
mathematical cost functions to perform
fragmentation. 1. Gain sufficient knowledge on the business
requirements of an organization.
The partitioning algorithms suggested in the 2. Gather necessary and sufficient information
literature suffer from various limitations that will about intended usage of the database to
complicate the task of a Data Base Designer (DBD). determine the set of queries that would be of
These limitations are: immediate use. Henceforth, this set will be
called the Set of Kickoff Queries (SKQ). This
1. A DBD has to have sufficient empirical data on step requires thorough understanding of the
Frequencies Of Queries (FOQ). business requirement of an organization.
2. FOQ is a function of several variables that 3. Gather information about future plans of an
include time, users, and future needs of an organization to determine additional queries
organization. that may be needed in the future. This set will
3. Attributes are partitioned based on FOQ. be referred to as the Set of Future Queries
(SFQ).
The first limitation makes the partitioning
inapplicable to newly designed database schemas. For illustration of the proposed static partitioning,
The second applies to periodical queries the likes of the following definition will be necessary.
those for student records and taxes databases.
Furthermore, changes in organizational structures or Definition 1:
business requirements may call for additional
attributes. The third limitation stalks from the a) Na : the total number of attributes.
natural dynamicity of FOQ. b) Nk : the number of queries in the set SKQ.
c) Nf : the number of queries in the set SFQ.
The common denominator in all three limitations is d) SQ: the union of the set SKQ with SFQ.
FOQ. Therefore, the author herein classifies all e) SA = {A1, A2… ANa}: the overall set of
partitions that are based on FOQ as dynamic. On the attributes.
other hand, the only way for a partition to be f) SQ = {Q1 , Q2, … , QNq} : the overall set of
independent of FOQ is when it is based on a queries.
database schema. In this case it will be logical to
classify it as static. The proposed partition is static In the next section, we will discuss a suggested
and will be called StatPart. simulator for the static partitioning.
The rest of this paper is organized as follows: In 3. Simulation of StatPart
the next section we will introduce the StatPart. In
section 3 we will discuss simulation of StatPart. A
Our proposed simulator will enable a DBD to
comparison of performance of StatPart with a
partition a database at its infancy, that is, at its
312 IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.1, January 2008
schema level. The output of the simulator may header attribute. One can see that RM [d, F] is
range from 0 to 100 percent where 0 percent implies equal to 1. The table provides the relationship
that the schema cannot be partitioned and 100 between queries and attributes. However, it doesn't
percent means that every attribute is placed in a directly provide desired relationships among
partition by itself. Both percentages are undesirable attributes. The RM matrix, however, will be used as
and the latter is unacceptable. Along with the input to the second module, the symmetry module,
schema, a complete set of queries and parameters which will produce the desired relationship.
defining the database must be fed into the simulator.
These parameters will be discussed hereafter as 3.2 The symmetry module
necessary.
The simulator has the following three modules. The following equations were used to compute the
Each of the modules is discussed separately. Symmetry Matrix (SM) on table 2 which defines
a) reflexivity the desired relationships among attributes.
b) symmetry
c) transitivity Na
SM [j, j] = Σ RM[i, j] for j = 1 to Na
3.1 The reflexivity module (1) i=1
The module prompts a user to enter values for each
of the first three parameters in Definition 1. The
module then prompts the user to enter a percentage Na
C that controls the number of attributes appearing in SM [i, j] = Σ RM (k, i)*RM (k, j)
each query. If the designer enters the value 30 for
k=1
example, then the module will generate a value of 1
with probability of 0.3 and a value of 0 with
For i = 1 to Na (For j = 1 to Na) i ≠j (2)
probability of 0.7. That is, if the total number of
attributes is 10 then on the average a query would Attribute A B C D E F G H
include three attributes. The reflexivity module will Attribute
then generate a matrix that relates attributes to
A 1 0 1 1 0 0 0 0
queries. For C = 30 and (Na , Nk , Nf) = (8,5,3), we
found the output challenging enough to use for the B 0 3 2 0 2 2 3 2
interest of the discussion. The output is shown on C 1 2 5 3 2 4 2 1
Table 1 below. Henceforth, such output will be D 1 0 3 4 2 3 1 1
called the Reflexivity Matrix (RM). E 0 2 2 2 4 3 3 2
In an RM matrix, the total number of 1’s on a F 0 2 4 3 3 5 3 2
column gives the degree of reflexivity of the column G 0 3 2 1 3 3 4 3
header’s attribute. For example, in table 1 the H 0 2 1 1 2 2 3 3
reflexivity of attribute B is equal to 3.
Table 2. Symmetry Matrix generated from the
Reflexivity Matrix and equations 1 and 2
Attribute A B C D E F G H
Query Equation 1 adds up column entries for each attribute
j in table 1 to determine its reflexivity. The diagonal
a 0 1 0 0 1 0 1 1
entries on an SM matrix give the reflexivity
b 0 0 1 1 0 1 0 0
degrees of attributes. Equation 2 finds the
c 1 0 1 1 0 0 0 0 intersection between each pair of attributes i and j.
d 0 0 0 1 1 1 1 1 For example, in table 1, if we performed entry by
e 0 1 1 0 1 1 1 0 entry multiplication of attributes i = E =
f 0 0 0 0 0 0 0 0 (1,0,0,1,1,0,0,1)T and j = F = (0,1,0,1,1,0,1,1)T the
g 0 1 1 0 0 1 1 1 result would be i*j = (0,0,0,1,1,0,0,1)T. Entries in
h 0 0 1 1 1 1 0 0 the result add up to 3 which is the value stored in
Table 1. A randomly generated Reflexivity both SM[E,F] and SM[F,E] of table 2.
Matrix
If the matrix is transformed into a graph, then
Once again, a 0 entry on the table indicates that the without loss of generality we assume attribute K to
row header query does not involve the column be represented by vertex V. The weight of the edge
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.1, January 2008 313
connecting V with a vertex W is given by the number of queries referencing V while for any other
symmetry value of attribute K with the attribute attribute W; DS gives the number of queries
represented by W. On the other hand, reflexivity for referencing both V and W. The transitivity module
an attribute indicates that an edge starts from, and receives the SM matrix as input and produces the
ends at the same vertex. required partition as output. However, before
discussing the transitivity module, we will first plan
It is practical to assume that each attribute is a strategy that sets the mechanism for the algorithm
included in at least one query. Consequently, each and then suggest tactics to optimize its output.
must have a reflexivity degree of at least one. From
table 1 in the previous section, the reflexivity of 3.4 Strategy and tactics
attribute B was found equal to 3. This reflectivity In general, the success of an algorithm depends on
degree is stored in SM[2,2] of table 2. Every the strategy that sets criteria for choosing the best
diagonal element gives the reflectivity degree of its start point and the smartest move thereafter. Going
column header’s attribute. A non-diagonal entry of back to the graph theory terminology, one must first
SM gives the symmetry between the row and choose the best vertex to start with, and then pick
column headers and is equal to the number of the most appropriate edge to traverse in search for
queries that include both. The SM matrix can be an optimal partition. Equally, our strategy will focus
thought of as a data structure for a graph. on selecting the attribute to start with, which is a
Consequently, for the corresponding graph, the function of the DR, and then picking the best
symmetry is modeled by an edge connecting the two neighbor to reach, which is a function of DS.
vertices (attributes). The SM matrix is itself
symmetrical around its diagonal. At this point, we To that end, there are four different possibilities:
are ready to discuss the transitivity module which {(+DR,+DS);(+DR,–DS);(–DR,+DS);(–DR,–DS)},
acts upon the output of the symmetry module. where +DR (+DS) represents the maximum degree
of reflexivity (symmetry) of the remaining vertices
3.3 The transitivity module (attributes). Similarly, –DR(–DS) represents the
Before proceeding with further discussions, we minimum degree of reflexivity (symmetry). Based
will summarize the above concepts in the on our empirical data, we determined that the most
following definitions. appropriate combination is the (–DR,+DS) pair.
A B C D E F G H Total
To illustrate the process, we used the SM on table 2
to produce the partition P. hit 2 9 4 4 10 10 12 9 60
Initially As is equal to {(A, B, C, D, E, F, G, H)}. miss 0 2 11 7 4 7 3 2 36
The following are step by step execution of the Table 4. Attribute associate for P’
algorithm.
(a) S = {A} and As = {(B, C, D, E, F, G, H)} Table 4 reflects an improvement of the partition’s
(b) S = {(A,C)} and As = {(B, D, E, F, G, H)} hit ratio from 56.3% to 62.5%. Each attribute’s
(c) S = {(A,C,F)} and As = {(B, D, E, G, H)} hit(miss) value in table 3 differ from its
(d) S = {(A,C,F,D)} and As = {(B, E, G, H)} corresponding value in table 4 by the DS the
attribute shares with F. The strategy can now be
P = { (A, C, D, F) ; (B, E, G, H) } summarized as follows in figure 2.
The hit ratio for the partition can be computed 1. Produce a partition from an SM using the
from the following table which shows the number criterion (–DR, +DS) and figure 1.
of times an attribute is associated with its subset 2. Compute the partition hit ratio (PHR)
mates (hits) verses the number of times it is 3. If PHR is less than the predefined threshold
associated with others that are not in its subset then
(misses). a) Find the attribute with the minimum hit to
miss ratio and move it to a different subset
A B C D E F G H Total using the attribute association table in the
hit 2 7 8 7 7 7 9 7 54 process
miss 0 4 7 4 7 10 6 4 42 b) Repeat from step (2)
4. End partitioning
Table 3. Attribute associate for P
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.1, January 2008 315
Figure 2. Partitioning Strategy Now that we have the SM matrix for AUM, we will
use it as input to the transitivity module to
determine the partition. As in the previous section,
we used the algorithm in figure 1 to produce a
4. StatPart VS a benchmark partition. We call the resulting partition Q to
distinguish it from the partition in the previous
We will use the Attribute Usage Matrix (AUM) in example.
reference [8] as a benchmark and compare the Q = { (A, B, C, E, G, H, I) ; (D, F, J) }
results of their example with the output of our
simulator. Their AUM matrix is shown on table 5 The next step is to compute the partition’s hit ratio.
below. In a similar manner as before, table 7 is produced as
the attribute association table for Q.
Attributes A B C D E F G H I J
Queries A B C D E F G H I J Total
a 1 0 0 0 1 0 1 0 0 0 hit 12 14 15 2 12 2 13 14 15 2 101
b 0 1 1 0 0 0 0 1 1 0 miss 0 0 3 6 0 6 0 0 3 4 22
c 0 0 0 1 0 1 0 0 0 1 Table 7. Attribute associate for
d 0 1 0 0 0 0 1 1 0 0
e 1 1 1 0 1 0 1 1 1 0 The partition hit ratio is 82%, which is above the
1 0 0 0 1 0 0 0 0 0 threshold of 60%. The partition for the AUM
f
matrix from reference [8] is shown in figure 3
g 0 0 1 0 0 0 0 0 1 0 below. Replacing the numbers from 1 to 10 in the
h 0 0 1 1 0 1 0 0 1 1 figure with the letters A through J we can see that
the two partitions are identical. The figure suggest
Table 5. The AUM matrix from reference[8] that (4, 6, 10) should be in a subset while the rest
in another. Equally, our partition put (D, F, G) in a
One can see that the AUM matrix is quite similar to different subset than that of the rest.
the RM matrix in table 1. First, we will use our
symmetry module to produce an SM matrix which
would serve as input to the transitive module. The
latter will produce a partition that can be compared 4 2 10
3
with the partition in the reference [8]. 4
9
Based on the AUM matrix as input, the output of 2 2
the symmetry module is shown in table 6 below. 2
The table will be referred to as the Symmetry 6
Matrix for AUM. Entries of SM are computed 8
using equations 1 and 2. 5
3
Attributes A B C D E F G H I J 3
Attributes
A 3 1 1 0 3 0 2 1 1 0 2
1
B 1 3 2 0 1 0 2 3 2 0
C 1 2 4 1 1 1 1 2 4 1
D 0 0 1 2 0 2 0 0 1 2 2 2
E 3 1 1 0 3 0 2 1 1 0 7
F 0 0 1 2 0 2 0 0 1 2 Figure 3. Partition from reference [8]
G 2 2 1 0 2 0 3 2 1 0
H 1 3 2 0 1 0 2 3 2 0
An important point to mention is that the algorithm
I 1 2 4 1 1 1 1 2 4 1
presented in this paper is a simplified version of
J 0 0 1 2 0 2 0 0 1 2 the one used for simulation. Few steps are not
included in this document because the reader can
Table 6. The SM matrix for the AUM matrix easily figure them out using his/her imagination
and creativity.
316 IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.1, January 2008