0% found this document useful (0 votes)
9 views14 pages

Enhanced Schemes For Data Fragmentation, Allocation, and Replication in Distributed Database Systems

Uploaded by

ms240400014mak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views14 pages

Enhanced Schemes For Data Fragmentation, Allocation, and Replication in Distributed Database Systems

Uploaded by

ms240400014mak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Comput Syst Sci & Eng (2020) 2: 99–112 International Journal of

© 2020 CRL Publishing Ltd Computer Systems


Science & Engineering

Enhanced Schemes for Data


Fragmentation, Allocation, and
Replication in Distributed Database
Systems
Masood Niazi Torshiz1∗, Azadeh Salehi Esfaji1† and Haleh Amintoosi2‡
1 Department of Computer Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran
2 Computer Engineering Department, Faculty of Engineering, Ferdowsi University of Mashhad, Iran

With the growth of information technology and computer networks, there is a vital need for optimal design of distributed databases with the aim of performance
improvement in terms of minimizing the round-trip response time and query transmission and processing costs. To address this issue, new fragmentation,
data allocation, and replication techniques are required. In this paper, we propose enhanced vertical fragmentation, allocation, and replication schemes
to improve the performance of distributed database systems. The proposed fragmentation scheme clusters highly-bonded attributes (i.e., normally accessed
together) into a single fragment in order to minimize the query processing cost. The allocation scheme is proposed to find an optimized allocation to
minimize the round-trip response time. The replication scheme partially replicates the fragments to increase the local execution of queries in a way that
minimizes the cost of transmitting replicas to the sites. Experimental results show that, on average, the proposed schemes reduce the round-trip response
time of queries by 23% and query processing cost by 15%, as compared to the related work.

Keywords: Distributed Database, Big Data, Vertical Fragmentation, Allocation, Replication

1. INTRODUCTION performance of distributed databases. So, there is a need for


an effective design of a distributed database in order to achieve
Today, distributed databases have developed in business the desired reliability and performance.
organizations because of their advantages such as consistency, There exists considerable work such as [4–9] that aims at
scalability, availability, and accessibility [1]. addressing the fragmentation or allocation/replication problem
With the growth of communication and information techno- separately. However, few works such as [10,11,12,14], have
logy, DDBMS becomes increasingly essential, leading to the addressed all three issues integrally. Moreover, most recent
need for fast and efficient access to distributed databases [2]. As works have concentrated on minimizing the transmission
these databases are located in different servers and connected cost, access cost, or the query processing cost and are not
with different link speeds, they can profoundly impact the concerned about minimizing the delays incurred by transmission
response time and subsequently, the transmission cost and and processing times (including queuing delays), which are
important for Internet-based systems [16]. Few works such as
∗ E-mail: [email protected]
[14–16] have considered the importance of minimizing response
† E-mail: [email protected]
‡ Email: [email protected], ORCID: 0000-0002-1447-8086, corresponding
time in distributed database design. Round-trip response time is
author
defined as the time elapsed between the arrival of a query to

vol 35 no 2 March 2020 99


ENHANCED SCHEMES FOR DATA FRAGMENTATION, ALLOCATION, AND REPLICATION IN DISTRIBUTED DATABASE SYSTEMS

a network site and the time when the response to the query is the proposed approach is explained in Section 4. Section 5
received at the query source. In other words, round-trip response presents the simulation results and the validation of the proposed
time consists of transmitting a query from its source site to a approach as well as a comparative. Finally, Section 6 concludes
server site, processing the query and generating a response at the paper.
the server site, and transmitting the response back to the source
site [14]. Also, most works have proposed heuristic algorithms
for addressing DDB fragmentation and allocation, and only a 2. RELATED WORK
few have proposed mathematical programming formulations.
In this paper, we address the three above mentioned issues Fragmentation and allocation have been known as the main
in a distributed database by proposing vertical fragmentation, procedures for reliable performance and an efficient design for
allocation, and replication schemes to minimize the round-trip a distributed database and investigated in many articles. The
response time. The contributions are summarized as follows: work in [4] developed an integrated methodology for frag-
mentation and allocation, which incorporated concurrence
– We propose a vertical fragmentation scheme that partitions control and communication network cost in distributed environ-
the data into fragments such that bonded attributes (i.e., ments. Authors in [11] proposed a clustering-based technique
accessed together by the queries) locate in a single for vertical fragmentation and allocation in distributed database
fragment, thus reducing the access cost to those attributes systems. Their proposed scheme created query clusters to form
by the queries. The proposed scheme utilizes a weighted fragments. They assume that each fragment is a set of attributes
graph (with attributes as the vertices and the bonds accessed together by a particular query Similarly, authors in [12]
between attributes as edges) and partitions it to subgraphs proposed a heuristic approach to reduce transmission costs of
(i.e., fragments) with maximum connectivity between the distributed queries. They also proposed a site clustering algo-
relevant vertices (i.e., attributes) of each partition. The rithm to ensures the creation of highly-balanced clusters. They
graph partitioning is done in a way that prevents the creation also suggested several advanced allocation scenarios with data
of too small or too big partitions. The fragmentation scheme replication consideration. The work in [17] proposed a new
aims to minimize the query processing cost. More details vertical fragmentation algorithm using a graphical technique and
can be found in Section 4.1. an Attribute Usage Matrix (AUM), which represents the essential
queries whose primary purpose, unlike iterative binary partition-
– We propose a static allocation scheme that takes advantage
ing methods, is to create all fragments by one iteration. The work
of simulated annealing metaheuristic technique to solve
in [18] proposed an algorithm to measure the similarity between
the NP-hard problem of optimized allocation to minimize
any pair of attributes. This method clusters attributes into sub-
the round-trip response time. Instead of considering a
relations, which are called fragments. For this purpose, the
random allocation sequence for the initial allocation in the
relations are divided into sub-relations at the design cycle. The
simulated annealing process, the allocation scheme creates
works in. [19, 20] presented an objective function to evaluate
a targeted initial allocation pattern by considering the
the “goodness” of fragmentation algorithms. The work in [21]
fragment access ratios and allocating fragments with higher
developed a vertical fragmentation approach where an attribute
access ratios to the sites with higher processing speeds. This
affinity table was used as input to the proposed approach. A
targeted initialization will decrease the required time for the
dynamic table for fragmentation and allocation was proposed
simulated annealing algorithm to find the optimal allocation
in [22], which monitors the access pattern of network sites to
pattern. Moreover, in order to avoid the bottleneck problem
data tables and utilizes it to perform fragmentation, replication,
(allocating high demand fragments to low processing speed
and re-allocation to maximize the number of local accesses. The
sites), the proposed allocation scheme considers the site
work in [23] presented a mathematical optimization model
capacity constraints.
called DFAR that unifies the fragmentation, allocation and
– We propose a replication scheme that performs partial dynamical migration of data in distributed database systems
replication of fragments to increase the local execution of considering the storage capacity of network sites. Their model
queries. The fragments are replicated in a way to minimize utilizes the Threshold Accepting algorithm to solve the DFAR
the cost of transmitting replicas to the sites. In order to problem. The works in [14–16] presented a new mathematical
find an optimal replication solution, the proposed scheme model for the fragmentation and the allocation problem, called
utilizes the simulated annealing technique and considers VFA-RT, which aims to minimize the round-trip response time
two constraints: of queries. VFA-RT model is made of a non-linear objective
function and a group of constraints. In order to solve the model,
i) a fragment is replicated to a site only if there is a need Threshold Accepting (TA) and Tabu Search (TS) metaheuristic
for the fragment on that site and algorithms were used. The work in [24] proposed the Adaptive
ii) the fragment is replicated to a site only if the site Distributed Request Window Algorithm (ADRW) to achieve
capacity constraint is preserved. fragmentation and dynamic allocation of data. This approach
is compatible with the access patterns changes of requests for
The remainder of the paper is organized as follows: Section 2 attributes and makes decisions on the replications according to
reviews the works concentrating on vertical fragmentation, their “read/write” requests for data and total servicing part. The
allocation, and replication in distributed database design. purpose of this algorithm is to adjust data allocation patterns to
Section 3 expresses the preliminaries and basic concepts reduce the total servicing cost of the full read/write requests of
referenced throughout the paper. The detailed description of data. The work in [25] presented a genetic algorithm approach

100 computer systems science & engineering


M. NIAZI TORSHIZ ET AL.

Table 1 Summary of Related Work.

Reference

[20]

[19]

[18]

[26]

[14]

[11]
[23]

[28]

[10]

[12]
[4]
Fragmentation * * * * * *
Allocation * *
Problem
Fragmentation + Allocation * * * * *
addressed
Re-allocation
Replication * * *
Query Transmission Cost * * * * *
Value to
Query Processing Cost * * *
minimize Round-trip Response Time * * *

to solve the combined problems of vertical fragmentation and Moreover, most of the works have concentrated on minimizing
access path selection. Choosing the access path is a kind of the query processing/transmission cost, and only a few (e.g.,
mechanism which is capable of conducting an effective search [14]) have dealt with round trip time minimization. In our
for the physical sites of data. The work in [26] presented a new proposed allocation scheme, we also consider minimizing this
heuristic approach for fragmentation. This approach reduces the parameter. It is worth mentioning that our work is different from
transfer cost of fragments to different sites using a mathematical the works in [14–16] since we address the replication problem
model. In this approach, fragmentation and allocation are done as well as the fragmentation and allocation problems. Moreover,
simultaneously. Authors in [7] propose a linear approach to we have some innovations in the fragmentation and allocation
distributed database optimization that gathers incremental online schemes compared to other related works. More detail about the
knowledge about data access patterns and database statistics proposed schemes can be found in Section 4.
for online re-allocation of the fragments in order to continually
optimize the query response time. In [6], the authors proposed a
method based on a particle swarm optimization algorithm to 3. PRELIMINARIES AND ASSUMPTIONS
solve the data allocation problem that aims to minimize the
query execution time and transaction cost. In [27,28] authors 3.1 Basic Definitions:
discussed the data allocation issue in the purpose of minimizing
data transmission across network sites using an ant colony Vertical fragmentation divides an original relation (DB table)
optimization algorithm. The proposed procedure in [8] was into some sub-relations (fragments) in a way that the combi-
a vertical fragmentation model with the two-phase allocation nation of the fragments generates the primary relation [1]. If
process. Unlike most earlier studies, the tradeoffs between R denotes a relation with a set of attributes (columns) A =
different allocation scenarios were discussed for finding an {A1 .A2 . . . . A L }, vertical fragmentation is partitioning R into
optimal way of attribute assignment over sites. However, the some sub relations Fi , such that Fi s are derived from Equation 1:
model presented in [9] was an extension for [8] and could 
considerably reduce communication costs and query response Fi = R ∀ Ai ∈ A
P K ,Ai (1)
time. R = F1 ∞F2 ∞ . . . ∞FN
The work in [5] considered the data allocation problem

in distributed databases where the query execution strategy Where is the projection operator of relational algebra
affects allocation decisions. Authors in [29] propose a vertical [1], and PK is the primary key that should be replicated in
partitioning algorithm that uses graphical techniques and starts all fragments. Relation R should also be reconstructable by
from the attribute affinity matrix by considering it as a applying the join operator ∞ on the resultant sub-relations (i.e.,
complete graph. Then, forming a linearly connected spanning fragments), as illustrated in the above Equation. So, vertical
tree, it generates all meaningful fragments simultaneously by fragmentation on a relation R is defined as determining sub-
considering a cycle as a fragment. relations F1 , F2 , . . . , FN , such that query execution cost is
Table 1 summarizes the most relevant and recent works optimized concerning some criterion (here, minimizing the
discussed above. As can be observed, many works have query processing cost).
only dealt with the fragmentation problem, and many works Since vertical partitioning puts in one fragment those attributes
have addressed the integration of fragmentation and allocation usually accessed together, there is a need for some measure
problems. However, few works have considered addressing that would define more precisely the notion of “togetherness”
the fragmentation, allocation, and replication problems inte- [1]. Query execution frequency (f) and access frequency (the
grally. Replicating fragments has been shown to result in more frequency of accessing an attribute by a query) are two crucial
reliability, accessibility, traffic reduction of network, increase factors that define this notion. For each query Q i (1 ≤ i ≤ K )
of scalability, and better performance compared to the lack of and each attribute A j (1 ≤ j ≤L), we associate an attribute
replications [1, 22, 29]. In this article, we are going to propose access value, which equals to 1 if query Q i references attribute
solutions for all these three issues. A j , and zero otherwise. The set of all access values can be

vol 35 no 2 March 2020 101


ENHANCED SCHEMES FOR DATA FRAGMENTATION, ALLOCATION, AND REPLICATION IN DISTRIBUTED DATABASE SYSTEMS

Table 2 Notation Description.


Notation Description
S The set of sites (sites), S = {S1 , S2 , . . . SS N }
F The set of fragments, F = {F1 , F2 , . . . FN }
L Fi Number of attributes in fragment Fi (1 ≤ i ≤ N)
LSj Number of fragments in site S J (1 ≤ j ≤ S N)
Q The set of important queries, Q = {Q 1 , Q 2 , . . . Q k }
A The set of attributes, A = {A1 , A2 , . . . A L }
A AM The K × L matrix showing the attribute access values (which are either 1 or 0)
f qi The execution frequency of query Q q on site Si (expressed in queries/sec.)
-q The execution frequency of query Q q
1/Mq Mean length of query Q q (expressed in bits/query)
1/M R Mean length of query response (expressed in bits/response)
Ci j Transmission speed between site Si and site S j (expressed in bits/sec.)
Cj The procession capacity of site S j (expressed in queries/sec.)
Y jq The existence/non-existence of one or more attributes used by query q in site j (is either 1 or 0)
AB M The L × L matrix showing the number of queries that have accessed two attributes together (i.e., bound attributes)
Wi j The number of attributes existing in local fragment Fi and to which the query Q j accesses.
Rir j Set of relevant attributes not existing in local fragment Fi and must be accessed remotely by query Q j in fragment Fr
n ir j Total number of attributes that are in fragment Fi accessed remotely with respect to fragment Fr by query Q j
Ri j The numbebr of relevant attributes not existing in local fragment Fi and must be accessed remotely by query Q j .
ni The number of attributes that exist in fragment Fi .
Ti j Transmission cost between site Si and site S j .
M Kq The number of executions of query Q q
P Kq The ratio of execution of query Q q to the total number of queries’ executions
CSj The storage capacity of site S j .
AF M The L × N matrix showing whether an attribute belongs to a fragment or not
QFM The K × N matrix showing whether a query needs a fragment to be executed.
FSM The N × S N matrix showing whether a fragment is allocated to a site
ARj The access ratio of all queries to Fragment F j
AV A vector of size N showing the access ratios (ARs) of all fragments

represented by a K × L matrix called AAM1 as expressed by that in a typical environment, there may be many queries
Equation 2. being executed. However, typically, only important queries (for
 example, 20% of the whole active queries that have made 80%
1, if an attribute A j is accessed by Q i of data accesses) have been taken into consideration [1]. Table 2
A AM(Q i , A j ) =
0, Other wi se presents a detailed description of the notations used in the paper.
(2)
Similarly, we define an attribute bond value that measures
the strength of an imaginary bond between the two attributes. 3.2 Assumptions:
Attribute bond value represents the number of times two
attributes are accessed together by all queries at all sites. The In this paper, we assume of having a client-server architecture
set of all bond values can be represented by a L × L matrix where the server is responsible for performing the proposed
called ABM2 , as expressed by Equation 3. fragmentation and allocation schemes, and clients (i.e., sites)
store the fragments that are defined and allocated by the
⎧ K

⎨ q=1 A AM(Q q , Ai )∗
server. We also assume a static environment in which, the
AB M(Ai , A j ) = queries that are to be performed are read-only (i.e., do not
A AM(Q q , A j ) i = j (3)

⎩ modify the database) and are known beforehand (i.e., there
0, Other wi se exists information about what queries are going to be performed
on what sites and what attributes are going to be accessed by
Attributes that are accessed by queries are called relevant
these queries). We also assume that fragments are disjoint for
attributes, and every fragment that contains most of the relevant
all attributes except for the primary key PK, which should be
attributes is defined as the local fragment [25]. Wi j shows
repeated in all fragments of a relation (for reconstruction).
the number of attributes existing in local fragment Fi and to
which the query Q j accesses. The number of attributes not
locating in the local fragment Fi must be accessed remotely
by query Q j in fragment Fr are defined by Rir j . Note
3.3 Cost Model
1 Attribute Access Matrix As mentioned previously, vertical fragmentation and allocation
2 Attribute Bond Matrix are to be done to minimize the query processing costs. The

102 computer systems science & engineering


M. NIAZI TORSHIZ ET AL.

cost of a distributed query processing can be expressed by two 4.1 Vertical Fragmentation Scheme
factors: local query processing cost (cost of accessing irrelevant
local attributes) and remote query processing cost (the cost of The fragmentation scheme is responsible for dividing a database
accessing the remote relevant attributes). In this article, we into fragments to minimize the query response cost. As
consider the cost model in which, the cost of executing operations mentioned in Section 3.3, the query processing cost is affected
such as select, project, and join are not considered. In other by the cost of accessing irrelevant local attributes and the cost
words, since CPU time is negligible in comparison with I/O of accessing relevant remote attributes. So, in order to reduce
time, we do not consider the processing cost (which includes these costs, it is required that relevant attributes which are
the time of executing operations such as select, and join) and accessed together by the queries are located in a fragment. The
only consider the cost of accessing the attributes by these intuition behind this idea is that fragmenting relevant attributes
operations. together will decrease the number of irrelevant attributes within
In vertical fragmentation, a query does not usually require that fragment, thus reducing the irrelevant access cost. Besides,
retrieving all the attributes of a fragment during query process- locally fragmenting relevant attributes reduces the need for a
ing. Each attribute that is not required by a query but exists query to access them remotely, thus reducing the relevant remote
in the local fragment causes irrelevant local attribute access attribute cost. The fragmentation process, as proposed by this
cost. Attributes that are not required to be accessed by a query component, is as follows.
(but accessed because they reside within the retrieved fragment) In order to identify the attributes that are to be located within
are called irrelevant attributes. The existence of the irrelevant a fragment, we make use of AAM. As mentioned previously in
attributes in the local fragment may lead to the growth of the Section 3, AAM defines whether a query accesses an attribute or
number of local access. This, in turn, may result in the rise of not. Next, ABM is constructed using Equation 3. Remember that
the number of disk access, and hence, the local query processing ABM defines the number of times two attributes are accessed
cost increases. Equation 4 expresses the irrelevant local attribute together by all queries running on a site. A more detailed
access cost, as described in [17]: description of AAM, ABM and bond values has been presented
 previously in Section 3.
N K
|Wi j | In the next step, Graph G is created based on ABM in
Costlocal = f j2 × |Wi j | × (4)
ni which, vertices resemble attributes and edges connect those
i=1 j =1
two vertices (i.e., attributes) that are bonded together. The
Similarly, there are attributes that are required by the queries weight of an edge between two bond attributes Ai and A j
but do not exist in the local fragment. These attributes are called is obtained from ABM [i, j]. Once the graph is created, it is
relevant remote attributes. A greater number of relevant attributes partitioned into subgraphs. As mentioned above, putting highly-
that are in the remote fragments may also lead to an increase in bonded attributes in one partition (i.e., fragment) results in the
the remote query processing cost. [17] Equation 5 expresses the reduction of access cost. So, partitioning is done in a way
relevant remote attribute access cost [19]: that each subgraph has the maximum bond values between its
vertices. These subgraphs are then considered as fragments. So,
K N 
|Rir j | in order to do the partitioning, at the first step, we find the
Costremot e = min f j2 × |Rir j | × (5) edges with the lowest weights and remove them from the graph
i=1,N n ir j
j =1 r=1 provided that it does not lead to the graph disconnection. This
r=i
process can be expressed as finding a maximal spanning tree for
So, the total query processing cost, denoted by TCost, is the graph, which connects all the graph vertices and includes the
expressed by Equation 6, as mentioned in [19]. This parameter edges with higher weights (i.e., bond values).
will be further used in Section 5 in order to evaluate the Once the maximal spanning tree is constructed, we begin
performance of the proposed fragmentation scheme. partitioning it to subgraphs. A useful parameter in partitioning
is the partition size. Partition size is defined as the number of
T Cost = Costlocal + Costremot e (6) attributes that reside inside a partition. If the partition size is
too large, it leads to an increase in the irrelevant local attribute
access cost. The same stands for the partition size being too
4. PROPOSED VERTICAL small, which leads to the increase in the relevant remote attribute
FRAGMENTATION, ALLOCATION access cost. In order to create subgraphs, we start removing the
AND REPLICATION SCHEMES edges with the lowest weights. If there are multiple edges with
equal weights, we remove the edge that partitions the graph into
In the following, we describe the proposed fragmentation, sub-partitions (i.e., subgraphs) with the least difference in their
allocation, and replication schemes. The fragmentation scheme partition size. This will prevent the creation of too small or too
partitions attributes into fragments to minimize the query large partitions. The partitioning is done until N − 1 edges are
processing cost. The allocation component then optimally removed, resulting in the creationof N subgraphs. Each subgraph
allocates the fragments to the sites to minimize the round- is considered as a fragment.
trip response time. Once the allocation is done, data that are The output of the fragmentation component is the AFM3
commonly accessed by queries are replicated on the query’s (expressed in Equation 7), which is a L × N matrix that shows
local site to increase the locality of reference and reduce the whether an attribute belongs to a fragment or not.
communication cost. A detailed description of each component
3 Attribute Fragment Matrix
has been presented below.

vol 35 no 2 March 2020 103


ENHANCED SCHEMES FOR DATA FRAGMENTATION, ALLOCATION, AND REPLICATION IN DISTRIBUTED DATABASE SYSTEMS


1, if an attribute A j belongs to Fragment F j Let us assume that at j denotes whether attribute At is allocated
AFM(Ai , F j ) = on site S j or not (if yes, equals to 1; if not, equals to zero), lt is
0, Other wi se
(7) the length of attribute At in bytes, and CA is the cardinality of
the relation R, then the mean size of all fragments on site S j (in
bytes), denoted as µ S j is calculated by Equation 9.
4.2 Allocation Scheme
L

Once fragments are created, the next step is allocating the µS j = C A l t · at j (9)
t =1
fragments to the sites. Such allocation is done in a way to
minimize the round-trip response time. In other words, we are The above-mentioned capacity constraint is then defined by
looking for an optimal fragment allocation that minimizes the Equation 10.
round-trip response time.
µS j ≤ C S j , ∀ j ; 1 ≤ j ≤ S N (10)
As described previously, round trip response time is defined
as the time elapsed between the arrival of a query to a site This constraint should be considered both in the formation of
and the time the query response is received at the query the initial allocation pattern (as the initial solution) and during
source. In other words, the average round-trip response time the execution of the SA algorithm to find the optimal allocation
using is described by three terms: average transmission delay pattern.
of queries incurred by their transmission from query sources to
the servers, average processing delay of queries at the servers, 4.2.2 Fragment Prioritization
and average transmission delay of queries response back to their
sources. Specifically, the objective function, as described in Consider a situation in which, there is a fragment that is widely
[14], is minimizing round trip response time (RRT) described as accessed by a large number of queries. If such a fragment is
Equation 8. allocated to a site with low processing capability, it results in
⎡ an increase in processing delay, which contradicts the allocation
objective (i.e., minimizing the round-trip response time). So,
1 ⎣ 1
R RT = Mq C i j
there is a need to compute the access ratio of each fragment
j q i fqi y j q
ij −1 (by all queries) and give allocation priority to those with higher
q f qi y j q
⎤ access ratios. In order to do so, the following steps should be
1 1 ⎦ done:
+ Cj
+ M R Ci j
(8)
j f qi y j q
−1 ij −1 1. At first, we calculate the number of executions of query
q i q f qi y j q
Q q , denoted by MKq , which is the sum of the execution
As mentioned before, the general problem of minimizing frequency of query Q q per all sites, as expressed by
round trip response time is NP-hard. Therefore the proposed Equation 11.
solutions are based on heuristics. In this paper, we have utilized SN

the simulated annealing (SA) metaheuristic approach to find M Kq = f qi (11)


an optimal value for RRT. The SA algorithm starts from an i=1
initial solution and calculation of the objective function. It then 2. We then calculate the ratio of the execution of query Q q to
improves the value of the objective function in order to search the execution of all queries, denoted by PKq , as expressed
for an optimal solution. by Equation 12.
In our proposed allocation component, instead of considering
M Kq
a random fragment allocation pattern as the initial solution, P Kq = K
(12)
we provide an efficient initial allocation pattern that considers q=1 M Kq
fragment priorities in the initial allocation. More details are
described in Section 4.2.2. 3. Next, we need to define whether the query needs a fragment
or not. This is done by calculating a matrix called QFM4
as expressed by Equation 13, in which, operator  denotes
4.2.1 Bottleneck Problem the Boolean product.
In real situations, server sites may have different processing Q F M = A AM  AF M
speeds. Considering the minimization of round-trip response 
time as an objective inherently leads to the selection of sites with 1, if query Q i needs Fragment F j
Q F M(Q i , F j ) =
higher processing speeds to decrease the processing delay. This 0, Other wi se
may result in the heavily-loaded sites with a massive amount of (13)
traffic, which leads to the increase in the transmission delay of
queries from query sources to the server sites. This is an example 4. In the next step, we compute the access ratio of each
of a bottleneck problem. fragment Fi , denoted by A Ri , as shown in Equation 14.
In order to avoid the bottleneck, we assume that sites have K
storage capacity constraints, and this constraint should be A Ri = Q F M[q][i ] × P K q (14)
considered in fragment allocation. In other words, the sum of
q=1
the sizes of all fragments assigned to site S j must not exceed the
storage capacity of site S j (C S j ). 4 Query Fragment Matrix

104 computer systems science & engineering


M. NIAZI TORSHIZ ET AL.

5. Based on the access ratios obtained from the previous by query Q q on site S j . So, if a fragment is not needed by any
step, we are now able to create an access ratio vector of of the queries running on a site, X kj (which indicates whether
sizeN,denoted by AV, in which, each element AV[i] is fragment Fk should be replicated to site Sj) should be zero. The
initialized by the access ratio of Fragment Fi , i.e., AV i = second constraint is similar to the one expressed by Equation
A Ri . The access ratios can be regarded as the priority of 10. Replicas are stored in a site as long as the storage capacity
fragments in the allocation process. constraint of the site is not violated. In other words, the fragment
will be replicated on a site if at least one single access to the
6. Finally, in order to create an initial allocation pattern, fragment has been done AND there exists enough storage on
we first create an ordered list of sites based on their that site to store the replicated fragment.
processing speed and then allocate the fragments to the In order to find an optimal solution for the replication of
sites based on their priorities: the fragment with the highest fragments, the Simulating Annealing Algorithm (SA) has been
priority (e.g., highest access ratio) will be allocated to used. The algorithm begins with an initial answer of X kj for
the site with the highest processing speed. This step is the replication of fragments. At every iteration, the cost of
done until all fragments are allocated. Note that in order the obtained replication solution is calculated (considering the
to prevent the bottleneck problem discussed above, the constraints mentioned above) and compared with the previous
capacity constraints of sites, as described in Section 4.2.1, one. The process is continued until an optimal solution that
should be considered. minimizes Costtr is obtained.
The result of the above steps is an initial allocation pattern
of fragments to the sites, described as FSM5 , as expressed
by Equation 15. 5. NUMERICAL EXPERIMENTS

1, if fragment Fi is allocated to node S j In this section, we evaluate the performance of our proposed
F S M(F j , S j ) =
0, Other wi se schemes. First, we explain the experimentation setup, the
(15) scenarios we consider for performance evaluation, and the
datasets we used in experiments in Section 5.1. We then make
This initial allocation is then fed to the SA algorithm to find
an initial analysis of our proposed fragmentation scheme in
an optimal fragment allocation.
Section 5.2, based on the cost model mentioned in Section 3.3.
Then, we compare our proposed schemes with other methods in
Section 5.3.
4.3 The Replication Scheme
X kj ≤ Q F Mqk × ∅q j ∀k, j (17)
As mentioned previously, fragment replication allows the q
retrieval queries to be processed locally and quickly, which
results in the reduction of transmission time, and subsequently,
the round-trip response time of query executions. 5.1 Experiment Setup
In our proposed replication method, we replicate fragments to
the sites to minimize the total transmission cost of replicas. The We implemented the proposed schemes in Matlab 8.1 and con-
total transmission cost (Costtr ) is expressed by Equation 16. ducted a series of experiments to evaluate their performance. For
these experiments to be done,we randomly created 100 instances
Costtr = Ti j Si zek F S Mki X kj , ∀i = j (16) and grouped them into five different scenarios S1 to S5 (each
j k i having 20 instances) such that the instances in each scenario are
similar to each other. We aimed to create scenarios with different
Where Ti j is the cost of transmitting a byte from site S j to site loads (i.e., number of queries) and capacities (i.e., number of
S j , Si zek is the size of fragment Fk in bytes,FSM is the Fragment sites), from fewest (S1 ) to the highest (S5 ). Table 3 shows the
Site Matrix which is initial allocation pattern of fragments to the data used to generate the instances which are variable coefficients
sites, and X kj is a decision making variable, which is 1 or 0, for expressions in Section 4. It is worth mentioning that the data
indicating whether fragment Fk should be replicated to site S j values presented in this table are typical values that can be found
or not. in real cases.
In the replication method mentioned earlier, there are a set
of constraints that should be considered. The first constraint, as
expressed by Equation 17, denotes that a fragment is replicated
on a site if there is a need for the fragment on that site. In order to
5.2 Cost Analysis
determine whether a fragment is needed on a site, we consider
Figure 2 demonstrates the behavior of the two components of
two parameters. The first parameter is ∅q j , which equals 1 if
the query processing cost, as mentioned in Equations 4 and 5
the execution frequency of a query Q q on a site S j (i.e., f q j )
(i.e., local irrelevant attribute access cost and remote relevant
is greater than zero, and 0 otherwise. The second parameter is
attribute access cost) as a function of the number of fragments
Q F Mqk , which, as described by Equation 13, specifies whether
for two scenarios S1 and S2. As demonstrated in figure 2.a, the
a query Q q needs a fragment Fk or not. The multiplication of
increase in the number of fragments results in the reduction of
these two parameters denotes whether a fragment Fk is needed
irrelevant local attribute access cost. This is because when the
5 Fragment Site Matrix fragments are few, they each contain a higher number of local

vol 35 no 2 March 2020 105


ENHANCED SCHEMES FOR DATA FRAGMENTATION, ALLOCATION, AND REPLICATION IN DISTRIBUTED DATABASE SYSTEMS

computer systems science & engineering


Table 3 The data valuses for generating the instances.
Scenario No of No. of No. of Execution AAM Processing Storage Transmission Transmission Length of Mean Mean
Attributes Queries sites Frequency Capacity Capacity Speed Ci j Cost Ti j Attribute lt Query Response
L K SN fqi Cj CSj Length Length
1/Mq 1/M R
S1 6 4 3 2–60 0–1 50–500 100- 200 100000–400000 0–25 4–12 1000 5500
S2 5 5 3 5–50 0–1 50–500 110–245 100000–400000 0–25 4–12 1000 5500
S3 5 7 3 5–150 0–1 50–500 100–312 100000–400000 0–25 4–12 1000 5500
S4 10 8 4 0–19 0–1 50–500 90–230 100000–400000 0–25 4–12 1000 5500
S5 20 15 6 5–150 0–1 50–500 122–400 100000–400000 0–25 4–12 1000 5500

106
M. NIAZI TORSHIZ ET AL.

Irrelevant Local Attrib. Access Cost-

Irrelevant Local Attrib. Access Cost (× 104)


Irrelevant Local Attrib. Access Cost (× 104)
Irrelevant Local Attrib. Access
Cost- Proposed Scheme Proposed Scheme
2.5
1.4
2 1.2
1
1.5
0.8
1 0.6
0.4
0.5
0.2
0 0
1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10
Number of Fragments Number of Fragments
a) Evolution of irrelevant local attribute access cost in S1 and S2 scenarios

Relevant Remote Attrib. Access Cost-


Relevant Remote Attrib. Access Cost (× 104)

Relevant Remote Attrib. Access Cost (× 104)


Proposed Scheme Proposed Scheme
4.5 2
4 1.8
3.5 1.6
3 1.4
1.2
2.5
1
2
0.8
1.5 0.6
1 0.4
0.5 0.2
0 0
1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10
Number of Fragments Number of Fragments

b) Evolution of relevant remote attribute access cost in S1 and S2 scenarios

Figure 2 The impact of the number of fragments on attributes access costs.

attributes and so, the local attribute access cost will be high. On access cost for different number of fragments, as illustrated in
the other hand, when the number of fragments increases, we have Figure 3.a. The optimal number of fragments is acquired when
a fewer number of attributes in each fragment. So, when a query remote and local attribute access cost curves meet, which is 2
gets access to a fragment, it will encounter a fewer number of for this scenario. Figure 3.b confirms that the least amount of
irrelevant attributes. As the number of fragments increases, the query processing cost is acquired when the number of fragments
reduction of the number of irrelevant attributes continues until it is 2.
reaches zero, as shown in figures 2.a. In contrast, an increase in Next, we compare the performance of the proposed model
the number of fragments leads to the increase in the number of with the VFA-RT model [14–16] based on the round-trip response
relevant remote attributes, thus increasing the irrelevant remote time, expressed in Equation 8. Remember from Section 4.2
attribute access cost, as illustrated in Figure 2.b. that the aim of the allocation process is allocating fragments
into sites in a way that minimizes the round-trip response time
of queries. Figure 3.c illustrates the round-trip response time
5.3 Evaluation Result obtained from running the experiment on 20 instances of the S1
scenario. As this figure shows, our proposed model has shown
In the following, we compare the performance of our proposed better performance in regards to obtaining less amount of round-
model with other related models for the different S1 to S5 trip response time. On average, the proposed approach has
scenarios. resulted in the 26% reduction of round-trip response time, as
compared to the VFA-RT method, for S1 scenario.
5.3.1 S1 Experiment Results
5.3.2 S2 Experiment Results
To evaluate the proposed fragmentation and allocation schemes,
we consider the cost model described in Section 3.3. First, in Figure 4 illustrates the query processing cost and the round-
order to obtain the optimal number of fragments, we evaluate the trip response time for the second scenario S2, respectively. As
irrelevant local attribute access cost and relevant remote attribute figure 4.a shows, the minimum amount of query processing

vol 35 no 2 March 2020 107


ENHANCED SCHEMES FOR DATA FRAGMENTATION, ALLOCATION, AND REPLICATION IN DISTRIBUTED DATABASE SYSTEMS

Relevant Remotee Attrib. Accesss Cost

Local and Remote Attri.b Access (× 104)


Irrrelevant Local A
Attrib. Access Cost
C
5

Cost
1

0
1 2 3 4 5 6
N
Number of Fraagments
a

Tottal Cost- Propossed Scheme

Round-trip Response Time (s) (× 10-3)


Proposed Scheme VFA-RT
T
Total Query Processing Cost (× 104)

4.5 4.5
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0 0.5
1 2 3 4 5 6 1 3 5 7 9 111 13 15 17 19

N
Number of Fraagments Instance N
Number
c
b
Figure 3 S1 Experimental Results.

cost is obtained when the fragment number equals 2. Figure 4.b based on the round-trip response time for different instance
compares the performance of the proposed model and the VFA- numbers. As shown in this figure, the proposed allocation scheme
RT model based on the round-trip response time for different method outperforms the VFA-RT model in regards to less amount
instance numbers in S2. Again, the proposed model has achieved of round-trip response time. On average, our proposed model has
better outcomes resulting in a 30% reduction of the average acquired a 27% reduction in round trip response time, compared
round-trip response time. to the VFA-RT model.

5.3.3 S3 Experiment Results 5.3.5 S5 Experiment Results


Figure 5.a illustrates the query processing cost and the number Figure 7.a shows the processing cost of queries in regards to
of optimal fragments for scenario S3, which is 2. Figure 5.b the different numbers of fragments for scenario S5. According
compares the performance of the proposed schemes with to Table 3, the number of queries and attributes in S5 are the
VFA-RT based on the round-trip response time of the queries highest numbers among all. The optimal number of fragments is
for 20 instances of S3. As has been shown in this figure, 3. Figure 7.b demonstrates the evolution of round-trip response
applying the proposed fragmentation and allocation scheme time for 20 instances in S5. A comparison between the results
has resulted in a 15% reduction of round-trip response of the proposed method and VFA-RT indicates that the proposed
time. approach has caused a 21% reduction in the average round-trip
response time.
5.3.4 S4 Experiment Results
5.3.6 Query Processing Time Evaluation
Figure 6.a illustrates the query processing cost obtained with
different fragment numbers. As obtained from this figure, the As mentioned previously, fragmentation aims to partition
optimal number of fragments which leads to the minimum cost attributes into fragments in a way to minimize the processing cost
equals to 3. Figure 6.b demonstrates the comparison results of the queries needing those attributes. In Table 4, we compare
between the proposed allocation model and the VFA-RT model the query processing cost of the proposed fragmentation scheme

108 computer systems science & engineering


M. NIAZI TORSHIZ ET AL.

trip Response Time (s) (× 104)


Total Cost- Prooposed Scheme Propposed Scheme VFA-R
RT
Total Query Processing Cost (× 104)
6 8

5.5 7
5
6
4.5
4 5

3.5

Round-trip
4
3

Round
3
2.5
2 2
1 2 3 4 5 1 3 5 7 9 111 13 15 17 19

Nuumber of Fraggments Instance N


Number
b
a
Figure 4 S2 Experimental Results.

Tootal Cost- Propoosed Scheme Round-trip Response Time (s) (× 104) Propoosed Scheme VFA-RT
T
Total Query Processing Cost (× 104)

3.5 6

3 5

2.5 4

2 3

1.5 2

1 1

0.5 0
1 2 3 4 5 1 3 5 7 9 11 13 15 17 199 21
Num
mber of Fragm
ments Instance Nuumber
a b
Figure 5 S3 Experimental Results.
Round-trip Response Time (s) (× 10-3)

Proposeed Scheme VFA-RT


Total Query Processing Cost (× 104)

T
Total Cost- Propposed Scheme
2 6

1.88 5
1.66 4
1.44
3
1.22
2
1
1
0.88
0.66 0
1 2 3 4 5 6 7 8 9 10 1 3 5 7 9 11 133 15 17 19
Nuumber of Fragm
ments I
Instance Numbber
a b
Figure 6 S4 Experimental Results.

vol 35 no 2 March 2020 109


ENHANCED SCHEMES FOR DATA FRAGMENTATION, ALLOCATION, AND REPLICATION IN DISTRIBUTED DATABASE SYSTEMS

Total Cost- Proposed Scheme Proposeed Scheme VFA-RT


0.0445
4

Round-trip Response Time (s)


0.004
Total Query Processing Cost

3.5 0.0335
0.003
3
0.0225
2.5 0.002
2 0.0115
0.001
1.5
0.0005
1 0
1 3 5 7 9 11 13 15 17 19 1 3 5 7 9 11 13 15 17 199
Num
mber of Fragmeents Instance Num
mber
a
b
Figure 7 S5 Experimental Results.

Impact of Replication Scheme


Round-trip Response Time (s)

0.4

0.3

0.2

0.1

0
S1 S2 S3 S4 S5

Round-trip Response Time without Replication


Round-trip Response Time with Replication
Figure 8 Impact of the Proposed Replication Scheme.

Table 4 Query Processing Cost for the Proposed Fragmentation Scheme and VFA-RT.
Query Processing Cost (×104 )
Scenario
VFA-RT Proposed Fragmentation
Scheme
S1 1.2144 0.9564
S2 0.3145 0.2993
S3 1.3297 0.8348
S4 0.9564 0.6846
S5 1.9821 1.4901

and the VFA-RT model for all S1 to S5 scenarios. As has been fragments, based on some criteria, with the aim of local execution
shown in this table, the proposed fragmentation scheme has of queries and reducing both the communication cost between
shown better performance in comparison with VFA-RT model sites and the queries’ execution time. In order to observe the
and acquired less query processing cost. impact of such replication, we evaluate the round-trip response
time for different S1 to S5 scenarios in two different situations:
the situation where the replication scheme is applied and the
5.4 Impact of the Proposed Replication Scheme one without replication. The results that have been shown in
Figure 8 demonstrate that replicating the fragments will cause
As mentioned previously in Section 4.3, once fragments are a substantial reduction in round trip response time for all
allocated to the sites, the replication scheme replicates some scenarios.

110 computer systems science & engineering


M. NIAZI TORSHIZ ET AL.

6. CONCLUSION Intelligent Computing and Information Systems (ICICIS). 2015.


IEEE.
With the growth of information technology and computer 11. Sewisy, A., Amer, A. and Abdalla, H. (2017). A Novel Query-
networks, there is a vital need for optimal design of distributed Driven Clustering-Based Technique for Vertical Fragmentation
databases. Three main challenges in the design of a distributed and Allocation in Distributed Database Systems. International
Journal on Semantic Web and Information Systems, 13(2), pp.27–
database are fragmentation, data allocation, and replication. In
54.
this article, we present new approaches for vertical fragmen- 12. Abdalla, H.; Artoli, A.M. (2019). Towards an Efficient Data Frag-
tation, allocation and replication for distributed databases. The mentation, Allocation, and Clustering Approach in a Distributed
proposed vertical fragmentation scheme partitions the data into Environment. Information, V10, 112.
fragments such that those bonded attributes (i.e., accessed 13. Amer A A, Mohamed M H, Al Asri K A. (2019). ASGOP:
together by the queries) are located in a single fragment, thus An Aggregated Similarity-Based Greedy-Oriented Approach for
reducing the access cost to those attributes and minimizing Relational DDBSs Design. Heliyon, (In press).
the query processing time. The allocation scheme utilizes the 14. Pazos, R.A., et al., Minimizing roundtrip response time in
benefits of the simulated to minimize the round-trip response distributed databases with vertical fragmentation. Journal of
time of queries running on the sites. We also propose a Computational and Applied Mathematics, 2014. 259: p. 905–913.
replication that aims to perform partial replication of fragments 15. Vázquez, G. and J. Pérez. Modeling the nonlinear nature of
response time in the vertical fragmentation design of distributed
to increase the local execution of queries. We compare the
databases. in International Symposium on Distributed Computing
performance of our proposed schemes with other related work, and Artificial Intelligence 2008 (DCAI 2008). 2009. p. 605–612.
considering different scenarios. Results show that the proposed Springer.
fragmentation scheme can reduce the query processing cost by 16. Vázquez, G. and J. Pérez. Vertical fragmentation design of
15% as compared to the VFA-RT model. Moreover, the allocation distributed databases considering the nonlinear nature of roundtrip
scheme achieves a 23% reduction (on average) in the round-trip response time. in International Conference on Knowledge-Based
response time of queries. Considering the replication scheme and Intelligent Information and Engineering Systems. 2010.
also results in a 10% reduction in the round-trip response time, Springer.
as compared to the situation where replication has not been 17. Navathe, S., et al., Vertical partitioning algorithms for database
considered. design. ACM Transactions on Database Systems (TODS), 1984.
9(4): p. 680–710.
18. Abuelyaman, E.S., An optimized scheme for vertical partitioning
of a distributed database, International Journal of Computer
Science and Network Security (IJCSNS), 2008. 8(1): pp. 310–
REFERENCES 316.
19. Chakravarthy, S., et al., An objective function for vertically
1. Özsu, M.T. and P. Valduriez, Principles of distributed database partitioning relations in distributed databases and its analysis,
systems. 2011: Springer Science & Business Media. Distributed and parallel databases, 1994. 2(2): pp. 183–207.
2. Nashat D, Amer A.A. A Comprehensive Taxonomy of Fragmen- 20. Muthuraj, J., et al. A formal approach to the vertical partitioning
tation and Allocation Techniques in Distributed Database Design. problem in distributed database design, in Proceedings of the
ACM Computing Surveys. 2018. 51(1). second international conference on Parallel and distributed infor-
3. Iacob, N., Data replication in distributed environments. Annals- mation systems, IEEE Computer Society Press, 1993. pp. 28–34.
Economy Series, 2010. 4: p. 193–202. 21. Navathe, S., K. Karlapalem, and M. Ra, A mixed fragmentation
4. Tamhankar, A.M. and S. Ram, Database fragmentation and methodology for initial distributed database design, Journal of
allocation: an integrated methodology and case study. IEEE Computer and Software Engineering, 1995. 3(4): pp. 395–426.
Transactions on Systems, Man, and Cybernetics-Part A: Systems 22. Hauglid, J.O., et al, DYFRAM: dynamic fragmentation and replica
and Humans, 1998. 28(3): p. 288–305. management in distributed database systems. Distributed and
5. Apers, P.M., Data allocation in distributed database systems. ACM Parallel Databases, 2010. 28(2–3): p. 157–185.
Transactions on Database Systems (TODS), 1988. 13(3): p. 263– 23. Pérez, J., et al. Vertical fragmentation and allocation in distributed
304. databases with site capacity restrictions using the threshold
6. Mostafa Mahi, Omer Kaan Baykan, Halife Kodaz, A new approach accepting algorithm. in Mexican International Conference on
based on particle swarm optimization algorithm for solving data Artificial Intelligence. 2000. p. 75–81. Springer.
allocation problem, Applied Soft Computing, Volume 62, 2018, 24. Gu, X., W. Lin, and B. Veeravalli, Practically realizable efficient
Pages 571–578. data allocation and replication strategies for distributed databases
7. S. Darabant et. al, A linear approach to distributed database with buffer constraints. IEEE Transactions on Parallel and
optimization using data reallocation T2 - 2017 25th International Distributed Systems, 2006. 17(9): p. 1001–1013.
Conference on Software, Telecommunications and Computer. 25. Song, S.-K. and N. Gorla, A genetic algorithm for vertical
8. Amer, A.A. ; Abdalla, H.I.,An integrated design scheme for per- fragmentation and access path selection. The Computer Journal,
formance optimization in distributed environments, International 2000. 43(1): p. 81–93.
Conference on Education and e-Learning Innovations (ICEELI), 26. Ma, H., K.-D. Schewe, and M. Kirchberg. A heuristic approach
2012. p: 1–8. to vertical fragmentation incorporating query information. in
9. Hassan I. Abdalla, Ali A. Amer, and Hassan Mathkour. A novel Proceedings of the 17th Australasian Database Conference, 2006.
vertical fragmentation, replication and allocation model in DDBSs. (49): p. 183–192. IEEE.
J. Universal Computer Science. 2014. 20(10). p. 1469–1487. 27. Rosa Karimi Adl, Seyed Mohammad Taghi Rouhani Rankoohi,
10. Raouf, A.E.A., N.L. Badr, and M. Tolba. An optimized scheme for A new ant colony optimization based algorithm for dataallocation
vertical fragmentation, allocation and replication of a distributed problem in distributed databases, Knowl Inf Syst (2009) 20: 349–
database. in 2015 IEEE Seventh International Conference on 373, DOI 10.1007/s10115–008-0182-y.

vol 35 no 2 March 2020 111


ENHANCED SCHEMES FOR DATA FRAGMENTATION, ALLOCATION, AND REPLICATION IN DISTRIBUTED DATABASE SYSTEMS

28. Goli, M. and S.M.T.R. Rankoohi, A new vertical fragmentation 30. Khan, S.U. and I. Ahmad, Replicating data objects in large
algorithm based on ant collective behavior in distributed database distributed database systems: an axiomatic game theoretic
systems. Knowledge and Information Systems, 2012. 30(2): mechanism design approach. Distributed and Parallel Databases,
p. 435–455. 2010. 28(2–3): p. 187–218.
29. Shamkant B. Navathe and Mingyoung Ra., Vertical partitioning
for database design: A graphical algorithm, Proceedings of the
ACM SIGMOD International Conference on Management of Data,
pp. 440–450, 1989.

112 computer systems science & engineering

You might also like