IJERT Efficient Fragmentation and Alloca
IJERT Efficient Fragmentation and Alloca
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013
cluster and the fragments are allocated to the cluster. environment to enhance system performance. Too
Also the static allocation of fragments provides only the many copies of a fragment tend to slow down
limited response to the changes in workload. Hence updates while enhancing the performance of read-
dynamic methods are adopted for fragmentation of both only queries. Too few copies of a fragment will
structured and unstructured databases. This reduces decrease the availability of data and the
the movement of data and also improves the overall performance of read-only queries.
system performance. Concurrency control: Appropriate techniques are
required to synchronize the various copies of a
Keywords—Distributed database, fragmentation, fragment. These techniques must take into account
allocation of fragments, cluster of sites. of the requirements on the concurrency of data
held in the copies and the existence of multiple
1. Introduction users.
Query processing: Since a query may access
Distributed database systems comprise a single multiple fragments, and since each fragment may
logical database that is partioned and distributed across have multiple copies, query optimization becomes
various sites in a communication network. Database an important issue.
technology has become prevalent in most business In addition, there are other design decisions such as
organizations. Distributed Database System (DDS) are the configuration of the network connecting the
becoming more affordable and useful. A DDS typically database sites, allocation of storage capacity and
consist of a number of distinct yet interrelated security. Many of the decisions outlined above are not
databases (fragments) located at different geographic independent of each other. For example, fragmentation
sites which can communicate through a network. and allocation are very closely related, with each
Typically, such a system is managed by a distributed decision affecting the other. Fragmentation and
database management system (DDBMS). Each site of allocation typically use similar input parameters (e.g., a
the DDS has its own hardware and is capable of description of user queries, updates, data access
autonomous operation. A site participates in the frequencies, communication cost, and relationships
execution of global transactions involving databases at among data objects). In distributed databases, the
two or more remote sites. communication costs can be reduced by partitioning
www.ijert.org 1
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013
database tables horizontally into fragments, and is described as follows: the principle idea of this
allocating these fragments to the sites where they are architecture is to define specialized servers with
most frequently accessed. The aim is to make most data specific functionalities. The servers are connected to a
accesses local, and avoid remote reads and writes. The network of clients that can access the services of the
read cost can be further reduced by the replication of servers. Stations (servers or clients) can have different
fragments when beneficial. Obviously, important design complexities starting from diskless client to
challenges in fragmentation and replication are how to combined server-client machine. The DBMS functions
fragment, when to replicate fragments, and how to are divided between servers and clients using different
allocate the (replicated) fragments. approaches. The client refers to a data distribution
Previous works on data allocation has focused on dictionary to know how to decompose the global query
static fragmentation based on analyzing queries. These into multiple local queries.
techniques are only useful in contexts where read
queries dominate. However, in many application areas,
workloads are very dynamic with frequent changes in
access patterns at different sites. One common reason
for this is that their data usage often consists of two
separate phases: a first phase where writing of data
dominates (for instance during simulation when results
are written), and a subsequent second phase when a
subset of the data, for example results, is mostly read.
The dynamism of the overall access pattern is further
increased by the different instances of the applications
executing in different phases at different sites. Because
of dynamic workloads, static/manual fragmentation and
replication may not always be optimal. Instead, the Fig.2.1 Client Server Architecture
RRTT
fragment and replication management should be The interactions are as follows:
dynamic and automatic i.e., change in access patterns 1. Client parses the user’s query and decomposes it into
should result in refragmentation and reallocation of independent site queries.
IIJJEE
fragments when beneficial, as well as in the creation or 2. Client forwards each independent query to the
removal of fragment replicas. corresponding server by consulting with the data
The primary concern of this paper describes the distribution dictionary.
approach to perform fragmentation of structured data 3. Each server processes the local query and sends back
and the secondary concern is to fragment an the resulting relation to the client.
unstructured data. Furthermore allocation of fragments 4. Client combines (manually by the user, or
to the cluster of sites is carried out in order to reduce automatically by client abstract) the received
the communication cost. subqueries, and do more processing if needed to
The rest of the paper is organized as follows: Section 2 get to the final target result.
describes the fragmentation concept on structured data.
Section 3 describes the fragmentation concept on B. Fragmentation
unstructured data. Section 4 describes the allocation of
fragments to cluster of sites rather than allocating to The Primary concern of distributed database system
individual sites. Section 5 describes some concluding design is to perform fragmentation of the relations in
remarks of the paper. case of relational database or classes in case of object
oriented databases and allocation of the fragments into
2. Fragmentation on structured data different sites of the distributed system. Fragmentation
is a design technique to divide a single relation or class
A. Architecture of a database into two or more partitions such that on
combining the partitions provides the original database
without any loss of information. This reduces the
The Distributed Database System(DDBS) must be
amount of irrelevant data accesses by the applications,
capable to support more complex and more
thus reducing the number of disk accesses.
sophisticated functionality. Networks have several
Fragmentation is classified into horizontal, vertical and
types of topologies that define how nodes are
mixed/hybrid.
physically and logically connected. One of the popular
topology used in DDBS, the client-server architecture
www.ijert.org 2
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013
Vertical fragmentation (VF) allows a relation or on knowledge of the past. In our approach, this means
class to be partitioned into disjoint sets of columns or the detecting replica access patterns, i.e., which
attributes except the primary key. Each partition must fragments are accessed by which sites. This is
include the primary key attribute(s) of the table. This performed by recording replica accesses. Because of
arrangement can make sense when different sites are recording the access patterns continuously, old data
responsible for processing different functions involving may be discarded periodically such that statistics only
an entity. include recent accesses. In this way, the system can
Objective of vertical fragmentation is to partition a adapt to the changes in access patterns. Statistics are
relation into a set of smaller relations so that many of stored using histograms design.
the applications will run on only one fragment.
a. Vertical fragmentation of a relation R, produces the 3. Fragmentation on unstructured data
fragments R1, R2 etc. Each of which contains a subset
of R’s attributes. The world-wide web (WWW) is often considered to
b. Vertical fragmentation is defined using the be the world's largest database and the eXtensible
projection operation of the relational algebra: П (R) Markup Language (XML) is then considered to provide
its data model. There raises the question, how to obtain
iii. Hybrid Fragmentation a suitable distribution design for XML documents.
Combination of horizontal and vertical Here horizontal and vertical fragmentation techniques
fragmentations is mixed or hybrid fragmentations are generalised from the relational data model to XML.
(MF). In this type of fragmentation scheme, the relation Furthermore, splitting is introduced as a third kind of
is divided into arbitrary blocks based on the fragmentation. Then it is shown how relational
transactions. Each fragment can be allocated on to a techniques for defining reasonable fragments can be
specific site. This type of fragmentation is the most applied to the case of XML.
complex one, which needs more management. In most In this section, XML is described as a data model.
cases simple horizontal or vertical fragmentation of a Extended DTDs(Document Type Definitions) are used
to define schemata. Equivalently, XML-Schema is
www.ijert.org 3
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013
used, but extensions would be needed. Then it is <db><wine w-id=$I producer=$P price=$Q>
considered to be the standard for XML documents as <name>$M</>
databases over such schemata. The queries are used <rest>$R</>
with an extension of XML-QL. Equivalently, XQuery </></>IN”XYZ”
could be used, but again extensions would be needed in .
both cases. .
</db>
A. Schemata and Document Type Definitions In the above example, splitting is carried out to separate
names of wines from wine themselves where XYZ is
A document type definition (DTD) may be an URL.
considered as some kind of schema. Within such a
DTD the regular expressions can be considered as some ii. Horizontal fragmentation
form of typing. We will make this view explicit and The two versions of generalising horizontal
introduce a typed version of XML. The Types are to be fragmentation from RDM to the object oriented case is
considered are used as abstract syntax. considered. The first version which addresses
t = b |t0|t*|t+|t1........tn| t1........ tn horizontal fragmentation on the level of classes,
Here, b represents as usual a collection of base types. whereas the second one addresses the problem on the
Among these base types it is assumed to have a type level of bulk types inside the structure definition of the
ID, i.e., a type representing a not further specified set of classes. However, at the end it turns out that the second
identifiers. There may be other base types such as INT , version only leads to fragmentation, if it is followed by
STRING , URL for integer, character strings and URL- a splitting fragmentation. In this case, the same results
addresses respectively. is a type representing just an can b
empty sequence or tuple. t* and t+ represent arbitrary e obtained by applying the splitting fragmentation. The
or non-empty sequences, respectively, with values of horizontal fragmentation can be achieved by the
type t. t0 represents the values of type t or the empty following selection query.
RRTT
sequence. t1........tn represents sequences or tuples. Example selection query for horizontal fragmentation:
Finally, t1........tn represents a disjoint union. <db>
CONSTRUCT
IIJJEE
www.ijert.org 4
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013
database queries that access the applications on the the fragment Fi at the cluster Cj.
distributed database sites should be performed CLUsum(Tk, Fi, Cj) = CLU(Tk, Fi, Cj)*
effectively. Therefore the fragments that are accessed FREQLU(Tk, Fi, Cj)
by queries are needed to be allocated to the distributed The cost of space(CSP) occupied by the fragment Fi
database sites so as to reduce the communication cost in the cluster Cj times the size of the fragment Cj
during the applications execution and handling their times the size of the fragment Fi (in bytes).
operational processing. A method for grouping the sites CSPsum(Tk, Fi, Cj)= CSP(Tk, Fi, Cj)* Fsize(Tk,Fi)
is proposed to optimize the cost of the fragment Remote updates(CRU) sent from other clusters Cx;
allocation functions and to reduce the queries the average cost of local updates at cluster Cj times
processing time by allocating the fragments to the the average number of frequency of
cluster of sites instead of allocating the fragments site update(FREQRU) issued by the transaction T k to
by site. the fragment Fi for each cluster other than the
Clustering sites current one.
Clustering is the process of grouping sites according CRUsum(Tk, Fi, Cj)= CLU(Tk, Fi, Cj)*
to a Communication Cost Range(CCR) to increase the
FREQRU(Tk, Fi, Cj)
system I/O performance and reduce storage overheads.
Remote communications(CRC) from other clusters
Clustering helps in reducing the communication costs
Cx; the update ratio(Uratio) (Unit Update/Unit
between the sites during the process of data allocation.
Communication) times the average number of
Two sites (Si, Sj) are grouped in one cluster if the
frequency of update issued by the transaction Tk to
communication cost between them is less than or equal
the fragment Fi at the cluster Cj times the average
to a CCR; the number of communication units which is
cost of communication between clusters other
allowed for the maximum difference of the
than the current one.
communication cost between the sites to be grouped in
CRCsum(Tk, Fi, Cj) = Uratio * FREQLU(Tk, Fi, Cj)
the same cluster, this number is determined by the
*CRC(Tk, Fi, Cj)
network of the DDBs.
According to the previous formulas the Cost of
Allocation CA(T k, Fi, Cj) is defined as the sum of
www.ijert.org 5
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013
the following costs: local retrievals, local updates, Fragm Clust Site Retrieval Update
space, remote update, and remote communication. ent er frequency frequenc
CA(Tk, Fi, Cj) = CLRsum(Tk, Fi, Cj) + CLUsum(Tk, y
Fi, Cj) + CSPsum(Tk, Fi, Cj) + CRUsum(Tk, Fi, Cj)
+CRCsum(Tk, Fi, Cj)
F1 C1 S1 80 10
ii. Cost of Not Allocating a Fragment to a Cluster S2 60 26
The cost for not allocating the fragment Fi to the cluster C2 S3 60 16
Cj is computed as the sum of the following: S4 0 0
The average cost of local retrievals at cluster Cj C3 S5 35 5
times the average number of frequency of retrieval S6 25 5
issued by the transaction Tk to the fragment Fi at the
cluster Cj. It is the same as defined in previous F2 C1 S3 20 4
section. S4 20 6
Retrievals from other clusters Cx of remote sites; C2 S5 5 30
the retrieval ratio(CRR) (Unit Retrieval/Unit S6 105 20
Communication) times the average number of F3 C2 S3 30 0
frequency of retrieval issued by the transaction T k to S4 0 0
the fragment Fi at the cluster Cj for each cluster C3 S5 40 30
other than the current one times the average cost of S6 30 10
communication between clusters(CCC).
CRRsum(Tk, Fi, Cj) = Ratio *
FREQRR(Tk,Fi,Cj) * CCC Table 2- Cost of space, retrieval and update
According to the formulas specified previously, the
Cost of Not Allocation CN (T k, Fi, Cj) is defined as Cluster Site Cost Of Cost Of Cost Of
RRTT
the sum of cost of local retrievals and sum of cost of Space Retrieval Update
remote retrievals. C1 S1 0.004 0.15 0.25
CN(Tk, Fi, Cj) = CLRsum(Tk, Fi, Cj) S2 0.006 0.25 0.35
IIJJEE
www.ijert.org 6
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 2 Issue 1, January- 2013
The importance of the distributed database systems IEEE Transaction on systems,Cybernetics, Vol.28,No.3,May
has increased further with developments in networking 1998.
technologies. Effective distribution of the database [15]Hui Ma, Klaus Dieter Schewe, “Fragmentation of XML
document”, Massey University, Information systems.
fragments plays a critical role in the functioning of the
[16]Imran R. Mansuri, Sunita Sarawgi,” Integrating
database in terms of performance and cost. In this unstructured data into relational databases”.
paper, a new formulation for the problem of [17]Ayaz Ahmed Shariff K, Mohammed Ali Hussain,
fragmenting and allocating those fragments at Sambath kumar,” Leveraging unstructured data into
minimum cost is presented for both structured and intelligent information- Analysis and evaluation”, 2011
unstructured data. Results from the application of these International Conference on Information and Network
formulations can be utilized for a large number of data Technology, IPCSIT vol.4 (2011).
sets.
References
[1]Syam Menon, “Allocation of fragments in distributed
system”,IEEE Transactions on parallel and distributed
system,vol 16, No.7,July 2005.
[2]Hassan I Ahdalla,”A New data reallocation model for
distributed database systems”, International journal of
database theory and application.vol.5, No.2,June 2012.
[3]Yin- Fu.Huang and Jyh- Herchen,”Fragment Allocation in
distributed database design”, Journal of information science
and engineering 17,491-506(2001).
[4]Leon Tambulea, Manuela,” Redistributing fragments into
distributed databases”, Int. J. of computers,communication &
control, ISSN 1841-9836 vol III(2008).
www.ijert.org 7