Unit 2adtnotes
Unit 2adtnotes
Parallel DBMS
●
● Disadvantage: TOPIC 3: DISTRIBUTED DATABASE FEATURES:
Independent parallelism does not provide a high degree of Distributed Database Technology:
parallelism, and is less useful in a highly parallel system. ● Mode of working from centralized to decentralized.
● Advantage: Applications:
It is useful with a lower degree of parallelism.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● Rapid developments in network and data communication ● It is also capable of processing data stored on other computers in
technology, epitomized by the Internet, mobile and wireless the network.
computing, intelligent devices, and grid computing. ● Users access the distributed database via applications, which are
Introduction: classified as those that do not require data from other sites (local
● The shareability of the data and the efficiency of data access applications) and those that do require data from other sites
should be improved by the development of a distributed database (global applications).
system that reflects this organizational structure, makes the data in ● We require a DDBMS to have at least one global application.
all units accessible, and stores data proximate to the location where A DDBMS therefore has the following characteristics:
it is most frequently used. ● a collection of logically related shared data;
● Distributed DBMSs should help resolve the islands of information ● the data is split into a number of fragments;
problem. ● fragments may be replicated;
Concepts: ● fragments/replicas are allocated to sites;
Distributed Database ● the sites are linked by a communications network;
A logically interrelated collection of shared data (and a description of this ● the data at each site is under the control of a DBMS;
data), physically distributed over a computer network. ● the DBMS at each site can handle local applications,
Distributed DBMS autonomously;
Software system that permits the management of the distributed database ● each DBMS participates in at least one global application.
and makes the distribution transparent to users.
● Each fragment is stored on one or more computers under the
control of a separate DBMS, with the computers connected by a
communications network.
● Each site is capable of independently processing user requests that Distributed DBMS
require access to local data that is, each site has some degree of
local autonomy.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● The objective of transparency is to make the distributed system
appear like a centralized system.
● This is sometimes referred to as the fundamental principle of
distributed DBMSs
● Advantage of transparency in DDBMS: This requirement
provides significant functionality for the end-user.
● Disdvantage of transparency in DDBMS: It creates many
additional problems that have to be handled by the DDBMS.
● It is not necessary for every site in the system to have its own local
database.
● From the definition of the DDBMS, the system is expected to
make the distribution transparent (invisible) to the user. Thus, the
fact that a distributed database is split into fragments that can be
stored on different computers and perhaps replicated, should be
Distributed Processing
hidden from the user.
A centralized database that can be accessed over a computer network.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
SNO Distributed DBMS Distributed Processing
1. Software system that A centralized database that can
permits the management of be accessed over a computer
the distributed database and network.
makes the distribution
transparent to users.
2. The key point in the If the data is centralized, even
definition of a distributed though others may be
DBMS is that the system accessing the data over the
consists of data that is network it is called the
physically distributed across distributed processing.
a number of sites in the
network.
3.
Advantages of DDBMSs
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● Reflects organizational structure increased performance by exploiting the parallel processing
● Improved shareability and local autonomy capability of multiple sites.
● Improved availability Heterogeneous DDBMS
● Improved reliability ● Sites may run different DBMS products, with possibly different
● Improved performance underlying data models.
● Economics ● So the system may be composed of relational, network
● Modular growth hierarchical, and object-oriented DBMSs.
Disadvantages of DDBMSs ● Heterogeneous systems usually result when individual sites have
● Complexity implemented their own databases and integration is considered at a
● Cost later stage.
● Security ● In a heterogeneous system, translations are required to allow
● Integrity control more difficult communication between different DBMSs.
● Lack of standards ● To provide DBMS transparency, users must be able to make
● Lack of experience requests in the language of the DBMS at their local site.
● Database design more complex ● The system then has the task of locating the data and performing
any necessary translation.
Types of DDBMS ● Data may be required from another site that may have:
● Homogeneous DDBMS ▪ different hardware;
● Heterogeneous DDBMS ▪ different DBMS products;
Homogeneous DDBMS ▪ different hardware and different DBMS products.
● All sites use same DBMS product.
Advantage: different hardware
● Homogeneous systems are much easier to design and manage. ● If the hardware is different but the DBMS products are the same,
● This approach provides incremental growth, making the the translation is straightforward, involving the change of codes
addition of a new site to the DDBMS easy, and allows and word lengths
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
different DBMS products ● Open Group formed a Working Group to provide specifications
● If the DBMS products are different, the translation is complicated that will create a database infrastructure environment where there
involving the mapping of data structures in one data model to the is:
equivalent data structures in another data model. o Common SQL API that allows client applications to be
● For example, relations in the relational data model are mapped to written that do not need to know vendor of DBMS they
records and sets in the network model. are accessing.
● It is also necessary to translate the query language used. o Common database protocol that enables DBMS from one
different hardware and different DBMS products. vendor to communicate directly with DBMS from another
● If both the hardware and software are different, then both these vendor without the need for a gateway.
types of translation are required. o A common network protocol that allows communications
● This makes the processing extremely complex. between different DBMSs.
Gateway: ● Most ambitious goal is to find a way to enable transaction to span
● The typical solution used by some relational systems that are part DBMSs from different vendors without use of a gateway.
of a heterogeneous DDBMS is to use gateways. ● Group has now evolved into DBIOP Consortium and are working
● It converts the language and model of each different DBMS into in version 3 of DRDA (Distributed Relational Database
the language and model of the relational system. Architecture) standard.
Disadvantages of Gateways: Multidatabase System (MDBS)
● It may not support transaction management. DDBMS in which each site maintains complete autonomy.
● The gateway approach is concerned only with the problem of
translating a query expressed in one language into an equivalent ● DBMS that resides transparently on top of existing database and
expression in another language. file systems and presents a single database to its users.
● It does not address the issues of homogenizing the structural and ● Allows users to access and share data without requiring physical
representational differences between different schemas. database integration.
Open Database Access and Interoperability ● Two types of MDBS
● Unfederated MDBS (no local users) and federated MDBS.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
Federated MDBS(FMDBS): o Local Area Network (LAN) intended for connecting
● Where there are no local users. computers at same site.
Unfederated MDBS(UFMDBS): o Wide Area Network (WAN) used when computers or
● A federated system is a cross between a distributed DBMS and a LANs need to be connected over long distances.
centralized DBMS. o WAN relatively slow and less reliable than LANs.
● It is a distributed system for global users and a centralized system DDBMS using LAN provides much faster response time
for local users. than one using WAN.
Functions of MDBS:
● An MDBS maintains only the global schema against which users
issue queries and updates and the local DBMSs themselves
maintain all user data. The global schema is constructed by
integrating the schemas of the local databases.
● The MDBS first translates the global queries and updates into
queries and updates on the appropriate local DBMSs.
● It then merges the local result and generates the final global result
for the user.
● The MDBS coordinates the commit and abort operations for global
transactions by the local DBMSs that processed them, to maintain
consistency of data within the local databases.
Functions of a DDBMS
● An MDBS controls multiple gateways and manages local
● Expect DDBMS to have at least the functionality of a DBMS.
databases through these gateways.
● Also to have following functionality:
Overview of Networking
● Extended communication services.
● Network - Interconnected collection of autonomous computers,
● Extended Data Dictionary.
capable of exchanging information.
● Distributed query processing.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● Extended concurrency control. o transaction transparency;
● Extended recovery services. o performance transparency;
Date’s 12 Rules for a DDBMS o DBMS transparency.
0. Fundamental Principle :To the user, a distributed system Distribution Transparency
should look exactly like a nondistributed system. ● Distribution transparency allows the user to perceive the database
1. Local Autonomy as a single, logical entity.
2. No Reliance on a Central Site ● If a DDBMS exhibits distribution transparency, then the user does
3. Continuous Operation not need to know the data is fragmented (fragmentation
4. Location Independence transparency) or the location of data items (location
5. Fragmentation Independence transparency).
6. Replication Independence ● If the user needs to know that the data is fragmented and the
7. Distributed Query Processing location of fragments then we call this local mapping
8. Distributed Transaction Processing transparency.
9. Hardware Independence Transaction Transparency
10. Operating System Independence ● Transaction transparency in a DDBMS environment ensures that
11. Network Independence all distributed transactions maintain the distributed database’s
12. Database Independence integrity and consistency.
● Last four rules are ideals. ● A distributed transaction accesses data stored at more than one
location.
Transparencies in a DDBMS ● Each transaction is divided into a number of subtransactions, one
● The definition of a DDBMS states that the system should make the for each site that has to be accessed; a subtransaction is represented
distribution transparent to the user. by an agent
● Transparency hides implementation details from the user. Performance Transparency
● We can identify four main types of transparency in a DDBMS: ● Performance transparency requires a DDBMS to perform as if it
o distribution transparency; were a centralized DBMS.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● In a distributed environment, the system should not suffer any ● DDBMS must ensure that no two sites create a database object
performance degradation due to the distributed architecture, for with same name.
example the presence of the network. ● One solution is to create central name server. However, this results
● Performance transparency also requires the DDBMS to determine in:
the most cost-effective strategy to execute a request. o loss of some local autonomy;
o central site may become a bottleneck;
DBMS Transparency o low availability; if the central site fails, remaining sites
● DBMS transparency hides the knowledge that the local DBMSs cannot create any new objects.
may be different, and is therefore only applicable to heterogeneous
DDBMSs. ● Alternative solution - prefix object with identifier of site that
● It is one of the most difficult transparencies to provide as a created it.
generalization. ● For example, Branch created at site S1 might be named
Distribution Transparency S1.BRANCH.
● Distribution transparency allows user to perceive database as ● Also need to identify each fragment and its copies.
single, logical entity. ● Thus, copy 2 of fragment 3 of Branch created at site S1 might
● If DDBMS exhibits distribution transparency, user does not need to be referred to as S1.BRANCH.F3.C2.
know: ● However, this results in loss of distribution transparency.
o data is fragmented (fragmentation transparency), ● An approach that resolves these problems uses aliases for each
o location of data items (location transparency), database object.
o otherwise call this local mapping transparency. ● Thus, S1.BRANCH.F3.C2 might be known as LocalBranch by
● With replication transparency, user is unaware of replication of user at site S1.
fragments . ● DDBMS has task of mapping an alias to appropriate database
Naming Transparency object.
● Each item in a DDB must have a unique name.
Transaction Transparency
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● Ensures that all distributed transactions maintain distributed ● DDBMS must ensure both global and local transactions do not
database’s integrity and consistency. interfere with each other.
● Distributed transaction accesses data stored at more than one ● Similarly, DDBMS must ensure consistency of all
location. subtransactions of global transaction.
● Each transaction is divided into number of subtransactions,
one for each site that has to be accessed. Classification of Transactions
● DDBMS must ensure the indivisibility of both the global ● In IBM’s Distributed Relational Database Architecture (DRDA),
transaction and each of the subtransactions. four types of transactions:
o Remote request
Example - Distributed Transaction o Remote unit of work
● T prints out names of all staff, using schema defined above as o Distributed unit of work
S1, S2, S21, S22, and S23. Define three subtransactions TS3, TS5, o Distributed request.
and TS7 to represent agents at sites 3, 5, and 7.
Concurrency Transparency
● All transactions must execute independently and be logically
consistent with results obtained if transactions executed one at
a time, in some arbitrary serial order.
● Replication makes concurrency more complex.
● Same fundamental principles as for centralized DBMS.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● If a copy of a replicated data item is updated, update must be o DDBMS should not suffer any performance degradation
propagated to all copies. due to distributed architecture.
● Could propagate changes as part of original transaction, making it o DDBMS should determine most cost-effective strategy to
an atomic operation. execute a request.
● However, if one site holding copy is not reachable, then transaction
is delayed until site is reachable. ● Distributed Query Processor (DQP) maps data request into ordered
● Could limit update propagation to only those sites currently sequence of operations on local databases.
available. Remaining sites updated when they become available ● Must consider fragmentation, replication, and allocation schemas.
again. ● Distributed Query Processing has to decide:
● Could allow updates to copies to happen asynchronously, o which fragment to access;
sometime after the original update. Delay in regaining consistency o which copy of a fragment to use;
may range from a few seconds to several hours. o which location to use.
● Distributed Query Processing produces execution strategy
Failure Transparency optimized with respect to some cost function.
● DDBMS must ensure atomicity and durability of global ● Typically, costs associated with a distributed request include:
transaction. o I/O cost;
● Means ensuring that subtransactions of global transaction either all o CPU cost;
commit or all abort. o communication cost.
● Thus, DDBMS must synchronize global transaction to ensure that Performance Transparency – Example
all subtransactions have completed successfully before recording a Property(propNo, city) 10000 records in London
final COMMIT for global transaction. Client(clientNo,maxPrice) 100000 records in Glasgow
● Must do this in presence of site and network failures. Viewing(propNo, clientNo) 1000000 records in London
SELECT p.propNo
Performance Transparency FROM Property p INNER JOIN
● DDBMS must perform as if it were a centralized DBMS. (Client c INNER JOIN Viewing v ON c.clientNo = v.clientNo)
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
ON p.propNo = v.propNo
WHERE p.city=‘Aberdeen’ AND c.maxPrice > 200000;
Assume:
Each tuple in each relation is 100 characters long. TOPIC 4: Distributed DBMS Architecture
10 renters with maximum price greater than £200,000.
100 000 viewings for properties in Aberdeen. Reference Architecture for DDBMS
Computation time negligible compared to communication time. ● Due to diversity, no accepted architecture equivalent to
ANSI/SPARC 3-level architecture.
● A reference architecture consists of:
o Set of global external schemas.
o Global conceptual schema (GCS).
o Fragmentation schema and allocation schema.
o Set of schemas for each local DBMS conforming to
3-level ANSI/SPARC.
● Some levels may be missing, depending on levels of transparency
supported.
The edges in this figure represent mappings between the different schemas.
Global conceptual schema
● The global conceptual schema is a logical description of the whole
database, as if it were not distributed.
● This level corresponds to the conceptual level of the ANSI-SPARC
architecture and contains definitions of entities, relationships,
constraints, security, and integrity information.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● It provides physical data independence from the distributed
environment. The global external schemas provide logical data
independence.
Fragmentation and allocation schemas
● The fragmentation schema is a description of how the data is to be
logically partitioned.
● The allocation schema is a description of where the data is to be
located, taking account of any replication.
Local schemas
● Each local DBMS has its own set of schemas.
● The local conceptual and local internal schemas correspond to the
equivalent levels of the ANSI-SPARC architecture.
● The local mapping schema maps fragments in the allocation
schema into external objects in the local database.
● It is DBMS independent and is the basis for supporting
heterogeneous DBMSs.
●
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
Components Architecture of a DDBMS o It has its own local system catalog that stores information
about the data held at that site.
o In a homogeneous system, the LDBMS component is the
same product, replicated at each site.
o In a heterogeneous system, there would be at least two
sites with different DBMS products and/or platforms.
Data Communication Component:
o The DC component is the software that enables all sites to
communicate with each other.
o The DC component contains information about the sites
and the links
Global System Catalog:
o The GSC has the same functionality as the system catalog
of a centralized system.
● Independent of the reference architecture, we can identify a o The GSC holds information specific to the distributed
component architecture for avDDBMS consisting of four major nature of the system, such as the fragmentation,
o local DBMS (LDBMS) component; o It can itself be managed as a distributed database and so it
o data communications (DC) component; can be fragmented and distributed, fully replicated, or
o distributed DBMS (DDBMS) component. o A fully replicated GSC compromises site autonomy as
for controlling the local data at each site that has a o A centralized GSC also compromises site autonomy and
o Improved Performance.
● The quantitative information may include: ● Alternatively, bad allocation may result in underutilization of
o the site from which a transaction is run; Balanced storage capacities and costs
o the performance criteria for transactions. ● Consideration should be given to the availability and cost of
● The qualitative information may include information about the storage at each site so that cheap mass storage can be used, where
transactions that are executed, such as: possible.
o the relations, attributes, and tuples accessed; ● This must be balanced against locality of reference.
o the type of access (read or write); Minimal communication costs
o the predicates of read operations. ● Consideration should be given to the cost of remote requests.
● Retrieval costs are minimized when locality of reference is
Objectives for allocation and definition of fragments maximized or when each site has its own copy of the data.
Locality of reference ● However, when replicated data is updated, the update has to be
● Where possible, data should be stored close to where it is used. performed at all sites holding a duplicate copy, thereby increasing
● If a fragment is used at several sites, it may be advantageous to communication costs.
store copies of the fragment at these sites. Data Allocation
Improved reliability and availability ● Four alternative strategies regarding placement of data:
● Reliability and availability are improved by replication: there is o Centralized,
another copy of the fragment available at another site in the event o Partitioned (or Fragmented),
of one site failing. o Complete Replication,
o Selective Replication.
Centralized:
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● This strategy consists of a single database and DBMS stored at one ● Storage costs and communication costs for updates are the most
site with users distributed across the network. expensive. To overcome some of these problems, snapshots are
● Locality of reference is at its lowest as all sites, except the central sometimes used.
site, have to use the network for all data accesses. ● A snapshot is a copy of the data at a given time.
● Communication costs are high. ● The copies are updated periodically, for example, hourly or
● Reliability and availability are low, as a failure of the central site weekly, so they may not be always up to date.
results in the loss of the entire database system. ● Snapshots are also sometimes used to implement views in a
Fragmented (or partitioned) distributed database to improve the time it takes to perform a
● This strategy partitions the database into disjoint fragments, with database operation on a view.
each fragment assigned to one site. Selective replication
● If data items are located at the site where they are used most ● This strategy is a combination of fragmentation, replication, and
frequently, locality of reference is high. centralization.
● As there is no replication, storage costs are low; similarly, ● Some data items are fragmented to achieve high locality of
reliability and availability are low, although they are higher than in reference and others, which are used at many sites and are not
the centralized case as the failure of a site results in the loss of only frequently updated, are replicated; otherwise, the data items are
that site’s data. centralized.
● Performance should be good and communications costs low if the ● The objective of this strategy is to have all the advantages of the
distribution is designed properly. other approaches but none of the disadvantages. This is the most
Complete replication commonly used strategy because of its flexibility.
● This strategy consists of maintaining a complete copy of the
database at each site.
● Therefore, locality of reference, reliability and availability, and
performance are maximized.
Horizontal Fragmentation
o Disjointness
● The fragments are disjoint except for
the primary key, which is necessary for
reconstruction.
● Vertical fragments are determined by establishing the affinity of
one attribute to another.
● One way to do this is to create a matrix that shows the number of
accesses that refer to each attribute pair.
● For example, a transaction that accesses attributes a1, a2, and a4 of
relation R with attributes (a1, a2, a3, a4), can be represented by the
following matrix:
Advantages of Vertical Fragmentation: a1 a2 a3 a4
● The fragments can be stored at the sites that need them. a1 1 0 1
● The performance is improved as the fragment is smaller than the a2 0 1
original base relation. a3 0
This fragmentation schema satisfies the correctness rules: a4
o Completeness
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● The matrix is triangular; the diagonal does not need to be filled in o Instead, mixed or hybrid fragmentation is required.
as the lower half is a mirror image of the upper half. Mixed fragment:
● The 1s represent an access involving the corresponding attribute o Consists of a horizontal fragment that is subsequently
pair, and are eventually replaced by numbers representing the vertically fragmented, or a vertical fragment that is then
transaction frequency. horizontally fragmented.
● A matrix is produced for each transaction and an overall matrix is ● A mixed fragment is defined using the Selection and Projection
produced showing the sum of all accesses for each attribute pair. operations of the relational algebra.
● Pairs with high affinity should appear in the same vertical ● Given a relation R, a mixed fragment is defined as:
fragment; pairs with low affinity may be separated. σ p(∏a1, ... ,an(R))
● If working with single attributes and all major transactions may be or
a lengthy calculation. ∏a1, ... ,an(σp(R))
● Therefore, if it is known that some attributes are related, it may be where p is a predicate based on one or more attributes of
prudent to work with groups of attributes instead. R and a1, . . . , an are attributes of R.
● This approach is known as splitting. Example - Mixed Fragmentation
o It produces a set of non-overlapping fragments, which S1 = ∏staffNo, position, sex, DOB, salary(Staff)
ensures compliance with the disjointness S2 = ∏staffNo, fName, lName, branchNo(Staff)
o The non-overlapping characteristic applies only to S21 = σ branchNo=‘B003’(S2)
attributes that are not part of the primary key. S22 = σ branchNo=‘B005’(S2)
o Primary key fields appear in every fragment and so can be S23 = σ branchNo=‘B007’(S2)
omitted from the analysis.
Mixed Fragmentation
o For some applications horizontal or vertical fragmentation of a
database schema by itself is insufficient to adequately distribute the
data.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● This produces three fragments (S21, S22, and S23), one consisting
of those tuples where the branch number is B003 (S21), one
consisting of those tuples where the branch number is B005 (S22),
and the other consisting of those tuples where the branch number is
B007 (S23)
o Disjointness
o
▪ The fragments are disjoint; there can be no staff
▪ where w is the number of horizontal fragments
member who works in more than one branch and
defined on S and f is the join attribute.
S1 and S2 are disjoint except for the necessary
Example - Derived Horizontal Fragmentation
duplication of primary key.
● We may have an application that joins the Staff and
Derived Horizontal Fragmentation
PropertyForRent relations together. For this example, we assume
● Some applications may involve a join of two or more relations.
that Staff is horizontally fragmented according to the branch
● If the relations are stored at different locations, there may be a
number, so that data relating to the branch is stored locally:
significant overhead in processing the join.
o S3 = σ branchNo=‘B003’(Staff)
● To avoid overhead it may be more appropriate to ensure that the
o S4 = σ branchNo=‘B005’(Staff)
relations, or fragments of relations, are at the same location. This
o S5 = σ branchNo=‘B007’(Staff)
can be achieved using derived horizontal fragmentation.
● We also assume that property PG4 is currently managed by SG14.
Derived fragment :
● It would be useful to store property data using the same
● A horizontal fragment that is based on the horizontal fragmentation
fragmentation strategy.
of a parent relation.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● This is achieved using derived fragmentation to horizontally ● If a relation contains more than one foreign key, it will be
fragment the PropertyForRent relation according to branch necessary to select one of the referenced relations as the parent.
number: ● The choice can be based on the fragmentation used most frequently
or the fragmentation with better join characteristics, that is, the join
involving smaller fragments or the join that can be performed in
●
parallel to a greater degree.
● This produces three fragments (P3, P4, and P5), one consisting of
No fragmentation
those properties managed by staff at branch number B003 (P3),
● A final strategy is not to fragment a relation.
one consisting of those properties managed by staff at branch B005
● For example, the Branch relation contains only a small number of
(P4), and the other consisting of those properties managed by staff
tuples and is not updated very frequently.
at branch B007 (P5)
● Rather than trying to horizontally fragment the relation on, for
example, branch number, it would be more sensible to leave the
relation whole and simply replicate the Branch relation at each site.
Example:
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● The use of Semijoins is beneficial if there are only a few tuples of o R* algorithm;
R1 that participate in the join of R1 and R2. o SDD-1 algorithm.
● The join approach is better if most tuples of R1 participate in the o AHY
join, because the Semijoin approach requires an additional transfer o Distributed Ingres
of a projection on the join attribute. Global Optimization – R*
Global Optimization ● R* uses a cost model based on total cost and static query
● Objective of this layer is to take the reduced query plan for the optimization.
data localization layer and find a near-optimal execution ● Like centralized System R optimizer, algorithm is based on an
strategy. exhaustive search of all join orderings, join methods (nested loop
● In distributed environment, speed of network has to be or sort-merge join), and various access paths for each relation.
considered when comparing strategies. ● When Join is required involving relations at different sites, R*
● If know topology is that of WAN, could ignore all costs other selects the sites to perform Join and method of transferring data
than network costs. between sites.
● LAN typically much faster than WAN, but still slower than
disk access. ● For a Join of R and S with R at site 1 and S at site 2, there are three
● Cost model could be based on total cost (time), as in candidate sites:
centralized DBMS, or response time. Latter uses parallelism o site 1, where R is located;
inherent in DDBMS. o site 2, where S is located;
o some other site (e.g., site of relation T, which is to be
joined with join of R and S).
● In R*, there are 2 methods for transferring data:
o Ship whole relation
o Fetch tuples as needed.
● First method incurs a larger data transfer but fewer message then
● The distributed query optimization algorithms:
second.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● R* considers only the following methods:
o Nested loop, ship whole outer relation to site of inner.
o Sort-merge, ship whole inner relation to site of outer.
o Nested loop, fetch tuples of inner relation as needed for
each tuple of outer relation.
o Sort-merge, fetch tuples of inner relation as needed for
each tuple of outer relation.
o Ship both relations to third site.
Global Optimization – SDD-1
● Based on an earlier method known as “hill climbing”, a greedy
algorithm that starts with an initial feasible solution which is then
iteratively improved.
● Modified to make use of Semijoin to reduce cardinality of join
operands.
● Like R*, SDD-1 optimizer minimizes total cost, although unlike
R* it ignores local processing costs and concentrates on
communication message size.
● Like R*, query processing timing used is static.
● Based on concept of “beneficial Semijoins”.
● Communication cost of Semijoin is simply cost of transferring join
attribute of first operand to site of second operand.
● “Benefit” of Semijoin is taken as cost of transferring irrelevant
tuples of first operand, which Semijoin avoids.
SDD-1 Algorithm proceeds as follows:
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● Phase 1 – Initialization: Perform all local reductions using
Selection and Projection. Execute Semijoins within same site
to reduce sizes of relations. Generate set of all beneficial
Semijoins across sites (Semijoin is beneficial if its cost is less
than its benefit).
● Phase 2 – Selection of beneficial Semijoins: Iteratively select
most beneficial Semijoin from set generated and add it to
execution strategy. After each iteration, update database
statistics to reflect incorporation of the Semijoin and update
the set with new beneficial Semijoins.
● Phase 3 – Assembly site selection: Select, among all sites, site
to which transmission of all relations incurs a minimum cost.
Choose site containing largest amount of data after reduction
phase so that sum of the amount of data transferred from other
sites will be minimum.
● Phase 4 – Postoptimization: Discard useless Semijoins; e.g. if
R resides in assembly site and R is due to be reduced by
Semijoin, but is not used to reduce other relations after
Semijoin, then since R need not be moved to another site
during assembly phase, Semijoin on R is useless and can be
discarded.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● In a distributed DBMS, these modules still exist in each local
DBMS.
● In addition, there is also a global transaction manager or
Topic:Distributed Transaction Processing transaction coordinator at each site to coordinate the execution
of both the global and local transactions initiated at that site.
● The objectives of distributed transaction processing are the same as ● Inter-site communication is still through the data communications
those of centralized systems, although more complex because the component (transaction managers at different sites do not
DDBMS must also ensure the atomicity of the global transaction communicate directly with each other).
and each component subtransaction. PROCEDURE TO EXECUTE A GLOBAL TRANSACTION
● The transaction manager coordinates transactions on behalf of INITIATED AT SITE IS AS FOLLOWS:
application programs, communicating with the scheduler. (i) The transaction coordinator (TC1) at site S1 divides the
● The Scheduler the module responsible for implementing a transaction into a number of subtransactions using information
particular strategy for concurrency control. The objective of the held in the global system catalog.
scheduler is to maximize concurrency without allowing (ii) The data communications component at site S1 sends the
concurrently executing transactions to interfere with one another subtransactions to the appropriate sites, S2 and S3, say.
and thereby compromise the consistency of the database. (iii) The transaction coordinators at sites S2 and S3 manage these
● In the event of a failure occurring during the transaction, the subtransactions.
recovery manager ensures that the database is restored to the state (iv) The results of subtransactions are communicated back to TC1
it was in before the start of the transaction, and therefore a via the data communications components.
consistent state. The recovery manager is also responsible for
restoring the database to a consistent state following a system
failure.
● The buffer manager is responsible for the efficient transfer of data
between disk storage and main memory.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
⮚ For global transactions, this task is much more complicated, since
several the failure of a communication link connecting these sites,
may result in erroneous computations.
SYSTEM STRUCTURE:
Local Transaction Manager:
⮚ Each site has its own local transaction manager.
⮚ Function is to ensure the ACID properties of those transactions that
execute at that site.
⮚ The various transaction managers cooperate to execute global
transactions.
Transaction Processing (Referred from Silberschatz)
⮚ Consider an abstract model of a transaction system, in which each
Distributed Transactions
site contains two subsystems:
⮚ Access to the various data items in a distributed system is usually
o The transaction manager:
accomplished through transactions, which must preserve the ACID
▪ It manages the execution of those transactions
properties.
(or subtransactions) that access data stored in a
⮚ There are two types of transaction that we need to consider.
local site.
⮚ The local transactions are those that access and update data in
▪ Each such transaction may be either a local
only one local database.
transaction (that is, a transaction that executes at
⮚ The global transactions are those that access and update data in
only that site) or
several local databases.
▪ part of a global transaction (that is, a transaction
⮚ Ensuring the ACID properties of the local transactions can be done
that executes at several sites).
easily.
o The transaction coordinator coordinates the execution
of the various transactions (both local and global) initiated
at that site.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
• Breaking the transaction into a number of subtransactions
and distributing these subtransactions to the appropriate
sites for execution
• Coordinating the termination of the transaction, which
may result in the transaction being committed at all sites
or aborted at all sites
System Failure Modes
● A distributed system may suffer from the same types of failure that
a centralized system.
● There are additional types of failure with which we need to deal in
a distributed environment.
● The basic failure types are
⮚ Each transaction manager is responsible for
• Failure of a site
• Maintaining a log for recovery purposes
• Loss of messages
• Participating in an appropriate concurrency-control
• Failure of a communication link
scheme to coordinate the concurrent execution of the
• Network partition
transactions executing at that site
● The loss or corruption of messages is always a possibility in a
distributed system.
⮚ A transaction coordinator, as its name implies, is responsible for
● The system uses transmission-control protocols, such as TCP/IP, to
coordinating the execution of all the transactions initiated at that
handle such errors.
site.
● If two sites A and B are not directly connected, messages from one
⮚ For each such transaction, the coordinator is responsible for
to the other must be routed through a sequence of communication
• Starting the execution of the transaction
links.
● If a communication link fails, messages that would have been
transmitted across the link must be rerouted.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● It is possible to find another route through the network, so that the
messages are able to reach their destination.
● In other cases, a failure may result in there being no connection
between some pairs of sites.
● A system is partitioned if it has been split into two (or more)
subsystems, called partitions that lack any connection between
them.
All concurrency control mechanisms must ensure that: then the global schedule (the union of all local schedules) is also
(I) the consistency of data items is preserved and serializable provided local serialization orders are identical.
(II) that each atomic action is completed in a finite time. ● This requires that all subtransactions appear in the same order in
A good concurrency control mechanism for distributed DBMSs should: the equivalent serial schedule at all sites.
● perform satisfactorily in a network environment that has The solutions to concurrency control in a distributed environment
significant communication delay; are based on the two main approaches of locking and
The problem in distributed environment when multiple user’s access Given a set of transactions to be executed concurrently, then:
(i) problems of lost update, some (unpredictable) serial execution of those transactions
(ii) uncommitted dependency, and ● Timestamping guarantees that the concurrent execution is
but can time out only in the middle two states. The actions to be taken
are as follows:
o Timeout in the WAITING state The coordinator is waiting
for all participants to acknowledge whether they wish to
commit or abort the transaction. In this case, the
coordinator cannot commit the transaction because it has
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
not received all votes. However, it can decide to globally
abort the transaction.
o Timeout in the DECIDED state The coordinator is
waiting for all participants to acknowledge whether they
have successfully aborted or committed the transaction. In
this case, the coordinator simply sends the global decision
again to sites that have not acknowledged.
Participant
● The simplest termination protocol is to leave the participant
process blocked until communication with the coordinator is
re-established.
● The participant can then be informed of the global decision and
resume processing accordingly.
● There are other actionsBthat may be taken to improve
performance.
● A participant can be in one of four states during the commit
process: INITIAL, PREPARED, ABORTED, and COMMITTED,
as shown in Figure (b). ● However, a participant may time out only in the first two states as
follows:
o Timeout in the INITIAL state The participant is waiting
for a PREPARE message from the coordinator, which
implies that the coordinator must have failed while in the
INITIAL state. In this case, the participant can
unilaterally abort the transaction. If it subsequently
receives a PREPARE message, it can either ignore it, in
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
which case the coordinator times out and aborts the global Recovery protocols for 2PC
transaction, or it can send an ABORT message to the ● We now consider the action to be taken by a failed site on
coordinator. recovery.
o Timeout in the PREPARED state The participant is ● The action on restart again depends on what stage the coordinator
waiting for an instruction to globally commit or abort the or participant had reached at the time of failure.
transaction. The participant must have voted to commit Coordinator failure
the transaction, so it cannot change its vote and abort the ● We consider three different stages for failure of the coordinator:
transaction. Equally well, it cannot go ahead and commit o Failure in INITIAL state The coordinator has not yet
the transaction, as the global decision may be to abort. started the commit procedure. Recovery in this case starts
Without further information, the participant is blocked. the commit procedure.
However, the participant could contact each of the other o Failure in WAITING state The coordinator has sent the
participants attempting to find one that knows the PREPARE message and although it has not received all
decision. This is known as the cooperative termination responses, it has not received an abort response. In this
protocol. A straightforward way of telling the participants case, recovery restarts the commit procedure.
who the other participants are is for the coordinator to o Failure in DECIDED state The coordinator has instructed
append a list of participants to the vote instruction. the participants to globally abort or commit the
● The cooperative termination protocol reduces the likelihood of transaction. On restart, if the coordinator has received all
blocking, blocking is still possible and the blocked process will acknowledgements, it can complete successfully.
just have to keep on trying to unblock as failures are repaired. Otherwise, it has to initiate the termination protocol.
● If it is only the coordinator that has failed and all participants Participant failure
detect this as a result of executing the termination protocol, then ● The objective of the recovery protocol for a participant is to ensure
they can elect a new coordinator and resolve the block. that a participant process on restart performs the same action as all
other participants, and that this restart can be performed
independently (that is, without the need to consult either the
coordinator or the other participants).
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● We consider three different stages for failure of a participant: ● If a site Sk receives a message from a lower-numbered participant,
o Failure in INITIAL state The participant has not yet voted then Sk knows that it is not to be the new coordinator and stops
on the transaction. Therefore, on recovery it can sending messages.
unilaterally abort the transaction, as it would have been ● This protocol is relatively efficient and most participants stop
impossible for the coordinator to have reached a global sending messages quite quickly.
commit decision without this participant’s vote. ● Eventually, each participant will know whether there is an
o Failure in PREPARED state The participant has sent its operational participant with a lower number.
vote to the coordinator. In this case, recovery is via the ● If there is not, the site becomes the new coordinator.
termination protocol. ● If the newly elected coordinator also times out during this process,
o Failure in ABORTED/COMMITTED states The the election protocol is invoked again.
participant has completed the transaction. Therefore, on ● After a failed site recovers, it immediately starts the election
restart, no further action is necessary. protocol.
Election protocols ● If there are no operational sites with a lower number, the site forces
● If the participants detect the failure of the coordinator (by timing all higher-numbered sites to let it become the new coordinator,
out) they can elect a new site to act as coordinator. regardless of whether there is a new coordinator or not.
● One election protocol is for the sites to have an agreed linear Communication topologies for 2PC
ordering. ● There are several different ways of exchanging messages, or
● We assume that site Si has order i in the sequence, the lowest being communication topologies, that can be employed to implement
the coordinator, and that each site knows the identification and 2PC.
ordering of the other sites in the system, some of which may also o Centralized 2PC,
have failed. ▪ In centralized 2PC all communication is
● One election protocol asks each operational participant to send a funneled through the coordinator, as shown in
message to the sites with a greater identification number. Figure (a).
● Thus, site Si would send a message to sites Si+1, Si+2, . . . , Sn in
that order.
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
● These improvements depend upon adopting different ways of coordinator and the remaining sites are the participants. The 2PC
Linear 2PC from coordinator to participant n for the voting phase and a
● In linear 2PC, participants can communicate with each other, as backward chain of communication from participant n to the
● Both the coordinator and participant still have periods of waiting, Termination protocols for 3PC
but the important feature is that all operational processes have ● As with 2PC, the action to be taken depends on what state the
been informed of a global decision to commit by the coordinator or participant was in when the timeout occurred.
PRE-COMMIT message prior to the first process committing, and Coordinator
can therefore act independently in the event of failure. ● The coordinator can be in one of five states during the commit
● If the coordinator does fail, the operational sites can communicate process as shown in Figure but can timeout in only three states.
with each other and determine whether the transaction should be
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
acknowledge whether they wish to commit or abort the
transaction, so it can decide to globally abort the
transaction.
o Timeout in the PRE-COMMITTED state The participants
have been sent the PRECOMMIT message, so
participants will be in either the PRE-COMMIT or
READY states. In this case, the coordinator can complete
the transaction by writing the commit record to the log file
and sending the GLOBAL-COMMIT message to the
participants.
o Timeout in the DECIDED state This is the same as in
2PC. The coordinator is waiting for all participants to
acknowledge whether they have successfully aborted or
committed the transaction, so it can simply send the
global decision to all sites that have not acknowledged.
Participant
● The participant can be in one of five states during the commit
process as shown in Figure
● In partition P1, a transaction has withdrawn £10 from an account ● Assuming at the start both partitions have £100 in balx, then on
(with balance balx) and in partition P2, two transactions have each completion one has £40 in balx and the other has £50. Importantly,
withdrawn £5 from the same account. Assuming at the start both neither has violated the integrity constraint.
partitions have £100 in balx, then on completion they both have ● However, when the partitions recover and the transactions are both
£90 in balx. fully implemented, the balance of the account will be –£10, and the
● When the partitions recover, it is not sufficient to check the value integrity constraint will have been violated.
in balx and assume that the fields are consistent if the values are ● Processing in a partitioned network involves a tradeoff in
● In this case, the value after executing all three transactions should ● Absolute correctness is most easily provided if no processing of
⮚ Parallel machines are becoming quite common and affordable relational algebra)
dropped sharply ⮚ Different queries can be run in parallel with each other.
o Recent desktop computers feature multiple processors and Concurrency control takes care of conflicts.
this trend is projected to accelerate ⮚ Thus, databases naturally lend themselves to parallelism.
l E.g., 10 ≤ r.A < 25. o Index on partitioning attribute can be local to disk,
making lookup and update more efficient
⮚ No clustering, so difficult to answer range queries
LOYOLA-ICAM COLLEGE OF ENGINEERING AND TECHNOLOGY
LOYOLA COLLEGE CAMPUS, NUNGUMBAKKAM, CH – 34
CS2029 ADVANCED DATABASE TECHOLOGY
UNIT II
DISTRIBUTED DATABASES
⮚ Queries/transactions execute in parallel with one another. ⮚ More complex protocols with fewer disk reads/writes exist.
⮚ Increases transaction throughput; used primarily to scale up a ⮚ Cache coherency protocols for shared-nothing systems are similar.
transaction processing system to support a larger number of Each database page is assigned a home processor. Requests to
transactions per second. fetch the page or write it to disk are sent to the home processor.