0% found this document useful (0 votes)
27 views190 pages

DDB Unit 1-5

The document provides an overview of Distributed Database Systems (DDBS), detailing their architecture, design, and advantages such as transparent management, reliability, improved performance, and easier system expansion. It discusses the complexities and challenges associated with distributed environments, including data replication, failure management, and transaction synchronization. Additionally, it outlines various architectural models and the autonomy, distribution, and heterogeneity of distributed DBMSs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views190 pages

DDB Unit 1-5

The document provides an overview of Distributed Database Systems (DDBS), detailing their architecture, design, and advantages such as transparent management, reliability, improved performance, and easier system expansion. It discusses the complexities and challenges associated with distributed environments, including data replication, failure management, and transaction synchronization. Additionally, it outlines various architectural models and the autonomy, distribution, and heterogeneity of distributed DBMSs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 190

DISTRIBUTED DATABASES

S.MOUNASRI
Asst. Prof.
CSE
Unit-1

Introduction

Distributed DBMS Architecture

Distributed database design


Introduction

distributed data
processing

distributed database
system

promises of DDBSs

problem areas
Distributed DBMS Architecture

architectural models for


distributed DBMS

DDMBS architecture
Distributed database design

alternative design
strategies

distribution design issues

Fragmentation

allocation
INTRODUCTION

 Distributed database system (DDBS) technology is the union of


two opposed approaches to data processing i.e database system
and computer network technologies.
distributed data processing

(distributed processing or distributed computing)


{
• Distributed computing is a field of computer science that
studies distributed systems.
• A distributed system is a system whose components are
located on different networked computers, which communicate
and coordinate their actions by passing messages to one another.
• The components interact with one another in order to
achieve a common goal.
}
distributed database system( DDBS)

What is distributed database??


 distributed database as a collection of multiple, logically
interrelated databases distributed over a computer network. i.e,
A database that is not limited to one system, it is spread over
different sites(ex: on multiple computers or over a network of
computers)
 Then distributed DBMS is the software system that manages
a distributed database such that the distribution aspects are
transparent to the users.
 DDBS is used to refer jointly to the distributed database and the
distributed DBMS.
 DDBS is not only specify about the data but also about the
structure of data
 In a DDBS , despite the existence of a network, the database
resides at only one node of the network.
 So we face problems not less than maintaining dbms.
 Here the database is centrally managed by one computer
system(site 2) and all requests are routed to that site.
 In this regard we mainly deal with transmission delays
 So the existence of CN or a collection of files is not
sufficient to form DDBS. So we move to….. (cont.)
 So we move environment where data are distributed among
a number of sites
Promises of DDBSs
 Many advantages of DDBSs are distilled to four fundamentals
which are called as promises of DDBS technology
1. Transparent Management of Distributed and Replicated Data
(data independence, Network Transparency, Replication
Transparency, Fragmentation Transparency)
2. Reliability (reliable access to data) Through Distributed
Transactions
3. Improved Performance
4. Easier System Expansion
1. Transparent Management of Distributed and Replicated Data
 Transparency refers to separation of the higher-level semantics
of a system from lower-level implementation issues.
 ( DDBMS hides all the added complexities of distribution,
allowing users to think that they are working with a single
centralized system.)
 Here goes an example …..
 Consider an engineering firm that has offices in Boston,
Waterloo, Paris and San Francisco.
 They run projects and maintain database of their employees (ex:
projects, employee data)
 Let us assume that the database is relational and stored in
following two relations
EMP(ENO, ENAME, TITLE) and PROJ(PNO,PNAME,
BUDGET) {cont…}
 The other relation to store salary
information SAL(TITLE, AMT)
 The fourth relation to know the assign projects with duration
and responsibility indicates as ASG
ASG(ENO, PNO, RESP, DUR).
 If we wanted to find out the names and employees who worked
on a project for more than 12 months the query that u are going
to write was…..

 Based on the query it is going to search in different databases of


Boston , Paris etc…
 In order to quick processing of query we are going to partition
each of the relations and store each partition at a different site.
 This is known as fragmentation
 Sometimes we also duplicate some of this data at other sites for
performance and reliability reasons
 It means distributed database which is fragmented and
replicated
 So Fully transparent access means that the users can still pose the
query as specified above, without paying any attention to the
fragmentation, location, or replication of data
i. Data Independence
 Data independence is a fundamental form of transparency
 It is capacity of changing the Database schema
(structure/description) at one level of a database system without
effecting the schema at the next higher level.
 Database system follows multilayer architecture to store metadata
 2 types of Data independence : Logical data independence
followed by logical schema
: Physical data independence
followed by physical schema
 LDI stores information about how data is managed inside
 Physical data independence, on the other hand, deals with hiding
the details of the storage structure from user applications
ii Network Transparency/ Distribution transperancy
 Other than data the user should be protected from the
operational details of the network
 Allowing a user to access a resource ( application program or
data) without the user needing to know whether the resource
is located on the local machine (i.e., the computer which the
user is currently using) or on a remote machine (i.e., a
computer elsewhere on the network).
iii Replication Transparency
 Replication transparency ensures that replication of databases
are hidden from the users.
 It enables users to query upon a table as if only a single copy
of the table exists.
iv Fragmentation Transparency
 dividing each database relation into smaller fragments and
treat each fragment as a separate database object (i.e., another
relation).
 This is for reasons of performance, availability, and
reliability.
 Fragmentation Transparency hides the fact that the table the
user is querying on is actually a fragment or union of some
fragments.
so to provide easy and efficient access of the
DBMS we need to have full transparency.
2) Reliability (reliable access to data) Through Distributed Transactions

 Distributed DBMSs are designed to improve reliability by


having replicated components results in eliminating failure.
 So the failure creates problem to the entire system
 In distributed proper care is taken such that instead of failure
part user may be permitted to access other parts of the
distributed database
 This is useful to support for distributed transactions.
 A transaction is a basic unit of consistent and reliable
computing, consisting of a sequence of database operations
executed as an atomic action.
 Let us take an example of transaction based on the engineering
firm.
 Assuming that there is an application that updates the salaries of
all the employees by 10%.
 In the middle of this transaction if a system failure . we would
like the DBMS to be able to determine, upon recovery, where it
left off and continue with its operation (or start all over again)
 Alternatively, if some other user runs a query calculating the
average salaries of the employees in this firm while the original
update action is going on, the calculated result will be in error.
 Therefore we would like the system to be able to synchronize
the concurrent execution of these two programs.
 Distributed transactions execute at a number of sites at which
they access the local database.
 Here we are providing a facility that there is no interruption in
any transaction
3) Improved Performance
 performance of distributed DBMSs is improved by :-
 A distributed DBMS fragments the conceptual database. This is
also called as data localization
 It has the following advantages
-- Since each site handles only a portion of the database,
contention for CPU and I/O services is not severe.
--- Localization reduces remote access delays .
 Implementation of inherent parallelism of distributed systems
 It has inter-query and intra-query parallelism.
 Inter-query parallelism is the ability to execute multiple queries
at the same time
 intra-query parallelism is achieved by breaking up a single query
into a number of sub queries each of which is executed at a
different site, accessing a different part of the distributed
database.
4) Easier System Expansion
 In a distributed environment, it is much easier to accommodate
increasing database sizes.
 In general expansion can be handled by adding processing and
storage power to the network.
 This also depends on the overhead of distribution.
 One aspect of easier system expansion is economics.
 It normally costs much less to put together a system of
“smaller” computers with the equivalent power of a single big
machine.
Problem areas
 As distributed environment is more complex than the database
systems, even the underlying principles are same.
 This complexity give rise to new problems i.e., influenced by
3 factors
1. First, as data may be replicated in distributed
environment….. The distributed DB is designed in such a
way that a portion of DB or entire DB reside at different sites
of CN(not essential that every site on the network contain
DB)………….. So in this regard the DDB is responsible for -
- If requested choosing one of the stored copies of the
requested data for access
- making sure that the effect of an update is reflected on each
& every copy of that data item
2. Second, if some site fail either by S/W or H/W malfunctioning or
if some communication link fail while execution of updating , the
system system must make sure that the effects will be reflected
on the data residing at the failing or unreachable sites as soon as
the system can recover from the failure.
3. Third, since each site cannot have instantaneous information on
the actions currently being carried out at the other sites, the
synchronization of transactions on multiple sites has to be take
cared.
Distributed DBMS Architecture
 The architecture of a system defines its structure
 Here the components of the system are identified
 The function of each component is specified
 The interrelationships and interactions among
these components are defined.
1) ANSI/SPARC Architecture
 In late 1972, the Computer and Information Processing
Committee (X3) of the American National Standards
Institute (ANSI) established a Study Group on DBMS under
the Standards Planning and Requirements Committee
(SPARC).
 The mission of the study group was to study the feasibility of
setting up standards
 The study group issued its report in 1975 and its final report in
1977
 The architectural framework proposed in these reports known
as the “ANSI/SPARC architecture,” its full title being
“ANSI/X3/SPARC DBMS Framework.”
 There are three views of data:
 The external view, which is that of the end user, who might be a
programmer.
 The internal view, that of the system or machine.
 The conceptual view, that of the enterprise.
 Lowest level of the architecture is the internal view, which
deals with the physical definition and organization of data.
 The location of data on different storage devices and the
access mechanisms used to reach and manipulate data are the
issues dealt with at this level.
 The external view, which is concerned with how users view
the database.
 In between these two ends is the conceptual schema, which is
an abstract definition of the database used to represent the data
and the relationships among data.
 The advantage is it mainly supports data independence
 The separation of schemas leads to the physical and logical data
independence.
Architectural Models for Distributed DBMSs
 There are various possible ways in which a distributed
DBMS may be architected.
 We use a classification that organizes the systems
as characterized with respect to
(1) the autonomy of local systems,
(2) their distribution, and
(3) their heterogeneity.
Autonomy
 Autonomy, refers to the distribution of control, which
indicates the degree to which individual DBMSs can operate
independently.
 Autonomy is a function of a number of factors such as
1) whether the component systems (individual
DBMSs) exchange information.
2) whether they can independently execute transactions.
3) whether one is allowed to modify them.
Requirements of an autonomous system have been specified
as follows
1. The local operations of the individual DBMSs should not
be affected by their participation in the distributed system.
2. The manner in which the individual DBMSs process queries
should not be affected by the execution of global queries
that access multiple databases
3. System consistency or operation should not be compromised
when individual DBMSs join or leave the distributed
system.
There are dimensions of autonomy which can be specified as
follows
1. Design autonomy: Individual DBMSs are free to use the data
models and transaction management techniques that they
prefer.
2. Communication autonomy: Each of the individual DBMSs is
free to make its own decision as to what type of information
it wants to provide to the other DBMSs.
3. Execution autonomy: Each DBMS can execute the
transactions that are submitted to it in any way that it wants
to.
Other than autonomous systems there are other alternatives.
They are tight integration, semiautonomous, total isolation.
1. Tight integration : a single-image of the entire database is
available to any user who wants to share the information,
which may reside in multiple databases. In these tightly-
integrated systems, the data managers are implemented. so that
one of them is in control of the processing of each user request
even if that request is serviced by more than one data manager
2. Semiautonomous : They are not fully autonomous systems
because they need to be modified to enable them to exchange
information with one another.
3. Total isolation : the individual systems are stand-alone DBMSs
that know neither of the existence of other DBMSs nor how to
communicate with them. In such systems, the processing of
user transactions that access multiple databases is especially
difficult
Distribution :
Distribution is the physical distribution of data over multiple sites.
There are many ways DBMSs have been distributed. We
abstract these alternatives into two classes:
1. client/server distribution : The sites on a network are
distinguished as “clients” and “servers”. Communication duties
are shared between the client machines and servers.
2. peer-to-peer distribution : There is no distinction of client
machines versus servers. Each machine has full DBMS
functionality and can communicate with other machines to
execute queries and transactions.
Heterogeneity
 It refers to the uniformity or dissimilarity of the data models,
system components and databases.
 It occurs in various forms in Distributed Systems which may
be in the field of hardware heterogeneity and differences in
networking protocols to variations in data managers.
 It may be in data models, query languages, and transaction
management protocols.
 Representing data with different modeling tools creates
heterogeneity
 Heterogeneity in query languages involves the use of
completely different data access paradigms
 Even if the SQL is the standard relational query language,
there are many different implementations also.
Client/Server Systems
 Client/server DBMSs entered the computing at the beginning of
1990’s
 It mainly focus on two functions 1) server functions 2) client
functions.
 This is a two-level architecture easier to manage the complexity
of modern DBMSs and the complexity of distribution.
 The functionality allocation between clients and serves differ
in different types of distributed DBMSs.
 In relational systems, the server does most of the data
management work (all of query processing and optimization,
transaction management and storage management is done at
the server)
 The client, in addition to application and the user interface, has
a DBMS client module that is responsible for managing the data
that is cached to the client and (sometimes) managing the
transaction locks
 There is operating system and communication software that
runs on both the client and the server.
 Here the communication between the clients and the server(s)
is at the level of SQL statements.
 The client passes SQL queries to the server. The server does
most of the work and returns the result relation to the client.
 There are a various types of client/server architectures.
 One of the case is there is only one server which is accessed by
multiple clients.
 This is called as multiple client/single server.
 Here the database is stored on only one machine (the server) that
also hosts the software to manage it.
 The more sophisticated client/server architecture is one
where there are multiple servers in the system called as
multiple client/multiple server approach.
In multiple client/multiple server approach two alternative
management strategies are possible :
1. either each client manages its own connection to the appropriate
server or each client knows only its “home server” which then
communicates with other servers as required. This approach
simplifies server code, but loads the client machines with
additional responsibilities which is called “heavy client”
systems.
2. In other approach it concentrates on the data management
functionality at the servers. Thus, the transparency of data
access is provided at the server interface, leading to “light
clients.”
Peer-to-Peer Systems
 Modern peer-to-peer systems have two important differences
from their earlier relatives.
 The first is the massive distribution in current systems.
 While in the early days we focused on a few sites, current
systems consider thousands of sites.
 The second is the inherent heterogeneity of every aspect of
the sites and their autonomy.
 Now initially we focus on the meaning of peer-to-peer
(functionality of each site), since the principles and
fundamental techniques of these systems are very similar to
those of client/server systems
 The description of the architecture in data organizational view.
 First is that the physical data organization on each machine is
different.
 This means that there needs to be an individual internal
schema definition at each site which called as local internal
schema (LIS).
 The enterprise view of the data is described by the global
conceptual schema (GCS).
 which is global because it describes the logical structure of the
data at all the sites.
 To handle data fragmentation and replication, the logical
organization of data at each site needs to be described.
 Therefore, there needs to be a third layer in the architecture, the
local conceptual schema (LCS).
 user applications and user access to the database is supported by
external schemas (ESs).
 Data independence is supported since the model is
an extension of ANSI/SPARC
 Location and replication transparencies are supported by
the local and global conceptual schemas.
 Network transparency, on the other hand, is supported by the
global conceptual schema
 The detailed components of a distributed DBMS are shown
in before picture
 One component handles the interaction with users, and
another deals with the storage
 user processor, consists of 4 elements:

1. user interface handler


2. semantic data controller
3. Global query optimizer and decomposer
4. distributed execution monitor
1. The user interface handler is responsible for interpreting user
commands as they come in, and formatting the result data as
it is sent to the user.
2. The semantic data controller uses the integrity constraints
and authorizations checking if the user query can be
processed & also responsible for authorization
3. The global query optimizer and decomposer determines an
execution strategy to minimize a cost function. The global
query optimizer is responsible for generating the best
strategy to execute distributed operations
4. The distributed execution monitor coordinates the
distributed execution of the user request. The execution
monitor is also called the distributed transaction manager.
In executing queries in a distributed fashion, the execution
monitors at various sites may, and usually do, communicate with
one another.
 The second component of a distributed DBMS is the
data processor consists of 3 elements:
1. The local query optimizer, which acts as the access path
selector, is responsible for choosing the best access path
to access any data item
2. The local recovery manager is responsible for making sure
that the local database remains consistent even when failures
occur.
3. The run-time support processor is the interface to the
operating system and contains the database buffer (or cache)
manager, which is responsible for maintaining the main
memory buffers and managing the data accesses.
 In peer-to-peer systems, both the user processor modules and
the data processor modules on each machine
Distributed database design
Alternative design strategies
 Two major strategies that have been identified for designing
distributed databases. They are
1. Top-down approach
2. Bottom-up approach
 Top-down approach is more suitable for tightly integrated,
homogeneous distributed DBMSs.
 Bottom-up design is more suitable to multidatabases.
 The activity begins with a requirements analysis that
defines the environment of the system (data and processing
needs of all database users)
 The requirements study also specifies where the final
system is expected to stand with respect to the objectives of
a distributed DBMS
 These objectives are performance, reliability and availability,
economics, and expandability (flexibility).
 The requirements document is input to two parallel
activities: view design and conceptual design.
 The view design activity deals with defining the interfaces for
end users.
 The conceptual design, is the process by which the enterprise
is examined to determine entity types and relationships
among these entities.
 From the conceptual design step, comes the definition of
global conceptual schema.
 The global conceptual schema (GCS) and access pattern
information collected as a result of view design are inputs to
the distribution design step.
 The objective at this stage, is to design the local conceptual
schemas (LCSs) by distributing the entities over the sites of the
distributed system.
 The last step in the design process is the physical design,
which maps the local conceptual schemas to the physical
storage devices available at the corresponding sites.
 As designing and development activity is an ongoing process
which requires constant monitoring and periodic adjustment
and tuning.
 For that reason observation and monitoring came into existence.
 The result is some form of feedback, which may result
in backing up to one of the earlier steps in the design.
Distribution design issues
 In relational model of entities and relations. The relations in a
database schema are decomposed into smaller fragments
 Thus, the distribution design activity consists of two
steps: fragmentation and allocation
 As there is no justification or details for this process. So a set
of interrelated questions covers the entire issue
1. Why fragment at all?
2. How should we fragment?
3. How much should we fragment?
4. Is there any way to test the correctness of decomposition?
5. How should we allocate?
6. What is the necessary information for fragmentation
and allocation?
1) Reasons for Fragmentation
 the decomposition of a relation into fragments, each being
treated as a unit, permits a number of transactions to
execute concurrently.
 Fragmentation also results in the parallel execution of a
single query by dividing it into a set of sub queries that
operate on fragments.
 Thus fragmentation typically increases the level
of concurrency.
 This form of concurrency is refer to as intraquery concurrency
2) Fragmentation Alternatives
 Relation instances are essentially tables
 finding alternative ways of dividing a table into smaller ones.
 There are two alternatives : dividing it horizontally or
dividing it vertically.
3) Degree of Fragmentation
 The extent to which the database should be fragmented is
an important decision that affects the performance of query
execution.
 The degree of fragmentation goes from one extreme, that is,
not to fragment at all, to the other extreme, to fragment to the
level of individual tuples (in the case of horizontal
fragmentation) or to the level of individual attributes (in the
case of vertical fragmentation).
4) Correctness Rules of Fragmentation
 They are three rules that we follow during fragmentation,
which, together, ensure that the database does not
undergo semantic change during fragmentation.
a) Completeness b) Reconstruction. C) Disjointness
5) Allocation Alternatives
 After the database is fragmented, one has to decide on the
allocation of the fragments to various sites on the network.
 When data are allocated, it may either be replicated
or maintained as a single copy.
 The reasons for replication are reliability and efficiency
of read-only queries.
 The replication of data that depends on the ratio of the read-
only queries to the update queries.
6) Information Requirements
 The information needed for distribution design can be divided
into four categories:
 database information, application information, communication
network information, and computer system information.
Fragmentation
 there are two fundamental fragmentation strategies:
horizontal
and vertical.
 Furthermore, there is a possibility of nesting
fragments in a hybrid fashion.

1. Horizontal Fragmentation
2. Vertical Fragmentation
3. Hybrid Fragmentation
Hybrid Fragmentation
 In most cases a simple horizontal or vertical fragmentation of a
database schema will not be sufficient to satisfy the
requirements of user applications.
 In this case a vertical fragmentation may be followed by
a horizontal one, or vice versa, producing a tree
structured partitioning
 Since two types of partitioning strategies are applied one after
the other which is called as hybrid fragmentation (mixed or nested
fragmentation.)
Allocation
 The allocation of resources across the nodes or placing individual
files of a computer network is a big task.
Allocation Problem

 It is defined with respect to two measures


1. Minimal cost.
2. Performance.
2. Performance : The allocation strategy is designed to
maintain a performance metric. Two well-known ones are to
minimize the response time and to maximize the system
throughput at each site.
DDB
unit-2

S.MOUNASRI
Asst. Prof.
CSE
Query Processing & Decomposition

Distributed Query Optimization


QUERY PROCESSING & DECOMPOSITION
Query Processing Objectives
⚫ The objective of query processing in a distributed context is
to transform a high-level query on a distributed database into
an efficient execution strategy expressed in a low-level
language on local databases
⚫ We assume that the high-level language is relational
calculus, while the low-level language is an extension of
relational algebra with communication operators.
⚫ Important aspect of query processing is query optimization
⚫ Transformations of the high-level query means the one that
optimizes (minimizes) resource consumption
⚫A good measure of resource consumption is the total cost
that will be incurred in processing the query.
⚫ Total cost is the sum of all times incurred in processing
the operators of the query at various sites and in intersite
communication.
⚫ Another measure is the response time of the query, which
is the time elapsed for executing the query
⚫ In a distributed database system, the total cost to
be minimized includes
1. CPU,
2. I/O and
3. communication costs.
1. The CPU cost is incurred when performing operators on
data in main memory
2. The I/O cost is the time necessary for disk
accesses.This cost can be minimized by reducing the
number of disk accesses through fast access methods to
the data and efficient use of main memory (buffer
management).
3. The communication cost is the time needed for exchanging
data between sites participating in the execution of the
query.
Characterization of Query Processors
⚫ It is quite difficult to evaluate and compare query
processors in both centralized systems and distributed
systems because they may differ in many aspects.
⚫ Here are some important characteristics of query
processors which is used as a basis for comparison.
⚫ The first four characteristics hold for both centralized
and distributed query processors
⚫ next four characteristics are particular to distributed
query processors distributed DBMSs.
1. Languages
2. Types of Optimization
3. Optimization Timing
4. Statistics
5. Decision Sites
6. Exploitation of the Network Topology
7. Exploitation of Replicated Fragments
8. Use of Semijoins
1) Languages
⚫ Query processing must perform efficient mapping
from the input language to the output language.
2) Types of Optimization
⚫ query optimization aims at choosing the “best” point in
the solution space of all possible execution strategies.
⚫ An immediate method for query optimization is to search the
solution space, predict the cost of each strategy, and select
the strategy with minimum cost.
⚫ The problem is that the solution space can be large; that is,
there may be many equivalent strategies, even with a
small number of relations.
⚫ Therefore, an “exhaustive” search approach is often
used whereby (almost) all possible execution strategies
are considered
3) Optimization Timing
⚫ A query may be optimized at different times.
⚫ Optimization can be done statically or dynamically
⚫ statically means before executing the query
⚫ Dynamically means as the query is executed.
⚫ The main advantage over static query optimization is that the
actual sizes of intermediate relations are available to the
query processor, thereby minimizing the probability of a bad
choice
4) Statistics
⚫ The effectiveness of query optimization relies on statistics on
the database.
⚫ Dynamic query optimization requires statistics in order to
choose which operators should be done first.
⚫ Static query optimization is even more demanding since the
size of intermediate relations must also be estimated based
on statistical information
5) Decision Sites
⚫ When optimization is used, either a single site or several
sites may participate in the selection of the strategy to be
applied for answering the query.
⚫ In centralized decision approach a single site generates the
strategy even the decision process could be distributed
among various sites participating in the elaboration of the
best strategy
⚫ Even it is simple but requires knowledge of the
entire distributed database
⚫ Where as in distributed approach it requires only local
information.
⚫ Hybrid approaches where one site makes the major decisions
and other sites can make local decisions are also frequent
6) Exploitation of the Network Topology
⚫ The network topology is generally exploited by the
distributed query processor
⚫ With wide area networks, the cost function to be
minimized can be restricted to the data communication cost
⚫ With local area networks, communication costs
are comparable to I/O costs
7) Exploitation of Replicated Fragments
⚫ A distributed relation is usually divided into
relation fragments
⚫ Distributed queries expressed on global relations are
mapped into queries on physical fragments of relations by
translating relations into fragments.
⚫ We call this process localization because its main function is
to localize the data involved in the query.
8) Use of Semijoins
⚫ The semijoin operator has the important property of
reducing the size of the operand relation.
⚫ a semijoin is particularly useful for improving the
processing of distributed join operators as it reduces the size
of data exchanged between sites
⚫ The early distributed DBMSs, which has slow wide
area networks, make extensive use of semijoins.
⚫ Some later systems, which has faster networks and do
not employ semijoins.
Layers of Query Processing
⚫ The input is a query on global data expressed in relational
calculus.
⚫ This query is posed on global (distributed) relations,
meaning that data distribution is hidden.
⚫ Four main layers are involved in distributed query processing.
⚫ The first three layers map the input query into an optimized
distributed query execution plan
⚫ They perform the functions of query decomposition, data
localization, and global query optimization.
⚫ The fourth layer performs distributed query execution
by executing the plan and returns the answer to the
query.
⚫ It is done by the local sites and the control site.
1) Query Decomposition
⚫ The first layer decomposes the calculus query into
an algebraic query on global relations.
⚫ The information needed for this transformation is found in
the global conceptual schema describing the global relations
⚫ Query decomposition can be done as four successive steps
⚫ First, the calculus query is rewritten in a normalized form that
is suitable for subsequent manipulation
⚫ Second, the normalized query is analyzed semantically so that
incorrect queries are detected and rejected as early as
possible.
⚫ Third, the correct query (still expressed in relational
calculus) is simplified (eliminate redundant predicates)
⚫ Fourth, the calculus query is restructured as an
algebraic query
⚫ The algebraic query generated by this layer avoid
worse executions
2) Data Localization
⚫ The input to the second layer is an algebraic query on
global relations.
⚫ The main role of the second layer is to localize the
query’s data using data distribution information in the
fragment schema.
⚫ This layer determines which fragments are involved in
the query and transforms the distributed query into a
query on fragments
⚫ A global relation can be reconstructed by applying the
fragmentation rules, and then deriving a program, called a
localization program, of relational algebra operators,
which then act on fragments.
3) Global Query Optimization
⚫ The input to the third layer is an algebraic query on fragments.
⚫ The goal of query optimization is to find an execution
strategy for the query.
⚫ Query optimization find best way that minimize a
cost function.
⚫ The cost function, often defined in terms of time units
⚫ It is a weighted combination of I/O, CPU, and
communication costs.
⚫ The output of the query optimization layer is a
optimized algebraic query
⚫ It is represented and saved (for future executions) as
a distributed query execution plan .
4) Distributed Query Execution
⚫ The last layer is performed by all the sites having
fragments involved in the query.
⚫ Each subquery executing at one site, called a local query,
is then optimized using the local schema of the site and
executed.
Query Decomposition
⚫ It is the first phase of query processing that transforms
a relational calculus query into a relational algebra
query
⚫ The successive steps of query decomposition are (1)
normalization, (2) analysis, (3) elimination of redundancy, and
(4) rewriting.
Normalization
⚫ It is the goal of normalization to transform the query to a
normalized form to facilitate further processing.
⚫ Two types of normal from conjunctive ( ˄ ) , disjunctive ( ˅
) normal form.
Analysis
⚫ Query analysis enables rejection of normalized queries for
which further processing is either impossible or
unnecessary.
⚫ The main reasons for rejection are that the query is type
incorrect or semantically incorrect.
⚫ When one of these cases is detected, the query is simply
returned to the user with an explanation. Otherwise,
query processing is continued.
Elimination of Redundancy
Rewriting
⚫ The last step of query decomposition rewrites the query
in relational algebra.
⚫ For the sake of clarity it is customary to represent the
relational algebra query graphically by an operator tree
⚫ Here the leaf node is a relation stored in the database, and
a non-leaf node is an intermediate relation produced by a
relational algebra operator.
⚫ The sequence of operations is directed from the leaves to
the root, which represents the answer to the query.
• Select Operation (σ) : It selects tuples that satisfy the given predicate from a relation
• Project Operation (∏) : It projects column(s) that satisfy a given predicate
• Cartesian Product (Χ) : Combines information of two different relations into one.
Localization of Distributed Data
⚫ Determine which fragments are involved
⚫ the localization layer translates an algebraic query on
global relations into an algebraic query expressed on
physical fragments.
⚫ To simplify this section, we do not consider the fact that data
fragments may be replicated
⚫ This can be viewed as replacing the leaves of the operator
tree of the distributed query with subtrees corresponding to
the localization programs.
⚫ We call the query obtained this way the localized query.
Reduction for Primary Horizontal Fragmentation
Reduction with selection
EXAMPLE: We now illustrate reduction by horizontal
fragmentation using the following example query:
⚫ Applying the approach to localize EMP from EMP1, EMP2,
and EMP3 gives the localized query By commuting the
selection with the union operation, it is easy to detect that the
selection predicate contradicts the predicates of EMP1and
EMP3, thereby producing empty relations. The reduced
query is simply applied to EMP2.
(a) Localized query (b) Reduced query

FIG: Reduction for Horizontal Fragmentation (with Selection)


Reduction for Vertical Fragmentation
DISTRIBUTED QUERY OPTIMIZATION
Query Optimization
⚫ objective of the optimizer is to find a strategy close to
optimal & avoid bad strategies.
⚫ the strategy produced by the optimizer as the optimal
strategy (or optimal ordering).
⚫ The output of the optimizer is an optimized query execution
plan consisting of the algebraic query specified on fragments
and the communication operations to support the execution
of the query over the fragment sites.
⚫ The selection of the optimal strategy requires the prediction of
execution costs (total cost)
⚫ The execution cost is expressed as a weighted combination of
I/O, CPU, and communication costs.
Query Optimization
⚫ Query optimization refers to the process of producing a
query execution plan (QEP) which represents an execution
strategy for the query.
⚫ QEP minimizes an objective cost function.
⚫ A query optimizer, the software module that performs query
optimization, consists of three components:
⚫ a search space,
⚫ a cost model, and
⚫ a search strategy
⚫ The search space provides execution plans that represent the
input query.
⚫ All plans provides same result for input but differ in the
execution order of operations and the way the operations
are implemented, and therefore in their performance.
⚫ The cost model predicts the cost of a given execution plan.
To be accurate, the cost model must have good knowledge
about the distributed execution environment.
⚫ The search strategy explores the search space and selects
the best plan, using the cost model.
⚫ The details of the environment (centralized vs. distributed) are
captured by the search space and the cost model.
⚫ Query execution plans are abstracted by means of
operator trees by defining the order in which the
operations are executed
⚫ They are enriched with additional information, such as the
best algorithm chosen for each operation.
⚫ For a given query, the search space can thus be defined as the
set of equivalent operator trees
⚫ To characterize query optimizers, we concentrate on join
trees, which are operator trees whose operators are join or
Cartesian product.
⚫ because permutations of the join order have the most
important effect on performance of relational queries.
Example
⚫ Each of the join trees can be assigned a cost based on the
estimated cost of each operator.
⚫ Join tree (c) which starts with a Cartesian product may have
a much higher cost than the other join trees.
⚫ If a query is large we use some restrictions
⚫ The most common heuristic is to perform selection and
projection when accessing base relations
⚫ Another common heuristic is to avoid Cartesian products that
are not required by the query.
⚫ Like operator tree (c) would not be part of the search
space considered by the optimizer.
• Another important restriction is with respect to the shape of the join tree.
• Two kinds of join trees are usually distinguished: linear versus bushy trees
•By considering only linear trees, the size of the search space is reduced to O(2N).
•In distributed environment, bushy trees are useful in exhibiting parallelism.
•For example, in join tree (b), operations R1 R2 and R3 R4 can be done in parallel.
Cost functions
⚫ The cost of a distributed execution strategy can be
expressed with respect to either the total time or the
response time.
⚫ The total time is the sum of all time (also referred to as cost)
components.
⚫ The response time is the elapsed time from the initiation to the
completion of the query.
⚫A general formula for determining the total time

⚫ The two first components measure the local processing time


⚫ last components depicts the communication time
⚫ TCPU= time of a CPU instruction
⚫ TI/O= time of a disk I/O.
⚫ TMSG = fixed time of initiating and receiving a message
⚫ TTR=time it takes to transmit a data unit from one site
to another.
⚫ The data unit is in terms of bytes. (#bytes is the sum of
the sizes of all messages )
⚫ An assumption is that TTR is constant, but it varies from
LAN to WAN
⚫ Costs are generally expressed in terms of time units, which
in turn, can be translated into other units (e.g., dollars).
Centralized Query Optimization
⚫ It is a prerequisite to understanding distributed
query optimization for three reasons.
⚫ First, a distributed query is translated into local queries, each
of which is processed in a centralized way.
⚫ Second, distributed query optimization techniques are often
extensions of the techniques for centralized systems.
⚫ Finally, centralized query optimization is a simpler problem.
⚫ The minimization of communication costs makes distributed
query optimization more complex.
⚫ the optimization timing, which can be dynamic, static or
hybrid, is a good basis for classifying query optimization
techniques.
⚫ Dynamic Query Optimization : The QEP is dynamically
constructed by the query optimizer which makes calls to the
DBMS execution engine for executing the query’s
operations. Thus, there is no need for a cost model.
⚫ Static Query Optimization : With static query optimization,
there is a clear separation between the generation of the
QEP at compile-time and its execution by the DBMS
execution engine. Thus, an accurate cost model is key to
predict the costs of candidate QEPs.

Distributed Query Optimization


S.MOUNASRI
Asst. Prof.
CSE
Transaction Management
⦿ Definition
⦿ Properties of transaction
⦿ Types of transactions
⦿ Distributed concurrency control
DEFINITION
⦿ The concept of a transaction is used in database systems as
a basic unit of consistent and reliable computing.
⦿ The terms consistent and reliable plays a major role
⦿ A database is in a consistent state if it obeys all of the
consistency (integrity) constraints defined over it
⦿ Reliability refers to both the resiliency of a system to
various types of failures and its capability to recover from
them.
⦿ a transaction is a unit of consistent and reliable computation.
⦿ a transaction takes a database, performs an action on it, and
generates a new version of the database, causing a state transition.
⦿ a transaction is considered to be made up of a sequence of read
and write operations on the database, together with
computation steps.
⦿ i.e., a transaction may be thought of as a program with embedded
database access queries
Properties of Transactions
⦿ The consistency and reliability aspects of transactions are due
to four properties:
(1) atomicity,
(2) consistency,
(3) isolation, and
(4) durability.
⦿ Together, these are commonly referred to as the
ACID properties of transactions.
Atomicity
⦿ Atomicity refers to either all the transaction’s actions are
completed, or none of them are.
⦿ This is also known as the “all-or-nothing property.”
⦿ Atomicity requires that if the execution of a transaction is
interrupted by any sort of failure, the DBMS will be
responsible for determining what to do with the transaction
upon recovery from the failure.
⦿ There are two possible courses of action: it can either be
terminated by completing the remaining actions, or it can
be terminated by undoing all the actions that have already
been executed.
⦿ There are two types of failures : A transaction itself may fail due
to input data errors, deadlocks, or other factors.
⦿ Maintaining transaction atomicity in the presence of this type of
failure is commonly called the transaction recovery
⦿ The second type of failure is caused by system crashes, such as
media failures, processor failures, communication link
breakages, power outages etc.
⦿ Ensuring transaction atomicity in the presence of system crashes
is called crash recovery.
Consistency
⦿ The consistency of a transaction is its correctness
⦿ In other words, a transaction is a correct program that maps one
consistent database state to another.
⦿ There is a classification of consistency that groups databases into
four levels.
⦿ Then, based on the concept of dirty data, the four levels are
defined as follows:
“Degree 3: Transaction T sees degree 3 consistency if:
1. T does not overwrite dirty data of other transactions.
2. T does not commit any writes until it completes all its writes
[i.e., until the end of transaction (EOT)].
3. T does not read dirty data from other transactions.
4. Other transactions do not dirty any data read by T before
T completes.
Degree 2: Transaction T sees degree 2 consistency if:
1. T does not overwrite dirty data of other transactions.
2. T does not commit any writes before EOT.
3. T does not read dirty data from other
transactions. Degree 1: Transaction T sees degree 1
consistency if:
1. T does not overwrite dirty data of other transactions.
2. T does not commit any writes before EOT.
Degree 0: Transaction T sees degree 0 consistency if:
1. T does not overwrite dirty data of other transactions.”
⦿ defining multiple levels of consistency is to provide application
programmers the flexibility to define transactions that operate
at different levels.
⦿ Consequently, while some transactions operate at Degree
3 consistency level, others may operate at lower levels
Isolation
⦿ This property ensures that multiple transactions can occur
concurrently without leading to the inconsistency of
database state.
⦿ There is a possibility of another way transaction also
Durability
⦿ Durability refers to that property of transactions which
ensures that once a transaction commits, its results are
permanent and cannot be erased from the database.
⦿ The DBMS ensures that the results of a transaction will survive
subsequent system failures.
Types of transactions
⦿ We are going to discuss transaction models
⦿ Transactions have been classified according to a number of
criteria.
⦿ First goes duration of transactions
⦿ transactions may be classified as online or batch
⦿ These two classes are also called short-life and long-
life transactions.
⦿ Online transactions are characterized by very short execution/
response times (typically, on the order of a couple of seconds)
and by access to a relatively small portion of the database.
⦿ Batch transactions, take longer to execute (response time
⦿ being measured in minutes, hours, or even days) and access
a larger portion of the database.
⦿ Example statistical applications, report generation, complex
queries, and image processing
⦿ Intermixing of read and write actions without any specific
ordering is called general transactions.
⦿ If the transactions are restricted so that all the read actions are
performed before any write action, the transaction is called a
two- step transaction
⦿ If the transaction is restricted so that a data item has to be read
before it can be updated (written), the corresponding class is
called restricted (or read-before-write) transaction
⦿ If a transaction is both two step and restricted, it is called a
restricted two-step transaction.

Read(x Write(y) commit


)
Read(y
)
x←x+1
y←y+1
Write(
x)
R ead(y)
e y
a ←y+1
d Write(
( y)
x commi
) t
x

x
+
1
W
r
i
t
e
(
x
)
R
⦿ Transactions can also be classified according to their structure
⦿ distinguishing four broad categories in increasing complexity
1. flat transactions
2. nested transactions (closed/open)
3. workflow models (are combinations of various nested
forms.) flat transactions
⦿ Flat transactions have a single start point (Begin transaction)
and a single termination point (End transaction).
nested transactions
⦿ An alternative transaction model is to permit a transaction to
include other transactions with their own begin and commit
points
⦿ These transactions that are embedded in another one are
usually called subtransactions.
⦿ we differentiate between closed and open nesting because
of their termination characteristics.
⦿ Closed nested transactions commit in a bottom-up fashion
through the root.
⦿ Thus, a nested subtransaction begins after its parent and
finishes before it, and the commitment of the subtransactions is
conditional upon the commitment of the parent.
⦿ An open nested transaction allows its partial results to be
observed outside the transaction.
Advantages of nested transactions
1. providing a higher-level of concurrency among transactions.
Since a transaction consists of a number of other
transactions, more concurrency is possible within a single
transaction
2. It is possible to recover independently from failures of
each subtransaction.
Workflows
⦿ to model business activities flat and nested transactions are less
sufficient .
⦿ so workflows came into existence
⦿ a workflow is “a collection of tasks organized to accomplish
some business process.”
⦿ Three types of workflows are identified
Distributed concurrency control
⦿ The distributed concurrency control mechanism of a distributed
DBMS ensures that the consistency of the database maintained
in a multiuser distributed environment.
⦿ In this chapter, we make two major assumptions: the
distributed system is fully reliable and does not experience any
failures (of hardware or software), and the database is not
replicated.
Serializability
⦿ When multiple transactions are running concurrently then there
is a possibility that the database may be left in an inconsistent
state. Serializability is a concept that helps us to check which
schedules are serializable, that always leaves the database in
consistent state.
⦿ A history R (also called a schedule) is defined over a set of
transactions T={T1,T2,……Tn} and specifies an interleaved order
of execution of these transactions’ operations.
⦿ Consider Two operations Oi j(x) and Okl(x) (i and k representing
transactions and are not same) accessing the same database entity
x are said to be in conflict if at least one of them is a write
operation.
⦿ There are two things in this definition
⦿ First, read operations do not conflict with each other. We can,
therefore, talk about two types of conflicts: read-write (or
write- read), and write-write.
⦿ Second, the two operations can belong to the same transaction or
to two different transactions.
⦿ the existence of a conflict between two operations indicates
that their order of execution is important
⦿ The ordering of two read operations is insignificant.(very small)
Concurrency control mechanisms & algorithms
⦿ There are a number of ways that the concurrency control
approaches can be classified.
⦿ Grouping the concurrency control mechanisms into two
broad classes:
1. pessimistic concurrency control methods and
2. optimistic concurrency control methods.
⦿ Pessimistic algorithms synchronize the concurrent execution of
transactions early in their execution life cycle
⦿ optimistic algorithms delay the synchronization of transactions
until their termination
⦿ The pessimistic group consists of locking based algorithms,
ordering (or transaction ordering) based algorithms, and
hybrid algorithms.
⦿ The optimistic group can be classified as locking-based or
timestamp ordering-based.
⦿ In the locking-based approach, the synchronization of
transactions is achieved by employing physical or logical locks
on some portion or granule of the database.
⦿ The timestamp ordering (TO) class involves organizing the
execution order of transactions so that they maintain
transaction consistency
⦿ This ordering is maintained by assigning timestamps to both the
transactions and the data items that are stored in the database.
⦿ in some locking-based algorithms, timestamps are also used, to
improve efficiency and the level of concurrency called as
hybrid algorithms
locking-based approach
⦿ The main idea of locking-based concurrency control is to ensure
that a data item that is shared by conflicting operations is
accessed by one operation at a time by associating a “lock” with
each lock unit.
⦿ lock is set by a transaction before it is accessed and is reset at the end
of its use and cannot be accessed by an operation if it is already locked
by another
⦿ there are two types of locks (called lock modes) associated with each
lock unit:
1. read lock (rl) and
2. write lock (wl).
⦿ A transaction Ti that wants to read a data item contained in lock unit
x obtains a read lock on x [denoted rli(x)], same happens for write
operations.
⦿ Two lock modes are compatible if two transactions that access the same
data item can obtain these locks on that data item at the same time.
⦿ The distributed DBMS also handles the lock
management responsibilities on behalf of the transactions
⦿ It means users do not need to specify when a data item needs to
be locked; the distributed DBMS takes care of that every time the
transaction issues a read or write operation.
⦿ The problem with history H is the locking algorithm releases the
locks that are held by a transaction (say, Ti) as soon as the
associated database command (read or write) is executed, and
that lock unit (say x) no longer needs to be accessed
⦿ Hence we go for two-phase locking (2PL).
⦿ It states that no transaction should request a lock after it
releases one of its locks.
Time-stamped & optimistic concurrency control algorithms

⦿ To establish timestamp ordering, the transaction manager assigns


each transaction Ti a unique timestamp, ts(Ti), at its initiation.
⦿ A timestamp is a identifier that serves to identify each transaction
uniquely and is used for ordering.
⦿ Formally, the timestamp ordering (TO) rule can be specified
as follows:
Optimistic Concurrency Control
⦿ Pessimistic concurrency control algorithms assume that the
conflicts between transactions are quite frequent and do not
permit a transaction to access a data item if there is a conflicting
transaction that accesses that data item.
⦿ Thus operation of a transaction follows the sequence of
phases validation (V), read (R), computation (C), write (W)
⦿ Optimistic algorithms, delay the validation phase until just before the
write phase
⦿ Thus an operation submitted to an optimistic scheduler is never delayed.
⦿ The read, compute, and write operations of each transaction are
processed freely without updating the actual database.
⦿ Each transaction initially makes its updates on local copies of data items.
⦿ The validation phase consists of checking if these updates would
maintain the
consistency of the database.
⦿ If the answer is affirmative, the changes are made global (i.e., written
into the
actual database).
⦿ Otherwise, the transaction is aborted and has to restart.
Deadlock management

⦿ Any locking-based concurrency control algorithm may result in


deadlocks.
⦿ Since there is mutual exclusion (object that prevents
simultaneous access to a shared resource (data) ) and transactions
may wait on locks
⦿ TO-based algorithms that require the waiting of transactions may
also cause deadlocks.
⦿ A deadlock can occur because transactions wait for one another
⦿ deadlock situation is a set of requests that can never be granted
by the concurrency control mechanism.
⦿ Using the WFG, it is easier to indicate the condition for
the occurrence of a deadlock.
⦿ A deadlock occurs when the WFG contains a
cycle. Deadlock Prevention
⦿ Deadlock prevention methods guarantee that deadlocks cannot
occur in the first place
⦿ the transaction manager checks a transaction when it is
first initiated and does not permit it to proceed if it may
cause a deadlock
⦿ To perform this check, it is required that all of the data items that
will be accessed by a transaction be predeclared
⦿ The transaction manager then permits a transaction to proceed if all
the data items that it will access are available.
⦿ Otherwise, the transaction is not permitted to proceed.
DISTRIBUTED
DATABASES
UNIT 4 B
Y
S.MOUNAS
RI
ASST.PRO
F.
Distributed DBMS Reliability
 Reliability concepts & measures
 fault-tolerance in distributed systems
 Failures in distributed DBMS
 Local & distributed reliability protocols
 site failures and network partitioning
Parallel database systems
 parallel database system architectures
 Parallel data placement
 parallel query processing
 load balancing
 database clusters
Distributed DBMS Reliability
• A reliable distributed database management system is one that can continue to
process user requests even when the underlying system is unreliable.
• Even when components of the distributed computing environment fail, a reliable
distributed DBMS should be able to continue executing user requests without
violating database consistency.
• So we discuss the reliability features of a distributed DBMS.
 Reliability concepts & measures
(i) System, State, and Failure:
• Reliability refers to a system that consists of a set of components.
• The system has a state, which changes as the system operates.
• Any deviation of a system from the behavior described in the specification is considered a
failure.
• For example, in a distributed transaction manager the specification may state that
only serializable schedules for the execution of concurrent transactions should be
generated.

• If the transaction manager generates a non-serializable schedule, we say that it has failed.
(ii) Reliability and Availability
• Reliability refers to the probability that the system under consideration does not experience
any failures in a given time interval
• Formally, the reliability of a system, R(t), is defined as the following conditional
probability:
R(t) =Pr{0 failures in time [0,t]|no failures at t =0}

• If we assume that failures follow a Poisson distribution (which is usually the case
for hardware), this formula reduces to

R(t) =Pr{0 failures in time [0,t]}

• Under the same assumptions, it is possible to derive that


• Availability, A(t),refers to the probability that the system is operational according to its
specification at a given point in time t
• Reliability and availability of a system are considered to be contradictory objectives

(iii) Mean Time between Failures (MTBF) /Mean Time to Repair (MTTR)
• MTBF is the expected time between subsequent failures in a system with repair.
• MTBF can be calculated either from empirical data or from the reliability function as :
• MTTD : mean time to detect
 Failures in distributed DBMS
• Designing a reliable system that can recover from failures requires identifying the types
of failures with which the system has to deal.
• In a distributed database system, we need to deal with four types of failures:
i. transaction failures (aborts)
ii. site (system) failures
iii. media (disk) failures
iv. communication line failures
(i) transaction failures :
• Transactions can fail for a number of reasons.
• Failure can be due to an error in the transaction caused by incorrect input data or due to
deadlocks
• some concurrency control algorithms do not permit a transaction to proceed or even to wait if
the data that they attempt to access are currently being accessed by another transaction.
• This might also be considered a failure.
• The approach to take in cases of transaction failure is to abort the transaction, thus resetting the
database to its state prior to the start of this transaction
(ii) site (system) failures :
• The reasons for system failure can be traced back to a hardware or to a software failure
• A system failure is always assumed to result in the loss of main memory contents.
• Therefore, any part of the database that was in main memory buffers is lost as a result of a
system failure.
• system failures are referred to as site failures, which makes site unreachable
• We differentiate between partial and total failures in a distributed system.
• Total failure refers to the simultaneous failure of all sites in the distributed system
• partial failure indicates the failure of only some sites while the others remain operational.
(iii) media failures :
• Media failure refers to the failures of the secondary storage devices that store the database.
• These failures may be due to operating system errors, as well as to hardware faults
• It means that all or part of the database that is on the secondary storage is considered to be
destroyed and inaccessible
(iv) Communication Failures :
• The three types of failures described above are common to both centralized and distributed
DBMSs.
• Communication failures are unique to the distributed & there are a number of types of
communication failures.
• Errors in the messages, improperly ordered messages, lost (or undeliverable) messages, and
communication line failures.
 Local & distributed reliability protocols
Local reliability protocols
• we discuss the functions performed by the local recovery manager (LRM) that exists at each site.
• These functions maintain the atomicity and durability properties of local transactions.
• Which relate to the execution of the commands that are passed to the LRM, which are
begin_transaction, read, write, commit, and abort.
• When the LRM wants to read a page of data on behalf of a transaction it issues a fetch command,
indicating the page that it wants to read.
• The buffer manager checks to see if that page is already in the buffer and if so, makes it available for that
transaction; if not, it reads the page from the stable database into an empty database buffer
• Other than above a sixth interface command to the LRM: recover.
• The recover command is the interface that the operating system has to the LRM
• It is used during recovery from system failures when the operating system asks the DBMS to recover
the database to the state that existed when the failure occurred.
distributed reliability protocols
• the distributed version also aim to maintain the atomicity and durability of distributed transactions that
execute over a number of databases
• The protocols address the distributed execution of the begin_transaction, read, write, abort, commit,
and recover commands.
• all the commands are executed in the same manner of centralized system
• We assume that at the originating site of a transaction there is a coordinator process and at each site
where the transaction executes there are participant processes.
• Thus, the distributed reliability protocols are implemented between the coordinator and the participants.
• Assuming that during the execution of a distributed transaction, one of the sites involved in the
execution fails; we would like the other sites to terminate the transaction
• Recovery protocols deal with the procedure that the process (coordinator or participant) at the failed site
has to go through to recover its state once the site is restarted
 site failures and network partitioning
Parallel database systems
 parallel database system architectures
• Many data-intensive applications like e-commerce, data warehousing, and data mining, s
require support for very large databases.
• Very large databases are accessed through high numbers of concurrent transactions (e.g.,
performing on-line orders on an electronic store) or complex queries (e.g., decision-support
queries).
• The first kind of access is representative of On-Line Transaction Processing (OLTP)
applications
• while the second is representative of On-Line Analytical Processing (OLAP) applications
• Supporting very large databases efficiently for either OLTP or OLAP can be addressed by
combining parallel computing and distributed database management.
(i) Parallel Database System Architectures

Objectives
• Parallel database systems combine database management and parallel processing to increase performance
and availability
• A parallel database system can be loosely defined as a DBMS implemented on a parallel\ computer.
• The objectives of parallel database systems are covered by those of distributed DBMS (performance,
availability, extensibility).
• Ideally, a parallel database system should provide the following advantages
 High-performance : obtained by parallel data management, query optimization, and load balancing(Load
balancing is the ability of the system to divide a given workload equally among all processors.) etc
High-availability: A parallel database system consists of many redundant components, it can well increase
data availability and fault-tolerance
Extensibility: In a parallel system, accommodating increasing database sizes or increasing performance
demands should be easier.
Functional Architecture
1. Session Manager : provide support for client interactions with the server and also performs
the connections and disconnections between the client processes and the two other
subsystems. Therefore, it initiates and closes user sessions
2. transaction Manager : It receives client transactions related to query compilation and
execution. Depending on the transaction, it activates the various compilation phases,
triggers query execution, and returns the results as well as error codes to the client
application
3. Data Manager : It provides all the low-level functions needed to run compiled queries in
parallel, i.e., database operator execution, parallel transaction support, cache management,
etc.
Parallel DBMS Architectures
• There are three basic parallel computer architectures depending on how main memory or disk is
shared: shared-memory, shared-disk and shared-nothing
1. shared-memory : In the shared-memory approach any processor has access to any memory module
or disk unit through a fast interconnect . All the processors are under the control of a single
operating system.
2. shared-disk : In this any processor has access to any disk unit through the interconnect but
exclusive (non-shared) access to its main memory . Each processor-memory node is under the control
of its own copy of the operating system.

3. shared-nothing: In this approach each processor has exclusive access to its main memory and disk
unit(s). Similar to shared-disk, each processor memory-disk node is under the control of its own copy
of the operating system. Then, each node can be viewed as a local site
 Parallel data placement
• Data placement in a parallel database system exhibits similarities with data fragmentation in
distributed databases
• we use the terms partitioning and partition instead of horizontal fragmentation and
horizontal fragment
• There are three basic strategies for data partitioning: round-robin, hash, and range partitioning
1. Round-robin partitioning is the simplest strategy, it ensures uniform data distribution. This strategy
enables the sequential access to a relation to be done in parallel.
2. Hash partitioning applies a hash function to some attribute that yields the partition number. This
strategy allows exact-match queries on the selection attribute to be processed by exactly one node
and all other queries to be processed by all the nodes in parallel.
3. Range partitioning distributes tuples based on the value intervals (ranges) of some attribute
 parallel query processing
• The objective of parallel query processing is to transform queries into execution plans that can be
efficiently executed in parallel.
• It focuses on both intra-operator parallelism (a single operator is distributed among
multiple processors.)and inter-operator parallelism(each query runs on multiple processors
which corresponds to different operators of a query running in different processors.)
• A parallel query optimizer can be seen as three components: a search space, a cost model, and a
search strategy.
 load balancing
• Good load balancing is crucial for the performance of a parallel system
• the response time of a set of parallel operators is that of the longest one.
• Thus, minimizing the time of the longest one is important for minimizing response time.
• Balancing the load of different transactions and queries among different nodes is also essential to
maximize throughput
• Solutions to these problems can be obtained at the intra- and inter-operator levels
 database clusters
• a cluster can have a shared-disk or shared-nothing architecture
• Shared-disk requires a special interconnect that provides a shared disk space to all nodes with
provision for cache consistency
• Shared-nothing can better support database autonomy without the additional cost of a special
interconnect and can scale up to very large configurations
• Client applications interact with the middleware in a classical way to submit database transactions
• The general processing of a transaction to a single database is as follows. First, the transaction is
authenticated and authorized using the directory. If successful, the transaction is routed to a
DBMS at some, possibly different, node to be executed.
• As in a parallel DBMS, the database cluster middleware has several software layers:
transaction load balancer, replication manager, query processor and fault tolerance manager
UNIT-5
DISTRIBUTED DATABASES

S.MOUNASRI
Asst.Prof.
DISTRIBUTED OBJECT DATABASE MANAGEMENT SYSTEMS

Fundamental object concepts and models


Object distributed design
Architectural issues
Object management
Distributed object storage
Object query processing
• Object database management systems (object DBMSs) are better
candidates for the development of some of the applications
• Because the applications require explicit storage and manipulation of more
abstract data types (e.g., images, design documents) and the ability for the
users to define their own application-specific types.
Fundamental object concepts and models
• An object DBMS is a system that uses an “object” as the fundamental
modeling, in which information is represented in the form of objects
Object
• all object DBMSs are built around the fundamental concept of an object
• An object represents a real entity in the system that is being modeled.
• It is represented as a (OID, state, interface)
• OID is the object identifier
• state is some representation of the current state of the object
• interface defines the behavior of the object
Types and Classes
• “class” refer to the specific object model construct and the term “type” refer to
a domain of objects (e.g., integer, string).
• Here we don’t make a distinction between primitive system objects (i.e., values),
structural (tuple or set) objects, and user-defined objects.
• A class describes the type of data by providing a domain of data with the same
structure
• The class definition specifies that Car has eight attributes
• Four of the attributes (model, year, serial no, capacity) are value-based, while the
others (engine, bumpers, tires and make) are object-based (i.e., have other objects
as their values).
• Attribute bumpers is set valued (i.e., uses the set constructor), and attribute tires is
tuple-valued where the left front (lf), right front (rf), left rear (lr) and right rear (rr)
tires are individually identified.
• we follow a notation where the attributes are lower case and types are capitalized.
• Thus, engine is an attribute and Engine is a type in the system
Composition (Aggregation)
Object distributed design
• The two important aspects of distribution design are fragmentation and allocation
• An object is defined by its state and its methods.
• We can fragment the state, the method definitions, and the method implementation.
• Furthermore, the objects in a class extent can also be fragmented and placed at
different sites.
1. Horizontal Class Partitioning : class C for partitioning, we create classes C1,..., Cn,
each of which takes the instances of C that satisfy the particular partitioning
predicate.
2. Vertical Class Partitioning : Vertical fragmentation is considerably more complicated.
Given a class C, fragmenting it vertically into C1,...,Cm produces a number of
classes, each of which contains some of the attributes and some of the methods. Thus,
each of the fragments is less defined than the original class.
3. Path Partitioning : Path partitioning is a concept describing the clustering of all the
objects forming a composite object into a partition . A path partition consists of
grouping the objects of all the domain classes that correspond to all the instance
variables in the subtree rooted at the composite object
Architectural issues
• The preferred architectural model for object DBMSs has been client/server
• The unit of communication between the clients and the server is an issue in
object dbms
• Since data are shared by many clients, the management of client cache
buffers for data consistency becomes a serious concern
• Since objects may be composite or complex, there may be possibilities for
prefetching component objects when an object is requested.
Alternative Client/Server Architectures
• Two main types of client/server architectures have been proposed:
1. object servers
2. page servers
• The distinction is partly based on the granularity of data that are shipped
between the clients and the servers, and partly on the functionality provided
to the clients and servers.
• In object servers, clients request “objects” from the server, which retrieves them from
the database and returns them to the requesting client.
• In object servers, the server undertakes most of the DBMS services, with the client
providing basically an execution environment for the applications, as well as some
level of object management functionality
• The object management layer is duplicated at both the client and the server in order
to allow both to perform object functions. Object manager serves a number of
functions
• Object manager deals with the implementation of the object identifier (logical,
physical, or virtual) and the deletion of objects (either explicit deletion or garbage
collection ).
• object managers at the client(to improve system performance) and the server
implement an object cache
• The optimization of user queries and the synchronization of user transactions are all
performed in the server, with the client receiving the resulting objects.
• the unit of transfer between the servers and the clients is a physical unit of
data, such as a page(is the smallest unit of data for memory management
in a virtual memory) or segment, rather than an object
• Page server architectures split the object processing services between the
clients and the servers
• Page servers simplify the DBMS code, since both the server and the client
maintain page caches, and the representation of an object is the same
• Thus, updates to the objects occur only in client caches and these updates
are reflected on disk when the page is flushed from the client to the server
• The server performs a limited set of functions and can therefore serve a
large number of clients
Object management
• Object management includes tasks such as object identifier management, pointer
swizzling, object migration, deletion of objects tasks at the server
• object identifier management : object identifiers (OIDs) are system-generated and used
to uniquely identify every object.
• The implementation object identifier has two common solutions, based on either
physical or logical identifiers
• The physical identifier (POID) approach equates the OID with the physical address of
the corresponding object. The address can be a disk page address
• The logical identifier (LOID) approach consists of allocating a system-wide unique OID
per object.
• Pointer Swizzling : In object systems, one can navigate from one object to another using
path expressions that involve attributes with object-based values. These are called
pointers.
• Usually on disk, object identifiers(OID) are used to represent these pointers.
• In memory, it is desirable to use in-memory pointers for navigating from one object to
another.
• The process of converting a disk version of the pointer to an in-memory version of a
pointer is known as “pointer-swizzling”.
Distributed object storage
• Among the many issues related to object storage, two are particularly
relevant in a distributed system: object clustering and distributed garbage
collection.
• for clustering data on disk such that the I/O cost of retrieving them is
reduced.
• Garbage collection is a problem that arises in object databases due to
reference-based sharing.
• Indeed, in many object DBMSs, the only way to delete an object is to delete
all references to it.
Object query processing
• Almost all object query processors and optimizers that have been proposed
to date use techniques developed for relational systems.
• Consequently, it is possible to claim that distributed object query
processing and optimization techniques require the extension of
centralized object query processing and optimization with the distribution
approaches(discussed earlier)
• Objects can (and usually do) have complex structures whereby the state of
an object references another object.
• Accessing such complex objects involves path expressions
Object Oriented Data Model
• Object oriented data model is based upon objects, with different attributes.
• All these object have multiple relationships between them.
Objects
• The real world entities and situations are represented as objects in the Object oriented
database model.
Attributes and Method
• Every object has certain characteristics. These are represented using Attributes. The
behaviour of the objects is represented using Methods.
Class
• Similar attributes and methods are grouped together using a class. An object can be
called as an instance of the class.
Inheritance
• A new class can be derived from the original class. The derived class contains attributes
and methods of the original class as well as its own.
Inheritance
• A new class can inherit the attributes of old class

ca
r

aud for hyund


i d ai

Object identity
• Object identity is typically implemented via a unique, system-generated OID. The value
of the OID is not visible to the external user, but is used internally by the system to
identify each object uniquely and to create and manage inter-object references
persistence of objects
• Persistence denotes a process or an object that continues to exist even after its parent
process, or the system that runs it is turned off. (the browser restarts the next time you
open it and attempts to reopen any tabs that were open when it crashed. A persistent
process thus exists even if it failed or was killed for some technical reasons. )
• there are two types of persistence: object persistence and process persistence.
• object persistence refers to an object that is not deleted until a need emerges to remove
it from the memory.
• process persistence, processes are not killed or shut down by other processes and exist
until the user kills them.
persistent programming languages
• A persistent programming language is a programming language extended with constructs
to handle persistent data.
• Using Embedded SQL, a programmer is responsible for writing explicit code to fetch
data into memory or store data back to the database
• In a persistent program language, a programmer can manipulate persistent data without
having to write such code explicitly.
COMPARISON OODBMS AND ORDBMS

oodbms ordbms
• In the object oriented database, the data • In relational database, data is stored in
is stored in the form of objects. the form of tables, which contains rows
• In oodbms, relationships are represented and column.
by references via the object identifier • In ordbms, connections between two
(OID). relations are represented by foreign key
• Handles larger and complex data than attributes
RDBMS. • Handles comparatively simpler data.
• In oodbms, the data management • n relational database systems there are
language is typically incorporated into a data manipulation languages such as
programming language such as #C++. SQL,
• Stores data entries are described as • Stores data in entries is described as
object. tables.

You might also like