DISTRIBUTED
DATABASES
Distributed Database
In distributed database system the database is stored on
several computers.
The computers in a distributed system communicate with
one another through various communication media, such
as high-speed networks or telephone lines.
Computers in a distributed system also referred to as
sites or nodes
It consist of single logical database that is split into a
number of fragments .
Each fragment is stored on one or more computers under
the control of separate DBMS.
logically interrelated collection of shared data physically
distributed over a computer network is called a distributed
database
Ex:
One bank have branches all over India &
its head office is in Delhi.
Assume bank maintains local data in
Local Branch and copy of data of all
branches at Delhi.
Data is distributed all over India.
This eases query processing for local
customers of a branch & also of a global
customer.
Bank using distributed
processing
Mumbai
Delhi
Chennai
Bangalore
(Head Office)
Agra
Local
Branch
Local
Branch
Local
Branch
Local
Branch
A distributed database system consists of a
collection of sites connected together via
some kind of communications network, in
which :
each site is a database system site in its
own right;
the sites agree to work together, so that
a user at any site can access data
anywhere in the network exactly as if
the data were all stored at the user's
own site
Distributed DBMS.
Software system that permits the
management of the distributed database and
makes the distribution transparent to users.
Characterstics of DDBMS
Collection of logically-related shared data.
Data split into fragments.
Fragments may be replicated.
Fragments/replicas allocated to sites.
Sites linked by a communications network.
Data at each site is under control of a DBMS.
DBMSs handle local applications autonomously.
Each DBMS participates in at least one global
application.
Advantages of DDBMS
1. Data sharing
.If a number of different sites are connected
to each other, then a user at one site may be
able to access data that is available at another
site.
For example, in the distributed banking
system,
is possible for a user in one branch
2.
Local itAutonomy
toThe
access
dataadvantage
in another to
branch.
.
primary
accomplishing data
sharing by means of data distribution is that
each site is able to retain a degree of control
over data stored locally.
In a centralized system, the database administrator
of the central site controls the database.
In a distributed system, there is a global database
administrator responsible for the entire system.
A part of these responsibilities is delegated to the
local database administrator for each site.
each local administrator may have a different degree
of autonomy which is often a major advantage of
distributed databases.
3.Reliability and Availability
If one site fails in distributed system, the
remaining sited may be able to continue operating.
In particular, if data are replicated in several sites,
transaction needing a particular data item may find
it in several sites. Thus, the failure of a site does not
necessarily imply the shutdown of the system.
The failure of one site must be detected
by the system, and appropriate action
may be needed to recover from the
failure. The system must no longer use
the service of the failed site. Finally,
when the failed site recovers or is
repaired, mechanisms must be available
to integrate it smoothly back into the
system.
4.Faster data access
users can issue commands from any
location to access data and it does not
affect the working of database.
Its advantage is that if a user wants to
5. Modular Growth
Any time new nodes (computers) can be added to
the network without any difficulty.
6.Speedup Query Processing:
If a query involves data at several sites, it may be
possible to split the query into sub queries that
can be executed in parallel by several sites.
Such parallel computation allows for faster
processing of a users query.
In those cases in which data is replicated, queries
may be directed by the system to the least
heavily loaded sites.
Disadvantages of DDBMSs
Complexity -A distributed database is
more complicated to set up and maintain
compared to a central database.
Managing and controlling of ddms is
complex
Security-there is less security because
data is at so many different sites.
Distributed databases provides more
flexible accesses that increase the
chance of security violations since the
database can be accessed throughout
every site within the network.
Lack of Standards- there are no tools or
methodologies yet to help users convert a centralized
DBMS into a distributed DBMS.
Database Design More Complex-besides of the normal
difficulties, the design of a distributed database has to
consider fragmentation of data, allocation of
fragments to specific sites and data replication.
Cost- increased complexity and a more extensive
infrastructure means extra costs.
Lack of Experience-distributed databases are difficult
to work with, and as a young field there is not much
readily available experience on proper practice.
local and global transactions
A
local transaction accesses data in the
single site at which the transaction was
initiated.
A global transaction either accesses
data in a site different from the one at
which the transaction was initiated or
accesses data in several different sites.
Ensuring ACID properties of local
transcation can be done same as
normal transction. Ensuring ACID
properties of global transcation is
complex
Types of DDBMS
Homogeneous DDBMS
Heterogeneous DDBMS
Homogenous Distributed Database Systems
All
sites have identical software /schema
They are aware of each other and agree to
cooperate in processing user requests
Goal: provide a view of a single database,
hiding details of distribution. It appears to
user as a single system
Homogeneous Database
Identical DBMSs
All data is managed by the distributed
DBMS( no exclusively local data)
All access is through one, global schema
The global schema is the union of all the local
Heterogeneous Distributed
Database Systems
Data distributed across all the nodes
Different software/schema on different sites
Different DBMSs may be used at each node
Local access is done using the local DBMS and
schema
Remote access is done using the global
schema
Goal: integrate existing databases to provide useful
functionality
19
Typical Heterogeneous Environment
Non-identical DBMSs
Source: adapted from Bell and Grimson, 1992.
20
Distributed Database
Design
Design of ddms introduce 3 issues
How to partition database into fragments
Which fragments to replicate
Where to locate those fragments and replicas
Fragmentation and replication deals with
first 2 issues.allocation deal with 3 rd
issues
Fragmentation
Allocation
Relation may be divided into a number of subrelations, which are then distributed.
Each fragment is stored at site with "optimal"
distribution.
Replication
Copy of fragment may be maintained at
several sites.
Data Fragmentation
If the relation r relation r into fragments r1, r2, , rn which
contain sufficient information to reconstruct relation r.
3 Rules which must be followed:
Completeness - If a relation R is decomposed into
fragments R1,R2....Rn, each data item in R must appear in
at least one fragment
Reconstruction - It must be possible to define a relational
operation that will reconstruct R from the fragments
Disjointness - A data item must appear in only one
fragment - exception - Primary Key in vertical fragmentation
For horizontal fragmentation, data item is a tuple
For vertical fragmentation, data item is an attribute.
Types of fragmentation:
Three types of fragmentation:
Horizontal
Vertical
Mixed
Other possibility is no fragmentation:
If relation is small and not updated frequently, may
be better not to fragment relation.
Horizontal fragmentation
each tuple of r is assigned to one or more fragments
Example : relation account with following schema
Account = (account_number, branch_name , balance
)
account relation can be divided into several different
fragments,each of which consists of tuples of
accounts belonging to a particular branch.If the
banking system has only two branchesHillside and
Valleyviewthen there are two different fragments:
We reconstruct the relation r by taking the union of all fragments; that is,
r = r1 r2 r n
Horizontal Fragmentation of account
Relation
account_number branch_name
A-305
A-226
A-155
balance
Hillside
Hillside
Hillside
500
336
62
account1 = branch_name=Hillside
(account )
account_number branch_name
balance
A-177
A-402
A-408
A-639
Valleyview
Valleyview
Valleyview
Valleyview
account2 = branch_name=Valleyview
(account )
205
10000
1123
750
PROJ1: projects with budgets less than $200,000
PROJ2: projects with budgets greater than or equal
to
$200,000
Vertical fragmentation:
the schema for relation r is split into several smaller
schemas
All schemas must contain a common candidate key (or superkey) to
ensure lossless join property.
A special attribute, the tuple-id attribute may be added to each
schema to serve as a candidate key.
We can reconstruct the relation by taking natural join of
relations
r=r1
r2
r3
..
rn
Vertical Fragmentation of employee_info
Relation
branch_name customer_name
tuple_id
Lowman
1
Hillside
Camp
2
Hillside
Camp
3
Valleyview
Kahn
4
Valleyview
Kahn
5
Hillside
Kahn
6
Valleyview
Green
7
Valleyview
deposit1 = branch_name, customer_name, tuple_id (employee_info )
account_number
balance
tuple_id
500
A-305
1
336
A-226
2
205
A-177
3
10000
A-402
4
62
A-155
5
1123
A-408
6
750
A-639
7
deposit2 = account_number, balance, tuple_id (employee_info )
Horizontal and Vertical
Fragmentation
Mixed Fragmentation
Combination of horizontal and vertical
strategies
Is also called hybrid or nesting
A horizontal fragment that is subsequently
vertically fragmented, or a vertical fragment
that is then horizontally fragmented.
Mixed fragmentation is defined using
select and project operation of relation
algebra
Original relation can be obtained by join
and union operation
Advantages of
Fragmentation
Horizontal:
allows parallel processing on fragments of a
relation
allows a relation to be split so that tuples are
located where they are most frequently
accessed
Vertical:
allows tuples to be split so that each part of
the tuple is stored where it is most frequently
accessed
tuple-id attribute allows efficient joining of
vertical fragments
Disadvantages:
Performance - may be slower
Integrity - more difficult
Data Replication
System maintains multiple copies of data,
stored in different sites, for faster
retrieval and fault tolerance.
Two types replication
Full replication
Partial replication
Full replication
Full replication of a relation is the case
where the relation is stored at all sites.
Fully redundant databases are those in
which every site contains a copy of the
entire database.
Can be impractical due to amount of overhead
Partial
replication
Some importantant frequently used
fragments are only replicated
Most DDBMSs are able to handle the
partially replicated database well
Unreplicated
database
Stores each database fragment at single
site
No duplicate database fragments
Advantages of Replication
Availability: failure of site containing relation
r does not result in unavailability of r is
replicas exist.
Parallelism: queries on r may be processed
by several nodes in parallel.
Reduced
data transfer: relation r is
available locally at each site containing a
replica of r.
Disadvantages of Replication
Increased cost of updates: each replica of
relation r must be updated.
Increased complexity of concurrency control:
concurrent updates to distinct replicas may
lead to inconsistent data unless special
concurrency control mechanisms are
implemented.
One solution: choose one copy as primary copy
and apply concurrency control operations on
primary copy
Data allocation
Four alternative strategies regarding
placement of data:
Centralized
Partitioned (or Fragmented)
Complete Replication
Selective Replication
Data allocation algorithms consider variety of
factors like
performance,reliabitlity,availbility,storage
cost,communication cost
Centralized data allocation
entire DB is stored at one site with users
distributed across the network.
Partitioned data allocation
Complete Replication
Database partitioned into disjoint fragments, each
fragment assigned to one site.
Consists of maintaining complete copy of database at
each site.
Selective Replication
Combination of partitioning, replication, and
centralization.
erence Architecture for DDBMS
Due to diversity, no universally accepted
architecture such as the ANSI/SPARC 3level architecture.
A reference architecture consists of:
Set of global external schemas.
Global conceptual schema (GCS).
Fragmentation schema and allocation schema.
Set of schemas for each local DBMS conforming
to 3-level ANSI/SPARC .
Some levels may be missing, depending on
levels of transparency supported.
Global conceptual schema
the global conceptual is a logical description of the whole
data base as if it were not distributed.
In DDBMS, GCS is union of all local conceptual schemas.
fragmentation and allocation schema
The fragementation schema is a description of how the
data is to be logically partioned
Allocation schema is a description of where data is to be
located
Local schemas
Each local DBMS has its own set of schemas
The local mapping schema maps fragments in allocation
schema into external objects in the local data base
Components of a DDBMS
Local DBMS (LDBMS) component - It has its
own
local system catalog that stores
information about the data held at that site.
Data communications (DC) component is
the software that enables all sites to
communicate with each other.
Global System Catalog (GSC) - The GSC holds
information specific to the distributed nature
of
the system, such as the fragmentation and
allocation schemas.
Distributed DBMS component - is the
controlling unit of the entire system.