0% found this document useful (0 votes)
123 views

Chapter 6 Distributed System Management

Distributed databases allow data to be stored across multiple interconnected sites that communicate over a computer network. A distributed database management system (DDBMS) manages the distributed database, making the distribution transparent to users. Distributed databases offer advantages like modularity, reliability, and lower communication costs compared to centralized databases. Distributed databases can be either homogeneous, where all sites use the same DBMS, or heterogeneous, where sites may use different DBMSs. Common distributed DBMS architectures include client-server, where servers perform data management and clients provide interfaces, and peer-to-peer.

Uploaded by

abreham
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views

Chapter 6 Distributed System Management

Distributed databases allow data to be stored across multiple interconnected sites that communicate over a computer network. A distributed database management system (DDBMS) manages the distributed database, making the distribution transparent to users. Distributed databases offer advantages like modularity, reliability, and lower communication costs compared to centralized databases. Distributed databases can be either homogeneous, where all sites use the same DBMS, or heterogeneous, where sites may use different DBMSs. Common distributed DBMS architectures include client-server, where servers perform data management and clients provide interfaces, and peer-to-peer.

Uploaded by

abreham
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Chapter 6: Distributed Database Management

1. Introduction
A database is an ordered collection of related data that is built for a specific purpose. A
database may be organized as a collection of multiple tables, where a table represents a real-
world element or entity. Each table has several different fields that represent the characteristic
features of the entity. A database management system is a collection of programs that enables
creation and maintenance of a database. DBMS is available as a software package that
facilitates definition, construction, manipulation and sharing of data in a database.
Database Schemas
A database schema is a description of the database which is specified during database design
and subject to infrequent alterations. It defines the organization of the data, the relationships
among them, and the constraints associated with them. Databases are often represented through
the three-schema architecture or ANSISPARC architecture. The goal of this architecture is to
separate the user application from the physical database. The three levels are −
Internal Level having Internal Schema − It describes the physical structure, details of internal
storage and access paths for the database.
Conceptual Level having Conceptual Schema − It describes the structure of the whole database
while hiding the details of physical storage of data. This illustrates the entities, attributes with
their data types and constraints, user operations and relationships.
External or View Level having External Schemas or Views − It describes the portion of a
database relevant to a particular user or a group of users while hiding the rest of database.
Distributed DBMS
A distributed database is a set of interconnected databases that is distributed over the computer
network or internet. A Distributed Database Management System (DDBMS) manages the
distributed database and provides mechanisms so as to make the databases transparent to the
users. In these systems, data is intentionally distributed among multiple nodes so that all
computing resources of the organization can be optimally used.
A distributed database is a collection of multiple interconnected databases, which are spread
physically across various locations that communicate via a computer network.
Features

• Databases in the collection are logically interrelated with each other. Often they
represent a single logical database.
• Data is physically stored across multiple sites. Data in each site can be managed by a
DBMS independent of the other sites.
• The processors in the sites are connected via a network. They do not have any
multiprocessor configuration.
• A distributed database is not a loosely connected file system.
• A distributed database incorporates transaction processing, but it is not synonymous
with a transaction processing system.
Distributed Database Management System
A distributed database management system (DDBMS) is a centralized software system that
manages a distributed database in a manner as if it were all stored in a single location.
Features

•It is used to create, retrieve, update and delete distributed databases.


•It synchronizes the database periodically and provides access mechanisms by the
virtue of which the distribution becomes transparent to the users.
• It ensures that the data modified at any site is universally updated.
• It is used in application areas where large volumes of data are processed and accessed
by numerous users simultaneously.
• It is designed for heterogeneous database platforms.
• It maintains confidentiality and data integrity of the databases.
• Support for Both OLTP and OLAP
Advantages of Distributed Databases
• Modular Development − If the system needs to be expanded to new locations or new units,
in centralized database systems, the action requires substantial efforts and disruption in the
existing functioning. However, in distributed databases, the work simply requires adding
new computers and local data to the new site and finally connecting them to the distributed
system, with no interruption in current functions.
• More Reliable − In case of database failures, the total system of centralized databases
comes to a halt. However, in distributed systems, when a component fails, the functioning
of the system continues may be at a reduced performance. Hence DDBMS is more reliable.
• Better Response − If data is distributed in an efficient manner, then user requests can be
met from local data itself, thus providing faster response. On the other hand, in centralized
systems, all queries have to pass through the central computer for processing, which
increases the response time.
• Lower Communication Cost − In distributed database systems, if data is located locally
where it is mostly used, then the communication costs for data manipulation can be
minimized. This is not feasible in centralized systems.

Distributed Database Vs Centralized Database


Centralized DBMS Distributed DBMS
In Centralized DBMS the database are stored In Distributed DBMS the database are stored
in a only one site in different site and help of network it can
access it

If the data is stored at a single computer site, Database and DBMS software distributed
which can be used by multiple users over many sites, connected by a computer
network

Database is maintained at one site Database is maintained at a number of


different sites
If centralized system fails, entire system is If one system fails,system continues work with
halted other site

It is a less reliable It is a more reliable

Centralized database

Figure 1.5 Centralized database


Distributed database

Figure1. 6 Distributed database

2. Types of Distributed Databases

Figure 1.7 Types of Distributed Databases

Distributed databases can be broadly classified into homogeneous and heterogeneous


distributed database environments

Homogeneous Distributed Databases


In a homogeneous distributed database, all the sites use identical DBMS and operating systems.
Its properties are −
• The sites use very similar software.
• The sites use identical DBMS or DBMS from the same vendor.
• Each site is aware of all other sites and cooperates with other sites to process user
requests.
• The database is accessed through a single interface as if it is a single database.
Types of Homogeneous Distributed Database
There are two types of homogeneous distributed database −
1. Autonomous − Each database is independent that functions on its own. They are integrated
by a controlling application and use message passing to share data updates.
2. Non-autonomous − Data is distributed across the homogeneous nodes and a central or
master DBMS co-ordinates data updates across the sites.

Heterogeneous Distributed Databases


In a heterogeneous distributed database, different sites have different operating systems,
DBMS products and data models. Its properties are −
• Different sites use dissimilar schemas and software.
• The system may be composed of a variety of DBMSs like relational, network,
hierarchical or object oriented.
• Query processing is complex due to dissimilar schemas. Transaction processing is
complex due to dissimilar software.
• A site may not be aware of other sites and so there is limited co-operation in processing
user requests.

Types of Heterogeneous Distributed Databases

Federated − The heterogeneous database systems are independent in nature and integrated
together so that they function as a single database system.

Un-federated − The database systems employ a central coordinating module through which
the databases are accessed.

3. Distributed DBMS Architectures


DDBMS architectures are generally developed depending on three parameters −

• Distribution − It states the physical distribution of data across the different sites.
• Autonomy − It indicates the distribution of control of the database system and the
degree to which each constituent DBMS can operate independently.
• Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system
components and databases.

Architectural Models
• Client - Server Architecture for DDBMS
• Peer - to - Peer Architecture for DDBMS
• Multi - DBMS Architecture
Client - Server Architecture for DDBMS
This is a two-level architecture where the functionality is divided into servers and clients. The
server functions primarily encompass data management, query processing, optimization and
transaction management. Client functions include mainly user interface. However, they have
some functions like consistency checking and transaction management.
Distinguish the functionality and divide these functions into two classes, server functions and
client functions.
Server does most of the data management work
– query processing
– data management
– Optimization
– Transaction management etc
Client performs
– Application
– User interface
– DBMS Client model
The two different client - server architecture are −
Single Server Multiple Client: Single Server accessed by multiple clients
• A client server architecture has a number of clients and a few servers connected in a
network.
• A client sends a query to one of the servers. The earliest available server solves it and
replies.
• A Client-server architecture is simple to implement and execute due to centralized
server system.

Multiple Server Multiple Client: Multiple Servers accessed by multiple clients

Figure 1. 9 Multiple Servers accessed by multiple clients

Peer- to-Peer Architecture for DDBMS


In these systems, each peer acts both as a client and a server for imparting database services.
The peers share their resource with other peers and co-ordinate their activities.
This architecture generally has four levels of schemas –
Schemas Present
• Individual internal schema definition at each site, local internal schema
• Enterprise view of data is described the global conceptual schema.
• Local organization of data at each site is describe in the local conceptual schema.
• User applications and user access to the database is supported by external schemas
Local conceptual schemas are mappings of the global schema onto each site. Databases are
typically designed in a top-down fashion, and, therefore all external view definitions are made
globally.
Major Components of a Peer-to-Peer System
– User Processor
– Data processor
User Processor
• User-interface handler
• responsible for interpreting user commands, and formatting the result data
• Semantic data controller
• checks if the user query can be processed.
• Global Query optimizer and decomposer
• determines an execution strategy
• Translates global queries into local one.
• Distributed execution
• Coordinates the distributed execution of the user request
Data processor
• Local query optimizer
• Acts as the access path selector
• Responsible for choosing the best access path
• Local Recovery Manager

Figure 1.10 Frame Work

• Makes sure local database remains consistent


• Run-time support processor
• Is the interface to the operating system and contains the database buffer
• Responsible for maintaining the main memory buffers and managing the data access.
Multi - DBMS Architectures
This is an integrated database system formed by a collection of two or more autonomous
database systems.

Multi-DBMS can be expressed through six levels of schemas −

• Multi-database View Level − Depicts multiple user views comprising of subsets of the
integrated distributed database.
• Multi-database Conceptual Level − Depicts integrated multi-database that comprises of
global logical multi-database structure definitions.
• Multi-database Internal Level − Depicts the data distribution across different sites and
multi-database to local data mapping.
• Local database View Level − Depicts public view of local data.
• Local database Conceptual Level − Depicts local data organization at each site.
• Local database Internal Level − Depicts physical data organization at each site.
There are two design alternatives for multi-DBMS: They are
1. Model with multi-database conceptual level.
Models Using a Global Conceptual Schema (GCS)

Figure 1.11 Models Using a Global Conceptual Schema


• GCS is defined by integrating either the external schemas of local autonomous
databases or parts of their local conceptual schema
• Users of a local DBMS define their own views on the local database.
• If heterogeneity exists in the system, then two implementation alternatives exist:
unilingual and multilingual
• Unilingual requires the users to utilize possibly different data models and languages
• Basic philosophy of multilingual architecture, is to permit each user to access the
global database.

2. Model without multi-database conceptual level.


• Consists of two layers, local system layer and multi database layer.
• Local system layer , present to the multi-database layer the part of their local
database they are willing share with users of other database.
• System views are constructed above this layer
• Responsibility of providing access to multiple database is delegated to the mapping
between the external schemas and the local conceptual schemas.
• Full-fledged DBMs, exists each of which manages a different database.

Figure 1.12 Models without using a Global Conceptual Schema

4. Distributed DBMS Design:


The design of a distributed computer system involves making decisions on the placement
of data and programs across the sites of a computer network, as well as possibly designing the
network itself.
Design Problem:
Design problem of distributed systems: Making decisions about the placement of data and
programs across the sites of a computer network as well as possibly designing the network itself.
In DDBMS, the distribution of applications involves
– Distribution of the DDBMS software
– Distribution of applications that run on the database
Distribution of applications will not be considered, instead the distribution of data is studied.
The strategies can be broadly divided into replication and fragmentation. However, in most cases, a
combination of the two is used.

Data Replication: Data replication is the process of storing separate copies of the database
at two or more sites. It is a popular fault tolerance technique of distributed databases.
Advantages of Data Replication:
• Reliability − In case of failure of any site, the database system continues to work since a copy
is available at another site(s).
• Reduction in Network Load − Since local copies of data are available, query processing can
be done with reduced network usage, particularly during prime hours. Data updating can be
done at non-prime hours.
• Quicker Response − Availability of local copies of data ensures quick query processing and
consequently quick response time.
• Simpler Transactions − Transactions require less number of joins of tables located at
different sites and minimal coordination across the network. Thus, they become simpler in
nature.
Disadvantages of Data Replication
• Increased Storage Requirements − Maintaining multiple copies of data is associated with
increased storage costs. The storage space required is in multiples of the storage required for
a centralized system.
• Increased Cost and Complexity of Data Updating − Each time a data item is updated, the
update needs to be reflected in all the copies of the data at the different sites. This requires
complex synchronization techniques and protocols.
• Undesirable Application – Database coupling − If complex update mechanisms are not
used, removing data inconsistency requires complex co-ordination at application level. This
results in undesirable application – database coupling.
Some commonly used replication techniques are −

• Snapshot replication
• Near-real-time replication
• Pull replication
Fragmentation:
Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of the table are
called fragments. Fragmentation can be of three types: horizontal, vertical, and hybrid (combination
of horizontal and vertical). Horizontal fragmentation can further be classified into two techniques:
primary horizontal fragmentation and derived horizontal fragmentation.
Fragmentation should be done in a way so that the original table can be reconstructed from the
fragments. This is needed so that the original table can be reconstructed from the fragments whenever
required. This requirement is called “reconstructiveness.”
Advantages of Fragmentation
• Since data is stored close to the site of usage, efficiency of the database system is increased.
• Local query optimization techniques are sufficient for most queries since data is locally
available.
• Since irrelevant data is not available at the sites, security and privacy of the database system
can be maintained.
Disadvantages of Fragmentation
• When data from different fragments are required, the access speeds may be very low.
• In case of recursive fragmentations, the job of reconstruction will need expensive techniques.
• Lack of back-up copies of data in different sites may render the database ineffective in case
of failure of a site.
Vertical Fragmentation:
In vertical fragmentation, the fields or columns of a table are grouped into fragments. In order to
maintain reconstructiveness, each fragment should contain the primary key field(s) of the table.
Vertical fragmentation can be used to enforce privacy of data.
For example, let us consider that a University database keeps records of all registered students in a
Student table having the following schema.
STUDENT

Regd_No Name Course Address Semester Fees Marks

Now, the fees details are maintained in the accounts section. In this case, the designer will fragment
the database as follows −
CREATE TABLE STD_FEES AS
SELECT Regd_No, Fees FROM STUDENT;
Horizontal Fragmentation:
Horizontal fragmentation groups the tuples of a table in accordance to values of one or more fields.
Horizontal fragmentation should also confirm to the rule of reconstructiveness. Each horizontal
fragment must have all columns of the original base table.
For example, in the student schema, if the details of all students of Computer Science Course needs
to be maintained at the School of Computer Science, then the designer will horizontally fragment the
database as follows −
CREATE COMP_STD AS
SELECT * FROM STUDENT WHERE COURSE = "Computer Science";
Hybrid Fragmentation:
In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniques are used.
This is the most flexible fragmentation technique since it generates fragments with minimal
extraneous information. However, reconstruction of the original table is often an expensive task.
Hybrid fragmentation can be done in two alternative ways −
• At first, generate a set of horizontal fragments; then generate vertical fragments from one or
more of the horizontal fragments.
• At first, generate a set of vertical fragments; then generate horizontal fragments from one or
more of the vertical fragments.
5. Distributed Query Processing:
Query processing is a set of all activities starting from query placement to displaying the results
of the query.
For centralized systems, the primary criterion for measuring the cost of a particular strategy is
the number of disk accesses.
In a distributed system, other query issues must be taken into account: 1. The cost of a data
transmission over the network. 2. The potential gain in performance from having several sites
process parts of the query in parallel.

Layers of Query Processing:


• Query Decomposition. The first layer decomposes the calculus query into an algebraic query
on global relations.
• Data Localization. The input to the second layer is an algebraic query on global relations.
• Global Query Optimization.
• Distributed Query Execution.

Steps of query processing:

• Step 1: Parsing.
• Step 2: Translation.
• Step 3: Optimizer.
• Step 4: Execution Plan.
• Step 5: Evaluation.

Figure 2.1 steps in query processing

1.Parsing and translation


Check syntax and verify relations.-Translate the query into an equivalent relational algebra
expression.
2.Optimization-Generate an optimal evaluation plan (with lowest cost) for the query plan.

3.Evaluation-The query-execution engine takes an(optimal) evaluation plan, executes that


plan, and returns the answers to the query
Figure: Steps in Query Processing

Cost of Query:
1. Response time (wall-clock time needed to execute a plan) depends on several factors
System configuration
• amount of dedicated buffer in RAM (aka, memory, main memory)
• whether or not indices are (partially) stored permanently in the buffer
Runtime conditions
• amount of free buffer at the time the plan is executed
• content of the buffer at the time the plan is executed
• parameters, embedded in queries, which are resolved at runtime only
SELECT salary
FROM instructor WHERE salary < $a
where $a is a variable provided by the application (user)
2. Query cost (total elapsed time for answering a query) is measured in terms of different resources
▪ disk access (I/O operation on disk)
▪ CPU usage
▪ (network communication for distributed DBMS – later in this course)
3. Typically disk access is the predominant cost, and is also relatively easy to estimate. Measured by
taking into account
▪ Number of seeks (number of random I/O accesses)
▪ Number of blocks read
▪ Number of blocks written

You might also like