Chapter 6 Distributed System Management
Chapter 6 Distributed System Management
1. Introduction
A database is an ordered collection of related data that is built for a specific purpose. A
database may be organized as a collection of multiple tables, where a table represents a real-
world element or entity. Each table has several different fields that represent the characteristic
features of the entity. A database management system is a collection of programs that enables
creation and maintenance of a database. DBMS is available as a software package that
facilitates definition, construction, manipulation and sharing of data in a database.
Database Schemas
A database schema is a description of the database which is specified during database design
and subject to infrequent alterations. It defines the organization of the data, the relationships
among them, and the constraints associated with them. Databases are often represented through
the three-schema architecture or ANSISPARC architecture. The goal of this architecture is to
separate the user application from the physical database. The three levels are −
Internal Level having Internal Schema − It describes the physical structure, details of internal
storage and access paths for the database.
Conceptual Level having Conceptual Schema − It describes the structure of the whole database
while hiding the details of physical storage of data. This illustrates the entities, attributes with
their data types and constraints, user operations and relationships.
External or View Level having External Schemas or Views − It describes the portion of a
database relevant to a particular user or a group of users while hiding the rest of database.
Distributed DBMS
A distributed database is a set of interconnected databases that is distributed over the computer
network or internet. A Distributed Database Management System (DDBMS) manages the
distributed database and provides mechanisms so as to make the databases transparent to the
users. In these systems, data is intentionally distributed among multiple nodes so that all
computing resources of the organization can be optimally used.
A distributed database is a collection of multiple interconnected databases, which are spread
physically across various locations that communicate via a computer network.
Features
• Databases in the collection are logically interrelated with each other. Often they
represent a single logical database.
• Data is physically stored across multiple sites. Data in each site can be managed by a
DBMS independent of the other sites.
• The processors in the sites are connected via a network. They do not have any
multiprocessor configuration.
• A distributed database is not a loosely connected file system.
• A distributed database incorporates transaction processing, but it is not synonymous
with a transaction processing system.
Distributed Database Management System
A distributed database management system (DDBMS) is a centralized software system that
manages a distributed database in a manner as if it were all stored in a single location.
Features
If the data is stored at a single computer site, Database and DBMS software distributed
which can be used by multiple users over many sites, connected by a computer
network
Centralized database
Federated − The heterogeneous database systems are independent in nature and integrated
together so that they function as a single database system.
Un-federated − The database systems employ a central coordinating module through which
the databases are accessed.
• Distribution − It states the physical distribution of data across the different sites.
• Autonomy − It indicates the distribution of control of the database system and the
degree to which each constituent DBMS can operate independently.
• Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system
components and databases.
Architectural Models
• Client - Server Architecture for DDBMS
• Peer - to - Peer Architecture for DDBMS
• Multi - DBMS Architecture
Client - Server Architecture for DDBMS
This is a two-level architecture where the functionality is divided into servers and clients. The
server functions primarily encompass data management, query processing, optimization and
transaction management. Client functions include mainly user interface. However, they have
some functions like consistency checking and transaction management.
Distinguish the functionality and divide these functions into two classes, server functions and
client functions.
Server does most of the data management work
– query processing
– data management
– Optimization
– Transaction management etc
Client performs
– Application
– User interface
– DBMS Client model
The two different client - server architecture are −
Single Server Multiple Client: Single Server accessed by multiple clients
• A client server architecture has a number of clients and a few servers connected in a
network.
• A client sends a query to one of the servers. The earliest available server solves it and
replies.
• A Client-server architecture is simple to implement and execute due to centralized
server system.
• Multi-database View Level − Depicts multiple user views comprising of subsets of the
integrated distributed database.
• Multi-database Conceptual Level − Depicts integrated multi-database that comprises of
global logical multi-database structure definitions.
• Multi-database Internal Level − Depicts the data distribution across different sites and
multi-database to local data mapping.
• Local database View Level − Depicts public view of local data.
• Local database Conceptual Level − Depicts local data organization at each site.
• Local database Internal Level − Depicts physical data organization at each site.
There are two design alternatives for multi-DBMS: They are
1. Model with multi-database conceptual level.
Models Using a Global Conceptual Schema (GCS)
Data Replication: Data replication is the process of storing separate copies of the database
at two or more sites. It is a popular fault tolerance technique of distributed databases.
Advantages of Data Replication:
• Reliability − In case of failure of any site, the database system continues to work since a copy
is available at another site(s).
• Reduction in Network Load − Since local copies of data are available, query processing can
be done with reduced network usage, particularly during prime hours. Data updating can be
done at non-prime hours.
• Quicker Response − Availability of local copies of data ensures quick query processing and
consequently quick response time.
• Simpler Transactions − Transactions require less number of joins of tables located at
different sites and minimal coordination across the network. Thus, they become simpler in
nature.
Disadvantages of Data Replication
• Increased Storage Requirements − Maintaining multiple copies of data is associated with
increased storage costs. The storage space required is in multiples of the storage required for
a centralized system.
• Increased Cost and Complexity of Data Updating − Each time a data item is updated, the
update needs to be reflected in all the copies of the data at the different sites. This requires
complex synchronization techniques and protocols.
• Undesirable Application – Database coupling − If complex update mechanisms are not
used, removing data inconsistency requires complex co-ordination at application level. This
results in undesirable application – database coupling.
Some commonly used replication techniques are −
• Snapshot replication
• Near-real-time replication
• Pull replication
Fragmentation:
Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of the table are
called fragments. Fragmentation can be of three types: horizontal, vertical, and hybrid (combination
of horizontal and vertical). Horizontal fragmentation can further be classified into two techniques:
primary horizontal fragmentation and derived horizontal fragmentation.
Fragmentation should be done in a way so that the original table can be reconstructed from the
fragments. This is needed so that the original table can be reconstructed from the fragments whenever
required. This requirement is called “reconstructiveness.”
Advantages of Fragmentation
• Since data is stored close to the site of usage, efficiency of the database system is increased.
• Local query optimization techniques are sufficient for most queries since data is locally
available.
• Since irrelevant data is not available at the sites, security and privacy of the database system
can be maintained.
Disadvantages of Fragmentation
• When data from different fragments are required, the access speeds may be very low.
• In case of recursive fragmentations, the job of reconstruction will need expensive techniques.
• Lack of back-up copies of data in different sites may render the database ineffective in case
of failure of a site.
Vertical Fragmentation:
In vertical fragmentation, the fields or columns of a table are grouped into fragments. In order to
maintain reconstructiveness, each fragment should contain the primary key field(s) of the table.
Vertical fragmentation can be used to enforce privacy of data.
For example, let us consider that a University database keeps records of all registered students in a
Student table having the following schema.
STUDENT
Now, the fees details are maintained in the accounts section. In this case, the designer will fragment
the database as follows −
CREATE TABLE STD_FEES AS
SELECT Regd_No, Fees FROM STUDENT;
Horizontal Fragmentation:
Horizontal fragmentation groups the tuples of a table in accordance to values of one or more fields.
Horizontal fragmentation should also confirm to the rule of reconstructiveness. Each horizontal
fragment must have all columns of the original base table.
For example, in the student schema, if the details of all students of Computer Science Course needs
to be maintained at the School of Computer Science, then the designer will horizontally fragment the
database as follows −
CREATE COMP_STD AS
SELECT * FROM STUDENT WHERE COURSE = "Computer Science";
Hybrid Fragmentation:
In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniques are used.
This is the most flexible fragmentation technique since it generates fragments with minimal
extraneous information. However, reconstruction of the original table is often an expensive task.
Hybrid fragmentation can be done in two alternative ways −
• At first, generate a set of horizontal fragments; then generate vertical fragments from one or
more of the horizontal fragments.
• At first, generate a set of vertical fragments; then generate horizontal fragments from one or
more of the vertical fragments.
5. Distributed Query Processing:
Query processing is a set of all activities starting from query placement to displaying the results
of the query.
For centralized systems, the primary criterion for measuring the cost of a particular strategy is
the number of disk accesses.
In a distributed system, other query issues must be taken into account: 1. The cost of a data
transmission over the network. 2. The potential gain in performance from having several sites
process parts of the query in parallel.
• Step 1: Parsing.
• Step 2: Translation.
• Step 3: Optimizer.
• Step 4: Execution Plan.
• Step 5: Evaluation.
Cost of Query:
1. Response time (wall-clock time needed to execute a plan) depends on several factors
System configuration
• amount of dedicated buffer in RAM (aka, memory, main memory)
• whether or not indices are (partially) stored permanently in the buffer
Runtime conditions
• amount of free buffer at the time the plan is executed
• content of the buffer at the time the plan is executed
• parameters, embedded in queries, which are resolved at runtime only
SELECT salary
FROM instructor WHERE salary < $a
where $a is a variable provided by the application (user)
2. Query cost (total elapsed time for answering a query) is measured in terms of different resources
▪ disk access (I/O operation on disk)
▪ CPU usage
▪ (network communication for distributed DBMS – later in this course)
3. Typically disk access is the predominant cost, and is also relatively easy to estimate. Measured by
taking into account
▪ Number of seeks (number of random I/O accesses)
▪ Number of blocks read
▪ Number of blocks written