Student Notes
Student Notes
Student Notes
Module Structure
Distributed DBMS Architecture
Distributed DBMS
Distributed DBMS Architecture
Distributed Data Sources
Distributed Design Issues
ANSI/SPARC Architecture
• In late 1972, the Computer and Information Processing Committee (X3) of the American
National Standards Institute (ANSI) established a Study Group on Database Management
Systems under the auspices of its Standards Planning and Requirements Committee (SPARC).
• The mission of the study group was to study the feasibility of setting up standards in this
area, as well as determining which aspects should be standardized if it was feasible
• The study group proposed that the interfaces be standardized, and defined an architectural
framework that contained 43 interfaces, 14 of which would deal with the physical storage
subsystem of the computer and therefore not be considered essential parts of the DBMS
architecture
1
DISTRIBUTED DATA SYSTEMS - SSZG554
Student Notes
• In a simplified version of the ANSI/SPARC architecture there are three views of data:
– The external view , which is that of the end user, who might be a programmer
– The internal view , that of the system or machine; and
– The conceptual view , that of the enterprise
Internal Schema
• At the lowest level of the architecture is the internal view, which deals with the physical
definition and organization of data.
• The location of data on different storage devices and the access mechanisms used to reach
and manipulate data are the issues dealt with at this level.
External Schema
• At the other extreme is the external view, which is concerned with how users view the
database.
• An individual user’s view represents the portion of the database that will be accessed by that
user as well as the relationships that the user would like to see among the data.
• A view can be shared among a number of users, with the collection of user views making up
the external schema.
Conceptual Schema
• In between these two ends is the conceptual schema, which is an abstract definition of the
database.
• It is the “real world” view of the enterprise being modeled in the database
Client/Server Systems
• The general idea is very simple and elegant: distinguish the functionality that needs to be
provided and divide these functions into two classes server functions and client functions.
• This provides a two-level architecture which makes it easier to manage the complexity of
modern DBMSs and the complexity of distribution
• In relational systems, the server does most of the data management work.
2
DISTRIBUTED DATA SYSTEMS - SSZG554
Student Notes
• This means that all of query processing and optimization, transaction management and
storage management is done at the server.
• The client, in addition to the application and the user interface, has a DBMS client module
that is responsible for managing the data that is cached to the client and (sometimes)
managing the transaction locks that may have been cached as well.
• It is also possible to place consistency checking of user queries at the client side, but this is
not common since it requires the replication of the system catalog at the client machines.
• In relational systems where the communication between the clients and the server(s) is at
the level of SQL statements
• There are a number of different types of client/server architecture.
• The simplest is the case where there is only one server which is accessed by multiple clients
we call this multiple client/single server .
• From a data management perspective, this is not much different from centralized databases
since the database is stored on only one machine (the server) that also hosts the software to
manage it.
• A more sophisticated client/server architecture is one where there are multiple servers in
the system the so-called multiple client/multiple server approach.
3
DISTRIBUTED DATA SYSTEMS - SSZG554
Student Notes
• In this case, two alternative management strategies are possible: either each client manages
its own connection to the appropriate server or each client knows of only its “home server”
which then communicates with other servers as required
Peer-to-Peer Systems
• The physical data organization on each machine may be, and probably is, different.
• This means that there needs to be an individual internal schema definition at each site,
which we call the local internal schema (LIS).
• The enterprise view of the data is described by the global conceptual schema (GCS), which is
global because it describes the logical structure of the data at all the sites.
• To handle data fragmentation and replication, the logical organization of data at each site
needs to be described.
• Therefore, there needs to be a third layer in the architecture, the local conceptual schema
(LCS).
4
DISTRIBUTED DATA SYSTEMS - SSZG554
Student Notes
• In the architectural model we have chosen, then, the global conceptual schema is the union
of the local conceptual schemas.
• Finally, user applications and user access to the database is supported by external schemas
• The user queries data irrespective of its location or of which local component of the
distributed database system will service it
• The distributed DBMS translates global queries into a group of local queries, which are
executed by distributed DBMS components at different sites that communicate with one
another.
The first major component, which we call the user processor , consists of four elements:
1. The user interface handler is responsible for interpreting user commands as they come in, and
formatting the result data as it is sent to the user.
2. The semantic data controller uses the integrity constraints and authorization that are defined as
part of the global conceptual schema to check if the use query can be processed
5
DISTRIBUTED DATA SYSTEMS - SSZG554
Student Notes
3. The global query optimizer and decomposer determines an execution strategy to minimize a cost
function, and translates the global queries into local ones using the global and local conceptual
schemas as well as the global directory.
The global query optimizer is responsible, among other things, for generating the best strategy to
execute distributed join operations
4. The distributed execution monitor coordinates the distributed execution of the user request.
The execution monitor is also called the distributed transaction manager .
In executing queries in a distributed fashion, the execution monitors at various sites may, and usually
do, communicate with one another
The second major component of a distributed DBMS is the data processor and consists of
three elements
1. The local query optimizer, which actually acts as the access path selector, is responsible for
choosing the best access path5 to access any data item
2. The local recovery manager is responsible for making sure that the local database remains
consistent even when failures occur.
3. The run-time support processor physically accesses the database according to the physical
commands in the schedule generated by the query optimizer.
The run-time support processor is the interface to the operating system and contains the database
buffer (or cache) manager, which is responsible for maintaining the main memory buffers and
managing the data accesses.
Multidatabase System
• Multidatabase systems (MDBS) represent the case where individual DBMSs (whether
distributed or not) are fully autonomous and have no concept of cooperation; they may not
even “know” of each other’s existence or how to talk to each other.
• The differences in the level of autonomy between the distributed multi-DBMSs and
distributed DBMSs are also reflected in their architectural models.
6
DISTRIBUTED DATA SYSTEMS - SSZG554
Student Notes
• In the case of logically integrated distributed DBMSs, the global conceptual schema defines
the conceptual view of the entire database, while in the case of distributed multi-DBMSs, it
represents only the collection of some of the local databases that each local DBMS wants to
share.
• The individual DBMSs may choose to make some of their data available for access by others
by defining an export schema
• In a MDBS, the GCS which is also called a mediated schema is defined by integrating either
the external schemas of local autonomous databases or (possibly parts of their) local
conceptual schemas.
• Designing the global conceptual schema in multidatabase systems involves the integration of
either the local conceptual schemas or the local external schemas
• A major difference between the design of the GCS in multi-DBMSs and in logically integrated
distributed DBMSs is that in the former the mapping is from local conceptual schemas to a
global schema
• if heterogeneity exists in the multidatabase system, a canonical data model has to be found
to define the GCS
• If heterogeneity exists in the system, then two implementation alternatives exist: unilingual
and multilingual.
• A unilingual multi-DBMS requires the users to utilize possibly different data models and
languages when both a local database and the global database are accessed.
• The identifying characteristic of unilingual systems is that any application that accesses data
from multiple databases must do so by means of an external view that is defined on the
global conceptual schema.
• This means that the user of the global database is effectively a different user than those who
access only a local database, utilizing a different data model and a different data language.
• An alternative is multilingual architecture, where the basic philosophy is to permit each user
to access the global database (i.e., data from other databases) by means of an external
schema, defined using the language of the user’s local DBMS
• A popular implementation architecture for MDBSs is the mediator/wrapper approach
7
DISTRIBUTED DATA SYSTEMS - SSZG554
Student Notes
• A mediator is a software module that exploits encoded knowledge about certain sets or
subsets of data to create information for a higher layer of applications
• Using this architecture to implement a MDBS, each module in the multi-DBMS layer of is
realized as a mediator
• Since mediators can be built on top of other mediators, it is possible to construct a layered
implementation.
• In mapping this architecture to the data logical view of the mediator level implements the
GCS.
• It is this level that handles user queries over the GCS and performs the MDBS functionality.
• The mediators typically operate using a common data model and interface language.
• To deal with potential heterogeneities of the source DBMSs, wrappers are implemented
whose task is to provide a mapping between a source DBMSs view and the mediators’ view.
• The exact role and function of mediators differ from one implementation to another.
• In some cases, thin mediators have been implemented who do nothing more than
translation.
• In other cases, wrappers take over the execution of some of the query functionality.
• One can view the collection of mediators as a middleware layer that provides services above
the source systems
8
DISTRIBUTED DATA SYSTEMS - SSZG554
Student Notes
• Machine data sources are stored on the system with a user-defined name.
• Associated with the data source name is all of the information the Driver Manager and
driver need to connect to the data source.
• For an Xbase data source, this might be the name of the Xbase driver, the full path of the
directory containing the Xbase files, and some options that tell the driver how to use those
files, such as single-user mode or read-only.
• File data sources are stored in a file and allow connection information to be used repeatedly
by a single user or shared among several users.
• When a file data source is used, the Driver Manager makes the connection to the data
source using the information in a .dsn file.
• This file can be manipulated like any other file. A file data source does not have a data
source name, as does a machine data source, and is not registered to any one user or
machine.
• Data sources usually are created by the end user or a technician with a program called
the ODBC Administrator.
• The ODBC Administrator prompts the user for the driver to use and then calls that driver.
• The driver displays a dialog box that requests the information it needs to connect to the data
source.
• After the user enters the information, the driver stores it on the system.
• Later, the application calls the Driver Manager and passes it the name of a machine data
source or the path of a file containing a file data source.
9
DISTRIBUTED DATA SYSTEMS - SSZG554
Student Notes
• When passed a machine data source name, the Driver Manager searches the system to find
the driver used by the data source.
• It then loads the driver and passes the data source name to it. The driver uses the data
source name to find the information it needs to connect to the data source.
• Finally, it connects to the data source, typically prompting the user for a user ID and
password, which generally are not stored.
• When passed a file data source, the Driver Manager opens the file and loads the specified
driver.
• If the file also contains a connection string, it passes this to the driver.
• Using the information in the connection string, the driver connects to the data source.
• If no connection string was passed, the driver generally prompts the user for the necessary
information.
10