Distributed DB
Distributed DB
Distributed DB
A distributed database is a collection of multiple interconnected databases, which are
spread physically across various locations that communicate via a computer network.
Location independent
Hardware independent
Network independent
Transaction transparency
DBMS independent
Improved Performance
Design Issues
Features
As can be seen from the above diagram, all the information for the organisation is
stored in a single database. This database is known as the centralized database.
Advantages
Some advantages of Centralized Database Management System are −
The data integrity is maximised as the whole database is stored at a single physical
location. This means that it is easier to coordinate the data and it is as accurate and
consistent as possible.
The data redundancy is minimal in the centralised database. All the data is stored
together and not scattered across different locations. So, it is easier to make sure there
is no redundant data available.
Since all the data is in one place, there can be stronger security measures around it. So,
the centralised database is much more secure.
Data is easily portable because it is stored at the same place.
The centralized database is cheaper than other types of databases as it requires less
power and maintenance.
All the information in the centralized database can be easily accessed from the same
location and at the same time.
Disadvantages
Some disadvantages of Centralized Database Management System are −
Since all the data is at one location, it takes more time to search and access it. If the
network is slow, this process takes even more time.
There is a lot of data access traffic for the centralized database. This may create a
bottleneck situation.
Since all the data is at the same location, if multiple users try to access it simultaneously
it creates a problem. This may reduce the efficiency of the system.
If there are no database recovery measures in place and a system failure occurs, then
all the data in the database will be destroyed.
In a database, the chances of data duplication are quite high as several users
use one database. A DBMS reduces data repetition and redundancy by
creating a single data repository that can be accessed by multiple users.
In a homogeneous distributed database, all the sites use identical DBMS and operating
systems. Its properties are −
The sites use very similar software.
The sites use identical DBMS or DBMS from the same vendor.
Each site is aware of all other sites and cooperates with other sites to process
user requests.
The database is accessed through a single interface as if it is a single database.
Location transparency
Fragmentation transparency
Replication transparency
Location Transparency
Location transparency ensures that the user can query on any table(s) or fragment(s)
of a table as if they were stored locally in the user’s site. The fact that the table or its
fragments are stored at remote site in the distributed database system, should be
completely oblivious to the end user. The address of the remote site(s) and the access
mechanisms are completely hidden.
In order to incorporate location transparency, DDBMS should have access to updated
and accurate data dictionary and DDBMS directory which contains the details of
locations of data.
Fragmentation Transparency
Fragmentation transparency enables users to query upon any table as if it were
unfragmented. Thus, it hides the fact that the table the user is querying on is actually a
fragment or union of some fragments. It also conceals the fact that the fragments are
located at diverse sites.
This is somewhat similar to users of SQL views, where the user may not know that
they are using a view of a table instead of the table itself.
Replication Transparency
Replication transparency ensures that replication of databases are hidden from the
users. It enables users to query upon a table as if only a single copy of the table exists.
Replication transparency is associated with concurrency transparency and failure
transparency. Whenever a user updates a data item, the update is reflected in all the
copies of the table. However, this operation should not be known to the user. This is
concurrency transparency. Also, in case of failure of a site, the user can still proceed
with his queries using replicated copies without any knowledge of failure. This is failure
transparency.
Combination of Transparencies
In any distributed database system, the designer should ensure that all the stated
transparencies are maintained to a considerable extent. The designer may choose to
fragment tables, replicate them and store them at different sites; all oblivious to the end
user. However, complete distribution transparency is a tough task and requires
considerable design efforts.
Distribution Transparency
Distribution transparency allows a physically dispersed database to be
managed as though it were a centralized database. The level of transparency
supported by the DDBMS varies from system to system. Three levels of
distribution transparency are recognized:
• Fragmentation transparency is the highest level of transparency. The end
user or programmer does not need to know that a database is partitioned.
Therefore, neither fragment names nor fragment locations are specified prior
to data access.
• Location transparency exists when the end user or programmer must
specify the database fragment names but does not need to specify where
those fragments are located.
• Local mapping transparency exists when the end user or programmer must
specify both the fragment names and their locations.
Now suppose that the end user wants to list all employees with a date of
birth prior to January 1, 1960. To focus on the transparency issues, also
suppose that the EMPLOYEE table is fragmented and each fragment is
unique. The unique fragment condition indicates that each row is unique,
regardless of the fragment in which it is located. Finally, assume that no
portion of the database is replicated at any other site on the network.
Depending on the level of distribution transparency support, you may
examine three query cases.
Case 1: The Database Supports Fragmentation Transparency
SELECT *
FROM EMPLOYEE
WHERE EMP_DOB < '01-JAN-196';
Case 2: The Database Supports Location Transparency
Fragment names must be specified in the query, but the fragment’s location
is not specified. The query reads:
SELECT *
FROM E1
WHERE EMP_DOB < '01-JAN-1960';
UNION
SELECT *
FROM E2
WHERE EMP_DOB < '01-JAN-1960';
UNION
SELECT *
FROM E 3
WHERE EMP_DOB < '01-JAN-1960';
Case 3: The Database Supports Local Mapping Transparency
Both the fragment name and its location must be specified in the query.
Using pseudo-SQL:
SELECT *
FROM El NODE NY
WHERE EMP_DOB < '01-JAN-1960';
UNION
SELECT *
FROM E2 NODE ATL
WHERE EMP_DOB < '01-JAN-1960';
UNION
SELECT * FROM E3 NODE MIA
WHERE EMP_DOB < '01-JAN-1960';
Distribution transparency is supported by a distributed data dictionary
(DDD), or a distributed data catalog (DDC). The DDC contains the
description of the entire database as seen by the database administrator.
The database description, known as the distributed global schema, is the
common database schema used by local TPs to translate user requests into
sub queries (remote requests) that will be processed by different DPs. The
DDC is itself distributed, and it is replicated at the network nodes. Therefore,
the DDC must maintain consistency through updating at all sites.
2. The query language that will be used should not include any location
specification. In this way, location transparency can be achieved.
3. The data that is stored in a relation should not contain any location specification.
In this naming transparency can be assured.
4. Every database object must have a system wide unique name.
5. The location information can be found using the data dictionary.
6. Using aliases, we can move the database objects transparently.
7. Replication Transparency
Replication transparency states that the replicas that are created should be
controlled by the system, not by user. The user should not have any doubt about
whether the fetched data is coming from replicated copy of the elation or from the
actual copy of the relation. To achieve replication transparency the concurrency
control protocol needs to be devised, which assures that the update of data
taking place in one copy should also be updated in other copies. In this way,
transparency regarding replicas of data can be maintained.
8. Fragmentation Transparency
Fragmentation transparency states that the fragments that are created to store
the data in distributed manner should remain transparent and all the data
management work required to control the fragments should be done by the
system, not by the user. In this task, when a user puts a query, the global query
is distributed in many sites to fetch data from fragments and this data is put
together at the end to generate the result. The system ensures that the total
procedure of query decomposition and re-composition should be transparent to
the user.
Figure 15.4
Reference architecture for distributed database
1. Global schema
The global schema contains two parts, a global external schema and a global conceptual
schema. The global schema gives access to the entire system. It provides applications with
access to the entire distributed database system, and logical description of the whole
database as if it was not distributed.
2. Fragmentation schema
The fragmentation schema gives the description of how the data is partitioned.
3. Allocation schema
Gives a description of where the partitions are located.
4. Local mapping
The local mapping contains the local conceptual and local internal schema. The local
conceptual schema provides the description of the local data. The local internal schema
gives the description of how the data is physically stored on the disk.
Data fragmentation in DBMS
Distributed Database systems provide distribution transparency of the data
over the DBs. This is achieved by the concept called Data Fragmentation.
That means, fragmenting the data over the network and over the DBs. Initially
all the DBs and data are designed as per the standards of any database
system – by applying normalization and denormalization. But the concept of
distributed system makes these normalized data to be divided further. That
means the main goal of DDBMS is to provide the data to the user from the
nearest location to them and as fast as possible. Hence the data in a table are
divided according their location or as per user’s requirement.
Dividing the whole table data into smaller chunks and storing them in different
DBs in the DDBMS is called data fragmentation. By fragmenting the relation in
DB allows:
Now these queries will give the subset of records from EMPLOYEE table
depending on the location of the employees. These sub set of data will be
stored in the DBs at respective locations. Any insert, update and delete on the
employee records will be done on the DBs at their location and it will be
synched with the main table at regular intervals.
Above is the simple example of horizontal fragmentation. This fragmentation
can be done with more than one conditions joined by AND or OR clause.
Fragmentation is done based on the requirement and the purpose of DDB.
This type of fragment will have fragmented details about whole employee. This will be useful
when the user needs to query only few details about the employee. For example consider a query
to find the department of the employee. This can be done by querying the third fragment of the
table. Consider a query to find the name and age of an employee whose ID is given. This can be
done by querying first fragment of the table. This will avoid performing ‘SELECT *’ operation
which will need lot of memory to query the whole table – to traverse whole data as well as to
hold all the columns.
READ Architecture of Database
In this fragment overlapping columns can be seen but these columns are primary key and are
hardly changed throughout the life cycle of the record. Hence maintaining cost of this
overlapping column is very least. In addition this column is required if we need to reconstruct the
table or to pull the data from two fragments. Hence it still meets the conditions of fragmentation.
Note that CRUD doesn't include a means of searching for things in a database, which would
generally be considered a higher-level capability.
Distribution of a database affects implementation strategy, but it's generally desirable that
the semantics of the core operations don't constrain the user to a particular
implementation.
Access Rights
A user’s access rights refers to the privileges that the user is given regarding DBMS
operations such as the rights to create a table, drop a table, add/delete/update tuples
in a table or query upon the table.
In distributed environments, since there are large number of tables and yet larger
number of users, it is not feasible to assign individual access rights to users. So,
DDBMS defines certain roles. A role is a construct with certain privileges within a
database system. Once the different roles are defined, the individual users are
assigned one of these roles. Often a hierarchy of roles are defined according to the
organization’s hierarchy of authority and responsibility.
For example, the following SQL statements create a role "Accountant" and then
assigns this role to user "ABC".
CREATE ROLE ACCOUNTANT;
GRANT SELECT, INSERT, UPDATE ON EMP_SAL TO ACCOUNTANT;
GRANT INSERT, UPDATE, DELETE ON TENDER TO ACCOUNTANT;
GRANT INSERT, SELECT ON EXPENSE TO ACCOUNTANT;
COMMIT;
GRANT ACCOUNTANT TO ABC;
COMMIT;
A data type constraint restricts the range of values and the type of operations that can
be applied to the field with the specified data type.
For example, let us consider that a table "HOSTEL" has three fields - the hostel
number, hostel name and capacity. The hostel number should start with capital letter
"H" and cannot be NULL, and the capacity should not be more than 150. The following
SQL command can be used for data definition −
CREATE TABLE HOSTEL (
H_NO VARCHAR2(5) NOT NULL,
H_NAME VARCHAR2(15),
CAPACITY INTEGER,
CHECK ( H_NO LIKE 'H%'),
CHECK ( CAPACITY <= 150)
);
Entity integrity control enforces the rules so that each tuple can be uniquely identified
from other tuples. For this a primary key is defined. A primary key is a set of minimal
fields that can uniquely identify a tuple. Entity integrity constraint states that no two
tuples in a table can have identical values for primary keys and that no field which is a
part of the primary key can have NULL value.
For example, in the above hostel table, the hostel number can be assigned as the
primary key through the following SQL statement (ignoring the checks) −
CREATE TABLE HOSTEL (
H_NO VARCHAR2(5) PRIMARY KEY,
H_NAME VARCHAR2(15),
CAPACITY INTEGER
);
Referential integrity constraint lays down the rules of foreign keys. A foreign key is a
field in a data table that is the primary key of a related table. The referential integrity
constraint lays down the rule that the value of the foreign key field should either be
among the values of the primary key of the referenced table or be entirely NULL.
For example, let us consider a student table where a student may opt to live in a
hostel. To include this, the primary key of hostel table should be included as a foreign
key in the student table. The following SQL statement incorporates this −
CREATE TABLE STUDENT (
S_ROLL INTEGER PRIMARY KEY,
S_NAME VARCHAR2(25) NOT NULL,
S_COURSE VARCHAR2(10),
S_HOSTEL VARCHAR2(5) REFERENCES HOSTEL
);
Integrity Constraints
o Integrity constraints are a set of rules. It is used to maintain the quality of
information.
o Integrity constraints ensure that the data insertion, updating, and other processes
have to be performed in such a way that data integrity is not affected.
o Thus, integrity constraint is used to guard against accidental damage to the
database.
Types of Integrity Constraint
1. Domain constraints
o Domain constraints can be defined as the definition of a valid set of values for an
attribute.
o The data type of domain includes string, character, integer, time, date, currency,
etc. The value of the attribute must be available in the corresponding domain.
Example:
2. Entity integrity constraints
o The entity integrity constraint states that primary key value can't be null.
o This is because the primary key value is used to identify individual rows in relation
and if the primary key has a null value, then we can't identify those rows.
o A table can contain a null value other than the primary key field.
Example:
Example:
next →← prev
Integrity Constraints
o Integrity constraints are a set of rules. It is used to maintain the quality of
information.
o Integrity constraints ensure that the data insertion, updating, and other processes
have to be performed in such a way that data integrity is not affected.
o Thus, integrity constraint is used to guard against accidental damage to the
database.
1. Domain constraints
o Domain constraints can be defined as the definition of a valid set of values for an
attribute.
o The data type of domain includes string, character, integer, time, date, currency,
etc. The value of the attribute must be available in the corresponding domain.
Example:
2. Entity integrity constraints
o The entity integrity constraint states that primary key value can't be null.
o This is because the primary key value is used to identify individual rows in relation
and if the primary key has a null value, then we can't identify those rows.
o A table can contain a null value other than the primary key field.
Example:
Example:
4. Key constraints
o Keys are the entity set that is used to identify an entity within its entity set
uniquely.
o An entity set can have multiple keys, but out of which one key will be the primary
key. A primary key can contain a unique and null value in the relational table.
Example: