Distributed Database Word
Distributed Database Word
UNIT - I
A database is an ordered collection of related data that is built for a specific purpose. A data-
base may be organized as a collection of multiple tables, where a table represents a real world
element or entity. Each table has several different fields that represent the characteristic fea-
tures of the entity.
For example, a company database may include tables for projects, employees, departments,
products and financial records. The fields in the Employee table may be Name, Company_Id,
Date_of_Joining, and so forth.
tionships and are modelled using the “tree” data structure. These are very fast and simple.
structure.
Figure 1.2 Network DBMS
Relational DBMS
In relational databases, the database is represented in the form of relations. Each relation
models an entity and is represented as a table of values. In the relation or table, a row is
called a tuple and denotes a single record. A column is called a field or an attribute and de-
notes a characteristic property of the entity. RDBMS is the most popular database manage-
ment system.
Centralised Database :A centralized database is stored at a single location such as a mainframe com-
puter. It is maintained and modified from that location only and usually accessed using an internet con-
nection such as a LAN or WAN. The centralized database is used by organisations such as colleges, com-
panies, banks etc.
As can be seen from the above diagram, all the information for the organisation is stored in a sin-
gle database. This database is known as the centralized database.
Advantages
Some advantages of Centralized Database Management System are −
• The data integrity is maximised as the whole database is stored at a single physical location. This
means that it is easier to coordinate the data and it is as accurate and consistent as possible.
• The data redundancy is minimal in the centralised database. All the data is stored together and not
scattered across different locations. So, it is easier to make sure there is no redundant data avail-
able.
• Since all the data is in one place, there can be stronger security measures around it. So, the cen -
tralised database is much more secure.
Disadvantages
• Since all the data is at one location, it takes more time to search and access it. If the net-
work is slow, this process takes even more time.
• There is a lot of data access traffic for the centralized database. This may create a bottle -
neck situation.
• Since all the data is at the same location, if multiple users try to access it simultaneously
it creates a problem. This may reduce the efficiency of the system.
• If there are no database recovery measures in place and a system failure occurs, then all
the data in the database will be destroyed.
Distributed DBMS
A distributed database is a set of interconnected databases that is distributed over the com-
puter network or internet. A Distributed Database Management System (DDBMS) manages
the distributed database and provides mechanisms so as to make the databases transparent to
the users. In these systems, data is intentionally distributed among multiple nodes so that all
computing resources of the organization can be optimally used.
A distributed database is a collection of multiple interconnected databases, which are
spread physically across various locations that communicate via a computer network.
Features
· Databases in the collection are logically interrelated with each other. Often they
represent a single logical database.
· Data is physically stored across multiple sites. Data in each site can be managed by
a DBMS independent of the other sites.
· The processors in the sites are connected via a network. They do not have any
multiprocessor configuration.
· A distributed database is not a loosely connected file system.
· A distributed database incorporates transaction processing, but it is not synonymous
with a transaction processing system.
Factors Encouraging DDBMS
· Distributed Nature of Organizational Units − Most organizations in the current times
are subdivided into multiple units that are physically distributed over the globe. Each
unit requires its own set of local data. Thus, the overall database of the organization
becomes distributed.
· Need for Sharing of Data − The multiple organizational units often need to communi-
cate with each other and share their data and resources. This demands common data-
bases or replicated databases that should be used in a synchronized manner.
· Support for Both OLTP and OLAP − Online Transaction Processing (OLTP) and
Online Analytical Processing (OLAP) work upon diversified systems which may
have common data. Distributed database systems aid both these processing by pro-
viding synchronized data.
· Database Recovery − One of the common techniques used in DDBMS is replication
of data across different sites. Replication of data automatically helps in data recovery
if database in any site is damaged. Users can access data from other sites while the
damaged site is being reconstructed. Thus, database failure may become almost in-
conspicuous to users.
· Support for Multiple Application Software − Most organizations use a variety of ap-
plication software each with its specific database support. DDBMS provides a uni-
form functionality for using the same data among different platforms.
Advantages of Distributed Databases
· Modular Development − If the system needs to be expanded to new locations or new
units, in centralized database systems, the action requires substantial efforts and dis-
ruption in the existing functioning. However, in distributed databases, the work sim-
ply requires adding new computers and local data to the new site and finally connect-
ing them to the distributed system, with no interruption in current functions.
· More Reliable − In case of database failures, the total system of centralized databases
comes to a halt. However, in distributed systems, when a component fails, the func-
tioning of the system continues may be at a reduced performance. Hence DDBMS is
more reliable.
· Better Response − If data is distributed in an efficient manner, then user requests can
be met from local data itself, thus providing faster response. On the other hand, in
centralized systems, all queries have to pass through the central computer for process-
ing, which increases the response time.
Database is maintained at a
Database is maintained at one site
number of different sites
Centralized database
Distributed database
Figure1. 6 Distributed database
• Data Transparency: DDBMS hides all the complexities from its users and
provides transparent access to data and applications to users.
· Each site is aware of all other sites and cooperates with other sites to process user
requests.
· The database is accessed through a single interface as if it is a single data-
base.
Mixed Distribution: This is a combination of fragmentation and partial replications. Here, the
tables are initially fragmented in any form (horizontal or vertical), and then these fragments
are partially replicated across the different sites according to the frequency of accessing the
fragments.
Design Strategies
The strategies can be broadly divided into replication and fragmentation. However, in most
cases, a combination of the two is used.
Data Replication
Data replication is the process of storing separate copies of the database at two or more sites.
It is a popular fault tolerance technique of distributed databases.
Advantages of Data Replication
· Reliability − In case of failure of any site, the database system continues to work
since a copy is available at another site(s).
· Reduction in Network Load − Since local copies of data are available, query process-
ing can be done with reduced network usage, particularly during prime hours. Data
updating can be done at non-prime hours.
· Quicker Response − Availability of local copies of data ensures quick query process-
ing and consequently quick response time.
· Simpler Transactions − Transactions require less number of joins of tables located at
different sites and minimal coordination across the network. Thus, they become sim-
pler in nature.
• Periodically copies the entire dataset from the source to the target database.
• Suitable for systems where real-time updates are not necessary.
• Example: A reporting system that updates once a day.
2. Transactional Replication
• Changes can be made at multiple locations, and they are merged later.
• Used in scenarios where databases operate independently and sync periodically.
• Example: Mobile applications that work offline and sync when connected.
5. Full Replication
Disadvantages
1. Applications whose views are defined on more than one fragment may suffer
performance degradation, if applications have conflicting requirements.
2. Simple tasks like checking for dependencies, would result in chasing after data in a
number of sites
3. When data from different fragments are required, the access speeds may be very
high.
4. In case of recursive fragmentations, the job of reconstruction will need expensive
techniques.
5. Lack of back-up copies of data in different sites may render the database ineffective
in case of failure of a site.
Vertical Fragmentation
In vertical fragmentation, the fields or columns of a table are grouped into fragments. In or-
der to maintain reconstructiveness, each fragment should contain the primary key field(s) of
the table. Vertical fragmentation can be used to enforce privacy of data.
Grouping
· Starts by assigning each attribute to one fragment
o At each step, joins some of the fragments until some criteria is satisfied.
· Results in overlapping fragments
Splitting
· Starts with a relation and decides on beneficial partitioning based on the access
behaviour of applications to the attributes
· Fits more naturally within the top-down design
· Generates non-overlapping fragments
For example, let us consider that a University database keeps records of all registered stu-
dents in a Student table having the following schema.
r
k
s
STUDENT
Horizontal Fragmentation
The example
For example, in the student schema, if the details of all students of Computer Science Course
needs to be maintained at the School of Computer Science, then the designer will horizontally
fragment the database as follows −
Hybrid Fragmentation
In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniques
are used. This is the most flexible fragmentation technique since it generates fragments with
minimal extraneous information. However, reconstruction of the original table is often an ex-
pensive task.
Hybrid fragmentation can be done in two alternative ways −
At first, generate a set of horizontal fragments; then generate vertical fragments from one or
more of the horizontal fragments.
At first, generate a set of vertical fragments; then generate horizontal fragments from one or
more of the vertical fragments.
Transparency
Transparency in DBMS stands for the separation of high level semantics of the system from
the low-level implementation issue. High-level semantics stands for the endpoint user, and
low level implementation concerns with complicated hardware implementation of data or
how the data has been stored in the database. Using data independence in various layers of
the database, transparency can be implemented in DBMS.
Distribution transparency is the property of distributed databases by the virtue of which the
internal details of the distribution are hidden from the users. The DDBMS designer may
choose to fragment tables, replicate the fragments and store them at different sites. However,
since users are oblivious of these details, they find the distributed database easy to use like
any centralized database.
Unlike normal DBMS, DDBMS deals with communication network, replicas and fragments
of data. Thus, transparency also involves these three factors.
Following are three types of transparency:
1. Location transparency
2. Fragmentation transparency
3. Replication transparency
Location Transparency
Location transparency ensures that the user can query on any table(s) or fragment(s) of a ta-
ble as if they were stored locally in the user’s site. The fact that the table or its fragments
are stored at remote site in the distributed database system, should be completely oblivious to
the end user. The address of the remote site(s) and the access mechanisms are completely hid-
den.In order to incorporate location transparency, DDBMS should have access to updated and
accurate data dictionary and DDBMS directory which contains the details of locations of
data.
Fragmentation Transparency
Fragmentation transparency enables users to query upon any table as if it were unfragmented.
Thus, it hides the fact that the table the user is querying on is actually a fragment or union of
some fragments. It also conceals the fact that the fragments are located at diverse sites.This is
somewhat similar to users of SQL views, where the user may not know that they are using a
view of a table instead of the table itself.
Replication Transparency
Replication transparency ensures that replication of databases are hidden from the users. It
enables users to query upon a table as if only a single copy of the table exists.Replication
transparency is associated with concurrency transparency and failure transparency. Whenever
a user updates a data item, the update is reflected in all the copies of the table. However, this
operation should not be known to the user. This is concurrency transparency. Also, in case of
failure of a site, the user can still proceed with his queries using replicated copies without any
knowledge of failure. This is failure transparency.
Combination of Transparencies
In any distributed database system, the designer should ensure that all the stated transparen-
cies are maintained to a considerable extent. The designer may choose to fragment tables,
replicate them and store them at different sites; all oblivious to the end user. However, com-
plete distribution transparency is a tough task and requires considerable design efforts.
Authentication
In a distributed database system, authentication is the process through which only legitimate
users can gain access to the data resources.
Authentication can be enforced in two levels −
Controlling Access to Client Computer − At this level, user access is restricted while login
to the client computer that provides user-interface to the database server. The most common
method is a username/password combination. However, more sophisticated methods like bio-
metric authentication may be used for high security data.
Controlling Access to the Database Software − At this level, the database software/admin-
istrator assigns some credentials to the user. The user gains access to the database using these
credentials. One of the methods is to create a login account within the database server.
Access Rights
A user’s access rights refers to the privileges that the user is given regarding DBMS opera-
tions such as the rights to create a table, drop a table, add/delete/update tuples in a table or
query upon the table.
In distributed environments, since there are large number of tables and yet larger number of
users, it is not feasible to assign individual access rights to users. So, DDBMS defines certain
roles. A role is a construct with certain privileges within a database system. Once the differ-
ent roles are defined, the individual users are assigned one of these roles. Often a hierarchy of
roles are defined according to the organization’s hierarchy of authority and responsibility.
For example, the following SQL statements create a role "Accountant" and then assigns this
role to user "ABC".
COMMIT;
Authorization: Authorization determines what actions authenticated users are allowed to perform within
the database. It defines access control policies based on user roles, privileges, and permissions. Here are
common authorization mechanisms used in database security:
Role-Based Access Control (RBAC): RBAC assigns permissions to roles, and users are assigned to
these roles based on their job responsibilities or organizational roles. This simplifies access management
by grouping users with similar access requirements.
);
Entity Integrity Control
Entity integrity control enforces the rules so that each tuple can be uniquely identified from
other tuples. For this a primary key is defined. A primary key is a set of minimal fields that
can uniquely identify a tuple. Entity integrity constraint states that no two tuples in a table can
have identical values for primary keys and that no field which is a part of the primary key can
have NULL value.
For example, in the above hostel table, the hostel number can be assigned as the primary key
through the following SQL statement (ignoring the checks) −
INTEGER);
Database Backup
· Database Backup is storage of data that means the copy of the data.
· It is a safeguard against unexpected data loss and application errors.
· It protects the database against data loss.
· If the original data is lost, then using the backup it can reconstructed.
1. Physical backups
· Physical Backups are the backups of the physical files used in storing and recovering your database, such as
datafiles, control files and archived redo logs, log files.
· It is a copy of files storing database information to some other location, such as disk, some offline storage like mag-
netic tape.
· Physical backups are the foundation of the recovery mechanism in the database.
· Physical backup provides the minute details about the transaction and modification to the database.
2. Logical backup
· Logical Backup contains logical data which is extracted from a database.
· It includes backup of logical data like views, procedures, functions, tables, etc.
· It is a useful supplement to physical backups in many circumstances but not a sufficient protection against data loss
without physical backups, because logical backup provides only structural information.
Importance Of Backups
· Planning and testing backup helps against failure of media, operating system, software and any other kind of fail-
ures that cause a serious data crash.
· It determines the speed and success of the recovery.
· Physical backup extracts data from physical storage (usually from disk to tape). Operating system is an example of
physical backup.
· Logical backup extracts data using SQL from the database and store it in a binary file.
· Logical backup is used to restore the database objects into the database. So the logical backup utilities allow DBA
(Database Administrator) to back up and recover selected objects within the database.
Methods of Backup
The different methods of backup in a database are:
· Full Backup - This method takes a lot of time as the full copy of the database is made including the data and
the transaction records.
· Transaction Log - Only the transaction logs are saved as the backup in this method. To keep the backup file
as small as possible, the previous transaction log details are deleted once a new backup record is made.
· Differential Backup - This is similar to full backup in that it stores both the data and the transaction records.
However only that information is saved in the backup that has changed since the last full backup. Because of
this, differential backup leads to smaller files.
Hardware protection is divided into 3 categories: CPU protection, Memory Protection, and I/O protection. These are
explained as following below.
1. CPU Protection:
CPU protection is referred to as we can not give CPU to a process forever, it should be for some limited time
otherwise other processes will not get the chance to execute the process. So for that, a timer is used to get over
from this situation. which is basically give a certain amount of time a process and after the timer execution a
signal will be sent to the process to leave the CPU. hence process will not hold CPU for more time.
2. Memory Protection:
In memory protection, we are talking about that situation when two or more processes are in memory and one
process may access the other process memory. and to protecting this situation we are using two registers as:
Base register
Limit register
So basically Bare register store the starting address of program and limit register store the size of the process, so
when a process wants to access the memory then it is checked that it can access or can not access the memory.
3. I/O Protection:
So when we ensuring the I/O protection then some cases will never have occurred in the system as:
1. Termination I/O of other process
2. View I/O of other process
3. Giving priority to a particular process I/O
Redundancy
Data redundancy is a condition created within a database or data storage technology in which the same piece of data is
held in two separate places.
This can mean two different fields within a single database, or two different spots in multiple software environments or
platforms. Whenever data is repeated, this basically constitutes data
redundancy. This can occur by accident, but is also done deliberately for backup and recovery purposes.
Hardware redundancy
Hardware redundancy is achieved by providing two or more physical copies of a hardware component. When other
techniques, such as use of more reliable components, manufacturing quality control, test, design simplification, etc.,
have been exhausted, hardware redundancy may be the only way to improve the dependability of a system.
What Is Recovery?
· Recovery is the process of restoring a database to the correct state in the event of a failure.
· It ensures that the database is reliable and remains in consistent state in case of a failure.
Database Recovery
There are two methods that are primarily used for database recovery. These are:
· Log based recovery - In log based recovery, logs of all database transactions are stored in a secure area so
that in case of a system failure, the database can recover the data. All log information, such as the time of the
transaction, its data etc. should be stored before the transaction is executed.
· Shadow paging - In shadow paging, after the transaction is completed its data is automatically stored for
safekeeping. So, if the system crashes in the middle of a transaction, changes made by it will not be reflected
in the database.
SHADOW PAGE TABLE
CURRENT PAGE TABLE
Log-Based Recovery
· Logs are the sequence of records, that maintain the records of actions performed by a transaction.
· In Log – Based Recovery, log of each transaction is maintained in some stable storage. If any failure occurs, it
can be recovered from there to recover the database.
· The log contains the information about the transaction being executed, values that have been modified and transac-
tion state.
· All these information will be stored in the order of execution.
Example:
Assume, a transaction to modify the address of an employee. The following logs are written for this transaction,
Log 1: Transaction is initiated, writes 'START' log.
Log: <Tn START>
There are two methods of creating the log files and updating the database,
1. Deferred Database Modification
2. Immediate Database Modification
1. In Deferred Database Modification, all the logs for the transaction are created and stored into stable storage sys-
tem. In the above example, three log records are created and stored it in some storage system, the database will be
updated with those steps.
2. In Immediate Database Modification, after creating each log record, the database is modified for each step of
log entry immediately. In the above example, the database is modified at each step of log entry that means after
first log entry, transaction will hit the database to fetch the record, then the second log will be entered followed
by updating the employee's address, then the third log followed by committing the database changes.
Recovery from Transaction Failures
Summary
A transaction may fail due to logical errors, deadlocks, or system crashes. Recovery tech-
niques include:
• Updates are written to logs first and applied only after the transaction commits.
• If a crash occurs before commit, nothing is applied, and no rollback is needed.
<T1, Start>
<T1, Update(A, 100)>
<T1, Commit>
• If the system crashes after commit → Redo changes.
• If the system crashes before commit → Undo changes.
B. Checkpointing
• A checkpoint is a point where the database writes all committed changes from logs to
the database.
• Reduces the time for recovery after a crash.
Process:
Shadow Paging