Distributed Database: Database Database Management System Storage Devices CPU Computers Network
Distributed Database: Database Database Management System Storage Devices CPU Computers Network
Distributed database
A database that consists of two or more data files located at different sites on a computer network. Because the database
is distributed, different users canaccess it without interfering with one another. However, the DBMS must periodically
synchronize the scattered databases to make sure that they all have consistent data.
A distributed database is a database that is under the control of a central database management system (DBMS) in
which storage devicesare not all attached to a common CPU. It may be stored in multiple computers located in the
same physical location, or may be dispersed over a network of interconnected computers.
Collections of data (e.g. in a database) can be distributed across multiple physical locations. A distributed database
can reside on network servers on the Internet, on corporate intranets or extranets, or on other company networks.
Replication and distribution of databases improve database performance at end-user worksites. [1] Template:Needs
clarification
To ensure that the distributive databases are up to date and current, there are two processes: replication and
duplication. Replication involves using specialized software that looks for changes in the distributive database. Once
the changes have been identified, the replication process makes all the databases look the same. The replication
process can be very complex and time consuming depending on the size and number of the distributive databases.
This process can also require a lot of time and computer resources. Duplication on the other hand is not as
complicated. It basically identifies one database as a master and then duplicates that database. The duplication
process is normally done at a set time after hours. This is to ensure that each distributed location has the same data.
In the duplication process, changes to the master database only are allowed. This is to ensure that local data will not
be overwritten. Both of the processes can keep the data current in all distributive locations.[2]
Besides distributed database replication and fragmentation, there are many other distributed database design
technologies. For example, local autonomy, synchronous and asynchronous distributed database technologies. These
technologies' implementation can and does depend on the needs of the business and the sensitivity/confidentiality of
the data to be stored in the database, and hence the price the business is willing to spend on ensuring data security,
consistency and integrity.
Basic architecture
A database User accesses the distributed database through:
Local applications
applications which do not require data from other sites.
Global applications
applications which do require data from other sites.
Single site failure does not affect performance of system. All transactions follow A.C.I.D. property: a-atomicity, the
transaction takes place as whole or not at all; c-consistency, maps one consistent DB state to another; i-isolation,
each transaction sees a consistent DB; d-durability, the results of a transaction must survive system failures. The
Merge Replication Method used to consolidate the data between databases.
• Complexity — extra work must be done by the DBAs to ensure that the distributed nature of the system is
transparent. Extra work must also be done to maintain multiple disparate systems, instead of one big one.
3
Extra database design work must also be done to account for the disconnected nature of the database — for
example, joins become prohibitively expensive when performed across multiple systems.
• Economics — increased complexity and a more extensive infrastructure means extra labour costs.
• Security — remote database fragments must be secured, and they are not centralized so the remote sites
must be secured as well. The infrastructure must also be secured (e.g., by encrypting the network links
between remote sites).
• Difficult to maintain integrity — in a distributed database, enforcing integrity over a network may require too
much of the network's resources to be feasible.
• Inexperience — distributed databases are difficult to work with, and as a young field there is not much readily
available experience on proper practice.
• Lack of standards — there are no tools or methodologies yet to help users convert a centralized DBMS into a
distributed DBMS.
• Database design more complex — besides of the normal difficulties, the design of a distributed database has
to consider fragmentation of data, allocation of fragments to specific sites and data replication.
• Additional software is required.
• Operating System should support distributed environment.
• Concurrency control: it is a major issue. It is solved by locking and timestamping.
Several reasons why businesses and organizations move to distributed databases include
organizational and economic reasons, reliable and flexible interconnection of existing database,
and the future incremental growth. Companies believe that a decentralized, distributed data
database approach will adapt more naturally with the structure of the organizations. Distributed
database is more suitable solution when several database already exist in an organization. In
addition, the necessity of performing global application can be easily perform with distributed
database. If an organization grows by adding new relatively independent organizational units,
then the distributed database approach support a smooth incremental growth.
Data can physically reside nearest to where it is most often accessed, thus providing users with
local control of data that they interact with. This results in local autonomy of the data allowing
users to enforce locally the policies regarding access to their data.
4
One might want to consider a parallel architecture is to improve reliability and availability of
the data in a scalable system. In a distributed system, with some careful tact, it is possible to
access some, or possibly all of the data in a failure mode if there is sufficient data replication.
DDBMS also has a few disadvantages. Managing and controlling is complex, there is less
security because data is at so many different sites.
Distributed databases provides more flexible accesses that increase the chance of security
violations since the database can be accessed throughout every site within the network. For
many applications, it is important to provide secure. Present distributed database systems do not
provide adequate mechanisms to meet these objectives. Hence the solution requires the
operation of DDBMS capable of handling multilevel data. Such a system is also called a multi
level security distributed database management systems (MLS-DDBMS). MLS-DDBMS
provides a verification service for users who wish to share data in the database at different level
security. In MLS- DDBMS, every data item in the database has correlated with one of several
classifications or sensitivities.
The ability to ensure the integrity of the database in the presence of unpredictable failures of
both hardware and software components is also an important features of any distributed
database management systems. The integrity of a database is concerned with its consistency,
correctness, validity, and accuracy. The integrity controls must be built into the structure of
software, databases, and involved personnel.
If there are multiple copies of the same data, then this duplicated data introduces additional
complexity in ensuring that all copies are updated for each update. The notion of concurrency
control and recoverability consume much of the research efforts in the area of distributed
database theory. Increasing in reliability and performance is the goal and not the status quo.
Advantages
Organizations may be distributed across a wide geographic area. It is natural for databases to be set up to reflect
this. Local areas will keep local information and this allows local users to quickly access the local database. A
headquarters may also wish to make global inquiries to local data at the local regions.
Improved shareability.
This allows users at one site to access data stored at other sites. Data is placed close to the users who normally
access this data, it gives them local control and allows them to set up and establish local policies regarding the
data use. A global administrator is responsible for the entire system and should help at a local level to develop
and manage the dbms.
Data in a centralized DB is inaccessible if there is a problem with the DB. However i a distributed site the local
data is only inaccessible and if replication is i force then all data may be available at another site.
Improved performance.
Since data is kept local then local access is much quicker than access to a centralized DB.
5
Disadvantages
Complexity
A DDBMS is much more complex than a centralized DB. If there are conflicts in hardware and software in use then
this may cause performance issues and the sited advantages may become disadvantages.
Cost.
It is much more expensive to setup and maintain a DDBMS. More hardware is required the network maintenance
is increased the communication costs increase and there will be additional labour costs.
Security.
It is much more difficult to maintain a secure network system across multiple locations. The network needs to be
made secure and access to replicated data needs to be maintained across multiple sites.
Distributed Databases
What is a DDBMS?
• A Distributed DB is a related collection of data just like a normal DB, but physically distributed over a network.
• The data is split into “fragments” on separate machines, each running a local DBMS.
• A Distributed DBMS is the software that manages distribution of data and processing in a fashion that is
invisible to users.
• Local DBMSs can access local data autonomously, or access remote data through the DDBMS and other
local DBMSs.
• Apps that only use local data are called “Local Applications”
• To be a DDBMS, every local DBMS must participate in at least one Global App.
• Homogeneous DDBMSs use the same DBMS on the same platform on each site
• Heterogeneous DDBMSs use different DBMSs/Platforms and require gateways or other middle-ware to
convert queries/data models between sites
Advantages of a DDBMS:
• If well designed can strike an optimal balance between local speed and global access
• Scalability
Disadvantages of a DDBMS:
Note:
• A DDBMS is not the same as “distributed processing” which is centralised DB accessible through a network
(rather than the data itself being distributed)
• A DDBMS is not always the same as a “parallel DBMS” which is a single DBMS using multiple
processors/multiple disks
DDBMS Components:
• Global System Catalogue – same as a normal systems catalogue, plus frag/alloc information
• Distributed DBMS (DDBMS) – main functional unit – transaction management, backup/recovery etc.
• Centralised Data Storage – this is not a DDB at all. No replication so no additional storage costs.
• Fragmented Data Storage – no replication but data is split up and distributed. If done correctly, locality of
reference will be very high as is performance and storage and communications costs are low. Reliability and
accessibility are only ok.
• Fully Replicated Storage – every site hosts a full copy of the DB. Very high storage costs, improved
performance for read trans/low comm costs. Write trans low performance/high comms cost.
• Partially Replicated Storage – combination of above methods. It makes the most sense – if done right. You
have high locality, high reliability/access, good all-round performance, ok storage costs and low comm. costs.
Fragmentation:
o Since most apps only use subsets of relations, it makes sense to break the relations into subsets for
storage across the network.
• There are drawbacks, like increased integrity administration and performance hits for poorly fragmented data
sets
• For a fragmentation effort to be viable: it must be complete (all items in a relations must appear in at least one
fragment of the relation), functional dependencies must be preserved, and (other than primary keys)
fragments should be disjoint.
• Types of Fragmentation:
o Horizontal – break up relation into subsets of the tuples in that relation, based on a restriction on one
or more attributes. E.g. – we could break up a table with student info into one subset for undergrads
and one subset for postgrads.
8
o Vertical – breaking up a relation into subsets of attributes. E.g. – breaking up a hypothetical student
table into grade/course related columns and contact/personal related columns.
o Mixed – fragments the data multiple times, in different ways. We could do our postgrad/undergrad
split and then our grades/course split to each of the fragments
o Derived – fragmenting a relation to correspond with the fragmentation of another relation upon which
the 1st relation is dependent in some way.
Transparency:
• Distribution Transparency allows users to ignore the physical fragmentation of data, to varying degrees:
o Frag. Transparency is high level transparency where a user could write “SELECT * FROM Student
WHERE year = 2” without needing to specify what fragment of the Student relation contains the data,
nor where that fragment is stored.
o Location Transparency – mid level transparency where a user would need to write “SELECT * FROM
S14 WHERE year = 2” where S14 is the relevant fragment of the Student relation, but still wouldn’t
need to say where the fragment is stored.
o Local Mapping Transparency – low level transparency where a user would need to write “SELECT *
FROM S14 AT SITE 7 WHERE year = 2” where S14 is the relevant fragment of the Student relation
and SITE 7 is where the fragment is physically located.
o Distribution transparency is supported by a database name server which aliases unique database
object identifiers with user friendly names.
o Local transactions and remote single site transactions are handled without additional difficulty, but
multi-site transactions must be broken into sub-transactions (for each site) and the independence,
atomicity and durability of a centralised DBMS.
o Performance Transparency simply means DDBMS perform at the same level as a normal DBMS.
o This puts a lot of burden on the distributed query processor, which decides what fragment to hit,
which copy (if replicated), and which location to use, as well as calculating I/O time, CPU time and
communication costs.
DBMS Transparency means that a heterogeneous DDBMS will behave like a homogenous DDBMS.
Replication:
• Replication Options:
9
o Synchronous Updates – all copy updates are part of one transactions commit phase - a lot of admin
overhead, comm. Costs and opportunity for failure but often necessary
o Asynchronous Updates – periodic updates of all copies based on one mater copy – violates the idea
of data independence but can be useful in situations where the cost of synch updates are
unwarranted.
o Master/slave: publish and subscribe model – authoritive changes are made only to the master site
and published to the slaves asynchronously
o Workflow: like M/S, but Master status moves from site to site depending on the task at hand
o Synchronous: We can synch up our replicas using regularly scheduled “snapshots” of the master
data, or database triggers (when X happens, do Y)
A homogenous distributed database system is a network of two or more Oracle Databases that
reside on one or more machines. Figure 29-1 illustrates a distributed system that connects three
databases: hq, mfg, and sales. An application can simultaneously access or modify the data in
several databases in a single distributed environment. For example, a single query from a
Manufacturing client on local database mfg can retrieve joined data from the products table on
the local database and the dept table on the remote hq database.
For a client application, the location and platform of the databases are transparent. You can also
create synonyms for remote objects in the distributed system so that users can access them with
the same syntax as local objects. For example, if you are connected to database mfg but want to
access data on database hq, creating a synonym on mfg for the remote dept table enables you to
issue this query:
SELECT * FROM dept;
In this way, a distributed system gives the appearance of native data access. Users on mfg do not
have to know that the data they access resides on remote databases.
An Oracle Database distributed database system can incorporate Oracle Databases of different
versions. All supported releases of Oracle Database can participate in a distributed database
system. Nevertheless, the applications that work with the distributed database must understand
the functionality that is available at each node in the system. A distributed database application
cannot expect an Oracle7 database to understand the SQL extensions that are only available
with Oracle Database.
The terms distributed database and distributed processing are closely related, yet have
distinct meanings. There definitions are as follows:
• Distributed database
A set of databases in a distributed system that can appear to applications as a single data
source.
• Distributed processing
12
The operations that occurs when an application distributes its tasks among different
computers in a network. For example, a database application typically distributes front-
end presentation tasks to client computers and allows a back-end database server to
manage shared access to a database. Consequently, a distributed database application
processing system is more commonly referred to as a client/server database application
system.
The terms distributed database system and database replication are related, yet distinct. In
a pure (that is, not replicated) distributed database, the system manages a single copy of all data
and supporting database objects. Typically, distributed database applications use distributed
transactions to access both local and remote data and modify the global database in real-time.
The term replication refers to the operation of copying and maintaining database objects in
multiple databases belonging to a distributed system. While replication relies on distributed
database technology, database replication offers applications benefits that are not possible
within a pure distributed database environment.
Most commonly, replication is used to improve local database performance and protect the
availability of applications because alternate data access options exist. For example, an
application may normally access a local database rather than a remote server to minimize
network traffic and achieve maximum performance. Furthermore, the application can continue
to function if the local server experiences a failure, but other servers with replicated data remain
accessible.
The Oracle Database server accesses the non-Oracle Database system using Oracle
Heterogeneous Services in conjunction with an agent. If you access the non-Oracle Database
data store using an Oracle Transparent Gateway, then the agent is a system-specific application.
For example, if you include a Sybase database in an Oracle Database distributed system, then
13
you need to obtain a Sybase-specific transparent gateway so that the Oracle Database in the
system can communicate with it.
Alternatively, you can use generic connectivity to access non-Oracle Database data stores so
long as the non-Oracle Database system supports the ODBC or OLE DB protocols.
Heterogeneous Services
Heterogeneous Services (HS) is an integrated component within the Oracle Database server and
the enabling technology for the current suite of Oracle Transparent Gateway products. HS
provides the common architecture and administration mechanisms for Oracle Database gateway
products and other heterogeneous access facilities. Also, it provides upwardly compatible
functionality for users of most of the earlier Oracle Transparent Gateway releases.
For each non-Oracle Database system that you access, Heterogeneous Services can use a
transparent gateway agent to interface with the specified non-Oracle Database system. The
agent is specific to the non-Oracle Database system, so each type of system requires a different
agent.
The transparent gateway agent facilitates communication between Oracle Database and non-
Oracle Database systems and uses the Heterogeneous Services component in the Oracle
Database server. The agent executes SQL and transactional requests at the non-Oracle Database
system on behalf of the Oracle Database server.
Generic Connectivity
Generic connectivity enables you to connect to non-Oracle Database data stores by using either
a Heterogeneous Services ODBC agent or a Heterogeneous Services OLE DB agent. Both are
included with your Oracle product as a standard feature. Any data source compatible with the
ODBC or OLE DB standards can be accessed using a generic connectivity agent.
The advantage to generic connectivity is that it may not be required for you to purchase and
configure a separate system-specific agent. You use an ODBC or OLE DB driver that can
interface with the agent. However, some data access features are only available with transparent
gateway agents.
14
Client/Server Database Architecture
A database server is the Oracle software managing a database, and a client is an application that
requests information from a server. Each computer in a network is a node that can host one or
more databases. Each node in a distributed database system can act as a client, a server, or both,
depending on the situation.
In Figure 29-2, the host for the hq database is acting as a database server when a statement is
issued against its local data (for example, the second statement in each transaction issues a
statement against the local dept table), but is acting as a client when it issues a statement against
remote data (for example, the first statement in each transaction is issued against the remote
table emp in the sales database).
A client can connect directly or indirectly to a database server. A direct connection occurs
when a client connects to a server and accesses information from a database contained on that
server. For example, if you connect to the hq database and access the dept table on this database
as in Figure 29-2, you can issue the following:
SELECT * FROM dept;
15
This query is direct because you are not accessing an object on a remote database.
In contrast, an indirect connection occurs when a client connects to a server and then accesses
information contained in a database on a different server. For example, if you connect to
the hq database but access the emp table on the remote sales database as in Figure 29-2, you can
issue the following:
SELECT * FROM emp@sales;
This query is indirect because the object you are accessing is not on the database to which you
are directly connected.