Unit 3
Unit 3
Distributed Database
Concepts
Distributed DBMS Environment
Computer
Database Networks
Technology
2
Introduction
• Distributed computing system
• Consists of several processing sites or nodes interconnected by a computer
network
• Nodes cooperate in performing certain tasks
• Partitions large task into smaller tasks for efficient solving
• Big data technologies
• Combine distributed and database technologies
• Deal with mining vast amounts of data
Distributed Database Concepts
• What constitutes a distributed database?
• Connection of database nodes over computer network
• There are multiple computers, called sites or nodes.
• These sites must be connected by an underlying network to transmit data and
commands among sites.
• Logical interrelation of the connected databases
• It is essential that the information in the various database nodes be logically related.
• Possible absence of homogeneity among connected nodes
• It is not necessary that all nodes be identical in terms of data, hardware, and software.
• Distributed database management system (DDBMS)
• Software system that manages a distributed database
Distributed Database Concepts
(cont’d.)
• Local area network
• Hubs or cables connect sites
• Long-haul or wide area network
• Telephone lines, cables, wireless, or satellite connections
• Network topology defines communication path
• Transparency
• Hiding implementation details from the end user.
• A highly transparent system offers a lot of flexibility to the end
user/application developer since it requires little or no awareness of
underlying details on their part.
Distributed Database System
A distributed database (DDB) is a collection of multiple, logically interrelated databases
distributed over a computer network.
A distributed database management system
(D–DBMS) is the software that manages the DDB and provides an access mechanism that
makes this distribution transparent to the users.
A distributed database is a database that is spread across multiple computers or nodes in a
network.
This means that the data is not stored in a single location, but rather in multiple locations.
This can provide several advantages, such as increased availability, scalability, reliability
and fault tolerance.
• Increased availability
• Isolate faults to their site of origin
• If one node fails, the data is still available on the other nodes
• In a centralized system, failure at a single site makes the whole system unavailable to all users.
• In a distributed database, some of the data may be unreachable, but users may still be able to access other parts of the database.
• If the data in the failed site has been replicated at another site prior to the failure, then the user will not be affected at all.
• Reliability
• The system is more reliable than a centralized system, as if one node fails, the system can still operate.
• Less danger of a single-point failure:
• When one of the computers fails, the workload is picked up by other workstations. Data are also distributed at multiple sites.
• Improved performance
• Data localization
• A distributed DBMS fragments the database by keeping the data closer to where it is needed most.
• Data localization reduces the contention for CPU and I/O services and simultaneously reduces access delays involved in wide area networks
• Faster data access:
• End users often work with only a locally stored subset of the company’s data
• Faster data processing:
• A distributed database system spreads out the systems workload by processing data at several sites.
• Interquery and intraquery parallelism can be achieved by executing multiple queries at different sites, or by breaking up a query into a number of
subqueries that execute in parallel.
Advantages of Distributed DBMS…
• Easier expansion via scalability
• Expansion of the system in terms of adding more data, increasing database
sizes, or adding more nodes is much easier than in centralized (non-
distributed) systems.
• Modular growth:
• New sites can be added to the network without affecting the operations of other sites.
• Cost as per need:
• With the increase in traffic from the users, we can easily scale our database by adding more
nodes to the system.
• Since these nodes are commodity hardware, they are relatively cheaper than adding more
resources to each of the nodes individually
• i.e., Horizontal scaling is cheaper than vertical scaling.
Disadvantages of DDBS
• Complexity of management and control:
• Database administrators must have the ability to coordinate database activities among different
sites
• Technological difficulty:
• Data integrity, transaction management, concurrency control, security, backup, recovery, query
optimization, access path selection, and so on, must all be addressed and resolved.
• Security:
• The probability of security lapses increases when data are located at multiple sites.
• Increased storage and infrastructure requirements:
• Multiple copies of data are required at different sites, thus requiring additional disk storage space.
• Increased training cost:
• Training costs are generally higher in a distributed model than they would be in a centralized model
• Duplicate Costs:
• Distributed databases require duplicated infrastructure to operate (physical location, environment,
personnel, software, licensing, etc.)
• Latency:
• There may be increased latency when accessing data from remote nodes.
Data Fragmentation, Replication, and Allocation
Techniques for Distributed Database Design
• Fragmentation is a process of dividing the whole or full database into
various sub tables or sub relations so that data can be stored in different
systems.
• Fragmentation should be done in a way so that the original table can be
reconstructed from the fragments.
• Fragments
• Logical units of the database
• Horizontal fragmentation (sharding)
• Horizontal fragment or shard of a relation is a subset of the tuples in that relation
• Can be specified by condition on one or more attributes or by some other method
• Groups rows to create subsets of tuples
• Each subset has a certain logical meaning
Data Fragmentation (cont’d.)
• Vertical fragmentation
• Divides a relation vertically by columns
• Keeps only certain attributes of the relation’
• Include the primary key or some unique key attribute in every vertical fragment so that the full
relation can be reconstructed from the fragments.
• Complete horizontal fragmentation
• A set of horizontal fragments whose conditions C1, C2, … , Cn include all the tuples in R
• i.e., every tuple in R satisfies (C1 OR C2 OR … OR Cn)—is called a complete horizontal fragmentation of R.
• In many cases a complete horizontal fragmentation is also disjoint.
• Apply UNION operation to the fragments to reconstruct relation
• Complete vertical fragmentation
• projection lists satisfy the following two conditions:
• L1 ∪ L2 ∪ … ∪ Ln = ATTRS(R)
• Li ∩ Lj = PK(R) for any i ≠ j, where ATTRS(R) is the set of attributes of R and PK(R) is the primary key of R
• Apply OUTER UNION or FULL OUTER JOIN operation to reconstruct relation
Data Fragmentation (cont’d.)
• Mixed (hybrid) fragmentation
• Combination of horizontal and vertical fragmentations
• Hybrid fragmentation can be done in two alternative ways:
• At first, generate a set of horizontal fragments; then generate vertical fragments from one
or more of the horizontal fragments.
• At first, generate a set of vertical fragments; then generate horizontal fragments from one
or more of the vertical fragments.
• Fragmentation schema
• Defines a set of fragments that includes all attributes and tuples in the database
• Allocation schema
• Describes the allocation of fragments to nodes(sites) of the DDBS
• If a fragment is stored at more than one site, it is said to be replicated.
Horizontal fragmentation
• It refers to the division of a relation into subsets (fragments) of tuples
(rows).
• Each fragment is stored at a different node, and each fragment has
unique rows. However, the unique rows all have the same attributes
(columns).
Vertical Fragmentation
• Some of the attributes are stored in one system and the rest are
stored in other systems.
• Each site may not need all columns of a table. In order to take care of
restoration, each fragment must contain the primary key field(s) in a
table.
Advantages of Fragmentation
• Horizontal:
• allows parallel processing on fragments of a relation
• allows a relation to be split so that tuples are located where they are most
frequently accessed
• Vertical:
• allows tuples to be split so that each part of the tuple is stored where it is
most frequently accessed
• tuple-id attribute allows efficient joining of vertical fragments
• allows parallel processing on a relation
Data Replication
• Data replication is the process of storing separate copies of the database at two or
more sites.
• It is a popular fault tolerance technique of distributed databases.
• A relation or fragment of a relation is replicated if it is stored redundantly in two
or more sites.
• Full replication of a relation is the case where the relation is stored at all sites.
• Fully redundant databases are those in which every site contains a copy of the
entire database.
• It is simply copying data from a database from one server to another server so
that all the users can share the same data without any inconsistency.
• The result is a distributed database in which users can access data relevant to
their tasks without interfering with the work of others
Data Replication and Allocation
• Replication is useful in improving the availability of data.
• Fully replicated distributed database
• Replication of whole database at every site in distributed system
• Improves availability remarkably
• because the system can continue to operate as long as at least one site is up
• Improves performance of retrieval (read performance) for global queries
• because the results of such queries can be obtained locally from any one site;
• Update operations (write performance) can be slow (disadvantage)
• since a single logical update must be performed on every copy of the database to keep the
copies consistent.
• Nonredundant allocation (no replication)
• Each fragment is stored at exactly one site
• all fragments must be disjoint, except for the repetition of primary keys among
vertical (or mixed) fragments.
Data Replication and Allocation (cont’d.)
• Partial replication
• Some fragments are replicated, and others are not
• Defined by replication schema
• A description of the replication of fragments
• Data allocation (data distribution)
• Each fragment assigned to a particular site in the distributed system
• Choices depend on performance and availability goals of the system
• For example, if high availability is required, transactions can be submitted at any site,
and most transactions are retrieval only, a fully replicated database is a good choice.
• However, if certain transactions that access particular parts of the database are
mostly submitted at a particular site, the corresponding set of fragments can be
allocated at that site only. Data that is accessed at multiple sites can be replicated at
those sites.
• If many updates are performed, it may be useful to limit replication.
• Finding an optimal or even a good solution to distributed data allocation is a
complex optimization problem.
Example of Fragmentation,
Allocation, and Replication
31
Centralized Database System VS Distributed
Database Systems
Centralized Database System Distributed Database System
Centralized Database System is simple type. Distributed Database system is complex type.
They are located on particular location. They are located in many geographical locations.
It consists of only one server It contains servers in many locations
It is only suitable for small organization and small- It is suitable for large organizations.
scale operation.
There is less chance of lost. More chance of data hacking, theft and lost
Maintenance is easy and security is high Maintenance is not easy and security is low as
compare to centralized database system.
Failure of server makes the whole system down Failure of one server doesn’t make the whole
system down.
There is no feature of load balancing There is feature of load balancing.
32
Design issues of a Distributed
•
System
Heterogeneity
• Distributed systems often consist of a heterogeneous collection of hardware and software components.
• This can make it difficult to ensure that all components can interoperate and work together correctly.
• Openness
• Distributed systems should be designed to be open and extensible.
• This allows for the addition of new components and the modification of existing components without disrupting the
overall system.
• Security
• Distributed systems are vulnerable to security attacks.
• This is because they often consist of multiple components that are interconnected over a network.
• Synchronization
• Distributed systems need to be synchronized so that all components have a consistent view of the system state.
• This can be difficult to achieve in a distributed environment where components may be located in different places and
may have different clocks.
• Failure handling
• Distributed systems need to be able to handle failures of individual components.
• This is because failures are inevitable in any distributed system.
• Scalability
• Distributed systems need to be scalable so that they can handle increasing loads.
• This can be a challenge because distributed systems often consist of a large number of components that need to be
able to communicate with each other.
Design issues of a Distributed System …
• Performance
• Distributed systems need to be designed to be efficient so that they can provide good
performance.
• This can be a challenge because distributed systems often involve a lot of communication between
components.
• Concurrency
• Distributed systems need to be designed to handle concurrent access to shared resources.
• This can be a challenge because concurrent access can lead to race conditions and deadlocks.
• Load balancing
• Distributed systems need to be designed to distribute load evenly across the different
components.
• This can be a challenge because the load on the different components can vary over time.
• Fault tolerance
• Distributed systems need to be designed to be fault-tolerant so that they can continue to operate
even when some of the components fail.
• This can be achieved by using redundancy and replication.
• Availability
• Distributed systems need to be designed to be available so that they can be accessed by users
even when some of the components are unavailable.
• This can be achieved by using load balancing and fault tolerance techniques.
Parallel vs Distributed Computing
• The main difference between parallel and distributed computing is
that parallel computing allows multiple processors to execute tasks
simultaneously while distributed computing divides a single task
between multiple computers to achieve a common goal.
• Memory is a major difference between parallel and distributed
computing. In parallel computing, the computer can have a shared
memory or distributed memory. In distributed computing, each
computer has its own memory.
• In Parallel Computing there is direct communication between
processors but in distributed computing there is communication
between computers through a network
Types of Distributed Databases
• Homogeneous distributed database
• All sites have identical software
• Are aware of each other and agree to cooperate in processing user requests.
• Each site surrenders part of its autonomy in terms of right to change schemas or
software
• Appears to user as a single system
• Heterogeneous distributed database
• Different sites may use different schemas and software
• Difference in schema is a major problem for query processing
• Difference in software is a major problem for transaction processing
• Sites may not be aware of each other and may provide only limited facilities for
cooperation in transaction processing
Overview of Concurrency Control and
Recovery in Distributed Databases
• Concurrency control ensures that multiple transactions can access and update shared data in
different sites without interfering with each other.
• Recovery ensures that the database can be restored to a consistent state after a failure at one or
more sites.
• Problems specific to distributed DBMS environment
• Dealing with multiple copies of the data items
• The concurrency control method is responsible for maintaining consistency among these copies.
• The recovery method is responsible for making a copy consistent with other copies if the site on which the copy is stored fails
and recovers later.
• Failure of individual sites
• The DDBMS should continue to operate with its running sites, if possible, when one or more individual sites fail. When a site
recovers, its local database must be brought up-to-date with the rest of the sites before it rejoins the system.
• Failure of communication links
• The system must be able to deal with the failure of one or more of the communication links that connect the sites.
• Distributed commit
• Problems can arise with committing a transaction that is accessing databases stored on multiple sites if some sites fail during
the commit process.
• Distributed deadlock
• Deadlock may occur among several sites, so techniques for dealing with deadlocks must be extended to take this into account.
Distributed Concurrency Control Based on a
Distinguished Copy of a Data Item
• One copy of each data item (table, row, etc.) is designated as the Distinguished Copy.
• Locks are associated with the distinguished copy
• This copy of the data item is where all locks are applied.
• The concurrency controller should look to this copy to determine if any locks are held.
• Several variations of distinguished copy:
• Primary site technique
• All distinguished copies kept at the same site
• Primary site is designated to be the coordinator site for all database items.
• Advantage:
• It is a simple extension of the centralized approach and thus is not overly complex.
• Disadvantages:
• All locking requests are sent to a single site, possibly overloading that site and causing a system bottleneck.
• Failure of the primary site paralyzes the system, since all locking information is kept at that site.
• This can limit system reliability and availability.
• Primary site with backup site
• Locking information maintained at both sites
• In case of primary site failure, the backup site takes over as the primary site, and a new backup site is chosen.
Distributed Concurrency Control Based on a Distinguished
Copy of a Data Item
• Several variations of distinguished copy (Cont.):
• Primary copy method
• Distinguished copies may reside on different sites
• Distributes the load of lock coordination among various sites
• This method can also use backup sites to enhance reliability and availability.
• Choosing a New Coordinator Site in Case of Failure
• Whenever a coordinator site fails in any of the preceding techniques, the sites that are still
running must choose a new coordinator.
• If both the primary and the backup sites are down, a process called election can be used to
choose the new coordinator site.
• The election algorithm itself is complex, but the main idea behind the election method is:
• Any site Y that attempts to communicate with the coordinator site repeatedly and fails to do so can
assume that the coordinator is down and can start the election process by sending a message to all
running sites proposing that Y become the new coordinator.
• As soon as Y receives a majority of yes votes, Y can declare that it is the new coordinator.
• The algorithm also resolves any attempt by two or more sites to become coordinator at the
same time.
Distributed Concurrency Control Based on Voting
• Voting method
• No distinguished copy
• Lock requests sent to all sites that contain a copy
• Each copy maintains its own lock
• If transaction that requests a lock is granted that lock by a majority of the copies, it
holds the lock on all copies
• Time-out period applies
• Results in higher message traffic among sites
• Algorithm:
1. Send a message to all nodes that maintain a replica of this item.
2. If a node can safely lock the item, then vote "Yes", otherwise, vote "No".
3. If a majority of participating nodes vote "Yes" then the lock is granted.
4. Send the results of the vote back out to all participating sites.
Distributed Recovery
• The recovery process in distributed databases is quite involved.
• Difficult to determine whether a site is down without exchanging numerous
messages with other sites.
• For example, suppose that site X sends a message to site Y and expects a
response from Y but does not receive it. There are several possible
explanations:
• The message was not delivered to Y because of communication failure.
• Site Y is down and could not respond.
• Site Y is running and sent a response, but the response was not delivered.
• Without additional information or the sending of additional messages, it is
difficult to determine what actually happened.
• Another problem with distributed recovery is Distributed commit.
• When a transaction is updating data at several sites, it cannot commit until it is sure
that the effect of the transaction on every site cannot be lost
• Two-phase commit protocol often used to ensure correctness
Overview of Transaction Management in
Distributed Databases
• We need mechanisms in place to ensure multiple copies of data are kept consistent.
• Concurrency and Commit protocols must be changed to account for replicated data.
• In a centralized DB we had the notion of a commit point. In distributed DB, we need to
consider committing a transaction that changes data on multiple nodes.
• Distributed Commit Protocol such as Two-Phase Commit (2PC) can be used.
• Global transaction manager
• Additional component is introduced for supporting distributed transactions
• The site where the transaction originated can temporarily assume the role of global transaction
manager
• Coordinates execution of database operations with transaction managers at multiple sites
• Passes database operations and associated information to the concurrency controller
• Controller responsible for acquisition and release of locks
• If the transaction requires access to a locked resource, it is blocked until the lock is acquired.
• Once the lock is acquired, the operation is sent to the runtime processor, which handles the actual execution of the
database operation.
• Once the operation is completed, locks are released, and the transaction manager is updated with the result of the
operation.
Commit Protocols
• Two-phase Commit Protocol
• The two-phase commit protocol (2PC) is a common way to ensure that all
participating databases commit or abort a distributed transaction together.
• Phase 1(Prepare phase):
• Send a message to all nodes: "Can you commit Transaction X?" All nodes that can commit this transaction reply with
"Yes".
• Phase 2 (Commit phase):
• If all nodes reply with "Yes", then send a "Commit" message to all nodes.
• If any node replies "No", then the transaction is aborted.
• 2PC is an example of a synchronous replication protocol.
• Coordinator (global recovery manager) maintains information needed for recovery
• In addition to local recovery managers
• 2PC has two main drawbacks:
• Blocking:
• If the coordinator fails, all participating databases will block until the coordinator recovers.
• This can cause performance degradation, especially if participants are holding locks to shared resources.
• Inability to handle network partitions:
• If the network between two participating databases is partitioned, 2PC cannot guarantee that
all databases will commit or abort the transaction together. This can lead to data
inconsistency.
Commit Protocols (Contd.)
• Three-phase Commit Protocol
• Divides second commit phase into two subphases
• Prepare-to-commit phase
• This phase is used to communicate result of the prepare phase to all participants.
• If all participants vote yes, then the coordinator instructs them to move into the prepare-to-commit state
• Commit subphase same as two-phase commit counterpart
• This extra phase in 3PC helps to prevent blocking and data inconsistency.
• If the coordinator fails, the participating databases can abort the transaction without blocking.
• if the coordinator crashes during this subphase, another participant can see the transaction through to
completion. It can simply ask a crashed participant if it received a prepare-to-commit message. If it did not,
then it safely assumes to abort.
• If the network between two participating databases is partitioned, the participating databases
can still commit or abort the transaction together.
• Also, by limiting the time required for a transaction to commit or abort to a maximum time-out period, the
protocol ensures that a transaction attempting to commit via 3PC releases locks on time-out.
• When a participant receives a pre-commit message, it knows that the rest of the participants have voted to
commit. If a pre-commit message has not been received, then the participant will abort and release all
locks.
Query Processing and Optimization in
Distributed Databases
• Stages of a distributed database query
• Query mapping
• Process of translating a query that is submitted to a distributed database system
into a set of subqueries that can be executed on the participating databases
• The translation (into an algebraic query) is done by referring to global conceptual schema
• Localization
• Maps the distributed query to separate queries on individual fragments using data distribution and
replication information.
• Global query optimization
• Strategy selected from list of candidates that is closest to optimal
• A list of candidate queries can be obtained by permuting the ordering of operations within a fragment
query generated by the previous stage.
• Time is the preferred unit for measuring cost. The total cost is a weighted combination of costs such as
CPU cost, I/O costs, and communication costs.
• Local query optimization
• Common to all sites in the DDB
• The techniques are similar to those used in centralized systems.
Query Processing and Optimization in
Distributed Databases (cont’d.)
• Data transfer costs of distributed query processing
• Cost of transferring intermediate and final result files (from one site to other)
• Optimization criterion: reducing amount of data transfer
Example:
Q: For each employee, retrieve the employee's name and the
name of the department for which the employee works.
This can be stated as follows in the relational algebra:
Q: πFname,Lname,Dname(EMPLOYEE Dno=Dnumber DEPARTMENT)
Example (contd.)
• The result of this query will include 10,000 records, assuming that every employee is related to a
department.
• Suppose that each record in the query result is 40 bytes long.
• The query is submitted at a distinct site 3, which is called the result site because the query result is needed
there.
• Neither the EMPLOYEE nor the DEPARTMENT relations reside at site 3.
• There are three simple strategies for executing this distributed query:
1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site and perform the join at site 3. In this
case, a total of 1,000,000 + 3,500 = 1,003,500 bytes must be transferred.
2. Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result to site 3. The size of the
query result is 40 * 10,000 = 400,000 bytes, so 400,000 + 1,000,000 = 1,400,000 bytes must be transferred.
3. Transfer the DEPARTMENT relation to site 1, execute the join at site 1, and send the result to site 3. In this case,
400,000 + 3,500 = 403,500 bytes must be transferred.
• If minimizing the amount of data transfer is our optimization criterion, we should choose strategy 3.
• However, suppose that the result site is site 2; then we have two simple strategies:
1. Transfer the EMPLOYEE relation to site 2, execute the query, and present the result to the user at site 2. Here, the
same number of bytes—1,000,000— must be transferred for query Q.
2. Transfer the DEPARTMENT relation to site 1, execute the query at site 1, and send the result back to site 2. In this
case 400,000 + 3,500 = 403,500 bytes must be transferred for query Q.
Query Processing and Optimization in Distributed
Databases (cont’d.)
• Distributed query processing using semijoin
• Reduces the number of tuples in a relation before transferring it to another site
• Send the joining column of one relation R to one site where the other relation S
is located; this column is then joined with S
• Join attributes and result attributes shipped back to original site and joined with
R
• Efficient solution to minimizing data transfer
Query Processing and Optimization in
Distributed Databases (cont’d.)
• Query and update decomposition
• User can specify a query as if the DBMS were centralized
• If full distribution, fragmentation, and replication transparency are supported
• Query decomposition module
• Breaks up a query into subqueries that can be executed at the individual sites
• Strategy for combining results must be generated
• To determine which replicas, include the data items referenced in a query, the
DDBMS refers to the fragmentation, replication, and distribution information stored in
the DDBMS catalog.
• Catalog stores attribute list and/or guard condition
• Guard is basically a selection condition that specifies which tuples exist in the fragment;
• it is called a guard because only tuples that satisfy this condition are permitted to be stored in the
fragment.
Types of Distributed Database Systems
• Factors that influence types of DDBMSs
• Degree of homogeneity of DDBMS software
• Homogeneous
• The sites use very similar software.
• The sites use identical DBMS or DBMS from the same vendor.
• Each site is aware of all other sites and cooperates with other sites to process user requests.
• The database is accessed through a single interface as if it is a single database.
• Heterogeneous
• Different sites use dissimilar schemas and software.
• Query processing is complex due to dissimilar schemas.
• Transaction processing is complex due to dissimilar software.
• A site may not be aware of other sites and so there is limited co-operation in processing user requests.
• Degree of local autonomy
• No local autonomy
• If there is no provision for the local site to function as a standalone DBMS
• On the other hand, if direct access by local transactions to a server is permitted, the system has some degree of local autonomy
• Multidatabase (peer to peer) system has full local autonomy
• Federated database system (FDBS)
• Global view or schema of the federation of databases is shared by the applications
Classification of Distributed Databases
The external schema is the view of the data that is seen by the
user. It is defined by the user's application.
Figure 23.9 The five-level schema architecture in a federated database system (FDBS)
An Overview of Three-Tier Client/Server
Architecture
• Distributed database applications are being developed in the context of the client/server
architectures.