0% found this document useful (0 votes)
1 views62 pages

Unit 3

The document discusses distributed databases, which consist of multiple interconnected database nodes that work together over a network. It covers concepts such as data fragmentation, replication, and allocation, as well as the advantages and disadvantages of distributed database systems. Key benefits include improved availability, reliability, and performance, while challenges include increased complexity and management difficulties.

Uploaded by

Akkal Bista
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views62 pages

Unit 3

The document discusses distributed databases, which consist of multiple interconnected database nodes that work together over a network. It covers concepts such as data fragmentation, replication, and allocation, as well as the advantages and disadvantages of distributed database systems. Key benefits include improved availability, reliability, and performance, while challenges include increased complexity and management difficulties.

Uploaded by

Akkal Bista
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Unit 3

Distributed Database
Concepts
Distributed DBMS Environment

Computer
Database Networks
Technology

2
Introduction
• Distributed computing system
• Consists of several processing sites or nodes interconnected by a computer
network
• Nodes cooperate in performing certain tasks
• Partitions large task into smaller tasks for efficient solving
• Big data technologies
• Combine distributed and database technologies
• Deal with mining vast amounts of data
Distributed Database Concepts
• What constitutes a distributed database?
• Connection of database nodes over computer network
• There are multiple computers, called sites or nodes.
• These sites must be connected by an underlying network to transmit data and
commands among sites.
• Logical interrelation of the connected databases
• It is essential that the information in the various database nodes be logically related.
• Possible absence of homogeneity among connected nodes
• It is not necessary that all nodes be identical in terms of data, hardware, and software.
• Distributed database management system (DDBMS)
• Software system that manages a distributed database
Distributed Database Concepts
(cont’d.)
• Local area network
• Hubs or cables connect sites
• Long-haul or wide area network
• Telephone lines, cables, wireless, or satellite connections
• Network topology defines communication path
• Transparency
• Hiding implementation details from the end user.
• A highly transparent system offers a lot of flexibility to the end
user/application developer since it requires little or no awareness of
underlying details on their part.
Distributed Database System
 A distributed database (DDB) is a collection of multiple, logically interrelated databases
distributed over a computer network.
 A distributed database management system
(D–DBMS) is the software that manages the DDB and provides an access mechanism that
makes this distribution transparent to the users.
 A distributed database is a database that is spread across multiple computers or nodes in a
network.
 This means that the data is not stored in a single location, but rather in multiple locations.
 This can provide several advantages, such as increased availability, scalability, reliability
and fault tolerance.

 Distributed database system


(DDBS) = DDB + D-DBMS
6
Distributed Database System
• A distributed database management system is the software that
manages the DDB and provides an access mechanism that makes the
distribution transparent to the users.
• Distributed database system consists of loosely coupled sites that
share no physical component
• Database systems that run on each site are independent of each
other
• Transactions may access data at one or more sites
Characteristics of DDBS
A collection of logically related shared data
The data is split into a number of fragments
Fragments may be replicated
Fragments/replicas are allocated to sites.
The sites are linked by a communications network
The data at each site is under the control of DBMS
The DBMS at each site can handle local applications, automatically
Each DBMS participates in at least one global application
Transparency
• Types of transparency
• Data organization transparency (also known as distribution or network transparency).
• This refers to freedom for the user from the operational details of the network and the placement of the data in the
distributed system.
• Location transparency
• command used to perform a task is independent of the location of the data and the location of the node
• Naming transparency
• This implies that once a name is associated with an object, the named objects can be accessed unambiguously without additional
specification as to where the data is located.
• Replication transparency
• Copies of the same data objects may be stored at multiple sites for better availability, performance, and reliability.
• Replication transparency makes the user unaware of the existence of these copies.
• Fragmentation transparency
• Fragmentation transparency makes the user unaware of the existence of fragments.
• Horizontal fragmentation
• Vertical fragmentation
• Design transparency
• freedom from knowing how the distributed database is designed.
• Execution transparency
• freedom from knowing where a transaction executes.
Distributed Databases

Figure 23.1 Data distribution and replication among distributed databases


Availability and Reliability
• Availability
• Probability that the system is continuously available during a time interval
• Reliability
• Probability that the system is running (not down) at a certain time point
• Both directly related to faults, errors, and failures
• A failure can be described as a deviation of a system’s behavior from that
which is specified in order to ensure correct execution of operations.
• Errors constitute that subset of system states that causes the failure.
• Fault is the cause of an error
• Fault-tolerant approaches
Scalability and Partition Tolerance
• Horizontal scalability
• Expanding the number of nodes in a distributed system.
• As nodes are added to the system, it should be possible to distribute some of
the data and processing loads from existing nodes to the new nodes.
• Vertical scalability
• Expanding capacity of the individual nodes in the system, such as expanding
the storage capacity or the processing power of a node.
• Partition tolerance
• System should have the capacity to continue operating while the network is
partitioned
Autonomy
• Autonomy determines the extent to which individual nodes can operate
independently.
• A high degree of autonomy is desirable for increased flexibility and
customized maintenance of an individual node.
• Design autonomy
• Independence of data model usage and transaction management techniques
among nodes
• Communication autonomy
• Determines the extent to which each node can decide on sharing information
with other nodes
• Execution autonomy
• Independence of users to act as they please
Advantages of Distributed Databases
• Improved ease and flexibility of application development
• Development at geographically dispersed sites

• Increased availability
• Isolate faults to their site of origin
• If one node fails, the data is still available on the other nodes
• In a centralized system, failure at a single site makes the whole system unavailable to all users.
• In a distributed database, some of the data may be unreachable, but users may still be able to access other parts of the database.
• If the data in the failed site has been replicated at another site prior to the failure, then the user will not be affected at all.

• Reliability
• The system is more reliable than a centralized system, as if one node fails, the system can still operate.
• Less danger of a single-point failure:
• When one of the computers fails, the workload is picked up by other workstations. Data are also distributed at multiple sites.

• Improved performance
• Data localization
• A distributed DBMS fragments the database by keeping the data closer to where it is needed most.
• Data localization reduces the contention for CPU and I/O services and simultaneously reduces access delays involved in wide area networks
• Faster data access:
• End users often work with only a locally stored subset of the company’s data
• Faster data processing:
• A distributed database system spreads out the systems workload by processing data at several sites.
• Interquery and intraquery parallelism can be achieved by executing multiple queries at different sites, or by breaking up a query into a number of
subqueries that execute in parallel.
Advantages of Distributed DBMS…
• Easier expansion via scalability
• Expansion of the system in terms of adding more data, increasing database
sizes, or adding more nodes is much easier than in centralized (non-
distributed) systems.
• Modular growth:
• New sites can be added to the network without affecting the operations of other sites.
• Cost as per need:
• With the increase in traffic from the users, we can easily scale our database by adding more
nodes to the system.
• Since these nodes are commodity hardware, they are relatively cheaper than adding more
resources to each of the nodes individually
• i.e., Horizontal scaling is cheaper than vertical scaling.
Disadvantages of DDBS
• Complexity of management and control:
• Database administrators must have the ability to coordinate database activities among different
sites
• Technological difficulty:
• Data integrity, transaction management, concurrency control, security, backup, recovery, query
optimization, access path selection, and so on, must all be addressed and resolved.
• Security:
• The probability of security lapses increases when data are located at multiple sites.
• Increased storage and infrastructure requirements:
• Multiple copies of data are required at different sites, thus requiring additional disk storage space.
• Increased training cost:
• Training costs are generally higher in a distributed model than they would be in a centralized model
• Duplicate Costs:
• Distributed databases require duplicated infrastructure to operate (physical location, environment,
personnel, software, licensing, etc.)
• Latency:
• There may be increased latency when accessing data from remote nodes.
Data Fragmentation, Replication, and Allocation
Techniques for Distributed Database Design
• Fragmentation is a process of dividing the whole or full database into
various sub tables or sub relations so that data can be stored in different
systems.
• Fragmentation should be done in a way so that the original table can be
reconstructed from the fragments.
• Fragments
• Logical units of the database
• Horizontal fragmentation (sharding)
• Horizontal fragment or shard of a relation is a subset of the tuples in that relation
• Can be specified by condition on one or more attributes or by some other method
• Groups rows to create subsets of tuples
• Each subset has a certain logical meaning
Data Fragmentation (cont’d.)
• Vertical fragmentation
• Divides a relation vertically by columns
• Keeps only certain attributes of the relation’
• Include the primary key or some unique key attribute in every vertical fragment so that the full
relation can be reconstructed from the fragments.
• Complete horizontal fragmentation
• A set of horizontal fragments whose conditions C1, C2, … , Cn include all the tuples in R
• i.e., every tuple in R satisfies (C1 OR C2 OR … OR Cn)—is called a complete horizontal fragmentation of R.
• In many cases a complete horizontal fragmentation is also disjoint.
• Apply UNION operation to the fragments to reconstruct relation
• Complete vertical fragmentation
• projection lists satisfy the following two conditions:
• L1 ∪ L2 ∪ … ∪ Ln = ATTRS(R)
• Li ∩ Lj = PK(R) for any i ≠ j, where ATTRS(R) is the set of attributes of R and PK(R) is the primary key of R
• Apply OUTER UNION or FULL OUTER JOIN operation to reconstruct relation
Data Fragmentation (cont’d.)
• Mixed (hybrid) fragmentation
• Combination of horizontal and vertical fragmentations
• Hybrid fragmentation can be done in two alternative ways:
• At first, generate a set of horizontal fragments; then generate vertical fragments from one
or more of the horizontal fragments.
• At first, generate a set of vertical fragments; then generate horizontal fragments from one
or more of the vertical fragments.

• Fragmentation schema
• Defines a set of fragments that includes all attributes and tuples in the database
• Allocation schema
• Describes the allocation of fragments to nodes(sites) of the DDBS
• If a fragment is stored at more than one site, it is said to be replicated.
Horizontal fragmentation
• It refers to the division of a relation into subsets (fragments) of tuples
(rows).
• Each fragment is stored at a different node, and each fragment has
unique rows. However, the unique rows all have the same attributes
(columns).
Vertical Fragmentation
• Some of the attributes are stored in one system and the rest are
stored in other systems.
• Each site may not need all columns of a table. In order to take care of
restoration, each fragment must contain the primary key field(s) in a
table.
Advantages of Fragmentation
• Horizontal:
• allows parallel processing on fragments of a relation
• allows a relation to be split so that tuples are located where they are most
frequently accessed
• Vertical:
• allows tuples to be split so that each part of the tuple is stored where it is
most frequently accessed
• tuple-id attribute allows efficient joining of vertical fragments
• allows parallel processing on a relation
Data Replication
• Data replication is the process of storing separate copies of the database at two or
more sites.
• It is a popular fault tolerance technique of distributed databases.
• A relation or fragment of a relation is replicated if it is stored redundantly in two
or more sites.
• Full replication of a relation is the case where the relation is stored at all sites.
• Fully redundant databases are those in which every site contains a copy of the
entire database.
• It is simply copying data from a database from one server to another server so
that all the users can share the same data without any inconsistency.
• The result is a distributed database in which users can access data relevant to
their tasks without interfering with the work of others
Data Replication and Allocation
• Replication is useful in improving the availability of data.
• Fully replicated distributed database
• Replication of whole database at every site in distributed system
• Improves availability remarkably
• because the system can continue to operate as long as at least one site is up
• Improves performance of retrieval (read performance) for global queries
• because the results of such queries can be obtained locally from any one site;
• Update operations (write performance) can be slow (disadvantage)
• since a single logical update must be performed on every copy of the database to keep the
copies consistent.
• Nonredundant allocation (no replication)
• Each fragment is stored at exactly one site
• all fragments must be disjoint, except for the repetition of primary keys among
vertical (or mixed) fragments.
Data Replication and Allocation (cont’d.)
• Partial replication
• Some fragments are replicated, and others are not
• Defined by replication schema
• A description of the replication of fragments
• Data allocation (data distribution)
• Each fragment assigned to a particular site in the distributed system
• Choices depend on performance and availability goals of the system
• For example, if high availability is required, transactions can be submitted at any site,
and most transactions are retrieval only, a fully replicated database is a good choice.
• However, if certain transactions that access particular parts of the database are
mostly submitted at a particular site, the corresponding set of fragments can be
allocated at that site only. Data that is accessed at multiple sites can be replicated at
those sites.
• If many updates are performed, it may be useful to limit replication.
• Finding an optimal or even a good solution to distributed data allocation is a
complex optimization problem.
Example of Fragmentation,
Allocation, and Replication

• Company with three computer sites


• One for each department
• Expect frequent access by employees working in the department and projects
controlled by that department
• See below figures for example fragmentation among the two sites
Data Replication: Advantages
• Reliability: In case of failure of any site, the database system continues to work since a copy is
available at another site(s).
• Parallelism: queries on r may be processed by several nodes in parallel.
• Reduction in Network Load: Since local copies of data are available, query processing can be
done with reduced network usage, particularly during prime hours. Data updating can be done
at non-prime hours.
• Quicker Response: Availability of local copies of data ensures quick query processing and
consequently quick response time.
• Simpler Transactions: Transactions require a smaller number of joins of tables located at
different sites and minimal coordination across the network. Thus, they become simpler in
nature. 30
Data Replication: Disadvantages
• Increased Storage Requirements: Maintaining multiple copies of data is associated
with increased storage costs. The storage space required is in multiples of the
storage required for a centralized system.
• Increased Cost and Complexity of Data Updating: Each time a data item is
updated, the update needs to be reflected in all the copies of the data at the
different sites. This requires complex synchronization techniques and protocols.
• Undesirable Application – Database coupling: If complex update mechanisms are
not used, removing data inconsistency requires complex co-ordination at
application level. This results in undesirable application – database coupling.

31
Centralized Database System VS Distributed
Database Systems
Centralized Database System Distributed Database System
Centralized Database System is simple type. Distributed Database system is complex type.
They are located on particular location. They are located in many geographical locations.
It consists of only one server It contains servers in many locations
It is only suitable for small organization and small- It is suitable for large organizations.
scale operation.
There is less chance of lost. More chance of data hacking, theft and lost
Maintenance is easy and security is high Maintenance is not easy and security is low as
compare to centralized database system.
Failure of server makes the whole system down Failure of one server doesn’t make the whole
system down.
There is no feature of load balancing There is feature of load balancing.

Data traffic rate is high Data traffic rate is low

32
Design issues of a Distributed

System
Heterogeneity
• Distributed systems often consist of a heterogeneous collection of hardware and software components.
• This can make it difficult to ensure that all components can interoperate and work together correctly.
• Openness
• Distributed systems should be designed to be open and extensible.
• This allows for the addition of new components and the modification of existing components without disrupting the
overall system.
• Security
• Distributed systems are vulnerable to security attacks.
• This is because they often consist of multiple components that are interconnected over a network.
• Synchronization
• Distributed systems need to be synchronized so that all components have a consistent view of the system state.
• This can be difficult to achieve in a distributed environment where components may be located in different places and
may have different clocks.
• Failure handling
• Distributed systems need to be able to handle failures of individual components.
• This is because failures are inevitable in any distributed system.
• Scalability
• Distributed systems need to be scalable so that they can handle increasing loads.
• This can be a challenge because distributed systems often consist of a large number of components that need to be
able to communicate with each other.
Design issues of a Distributed System …
• Performance
• Distributed systems need to be designed to be efficient so that they can provide good
performance.
• This can be a challenge because distributed systems often involve a lot of communication between
components.
• Concurrency
• Distributed systems need to be designed to handle concurrent access to shared resources.
• This can be a challenge because concurrent access can lead to race conditions and deadlocks.
• Load balancing
• Distributed systems need to be designed to distribute load evenly across the different
components.
• This can be a challenge because the load on the different components can vary over time.
• Fault tolerance
• Distributed systems need to be designed to be fault-tolerant so that they can continue to operate
even when some of the components fail.
• This can be achieved by using redundancy and replication.
• Availability
• Distributed systems need to be designed to be available so that they can be accessed by users
even when some of the components are unavailable.
• This can be achieved by using load balancing and fault tolerance techniques.
Parallel vs Distributed Computing
• The main difference between parallel and distributed computing is
that parallel computing allows multiple processors to execute tasks
simultaneously while distributed computing divides a single task
between multiple computers to achieve a common goal.
• Memory is a major difference between parallel and distributed
computing. In parallel computing, the computer can have a shared
memory or distributed memory. In distributed computing, each
computer has its own memory.
• In Parallel Computing there is direct communication between
processors but in distributed computing there is communication
between computers through a network
Types of Distributed Databases
• Homogeneous distributed database
• All sites have identical software
• Are aware of each other and agree to cooperate in processing user requests.
• Each site surrenders part of its autonomy in terms of right to change schemas or
software
• Appears to user as a single system
• Heterogeneous distributed database
• Different sites may use different schemas and software
• Difference in schema is a major problem for query processing
• Difference in software is a major problem for transaction processing
• Sites may not be aware of each other and may provide only limited facilities for
cooperation in transaction processing
Overview of Concurrency Control and
Recovery in Distributed Databases
• Concurrency control ensures that multiple transactions can access and update shared data in
different sites without interfering with each other.
• Recovery ensures that the database can be restored to a consistent state after a failure at one or
more sites.
• Problems specific to distributed DBMS environment
• Dealing with multiple copies of the data items
• The concurrency control method is responsible for maintaining consistency among these copies.
• The recovery method is responsible for making a copy consistent with other copies if the site on which the copy is stored fails
and recovers later.
• Failure of individual sites
• The DDBMS should continue to operate with its running sites, if possible, when one or more individual sites fail. When a site
recovers, its local database must be brought up-to-date with the rest of the sites before it rejoins the system.
• Failure of communication links
• The system must be able to deal with the failure of one or more of the communication links that connect the sites.
• Distributed commit
• Problems can arise with committing a transaction that is accessing databases stored on multiple sites if some sites fail during
the commit process.
• Distributed deadlock
• Deadlock may occur among several sites, so techniques for dealing with deadlocks must be extended to take this into account.
Distributed Concurrency Control Based on a
Distinguished Copy of a Data Item
• One copy of each data item (table, row, etc.) is designated as the Distinguished Copy.
• Locks are associated with the distinguished copy
• This copy of the data item is where all locks are applied.
• The concurrency controller should look to this copy to determine if any locks are held.
• Several variations of distinguished copy:
• Primary site technique
• All distinguished copies kept at the same site
• Primary site is designated to be the coordinator site for all database items.
• Advantage:
• It is a simple extension of the centralized approach and thus is not overly complex.
• Disadvantages:
• All locking requests are sent to a single site, possibly overloading that site and causing a system bottleneck.
• Failure of the primary site paralyzes the system, since all locking information is kept at that site.
• This can limit system reliability and availability.
• Primary site with backup site
• Locking information maintained at both sites
• In case of primary site failure, the backup site takes over as the primary site, and a new backup site is chosen.
Distributed Concurrency Control Based on a Distinguished
Copy of a Data Item
• Several variations of distinguished copy (Cont.):
• Primary copy method
• Distinguished copies may reside on different sites
• Distributes the load of lock coordination among various sites
• This method can also use backup sites to enhance reliability and availability.
• Choosing a New Coordinator Site in Case of Failure
• Whenever a coordinator site fails in any of the preceding techniques, the sites that are still
running must choose a new coordinator.
• If both the primary and the backup sites are down, a process called election can be used to
choose the new coordinator site.
• The election algorithm itself is complex, but the main idea behind the election method is:
• Any site Y that attempts to communicate with the coordinator site repeatedly and fails to do so can
assume that the coordinator is down and can start the election process by sending a message to all
running sites proposing that Y become the new coordinator.
• As soon as Y receives a majority of yes votes, Y can declare that it is the new coordinator.
• The algorithm also resolves any attempt by two or more sites to become coordinator at the
same time.
Distributed Concurrency Control Based on Voting
• Voting method
• No distinguished copy
• Lock requests sent to all sites that contain a copy
• Each copy maintains its own lock
• If transaction that requests a lock is granted that lock by a majority of the copies, it
holds the lock on all copies
• Time-out period applies
• Results in higher message traffic among sites
• Algorithm:
1. Send a message to all nodes that maintain a replica of this item.
2. If a node can safely lock the item, then vote "Yes", otherwise, vote "No".
3. If a majority of participating nodes vote "Yes" then the lock is granted.
4. Send the results of the vote back out to all participating sites.
Distributed Recovery
• The recovery process in distributed databases is quite involved.
• Difficult to determine whether a site is down without exchanging numerous
messages with other sites.
• For example, suppose that site X sends a message to site Y and expects a
response from Y but does not receive it. There are several possible
explanations:
• The message was not delivered to Y because of communication failure.
• Site Y is down and could not respond.
• Site Y is running and sent a response, but the response was not delivered.
• Without additional information or the sending of additional messages, it is
difficult to determine what actually happened.
• Another problem with distributed recovery is Distributed commit.
• When a transaction is updating data at several sites, it cannot commit until it is sure
that the effect of the transaction on every site cannot be lost
• Two-phase commit protocol often used to ensure correctness
Overview of Transaction Management in
Distributed Databases
• We need mechanisms in place to ensure multiple copies of data are kept consistent.
• Concurrency and Commit protocols must be changed to account for replicated data.
• In a centralized DB we had the notion of a commit point. In distributed DB, we need to
consider committing a transaction that changes data on multiple nodes.
• Distributed Commit Protocol such as Two-Phase Commit (2PC) can be used.
• Global transaction manager
• Additional component is introduced for supporting distributed transactions
• The site where the transaction originated can temporarily assume the role of global transaction
manager
• Coordinates execution of database operations with transaction managers at multiple sites
• Passes database operations and associated information to the concurrency controller
• Controller responsible for acquisition and release of locks
• If the transaction requires access to a locked resource, it is blocked until the lock is acquired.
• Once the lock is acquired, the operation is sent to the runtime processor, which handles the actual execution of the
database operation.
• Once the operation is completed, locks are released, and the transaction manager is updated with the result of the
operation.
Commit Protocols
• Two-phase Commit Protocol
• The two-phase commit protocol (2PC) is a common way to ensure that all
participating databases commit or abort a distributed transaction together.
• Phase 1(Prepare phase):
• Send a message to all nodes: "Can you commit Transaction X?" All nodes that can commit this transaction reply with
"Yes".
• Phase 2 (Commit phase):
• If all nodes reply with "Yes", then send a "Commit" message to all nodes.
• If any node replies "No", then the transaction is aborted.
• 2PC is an example of a synchronous replication protocol.
• Coordinator (global recovery manager) maintains information needed for recovery
• In addition to local recovery managers
• 2PC has two main drawbacks:
• Blocking:
• If the coordinator fails, all participating databases will block until the coordinator recovers.
• This can cause performance degradation, especially if participants are holding locks to shared resources.
• Inability to handle network partitions:
• If the network between two participating databases is partitioned, 2PC cannot guarantee that
all databases will commit or abort the transaction together. This can lead to data
inconsistency.
Commit Protocols (Contd.)
• Three-phase Commit Protocol
• Divides second commit phase into two subphases
• Prepare-to-commit phase
• This phase is used to communicate result of the prepare phase to all participants.
• If all participants vote yes, then the coordinator instructs them to move into the prepare-to-commit state
• Commit subphase same as two-phase commit counterpart
• This extra phase in 3PC helps to prevent blocking and data inconsistency.
• If the coordinator fails, the participating databases can abort the transaction without blocking.
• if the coordinator crashes during this subphase, another participant can see the transaction through to
completion. It can simply ask a crashed participant if it received a prepare-to-commit message. If it did not,
then it safely assumes to abort.
• If the network between two participating databases is partitioned, the participating databases
can still commit or abort the transaction together.
• Also, by limiting the time required for a transaction to commit or abort to a maximum time-out period, the
protocol ensures that a transaction attempting to commit via 3PC releases locks on time-out.
• When a participant receives a pre-commit message, it knows that the rest of the participants have voted to
commit. If a pre-commit message has not been received, then the participant will abort and release all
locks.
Query Processing and Optimization in
Distributed Databases
• Stages of a distributed database query
• Query mapping
• Process of translating a query that is submitted to a distributed database system
into a set of subqueries that can be executed on the participating databases
• The translation (into an algebraic query) is done by referring to global conceptual schema
• Localization
• Maps the distributed query to separate queries on individual fragments using data distribution and
replication information.
• Global query optimization
• Strategy selected from list of candidates that is closest to optimal
• A list of candidate queries can be obtained by permuting the ordering of operations within a fragment
query generated by the previous stage.
• Time is the preferred unit for measuring cost. The total cost is a weighted combination of costs such as
CPU cost, I/O costs, and communication costs.
• Local query optimization
• Common to all sites in the DDB
• The techniques are similar to those used in centralized systems.
Query Processing and Optimization in
Distributed Databases (cont’d.)
• Data transfer costs of distributed query processing
• Cost of transferring intermediate and final result files (from one site to other)
• Optimization criterion: reducing amount of data transfer

Example:
Q: For each employee, retrieve the employee's name and the
name of the department for which the employee works.
This can be stated as follows in the relational algebra:
Q: πFname,Lname,Dname(EMPLOYEE Dno=Dnumber DEPARTMENT)
Example (contd.)
• The result of this query will include 10,000 records, assuming that every employee is related to a
department.
• Suppose that each record in the query result is 40 bytes long.
• The query is submitted at a distinct site 3, which is called the result site because the query result is needed
there.
• Neither the EMPLOYEE nor the DEPARTMENT relations reside at site 3.
• There are three simple strategies for executing this distributed query:
1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site and perform the join at site 3. In this
case, a total of 1,000,000 + 3,500 = 1,003,500 bytes must be transferred.
2. Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result to site 3. The size of the
query result is 40 * 10,000 = 400,000 bytes, so 400,000 + 1,000,000 = 1,400,000 bytes must be transferred.
3. Transfer the DEPARTMENT relation to site 1, execute the join at site 1, and send the result to site 3. In this case,
400,000 + 3,500 = 403,500 bytes must be transferred.
• If minimizing the amount of data transfer is our optimization criterion, we should choose strategy 3.
• However, suppose that the result site is site 2; then we have two simple strategies:
1. Transfer the EMPLOYEE relation to site 2, execute the query, and present the result to the user at site 2. Here, the
same number of bytes—1,000,000— must be transferred for query Q.
2. Transfer the DEPARTMENT relation to site 1, execute the query at site 1, and send the result back to site 2. In this
case 400,000 + 3,500 = 403,500 bytes must be transferred for query Q.
Query Processing and Optimization in Distributed
Databases (cont’d.)
• Distributed query processing using semijoin
• Reduces the number of tuples in a relation before transferring it to another site
• Send the joining column of one relation R to one site where the other relation S
is located; this column is then joined with S
• Join attributes and result attributes shipped back to original site and joined with
R
• Efficient solution to minimizing data transfer
Query Processing and Optimization in
Distributed Databases (cont’d.)
• Query and update decomposition
• User can specify a query as if the DBMS were centralized
• If full distribution, fragmentation, and replication transparency are supported
• Query decomposition module
• Breaks up a query into subqueries that can be executed at the individual sites
• Strategy for combining results must be generated
• To determine which replicas, include the data items referenced in a query, the
DDBMS refers to the fragmentation, replication, and distribution information stored in
the DDBMS catalog.
• Catalog stores attribute list and/or guard condition
• Guard is basically a selection condition that specifies which tuples exist in the fragment;
• it is called a guard because only tuples that satisfy this condition are permitted to be stored in the
fragment.
Types of Distributed Database Systems
• Factors that influence types of DDBMSs
• Degree of homogeneity of DDBMS software
• Homogeneous
• The sites use very similar software.
• The sites use identical DBMS or DBMS from the same vendor.
• Each site is aware of all other sites and cooperates with other sites to process user requests.
• The database is accessed through a single interface as if it is a single database.
• Heterogeneous
• Different sites use dissimilar schemas and software.
• Query processing is complex due to dissimilar schemas.
• Transaction processing is complex due to dissimilar software.
• A site may not be aware of other sites and so there is limited co-operation in processing user requests.
• Degree of local autonomy
• No local autonomy
• If there is no provision for the local site to function as a standalone DBMS
• On the other hand, if direct access by local transactions to a server is permitted, the system has some degree of local autonomy
• Multidatabase (peer to peer) system has full local autonomy
• Federated database system (FDBS)
• Global view or schema of the federation of databases is shared by the applications
Classification of Distributed Databases

Figure 23.6 Classification of distributed databases


Distributed Database Architectures
• Parallel versus distributed architectures
• Types of multiprocessor system architectures
• Shared memory (tightly coupled)
• Multiple processors share secondary (disk) storage and also share primary memory
• Shared disk (loosely coupled)
• Multiple processors share secondary (disk) storage, but each has their own primary memory
• Database management systems developed using the above types of architectures are termed parallel
database management systems rather than DDBMSs, since they utilize parallel processor technology.
• Shared-nothing
• Every processor has its own primary and secondary (disk) memory, no common memory exists, and
the processors communicate over a highspeed interconnection network (bus or switch).
• Although the shared-nothing architecture resembles a distributed database computing environment,
major differences exist in the mode of operation
• In shared-nothing multiprocessor systems, there is symmetry and homogeneity of nodes; this is not
true of the distributed database environment, where heterogeneity of hardware and operating
system at each node is very common.
• Shared-nothing architecture is also considered as an environment for parallel databases.
Database System Architectures (b) A centralized database with distributed
access a
A networked architecture with a centralized
database at one of the sites

(a) Parallel database


(Shared-nothing)
architecture

(c) A truly distributed database architecture


Figure 23.7 Some different database system architectures A pure distributed database
General Architecture of Pure Distributed
Databases
• Global query compiler
• References global conceptual schema from the global system catalog to verify and impose
defined constraints
• Global query optimizer
• The global query optimizer references both global and local conceptual schemas and
generates optimized local queries from global queries.
• It evaluates all candidate strategies using a cost function that estimates cost based on
response time (CPU, I/O, and network latencies) and estimated sizes of intermediate
results.
• Global transaction manager
• Coordinates the execution across multiple sites with the local transaction managers
• Each local DBMS would have its local query optimizer, transaction manager, and
execution engines as well as the local system catalog, which houses the local
schemas.
Schema Architecture of Distributed Databases

Figure 23.8 Schema architecture of distributed databases


Federated Database Schema Architecture
The local schema is the conceptual schema (full database
definition) of a component database

The Component Schema is derived by translating local schema


into a model called the canonical data model or common data
model (CDM).
The CDM is created by translating the local schemas of the
different databases into a common format.

The export schema represents the subset of a component


schema that is available to the FDBS.

The federated schema is the global schema or view, which is


the result of integrating all the shareable export schemas.

The external schema is the view of the data that is seen by the
user. It is defined by the user's application.

Figure 23.9 The five-level schema architecture in a federated database system (FDBS)
An Overview of Three-Tier Client/Server
Architecture
• Distributed database applications are being developed in the context of the client/server
architectures.

Figure 23.10 The three-tier client/server architecture


Distributed Catalog Management
• Centralized catalogs
• Entire catalog is stored at one single site
• Easy to implement
• Fully replicated catalogs
• Identical copies of the complete catalog are present at each site
• Results in faster reads
• Partially replicated catalogs
• Each site maintains complete catalog information on data stored locally at
that site.
• Each site is also permitted to cache entries retrieved from remote sites.
Summary
• Distributed database concept
• Distribution transparency
• Fragmentation transparency
• Replication transparency
• Design issues
• Horizontal and vertical fragmentation
• Concurrency control and recovery techniques
• Query processing
• Categorization of DDBMSs

You might also like