ADBMS Chapter No. 3
ADBMS Chapter No. 3
NOTES:
3.1 Introduction,
3.7 Availability
Learning Objectives of the Unit : to study Commit protocol to work in the distributed
environment.
6. Key Concepts
Q1.) Explain various Concurrency Control approaches in DDBMS (Nov. 2009 6M, Apr
2010 6M, Nov 2012 10M, Apr 2013 6M)
Q4.) Compare With example Heterogeneous & Homogeneous databases (Apr. 2010 6M,
Nov. 2010 6M, Apr 2013 6M)
2. Directory System
Learning Resources :
Reference Book :
http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-3.html
Distributed Databases
Prof. Khandagale S.P. UNIT NO. 3
MCA Distributed Databases
– Multidatabase (No Schema): There is no one conceptual global schema. For data access.
a schema is constructed dynamically as needed by the application software.
Distributed Database Architecture:
There are 3 distributed database architectures, they are:
Client-Server System Architecture:
Client-server system has one or more client processes and one or more server processes,
and a client process can send a query to any one server process.
Clients are responsible for user-interface issues and servers manage data and execute
transactions.
Thus a client process could run on a personal computer and send queries to a server
running on a mainframe.
This architecture has been popular due to several reasons.
First, it is relatively simple to implement due to clean separation of functionality because
the server is centralized.
Second, Client are responsible to break queries into subqueries get executed by other sites
& merge the result.
And third users can run a GUI that they are familiar with, rather than the user interface on
the server.
Collaborating Server System Architecture:
In this we have a collection of database servers, When a server receives query that requires
access to data at other servers
It generate appropriate sub queries.
These sub queries are executed by other servers.
The results are put together to form final result.
Middleware System Architecture:
It is designed to allow a single query to span multiple servers without requiring all
database servers to be capable of managing such multi-site execution strategies.
In this one database server capable of managing queries and transactions spanning
multiple servers, the remaining servers need to handle only queries and transactions.
This special server as a layer of software that co-ordinates the execution of queries and
transactions across one or more independent database servers, such software is often
called as middleware system.
It is capable of executing joins and other relational operations on data obtained from the
other servers but, typically does not itself maintain any data.
1. Replication: The system maintains several identical copies or replicas of the relation, and store
each replica at a different sites.
The alternative to replication is to store only one copy of relation ‘r’.
2. Fragmentation: The system partitions the relation into several fragments, and stores each
fragment at different site.
Fragmentation and replication can be combined like a relation can be partitioned into
several fragments and there may be several replicas of each fragment.
Data Replication:
If a relation ‘r’ is replicated, a copy of relation ‘r’ is stored in 2 or more sites. In the most
extreme case, we have full replication, in which a copy is stored in every site in the system.
Advantages of data replication:
Availability: If one of the site is containing relation ‘r’ fails then the relation ‘r’ can be
found in another site. Thus the system can continue to process queries involving r despite
failure of one site.
Increased Parallelism:
In this case where the majority of access to the relation ‘r’ result in only the reading of the
relation then several sites can process queries involving ‘r’ in parallel.
The more replicas of ‘r’ there are, the greater the chance that the needed data will be
found in the site where the transaction is executing.
Hence data replication minimizes movement of data between sites.
Disadvantage of data replication:
Increased overhead on update:
The system ensure that all replicas of a relation ‘r’ are consistent; otherwise erroneous
computation may result.
Thus whenever ‘r’, is updated the updated must propagated to all sites containing replicas.
The result is increased overhead.
For Ex: In a banking system where account information is replicated in various sites, it is
necessary to ensure that the balance in a particular account agrees in all sites.
Replication increases the availability of the data to read any transaction.
But controlling updates by several transactions to replicated data is more complex than in
centralized systems.
Data Fragmentation:
If relation ‘r’ is fragmented, ‘r’ is divided into a number of fragments r1,r2,…..,rn.
There are two different schemas for fragmenting a relation:
Horizontal fragmentation and vertical fragmentation.
We can illustrate fragmentation taking the relation account schema.
Account Schema=(account-number, branch-name, balance).
Horizontal Fragmentation:
2. If an active site contains an <abort T> record in its log, then T must be aborted.
recover.
3. If some active participating site does not contain a <ready T> record in its log,
then the failed coordinator Ci cannot have decided to commit T. Can therefore abort T.
4. If none of the above cases holds, then all active sites must have a <ready T>
record in their logs, but no additional control records (such as <abort T> of <commit T>).
In this case active sites must wait for Ci to recover, to find decision.
Blocking problem: active sites may have to wait for failed coordinator
to Handling of Failures - Network Partition
If the coordinator and all its participants remain in one partition, the failure has no effect on the
commit protocol.
If the coordinator and its participants belong to several partitions:
Sites that are not in the partition containing the coordinator think the coordinator has
failed, and execute the protocol to deal with failure of the coordinator.
No harm results, but sites may still have to wait for decision from coordinator.
The coordinator and the sites are in the same partition as the coordinator think
that the sites in the other partition have failed, and follow the usual commit
protocol. Again, no harm results
Recovery and Concurrency Control
In-doubt transactions have a <ready T>, but neither a <commit T>, nor an <abort T> log
record.
The recovering site must determine the commit-abort status of such transactions by
contacting other sites; this can slow and potentially block recovery.
Recovery algorithms can note lock information in the log.
Instead of <ready T>, write out <ready T, L> L = list of locks held by T when the
log is written (read locks can be omitted).
For every in-doubt transaction T, all the locks noted in the <ready T, L> log
record are reacquired.
After lock reacquisition, transaction processing can resume; the commit or rollback of
in-doubt transactions is performed concurrently with the execution of new transactions.
Some assumptions:
Our distributed database system consists of n sites (servers/computers in different
locations) Data are replicated in two or more sites
Let us assume that the Transaction T1 is initiated at Site S5 as shown in Figure 2 (Step 1). Also,
assume that the requested data item D is replicated in Sites S1, S2, and S6. The steps are
numbered in the Figure 2. According to the discussion above, the technique works as follows;
Step 2 - The initiator site S5’s Transaction manager sends the lock request to lock data
item D to the lock-manager site S3.
The Lock-manager at site S3 will look for the availability of the data item D.
Step 3 - If the requested item is not locked by any other transactions, the lock-manager
site responds with lock grant message to the initiator site S5.
Step 4 - As the next step, the initiator site S5 can use the data item D from any of the
sites S1, S2, and S6 for completing the Transaction T1.
Step 5 - After successful completion of the Transaction T1, the Transaction manager of
S5 releases the lock by sending the unlock request to the lock-manager site S3.
Advantages:
Locking can be handled easily. We need two messages for lock (one for request, the
other for grant), and one message for unlock requests. Also, this method is simple as it
resembles the centralized database.
Deadlocks can be handled easily. The reason is, we have one lock manager who is
responsible for handling the lock requests.
Disadvantages:
The lock-manager site becomes the bottleneck as it is the only site to handle all the lock
requests generated at all the sites in the system.
Highly vulnerable to single point-of-failure. If the lock-manager site failed, then we lose
the concurrency control.
Advantages:
Simple implementation is required for the data which are fragmented. They can be
handled as in the case ofSingle Lock-Manager approach.
For replicated data, again the work can be distributed over several sites using one of the
above listed protocols.
Lock-Manager site is not the bottleneck as the work of lock-manager is distributed over
several sites.
Disadvantages:
Primary Copy Protocol - Distributed Lock Manager Approach :Primary Copy Distributed Lock
Manager Approach / Primary Copy based Distributed Concurrency Control Approach
Primary Copy Protocol:
Assume that we have the data item Q which is replicated in several sites and we choose one of
the replicas of data item Q as the Primary Copy (only one replica). The site which stores the
Primary Copy is designated as the Primary Site. Any lock requests generated for data item Q at
any sites must be routed to the Primary site. Primary site’s lock-manager is responsible for
handling lock requests, though there are other sites with same data item and local lock-
managers.
We can choose different sites as lock-manager sites for different data items.
How does Primary Copy protocol work?
Figure 1 shows the Primary Copy protocol implementation.
In the figure
Step 1: Transaction T1 initiated at site S5 and requests lock on data item Q. Even though the
data item available locally at site S5, the lock-manager of S5 cannot grant the lock. The reason
is, in our example, Site S3 is designated as primary site for Q. Hence, the request must be
routed to the site S3 by the Transaction manager of S5.
Step 2: S5 requests S3 for lock on Q. S5 sends lock request message to S3.
Step 3: If the lock on Q can be granted, S3 grants lock and send a message to S5.
On receiving lock grant, S5 executes the Transaction T1 (Executed on the data item Q available
locally. If no local copy, S3 has to execute the transaction in other sites where Q is available). Step 4:
On successful completion of Transaction, S5 sends unlock message to the Primary Site S3.
Note: If the transaction T1 writes the data item Q, then the changes must be forward to all the
sites where Q is replicated. If the transaction read the data item Q, then no problem.
Advantages:
Handling of concurrency control on replicated data is like unreplicated data. Simple
implementation.
Only 3 messages to handle lock and unlock requests (1 Lock request, 1 Granted, and 1
Unlockmessage) for both read and write.
Disadvantages:
Possible single point-of-failure. If the Primary Site of a data item, say Q fails, even
though the other sites with the same copy of Q available, the data item Q is inaccessible.
Example:
Let us assume that Q is replicated in 6 sites. Then, we need to lock Q in 4 sites (half+one = 6/2 +
1 = 4). Whentransaction T1 sends the lock request message to those 4 sites, the lock-managers
of those sites have to grant the locks based on the usual lock procedure.
How does Majority Based protocol
work? Implementation:
Figure 1 show the implementation of Majority Based protocol.
In the figure,
Q, R, and S are the different data items.
Q is replicated in sites S1, S2, S3 and S6.
R is replicated in sites S1, S2, S3, and S4.
S is replicated at sites S1, S2, S4, S5, and S6.
Note: If the transaction T1 writes the data item Q, then the changes must be forward to all the
sites where Q is replicated. If the transaction read the data item Q, then no problem.
Advantages:
Disadvantages:
Points to note:
-Transaction can execute only after successful acquisition of locks on majority of the replicas. -
Needs to send more messages, 2(n/2+1) lock messages and (n/2+1) unlock messages, when
compared to Primary Copy protocol (2n+1 messages).
-Local lock-managers are responsible for granting or denying the locks on requested items.
-Not suitable for applications where read operation is frequent.
-When writing the data item, a transaction performs writes on all replicas.
-When handling with unreplicated data, both requests can be handled by requesting the site at
which the data item available
Step 1: Transaction T1 initiated at site S5 and requests lock on data item Q. Q is available in S1,
S2, S3 and the site S6. According to the protocol, T1 has to lock Q in any one site in which Q is
replicated, i.e, in our example, we need to lock any 1 out of 4 sites where Q is replicated.
Assume that we have chosen the site S3.
Step 2: S5 requests S3 for shared lock on Q. The lock request is represented in purple color.
Step 3: If the lock on Q can be granted, S3 can grant lock and send a message to S5.
On receiving lock grant, S5 executes the Transaction T1 (Reading can be done in the locked site,
in our case, it is S3).
Step 4: On successful completion of Transaction, S5 sends unlock message to the site S3.
Let us assume that Transaction T1 needs data item Q. Q is available in S1, S2, S3 and the site S6.
Sites S4 and S5 do not have Q in them, are represented in red color
Step 1: Transaction T1 initiated at site S5 and requests lock on data item Q. According to the
protocol, T1 has to lock Q in all the sites in which Q is replicated, i.e, in our example, we need to
lock all the 4 sites where Q is replicated.
Step 2: S5 requests S1, S2, S3, and S6 for exclusive lock on Q. The lock request is represented in
purple color.
Step 3: If the lock on Q can be granted at every site, all the sites will respond with grant lock
message to S5. (If any one or more sites cannot grant, T1 cannot be continued)
On receiving lock grant, S5 executes the Transaction T1 (When writing the data item,
transaction performs writes on all replicas).
Step 4: On successful completion of Transaction, S5 sends unlock message to all sites S1, S2, S3,
and S6.
Advantages:
Read operation can be handled faster compared to Majority based protocol.If read
operations are performed frequently in an application, biased approach can be suggested.
Disadvantages:
Additional overhead on Write operation.
Implementation is complex. We need to send (n/2 + 1) lock request messages, (n/2 + 1)
lock grant messages, and (n/2 + 1) unlock messages for write operation.
Deadlock would occur as discussed in the Majority Based protocol.
CASE 1
Read Quorum Qr = 2, Write Quorum Qw = 3, Site’s weight = 1, Total weight of sites S = 4
Lock Example Discussion
1. Read request has to
lock at least two replicas
Read Lock (2 sites in our example)
2. Any two sites can be
locked
1. Write request has to
lock at least three replicas
Write (3 sites in our example)
Lock
Note that, read quorum intersects with write quorum. That is, out of available 4 sites, in our
example, 3 sites to be locked for write and 2 sites to be locked for read. It ensures that no
two transactions can read and write at the same time.
CASE 2
Read Quorum Qr = 1, Write Quorum Qw = 4, Site’s weight = 1, Total weight of sites S = 4
1. Read lock requires one
site
Read Lock
Note that, read requires any one site and write requires all the sites, which is actually the
implementation of Biased protocol. At the same time, if we make read and write quorum
with the same value 3, then it resembles the implementation of Majority based protocol.
That is the reason why Quorum Consensus protocol is mentioned as the generalization of the
above said techniques.
Points to note:
The Quorums must be chosen very carefully. That is, if read operations are frequent then we
would choose very small read quorum value so as to make read faster in available replicas and
so on.
The chosen read quorum value must intersect the write quorum value to avoid read-write
conflict.
On the other hand, no cycles in any of the local wait-for graph does not mean no deadlock has
occurred. Let us discuss this point with local wait-for graph examples as shown below;
Figure 1 shows the lock request status for transactions T1, T2, T3 and T4 in a distributed database
system. In the local wait-for graph of SITE 1, transaction T2 is waiting for transactions T1 and T3 to
finish. In SITE 2, transactions T3 is waiting for T4, and T4 is waiting for T2. From SITE 1 and SITE 2
local wait-for graphs, it is clear that transactions T2 and T3 are involved in both sites.
How it might be happened? For example, transaction T2 which is initiated at SITE 2 may need
some data items held by transactions T1 and T3 in SITE 1. Hence, SITE 2 forwards the request to
SITE 1. If the transactions are busy, then SITE 1 inserts edges T2 T1 and T2 T3 in its local wait-for
graph.
As another example, transaction T3 which is initiated at SITE 1 may need data items held by
transaction T4 at SITE 2. Hence, SITE 1 forwards the request to SITE 2. Based on the status of T4,
SITE 2 inserts an edge T3 T4 in its local wait-for graph.
You can observe from the local wait-for graphs of SITE 1 and SITE 2, there are no symptoms of
cycles. If we merge these two local wait-for graphs into a single wait-for graph, then we would
get the graph which is given in Figure 2, below. From Figure 2, it is clear that the union of two
local wait-for graphs have formed a cycle, which means deadlock has occurred. This merged
wait-for graph is called as Global wait-for graph.
1 (a)
1 (b)
Figure 1 - (a) Deadlock occurrence with two transactions, (b) deadlock occurrence with three
transactions
Real time example of Deadlock situation:
T2 – Transaction which updates all the accounts with yearly interest, say 5%. T2 need to lock all
the accounts in Write mode (Exclusive lock). T2 is said to be completed if and only if it
successfully updates the old balances of all the accounts with the new balances and commits
the transaction. T2 would involve one update query;
UPDATE ACCOUNT SET balance = balance + (balance*0.05);
The Deadlock prevention protocol prevents the system from deadlock through transaction
rollbacks. It chooses rollback over waiting for the lock whenever the wait could cause a
deadlock. In this approach we have the following two deadlock prevention algorithms; 1.Wait-
die
2.Wound-wait
Availability:
One of the goals in distributed databases is high availability, i.e the database must function
almost all the time.
Since failures are more likely in large distributed systems, a distributed database must
continue functioning even when there are various types of failures.
The ability to continue functioning even during failures is referred as Robustness.
For a distributed system to be robust, it must detect failures, reconfigure the system so
that computation may continue , and recover when a processor or a link is repaired.
server for some subsystem, an election must be held to
If a failed site was a central
determine the new server
Since network partition may not be distinguishable from site failure, the following
situations must be avoided
Two ore more central servers elected in distinct partitions
More than one partition updates a replicated data item
Cloud Based Databases : A database accessible to clients from the cloud and delivered to
users on demand via the Internet from a cloud database provider's servers. Also referred to
as Database-as-a-Service (DBaaS), cloud databases can use cloud computing to achieve
optimized scaling, high availability, multi-tenancy and effective resource allocation.
There are two common deployment models: users can run databases on the cloud
independently, using a virtual machine image, or they can purchase access to a database
service, maintained by a cloud database provider called.
A database accessible to clients from the cloud and delivered to users on demand via the
Internet from a cloud database provider's servers. Also referred to as Database-as-a-
Service (DBaaS), cloud databases can use cloud computing to achieve optimized scaling,
high availability, multi-tenancy and effective resource allocation.
Cloud computing refers to the delivery of computing and storage capacity as a service
to a heterogeneous community of end-recipients. Cloud computing entrusts services
with a user’s data, software, and computation over a network.
Just as databases are required in traditional computing, they are also required in cloud
computing. A cloud database also referred to as Database-as-a-Service (DBaaS), is a
database that is accessible to end-recipients from the cloud and delivered to users on
demand from a cloud database provider’s servers via the Internet.
While a cloud database can be a traditional database such as a MySQL or SQL Server database
that has been adopted for cloud use, a native cloud database such as Xeround's MySQL Cloud
database tends to better equipped to optimally use cloud resources and to guarantee scalability
as well as availability and stability.
Cloud databases can offer significant advantages over their traditional counterparts, including
increased accessibility, automatic failover and fast automated recovery from failures,
automated on-the-go scaling, minimal investment and maintenance of in-house hardware, and
potentially better performance. At the same time, cloud databases have their share of
potential drawbacks, including security and privacy issues as well as the potential loss of or
inability to access critical data in the event of a disaster or bankruptcy of the cloud database
service provider.
Virtual machine Image - cloud platforms allow users to purchase virtual machine
instances for a limited time. It is possible to run a database on these virtual machines.
Users can either upload their own machine image with a database installed on it, or
use ready-made machine images that already include an optimized installation of a
database. For example, Oracle provides a ready-made machine image with an
[1]
installation of Oracle Database 11g Enterprise Edition on Amazon EC2 and
[2]
on Microsoft Azure.
Database as a service (DBaaS) - some cloud platforms offer options for using a database as
a service, without physically launching a virtual machine instance for the database. In this
configuration, application owners do not have to install and maintain the database
on their own. Instead, the database service provider takes responsibility for installing and
maintaining the database, and application owners pay according to their usage. For
example, Amazon Web Services provides three database services as part of its cloud
offering, SimpleDB, a NoSQL key-value store, Amazon Relational Database Service, an SQL-
based database service with a MySQL interface, and DynamoDB. Similarly, Microsoft offers
[3]
the Azure SQL Database service as part of its cloud offering.
Time-to-Market
As the pace of business rapidly increases, while at the same time internal IT resources
remain in short supply at most companies, business managers are discovering an array
of cloud solutions they can easily apply to their business operation without requiring
the steps to acquire, install and maintain software. They can simply sign up for the
solution and begin using it right away.
Economics
Cloud computing lowers technology costs in two ways. First by significantly reducing the
need for IT experts and staff. The other is by efficiencies gained through shared multi-
tenant cloud environments that eliminate purchasing hardware equipment and
software licenses. Additionally, many services are month-to-month without long term
contracts, allowing businesses to easily apply these technologies “just-in-time” and
drop them when no longer needed.
Scalability
When a technology is custom-built or brought in-house, the IT managers must build an
infrastructure that can withstand the highest point of usage or they risk their reputation
not being able to deliver at peak times. In contrast, cloud computing services typically allow
for on-demand scalability for peak times or sustained periods. IT no longer needs to over-
engineer solutions and infrastructure or sacrifice quality of service.
Empowerment
Cloud computing solutions typically have a web-based interface for users. They can be
accessed by employees, customers and partners no matter where they are. With a
cloud database, everyone gets to work with the same set of information and
spreadsheet chaos is a thing of a past.
Best Practices
To the extent that reputable service providers are utilized, customers can be assured
that best practices in terms of security, reliability, and monitoring are in place. The
grade of service offered by leading cloud computing vendors is expensive and difficult to
implement on your own.
Eliminate Code
In addition to all of the standard benefits of cloud computing, Caspio is designed for
business users to create their own web applications without coding and without
reliance on IT. The business users can conceive and drive their own requirements and
features; everything from database reports, web forms, process approvals, dashboards,
and even mobile apps.
Go Green
Last but not least, cloud computing is all about virtualization, multi-tenancy, and
shared resources that provide more service for the amount of energy expended when
compared to in-house, single tenant solutions.
Directory Systems
White pages
Answer:
– E.g. different servers for Bell Labs Murray Hill and Bell Labs
Bangalore
LDAP
l Data Manipulation
Entries organized into a directory information tree (DIT) according to their DNs