0% found this document useful (0 votes)
9 views

Unit i Distributed Databases

Uploaded by

funfor340
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Unit i Distributed Databases

Uploaded by

funfor340
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT I DISTRIBUTED DATABASES

Distributed Database Vs Centralized Database


Centralized DBMS Distributed DBMS
In Distributed DBMS the database are stored
In Centralized DBMS the database are stored
in different site and help of network it can
in a only one site
access it

Database and DBMS software distributed


If the data is stored at a single computer
over many sites,connected by a computer
site,which can be used by multiple users
network

Database is maintained at a number of


Database is maintained at one site
different sites

If centralized system fails,entire system ishalted If one system fails,system continues work
with other site

It is a less reliable It is a more reliable

Centralized database

Figure 1.5 Centralized database

Distributed database

Figure1. 6 Distributed database


Types of Distributed Database
Homogeneous & Heterogeneous Distributed Databases

 In a homogeneous distributed database


 All sites have identical software
 Are aware of each other and agree to cooperate in processing user
requests.
 Each site surrenders part of its autonomy in terms of right to change
schemas or software
 Appears to user as a single system

 In a heterogeneous distributed database


 Different sites may use different schemas and software
 Difference in schema is a major problem for query processing
 Difference in software is a major problem for transaction
processing
 Sites may not be aware of each other and may provide only
limited facilities for cooperation in transaction processing
Distributed Data Storage

 Assume relational data model


 Replication
System maintains multiple copies of data, stored in different sites, for
faster retrieval and fault tolerance.
Types
1.Synchronous
2.Asynchronous
Replication Schemes
1. Full Replication
2.No Replication
3.Partial Replications
 Fragmentation
Relation is partitioned into several fragments stored in distinct sites
 Replication and fragmentation can be combined
Relation is partitioned into several fragments: system maintains
several identical replicas of each such fragment
 Transparency

Data Fragmentation

What is fragmentation?

Types of Fragmentation

1. Horizontal data fragmentation - Splitting by rows


2. Vertical data fragmentation - Splitting by columns
3. Mixed or Hybrid fragmentation
 The process of dividing the database into a smaller multiple parts is called
as fragmentation.
 These fragments may be stored at different locations.
 Division of relation r into fragments r1, r2, …, rn which contain sufficient
information to reconstruct relation r.

Types of data Fragmentation

There are three types of data fragmentation:

1. Horizontal data fragmentation - Splitting by rows

2. Vertical data fragmentation - Splitting by columns

3. Mixed or Hybrid fragmentation

1. Horizontal data fragmentation

 Horizontal fragmentation divides a relation(table) horizontally into the group


of rows to create subsets of tables.

 each tuple of r is assigned to one or more fragments

Horizontal Fragmentation of account Relation

Fragment 1 (r1)=account 1
Fragment 2(r2)=account 2
\
To reconstruct the relation r by taking union of all fragments

r=r1 U r2 U …… rn

2. Vertical Fragmentation

Vertical fragmentation divides a relation (table) vertically into groups of columns to


create subsets of tables.
 the schema for relation r is split into several smaller schemas
 All schemas must contain a common candidate key (or superkey) to ensure
lossless join property.
 A special attribute, the tuple-id attribute may be added to each schema to serve
as a candidate key.

Vertical Fragmentation of employee_info Relation

r(R) =R1,R2,R3…Rn

schema(R)=Schema(R1) U schema(R2) U…schema(Rn )

To Reconstruct relation r use natural join

r=r1 r2 r3 ……….. rn

 To reconstruct include
o primary key of R in each R i
o Any super key can be used
o Add a special attribute called a tuple –id (the logical or physical
address can be used as tuple-id)

Advantages of Fragmentation

o Horizontal:
 allows parallel processing on fragments of a relation
 allows a relation to be split so that tuples are located where they
are most frequently accessed
o Vertical:
 allows tuples to be split so that each part of the tuple is stored
where it is most frequently accessed
 tuple-id attribute allows efficient joining of vertical fragments
 allows parallel processing on a relation

3) Hybrid Fragmentation
 Hybrid fragmentation can be achieved by performing horizontal and vertical
partition together.
 Mixed fragmentation is group of rows and columns in relation.

Example: Consider the following table which consists of employee information.

Emp_ID Emp_Name Emp_Address Emp_Age Emp_Salary


101 Surendra Baroda 25 15000
102 Jaya Pune 37 12000
103 Jayesh Pune 47 10000

Fragmentation1:
SELECT * FROM Emp_Name WHERE Emp_Age < 40

Fragmentation2:
SELECT * FROM Emp_Id WHERE Emp_Address= 'Pune' AND Salary < 14000

Reconstruction of Hybrid Fragmentation


The original relation in hybrid fragmentation is reconstructed by
performing UNION and FULL OUTER JOIN.

horizontal fragmentation example : [this page is just for reference ]


 Example:
Account (Acc_No, Balance, Branch_Name, Type).
In this example if values are inserted in table Branch_Name as Pune, Baroda,
Delhi.

Acc_No Balance Branch_Name


A_101 5000 Pune
A_102 10,000 Baroda
A_103 25,000 Delhi
Fragmentation1:
SELECT * FROM Account WHERE Branch_Name= 'Pune' AND Balance < 50,000
Fragmentation2:
SELECT * FROM Account WHERE Branch_Name= 'Delhi' AND Balance <
50,000

Vertical fragmentation example :

Example:

Acc_No Balance Branch_Name


A_101 5000 Pune
A_102 10,000 Baroda
A_103 25,000 Delhi

Fragmentation1:
SELECT * FROM Acc_NO

Fragmentation2:
SELECT * FROM Balance

Transparency :
o Data transparency: Degree to which system user may remain unaware
of the details of how and where the data items are stored in a
distributed system
o Data Transparency can take several forms:
 Fragmentation transparency
 Replication transparency
 Location transparency
o DATA ITEMS AND NAMING
–Fragmentation Transparency:
•Users are not required to know how a relation has been fragmented.
–Replication Transparency:
•Users view to data is always unique, but for various constraints same data may be
replicated at different sites. Users don’t need to be concerned of where data objects
have been replicated and placed.
–Location Transparency:
•Users aren’t required to know the physical location of data. The distributed database
should be able to find any data as long as data identifier is supplied by user
transactions.
DATA ITEMS AND NAMING
 Data items in databases are Relations, Fragments and Replicas.
 These Data items must have unique names. That is – In distributed database
environment we must take care to ensure that two sites don’t use same name for
distinct data items.
 Solution to this problem is Use of a registered central name server.
Name Server helps to ensure that same name doesn’t get used for different data
items.
–This approach however has several drawbacks:
•First the name server may become a performance bottleneck when data items are
located by their names, resulting in poor performance.
•Second, if the name server crashes, it may not be possible for any site in the
distributed system to continue to run.
The second approach uses a mechanism – Each site prefixes its own site identifier to
any name that it generates. Although the approach ensures no two sites generate the
same name.
–This solution, however, fails to achieve location transparency – Given site
identifiers are attached to names.

•Examples: site17.account or account@site17.

–To address this problem, the database system can create a set of alternative names
or aliases, for data items. A user may hence refer to data items by simple names that
are translated by the system to complete names.
–Plus users will be unaffected if the database administrator decides to move a data
item form one site to another.
[Transactions –eg :read ,write,update]
Transactions concurrency and atomicity ]

Distributed Transactions

DISTRIBUTED TRANSACTIONS
Preserve ACID properties
Two types of transactions Local ,Global

SYSTEM STRUCTURE
Transaction Manager [Preserve ACID ]
responsibility[maintain log, concurrency ]
Transaction Coordinator [Manage & coordinate various transactions ]
responsibility [start transaction,dis.subtrans,termination of trans]

SYSTEM FAILURE MODES


System Failure Modes
1.Failure of site
2.Loss of Messages
3.Failure of a communication link
4.Network Partition

Data is distributed and the transaction can be executed on different nodes called
Distributed Transactions

•Access to the various data in a distributed system is usually accomplished through


transactions, which must preserve the ACID properties.
•Two types of transactions:
–Local Transaction access and update data in only one local database
–Global Transaction access and update data in several local databases

–Ensuring ACID properties of the local transactions isn‘t any issue however
achieving ACID properties for Global Transactions is a tedious and complicated
process as failure of communication link is obvious in distributed environment.

SYSTEM STRUCTURE

Transaction may access data at several sites and each site contains two sub-systems:
•Transaction Manager:
Each site has its own local transaction manager, whose function is to ensure ACID
properties of those transactions that execute at that site .
The various transaction managers cooperate with each other to manage Global
Transactions.
 Transaction may access data at several sites.
 Each site has a local transaction manager responsible for:
o Maintaining a log for recovery purposes
o Participating in coordinating the concurrent execution of the transactions
executing at that site.
 Each site has a transaction coordinator, which is responsible for:
o Starting the execution of transactions that originate at the site.
o Distributing sub transactions at appropriate sites for execution.
o Coordinating the termination of each transaction that originates at the
site, which may result in the transaction being committed at all sites or
aborted at all sites as shown in below figure.

Fig: System Architecture


SYSTEM FAILURE MODES
•A distributed system may suffer from the basic types of failures like Software Errors,
Hardware Errors or Disk Crashes.
•The basic failure types are :
 Failure of Site.
 Loss of Messages.
•Loss of messages is always a possibility in a distributed system. The system uses
Transmission Control Protocols such as TCP/IP to handle such errors
 Failure of communication link.
•Handled by network protocols, by routing messages via alternative links
 Network Partition.
•A network is said to be partitioned when it has been split into two or more
subsystems that lack any connection between them.

ref:you tube link in GCR -animation


COMMIT PROTOCOLS

Two –Phase commit


1.Commit protocol phase 1[prepare ,ready abort],
phase 2[commit,abort]
2. Handling of failures
Failure of a participating site
Failure of the Coordinator
Network Partition
3. Recovery and Concurrency Control
Three –Phase Commit
Alternative Models of Transaction Processing
Sending site Protocol
Receiving site Protocol

•Commit protocols are used to ensure atomicity across sites.

•In a local database system, for committing a transaction, the transaction manager has
to only convey the decision to commit to the recovery manager.

•However, in a distributed system, the transaction manager should convey the


decision to commit to all the sites taking part in the transaction.
– When processing is complete at each site, it reaches the partially committed
transaction state and waits for all other transactions to reach their partially
committed states. When it receives the message that all the sites are ready to commit,
it starts to commit – The Global Commit.
•One Phase Commit.
•Two Phase Commit.
•Three Phase Commit.

TWO PHASE COMMIT


•Two Phase Commit takes into assumption – the Fail-Stop Model.
–Failed States simply stop working and don‘t send an incorrect set of messages.
–Let T be a transaction initiated at site Si, and let the transaction coordinator at Si be
Ci
•Two Phase Commit operates in two phases:
–Phase 1 or Prepare Phase.
–Phase 2 or Commit/Abort Phase or Decision Phase.
–Transaction T completes its execution – all sites at which T has executed inform Ci
that has completed – Ci starts the 2PC protocol.
Phase 1: Prepare Phase
•Coordinator asks all participants to prepare to commit transaction Ti.
–Ci adds the records <prepare T> to the log and forces log to stable storage.
–sends prepare T messages to all sites where T executed.
•Upon receiving message, transaction manager at site determines if it can commit the
transaction
–if not, add a record <no T> to the log and send abort T message to Ci
–if the transaction can be committed, then:
–add the record <ready T> to the log
–force all records for T to stable storage
–send ready T message to Ci.

Phase 2: Decision Phase / Commit/Abort Phase:


–T can be committed of Ci received a ready T message from all the participating
sites: otherwise
T must be aborted.
–Coordinator adds a decision record, <commit T> or <abort T>, to the log and
forces record onto stable storage. Once the record is recorded on stable storage it is
irrevocable (even if failures occur)
–Coordinator sends a message to each participant informing it of the decision
(commit or abort)
–Participants take appropriate action locally.
•In some implementations of the 2PC protocol, a site sends an acknowledge T
message to the coordinator at the end of the second phase of protocol.
•When the coordinator receives the acknowledge T message from all the sites, it adds
the record <complete T> to the log.
HANDLING OF FAILURES

2PC responds in different ways to various types of failures:


1. Failure of a Participating Site:- If the coordinator detects that the site has failed it
takes following actions.

–If the site fails before responding with a ready T message to Ci, the coordinator
assumes that it responded an abort T message.
–If the site fails after the coordinator has received the ready T message from the site,
the coordinator executes the rest of commit protocol in the normal fashion, ignoring
the failure of the site.
•When site Si recovers, it examines its log to determine the fate of transactions active
at the time of the failure.
•Log contain <commit T> record: site executes redo (T)
•Log contains <abort T> record: site executes undo (T)
•Log contains <ready T> record: site must consult Ci to determine the fate of T.
–If Ci is up, it notifies Sk regarding whether T committed or aborted.
•If T committed, redo (T)
•If T aborted, undo (T)
–If Ci is down, Sk must try to find the fate of T from other sites. It does so by
sending query status T message to all sites in the system. On receiving such message
a site must consult its log whether T has executed there, if Yes ,it must notify Sk
about the outcome.
–If no site has information regarding T, Sk must wait until any site recovers and
coveys the outcome.
•The log contains no control records concerning T
–implies that Sk failed before responding to the prepare T message from Ci
–Sk must execute undo (T)

•If coordinator fails while the commit protocol for T is executing then participating
sites must decide on T‗s fate:
1.If an active site contains a <commit T> record in its log, then T must be committed.
2.If an active site contains an <abort T> record in its log, then T must be aborted.
3.If some active participating site does not contain a <ready T> record in its log, then
the failed coordinator Ci cannot have decided to commit T.
•Can therefore abort T.
4.If none of the above cases holds, then all active sites must have a <ready T> record
in their logs, but no additional control records (such as <abort T> of <commit T>).
•In this case active sites must wait for Ci to recover, to find decision.
•Blocking problem: active sites may have to wait for failed coordinator to recover.

NETWORK PARTITION
•Given we are in a distributed environment a network failure is quite obvious leading
to a situation known as Network Partition.
•When a Network Partitions, two possibilities exist;
–The coordinator and all its participants remain in one partition. So failure has no
effect on the commit protocol.
–The coordinator and its participants belong to several partitions.
•Sites that are not in the partition containing the coordinator think the coordinator has
failed, and execute the protocol to deal with failure of the coordinator.
•No harm results, but sites may still have to wait for decision from coordinator.
•The coordinator and the sites are in the same partition as the coordinator think that
the sites in the other partition have failed, and follow the usual commit protocol.
•Again, no harm results

RECOVERY AND CONCURRENCY CONTROL


•We need to deal with In-doubt transactions – Transactions in which have a <ready
T>, but neither a <commit T>, nor an <abort T> log record is found.
–The recovering site must determine the commit-abort status of such transactions by
contacting other sites using various alternatives that handle failures – but this a slow
process and can block recovery process.
• To circumvent this problem, the recovery algorithms typically provide support for
noting lock information in the log.
–Instead of <ready T>, write out <ready T, L> L = list of locks held by T when the
log is written.
–For every in-doubt transaction T, all the locks noted in the <ready T, L> log record
are reacquired.
•After lock reacquisition, transaction processing can resume; the commit or rollback
of in-doubt transactions is performed concurrently with the execution of new
transactions.

ALTERNATIVE MODELS OF TRANSACTION PROCESSING


•Persistent Messaging
–Starting with the funds transfer by a bank check..
–Persistent messages are the messages that are guaranteed to be delivered to the
receipt exactly once (neither less nor more) regardless of the failures, if the
transaction sending the message commits and are guaranteed not to be delivered if
the transaction aborts.
–Error handling is more complicated with persistent messaging than with 2 Phase
Commit. - If the account where the check is to be deposited has been closed, the
check must be sent back to the originating account and credited back.
–Both sites must be provided with error-handling code, along with code to handle
persistent messages.
•So we left it to requirement of an organization whether to implement a 2 Phase
Commit or eliminate blocking by putting an extra effort in implementing Persistent
Messaging.

IMPLEMENTATION OF PERSISTENT MESSAGING


•Persistent messaging can be implemented on top of an unreliable messaging
infrastructure, which may lose messages or deliver them multiple times.
1. Sending Site Protocol:
–When a transaction wishes to send a persistent message, it writes a record
containing the message in a special relation messages_to_send. Instead of directly
sending out the message.
This message is given a unique message identifier.
–The message delivery process monitors the relation and when a new message is
found, it sends the message to destination. The concurrency control mechanism
ensures that system process reads the message only after the transaction that wrote
the message commits.
–The message delivery process deletes a message from the relation only after it
receives an acknowledgment from the destination site. If it receives no
acknowledgement from the destination site, after some time it sends the message
again. It repeats this until an acknowledgment is received. In case of permanent
failures, the system will decide, after some period of time,that the message is
undeliverable.
2. Receiver Site Protocol:
–When a site receives a persistent message, it runs a transaction that adds the
message to a special received messages relation, provided it is not already present in
the relation (the unique message identifier allows duplicates to be detected).
–After the transaction commits, or if the message was already present in the relation,
the receiving site sends an acknowledgment back to the sending site.
–Multiple deliveries of the message. In many messaging systems, it is possible for
messages to get delayed arbitrarily, although such delays are very unlikely.
Therefore, to be safe, the message must never be deleted from the received messages
relation. Deleting it could result in a duplicate delivery not being detected.

Ref: Sathish CJ Youtube video

You might also like