Data Communication Basics CH 7
Data Communication Basics CH 7
Distributed Databases
and
Client-Server Architectures
1
Outline
1. Distributed Database Concepts
2. Data Fragmentation, Replication and Allocation
3. Types of Distributed Database Systems
4. Query Processing
5. Concurrency Control and Recovery
6. 3-Tier Client-Server Architecture
2
1. Distributed Database Concepts
A transaction can be executed by multiple networked
computers in a unified manner.
A distributed database (DDB) processes Unit of execution (a
transaction) in a distributed manner.
A distributed database (DDB) can be defined as :
– A collection of multiple logically related database
distributed over a computer network, and a distributed
database management system as a software system that
manages a distributed database while making the
distribution transparent to the user.
– The physical placement of data (files, relations, etc.) which
is not known to the user (distribution transparency).
3
• The EMPLOYEE, PROJECT, and WORKS_ON tables
may be fragmented horizontally and stored with possible
replication as shown below.
Remark:
• Each site has a DBMS
– Fragments (replicated or unique).
– Linked by network.
– Can handle local users.
– Participates in at least one global
4
requests.
Advantages of DDB :
i. Distribution and Network transparency:
Users do not have to worry about operational details of the
network.
– There is Location transparency, which refers to
freedom of issuing command from any location without
affecting its working.
– Then there is Naming transparency, which allows
access to any names object (files, relations, etc.) from
any location.
ii. Replication transparency:
It allows to store copies of a data at multiple sites as shown
in the above diagram.
This is done to minimize access time to the required data.
iii. Fragmentation transparency:
• Allows to fragment a relation horizontally (create a subset
of rows of a relation) or vertically (create a subset of
columns of a relation).
5
iv. Increased reliability and availability:
Reliability refers to system live time, that is, system is
running efficiently most of the time. Availability is the
probability that the system is continuously available (usable
or accessible) during a time interval.
A distributed database system has multiple nodes
(computers) and if one fails then others are available to do
the job.
v. Improved performance:
A distributed DBMS fragments the database to keep data
closer to where it is needed most.
This reduces data management (access and modification)
time significantly.
vi. Easier expansion (scalability):
Allows new nodes (computers) to be added anytime without
changing the entire configuration.
6
Disadvantages of Distributed Database
7
2. Data Replication and Fragmentation: Distributed data storage
There are two approaches to store the relation in the distributed
database : Replication and Fragmentation
I. Data Replication
The system maintain several identical copies of the relation & store
each copy at a different site
In general it enhance the performance of read operation and
increase the availability of data to read only transaction. However,
update transactions incur greater overhead
II. Data Fragmentation
– Split a relation into logically related and correct parts.
– The main reasons for fragmenting a relation are
•Efficiency- data that is not needed by the local applications is not
stored
•Parallelism – a transaction can be divided into several subqueries that
operate on fragments which will increase the degree of concurrency
– but reconstruction of the whole relation will require accessing data from all
sites containing part of the relation 8
• A relation can be fragmented in two ways:
Horizontal fragmentation
• It is a horizontal subset of a relation which contain those
of rows which satisfy selection conditions.
• Consider the Employee relation with selection condition
(DNO = 5). All rows satisfy this condition will create a
subset which will be a horizontal fragment of Employee
relation.
• A selection condition may be composed of several
conditions connected by AND or OR.
Vertical fragmentation
• It is a subset of a relation which is created by a subset of
columns. Thus a vertical fragment of a relation will
contain values of selected columns.
9
– Consider the Employee relation. A vertical fragment of can be
created by keeping the values of Name, Bdate, Sex, and Address.
– Because there is no condition for creating a vertical fragment,
each fragment must include the primary key attribute of the
parent relation Employee. In this way all vertical fragments of a
relation are connected.
Representation
There are three rules that must be followed during fragmentation
Completeness – if a relation r is decomposed into fragments
r1, r2… rn , each data item that can be found in r must appear
in at least one fragment
Reconstruction – it must be possible to define a relation
operation that will reconstruct the relation r from fragments
Disjointness –if a data item di appears in fragment ri , then it
shouldn’t appear in any other fragment
10
3. Types of Distributed Database Systems
• Homogeneous Window
– All sites of the database Site 5 Unix
Oracle Site 1
system have identical Oracle
setup, i.e., same database Window
system software. Site 4 Communications
– The system may have network
little or no local
autonomy Oracle
– The underlying operating Site 3 Site 2
systems can be a mixture Linux Oracle Linux Oracle
of Linux, Window, Unix,
etc.
11
• Heterogeneous
– At least one of the database must be from different vendor : two variants
– Federated: Each site may run different database system but the data access
is managed through a single conceptual schema.
• This implies that the degree of local autonomy is minimum. Each site
must adhere to a centralized access policy. There may be a global
schema.
– Multidatabase: There is no one conceptual global schema. For data access
a schema is constructed dynamically as needed by the application software.
Network
Object DBMS
Oriented Site 3 Site 2 Relational
12
Linux Linux
4. Query Processing in Distributed Databases
Issues
– Cost of transferring data (files and results) over the network.
• This cost is usually high, so some optimization is necessary.
• Example: suppose there are three sites. Where the relation Employee at site 1,
Department at Site 2 and no relation at site 3
– Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size = 10 6
bytes.
– Department at Site 2. 100 rows. Row size = 35 bytes. Table size = 3,500
bytes.
– And a query is initiated from S3 to retrieve employees [First Name (15 byte long),
Last name (15 byte long) and Department name (10 byte long) total of 40 bytes]
• Q: For each employee, retrieve employee Fname, Lname, and department
name
• Q: Fname,Lname,Dname (Employee Dno = Dnumber Department)
Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno
14
• Strategies : Minimizing data transfer.
1. Transfer Employee and Department to site 3.
• Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3.
• Transferring employees data from site 1 to site 2: 1,000,000 bytes
• Query result size = 40 * 10,000 = 400,000 bytes.
• Total transfer size = 1,000,000 + 400,000 = 1,400,000 bytes.
3. Transfer Department relation to site 1, execute the join at site
1, and send the result to site 3.
• Data Transfer from site 2 to site 1: 3500 bytes
• Query result size = 40 * 10,000 = 400,000 bytes
• Total bytes transferred = 3500+ 400,000 = 403,500
bytes.
– Preferred approach: strategy 3.
15
Example 2 : Consider the query
– Q’: For each department, retrieve the department name ,Fname and LName of
the department manager
• Relational Algebra expression:
– Fname,Lname,Dname (Employee Department)
Mgrssn = SSN
• The result of this query will have 100 tuples, assuming that every department has
a manager, the execution strategies are:
1. Transfer Employee and Department to the result site and perform the join at
site 3.
• Total bytes transferred = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the result to site 3.
• Site 1-- Site 2: 1,000,000
• Site2-- site3: Query result size = 40 * 100 = 4000 bytes.
• Total transfer size = 4000 +1,000,000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1 and send the result
to site 3.
• Total transfer size = 4000 + 3500 = 7500 bytes.
Preferred strategy: Choose strategy 3.
16
Example 3: Now suppose the result is needed at site2. Possible
strategies :
1. Transfer Employee relation to site 2, execute the query and
present the result to the user at site 2.
• Total transfer size = 1,000,000 bytes for both queries
Q and Q’.
2. Transfer Department relation to site 1, execute join at site 1
and send the result back to site 2.
• Total transfer size for
– Q = 400,000 + 3500 = 403,500 bytes
– Q’ = 4000 + 3500 = 7500 bytes.
17
5. Concurrency Control and Recovery
Distributed Databases encounter a number of concurrency
control and recovery problems which are not present in
centralized databases. Some of them are listed below.
18
– Communication link failure:
• This failure may create network partition which would
affect database availability even though all database
sites may be running.
– Distributed commit:
• A transaction may be fragmented and they may be
executed by a number of sites. This require a two or
three-phase commit approach for transaction commit.
– Distributed deadlock:
• Since transactions are processed at multiple sites, two or
more sites may get involved in deadlock. This must be
resolved in a distributed manner.
19
5. 1 Distributed Concurrency control
i. Primary site technique: A single site is designated as a
primary site which serves as a coordinator for transaction
management.
Primary site
Site 5
Site 1
Site 3 Site 2
20
• Transaction management:
– Concurrency control and commit are managed by this site.
– In two phase locking, this site manages locking and
releasing data items. If all transactions follow two-phase
policy at all sites, then serializability is guaranteed.
– Advantages:
• An extension to the centralized two phase locking so
implementation and management is simple.
• Data items are locked only at one site but they can be
accessed at any site.
– Disadvantages:
• All transaction management activities go to primary site
which is likely to overload the site.
• If the primary site fails, the entire system is inaccessible.
– To aid recovery a backup site is designated which behaves as
a shadow of primary site. In case of primary site failure,
backup site can act as primary site. 21
ii. Primary Copy Technique:
– In this approach, instead of a site, a data item partition is
designated as primary copy. To lock a data item just the
primary copy of the data item is locked.
• Advantages:
– Since primary copies are distributed at various sites, a
single site is not overloaded with locking and unlocking
requests.
• Disadvantages:
– Identification of a primary copy is complex. A distributed
directory must be maintained, possibly at all sites.
22
Recovery from a coordinator failure
• In both approaches a coordinator site or copy may become
unavailable. This will require the selection of a new
coordinator.
– Primary site approach with no backup site:
• Aborts and restarts all active transactions at all sites.
Elects a new coordinator and initiates transaction
processing.
– Primary site approach with backup site:
• Suspends all active transactions, designates the backup
site as the primary site and identifies a new back up site.
• Primary site receives all transaction management
information to resume processing.
– Primary and backup sites fail or no backup site:
• Use election process to select a new coordinator site.
23
iii. Concurrency control based on voting:
– There is no primary copy of coordinator.
– Send lock request to sites that have data item.
– If majority of sites grant lock then the requesting transaction
gets the data item.
– Locking information (grant or denied) is sent to all these
sites.
– To avoid unacceptably long wait, a time-out period is defined.
If the requesting transaction does not get any vote
information then the transaction is aborted.
24
Client-Server Database Architecture
Server 1 Client 1
Client 2
Server 2 Client 3
Server n Client n
25
three-tier client/server architecture.
26
• Clients reach server for desired service, but server does reach
clients.
• The server software is responsible for local data management
at a site, much like centralized DBMS software.
• The client software is responsible for most of the distribution
function.
• The communication software manages communication among
clients and servers.
• The processing of a SQL queries goes as follows:
– Client parses a user query and decomposes it into a number
of independent sub-queries. Each subquery is sent to
appropriate site for execution.
– Each server processes its query and sends the result to the
client.
– The client combines the results of subqueries and produces
the final result. 27