Distributed Database Management Systems
Distributed Database Management Systems
Management Systems
Outline
Introduction
Distributed DBMS Architecture
Distributed Database Design
Distributed Query Processing
Distributed Concurrency Control
Distributed Reliability Protocols
Distributed DBMS 2
Outline
Introduction
What is a distributed DBMS
Problems
Current state-of-affairs
Distributed DBMS Architecture
Distributed Database Design
Distributed Query Processing
Distributed Concurrency Control
Distributed Reliability Protocols
Distributed DBMS 3
Motivation
Database Computer
Technology Networks
integration distribution
D istr ibuted
Database
Systems
integration
integration ≠ centralization
Distributed DBMS 4
What is a Distributed Database
System?
A distributed database (DDB) is a collection of multiple,
logically interrelated databases distributed over a computer
network.
Distributed DBMS 5
What is not a DDBS?
Distributed DBMS 6
Centralized DBMS on a Network
Site 1
Site 2
Site 5
Communication
Network
Site 4 Site 3
Distributed DBMS 7
Distributed DBMS Environment
Site 1
Site 2
Site 5
Communication
Network
Site 4 Site 3
Distributed DBMS 8
Implicit Assumptions
Data stored at a number of sites each site
logically consists of a single processor.
Processors at different sites are interconnected
by a computer network no multiprocessors
parallel database systems
Distributed database is a database, not a
collection of files data logically related as
exhibited in the users’ access patterns
relational data model
D-DBMS is a full-fledged DBMS
not remote file system, not a TP system
Distributed DBMS 9
Distributed DBMS Promises
Improved performance
Distributed DBMS 10
Transparency
Transparency is the separation of the higher level
semantics of a system from the lower level
implementation issues.
Fundamental issue is to provide
data independence
in the distributed environment
Network (distribution) transparency
Replication transparency
Fragmentation transparency
horizontal fragmentation: selection
vertical fragmentation: projection
hybrid
Distributed DBMS 11
Example
EMP ASG
ENO ENAME TITLE ENO PNO RESP DUR
E1 J. Doe Elect. Eng. E1 P1 Manager 12
E2 M. Smith Syst. Anal. E2 P1 Analyst 24
E3 A. Lee Mech. Eng. E2 P2 Analyst 6
E4 J. Miller Programmer E3 P3 Consultant 10
E5 B. Casey Syst. Anal. E3 P4 Engineer 48
E6 L. Chu Elect. Eng. E4 P2 Programmer 18
E7 R. Davis Mech. Eng. E5 P2 Manager 24
E8 J. Jones Syst. Anal. E6 P4 Manager 48
E7 P3 Engineer 36
E7 P5 Engineer 23
E8 P3 Manager 40
PROJ PAY
PNO PNAME BUDGET TITLE SAL
P1 Instrumentation 150000 Elect. Eng. 40000
P2 Database Develop. 135000 Syst. Anal. 34000
P3 CAD/CAM 250000 Mech. Eng. 27000
P4 Maintenance 310000 Programmer 24000
Distributed DBMS 12
Transparent Access
SELECT ENAME,SAL
Tokyo
FROM EMP,ASG,PAY
WHERE DUR > 12
Boston Paris
AND EMP.ENO = ASG.ENO
Paris projects
AND PAY.TITLE = EMP.TITLE
Paris employees
Communication Paris assignments
Network Boston employees
Boston projects
Boston employees
Boston assignments
Montreal
New
Montreal projects
York
Paris projects
Boston projects New York projects
New York employees with budget > 200000
New York projects Montreal employees
New York assignments Montreal assignments
Distributed DBMS 13
Distributed Database –
User View
Distributed Database
Distributed DBMS 14
Distributed DBMS - Reality
User
Query
DBMS
Software
User
Application
DBMS
Software
DBMS Communication
Software Subsystem
User
DBMS User Application
Software Query DBMS
Software
User
Query
Distributed DBMS 15
Potentially Improved
Performance
Parallelism in execution
Inter-query parallelism
Intra-query parallelism
Distributed DBMS 16
Parallelism Requirements
Distributed DBMS 17
System Expansion
Distributed DBMS 18
Distributed DBMS Issues
Distributed Database Design
how to distribute the database
replicated & non-replicated database distribution
a related problem in directory management
Query Processing
convert user transactions to data manipulation instructions
optimization problem
min{cost = data transmission + local processing}
general formulation is NP-hard
Distributed DBMS 19
Distributed DBMS Issues
Concurrency Control
synchronization of concurrent accesses
consistency and isolation of transactions' effects
deadlock management
Reliability
how to make the system resilient to failures
atomicity and durability
Distributed DBMS 20
Relationship Between Issues
Directory
Management
Query Distribution
Reliabili ty
Processing Design
Concurrency
Control
Deadlock
Management
Distributed DBMS 21
Outline
Introduction
Distributed DBMS Architecture
Distributed Database Design
Fragmentation
Data Placement
Distributed DBMS 22
Design Problem
In the general setting :
Making decisions about the placement of data and programs
across the sites of a computer network as well as possibly
designing the network itself.
Distributed DBMS 23
Distribution Design
Top-down
mostly in designing systems from scratch
Bottom-up
when the databases already exist at a number of sites
Distributed DBMS 24
Top-Down Design
Requirements
Analysis
Objectives
User Input
Conceptual View Integration View Design
Design
Access
GCS Information ES’s
Distribution
Design User Input
LCS’s
Physical
Design
LIS’s
Distributed DBMS 25
Distribution Design Issues
Why fragment at all?
How to fragment?
How to allocate?
Information requirements?
Distributed DBMS 26
Fragmentation
Can't we just distribute relations?
What is a reasonable unit of distribution?
relation
views are subsets of relations ⇒ locality
extra communication
fragments of relations (sub-relations)
concurrent execution of a number of transactions that
access different portions of a relation
views that cannot be defined on a single fragment will
require extra processing
semantic data control (especially integrity enforcement)
more difficult
Distributed DBMS 27
Fragmentation Alternatives –
Horizontal
PROJ
PROJ1 : projects with budgets less PNO PNAME BUDGET LOC
$200,000
PROJ1 PROJ2
Distributed DBMS 28
Fragmentation Alternatives –
Vertical
PROJ
PROJ1: information about PNO PNAME BUDGET LOC
locations
PROJ1 PROJ2
Distributed DBMS 29
Degree of Fragmentation
finite number of alternat ives
tuples relations
or
attributes
Distributed DBMS 30
Correctness of Fragmentation
Completeness
Decomposition of relation R into fragments R1, R2, ..., Rn is complete
iff each data item in R can also be found in some Ri
Reconstruction
If relation R is decomposed into fragments R1, R2, ..., Rn, then there
should exist some relational operator ∇ such that
R = ∇1≤i≤nRi
Disjointness
If relation R is decomposed into fragments R1, R2, ..., Rn, and data
item di is in Rj, then di should not be in any other fragment Rk (k ≠ j ).
Distributed DBMS 31
Allocation Alternatives
Non-replicated
partitioned : each fragment resides at only one site
Replicated
fully replicated : each fragment at each site
partially replicated : each fragment at some of the sites
Rule of thumb:
Distributed DBMS 32
Fragmentation
Distributed DBMS 33
Primary Horizontal
Fragmentation
Definition :
Rj = σFj (R ), 1 ≤ j ≤ w
where Fj is a selection formula.
Therefore,
A horizontal fragment Ri of relation R consists of all
the tuples of R which satisfy a predicate pi.
⇓
Given a set of predicates M, there are as many
horizontal fragments of relation R as there are
predicates.
Distributed DBMS 34
PHF – Example
SKILL
TITLE, SAL
L1
EMP PROJ
ENO, ENAME, TITLE PNO, PNAME, BUDGET, LOC
L2 L3
ASG
ENO, PNO, RESP, DUR
Distributed DBMS 35
PHF – Example
PAY1 PAY2
TITLE SAL TITLE SAL
Mech. Eng. 27000 Elect. Eng. 40000
Programmer 24000 Syst. Anal. 34000
Distributed DBMS 36
PHF – Example
PROJ1 PROJ2
Database
P1 Instrumentation 150000 Montreal P2 135000 New York
Develop.
PROJ4 PROJ6
Distributed DBMS 37
PHF – Correctness
Completeness
Since the set of predicates is complete and minimal, the
selection predicates are complete
Reconstruction
If relation R is fragmented into FR = {R1,R2,…,Rr}
R = ∪∀Ri ∈FR Ri
Disjointness
Predicates that form the basis of fragmentation should be
mutually exclusive.
Distributed DBMS 38
Derived Horizontal
Fragmentation
Defined on a member relation of a link according
to a selection operation specified on its owner.
Each link is an equijoin.
Equijoin can be implemented by means of semijoins.
TITLE, SAL
L1
EMP PROJ
ENO, ENAME, TITLE PNO, PNAME, BUDGET, LOC
L2 L3
ASG
ENO, PNO, RESP, DUR
Distributed DBMS 39
DHF – Definition
Given a link L where owner(L)=S and member(L)=R, the
derived horizontal fragments of R are defined as
Ri = R F Si, 1≤i≤w
Distributed DBMS 40
DHF – Example
Given link L1 where owner(L1)=SKILL and
member(L1)=EMP
EMP1 = EMP SKILL1
EMP2 = EMP SKILL2
where
SKILL1 = σ SAL≤30000 (SKILL)
SKILL2 = σSAL>30000 (SKILL)
EMP1 EMP2
ENO ENAME TITLE ENO ENAME TITLE
Distributed DBMS 42
Vertical Fragmentation
Has been studied within the centralized context
design methodology
physical clustering
Distributed DBMS 43
Vertical Fragmentation
Overlapping fragments
grouping
Non-overlapping fragments
splitting
We do not consider the replicated key attributes to be
overlapping.
Advantage:
Easier to enforce functional dependencies (for integrity
checking etc.)
Distributed DBMS 44
VF – Correctness
A relation R, defined over attribute set A and key K,
generates the vertical partitioning FR = {R1, R2, …, Rr}.
Completeness
The following should be true for A:
A =∪ AR
i
Reconstruction
Reconstruction can be achieved by
R= K Ri ∀Ri ∈FR
Disjointness
TID's are not considered to be overlapping since they are maintained by
the system
Duplicated keys are not considered to be overlapping
Distributed DBMS 45
Fragment Allocation
Problem Statement
Given
F = {F1, F2, …, Fn} fragments
S ={S1, S2, …, Sm} network sites
Q = {q1, q2,…, qq} applications
Find the "optimal" distribution of F to S.
Optimality
Minimal cost
Communication + storage + processing (read & update)
Cost in terms of time (usually)
Performance
Response time and/or throughput
Constraints
Per site constraints (storage & processing)
Distributed DBMS 46
Allocation Model
General Form
min(Total Cost)
subject to
response time constraint
storage constraint
processing constraint
Decision Variable
1 if fragment F i is stored at site Sj
x ij =
0 otherwise
Distributed DBMS 47
Allocation Model
Total Cost
Distributed DBMS 48
Allocation Model
Query Processing Cost
Processing component
access cost + integrity enforcement cost + concurrency control cost
Access cost
Distributed DBMS 49
Allocation Model
Query Processing Cost
Transmission component
cost of processing updates + cost of processing retrievals
Cost of updates
∑ ∑
all sites all fragments
update message cost +
Retrieval Cost
∑ all fragments
min all sites (cost of retrieval command +
Distributed DBMS 50
Allocation Model
Constraints
Response Time
execution time of query ≤ max. allowable response time for
that query
Storage Constraint (for a site)
∑ all fragments
storage requirement of a fragment at that site ≤
∑ all queries
processing load of a query at that site ≤
Distributed DBMS 51
Allocation Model
Solution Methods
FAP is NP-complete
DAP also NP-complete
Heuristics based on
single commodity warehouse location (for FAP)
knapsack problem
branch and bound techniques
network flow
Distributed DBMS 52
Allocation Model
Attempts to reduce the solution space
assume all candidate partitionings known; select the “best”
partitioning
Distributed DBMS 53
Outline
Introduction
Distributed DBMS Architecture
Distributed Database Design
Distributed Query Processing
Query Processing Methodology
Distributed Query Optimization
Distributed Concurrency Control
Distributed Reliability Protocols
Distributed DBMS 54
Query Processing
high level user query
query
processor
Distributed DBMS 55
Query Processing Components
Query language that is used
SQL: “intergalactic dataspeak”
Query execution methodology
The steps that one goes through in executing high-level
(declarative) user queries.
Query optimization
How do we determine the “best” execution plan?
Distributed DBMS 56
Selecting Alternatives
SELECT ENAME
FROM EMP,ASG
WHERE EMP.ENO = ASG.ENO
AND DUR > 37
Strategy 1
ΠENAME(σDUR>37∧EMP.ENO=ASG.ENO (EMP × ASG))
Strategy 2
ΠENAME(EMP ENO (σDUR>37 (ASG)))
Distributed DBMS 57
What is the Problem?
Site 1 Site 2 Site 3 Site 4 Site 5
ASG1=σENO≤“E3”(ASG) ASG2=σENO>“E3”(ASG) EMP1=σENO≤“E3”(EMP) EMP2=σENO>“E3”(EMP) Result
Site 5 Site 5
result = EMP1’∪EMP2’ result2=(EMP1∪ EMP2) ENOσDUR>37(ASG1∪ ASG1)
EMP1’ EMP2’
ASG1 ASG2 EMP1 EMP2
Site 3 Site 4
EMP1’=EMP1 ENOASG1
’ EMP2’=EMP2 ENOASG2
’
Site 1 Site 2 Site 3 Site 4
ASG1’ ASG2’
Site 1 Site 2
ASG1’=σDUR>37(ASG1) ASG2’=σDUR>37(ASG2)
Distributed DBMS 58
Cost of Alternatives
Assume:
size(EMP) = 400, size(ASG) = 1000
tuple access cost = 1 unit; tuple transfer cost = 10 units
Strategy 1
produce ASG': (10+10)∗tuple access cost 20
transfer ASG' to the sites of EMP: (10+10)∗tuple transfer cost 200
produce EMP': (10+10) ∗tuple access cost∗2 40
transfer EMP' to result site: (10+10) ∗tuple transfer cost 200
Total cost 460
Strategy 2
transfer EMP to site 5:400∗tuple transfer cost 4,000
transfer ASG to site 5 :1000∗tuple transfer cost 10,000
produce ASG':1000∗tuple access cost 1,000
join EMP and ASG':400∗20∗tuple access cost 8,000
Total cost 23,000
Distributed DBMS 59
Query Optimization Objectives
Minimize a cost function
I/O cost + CPU cost + communication cost
These might have different weights in different
distributed environments
Wide area networks
communication cost will dominate
low bandwidth
low speed
high protocol overhead
most algorithms ignore all other cost components
Local area networks
communication cost not that dominant
total cost function should be considered
Can also maximize throughput
Distributed DBMS 60
Query Optimization Issues –
Types of Optimizers
Exhaustive search
cost-based
optimal
combinatorial complexity in the number of relations
Heuristics
not optimal
regroup common sub-expressions
perform selection, projection first
replace a join by a series of semijoins
reorder operations to reduce intermediate relation size
optimize individual operations
Distributed DBMS 61
Query Optimization Issues –
Optimization Granularity
Single query at a time
cannot use common intermediate results
Multiple queries at a time
efficient if many similar queries
decision space is much larger
Distributed DBMS 62
Query Optimization Issues –
Optimization Timing
Static
compilation ⇒ optimize prior to the execution
difficult to estimate the size of the intermediate results ⇒
error propagation
can amortize over many executions
R*
Dynamic
run time optimization
exact information on the intermediate relation sizes
have to reoptimize for multiple executions
Distributed INGRES
Hybrid
compile using a static algorithm
if the error in estimate sizes > threshold, reoptimize at run
time
MERMAID
Distributed DBMS 63
Query Optimization Issues –
Statistics
Relation
cardinality
size of a tuple
fraction of tuples participating in a join with another relation
Attribute
cardinality of domain
actual number of distinct values
Common assumptions
independence between different attribute values
uniform distribution of attribute values within their domain
Distributed DBMS 64
Query Optimization
Issues – Decision Sites
Centralized
single site determines the “best” schedule
simple
need knowledge about the entire distributed database
Distributed
cooperation among sites to determine the schedule
need only local information
cost of cooperation
Hybrid
one site determines the global schedule
each site optimizes the local subqueries
Distributed DBMS 65
Query Optimization Issues –
Network Topology
Wide area networks (WAN) – point-to-point
characteristics
low bandwidth
low speed
high protocol overhead
communication cost will dominate; ignore all other cost factors
global schedule to minimize communication cost
local schedules according to centralized query optimization
Query GLOBAL
Query GLOBAL
Decomposition
Decomposition SCHE
SCHEMA
MA
Fragment Query
STATS
Global
Global STATSON
ON
FRAG
Optimization
Optimization FRAGMEN
MENTS
TS
Optimized Local
Queries
Distributed DBMS 67
Step 1 – Query Decomposition
Input : Calculus query on global relations
Normalization
manipulate query quantifiers and qualification
Analysis
detect and reject “incorrect” queries
possible for only a subset of relational calculus
Simplification
eliminate redundant predicates
Restructuring
calculus query ⇒ algebraic query
more than one translation is possible
use transformation rules
Distributed DBMS 68
Normalization
Lexical and syntactic analysis
check validity (similar to compilers)
check for attributes and relations
type checking on the qualification
Put into normal form
Conjunctive normal form
(p11∨p12∨…∨p1n) ∧…∧ (pm1∨pm2∨…∨pmn)
Disjunctive normal form
(p11∧p12 ∧…∧p1n) ∨…∨ (pm1 ∧pm2∧…∧ pmn)
OR's mapped into union
AND's mapped into join or selection
Distributed DBMS 69
Analysis
Refute incorrect queries
Type incorrect
If any of its attribute or relation names are not defined
in the global schema
If operations are applied to attributes of the wrong type
Semantically incorrect
Components do not contribute in any way to the
generation of the result
Only a subset of relational calculus queries can be
tested for correctness
Those that do not contain disjunction and negation
To detect
connection graph (query graph)
join graph
Distributed DBMS 70
Simplification
Why simplify?
Remember the example
How? Use transformation rules
elimination of redundancy
idempotency rules
p1 ∧ ¬( p1) ⇔ false
p1 ∧ (p1 ∨ p2) ⇔ p1
p1 ∨ false ⇔ p1
…
application of transitivity
use of integrity rules
Distributed DBMS 71
Simplification – Example
SELECT TITLE
FROM EMP
WHERE EMP.ENAME = “J. Doe”
OR (NOT(EMP.TITLE = “Programmer”)
AND (EMP.TITLE = “Programmer”
OR EMP.TITLE = “Elect. Eng.”)
AND NOT(EMP.TITLE = “Elect. Eng.”))
⇓
SELECT TITLE
FROM EMP
WHERE EMP.ENAME = “J. Doe”
Distributed DBMS 72
Restructuring
Convert relational calculus to ΠENAME Project
relational algebra
Make use of query trees
σDUR=12 OR DUR=24
Example
Find the names of employees other than J.
Doe who worked on the CAD/CAM project σPNAME=“CAD/CAM” Select
for either 1 or 2 years.
σENAME≠“J. DOE”
SELECT ENAME
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO PNO
AND ASG.PNO = PROJ.PNO
AND ENAME ≠ “J. Doe” ENO Join
AND PNAME = “CAD/CAM”
AND (DUR = 12 OR DUR = 24)
PROJ ASG EMP
Distributed DBMS 73
Restructuring –Transformation
Rules
Commutativity of binary operations
R×S⇔S×R
R S⇔S R
R∪S⇔S∪R
Associativity of binary operations
( R × S ) × T ⇔ R × (S × T)
(R S) T⇔R (S T)
Idempotence of unary operations
ΠA’(ΠA’(R)) ⇔ ΠA’(R)
σp1(A1)(σp2(A2)(R)) = σp1(A1) ∧ p2(A2)(R)
where R[A] and A' ⊆ A, A" ⊆ A and A' ⊆ A"
Commuting selection with projection
Distributed DBMS 74
Restructuring –
Transformation Rules
Commuting selection with binary operations
σp(A)(R × S) ⇔ (σp(A) (R)) × S
σp(Ai)(R (Aj,Bk) S) ⇔ (σp(Ai) (R)) (Aj,Bk) S
σp(Ai)(R ∪ T) ⇔ σp(Ai) (R) ∪ σp(Ai) (T)
where Ai belongs to R and T
Distributed DBMS 75
Example
Recall the previous example: ΠENAME Project
Find the names of employees other than J.
Doe who worked on the CAD/CAM project
for either one or two years. σDUR=12 OR DUR=24
ENO Join
Distributed DBMS 76
Equivalent Query
ΠENAME
PNO ∧ENO
Distributed DBMS 77
Restructuring
ΠENAME
PNO
ΠPNO,ENAME
ENO
Distributed DBMS 78
Step 2 – Data Localization
Distributed DBMS 79
Example
ΠENAME
Assume
EMP is fragmented into EMP1, EMP2,
σDUR=12 OR DUR=24
EMP3 as follows:
EMP1=σENO≤“E3”(EMP)
σPNAME=“CAD/CAM”
EMP2= σ“E3”<ENO≤“E6”(EMP)
EMP3=σENO≥“E6”(EMP) σENAME≠“J. DOE”
ASG fragmented into ASG1 and ASG2
as follows:
PNO
ASG1=σENO≤“E3”(ASG)
ASG2=σENO>“E3”(ASG) ENO
Distributed DBMS 81
Eliminates Unnecessary Work
Distributed DBMS 82
Reduction for PHF
Reduction with selection
Relation R and FR={R1, R2, …, Rw} where Rj=σ p (R)
j
σ ENO=“E5” σ ENO=“E5”
Distributed DBMS 83
Reduction for PHF
Reduction with join
Possible if fragmentation is done on join attribute
Distribute join over union
Distributed DBMS 84
Reduction for PHF
Reduction with join - Example
Assume EMP is fragmented as before and
ASG1: σENO ≤ "E3"(ASG)
ASG2: σENO > "E3"(ASG)
Consider the query
SELECT*
FROM EMP, ASG
WHERE EMP.ENO=ASG.ENO
ENO
∪ ∪
Distributed DBMS 86
Reduction for VF
Find useless (not empty) intermediate relations
Relation R defined over attributes A = {A1, ..., An} vertically fragmented
as Ri = ΠA' (R) where A' ⊆ A:
ΠD,K(Ri) is useless if the set of projection attributes D is not in A'
Example: EMP1= ΠENO,ENAME (EMP); EMP2= ΠENO,TITLE (EMP)
SELECT ENAME
FROM EMP
ΠENAME ΠENAME
ENO ⇒
EMP1 EMP2 EMP1
Distributed DBMS 87
Step 3 – Global Query
Optimization
Input: Fragment query
Find the best (not necessarily optimal) global
schedule
Minimize a cost function
Distributed join processing
Bushy vs. linear trees
Which relation to ship where?
Ship-whole vs ship-as-needed
Decide on the use of semijoins
Semijoin saves on communication at the expense of
more local processing.
Join methods
nested loop vs ordered joins (merge join or hash join)
Distributed DBMS 88
Cost-Based Optimization
Solution space
The set of equivalent algebra expressions (query trees).
Cost function (in terms of time)
I/O cost + CPU cost + communication cost
These might have different weights in different distributed
environments (LAN vs WAN).
Can also maximize throughput
Search algorithm
How do we move inside the solution space?
Exhaustive search, heuristic algorithms (iterative
improvement, simulated annealing, genetic,…)
Distributed DBMS 89
Query Optimization Process
Input Query
Equivalent QEP
Best QEP
Distributed DBMS 90
Search Space
Search space characterized by PNO
alternative execution plans
ENO
Focus on join trees PROJ
× ASG
PROJ EMP
Distributed DBMS 91
Search Space
Restrict by means of heuristics
Perform unary operations before binary operations
…
R4
R3
R1 R2 R1 R2 R3 R4
Distributed DBMS 92
Search Strategy
How to “move” in the search space.
Deterministic
Start from base relations and build plans by adding one
relation at each step
Dynamic programming: breadth-first
Greedy: depth-first
Randomized
Search for optimalities around a particular starting point
Trade optimization time for execution time
Better when > 5-6 relations
Simulated annealing
Iterative improvement
Distributed DBMS 93
Search Strategies
Deterministic
R4
R3 R3
R1 R2 R1 R2 R1 R2
Randomized
R3 R2
R1 R2 R1 R3
Distributed DBMS 94
Cost Functions
Total Time (or Total Cost)
Reduce each cost (in terms of time) component individually
Do as little of each cost component as possible
Optimizes the utilization of the resources
Response Time
Do as many things as possible in parallel
May increase total time because of increased total activity
Distributed DBMS 95
Total Cost
Summation of all cost factors
Distributed DBMS 96
Total Cost Factors
Wide area network
message initiation and transmission costs high
local processing cost is low (fast mainframes or
minicomputers)
ratio of communication to I/O costs = 20:1
Distributed DBMS 97
Response Time
Elapsed time between the initiation and the completion of a
query
Distributed DBMS 98
Example
Site 1
x units
Site 3
Site 2 y units
SFσ(p(A i) ∨ p(A j)) = SFσ(p(A i )) + SFσ(p(A j)) – (SFσ(p(A i)) ∗ SFσ(p(A j)))
SFσ(A ∈ value) = SFσ(A= value) ∗ card({values})
Cartesian Product
card(R × S) = card(R) ∗ card(S)
Union
upper bound: card(R ∪ S) = card(R) + card(S)
lower bound: card(R ∪ S) = max{card(R), card(S)}
Set Difference
upper bound: card(R–S) = card(R)
lower bound: 0
Semijoin
card(R A S) = SF (S.A) ∗ card(R)
where
card(∏A(S))
SF (R A S)= SF (S.A) =
card(dom[A])
Execute joins
2.1 Determine the possible ordering of joins
ASG
ENO PNO
EMP PROJ
EMP ASG EMP × PROJASG EMP ASG PROJ PROJ ASG PROJ × EMP
pruned pruned pruned pruned
Ordering joins
Distributed INGRES
System R*
Semijoin ordering
SDD-1
Consider
PROJ PNO ASG ENO EMP
Site 2
ASG
ENO PNO
EMP PROJ
Site 1 Site 3
5. EMP → Site 2
PROJ → Site 2
Site 2 computes EMP PROJ ASG
Alternatives:
1 Do the join R A S
2 Perform one of the semijoin equivalents
R A S ⇔ (R A S) A S
⇔ R A (S A R)
⇔ (R A S) A (S A R)
Consider semijoin (R A S) A S
S' ← ∏A(S)
S' → Site 1
Site 1 computes R' = R A S'
R' → Site 2
Site 2 computes R' A S
Semijoin is better if
size(ΠA(S)) + size(R A S)) < size(R)
1: relation cardinality; 2: number of unique values per attribute; 3: join selectivity factor; 4: size
of projection on each join attribute; 5: attribute size and tuple size
Fetch as needed
number of messages = O(cardinality of external relation)
data transfer per message is minimal
better if relations are large and the selectivity is good