IT3306 - 04 - Distributed DB Systems
IT3306 - 04 - Distributed DB Systems
1
© e-Learning Centre, UCSC
Overview
Transparency
• In general, transparency is not allowing the end user to
know implementation details.
• There are several types of transparencies introduced in the
distributed database domain because the data is distributed
in multiple nodes.
i. Location transparency : Commands issued are not
changed according to the location of data or the
node.
ii. Naming transparency: When a name is associated
with an object, the object can be accessed without
giving additional details such as the location of data.
Transparency Cont.
iii. Replication transparency : User is not aware of the
replicas that are available in multiple nodes in order to
provide better performance, availability and reliability.
iv. Fragmentation transparency: User is not aware of
the fragments available.
v. Design transparency: User is unaware of the design
of the distributed database while he is performing the
transactions.
vi. Execution transparency: User is unaware of the
transaction execution details.
1
© e-Learning Centre, UCSC
0
4.1 Distributed Database Concepts, Components and
Advantage
1
© e-Learning Centre, UCSC
1
4.1 Distributed Database Concepts, Components and
Advantage
Scalability
1
© e-Learning Centre, UCSC
3
4.1 Distributed Database Concepts, Components and
Advantage
Autonomy
• The extent to which a single node (database) have the
capacity to be worked independently is refer to as
Autonomy.
• Higher flexibility is given to the nodes when there is high
autonomy.
• Autonomy can be applied in many aspects such as,
- Design autonomy: Independence of data model usage
and transaction management techniques.
- Communication autonomy: The extent to which each
node can decide on sharing of information with other
nodes.
- Execution autonomy: Independence of users to operate
as they prefer.
1
© e-Learning Centre, UCSC
4
4.1 Distributed Database Concepts, Components and
Advantage
Advantages of DDB
1. Improves the flexibility of application development
- The ability of carrying out application development
and maintenance from different physical locations.
2. Improve Availability
- Faults are isolated to the site of origin without
disturbing the other nodes connected.
- Even though a single node fails, the other nodes
continue to operate without failing the entire system.
(However, in a centralized system, failure at a single
site makes the whole system unavailable to all
users). Therefore, availability is improved with a
DDB.
1
© e-Learning Centre, UCSC
5
4.1 Distributed Database Concepts, Components and
Advantage
Advantages of DDB Cont.
3. Improve performance
- Data items are stored closer to where it is needed
the most. It reduces the competition for CPU and I/O
services required. The access delays involved in
wide area networks are also brought down.
- Since each node holds only a partition of the entire
DB, the number of transactions executed in each
site is smaller compared to the situation where all
transactions are submitted to a single centralized
database.
- Execution of queries in parallel by executing multiple
queries at different sites, or by splitting the query into
a number of subqueries also improves the
performance.
1
© e-Learning Centre, UCSC
6
4.1 Distributed Database Concepts, Components and
Advantage
1
© e-Learning Centre, UCSC
7
Activity
1
© e-Learning Centre, UCSC
8
Activity
2
© e-Learning Centre, UCSC
0
4.2 Types of Distributed Database Systems
2
© e-Learning Centre, UCSC
1
4.2 Types of Distributed Database Systems
2
© e-Learning Centre, UCSC
2
4.2 Types of Distributed Database Systems
2
© e-Learning Centre, UCSC
4
4.2 Types of Distributed Database Systems
Pure distributed
database
system
Autonomy
Centralized database
Heterogeneity systems
2
5
© e-Learning Centre, UCSC
4.3 Distributed Database Design Techniques
Fragmentation
● As the name implies, in distributed architecture, separate
portions of data should be stored in different nodes.
● Initially, we have to identify the basic logical unit of data.
In a relational database, relations are the simplest logical
unit.
● Fragmentation is a process of dividing the whole
database into various sub relations so that data can be
stored in different systems.
2
© e-Learning Centre, UCSC
6
4.3 Distributed Database Design Techniques
Example
● Suppose we have a relational database schema with three
tables (EMPLOYEE, DEPARTMENT, WORKS_ON) which
we should make partitions in order to store in several
nodes.
● Assume there are no replications allowed (data replication
allows storage of certain data in more than one place to
gain availability and reliability).
Employee
FNAME LNAME SSN BDATE ADDRESS
Works_on
ESSN DNO HOURS
Department
DNO DNAME LOCATION
2
© e-Learning Centre, UCSC
7
4.3 Distributed Database Design Techniques
Example - Approach 01
● One approach of data distribution is storing each relation
in each site.
We can store each relation in one node. In the following
example, we have stored the Employee table in Node 1, the
Department table in node 2 and Works_on table in node 3.
d4 Headquarters Kandy
d3 Finance Colombo
DNO
d8 Research Rathnapura
DNAME
LOCATION
2
© e-Learning Centre, UCSC
9
4.3 Distributed Database Design Techniques
3
© e-Learning Centre, UCSC
0
4.3 Distributed Database Design Techniques
Horizontal Fragmentation
• A subset of rows in a relation is known as horizontal
fragment or shard.
• Selection of the tuple subset is based on a condition of one
or more attributes.
• With horizontal fragmentation, we can divide tables
horizontally by creating subsets of tuples which has a logical
meaning for each of the subset.
• Then these fragments are assigned to different nodes in the
distributed system.
• Each horizontal fragment on a relation R can be specified in
the relational algebra by σCi(R) operation.(Ci → condition,
R→ relation).
• Reconstruction of the original relation is done by taking the
union of all fragments.
3
© e-Learning Centre, UCSC
2
• Ex:- If we want to store sales employee details and marketing
employee details separately in 2 nodes, we can use horizontal
fragmentation.
Employee
3
© e-Learning Centre, UCSC
3
4.3 Distributed Database Design Techniques
Explanation
• Original table (Employee) is divided into two subset of
rows.
• First horizontal fragment created (Sales_employee)
consists of details of employees who are working in the
sales department.
Sales_employee 𝛔Department = “sales” (Employee)
• Second horizontal fragment created
(Marketing_employee) consists of details of employees
who are working in the marketing department.
Marketing_employee 𝛔Department = “marketing”
(Employee)
3
© e-Learning Centre, UCSC
4
4.3 Distributed Database Design Techniques
Vertical Fragmentation
• With vertical fragmentation, we can divide the table by
columns.
• There can be situations where we do not need to store all
the attributes of a relation in a certain site.
• Therefore, with the technique of vertical fragmentation, we
can keep only required columns of a relation within a
single site.
• In vertical fragmentation, it is a must to include the primary
key or some unique key attribute in every vertical
fragment. Otherwise, we will not be able to create the
original table by putting the fragments together.
3
© e-Learning Centre, UCSC
5
4.3 Distributed Database Design Techniques
3
© e-Learning Centre, UCSC
6
• Ex:- If we want to store employees’ pay details and department
details separately in 2 nodes, we can use vertical fragmentation.
Employee
Explanation
• Original table (Employee) is divided into two subset of
columns.
• First vertical fragment created (Pay_data) consists of
salary details of employees.
Pay_data πname, salary(Employee)
• Second vertical fragment created (Dept_data) consists of
department details of employees.
Dept_data πname, Department (Employee)
3
© e-Learning Centre, UCSC
8
4.3 Distributed Database Design Techniques
Mixed Fragmentation
• Another fragmentation technique is the hybrid (mixed)
fragmentation where we can use a combination of both
the horizontal and vertical fragmentations.
• For example, take the EMPLOYEE table that we used
before.
• Employee table is vertically split into payment data and
department data. (vertical fragmentation)
• Then the department table is again separated by the
department, where the horizontal fragmentation is taking
place. (horizontal fragmentation)
• Relevant fragmentations with data can be seen in the next
slide.
3
© e-Learning Centre, UCSC
9
Employee
Kirushanthi 45900
Anna 47900
4
© e-Learning Centre, UCSC
0
Activity
Replication
• The main purpose of having data replicated in several
nodes is to ensure the availability of data.
• One extreme of data replication is having a copy of the
entire database at every node (full replication).
• The other extreme is not having replication at all. Here,
every data item is stored only at one site. (no replication)
4
© e-Learning Centre, UCSC
4
4.3 Distributed Database Design Techniques
Replication Cont.
• Full replication
-With full replication, we can achieve a higher degree
of availability. The reason for this is, the entire system
keeps running, even with only one site up, because
every site contains the whole DB.
-The other advantage is improved performance of read
queries, as the results can be obtained from any site
by locally processing at the site where it submitted.
-However, there are drawback of full replication.
-One is, degrading the write performance, because
each update should be performed at every copy of
data to maintain the consistency.
-Making the concurrency control and recovery
techniques are more complex and expensive.
4
© e-Learning Centre, UCSC
5
4.3 Distributed Database Design Techniques
Replication Cont.
• No replication
-When there are no replications, all fragments must be
disjoint ( no tuple in relation R, can be seen in more
than one site.) But the repetition of primary key should
be expected for the vertical fragments or mixed
fragments.
-Also known as non-redundant allocation.
-Suitable for systems with high write traffic.
-Lesser degree of availability is a disadvantage of no
replication.
4
© e-Learning Centre, UCSC
6
4.3 Distributed Database Design Techniques
Replication Cont.
• To get a balance between the pros and cons we
discussed, we can select a degree of replication suitable
for our application.
• Some fragments of the database may be replicated, and
others may not according to the requirements.
• It is also possible to have some fragments replicated in all
the nodes in the distributed system.
• Any way, all the replicas should be synchronized when an
update is taken place.
4
© e-Learning Centre, UCSC
7
4.3 Distributed Database Design Techniques
Allocation
• There cannot be any site which is not assigned to a site in
a DDB.
• The process of distributing data into nodes is known as
data allocation.
• The decisions of selecting the site to hold each fragment
and the number of replicas available for each data
depends on the,
- Performance requirement of the system
- Types of transactions
- Availability goals
- Transaction frequency
4
© e-Learning Centre, UCSC
8
4.3 Distributed Database Design Techniques
Allocation Cont.
Consider the following scenarios and the suggested
allocation mechanisms:
• Requires high availability of the system with high number
of retrievals,
- Recommend to have a fully replicated database.
• Requires to retrieve a subsection of data frequently,
-Recommend to allocate the required fragment into
multiple sites.
• Requires to perform a higher number of updates,
-Recommend to have a less number of replicas.
However, It is hard to find an optimal solution to distributed
data allocation since it is a complex optimization problem.
4
© e-Learning Centre, UCSC
9
Activity
Distribution Models
When the data volume increases, we can add more nodes
within our distributed database system to handle it. There are
different models for distributing data among these nodes.
1. Single server
• This is the minimum form of distribution and most often the
recommended option.
• Here, the database will be running in a single server without
any distribution.
• Since all read and write operations occur at a single node, it
would reduce the complexity by making the management
process easy.
5
© e-Learning Centre, UCSC
2
4.3 Distributed Database Design Techniques
A B C D
A B C D
5
© e-Learning Centre, UCSC
3
4.3 Distributed Database Design Techniques
5
© e-Learning Centre, UCSC
4
4.3 Distributed Database Design Techniques
5
© e-Learning Centre, UCSC
5
4.3 Distributed Database Design Techniques
5
© e-Learning Centre, UCSC
6
Distributed Database Design Techniques
Master
5
© e-Learning Centre, UCSC
8
4.3 Distributed Database Design Techniques
Distribution Models Cont.
• Appointment of the new master can be either an
automatic or a manual process.
• The disadvantage of having replicated nodes is the
inconsistency that may occur in between nodes.
• If the changes are not propagated to all the slave nodes,
there is a chance of different clients who are accessing
various slave nodes read different values.
5
© e-Learning Centre, UCSC
9
4.3 Distributed Database Design Techniques
6
© e-Learning Centre, UCSC
0
4.3 Distributed Database Design Techniques
6
© e-Learning Centre, UCSC
1
4.3 Distributed Database Design Techniques
6
© e-Learning Centre, UCSC
2
Activity
3. 4.
2.
Global Query Local Query
Query Mapping Localization
Optimization Optimization
6
4
6
© e-Learning Centre, UCSC
5
4.4 Query Processing and Optimization in
Distributed Databases
6
© e-Learning Centre, UCSC
6
4.4 Query Processing and Optimization in
Distributed Databases
6
© e-Learning Centre, UCSC
7
4.4 Query Processing and Optimization in
Distributed Databases
6
© e-Learning Centre, UCSC
8
4.4 Query Processing and Optimization in
Distributed Databases
6
© e-Learning Centre, UCSC
9
4.4 Query Processing and Optimization in
Distributed Databases
Example
Suppose Employee table and Department table are stored at node
01 and node 02 respectively. Results are expected to be presented
in node 03.
Employee Department
Size of one record =100 bytes Size of one record =35 bytes
No. of records=10000 No. of records=100
Node 01 Node 02
Results
Node 03
7
© e-Learning Centre, UCSC
0
4.4 Query Processing and Optimization in
Distributed Databases
Example
According to the details given, let’s calculate the size of each
relation.
7
© e-Learning Centre, UCSC
1
4.4 Query Processing and Optimization in
Distributed Databases
The sizes of attributes in Employee and Department relations are given
below.
EMPLOYEE
Fname field is 15 bytes long, Lname field is 15 bytes long, Address field is 10 bytes long
DEPARTMENT
7
© e-Learning Centre, UCSC
2
4.4 Query Processing and Optimization in
Distributed Databases
7
© e-Learning Centre, UCSC
3
4.4 Query Processing and Optimization in
Distributed Databases
Method 3
Explanation Transfer the DEPARTMENT relation to
site 1. Execute the join at site 1. Send the result to site 3.
Calculation
Total no. of bytes to be transferred = Size of the Department
table + size of the query result
= 403,500 bytes
7
© e-Learning Centre, UCSC
6
4.4 Query Processing and Optimization in
Distributed Databases
7
© e-Learning Centre, UCSC
7
Activity
Query: Retrieve the Student Name and Course Name which the
student is following.
Write the relational algebra for the above query.
Query: Retrieve the Student Name and Course Name which the
student is following.
If we are to transfer STUDENT and COURSE relations into node
3 and perform join operation, how many bytes need to be
transferred? Explain your answer.
© e-Learning Centre, UCSC
Activity
Suppose STUDENT table is stored in site 1 and COURSE able is
stored in site 2. The tables are not fragmented and the results
are stored in site 3. Every student is assigned to one course.
STUDENT(Sid, StudentName, Address, Grade, CourseID)
1000 records, each record is 50 bytes long
Sid: 5 bytes, StudentName;10 bytes, Address: 20 bytes
Query: Retrieve the Student Name and Course Name which the
student is following.
If we are to transfer STUDENT table into site 2, and then execute
join and send result into site 3, how many bytes need to be
transferred? Explain your answer.
© e-Learning Centre, UCSC
Activity
Suppose STUDENT table is stored in site 1 and COURSE able is
stored in site 2. The tables are not fragmented and the results
are stored in site 3.
STUDENT(Sid, StudentName, Address, Grade, CourseID)
1000 records, each record is 50 bytes long
Sid: 5 bytes, StudentName;10 bytes, Address: 20 bytes
Query: Retrieve the Student Name and Course Name which the
student is following.
If we are to transfer COURSE table into site 1, and then execute
join and send result into site 3, how many bytes need to be
transferred? Explain your answer.
© e-Learning Centre, UCSC
4.5 NoSQL Characteristics related to Distributed
Databases and Distributed System
1. Scalability
• NoSQL databases are typically used in applications with
high data growth.
• Scalability is the potential of a system to handle a growing
amount of data.
• In Distributed Databases, there are two strategies for
scaling a system.
- Horizontal scalability: When the amount of data
increases, distributed system can be expanded by
adding more nodes into the system.
- Vertical scalability: Increasing the storage capacity of
existing nodes.
• It is possible to carry out horizontal scalability while the
system is on operation. We can distribute the data among
newly added sites without disturbing the operations of
system.
8
© e-Learning Centre, UCSC
2
4.5 NoSQL Characteristics related to Distributed
Databases and Distributed System
2. Availability, Replication and Eventual Consistency:
• Most of the applications that are using NoSQL DBs,
require availability.
• It is achieved by replicating data in several nodes.
• With this technique, even if one node fails, the other
nodes who have the replication of same data will
response to the data requests.
• Read performance is also improved by having replicas.
When the number of read operations are higher, clients
can access the replicated nodes without making a single
node busy.
8
© e-Learning Centre, UCSC
3
4.5 NoSQL Characteristics related to Distributed
Databases and Distributed System
Availability, Replication and Eventual Consistency Cont.:
• But having replications may not be effective for write
operations because after a write operation, all the nodes
having same data item should be updated in order to keep
the system consistent.
• Due to this requirement of updating all the nodes with the
same data item, the system can get slower.
• However, most of the NoSQL applications prefer eventual
consistency.
• Eventual consistency will be discussed in next slide.
8
© e-Learning Centre, UCSC
4
4.5 NoSQL Characteristics related to Distributed
Databases and Distributed System
Availability, Replication and Eventual Consistency Cont.:
• Eventual Consistency
This means that at any time there may be nodes with
replication inconsistencies but if there are no further updates,
eventually all the nodes will synchronise and will be updated
to the same value.
8
© e-Learning Centre, UCSC
6
4.5 NoSQL Characteristics related to Distributed
Databases and Distributed System
Replication Models Cont.:
- Master-master replication: All the nodes are treated
similarly. Reads and writes can be performed on any
of the nodes. But it is not assured that all reads done
on different nodes see the same value. Since it is
possible for multiple users to write on a single data
item at the same time, system can be temporarily
inconsistent.
8
© e-Learning Centre, UCSC
7
4.5 NoSQL Characteristics related to Distributed
Databases and Distributed System
4. Sharding of Files:
• We have discussed the concept sharding in slide 55.
• In many NoSQL applications, there can be millions of data
records accessed by thousands of users concurrently.
• Effective responses can be provided by storing partitions of
data in several nodes.
• By using the technique called sharding (horizontal
partitioning), we can distribute the load across multiple
sites.
• Combination of sharding and replication improves load
balancing and data availability.
8
© e-Learning Centre, UCSC
8
4.5 NoSQL Characteristics related to Distributed
Databases and Distributed System
5. High-Performance Data Access:
• In many NoSQL applications, it might be necessary to find
a single data value or a file among billions of records.
• To achieve this, techniques such as hashing and range
partitioning are used.
- Hashing: A hash function h(k) applied on a given
key K, provides the location of a particular object.
- Range partitioning: Object’s location can be
identified from range of key values. For example,
location i would hold the objects whose key values
K are in the range Kimin ≤ K ≤ Ki max.
• We can use other indexes to locate objects based on
attribute conditions (different from the key K).
8
© e-Learning Centre, UCSC
9
Activity
Fill in the blanks with the most suitable word given.
(horizontal, vertical, eventual consistency, consistency, master,
slave, availability, usability)
Distributed Database
Concepts, Components
and Advantages
Types of Distributed
Database Systems
9
© e-Learning Centre, UCSC
1
Summary
NoSQL Characteristics
related to Distributed
Databases and
Distributed Systems
9
© e-Learning Centre, UCSC
2