0% found this document useful (0 votes)
4 views

Chapter 04 - Distributed Database Systems - converted

This lesson covers the concepts, advantages, and types of distributed database systems, including design techniques such as fragmentation, replication, and allocation. It also addresses query optimization in distributed databases and the characteristics of NoSQL databases. Intended learning outcomes include understanding data distribution, analyzing big data technologies, and describing cloud computing concepts.

Uploaded by

Kavini Amandi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Chapter 04 - Distributed Database Systems - converted

This lesson covers the concepts, advantages, and types of distributed database systems, including design techniques such as fragmentation, replication, and allocation. It also addresses query optimization in distributed databases and the characteristics of NoSQL databases. Intended learning outcomes include understanding data distribution, analyzing big data technologies, and describing cloud computing concepts.

Uploaded by

Kavini Amandi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Overview

• This lesson discusses about the concepts, advantages


and different types of distributed database systems.
• Distributed database design techniques and concepts
such as fragmentation, replication, allocation will be
4 : Distributed Database Systems discussed in detail.
• Thereafter, we will be looking at the distributed database
IT3306 – Data Management query optimization techniques.

Level II - Semester 3 • Finally, the NoSQL characteristics related to DDB will be


discussed.

1
© e-Learning Centre, UCSC © e-Learning Centre, UCSC 2

Intended Learning Outcomes List of subtopics

4.1. Distributed Database Concepts, Components and Advantages


• At the end of this lesson, you will be able to;
4.2. Types of Distributed Database Systems
• Describe the concepts in data distribution and
4.3. Distributed Database Design Techniques
distributed data management.
4.3.1. Fragmentation
• Analyze new technologies that have emerged to
manage and process big data. 4.3.2. Replication and Allocation

• Explain the distributed solutions provided in NoSQL 4.3.3. Distribution Models: Single Server , Sharding, Master-Slave,
databases. Peer-to-Peer
4.4. Query Processing and Optimization in Distributed Databases
• Describe different concepts and systems being used
for processing and analysis of big data. 4.4.1 Distributed Query Processing

• Describe cloud computing concepts. 4.4.2 Data Transfer Costs of Distributed Query Processing
4.5. NoSQL Characteristics related to Distributed Databases and
Distributed Systems
© e-Learning Centre, UCSC 3 © e-Learning Centre, UCSC 4

4.1 Distributed Database Concepts, Components and


4.1 Distributed Database Concepts, Components and
Advantage
Advantage
• Distributed database (DDB) is a set of logically
• A system that performs certain assigned tasks with the interrelated databases connected via a computer network.
help of several sites which are connected via a computer
• To manage the distributed database and to make it
network is known as a distributed computing system.
transparent to the user, we are using a software called
• The goal of a distributed computing system is to partition a distributed database management system (DDBMS).
complex problem that requires a large computational
• Following are the minimum conditions that should be
power into smaller pieces of work .
satisfied by a database to be distributed:
• Distributed Database technology has emerged as a result - Multiple computers (nodes) connected over a network to
of the merger between database technology and transmit data.
distributed systems technology. - Logical relationship between the information available in
different nodes.
- The hardware, software and data related to each site is
not mandatory to be identical.

© e-Learning Centre, UCSC 5 © e-Learning Centre, UCSC 6

4.1 Distributed Database Concepts, Components and 4.1 Distributed Database Concepts, Components and
Advantage Advantage

• Location of the nodes either can be with in a same Transparency


physical location connected via a LAN (Local Area • In general, transparency is not allowing the end user to
Network) or geographically disperse which is connected know implementation details.
via WAN (Wide Area Network). • There are several types of transparencies introduced in the
• We can use different network topologies to establish the distributed database domain because the data is distributed
communication between sites. in multiple nodes.
• The topology we select directly affects the performance i. Location transparency : Commands issued are not
and the query processing of the distributed database. changed according to the location of data or the
node.
ii. Naming transparency: When a name is associated
with an object, the object can be accessed without
giving additional details such as the location of data.

© e-Learning Centre, UCSC 7 © e-Learning Centre, UCSC 8


4.1 Distributed Database Concepts, Components and 4.1 Distributed Database Concepts, Components and
Advantage Advantage

Reliability and Availability


Transparency Cont.
• Reliability is the probability of a system in the running
iii. Replication transparency : User is not aware of the
state at a given time point.
replicas that are available in multiple nodes in order to
• Availability is defined as the probability of a system been
provide better performance, availability and reliability.
continuously available at a given time interval.
iv. Fragmentation transparency: User is not aware of • There is a direct relationship between reliability &
the fragments available. availability with the database faults, errors, and failures.
v. Design transparency: User is unaware of the design • If a system deviates from it’s defined behaviour, we call it
of the distributed database while he is performing the a Failure.
transactions. • Errors contain a subset of states which causes the
vi. Execution transparency: User is unaware of the failures.
transaction execution details. • A cause of an error is known as a Fault.

1
© e-Learning Centre, UCSC 9 © e-Learning Centre, UCSC
0

4.1 Distributed Database Concepts, Components and 4.1 Distributed Database Concepts, Components and
Advantage Advantage

Reliability and Availability Cont. Scalability and Partition Tolerance


• There are several approaches to make a system reliable. • Scalability is identifying to which extent the system can be
• One method is fault tolerance. expanded without making a disturbance to the operations.
• In this method, we identify and eliminate faults before they • There are two main types of scalability as follows.
result in system failures.
• Another method is ensuring the system do not contain any Scalability
faults by conducting quality control measures and testing.
• A reliable DDBMS should be able to process user
requests as long as database consistency is preserved. Horizontal Scalability Vertical Scalability
• The recovery manager in a DDBMS is working on the
Expand the number of nodes in Expand the capacity of the
failures arising from different aspects such as a Distributed system.
transactions, hardware, and communication networks. individual nodes in a system.
Make it possible to distribute Eg: Expanding the storage
some of the data and capacity or the processing power
processing loads among old of a node.
1 and new nodes. 1
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
1 2

4.1 Distributed Database Concepts, Components and 4.1 Distributed Database Concepts, Components and
Advantage Advantage

Scalability and Partition Tolerance Cont. Autonomy

• When the number of nodes are increased, the possibility of • The extent to which a single node (database) have the
network failures also grows up, resulting the nodes to be capacity to be worked independently is refer to as
partitioned into subgroups. Autonomy.
• In this situation, the nodes within a single subnetwork can • Higher flexibility is given to the nodes when there is high
communicate each other while the communication among autonomy.
partitions are lost. • Autonomy can be applied in many aspects such as,
• The ability of the system to keep operating even though the - Design autonomy: Independence of data model usage
network is divided into separate groups is known as and transaction management techniques.
partition tolerance. - Communication autonomy: The extent to which each
node can decide on sharing of information with other
nodes.
- Execution autonomy: Independence of users to operate
as they prefer.
1 1
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
3 4

4.1 Distributed Database Concepts, Components and 4.1 Distributed Database Concepts, Components and
Advantage Advantage
Advantages of DDB Cont.
Advantages of DDB
3. Improve performance
1. Improves the flexibility of application development - Data items are stored closer to where it is needed
- The ability of carrying out application development the most. It reduces the competition for CPU and I/O
and maintenance from different physical locations. services required. The access delays involved in
2. Improve Availability wide area networks are also brought down.
- Faults are isolated to the site of origin without - Since each node holds only a partition of the entire
disturbing the other nodes connected. DB, the number of transactions executed in each
- Even though a single node fails, the other nodes site is smaller compared to the situation where all
continue to operate without failing the entire system. transactions are submitted to a single centralized
(However, in a centralized system, failure at a single database.
site makes the whole system unavailable to all - Execution of queries in parallel by executing multiple
users). Therefore, availability is improved with a queries at different sites, or by splitting the query into
DDB. a number of subqueries also improves the
performance.
1 1
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
5 6
4.1 Distributed Database Concepts, Components and
Advantage Activity

Select the advantages of a distributed database over a


Advantages of DDB Cont. centralized database from the following features given.
4. Easy expansion • Less cost
- Ability to make the system expanded by adding more • Slow responses
nodes or increasing the database size helps to
facilitate the growth of data much easier when • Less complexity
compared to a centralized system. • Improved performance
• Easier Scalability
• Availability improvement
• Maintainability
• Flexibility in application development

1 1
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
7 8

Activity Activity

Fill in the blanks with the most suitable words given. State whether the given statements are True or False.
(same,multiple,network,location,replication,execution, design, 1. A system with high transparency offers a lot of flexibility to
horizontal, vertical, communication, same, fragmentation, the application developer. ( True / False )
naming) 2. It is mandatory for all the nodes to be identical in terms of
data, hardware, and software. ( True / False )
With _________ transparency, user is unaware about the
3. A distributed Database should be connected via a local area
different locations, where the data is stored.
network. ( True / False )
Making the user unaware of having multiple copies of the same 4. In DDB systems, expanding the processing power of nodes
data item in different sites is referred to as __________ is not considered as a way of increasing scalability. ( True /
transparency. False )
The ability of increasing the number of nodes in a distributed 5. With data localization, number of CPU and I/O services
database is __________ scalability. required can be reduced. ( True / False )

Increasing the storage capacity of nodes is known as


_____________ scalability.
1 2
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
9 0

4.2 Types of Distributed Database Systems 4.2 Types of Distributed Database Systems

• There are different types of Distributed Database - Degree of local autonomy is another factor relevant to
Management Systems classified based on the degree of the degree of homogeneity.
homogeneity. - If the local site is not allowed to be operated as an
- Homogeneous system: All the sites(servers) in the independent site, there is no local autonomy.
DDB use identical software and all the clients use the - If local transaction granted permission for direct
identical software. access to the server, then there is some degree of
- Heterogeneous system: Different software installed local autonomy.
in the servers or if the users involved in DDB use
different software.

2 2
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
1 2

4.2 Types of Distributed Database Systems 4.2 Types of Distributed Database Systems

• Classification of DDBMS with regards to distribution, - Federated database systems: Have a global view of
autonomy, and heterogeneity can be explained as below. the federation of databases that is shared by the
- Centralized DB: Got complete autonomy but a applications.
complete loss of distribution and heterogeneity. - Multidatabase systems: Have full local autonomy in
- Pure distributed database systems: There is only DB but does not have a global schema.
one conceptual schema. A site, which is a part of the
DDBMS provides access to the system. Therefore, no eg: A system with full local autonomy and full
local autonomy exists. heterogeneity. (Peer-to-peer database system)
- Further classification of centralized DBMS can be
done with level of autonomy. Those are federated
database systems and multi database systems. These
systems consist of independent servers, centralized
DBMS with local users, local transactions and DBA,
facilitating very higher degree of local autonomy.
2 2
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
3 4
4.2 Types of Distributed Database Systems
4.3 Distributed Database Design Techniques
Classification of the distributed databases that we discussed
in previous two slides can be seen in the following image. Fragmentation
● As the name implies, in distributed architecture, separate
Federated database systems
portions of data should be stored in different nodes.
Distribution
Multidatabase system
● Initially, we have to identify the basic logical unit of data.
In a relational database, relations are the simplest logical
Pure distributed unit.
database ● Fragmentation is a process of dividing the whole
system database into various sub relations so that data can be
stored in different systems.
Autonomy

Centralized database
Heterogeneity systems
2
5
2
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
6

4.3 Distributed Database Design Techniques 4.3 Distributed Database Design Techniques

Example Example - Approach 01


● Suppose we have a relational database schema with three ● One approach of data distribution is storing each relation
tables (EMPLOYEE, DEPARTMENT, WORKS_ON) which in each site.
we should make partitions in order to store in several
We can store each relation in one node. In the following
nodes. example, we have stored the Employee table in Node 1, the
● Assume there are no replications allowed (data replication Department table in node 2 and Works_on table in node 3.
allows storage of certain data in more than one place to
gain availability and reliability).
Employee
FNAME LNAME SSN BDATE ADDRESS

Works_on Employee Department Works_on


ESSN DNO HOURS

Department Site 01 Site 02 Site 03


DNO DNAME LOCATION
Data distribution technique 01: Storing each relation in each site.
2 2
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
7 8

4.3 Distributed Database Design Techniques


4.3 Distributed Database Design Techniques
Example - Approach 02 Example - Approach 02 Cont.
● Another approach is dividing a relation into smaller logical
units for distribution. ● For the scenario given in previous slide, we can store data
● For instance, think of a scenario where 3 different relevant to each department in separate site.
departments are located in 3 separate places. Finance ● The details of finance department will be stored in one
department in Colombo, research department in site.
Rathnapura and headquarters in Kandy as given in the ● Details of headquarters will be stored in another site.
below table. ● Details of research department will be stored in another
separate site.
DNO DNAME LOCATION
● Dividing a relation into smaller logical units can be done
by horizontal fragmentation or vertical fragmentation
d4 Headquarters Kandy
which will be discussed in coming slides.
d3 Finance Colombo
DNO
d8 Research Rathnapura
DNAME
LOCATION
2 3
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
9 0

4.3 Distributed Database Design Techniques


4.3 Distributed Database Design Techniques
Horizontal Fragmentation
Example - Approach 02 Cont.
• A subset of rows in a relation is known as horizontal
fragment or shard.
• Selection of the tuple subset is based on a condition of one
or more attributes.
Finance Research Headquarters
• With horizontal fragmentation, we can divide tables
Department Department Department
Details Details Details horizontally by creating subsets of tuples which has a logical
meaning for each of the subset.
Site 01 Site 03 Site 02 • Then these fragments are assigned to different nodes in the
distributed system.
Data distribution technique 02: Storing details of different departments in
3
• Each horizontal fragment on a relation R can be specified in
each site. 1 the relational algebra by σCi(R) operation.(Ci → condition,
R→ relation).
• Reconstruction of the original relation is done by taking the
union of all fragments.
3
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
2
• Ex:- If we want to store sales employee details and marketing 4.3 Distributed Database Design Techniques
employee details separately in 2 nodes, we can use horizontal
fragmentation. Explanation
Employee
• Original table (Employee) is divided into two subset of
Name Salary Department rows.
Kasun 120000 Sales • First horizontal fragment created (Sales_employee)
Rishad 135000 Sales consists of details of employees who are working in the
Kirushanthi 45900 Marketing
sales department.
Sales_employee  𝛔Department = “sales” (Employee)
Anna 47900 Marketing
• Second horizontal fragment created
(Marketing_employee) consists of details of employees
Sales Employee Marketing Employee who are working in the marketing department.
Name Salary Department Name Salary Department Marketing_employee  𝛔Department = “marketing”
Kasun 120000 Sales Kirushanthi 45900 Marketing (Employee)
Rishad 135000 Sales Anna 47900 Marketing

3 3
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
3 4

4.3 Distributed Database Design Techniques 4.3 Distributed Database Design Techniques

Vertical Fragmentation Vertical Fragmentation Cont.


• With vertical fragmentation, we can divide the table by • A vertical fragment on a relation R can be specified by a
columns. π Ai (R) operation in the relational algebra. (Ai → attributes, R→
• There can be situations where we do not need to store all relation )
the attributes of a relation in a certain site. • The Outer Union on vertical fragments can generate the
• Therefore, with the technique of vertical fragmentation, we original table.
can keep only required columns of a relation within a
single site.
• In vertical fragmentation, it is a must to include the primary
key or some unique key attribute in every vertical
fragment. Otherwise, we will not be able to create the
original table by putting the fragments together.

3 3
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
5 6

• Ex:- If we want to store employees’ pay details and department


details separately in 2 nodes, we can use vertical fragmentation. 4.3 Distributed Database Design Techniques
Employee
Explanation
Name Salary Department

Kasun 120000 Sales • Original table (Employee) is divided into two subset of
columns.
Rishad 135000 Sales
• First vertical fragment created (Pay_data) consists of
Kirushanthi 45900 Marketing salary details of employees.
Anna 47900 Marketing
Pay_data  πname, salary(Employee)

pay data Dept. data • Second vertical fragment created (Dept_data) consists of
department details of employees.
Name Salary Name Department

Kasun 120000
Dept_data  πname, Department (Employee)
Kasun Sales
Rishad 135000 Rishad Sales
Kirushanthi 45900 Kirushanthi Marketing
Anna 47900
Anna Marketing
3 3
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
7 8

Employee

4.3 Distributed Database Design Techniques Name Salary Department

Kasun 120000 Sales


Mixed Fragmentation Rishad 135000 Sales

• Another fragmentation technique is the hybrid (mixed) Kirushanthi 45900 Marketing

fragmentation where we can use a combination of both Anna 47900 Marketing


the horizontal and vertical fragmentations.
• For example, take the EMPLOYEE table that we used pay data Sales -Deptdata
before.
• Employee table is vertically split into payment data and Name Salary Name Department

department data. (vertical fragmentation) Kasun 120000 Kasun Sales


• Then the department table is again separated by the Rishad 135000 Rishad Sales
department, where the horizontal fragmentation is taking Kirushanthi 45900
place. (horizontal fragmentation) Anna 47900
Marketing -Deptdata

• Relevant fragmentations with data can be seen in the next Name Department
slide.
Kirushanthi 45900

Anna 47900
3 4
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
9 0
Activity Activity

1) Give horizontally fragmented relations (with data) for the Project 1) Give a vertical fragmentation of the above Project relation into two
relation given below so that projects with budgets less than sub-relations (with data), so that one contains only the information
150,000 are separated from projects with budgets greater than about project budgets (i.e. ProjNo, Budget), whereas the other
or equal to150,000. Express the fragmentation conditions using contains project names and locations (i.e. ProjNo, ProjName,
relational algebra for each fragment (with data). Location). Express the fragmentation condition using relational
algebra for each fragment.
2) Indicate how the original relation would be reconstructed.
2) Indicate how the original relation would be reconstructed.
ProjNo ProjName Budget Location

23 Boks 100000 Colombo ProjNo ProjName Budget Location

4 Goods 50000 Galle P1 Books 100,000 Colombo

65 Furniture 75000 Jaffna P2 Goods 50,000 Galle

87 Clothes 200000 Matara P3 Furniture 75,000 Colombo

P4 Clothes 200,000 Kandy

© e-Learning Centre, UCSC © e-Learning Centre, UCSC

Activity
4.3 Distributed Database Design Techniques
Fill in the blanks with the most suitable word given.
(vertical, horizontal, mixed, union, outer join, projection, selection)
Replication
When we divide a relation based on columns, it is known as • The main purpose of having data replicated in several
__________ fragmentation while, the relation divided based on rows nodes is to ensure the availability of data.
know as _______________ fragmentation. • One extreme of data replication is having a copy of the
A combination of these 2 fragmentations is referred to as entire database at every node (full replication).
__________ fragmentation. • The other extreme is not having replication at all. Here,
every data item is stored only at one site. (no replication)
The re-constructability of the relation from its fragments ensures that
constraints defined on the data in the form of dependencies are
preserved. A set of vertical fragments can be organized into the
original table using _________ operation.
With ________ operation, we can create the original relation from a
set of horizontal fragments.

4
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
4

4.3 Distributed Database Design Techniques


4.3 Distributed Database Design Techniques
Replication Cont.
Replication Cont.
• Full replication
• No replication
-With full replication, we can achieve a higher degree
of availability. The reason for this is, the entire system -When there are no replications, all fragments must be
keeps running, even with only one site up, because disjoint ( no tuple in relation R, can be seen in more
every site contains the whole DB. than one site.) But the repetition of primary key should
be expected for the vertical fragments or mixed
-The other advantage is improved performance of read
fragments.
queries, as the results can be obtained from any site
by locally processing at the site where it submitted. -Also known as non-redundant allocation.
-However, there are drawback of full replication. -Suitable for systems with high write traffic.
-One is, degrading the write performance, because -Lesser degree of availability is a disadvantage of no
each update should be performed at every copy of replication.
data to maintain the consistency.
-Making the concurrency control and recovery
techniques are more complex and expensive.
4 4
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
5 6

4.3 Distributed Database Design Techniques 4.3 Distributed Database Design Techniques

Replication Cont. Allocation


• To get a balance between the pros and cons we • There cannot be any site which is not assigned to a site in
discussed, we can select a degree of replication suitable a DDB.
for our application.
• The process of distributing data into nodes is known as
• Some fragments of the database may be replicated, and
data allocation.
others may not according to the requirements.
• The decisions of selecting the site to hold each fragment
• It is also possible to have some fragments replicated in all
and the number of replicas available for each data
the nodes in the distributed system.
depends on the,
• Any way, all the replicas should be synchronized when an
update is taken place. - Performance requirement of the system
- Types of transactions
- Availability goals
- Transaction frequency

4 4
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
7 8
Activity
4.3 Distributed Database Design Techniques
Select advantages of data replication in DDB.
Allocation Cont.
Consider the following scenarios and the suggested
allocation mechanisms: 1. Improves availability of data.
2. Improves performance of data retrieval.
• Requires high availability of the system with high number
3. Improves performance of data write operations.
of retrievals,
4. Slow down update queries.
- Recommend to have a fully replicated database. 5. Hard recovery.
• Requires to retrieve a subsection of data frequently, 6. Expensive concurrency control.
7. Slow down select queries.
-Recommend to allocate the required fragment into 8. Easy notions used for data query.
multiple sites.
• Requires to perform a higher number of updates,
-Recommend to have a less number of replicas.
However, It is hard to find an optimal solution to distributed
data allocation since it is a complex optimization problem.
4
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
9

Activity
4.3 Distributed Database Design Techniques
Mark the following statements as true (T) and false (F).
1. With data replication, we can have multiple copies of the same Distribution Models
data item in many sites. ( )
2. Data replication would slow down the read and write operations When the data volume increases, we can add more nodes
of a database. ( ) within our distributed database system to handle it. There are
3. There can be some data fragments which are replicated in all different models for distributing data among these nodes.
nodes of the distributed database. ( ) 1. Single server
4. The number of copies created for each fragment should be • This is the minimum form of distribution and most often the
equal. ( ) recommended option.
5. In a replicated system, there can be fragments which are not • Here, the database will be running in a single server without
replicated in another site. ( ) any distribution.
6. For a system with frequent updates, it is advised to use a larger • Since all read and write operations occur at a single node, it
number of replications. ( ) would reduce the complexity by making the management
process easy.

5
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
2

4.3 Distributed Database Design Techniques 4.3 Distributed Database Design Techniques

Distribution Models Cont. Distribution Models Cont.


2. Sharding • In the best-case scenario of sharding, different users will
• A database can get busy when several users access access different parts of the database stored in separate
different data in different parts of the database at the nodes, so that each user will only communicate with a
same time. single node.
• This can be solved by splitting data into several parts and • This technique will help in load balancing.
storing them in different nodes. This is called sharding. • It is necessary to segregate data correctly, for this
technique to be effective. Data that are accessed together
should be stored in a single node.
A B C D

A B C D

5 5
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
3 4

4.3 Distributed Database Design Techniques


4.3 Distributed Database Design Techniques

Distribution Models Cont. Distribution Models Cont.


• The following considerations should be made when • Auto-sharding is a feature given by most of the NoSQL
segregating (or sharding) the data databases, where the responsibility of splitting and storing
data is given to the database itself, ensuring that data goes
- Location: Place data close to the physical location of to the correct shard.
access. • Sharding will improve read performance as well as write
- Load Balancing: Make sure that each node will get performance.
approximately similar number of requests. - Improve read performance by replication and caching
- Improve write performance by horizontally scaling
- Order of access: Aggregates that will be read in writes.
sequence can be stored in a single node. • It is hard to achieve reliability only with the use of sharding.
• To improve reliability, it is necessary to use data replication
along with sharding. Otherwise, even though the data can
be accessed from different nodes, a failure of a node can
make the shard unavailable.

5 5
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
5 6
Distributed Database Design Techniques 4.3 Distributed Database Design Techniques
Distribution Models Cont. Distribution Models Cont.
3. Master-slave replication • Master-slave model is suitable for a system with read-
• In this model, one node is selected as the master (primary) intensive dataset.
and it is considered as the authorized source for data. • By adding more slaves, you can increase the efficiency of
• Master is the node which is responsible for updates. read operations since the read requests can be processed
• All the other nodes are treated as slaves (secondary). by any slave node.
• There is a process called synchronization to sync data • However, there is still a limitation on writes, because only
inside master with the slaves. master can process the writes to the database.
• If the master fails, it should be recovered or a slave node
has to be appointed as the new master.
Master

Slave Reads are done on Master or slaves Slave


5 5
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
7 8

4.3 Distributed Database Design Techniques 4.3 Distributed Database Design Techniques
Distribution Models Cont. Distribution Models Cont.
• Appointment of the new master can be either an 4. Peer-to-peer replication
automatic or a manual process.
• The disadvantage of having replicated nodes is the • In master-slave model, the master is still a bottleneck and
inconsistency that may occur in between nodes. a single point of failure.
• If the changes are not propagated to all the slave nodes, • In peer-to-peer model, there is no master and all the
there is a chance of different clients who are accessing nodes are of the equal weight.
various slave nodes read different values. • All the replicas can accept writes. Due to this reason, there
will be no loss of access to data due to failure of a single
node.
• However, with this model, we have to accept the problem of
inconsistency.
• After you write on a node, two users who are accessing
that changed data item from different nodes may read
two different values until data propagation is completed.

5 6
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
9 0

4.3 Distributed Database Design Techniques 4.3 Distributed Database Design Techniques

Distribution Models Cont. Distribution Models Cont.


• One solution for this inconsistency problem is, ensuring the Combining sharding and replication
coordination between replicas to synchronize with all the • We can use both master-slave replication and sharding
nodes after performing a write operation. together.
• Another solution would be coping with an inconsistent write. - In that approach, we have multiple masters. But
there is only one master for each data item.
• Also, we can combine peer-to-peer replication and
sharding.
- A common application of this can be seen in
column-family databases.
Reads and writes are done on all nodes

6 6
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
1 2

Activity
4.4 Query Processing and Optimization in
Mark the following statements as true (T) and false (F). Distributed Databases
1. Scaling up is including larger data servers with higher
storage capacity to cater the increasing data storage
requirement. ( )
2. Scale out is the process of running the database on a 3. 4.
2.
Global Query Local Query
cluster of servers. ( ) Query Mapping Localization
Optimization Optimization
3. We can ensure reliability of a DDB by using the technique
sharding. ( )
4. Single server is the most recommended distribution model These are the steps involved in distributed query processing. We will
( ) discuss each step in detail.
5. Read reliance is one of the advantages of master-slave
replication model. ( )
6
4

© e-Learning Centre, UCSC © e-Learning Centre, UCSC


4.4 Query Processing and Optimization in 4.4 Query Processing and Optimization in
Distributed Databases Distributed Databases

Step 02: Localization.


Step 01: Query Mapping.
• In this phase, the distributed query in global schema is
• The query inserted is specified in query language. mapped to separate queries on fragments.
• Then it is translated into an algebraic query. • For this, data distribution and fragmentation details are
• The translation process is referred to global conceptual used.
schema; here it does not consider the replicas and • performed at a central control site.
shards.
• The algebraic query is then normalized and analyzed for
semantic errors.
• This step is performed at a central control site.

6 6
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
5 6

4.4 Query Processing and Optimization in 4.4 Query Processing and Optimization in
Distributed Databases Distributed Databases

Step 03: Global Query Optimization. Step 04: Local Query Optimization.
• Optimization is selecting the optimal strategy from a list of • This stage is common to all sites in the DDB.
candidate strategies. • The techniques are similar to those used in centralized
• These candidate strategies can be obtained by permuting systems.
the order of operations generated in previous step. • performed locally at each site.
• To measure the cost associated with each set of
operations, we use the execution time.
• The total cost is calculated using costs such as CPU cost,
I/O costs, and communication costs.
• Since the nodes are connected via network in a DDB, the
most significant cost is for the communication between
these nodes.

6 6
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
7 8

4.4 Query Processing and Optimization in 4.4 Query Processing and Optimization in
Distributed Databases Distributed Databases
Example
Suppose Employee table and Department table are stored at node
01 and node 02 respectively. Results are expected to be presented
• In comparison to a centralized database system, in a in node 03.
distributed database there is additional complexity
involved in query processing.
• One is the cost of transferring data among sites.
• Intermediate files or the final result set can be transferred Employee Department
Size of one record =100 bytes Size of one record =35 bytes
in between nodes via the network. No. of records=10000 No. of records=100
• Reducing the amount of data to be transferred among
nodes is considered as an optimization criteria in the
query optimization algorithms used in DDBMS. Node 01 Node 02

Results

Node 03
6 7
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
9 0

4.4 Query Processing and Optimization in 4.4 Query Processing and Optimization in
Distributed Databases Distributed Databases
Example The sizes of attributes in Employee and Department relations are given
According to the details given, let’s calculate the size of each below.
relation.
EMPLOYEE
No. of records in Employee relation = 10000 Fname Lname Ssn Bdate Address Sex Salary
size of 1 record in Employee relation= 100
Fname field is 15 bytes long, Lname field is 15 bytes long, Address field is 10 bytes long
Size of the Employee relation = 100*10000 = 1000000 bytes

DEPARTMENT
No. of records in Department relation = 100
size of 1 record in Department relation= 35 Dname DNumber Mgr_ssn Mgr_start_date
Size of the Department relation = 100*35 =3500 bytes Dnumber field is 4 bytes long, Dname field is 10 bytes long, Mgr_ssn field is 9 bytes long

7 7
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
1 2
4.4 Query Processing and Optimization in 4.4 Query Processing and Optimization in
Distributed Databases Distributed Databases

Q: π Fname,Lname,Dname ( EMPLOYEE ∞Dno=Dnumber DEPARTMENT)


Assume we want to write a query to retrieve first name, last
name and department for each employee.
We can represent it in relational algebra as follows. We will discuss 3 strategies to execute this distributed query .
Method 1
Let’s call this query, Q. Explanation Transfer data in the EMPLOYEE relation and
the DEPARTMENT relation into the result site (node 03). Then
perform the join operation at node 3.
Q: π Fname,Lname,Dname (EMPLOYEE ∞Dno=Dnumber DEPARTMENT) Calculation
Total no. of bytes to be transferred= Size of the Employee
relation + Size of the Department relation
= 1,000,000 + 3,500
= 1,003,500 bytes
7 7
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
3 4

4.4 Query Processing and Optimization in 4.4 Query Processing and Optimization in
Distributed Databases Distributed Databases
Method 2
Method 3
Explanation Transfer the EMPLOYEE relation to site
2. Execute the join at site 2. Send the result to site 3. Explanation Transfer the DEPARTMENT relation to
site 1. Execute the join at site 1. Send the result to site 3.
Calculation
Calculation
Total no. of bytes to be transferred= Size of the Employee
table + The size of the query result Total no. of bytes to be transferred = Size of the Department
table + size of the query result
=
1,000,000 + (40 * 10,000 )
= 3,500 + (40 * 10,000)
= 1,400,000 bytes

= 403,500 bytes
Note: One record in result query consist of Fname (15 bytes),
LName ( 15 bytes) and Dname (10 bytes). Altogether 40 bytes.
There are 10,000 records retrieve as result.
Therefor size if the result query is 40 * 10000
7 7
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
5 6

Activity
4.4 Query Processing and Optimization in
Distributed Databases Suppose STUDENT table is stored in site 1 and COURSE table
is stored in site 2. The tables are not fragmented and the results
When considering the three methods we discussed, are stored in site 3. Every student is assigned to only one course.
Total no. of bytes to be transferred in method 1 = 1,003,500 STUDENT(Sid, StudentName, Address, Grade, CourseID)
1000 records, each record is 50 bytes long
Total no. of bytes to be transferred in method 2 = 1,400,000 Sid: 5 bytes, StudentName;10 bytes, Address: 20 bytes
Total no. of bytes to be transferred in method 3 = 403,500
COURSE( Cid, CourseName)
The least amount of data transfer occurs in method 3. 500 records, each record is 30 bytes long
Therefore, we choose method 3 as the optimal solution, since it Cid: 5 bytes, CourseName:10 bytes
transfers the minimum amount of data.

Query: Retrieve the Student Name and Course Name which the
student is following.
Write the relational algebra for the above query.
7
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
7

Activity Activity
Suppose STUDENT table is stored in site 1 and COURSE able is Suppose STUDENT table is stored in site 1 and COURSE able is
stored in site 2. The tables are not fragmented and the results stored in site 2. The tables are not fragmented and the results
are stored in site 3. Every student is assigned to one course. are stored in site 3. Every student is assigned to one course.
STUDENT(Sid, StudentName, Address, Grade, CourseID) STUDENT(Sid, StudentName, Address, Grade, CourseID)
1000 records, each record is 50 bytes long 1000 records, each record is 50 bytes long
Sid: 5 bytes, StudentName;10 bytes, Address: 20 bytes Sid: 5 bytes, StudentName;10 bytes, Address: 20 bytes

COURSE( Cid, CourseName) COURSE( Cid, CourseName)


500 records, each record is 30 bytes long 500 records, each record is 30 bytes long
Cid: 5 bytes, CourseName:10 bytes Cid: 5 bytes, CourseName:10 bytes

Query: Retrieve the Student Name and Course Name which the Query: Retrieve the Student Name and Course Name which the
student is following. student is following.
If we are to transfer STUDENT and COURSE relations into node If we are to transfer STUDENT table into site 2, and then execute
3 and perform join operation, how many bytes need to be join and send result into site 3, how many bytes need to be
transferred? Explain your answer. transferred? Explain your answer.
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
Activity
4.5 NoSQL Characteristics related to Distributed
Suppose STUDENT table is stored in site 1 and COURSE able is Databases and Distributed System
stored in site 2. The tables are not fragmented and the results
are stored in site 3. 1. Scalability
STUDENT(Sid, StudentName, Address, Grade, CourseID) • NoSQL databases are typically used in applications with
1000 records, each record is 50 bytes long high data growth.
Sid: 5 bytes, StudentName;10 bytes, Address: 20 bytes • Scalability is the potential of a system to handle a growing
amount of data.
• In Distributed Databases, there are two strategies for
COURSE( Cid, CourseName) scaling a system.
500 records, each record is 30 bytes long - Horizontal scalability: When the amount of data
Cid: 5 bytes, CourseName:10 bytes increases, distributed system can be expanded by
adding more nodes into the system.
- Vertical scalability: Increasing the storage capacity of
Query: Retrieve the Student Name and Course Name which the existing nodes.
• It is possible to carry out horizontal scalability while the
student is following.
system is on operation. We can distribute the data among
If we are to transfer COURSE table into site 1, and then execute newly added sites without disturbing the operations of
join and send result into site 3, how many bytes need to be system.
transferred? Explain your answer.
8
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
2

4.5 NoSQL Characteristics related to Distributed 4.5 NoSQL Characteristics related to Distributed
Databases and Distributed System Databases and Distributed System
2. Availability, Replication and Eventual Consistency: Availability, Replication and Eventual Consistency Cont.:
• Most of the applications that are using NoSQL DBs, • But having replications may not be effective for write
require availability. operations because after a write operation, all the nodes
• It is achieved by replicating data in several nodes. having same data item should be updated in order to keep
• With this technique, even if one node fails, the other the system consistent.
nodes who have the replication of same data will • Due to this requirement of updating all the nodes with the
response to the data requests. same data item, the system can get slower.
• Read performance is also improved by having replicas. • However, most of the NoSQL applications prefer eventual
When the number of read operations are higher, clients consistency.
can access the replicated nodes without making a single • Eventual consistency will be discussed in next slide.
node busy.

8 8
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
3 4

4.5 NoSQL Characteristics related to Distributed 4.5 NoSQL Characteristics related to Distributed
Databases and Distributed System Databases and Distributed System
Availability, Replication and Eventual Consistency Cont.: 3. Replication Models:
• Eventual Consistency • The main replication models that are used in NoSQL context
This means that at any time there may be nodes with is master-slave and master-master replication.
replication inconsistencies but if there are no further updates, - Master-slave replication: The primary node refers to as
eventually all the nodes will synchronise and will be updated master is responsible for all write operations. Then the
to the same value. updates are propagated to slave nodes keeping the
eventual consistency. There can be different techniques
for read operation. One option is making all reads on
For example, If Kamal updates the value of Z to 10, it will be master node. Another option would be making all reads
updated in the node A. But if Saman accesses the value of Z on slave nodes. But with this second option, there is no
from node B the value will not be 10; as the change hasn’t guarantee for all reads to have the same value on all
propagated from node A to node B. After sometime, when data item after accessing several nodes. (Because the
Saman access the value of Z from node B, then it will have system gets consistent eventually)
the value 10. This means that Saman will eventually see the
change in value Z made by Kamal. This is the eventual
consistency.
8 8
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
5 6

4.5 NoSQL Characteristics related to Distributed 4.5 NoSQL Characteristics related to Distributed
Databases and Distributed System Databases and Distributed System
Replication Models Cont.:
4. Sharding of Files:
- Master-master replication: All the nodes are treated
• We have discussed the concept sharding in slide 55.
similarly. Reads and writes can be performed on any
of the nodes. But it is not assured that all reads done • In many NoSQL applications, there can be millions of data
on different nodes see the same value. Since it is records accessed by thousands of users concurrently.
possible for multiple users to write on a single data • Effective responses can be provided by storing partitions of
item at the same time, system can be temporarily data in several nodes.
inconsistent. • By using the technique called sharding (horizontal
partitioning), we can distribute the load across multiple
sites.
• Combination of sharding and replication improves load
balancing and data availability.

8 8
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
7 8
Activity
4.5 NoSQL Characteristics related to Distributed Fill in the blanks with the most suitable word given.
Databases and Distributed System
(horizontal, vertical, eventual consistency, consistency, master,
5. High-Performance Data Access: slave, availability, usability)
• In many NoSQL applications, it might be necessary to find
a single data value or a file among billions of records.
__________ scalability can be performed while the system is on
• To achieve this, techniques such as hashing and range
operation.
partitioning are used.
- Hashing: A hash function h(k) applied on a given A relaxed form of consistency preferred by most of the NoSQL
key K, provides the location of a particular object. systems is known as _____________.
- Range partitioning: Object’s location can be
identified from range of key values. For example, In master-slave replication model, ___________ is used as the
location i would hold the objects whose key values source of write operations.
K are in the range Kimin ≤ K ≤ Ki max. Load balancing and ________ can be achieved in a system which
• We can use other indexes to locate objects based on uses the combination of sharding and replication.
attribute conditions (different from the key K).

8
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
9

Summary Summary

Distributed Database Query Processing and


Concepts, Components Optimization in Distributed Distributed Query Processing, Data Transfer
and Advantages Databases
Costs of Distributed Query Processing

Types of Distributed
Database Systems NoSQL Characteristics
related to Distributed
Databases and
Distributed Systems

Distributed Database Fragmentation, Replication and Allocation,


Design Techniques Distribution Models
© 2020 e-Learning Centre, UCSC

9 9
© e-Learning Centre, UCSC © e-Learning Centre, UCSC
1 2

You might also like