0% found this document useful (0 votes)
3 views

Distributed DBMS

The document provides an overview of Distributed Data Processing (DDP) and Distributed Database Systems (DDBS), highlighting their key concepts, advantages, and challenges. It explains the importance of data distribution, parallel processing, and coordination in enhancing performance, scalability, and fault tolerance. Additionally, it discusses the complexities involved in managing distributed systems, including issues related to network performance, data consistency, and security.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Distributed DBMS

The document provides an overview of Distributed Data Processing (DDP) and Distributed Database Systems (DDBS), highlighting their key concepts, advantages, and challenges. It explains the importance of data distribution, parallel processing, and coordination in enhancing performance, scalability, and fault tolerance. Additionally, it discusses the complexities involved in managing distributed systems, including issues related to network performance, data consistency, and security.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 62

1

UNIT I
Introduction:
 Distributed Data Processing :
Distributed Data Processing (DDP) is a method where computational tasks and data are
divided across multiple interconnected computers (or nodes) in a network. These systems
work together to process data more efficiently than a single centralized system.

In simple terms, rather than storing all data in one place and processing it on one machine,
distributed data processing breaks down tasks and processes them in parallel across
different locations.

Key Concepts:

1. Data Distribution:
o Data is stored across multiple physical locations.
o Can be fragmented (split into parts) or replicated (copies stored in multiple
locations).
2. Parallel Processing:
o Tasks are split and processed simultaneously on multiple nodes.
o This reduces processing time significantly.
3. Coordination:
o A central coordinator or a set of protocols ensures tasks are executed in the
correct sequence.
o Ensures consistency and synchronization across all nodes.
4. Transparency:
o The system hides the complexity of distribution from the user.
o The user feels like they're working with a single system.

Advantages:

 Scalability: Easy to add more nodes to handle increased load.


 Fault Tolerance: Failure in one node doesn't bring the entire system down.
 Faster Processing: Due to parallel execution.
 Resource Sharing: Makes use of distributed computing resources effectively.

 Distributed Database Systems –


A Distributed Database System (DDBS) is a type of database system in which the data is
stored across multiple physical locations, but appears to users as a single unified database.
These locations could be within the same building, across cities, or even globally spread
over different continents.

Each site (or node) in a DDBS is typically connected via a computer network and can
manage its own data independently. However, the sites work together to support query
processing, transaction management, and data consistency across the entire system.

Key Components:

1. Distributed Database:
o The actual data that is distributed over various sites.
o May be partitioned (fragmented) or replicated.
2. Distributed Database Management System (DDBMS):
o Software that manages the distributed database.

Types of Data Distribution:

 Fragmentation: Data is divided into pieces (fragments), each stored at different sites.
 Replication: Copies of the same data are stored at multiple sites to ensure availability.
 Hybrid: A combination of fragmentation and replication.

Features of Distributed Database Systems:

 Location Transparency: Users don’t need to know the physical location of data.
 Replication Transparency: Users are unaware of how many copies of data exist.
2

 Concurrency Control: Ensures that transactions are executed safely in a multi-user


environment.
 Fault Tolerance: The system can continue working even if one or more sites fail.
 Scalability: Easy to expand the system by adding new sites.

Advantages:

 Improved data availability and reliability.


 Faster local access to data.
 Better performance through parallel processing.
 Modular growth: easy to expand as needs grow.

Real-World Examples:

 Banking Systems: Branches in different cities access and update data stored locally
and globally.
 E-commerce: Customer data, orders, and inventory are managed across different
regional servers.
 Cloud-Based Services: Like Google, Amazon, and Facebook, which handle massive
data across global data centers.

 Promises of DDBSs –
Transparency -

The DDBS hides the complexity of distribution from users. This includes:

 Location Transparency: Users can access data without knowing where it's stored.
 Replication Transparency: Users don’t need to know how many copies of the data
exist.
 Fragmentation Transparency: Users are unaware of how data is split and stored across
sites.
 Transaction Transparency: Distributed transactions appear just like local ones.
 Failure Transparency: The system recovers from failures without user intervention.

Reliability and Availability -

Even if one or more sites fail, the system continues to operate.

Data replication ensures that data is available from other sites if one fails.

Scalability -

Easy to expand the system by adding more sites or nodes.

Can handle growing data volumes and increasing user load efficiently.

Performance -

Parallel query execution improves response time.

Users can access nearby servers (local access), reducing latency.

Data Integrity and Consistency -

Ensures that all copies of data remain consistent across sites.

Distributed transactions follow ACID properties.

Resource Sharing -

Enables sharing of data and processing power across different departments or locations.

Makes better use of available hardware and network resources.


3

Modularity -

Systems can be developed and maintained in parts.

New components or nodes can be added with minimal disruption.

 Complicating factors –
Distributed database systems are powerful, but they are also harder to manage than
centralized databases due to the following complications:

1. Network Issues

 Latency, packet loss, and network failures can disrupt data transfer between sites.
 Performance may degrade due to slow communication links.

2. Data Distribution Complexity

 Deciding how to fragment, replicate, and allocate data efficiently is complex.


 Poor distribution can lead to performance issues and unbalanced workloads.

3. Distributed Query Processing

 Queries that span multiple sites need to be optimized carefully.


 More complex than local queries due to data location, cost of data transfer, and
synchronization.

4. Concurrency Control

 Multiple users accessing distributed data simultaneously increases the risk of conflicts.
 Coordinating concurrency control across sites is harder than in a single system.

5. Distributed Transactions

 Ensuring Atomicity, Consistency, Isolation, Durability (ACID) across all sites is


complicated.
 Often uses complex protocols like Two-Phase Commit (2PC), which are slow and
vulnerable to site failures.

6. Consistency Maintenance

 When data is replicated, keeping all copies synchronized in real-time is challenging.


 Trade-off between consistency and availability (as explained in the CAP Theorem).

7. Security and Access Control

 Managing security across multiple sites with different administrators or policies is


tough.
 Increases the risk of unauthorized access or data breaches.

8. System Heterogeneity

 Different sites may use different DBMS software, hardware, operating systems, or data
models.
 Makes integration and standardization more complex.

9. Failure Handling

 Detecting and recovering from partial failures (like one node going down) requires
robust mechanisms.
 Ensuring the system stays consistent after a failure is not easy.

 Problem areas –
Distributed Query Processing -

Problem: Optimizing queries that access data from multiple locations.


4

Challenge: How to minimize communication cost, access time, and balance load across
sites.

Example: A query that joins tables stored in different cities can be very slow without
optimization.

Distributed Transaction Management -

Problem: Ensuring transactions maintain ACID properties across multiple sites.

Challenge: Coordinating commit and rollback over the network, handling failures during
transactions.

Example: A failure during a banking transfer between two branches must not lead to
inconsistent balances.

Concurrency Control -

Problem: Multiple users might try to access/update the same data simultaneously.

Challenge: Preventing deadlocks, lost updates, or inconsistent reads in a distributed


setting.

Example: Two users updating the same account balance from different locations at the
same time.

Replication and Consistency -

Problem: Managing multiple copies of the same data across sites.

Challenge: Ensuring all replicas are updated correctly and in sync.

Example: An e-commerce site's inventory should be updated instantly across all


warehouses.

Distributed Deadlock Detection and Resolution -

Problem: Deadlocks can happen across sites.

Challenge: Detecting them across the network and resolving without affecting
performance.

Example: Site A waits for Site B, while Site B is waiting for Site A — classic deadlock.

Failure Recovery -

Problem: Systems or network failures can cause data loss or inconsistencies.

Challenge: Ensuring automatic recovery, logging, and data integrity after a failure.

Example: If a site crashes mid-transaction, it must recover to a consistent state.

Security and Authorization -

Problem: Distributed environments have broader attack surfaces.

Challenge: Secure data during transfer and storage, manage user access across sites.

Example: One site may have different security policies than another.

Schema Integration and Data Heterogeneity -

Problem: Combining different databases (different formats, models, or structures).

Challenge: Resolving naming conflicts, data type mismatches, and semantics.

Example: One site uses “Customer_ID,” another uses “CID” for the same data.
5

 Distributed DBMS Architecture : Models Autonomy –


In a Distributed Database Management System (DDBMS), autonomy refers to the degree of
independence that individual databases or sites have from the central system or other
databases. It impacts control, data sharing, coordination, and integration across the
system.

Types of Autonomy in Distributed DBMS

There are three major models based on how much autonomy each site or node has:

1. Design Autonomy -

Definition:
Each site has the freedom to define its own database schema, data model, and even use a
different DBMS software.

Characteristics:

 Sites may use different data models (e.g., relational, object-oriented, NoSQL).
 Sites may store data using different naming conventions or structures.
 Integration becomes complex due to schema and semantic heterogeneity.

Example:
A hospital stores patient data in a relational database, while a clinic stores data in a NoSQL
document-based database. Both want to share data, but their structures differ.

Pros:

 Local teams can design based on their specific needs.


 Flexibility in choosing tools and technologies.

Cons:

 Complex schema integration.


 Difficult to run cross-site queries.

2. Communication Autonomy -

Definition:
Each site controls when and how it communicates or participates in the distributed system.

Characteristics:

 Sites can refuse to respond to certain queries or transactions.


 A site may temporarily disconnect or restrict data sharing.
 Useful in environments where privacy or regulatory boundaries exist.

Example:
In a federated university system, some universities may allow their student databases to be
queried by others only during admission season.

Pros:

 Sites maintain control over their resources.


 Reduced communication overhead when not needed.

Cons:

 Global transaction processing becomes inconsistent.


 Incomplete data if some sites don’t respond.

3. Execution Autonomy

Definition:
Each site can execute its local operations independently without needing to wait for global
decisions.
6

Characteristics:

 Local queries and updates can run without global coordination.


 Sites can opt out of participating in distributed transactions.
 Enhances local efficiency and performance.

Example:
A retail store processes customer billing locally even if it’s part of a large retail chain.
Central servers sync data only at end of day.

Pros:

 Fast response for local transactions.


 No dependence on central coordinator.

Cons:

 Risk of inconsistent global state.


 Conflict resolution needed during synchronization.

Levels of Autonomy (High to Low) –

Level Description Example


Tightly Central control; minimal autonomy Centralized control with
Integrated at local sites remote databases
Federated Some autonomy; sites share some Collaborative systems, e.g.
control and coordination hospital networks
Loosely High autonomy; sites operate Different departments using
Coupled independently, communicate when their own DBMSs
needed

Why Autonomy Matters?

 Pros of High Autonomy:


o Local control
o Flexibility
o Fault isolation
 Cons of High Autonomy:
o Harder integration
o Complex transaction coordination
o Inconsistent schemas

Diagram Idea (optional for a presentation)

You can represent it as:

pgsql
+--------------------+
| Site A (Own DBMS) |
+--------------------+
↑ ↓ Communication Autonomy
+--------------------+
| Site B (Own Schema) |
+--------------------+

Execution Autonomy ↔ Each site runs queries independently


Design Autonomy ↔ Each site has different DBMS/data model

 Distribution –
Distribution refers to the logical or physical placement of data across multiple
interconnected locations (sites, nodes, or servers) in a Distributed Database System.
7

Instead of storing all data in a single central database, it is split and stored across different
sites that are connected via a network.

Why Distribute Data?

1. ✅ Improved Performance: Data is located closer to users, reducing access time.


2. ✅ Scalability: Easy to add more sites or servers as data grows.
3. ✅ Reliability: Even if one site fails, others can continue to function.
4. ✅ Local Autonomy: Sites can manage their data independently.
5. ✅ Cost-effective: Can use smaller, cheaper hardware at multiple locations.

Types of Data Distribution -

There are three main strategies:

1. Fragmentation

 The database is broken into fragments (subsets) and distributed.


 Each fragment contains a portion of the data.
 Two types:
o Horizontal Fragmentation: Rows are divided among sites.
o Vertical Fragmentation: Columns are divided among sites.

Example: A customer database where customers from the North region are stored in server
A, and those from the South in server B.

2. Replication-

 Copies of data (full or partial) are stored at multiple sites.


 Ensures data availability and fault tolerance.
 Can be:
o Synchronous: All copies updated simultaneously.
o Asynchronous: Updates happen later.

Example: Product inventory data is replicated across warehouses to ensure fast local
access.

3. Hybrid Distribution-

 Combines fragmentation and replication.


 Some data is fragmented; some fragments are replicated.
 Offers optimized performance and availability.

Example: Product data is fragmented by region and replicated in nearby data centers.

Logical vs Physical Distribution -


Type Description
Logical Data appears as a unified database to users, even though it is stored
Distribution in multiple locations.
Physical Actual storage of data on different physical sites. Managed by DDBMS
Distribution for transparency.
Distribution Transparency -

To users, the DDBMS should hide the complexities of distribution:

 Location Transparency: Users don’t need to know where the data is stored.
 Replication Transparency: Users are unaware of how many copies exist.
 Fragmentation Transparency: Users see the data as one unified dataset.

 Heterogeneity DDBMS Architecture – Client/Server –


DDBMS stands for Distributed Database Management System, where the database is
spread across different locations. A Heterogeneous DDBMS is one where:

 Different sites may use different DBMS products (e.g., Oracle, SQL Server, MySQL).
 The systems may differ in data models, query languages, hardware, or operating
systems.
8

 Despite differences, they are connected via a network and work together to process
queries and transactions.

Client/Server Architecture in Heterogeneous DDBMS -

In this model, the DDBMS is implemented using a client/server architecture, where:

 Clients request services (like queries or transactions).


 Servers provide services (like executing queries or managing data).

Components of Client/Server Heterogeneous DDBMS-

1. Client

 Interface between the end-user and the system.


 Sends queries to the server and presents the results to the user.
 May include applications, web interfaces, or command-line tools.

2. Server

 Responsible for managing databases at its location.


 Executes queries, manages transactions, and ensures consistency.
 Servers may be running different DBMS systems, leading to heterogeneity.

3. Middleware

 Acts as a translator or coordinator between clients and multiple heterogeneous


servers.
 It helps:
o Translate queries from a common language to the native language of each
DBMS.
o Handle differences in schema, data types, and query syntax.
o Combine and present results to the client.

Key Features -

Feature Description
Interoperab Supports multiple DBMSs with different data models and query
ility languages.
Transparenc Users don't need to know the data's physical location or DBMS
y type.
Scalability Easy to add new servers or DBMSs.
Fault If one server fails, others can still work.
Tolerance

Example -

Let’s say a multinational company has:

 Oracle DB in the US branch.


 MySQL DB in the India branch.
 SQL Server in the UK branch.

All are managed under a heterogeneous DDBMS using client/server architecture. A client
from any branch can:

 Send a unified query.


 The middleware splits it and sends parts to each server.
 Each server runs its part and returns results.
 Middleware combines results and gives a final response to the client.

Advantages -

✅ Flexible integration of various systems


✅ Leverages existing DBMS investments
✅ Centralized access to distributed data
✅ Users don’t need to worry about differences in DBMS systems
9

 Peer to peer –
In Peer-to-Peer DDBMS architecture, all nodes (or sites) in the distributed system are
considered equal. Each site:

 Acts as both a client and a server.


 Can initiate requests and process incoming requests.
 Manages its own database, possibly using a different DBMS (hence heterogeneous).

There is no central coordinator or master server.

Characteristics of P2P Heterogeneous DDBMS -

Feature Description
Decentralized No single site is in charge. Every peer can operate independently.
Control
Autonomy Each node controls its own data and DBMS.
Heterogeneity Different peers may use different DBMSs, OSs, hardware, data
models, etc.
Collaboration Peers communicate directly to fulfill distributed queries or
transactions.

Components -

1. Peers (Nodes)
o Each peer:
 Runs a DBMS (could be different from others).
 Stores a portion of the distributed database.
 Can query, update, or share data with other peers.
2. Communication Layer
o Handles message passing between peers.
o Ensures that peers can send requests and responses effectively.
3. Query Processing & Translation Mechanism
o Needed to translate a query from one peer’s format to the target peer’s DBMS
format.

Example -

Imagine 3 branches of a company:

 One uses MySQL, one uses Oracle, one uses PostgreSQL.


 No branch is a master.
 A peer (say Oracle) sends a query that requires data from PostgreSQL.
 The PostgreSQL peer receives the query, processes its part, and sends back the result.
 The Oracle peer compiles the full result for the user.

Advantages of P2P Architecture -

✅ Scalability – You can add more peers easily.


✅ No single point of failure – If one node fails, others continue to work.
✅ Flexibility – Each peer can be independently maintained or updated.
✅ Autonomy – Organizations can maintain control over their own data and DBMS.

Challenges -

Complex coordination – Harder to manage consistency, concurrency, and transactions.


Security issues – Peers must trust each other or use secure protocols.
Data integration – Schema and data model differences need reconciliation.
Performance tuning – No central optimizer; each peer must handle its part efficiently.

Peer-to-Peer vs. Client/Server in DDBMS -

Feature Client/Server Peer-to-Peer


Control Centralized server Decentralized
Coordinatio Easier More complex
n
Scalability Moderate High
10

Fault Lower Higher


Tolerance
Autonomy Low (clients depend on High (each peer is
server) independent)

 MDBS –
MDBS stands for Multidatabase System.

It's a system that enables users to access and manipulate multiple autonomous databases,
which are:

 Heterogeneous (different DBMSs, data models, schemas)


 Distributed (located in different physical locations)
 Independent (each DBMS manages its own data, security, and transactions)

Key Features of MDBS -

Feature Description
Heterogen Supports different types of DBMSs (e.g., Oracle, MySQL,
eity MongoDB)
Autonomy Each local DBMS works independently
Transpare Users don’t see the complexity of distributed and heterogeneous
ncy systems
Non- Does not require changes in the local databases
intrusive

MDBS Architecture -

MDBS uses a middleware layer (often called a Global Database Management System –
GDBMS) between the user and the local DBMSs.

Components:

1. User/Application
o Sends queries as if querying a single global database.
2. Global Schema
o A unified logical view of all the underlying databases.
o Hides differences in data models, query languages, and schemas.
3. MDBS Middleware (GDBMS)
o Translates global queries to local queries.
o Sends sub-queries to respective local DBMSs.
o Collects, integrates, and sends the final result back to the user.
4. Local DBMSs
o Independent databases with their own DBMSs.
o Handle sub-queries and return results to middleware.

Example Scenario -

Let’s say a retail company has:

 Inventory data in MySQL in Warehouse A.


 Sales data in Oracle in Branch B.
 Customer data in PostgreSQL at HQ.

The user wants to run a query like:


Show customers who bought products that are currently low in stock.

MDBS:

1. Receives the query.


2. Translates it into 3 subqueries – one for each DBMS.
3. Collects and integrates the results.
4. Sends a single, unified result to the user.

Types of Transparency in MDBS -


11

Type Description
Location Users don’t know where data is stored.
Transparency
Replication Users don’t see duplicated data.
Transparency
Heterogeneity Users don’t deal with different query languages or data
Transparency models.
Schema Transparency Users see one global schema, not the individual ones.

Advantages

✅ Seamless access to multiple, independent databases


✅ No need to redesign existing databases
✅ Facilitates data integration in large organizations
✅ Reduces data redundancy and inconsistency

UNIT II
Data Distribution Alternatives:

 Data Distribution Alternatives: Design Alternatives – localized data –

In a Distributed Database Management System (DDBMS), data can be distributed across


multiple sites in different ways depending on performance, reliability, and organizational
needs.

The main data distribution alternatives are:

1. Centralized
2. Fragmentation
3. Replication
4. Localized Data (your focus)
5. Hybrid approaches

What is Localized Data?

Localized Data refers to a design where:

 Each site stores and manages only its own data.


 No data is shared or duplicated with other sites.
 Data access is usually local only, meaning queries and transactions are executed at
the site where the data resides.

Characteristics of Localized Data -

Aspect Description
Autonomy Each site operates independently.
Simplicity Easy to manage, no complex synchronization
needed.
No Redundancy Data is not replicated across sites.
Low Communication Since data access is local, network usage is
Overhead minimal.

Example -

Suppose a company has offices in:


12

 Mumbai, Delhi, and Bangalore.

Each office stores only its own employees’ records:

 Mumbai DB: Employee_Mumbai


 Delhi DB: Employee_Delhi
 Bangalore DB: Employee_Bangalore

If an employee from Mumbai accesses the system, only the Mumbai DB is involved.

Advantages -

✅ Performance: Fast access to local data


✅ Autonomy: Sites function independently
✅ Reduced Network Traffic: No cross-site data transfer
✅ Simplicity: Easier to manage, secure, and maintain

Disadvantages -

No Global View: Cannot easily perform queries across all sites


No Data Redundancy: Data loss at one site means total loss
Limited Functionality: Difficult to run company-wide reports or analytics
Low Availability: If one site is down, its data is unavailable

Use Cases -

 Branch-based applications (e.g., local HR records, local sales data)


 Decentralized systems where independence is more important than integration

 distributed data Fragmentation – Vertical –


Fragmentation is a technique used to split a database table into smaller parts (fragments)
that can be distributed across different sites in a distributed database system.

There are three main types:

1. Horizontal Fragmentation – splits by rows


2. Vertical Fragmentation – splits by columns ✅ (your focus)
3. Mixed/Hybrid Fragmentation – combination of both

What is Vertical Fragmentation?

Vertical Fragmentation splits a table into subsets of columns (attributes).

Each fragment contains:

 A subset of attributes from the original table.


 A copy of the primary key to ensure that the fragments can be rejoined later when
needed.

Example -

Let’s say we have a table:

scss
Employee(EmpID, Name, Address, Phone, Salary, Department)

We can create vertical fragments like:

Fragment 1 (Site A):

css
EmpID, Name, Address

Fragment 2 (Site B):

EmpID, Phone, Salary, Department


13

EmpID is kept in both fragments as it is the primary key and is needed to reconstruct the
original table using a join.

Why Use Vertical Fragmentation?

 Improves performance: Users at different locations may only need specific columns.
 Enhances security: Sensitive data (like salary) can be kept separate.
 Saves bandwidth: Only required attributes are transferred or accessed.

Reconstructing the Original Table -

You can get the original table back by performing a natural join or equijoin on the primary
key (e.g., EmpID):

sql
SELECT *
FROM Fragment1
NATURAL JOIN Fragment2;

Advantages -

✅ Improved query performance (only needed columns are accessed)


✅ Better data locality (store relevant columns closer to usage)
✅ Enhanced data security and access control
✅ Saves storage and reduces network traffic

Disadvantages -

Requires storing primary key in all fragments


Complex to manage and maintain consistency
Joins may be expensive if data is needed from multiple sites
Poor choice if most queries need all attributes (i.e., full table)

When to Use Vertical Fragmentation -

Use vertical fragmentation when:

Different applications or users access only specific columns.

You want to separate sensitive attributes like salary, login credentials, etc.

You want to optimize access based on locality of reference.

 Horizontal (primary & derived) –

Horizontal Fragmentation divides a table by rows (tuples) based on certain conditions. Each
fragment contains a subset of rows, but all columns.

1. Primary Horizontal Fragmentation

This is the simplest form of horizontal fragmentation.

Definition:

Primary Horizontal Fragmentation divides a table based on conditions applied to the table's
own attributes.

Example:

Table: EMPLOYEE

EmpI Nam Ag Dep City


D e e t
101 Raj 28 Sale Mumbai
s
102 Priya 32 HR Delhi
103 Ama 30 Sale Mumbai
14

n s
104 Neha 35 IT Bangalor
e

Fragmentation:

 Fragment 1 (Mumbai Employees):

sql
SELECT * FROM EMPLOYEE WHERE City = 'Mumbai';

 Fragment 2 (Other Cities):

sql
SELECT * FROM EMPLOYEE WHERE City != 'Mumbai';

✅ All columns retained, only rows filtered


✅ Easy to implement

2. Derived Horizontal Fragmentation -

Definition:

Derived Horizontal Fragmentation is based on a condition in another related table (i.e.,


fragmentation is derived from another table's fragmentation).

It's mostly used in foreign key relationships.

Example:

Two tables:

 DEPARTMENT(DeptID, DeptName, Location)


 EMPLOYEE(EmpID, Name, Age, DeptID)

Let’s say DEPARTMENT is horizontally fragmented by Location:

 Dept_Fragment_1: Location = 'Delhi'


 Dept_Fragment_2: Location = 'Mumbai'

Now, we want to fragment EMPLOYEE so that each employee is in the same site as their
department.

That means:

sql
EMPLOYEE_Fragment_1 = SELECT * FROM EMPLOYEE
WHERE DeptID IN (SELECT DeptID FROM DEPARTMENT WHERE Location = 'Delhi');
sql
EMPLOYEE_Fragment_2 = SELECT * FROM EMPLOYEE
WHERE DeptID IN (SELECT DeptID FROM DEPARTMENT WHERE Location = 'Mumbai');

✅ This depends on another table’s fragmentation

Advantages of Horizontal Fragmentation-

✅ Efficient local access to data


✅ Improves query performance
✅ Reduces network traffic
✅ Better security and data control

Disadvantages -

Reconstructing data from multiple fragments needs UNION


Query processing can be complex across sites
Derived fragmentation adds dependency on other tables
15

Summary Table -
Type Based On How it Works Query Reconstruction
Prima Conditions on same Filters rows using conditions like UNION of all
ry table City='Delhi' fragments
Deriv Related table's Uses JOIN or IN to select JOIN with related
ed fragmentation matching rows fragments

 Hybrid –

Hybrid Fragmentation is a combination of horizontal and vertical fragmentation. It can be


done in any order:

 Horizontal first, then vertical, or


 Vertical first, then horizontal

This method gives maximum flexibility for optimizing performance, security, and data
access patterns in distributed systems.

Types of Hybrid Fragmentation -

1. Horizontal → then Vertical

 First: Split the table by rows (conditions)


 Then: Split each fragment by columns

✅ Example:

Original Table: EMPLOYEE(EmpID, Name, Age, Dept, City, Salary)

Step 1: Horizontal Fragmentation by City

 Fragment A: City = 'Mumbai'


 Fragment B: City = 'Delhi'

Step 2: Vertical Fragmentation

 Fragment A1: EmpID, Name, Age


 Fragment A2: EmpID, Salary
 Fragment B1: EmpID, Name, Dept
 Fragment B2: EmpID, Salary

2. Vertical → then Horizontal

 First: Split the table by columns


 Then: Split each fragment by rows (conditions)

Example:

Original Table: STUDENT(RollNo, Name, Course, City, Marks)

Step 1: Vertical Fragmentation

 Fragment 1: RollNo, Name, Course


 Fragment 2: RollNo, City, Marks

Step 2: Horizontal Fragmentation

 Fragment 1-A: Course = 'CS'


 Fragment 1-B: Course = 'IT'
 Fragment 2-A: City = 'Delhi'
 Fragment 2-B: City = 'Mumbai'

Why Use Hybrid Fragmentation?

✅ Advantages:

 Tailors data distribution to complex application needs


16

 Optimizes both query performance and data locality


 Allows fine-grained control over what data goes where
 Improves security (store sensitive columns separately)
 Reduces data transfer across networks

❌ Disadvantages:

 More complex design and implementation


 Query reconstruction requires both JOIN and UNION
 Harder to maintain consistency and synchronization

When to Use Hybrid Fragmentation?

 When different regions need different rows and different departments need different
columns
 When applications frequently access certain subsets of both rows and columns
 In large, complex distributed systems needing optimization and modularity

Reconstruction of Hybrid Fragments -

To get the original table back:

1. Vertical fragments are first joined using primary key


2. Then, horizontal fragments are unioned

 general guidelines –
1. Completeness

✅ All data from the original relation must appear in the fragments.

No data should be lost. When we reconstruct the fragments, we must get back the entire
original table.

Example:
If we fragment a STUDENT table into STUDENT_Mumbai and STUDENT_Delhi, together they
must contain all rows.

2. Reconstructability (Recomposability)

✅ It must be possible to rebuild the original relation from the fragments using relational
operations like:

 UNION (for horizontal)


 JOIN (for vertical)
 both (for hybrid)

Why?
To ensure the original data view is not lost and can be used if needed by global queries.

3. Disjointness

✅ In horizontal fragmentation, the same row should not appear in multiple fragments.

Each tuple (row) must belong to only one horizontal fragment.

Note: In vertical fragmentation, attributes may overlap only if the primary key is repeated
(for reconstruction).

4. Minimality

✅ Avoid unnecessary fragmentation. Don't create too many small fragments.


17

Too many fragments increase complexity, overhead, and network communication.

Only fragment when it improves:

 Query performance
 Data locality
 Security or privacy

5. Data Locality

✅ Place fragments close to where they are most frequently accessed.

Improves performance and reduces latency.

Example: Keep Delhi-based customer data in the Delhi server.

6. Application Affinity

✅ Fragment based on application-specific access patterns.

Look at how applications use the data—which columns and rows they mostly access.

This ensures:

 Faster queries
 Less data movement
 Better user experience

7. Security and Privacy

✅ Sensitive data (like salary, passwords, etc.) can be placed in separate fragments with
restricted access.

Vertical fragmentation is often used for this.

8. Ease of Management

✅ Fragmentation should simplify, not complicate, database management.

Avoid over-engineering. Simpler, meaningful fragmentation is easier to monitor and


maintain.

 correctness rules –
1. Completeness

 Rule: Every item of data in the original relation must appear in at least one fragment.
 Goal: Ensure no data is lost during fragmentation.
 ✅ This guarantees that you don’t lose any tuples or attributes when breaking the table
apart.

Example:
If a CUSTOMER table has 1000 rows, all 1000 rows must be distributed across fragments—
none should be missing.

2. Reconstruction (Recomposability)

 Rule: It must be possible to reconstruct the original relation using relational


operations.
 For horizontal fragmentation → use UNION
 For vertical fragmentation → use NATURAL JOIN (using primary key)

Goal: Ensure that we can get back the original table if needed.
18

Example:
Fragments A and B of EMPLOYEE can be joined (vertically) or unioned (horizontally) to
rebuild the full EMPLOYEE table.

3. Disjointness

 Rule: In horizontal fragmentation, each tuple (row) should appear in only one
fragment.
 For vertical fragmentation, overlapping is allowed only for the primary key (needed for
joining).

Goal: Avoid duplicate data across fragments unless necessary.

Note: This rule may be relaxed in some advanced DDBMS where replication is involved, but
for pure fragmentation, disjointness is essential.

 Distribution transparency – location –


Distribution Transparency means that the user does not need to know how data is
distributed across different sites in a DDBMS. It hides the complexity of the system's
distributed nature.

It includes:

 Location transparency
 Fragmentation transparency
 Replication transparency
 Access and naming transparency, etc.

Location Transparency -

Definition:

Location Transparency means that the user can access data without knowing its physical
location (site) in the network.

In other words, the data looks like it’s all in one place, even if it's stored across multiple
locations.

Example:

Let’s say you have a distributed database with customer data:

 Site 1: CUSTOMER_Mumbai
 Site 2: CUSTOMER_Delhi

A user runs this query:

sql
SELECT * FROM CUSTOMER WHERE City = 'Mumbai';

✅ With location transparency, the user does not need to specify that the "Mumbai" data is
at Site 1.
The DDBMS handles it internally.

🔍 Without Location Transparency:

User would need to write something like:

sql
SELECT * FROM SITE1.CUSTOMER_Mumbai;

This is complex and tightly couples the query to the site structure.
19

How It Works:

 The DDBMS uses middleware, data dictionary, and query processors to locate the
data.
 It rewrites and optimizes the query based on data location and network efficiency.

Benefits of Location Transparency:


Benefit Description
Simplici Users don’t need to know where data is
ty stored
Flexibili Data can be moved without affecting
ty users
Scalabil New sites can be added easily
ity
Securit Hides the physical structure, adds
y protection

 Fragmentation –
Fragmentation Transparency means that users are not aware of how a database table has
been fragmented (broken into parts) and distributed across different sites.

The system hides the details of fragmentation (horizontal, vertical, or hybrid) from users.

Why It Matters?

In a fragmented database:

 A single logical table may be split into fragments.


 These fragments may be located on different servers/sites.

Yet, the user can query it as if it’s a single table.

Example:

Suppose the EMPLOYEE table is horizontally fragmented:

 EMPLOYEE_India (stored in Mumbai)


 EMPLOYEE_USA (stored in New York)

A user runs:

sql
SELECT * FROM EMPLOYEE WHERE Country = 'India';

With fragmentation transparency, the DDBMS knows it should query EMPLOYEE_India only,
even though the user wrote just EMPLOYEE.

Types of Fragmentation:

1. Horizontal Fragmentation
→ Rows (tuples) are divided across fragments.
2. Vertical Fragmentation
→ Columns (attributes) are divided across fragments.
3. Hybrid Fragmentation
→ Combination of both horizontal and vertical.

Levels of Fragmentation Transparency -


Level Description User Knowledge
Full User sees the database as a whole No knowledge of
Transparency fragmentation
Partial User knows some details (e.g., fragment Some knowledge
Transparency names)
20

No User must know fragment structure and write Full knowledge required
Transparency queries accordingly

How DDBMS handles it:

 The Query Processor figures out where each piece of data is located.
 It rewrites the user's query to target the correct fragments.
 It merges the results and gives a unified output.

Benefits of Fragmentation Transparency -


Benefit Description
Simplicity Users query as if data is in one place
Data Schema changes in fragments don’t affect
Independence users
Query System handles performance under the hood
Optimization
Maintenance Fragments can move or change without user
Flexibility impact

 Replication –
Replication Transparency means the user does not need to know that data is copied and
stored at multiple sites.

The DDBMS manages all copies of data behind the scenes, ensuring consistency and
availability.

What is Replication?

Replication = Copying data from one site to multiple sites.


Used for:

 Faster access (local data availability)


 Reliability (fault tolerance)
 Load balancing

Example:

Let’s say a PRODUCT table is replicated in:

 Mumbai
 Delhi
 Bangalore

User query:

sql
SELECT * FROM PRODUCT WHERE Category = 'Electronics';

With replication transparency:

 The user doesn't know (or care) where the query is executed.
 The system automatically uses the nearest or least loaded replica.
 The user sees a single, consistent result.

Replication Types -
Type Description
Full Entire database is copied at every site
Replication
Partial Only some tables/fragments are copied
Replication
No Replication Only one copy of data exists (not ideal in
DDBMS)
How DDBMS Ensures Transparency:

 Maintains synchronization across all replicas.


21

 Uses concurrency control and commit protocols.


 Hides all the complexity from the end user.

Benefits of Replication Transparency -


Benefit Description
High Data can be accessed even if one site is
Availability down
Performance Queries use the nearest/fastest replica
Simplicity User writes query as if there is only one
copy
Fault Data survives hardware failure or
Tolerance crashes

 Impact of distribution on user queries – No Global Data Dictionary(GDD) –


A Global Data Dictionary is a centralized catalog that stores metadata about the entire
distributed database, such as:

 Names of tables and attributes


 Locations (sites) of data/fragments
 Information on replication, fragmentation
 Access rights and schema definitions

What Happens When There is No GDD?

Without a GDD, the system (and the user) lacks centralized knowledge about where and
how the data is distributed.
This creates major challenges when executing queries.

⚠️Impact on User Queries (No GDD)


🔍 Aspect 💥 Impact
Query Users must know where data is located and how it is fragmented or
Complexity replicated.
No No location, fragmentation, or replication transparency — users must
Transparency write site-specific queries.
Manual Users may need to manually join data from different fragments or sites.
Integration
Increased Higher chance of incorrect or incomplete queries due to missing
Errors metadata.
Lower Query optimizers can't plan globally; performance may suffer due to
Efficiency inefficient execution paths.
Security Risk Harder to enforce consistent access control without centralized
knowledge.

Example: With vs Without GDD

Suppose a table EMPLOYEE is horizontally fragmented:

 EMP_INDIA at Site A
 EMP_USA at Site B

✅ With GDD:

User query:

sql
SELECT * FROM EMPLOYEE WHERE Country = 'India';

➡️DDBMS checks GDD and knows to query only Site A.

Without GDD:

User must know and write:

sql
SELECT * FROM SITE_A.EMP_INDIA WHERE Country = 'India';
22

➡️This creates tight coupling between application and data location — difficult to maintain
or scale.

Workarounds Without GDD

In systems without a GDD:

 Each site maintains a local dictionary


 Middleware or applications must be hardcoded with metadata
 Some systems use broadcast queries (inefficient)
 Data integration must be handled manually

 GDD containing location information –


A Global Data Dictionary (GDD) is a centralized metadata repository that holds information
about the structure and distribution of a distributed database.
One of its most critical roles is to store the location information of various data fragments
across different sites.

GDD and Location Information -

When a table in a distributed system is fragmented or replicated, the GDD keeps track of:

 Where each fragment is stored


 Which site has which replica
 How the data is partitioned (horizontally, vertically, or hybrid)

This means the GDD holds complete knowledge of data distribution across the entire
system.

How it works in practice:

Suppose you have a table CUSTOMER that is horizontally fragmented like this:

 Fragment 1: CUSTOMER_Asia → Stored at Site A


 Fragment 2: CUSTOMER_Europe → Stored at Site B
 Fragment 3: CUSTOMER_USA → Stored at Site C

The GDD will store:

 That the logical table is CUSTOMER


 The names of each fragment
 The conditions for fragmentation (e.g., Region = 'Asia')
 The exact site locations (A, B, C) where each fragment resides

Query Processing Using GDD -

When a user runs a query like:

sql
SELECT * FROM CUSTOMER WHERE Region = 'Asia';

The DDBMS refers to the GDD to determine:

1. Which fragment contains the relevant data (CUSTOMER_Asia)


2. Where this fragment is stored (Site A)
3. How to route the query and retrieve results

All this happens automatically — the user doesn’t need to know the internal structure or
location.

Benefits of GDD with Location Info -

 Location Transparency: Users write normal queries without knowing where data is
stored.
 Efficient Query Routing: System can send queries to the right site, avoiding
unnecessary network traffic.
23

 Easier Maintenance: Administrators can change fragment locations or add replicas


without affecting users.
 Better Optimization: Query optimizer can choose the best site for execution based on
load, proximity, or availability.

 Example on fragmentation –
Sample Table: EMPLOYEE
EmpI EmpNam Dep Salar City
D e t y
101 Aditi HR 3000 Mumba
0 i
102 Ramesh IT 5000 Pune
0
103 Neha Sale 3500 Delhi
s 0
104 Karan IT 5200 Mumba
0 i
105 Riya HR 3100 Chenn
0 ai
1. Horizontal Fragmentation -

In horizontal fragmentation, rows (tuples) are divided based on a condition.

Example:

Let's fragment based on City:

 Fragment 1 (EMP_Mumbai): Employees in Mumbai


→ Rows: 101, 104
 Fragment 2 (EMP_Others): Employees in other cities
→ Rows: 102, 103, 105

Use:

Each site handles local employee data. Queries related to Mumbai employees are routed
only to Fragment 1.

2. Vertical Fragmentation -

In vertical fragmentation, columns (attributes) are split, usually keeping the primary key in
all fragments.

Example:

 Fragment 1 (EMP_Contact): EmpID, EmpName, City


 Fragment 2 (EMP_Work): EmpID, Dept, Salary

Each site may focus on different tasks:

 HR department uses Fragment 1 (contact info)


 Payroll uses Fragment 2 (work details)

3. Hybrid Fragmentation -

Hybrid is a combination of both horizontal and vertical fragmentation.

Example:

Step 1 – Horizontal fragmentation based on City:

 Fragment A: Employees in Mumbai


 Fragment B: Employees in other cities

Step 2 – Vertical fragmentation inside Fragment B:

 Fragment B1: EmpID, EmpName, City


24

 Fragment B2: EmpID, Dept, Salary

So you end up with:

 EMP_Mumbai (full data of Mumbai employees)


 EMP_Others_Contact (non-Mumbai contact info)
 EMP_Others_Work (non-Mumbai work info)

UNIT III
Semantic Data Control:

 View Management -
Semantic Data Control refers to the techniques used in a Distributed Database
Management System (DDBMS) to manage how users interact with data meaningfully and
securely. It ensures that users see the data they’re allowed to see, in the way that makes
the most sense to them — without exposing the complexities of data distribution.

View Management -

What is a View? - A view is a virtual table — it doesn’t store data physically, but is
created by a query that pulls data from one or more base tables.

Example:

sql
CREATE VIEW IT_Employees AS
SELECT EmpID, EmpName, Salary
FROM EMPLOYEE
WHERE Dept = 'IT';

Here, users querying IT_Employees will only see IT department employees, even though
the data may be stored in various fragments or locations.

Purpose of View Management -

1. Simplify User Access


Users don't need to know about data fragmentation, replication, or physical storage.
Views give a clean, logical picture.
2. Ensure Data Security & Privacy
Sensitive data (like salary or addresses) can be hidden by designing restricted views.
3. Provide Logical Independence
Even if the base tables change (schema or location), the views can remain the same
for users.
4. Integrate Distributed Data
Views can combine data from different sites to provide a global view.

How Views Help in Distributed Databases -

In a distributed system:

 Data may be fragmented (horizontal/vertical)


 Data may be replicated
 Users are spread across locations

So the DBMS uses views to:

 Hide the complexity


 Combine data from different fragments
 Deliver semantic transparency

Example in Distributed Setup -

Assume:
25

 EMP_Pune, EMP_Mumbai, and EMP_Delhi are fragments


 You want a single view of all employees

sql
CREATE VIEW All_Employees AS
SELECT * FROM EMP_Pune
UNION
SELECT * FROM EMP_Mumbai
UNION
SELECT * FROM EMP_Delhi;

To users, it appears as one single All_Employees table — they don’t need to know about
fragmentation.

 Authentication – database authentication –

What is Authentication?

Authentication is the process of verifying the identity of a user or system trying to access a
database.
It ensures that only authorized users are allowed to connect and perform actions in the
database system.

What is Database Authentication?

Database Authentication is a type of authentication handled within the database


management system (DBMS) itself.
It checks the username and password stored in the database to confirm the identity of a
user.

3. How Database Authentication Works

1. User sends login credentials (username + password) to the DBMS.


2. The DBMS checks these credentials against its internal user database.
3. If credentials match, access is granted; otherwise, access is denied.

4.Types of Database Authentication

a. Internal Authentication

 Username and password are stored within the DBMS (e.g., Oracle, MySQL,
PostgreSQL).
 Passwords are often stored in hashed form.
 Suitable for smaller environments.

b. External Authentication

 DBMS delegates authentication to an external service like OS authentication or LDAP.


 Useful in large organizations to centralize authentication.

Example: Oracle supports OS-based authentication (OPS$username).

c. Third-Party Authentication

 Involves integration with services like:


o Kerberos
o Active Directory
o OAuth / OpenID Connect
 Used in modern cloud and enterprise database environments.

5. Database Authentication vs Authorization


Aspect Authentication Authorization
Purpose Verifies who the user is Determines what the user can do
Happens First – during login After login – when accessing
when? resources
Example Checking username & Allowing SELECT or INSERT on a
password table
26

6. Security Best Practices in Database Authentication

✅ Use strong passwords (with complexity and expiry rules)


✅ Enable account lockout after repeated failed attempts
✅ Use multi-factor authentication (MFA) if supported
✅ Encrypt passwords and connections (SSL/TLS)
✅ Regularly audit login attempts and access logs
✅ Avoid default usernames (like admin, root)

7. Example (MySQL)

Creating a User:

sql
CREATE USER 'abhi'@'localhost' IDENTIFIED BY 'StrongPass@123';

Granting Privileges:

sql
GRANT SELECT, INSERT ON CompanyDB.* TO 'abhi'@'localhost';

Viewing Users:

sql
SELECT User, Host FROM mysql.user;
8. Authentication in Distributed Databases

In Distributed DBMS:

 Each site may perform local authentication, or


 A central authentication server may manage access
 Ensures secure access across all sites

 OS authentication –
Operating System (OS) Authentication allows users to access a Database Management
System (DBMS) without entering a separate database username and password.

Instead, the DBMS uses the user's OS login credentials to verify their identity.

Why Use OS Authentication?

✅ Centralized User Management – You manage users at the OS level, not separately in the
DBMS.
✅ Simplified Login – Users don’t need to remember multiple passwords.
✅ Enhanced Security – Uses secure OS-level protocols and tools.
✅ Integration – Often used in enterprise environments where OS authentication is already in
place.

How It Works -

1. User logs in to the Operating System.


2. When accessing the database, the DBMS reads the OS username.
3. If that OS user is mapped to a database user, access is granted.

DBMS Examples That Support OS Authentication -

DBMS Feature Name / Method


Oracle OS Authentication
(OPS$)
SQL Windows Authentication
Server
PostgreSQ Peer Authentication
L (Linux)
MySQL Supported with plugins
27

Oracle Example – OS Authentication -

In Oracle, users are created with the prefix OPS$ (default) to enable OS authentication.

sql
CREATE USER OPS$ABHI IDENTIFIED EXTERNALLY;
GRANT CONNECT TO OPS$ABHI;

 Now, if OS user ABHI logs in from the OS shell, Oracle recognizes and allows them in.

SQL Server – Windows Authentication -

 SQL Server supports Windows Authentication using Active Directory.


 Users log in using their Windows credentials (no separate DB login needed).
 It's considered more secure than SQL Server authentication.

PostgreSQL – Peer Authentication -

 On Linux, PostgreSQL uses peer authentication for local connections.


 The DB user must match the current Linux system user.

Example: If your Linux user is abhi, and there’s a PostgreSQL user abhi, you can log in with:

bash
psql -U abhi

No password is needed if peer authentication is enabled.

Security Considerations -

Only use OS authentication on secure networks/systems


Ensure strong OS-level access controls
OS user accounts should follow strict password policies
Audit both OS and DB logs for tracking user access

Advantages -

 No need to store passwords in DBMS


 Simplified user and password management
 Seamless integration with enterprise security

Disadvantages -

 Tightly coupled with OS user accounts


 Difficult to use in cloud or cross-platform environments
 Limited flexibility for complex role-based access

 Access Rights –
What are Access Rights?

Access rights (also known as privileges or permissions) define what actions a user or role is
allowed to perform on database objects like tables, views, or procedures.

They are essential for controlling user activities and securing data within a database.

Common Types of Access Rights -

 SELECT: Permission to read data from a table or view.


 INSERT: Permission to add new records to a table.
 UPDATE: Permission to modify existing records in a table.
 DELETE: Permission to remove records from a table.
 EXECUTE: Permission to run a stored procedure or function.
 ALTER: Permission to modify the structure of a database object.
 DROP: Permission to delete a table or other database object.
 INDEX: Permission to create or drop indexes.
 REFERENCES: Permission to create foreign key constraints using another table.
28

Granting and Revoking Access Rights -

Most DBMSs provide two main commands to manage access rights:

 GRANT: Assigns one or more privileges to a user or role.


 REVOKE: Removes previously granted privileges.

Example (SQL):
sql
GRANT SELECT, INSERT ON Employees TO Abhi;
REVOKE INSERT ON Employees FROM Abhi;

Access Control Methods -

There are two main types of access control:

 Discretionary Access Control (DAC):


o The owner of the object decides who gets access.
o Example: A user can grant access to their tables.
 Mandatory Access Control (MAC):
o Access is based on fixed policies and classifications.
o Common in highly secure environments like government systems.

Role-Based Access Control (RBAC) -

Instead of assigning permissions to individual users, permissions are assigned to roles, and
users are assigned roles.

This simplifies management in large organizations.

Example:
sql
CREATE ROLE HR_Manager;
GRANT SELECT, UPDATE ON Employees TO HR_Manager;
GRANT HR_Manager TO Abhi;

Importance of Access Rights -

 Prevent unauthorized data access or changes.


 Protect sensitive or critical information.
 Ensure data integrity and confidentiality.
 Support compliance with security policies and regulations.

Best Practices -

 Grant minimum necessary privileges (principle of least privilege).


 Use roles to manage permissions effectively.
 Regularly review and audit user access rights.
 Revoke unnecessary privileges promptly.
 Use views to limit access to specific data fields or rows.

 Semantic Integrity Control – Centralized & Distributed –


Semantic Integrity Control refers to the rules and constraints applied to ensure that the
data in a database is logically correct, meaningful, and adheres to the business logic or
semantics.

It enforces data validity and consistency based on the meaning of the data, not just its
format or type.

In a Centralized DBMS -

In centralized systems, all data resides in a single database, so enforcing semantic integrity
is straightforward.
29

Examples of Semantic Integrity Rules:

 Foreign key constraints: Ensuring references between tables are valid.


 Check constraints: E.g., age >= 18
 Domain constraints: Validating data types and allowed values.
 Unique constraints: Ensuring no duplicate values.
 Application-defined rules: E.g., Salary cannot be less than Minimum Wage.

Enforcement:

 Enforced directly by the DBMS engine.


 All data is accessible from one point, making constraint checking efficient and
consistent.

In a Distributed DBMS -

In distributed systems, the database is spread across multiple sites, which introduces
additional challenges in maintaining semantic integrity.

Challenges:

 Data fragmentation: Data may be vertically, horizontally, or hybrid fragmented.


 Replication: Same data may exist at multiple sites – keeping them consistent is
critical.
 Latency: Network delays can cause temporary inconsistencies.
 Autonomy: Each site may manage its own constraints independently.

Enforcement Techniques:
a. Global Constraints Management

 A global controller or coordination mechanism checks constraints across sites.


 Example: Ensuring a foreign key in one site refers to a valid record on another site.

b. Local Enforcement with Coordination

 Sites enforce constraints locally, but sync with others for global constraints.

c. Two-Phase Commit (2PC)

 Used in transactions to ensure atomicity across sites.


 Helps in maintaining semantic integrity in multi-site updates.

Comparison: Centralized vs Distributed Semantic Integrity -

Feature Centralized DBMS Distributed DBMS


Constraint Central database Spread across multiple sites
Location
Enforcement Simple and Complex, needs coordination
efficient
Delay or Latency Minimal Can affect constraint checking
Data Replication Not required Often used, needs consistency
mechanisms
Autonomy Single system Each site may have local control
Coordination Not needed Required for global constraints

 Cost of enforcing semantic integrity –


Semantic integrity ensures that the data in a database adheres to defined business rules
and logic, such as constraints, relationships, and valid data formats.
However, maintaining this integrity incurs certain costs, both in centralized and distributed
database systems.

Types of Costs Involved -


30

a. Performance Overhead

 Constraint checking (e.g., CHECK, FOREIGN KEY, UNIQUE) adds extra operations
during INSERT, UPDATE, DELETE.
 Query execution can be slower, especially if semantic checks involve joins or lookups
in related tables.

b. Storage Cost

 Metadata for constraints, rules, and triggers consumes space.


 Additional indexes may be created to enforce constraints (e.g., for uniqueness or
foreign key checks).

c. Complexity in Implementation

 Enforcing complex business rules often requires:


o Stored procedures
o Triggers
o Custom application logic

This increases development and maintenance effort.

d. Transaction Management Overhead

 Ensuring semantic integrity often requires transactions.


 For multi-step operations, atomicity must be preserved, which can increase locking
and blocking, affecting concurrency.

In Distributed Database Systems -

The cost increases significantly due to additional factors:

a. Network Latency

 Cross-site checks (e.g., verifying a foreign key on another server) increase response
time.

b. Synchronization and Coordination

 Two-phase commit (2PC) protocol for atomic transactions across sites is expensive.
 Replication needs synchronization to maintain consistency.

c. Data Transfer Costs

 Enforcing constraints may require transferring remote data over the network.

d. Conflict Resolution

 Replicated or fragmented data may lead to conflicts (e.g., duplicate primary keys),
requiring resolution strategies, which are costly to implement and run.

Trade-offs -

Due to the high cost, especially in distributed systems, database designers must balance:

 Integrity vs Performance
 Global vs Local enforcement
 Immediate vs Deferred constraint checking

Optimization Strategies -

 Use localized constraints wherever possible.


 Apply deferred constraint checking when immediate enforcement isn't critical.
 Minimize cross-site dependencies in distributed setups.
 Use views or materialized views to encapsulate logic where direct enforcement is
complex.
31

 Optimize queries and indexing to support efficient integrity checks.

 Query Processing : Query Processing Problem –


Query processing refers to the steps a Database Management System (DBMS) takes to
translate a user query (usually in SQL) into an execution plan that retrieves data from the
database efficiently. The goal is to execute queries in the most efficient way, minimizing
resource usage like time, CPU, and I/O.

Query Processing Problem -

The Query Processing Problem involves the challenge of transforming a high-level query
(SQL) into an efficient execution plan that optimizes performance while ensuring
correctness. The problem arises due to factors like:

 Complex queries with joins, nested subqueries, etc.


 Large data sets.
 Diverse data access paths (indexes, tables, partitions).
 Distributed environments (in distributed databases).

The key tasks involved in query processing are:

 Parsing: Converting the SQL query into a query tree.


 Optimization: Finding the most efficient way to execute the query.
 Execution: Performing the operations described by the optimized query plan.

Query Processing Steps -

a. Parsing the Query

 The DBMS parses the SQL query into an internal format, usually a query tree or
abstract syntax tree (AST).
 Syntax and semantic errors are checked during this stage.

b. Query Rewrite and Optimization

 The query is optimized to minimize the use of resources like CPU, memory, and disk
I/O.
 Transformation rules (such as converting subqueries to joins or simplifying
expressions) may be applied.
 Cost-based optimization: The DBMS considers different strategies (index scan, full
table scan, etc.) and estimates the cost of each strategy.
 Heuristic optimization: The system applies common rules (like pushing down
selections or projections) to improve performance.

c. Query Execution Plan Generation

 The query optimizer generates a query execution plan, which defines:


o The order in which operations (like joins, selections, projections) will be executed.
o Which indexes will be used, if any.
o The algorithm for performing joins (e.g., nested loop, hash join, etc.).

d. Execution of the Query

 The execution engine runs the plan, performing the necessary operations to retrieve
or modify the data.
 It processes the data and returns the result to the user.

Challenges in Query Processing -

a. Optimization Complexity

 Finding the best execution plan for a query can be a NP-hard problem, especially with
multiple joins or complex predicates.
 The optimizer needs to balance between time and space complexity.
32

b. Data Distribution and Fragmentation (in Distributed Databases)

 In a distributed DBMS, data may be fragmented or replicated across multiple sites,


and the system must consider this when planning query execution. Data localization
becomes crucial to minimize network traffic.

c. Large-Scale Data Handling

 As databases grow larger, query optimization becomes more difficult. Indexing,


partitioning, and parallel processing are common strategies to handle this.

d. Join Ordering and Optimization

 Join ordering is a critical decision, as the order of joins can drastically affect the
execution time.
 For example, performing a nested loop join on large tables can be inefficient, while a
hash join might be faster.

Query Processing in Distributed Databases -

In distributed databases, query processing has additional complexities:

 Data distribution: Data might be fragmented (horizontal or vertical) or replicated. The


optimizer needs to choose the best access path, considering both local and remote
data.
 Cost of communication: Data transfer between sites can be expensive, so minimizing
inter-site communication is critical.
 Parallelism: Distributed systems can leverage parallel query execution to process data
faster by splitting the task across different servers.

Query Processing Techniques and Algorithms -

a. Heuristic Optimization

 This involves applying a set of rules (heuristics) to simplify the query.


o Example: Convert SELECT * FROM A WHERE B=1 into SELECT * FROM A WHERE
B=1 and avoid unnecessary full-table scans if an index exists.

b. Cost-Based Optimization

 The query planner generates multiple query plans and estimates their cost (based on
factors like I/O, CPU usage, etc.) before selecting the cheapest plan.
o Cost models are used to estimate the execution time of different query plans.

c. Join Algorithms

 Common algorithms for optimizing joins include:


o Nested loop join: Simple but inefficient for large datasets.
o Sort-merge join: Efficient when both tables are sorted.
o Hash join: Useful for large, unsorted tables.

Example of Query Processing Problem -

Query:
sql
SELECT name
FROM employees e, departments d
WHERE e.dept_id = d.dept_id AND d.dept_name = 'Sales';

Query Processing Steps:

1. Parsing: Convert SQL to a query tree.


2. Optimization:
o Join ordering: Decide whether to join employees and departments first, or filter
by dept_name before joining.
o Consider using a hash join or nested loop join, depending on the size of the
tables.
33

3. Execution Plan Generation:


o Plan 1: Use an index on dept_id and perform the join using a hash join.
o Plan 2: Perform a sequential scan of both tables and use a nested loop to join
them.
4. Execution: Execute the chosen plan and return the result.

 Layers of Query Processing Query Processing in Centralized Systems –


Parsing & Translation –

Parsing & Translation -

 Parsing and translation is the first layer in the query processing pipeline. This stage
involves syntax checking and semantic analysis of the query.

Stages of Query Processing in Centralized Systems -

a. Parsing- The parsing stage involves taking the SQL query submitted by the user and
converting it into a format that the database system can understand and process, such as
an internal representation like a query tree or abstract syntax tree (AST).

Steps in Parsing:

1. Lexical Analysis: The query is broken down into individual tokens (e.g., keywords like
SELECT, FROM, WHERE, column names, and operators like =).
2. Syntax Analysis: The tokens are analyzed according to the syntax rules of the SQL
language. This ensures the query follows the correct syntax, checking for things like
proper use of commas, parentheses, and SQL keywords.
3. Semantic Analysis: The DBMS checks that the query makes sense in terms of the
schema of the database. For example, it checks that the table and column names
exist, the data types are compatible, and joins are correctly defined.

If any syntax or semantic errors are found, the parser will return an error message. If the
query is correct, it is then converted into an internal format for further processing.

b. Translation- Once parsing is done, the query is translated into an internal representation
(usually in the form of a relational algebra expression or a query tree).

Steps in Translation:

 Logical Plan Generation: The query is converted into a logical query plan. This plan
outlines the steps needed to retrieve the result based on relational algebra operations
like selection, projection, join, etc.
 Query Tree Construction: The logical query plan is represented in the form of a query
tree or abstract syntax tree (AST). Each node in the tree represents an operation (e.g.,
selection, projection), and the edges represent data flows between operations.

For example, for a query like:

sql
SELECT name FROM employees WHERE age > 30;

The translation could involve:

 Selection (age > 30) on the employees table.


 Projection to select only the name column.

This intermediate representation helps the DBMS understand the required operations
before optimization.

After Parsing and Translation:

Once the parsing and translation stages are complete, the query will move on to further
stages such as:

 Optimization: The system will apply optimization techniques (cost-based or heuristic)


to create an execution plan that minimizes the resource consumption (I/O, CPU, etc.).
34

 Execution: The optimized query plan is executed, and the database engine retrieves
the result.

Key Concepts in Parsing & Translation --

a. Query Tree Representation

 The query tree is a hierarchical structure representing the logical operations required
to execute the query. It breaks down the query into smaller operations like select, join,
project, etc.

b. Relational Algebra

 The query is often translated into relational algebra expressions. These expressions
define operations like:
o Selection (σ): Filtering rows based on a condition.
o Projection (π): Selecting specific columns from a table.
o Join (⨝): Combining rows from two or more tables based on a related column.

c. Abstract Syntax Tree (AST)

 The abstract syntax tree is another form of internal representation. It’s used to
represent the hierarchical structure of the query’s components without the detailed
syntax information.

Example: Parsing and Translation Process -

SQL Query:

sql
SELECT name, salary FROM employees WHERE age > 30;

Step 1: Lexical Analysis


The query is split into tokens:

 SELECT, name, salary, FROM, employees, WHERE, age, >, 30

Step 2: Syntax Analysis


The DBMS checks that the query conforms to SQL syntax, ensuring that:

 The SELECT clause is followed by column names.


 The FROM clause specifies a table.
 The WHERE clause is followed by a valid condition.

Step 3: Semantic Analysis

 The DBMS checks that name, salary, and age are valid column names in the
employees table.
 It checks that the condition age > 30 is valid (i.e., the column age exists and is of a
numeric type).

Step 4: Translation into Logical Plan

 A logical query plan is created, and the query tree is generated.

Example query tree:

 Root: Projection (name, salary)


o Child: Selection (age > 30)
 Child: Table (employees)

Benefits of Parsing and Translation in Query Processing

 Error Detection: Helps identify syntax and semantic errors early in the process.
 Simplification: The translation process breaks down complex queries into simpler
operations, making optimization easier.
 Foundation for Optimization: The query tree or relational algebra expressions are used
to explore different execution strategies and choose the most efficient one.
35

 Optimization –

Query optimization is the process of finding the most efficient execution plan for a given
SQL query. The goal is to minimize the cost of executing a query in terms of resources like
CPU time, disk I/O, and memory usage, while still ensuring the correctness of the result.

The query optimization process takes place after the parsing and translation of the query
(into a query tree or relational algebra) and before the execution of the query. Optimization
aims to improve performance by reducing the computational cost.

Steps in Query Optimization:

Query Plan Generation -

Once the query has been parsed and translated, the DBMS creates multiple candidate
execution plans. These plans differ in how operations like joins, projections, and selections
are performed.

For example, given a query like:

sql
SELECT name, salary FROM employees WHERE age > 30;

Candidate plans might include:

 A full table scan on the employees table, applying the selection (age > 30) later.
 Using an index on the age column and applying the selection before the projection.

Cost Estimation -

Each candidate execution plan is associated with a cost. The DBMS uses a cost model to
estimate the resource usage of each plan. The cost typically includes:

 CPU time: The time required for processing operations.


 Disk I/O: The time spent accessing data from disk.
 Memory usage: The amount of memory required to execute the query.

The optimizer compares the costs of different plans and selects the one with the minimum
cost.

Heuristic Optimization -

Heuristic optimization involves applying a set of rules or heuristics to improve the query
without considering the actual cost. These rules are generally used to simplify and reorder
operations in the query tree.

Some common heuristics include:

 Push selection: Move a selection operation as close as possible to the data (i.e.,
applying filters earlier in the process to reduce the data size).
 Push projection: Move a projection operation to eliminate unneeded columns as early
as possible.
 Commutativity of join: Change the order of joins to reduce the size of intermediate
results.
 Join elimination: Eliminate unnecessary joins that don’t affect the result (e.g., if one of
the tables is not actually used in the output).

Example Heuristic:

If we have the query:

sql
SELECT name FROM employees, departments WHERE employees.dept_id =
departments.dept_id AND departments.name = 'HR';
36

A common heuristic would push the WHERE clause filtering departments.name = 'HR' as
close as possible to the departments table, reducing the number of rows involved in the
join operation.

Cost-Based Optimization -

Cost-based optimization involves using statistics about the data (such as table size, index
usage, and data distribution) to estimate the execution cost of different query plans and
choose the most efficient one.

Cost Model -

The cost model calculates the expected cost of each candidate plan, based on factors such
as:

 Number of tuples in tables.


 Available indexes.
 Join types (nested-loop, hash join, merge join).
 Data distribution (skew, sorting).

Example:

 Full Table Scan: The cost will depend on the number of pages in the table.
 Index Lookup: The cost will depend on whether an index exists on the age column and
the selectivity (i.e., how many records satisfy age > 30).

 Code generation –
Steps in Code Generation:

Query Execution Plan Translation -

After query optimization, the DBMS generates a query execution plan. This plan is a series
of low-level operations that represent the most efficient way to access, filter, join, and
return data based on the SQL query.

The query execution plan is typically represented in the form of operators (e.g., select,
project, join) and their input/output relations.

Conversion to Intermediate Code -

Instead of directly converting the optimized query plan into machine code, modern DBMS
often translate it into intermediate code. This code acts as a bridge between the high-level
query plan and the machine-specific execution code.

Intermediate Code: The intermediate code is usually in the form of relational operators
(e.g., SELECT, JOIN) expressed in a low-level language that can be interpreted by the DBMS.

Example Intermediate Code Representation:

For the SQL query:

sql
SELECT name, salary FROM employees WHERE age > 30;

A possible intermediate code could be:

markdown
1. Scan employees table.
2. Apply selection condition (age > 30).
3. Apply projection (name, salary).
4. Return results.

Generation of Machine Code or Execution Code -


37

Once the intermediate code is ready, the DBMS translates it into the machine code or a set
of low-level instructions that can be directly executed by the system. This is the final code
that interacts with the operating system and hardware to fetch data from the database.

In some DBMS systems, this machine code might be translated into a query execution
engine or virtual machine code that can be interpreted by the DBMS's internal runtime.

Example Machine Code Instructions:

For a query involving a table scan, the machine code might involve low-level instructions
for accessing disk pages, reading data blocks, and applying filters on the fly.

Execution of Code -

Once the machine code (or intermediate code) is generated, the DBMS executes the
instructions step by step to retrieve the result of the query. The execution involves:

 Accessing data from tables, indexes, or caches.


 Applying selections, projections, and joins.
 Returning the result to the user or storing the result in memory.

Example: Code Generation for a Simple Query -

Consider the query:

sql
SELECT name, salary FROM employees WHERE age > 30;

Step 1: Parsing & Translation

 The query is parsed and translated into a query tree or logical plan.

Step 2: Optimization

 Heuristic and cost-based optimization are applied to generate an efficient execution


plan.

Step 3: Query Execution Plan

The execution plan might look like this:

1. Scan the employees table (table scan or index scan).


2. Apply the selection condition: age > 30.
3. Apply the projection: Select the name and salary columns.
4. Return the result.

Step 4: Code Generation

 The DBMS generates an intermediate code:

markdown
1. Scan employees table.
2. Apply selection (age > 30).
3. Apply projection (name, salary).
4. Return result.

 This intermediate code is translated into low-level machine code that involves reading
blocks of the employees table, applying the condition age > 30, and fetching only the
name and salary columns.

Step 5: Execution

 The DBMS executes the generated machine code and returns the result, which is the
name and salary of employees aged over 30.
38

 Example Query Processing in Distributed Systems – Mapping global query to


local –

In a Distributed Database Management System (DDBMS), data is spread across multiple


locations or nodes. A global query refers to a query that is issued from the global user
perspective, meaning that it is intended to operate on the entire distributed database, even
though the data resides on different sites.

To process this global query in a distributed system, the DBMS needs to translate the global
query into local queries that can be executed at each of the different sites where the data
is located. This process involves several steps, which include query decomposition, local
query generation, and global query optimization.

Global Query to Local Queries Mapping: Example -

Global Query:

Let's assume we have a distributed database system with two sites: Site A and Site B. The
tables stored at these sites are as follows:

 Site A: Table employees (with columns employee_id, name, department_id)


 Site B: Table departments (with columns department_id, department_name)

The global query is:

sql
SELECT employees.name, departments.department_name
FROM employees, departments
WHERE employees.department_id = departments.department_id

This query requires joining the employees table at Site A with the departments table at Site
B. The challenge is to map this global query to local queries that can be executed on each
site.

Query Decomposition -

The first step is to decompose the global query into smaller components. These
components consist of subqueries that can be executed at each site. Query decomposition
identifies how data from multiple sites can be accessed and used to satisfy the global
query.

 Join Operation: The global query performs a join between the employees and
departments tables on the department_id column.
 Selection: The join is performed using the employees.department_id =
departments.department_id condition.

Decomposition Plan:

 First, a local query to retrieve employees at Site A will be generated.


 A second local query to retrieve departments at Site B will be generated.
 The results from both sites will need to be joined locally or transferred to a central site
for the join operation.

3. Local Query Generation

Now, based on the decomposition, the system generates local queries to execute at each
site.

Local Query for Site A (Employees):

At Site A, we need to retrieve the employee names along with their department_id. The
local query is:
39

sql
SELECT employee_id, name, department_id FROM employees

This query can be executed locally at Site A to fetch the relevant data.

Local Query for Site B (Departments):

At Site B, we need to retrieve the department names along with the department_id. The
local query is:

sql
SELECT department_id, department_name FROM departments

This query is executed locally at Site B.

Query Execution Strategy -

Once the local queries are generated, the system must decide how to execute them:

Approach 1: Shipping Data to the Query Processor

In this approach, the local data retrieved from both sites can be sent over the network to
the query processor site, where the final join operation is performed.

1. Execute Local Query on Site A: The SELECT query for employees is executed on Site A.
2. Execute Local Query on Site B: The SELECT query for departments is executed on Site
B.
3. Send Results to a Central Site (Query Processor): The results from both sites are sent
to a central site (or the site that issued the query).
4. Join at the Central Site: The results are joined on the department_id at the query
processor site.

Approach 2: Shipping Queries to Data

Alternatively, if the join condition is highly selective (i.e., there are few employees in each
department), it may be more efficient to ship the join condition directly to the sites. This
means that:

1. Site A Sends Employees Data: Site A sends employee data that matches a certain
department_id to Site B.
2. Site B Performs Join Locally: Site B performs the join with its departments table locally
using the data received from Site A.
3. Return Joined Results: The result is sent back to the user.

 Optimization –
Distributed query processing involves optimizing the way data is transferred and processed
between sites. Several factors can impact optimization:

 Data Transfer Cost: The cost of transferring data between sites (network latency,
bandwidth).
 Join Algorithm: The algorithm used for the join, such as nested-loop join, hash join, or
merge join.
 Local Query Execution: How efficiently the local queries can be executed at each site.
 Query Plan: Whether it's better to perform the join at one site or distribute the join
operation across sites.

For example, if the employees table at Site A is large and the departments table at Site B is
small, the system might decide to send the departments table to Site A rather than sending
the entire employees table to Site B. This minimizes the amount of data transferred.

UNIT IV
Optimization of Distributed Queries:
40

 Query Optimization –
In Distributed Database Management Systems (DDBMS), query optimization plays a crucial
role in improving the efficiency of query execution. Since data is distributed across multiple
sites, optimization techniques help minimize the total cost of executing queries by reducing
factors such as data transfer time, join costs, and execution time at each site.

The goal of distributed query optimization is to find the most efficient plan to execute a
given query by considering various possible ways the query can be executed across the
distributed system.

Components of Distributed Query Optimization -

The distributed query optimization process involves:

1. Decomposition of the global query into local subqueries.


2. Selection of the most efficient execution strategy.
3. Cost Estimation of different execution plans.
4. Choosing the optimal plan based on cost estimation.

Key Aspects of Query Optimization in Distributed Systems -

1. Join Optimization

Join operations, especially when tables are distributed across different sites, can become a
bottleneck. Efficiently performing joins across sites requires minimizing data transfer costs
and performing joins in an optimized way.

Join Types to Consider:

 Broadcast Join: If one table is small (e.g., a departments table), it can be sent to all
other sites, and local joins can be performed on the data.
 Replicated Join: One of the tables can be replicated across all sites. This minimizes
data transfer but requires careful consideration of memory and storage usage.
 Partitioned Join: If the tables are large but have a common partitioning key (e.g.,
department_id), data can be sent to the site with matching keys for local joins.

Cost of Joins:

 Cost of transferring data: The more data that is transferred, the higher the cost.
 Join algorithms: Choosing between nested-loop join, hash join, or merge join
depending on data distribution.

Example of Join Optimization:

In a scenario where employees and departments are located on different sites, the query
planner might choose between:

 Sending the departments table (if it’s small) to the site containing the employees
table.
 Or, alternatively, performing the join on each site where the data is located and then
combining the results.

2. Data Location Awareness

The location of data is critical for distributed query optimization. Accessing remote data
incurs a cost due to network latency and bandwidth.

Optimization Considerations:

 Data Locality: If both tables involved in a query are located at the same site, no data
transfer is needed. Hence, minimizing remote data accesses is key to reducing costs.
 Partitioning Awareness: If data is partitioned based on certain attributes (e.g.,
department_id), queries that access data from multiple partitions may need to be
restructured to minimize the amount of data transferred.
41

3. Query Decomposition

Decomposition refers to breaking down a global query into local subqueries that are
executed at the respective sites where the data is stored. The goal is to decompose the
query in such a way that it minimizes the communication overhead and optimizes data
access.

 Centralized Query Optimization –


Centralized query optimization refers to the process of improving the efficiency of
executing queries in a centralized database system. In a centralized database, all data
resides on a single server or site, meaning that the database management system (DBMS)
has direct access to all the data and resources, which simplifies query processing
compared to distributed systems.

The primary goal of centralized query optimization is to determine the most efficient
execution plan for a given query, reducing factors like query execution time, disk I/O
operations, CPU usage, and memory consumption.

The query optimization process involves a cost-based approach, where different strategies
for executing a query are evaluated based on their estimated cost, and the most efficient
one is selected.

Key Steps in Centralized Query Optimization -

1. Parsing:
o The query is parsed and transformed into a logical query tree or query plan. This
tree represents the structure of the query in terms of operations (e.g., SELECT,
JOIN, GROUP BY, etc.).
2. Logical Query Plan:
o The logical query plan is a high-level representation of the query without
considering physical details like indexes or the specific join algorithms to be
used.
o The logical query is then optimized by considering logical equivalences, such as:
 Reordering joins: The order in which tables are joined can significantly affect
query performance.
 Pushdown of selections: Selections (WHERE clauses) are pushed as close as
possible to the base tables to reduce intermediate result sizes.
 Pushdown of projections: Projections (SELECT clauses) are pushed to avoid
unnecessary columns in intermediate steps.
3. Physical Query Plan:
o Once the logical query is optimized, the query plan is translated into a physical
query plan that specifies the actual operations to be executed, including:
 Join algorithms (e.g., nested-loop join, hash join, merge join).
 Access paths (e.g., table scans, index scans).
 Join order (i.e., the order in which tables are joined).
 Sorting and grouping methods.
4. Cost Estimation:
o For each physical plan, a cost estimator calculates the cost of executing that
plan. The cost is usually measured in terms of I/O operations, CPU usage, and
memory usage.
o The DBMS estimates the cost based on statistics about the data, such as:
 Size of tables.
 Index availability.
 Cardinality of intermediate results.
 Data distribution.
5. Query Plan Selection:
o The system evaluates the costs of different physical query plans and selects the
one with the minimum estimated cost.
o The selected plan is then executed to produce the result of the query.

Types of Centralized Query Optimization Techniques -

1. Selection Pushdown:
42

o This optimization involves moving selection operations (WHERE conditions) as


close to the base tables as possible. By doing so, we filter out unnecessary data
early in the process, reducing the size of intermediate results.

Example:

sql
SELECT name FROM employees WHERE age > 30;

Optimized query with selection pushdown:

sql
SELECT name FROM employees WHERE age > 30;

Here, the selection is applied directly when reading data from the employees table.

2. Projection Pushdown:
o This optimization involves moving projection operations (column selections) as
early as possible in the query processing. By selecting only the required columns
early in the query execution, we reduce the amount of data processed at later
stages.

Example:

sql
SELECT name, department FROM employees WHERE salary > 50000;

Optimized query with projection pushdown:

sql
SELECT name, department FROM employees WHERE salary > 50000;

Here, only the necessary columns (name and department) are fetched, instead of
retrieving all columns.

3. Join Reordering:
o The order in which joins are performed can have a significant impact on the
performance of the query. Reordering the joins based on the size of the tables or
the selectivity of join conditions can minimize the intermediate results and
reduce overall computation.

Example: Consider the following query:

sql
SELECT e.name, d.department_name
FROM employees e, departments d, projects p
WHERE e.department_id = d.department_id
AND e.project_id = p.project_id;

In this case, if the employees table is much smaller than projects or departments, the
optimizer may choose to join employees with departments first and then join the
result with projects.

4. Join Algorithm Selection:


o Nested-Loop Join: The simplest join algorithm, where for each row in the first
table, the entire second table is scanned. This method is efficient for small tables
or when indexes are available on the join key.
o Merge Join: This method works efficiently when both tables are sorted on the join
attribute. It merges the two sorted tables to produce the final result.
o Hash Join: This method builds a hash table for the smaller table and then scans
the larger table to find matching rows. It is efficient when there are no indexes on
the join attributes and when the tables are large.
43

The optimizer chooses the best join algorithm based on factors like the size of the
tables and the availability of indexes.

5. Index Utilization:
o If the query involves conditions on columns with indexes, the optimizer may
choose to use the index to speed up data retrieval. Indexes can help avoid full
table scans and reduce I/O costs.
o The optimizer also considers multi-column indexes and composite indexes when
available.
6. Query Rewriting:
o Query rewriting involves transforming the original query into an equivalent query
that may be more efficient. For instance, the optimizer might transform
subqueries into joins or vice versa, depending on which approach will be more
efficient.

 Join Ordering Distributed Query Optimization Algorithms –


Join ordering is critical in distributed query optimization because the order in which joins
are executed can significantly affect performance, especially when data is spread across
different sites. The goal is to minimize the overall cost of query execution, considering
factors like data transfer cost and join algorithm efficiency.

Here are a few key distributed query optimization algorithms for join ordering:

1. Exhaustive Search (Brute Force)

 Description: This approach evaluates all possible orders in which joins can be
executed.
 Advantages: Guarantees finding the optimal solution.
 Disadvantages: Computationally expensive and inefficient for large queries with many
joins.

2. Dynamic Programming (DP)

 Description: Uses a bottom-up approach, where subproblems (smaller join queries) are
solved first, and their results are reused to solve larger subqueries. It builds the
optimal join order progressively.
 Advantages: More efficient than exhaustive search for larger queries.
 Disadvantages: Still computationally expensive for very large queries.

3. Greedy Algorithms

 Description: This approach chooses the next join based on the local optimal choice
(e.g., choosing the join that minimizes intermediate results or minimizes data
transfer).
 Advantages: Faster than exhaustive search and dynamic programming.
 Disadvantages: May not find the globally optimal join order.

4. Heuristic-Based Algorithms

 Description: Uses heuristics or rules of thumb to determine a good join order, like:
o Small-to-Large: Join the smaller tables first to minimize intermediate result sizes.
o Join Commute: Reorder joins based on the cost of the underlying join operations.
 Advantages: Faster and less resource-intensive.
 Disadvantages: Does not always guarantee optimal performance.

5. Join Graph Approach

 Description: The query is represented as a join graph where tables are nodes, and join
conditions are edges. The graph is traversed to determine the best join order.
 Advantages: Visualizes the relationships between tables and can be optimized
effectively.
 Disadvantages: May not always lead to the optimal solution.

 Distributed Transaction Management & Concurrency Control: Transaction


concept –
44

A transaction is a logical unit of work in a database system. It consists of a sequence of


operations that must be treated as a single entity to maintain the ACID properties
(Atomicity, Consistency, Isolation, Durability). These properties ensure reliable database
operations even in the presence of system failures or concurrency.

Key Properties of Transactions -

1. Atomicity:
o A transaction is atomic, meaning it is all-or-nothing. If any part of the transaction
fails, the entire transaction is rolled back to its initial state.
2. Consistency:
o A transaction ensures that the database starts in a consistent state and ends in a
consistent state, preserving data integrity.
3. Isolation:
o Transactions should be isolated from one another, ensuring that the operations
of one transaction do not interfere with those of another. The effect of a
transaction should be as if it is the only transaction running in the system.
4. Durability:
o Once a transaction has been committed, its changes are permanent, even in the
case of a system failure.

Distributed Transaction Management -

In distributed database systems, transactions can involve multiple databases or sites that
are connected via a network. These systems have to ensure that transactions maintain the
ACID properties across multiple, potentially geographically distributed, locations.

1. Distributed Transactions:
o A distributed transaction involves operations across multiple sites or databases,
which can be heterogeneous (i.e., different DBMSs on different sites).
2. Two-Phase Commit Protocol (2PC):
o This is a common protocol for ensuring atomicity in distributed transactions. It
works in two phases:
 Phase 1: The coordinator sends a "prepare" message to all participants.
Each participant responds with a "yes" (ready to commit) or "no" (abort).
 Phase 2: If all participants respond with "yes", the coordinator sends a
"commit" message to all, making the transaction permanent. If any
participant responds with "no", the coordinator sends an "abort" message,
and all participants roll back their changes.
3. Three-Phase Commit Protocol (3PC):
o This is an extension of the 2PC, designed to make the protocol more fault-
tolerant by adding an additional phase to avoid blocking in case of failures.
4. Distributed Locking:
o Distributed systems require distributed concurrency control mechanisms to
manage simultaneous access to shared data. These mechanisms ensure that
transactions are isolated and consistent, even when data is stored across
multiple locations.
o Common methods include:
 Two-Phase Locking (2PL): A lock acquisition and release protocol that
prevents other transactions from accessing the data being used in the
current transaction.
 Timestamp Ordering: Each transaction is assigned a timestamp, and
operations are performed in timestamp order to ensure serializability.

Concurrency Control -

Concurrency control in distributed transactions ensures that multiple transactions can be


executed concurrently while maintaining the ACID properties, especially isolation.

1. Locking-based Concurrency Control:


o Two-Phase Locking (2PL): Ensures serializability by acquiring locks on data before
accessing it and releasing locks after the transaction is done.
o Deadlock Prevention: The system ensures that deadlocks (where two or more
transactions are waiting on each other) are avoided by using techniques like:
45

Wait-Die: Older transactions wait for younger ones to release locks.


 Wound-Wait: Older transactions are aborted to allow younger transactions
to proceed.
2. Optimistic Concurrency Control (OCC):
o Instead of locking, this approach assumes that conflicts will be rare. Transactions
are allowed to execute without restrictions, but before committing, they are
validated to ensure no conflicts occurred. If a conflict is detected, the transaction
is rolled back.
3. Timestamp-based Concurrency Control:
o Each transaction is assigned a timestamp when it starts, and the database
ensures that operations are performed in the order of timestamps. This ensures
serializability by respecting the "happens-before" relationship.

 ACID property –
The ACID properties are a set of principles that guarantee the reliable processing of
database transactions. These properties ensure that transactions are executed in a way
that maintains data integrity, even in the presence of errors or system failures.

Let's go through each of these properties in detail:

1. Atomicity

 Definition: A transaction is atomic, meaning it is indivisible. It is either fully completed


or not executed at all. If any part of the transaction fails, the entire transaction fails
and is rolled back, leaving the database in its original state.
 Example: In a bank transfer between two accounts, if money is deducted from one
account but the addition to the other account fails, the entire transaction will be rolled
back, and no changes will be made.
 Guarantee: Atomicity ensures that there is no partial completion of a transaction. If an
error occurs, the transaction is reversed to maintain integrity.

2. Consistency

 Definition: A transaction takes the database from one consistent state to another
consistent state. It must preserve the database's integrity constraints (e.g., data
validation rules like primary keys, foreign keys, etc.).
 Example: Consider a transaction that transfers money from one bank account to
another. The consistency property ensures that the total amount of money in the
system remains the same before and after the transaction (no money is created or
lost).
 Guarantee: If the database was consistent before the transaction, it will remain
consistent after the transaction is completed. All integrity constraints (e.g.,
uniqueness, referential integrity) are maintained.

3. Isolation

 Definition: Isolation ensures that the operations of one transaction are not visible to
other transactions until the transaction is completed (i.e., committed). This prevents
data anomalies and ensures that transactions do not interfere with each other.
 Example: If two people simultaneously try to withdraw money from the same bank
account, isolation ensures that each transaction is executed as if it were the only one
happening, even if they are running concurrently.
 Levels of Isolation: There are different levels of isolation that balance performance
with correctness.
 Guarantee: Each transaction runs independently of others, ensuring that intermediate
states are not visible to other transactions and preventing issues like dirty reads, non-
repeatable reads, and phantom reads.

4. Durability
46

 Definition: Durability ensures that once a transaction is committed, the changes it


made to the database are permanent and will not be lost, even in the case of a
system failure (e.g., power outage or crash).
 Example: After completing a bank transfer transaction, even if the system crashes
immediately after, the changes (money transferred from one account to another) will
persist once the system is restored.
 Guarantee: Once a transaction is committed, its effects are saved permanently in the
database and are protected against system crashes, power failures, or other hardware
issues.

 Objectives of transaction management –


1. Ensuring ACID Properties

 Atomicity: Ensures that a transaction is an indivisible unit of work. Either all operations
within the transaction are completed successfully, or none are applied to the
database.
 Consistency: Ensures that a transaction brings the database from one valid state to
another, preserving data integrity and adhering to predefined constraints.
 Isolation: Guarantees that the operations of one transaction are isolated from the
operations of other concurrent transactions, preventing conflicts and anomalies.
 Durability: Ensures that once a transaction has been committed, its effects are
permanent and survive system crashes, power failures, or other types of hardware or
software failures.

2. Concurrency Control

 Objective: To allow multiple transactions to run concurrently while ensuring that the
database remains in a consistent state and that the integrity of the data is not
compromised.
 Methods: Includes the use of locking mechanisms (e.g., Two-Phase Locking) or
timestamp-based protocols to prevent issues like dirty reads, non-repeatable reads,
and phantom reads.
 Goal: Maximize throughput and minimize latency while ensuring that transactions do
not interfere with each other inappropriately.

3. Transaction Scheduling and Ordering

 Objective: To efficiently schedule the execution of transactions to optimize resource


usage and minimize the time required to complete them.
 Goal: Maintain serializability, which means ensuring that the outcome of executing
multiple transactions concurrently is the same as if they were executed sequentially,
one after the other.
 Techniques: Use algorithms such as locking, timestamp ordering, and optimistic
concurrency control to determine the sequence in which transactions are executed.

4. Deadlock Prevention and Resolution

 Objective: To handle situations where transactions are waiting for each other
indefinitely, leading to deadlocks.
 Methods:
o Prevention: Through techniques like Wait-Die or Wound-Wait where transactions
are either forced to wait or aborted based on priority.
o Detection and Recovery: Identifying deadlocks after they occur and taking
corrective actions like rolling back one or more transactions to break the cycle.

5. Transaction Atomicity and Recovery

 Objective: To ensure that a transaction is fully executed (atomic) or, if there is a


failure, to revert the system to a consistent state prior to the transaction.
 Methods:
o Logging: Maintain logs of all transactions to be able to recover in case of failure.
o Checkpointing: Periodically save the state of the database to facilitate faster
recovery.
 Goal: Ensure that no partial transactions are reflected in the database and that the
database can recover from failures to a consistent state.
47

6. Fault Tolerance and Durability

 Objective: To ensure that the database remains operational and consistent even in the
face of system crashes, power outages, or other unforeseen failures.
 Methods:
o Write-Ahead Logging (WAL): Ensures that all changes are logged before being
applied to the database to facilitate recovery.
o Replication and Backup: Maintain copies of the data across different systems to
ensure availability and durability.
 Goal: Prevent data loss and ensure that once a transaction is committed, its changes
are permanent.

 Types of transactions –
1. Flat Transactions

 Definition: A basic type of transaction that follows the traditional ACID properties from
start to finish.
 Example: Updating a customer's address in a single database operation.
 Characteristics:
o Single start and commit/rollback point.
o No sub-transactions.
o Simple and easy to manage.

2. Nested Transactions

 Definition: A transaction that contains sub-transactions, each of which can be


committed or rolled back independently.
 Example: A banking transaction where one sub-transaction deducts money, and
another updates transaction history.
 Characteristics:
o Parent-child relationship between transactions.
o Sub-transactions can commit independently, but the main transaction commits
only if all sub-transactions succeed.

3. Distributed Transactions

 Definition: A transaction that spans across multiple networked databases or systems.


 Example: Booking a flight, hotel, and rental car across different service databases in
one transaction.
 Characteristics:
o Involves multiple sites.
o Requires distributed concurrency control and two-phase commit protocol (2PC)
for atomicity.

4. Compensating Transactions

 Definition: A type of transaction used to reverse the effects of a previously completed


transaction.
 Example: Canceling an order by issuing a refund and restocking items.
 Characteristics:
o Often used in long-running or business transactions.
o Helps in recovery or rollback in non-ACID environments.

5. Long-Lived Transactions

 Definition: Transactions that run for a longer period and may require intermediate
states to be saved.
 Example: Online shopping cart where items are added over time before checkout.
 Characteristics:
o Can involve partial commits.
o Usually broken into smaller sub-transactions.
o May relax strict ACID properties for flexibility.

6. Real-Time Transactions

 Definition: Transactions that must be completed within a strict deadline or time


constraint.
48

 Example: ATM withdrawal must process within a few seconds.


 Characteristics:
o Prioritize timing and availability.
o Critical in systems like embedded systems or online trading platforms.

7. Interactive Transactions

 Definition: Transactions where user interaction is involved before completion.


 Example: Online payment process where user inputs OTP before transaction is
completed.
 Characteristics:
o User actions affect flow.
o Often require user input to proceed or confirm.

8. Batch Transactions

 Definition: A group of transactions processed together without manual intervention.


 Example: Payroll processing at the end of the month.
 Characteristics:
o High volume processing.
o Less real-time; more suited to scheduled operations.

9. Temporal Transactions

 Definition: Transactions that depend on time-based data or operations.


 Example: Generating monthly financial reports.
 Characteristics:
o Data is valid only for specific time intervals.
o Requires temporal consistency.

 Objectives of Distributed Concurrency Control –


1. Consistency Maintenance

 Ensure that all copies of data across distributed sites remain consistent.
 Any concurrent execution of transactions should leave the system in a consistent
state.
 Prevent issues like lost updates, dirty reads, or inconsistent reads.

2. Isolation of Transactions

 Each transaction should be executed as if it were the only one in the system.
 Ensure that interleaved execution of multiple transactions doesn't affect their
correctness.
 Guarantees the isolation property of ACID.

3. Serializability

 Maintain serializability, i.e., the concurrent execution of transactions should be


equivalent to some serial (one-by-one) execution.
 This is crucial for correctness in distributed environments.

4. Deadlock Handling

 Prevent or detect and resolve deadlocks that can occur when multiple transactions
wait indefinitely for resources locked by each other.
 Implement timeout, wait-die, wound-wait, or deadlock detection and recovery
mechanisms.

5. Availability and Reliability

 Ensure that transactions proceed smoothly even if some sites or communication links
fail.
 Must allow system to operate in partitioned environments without data corruption.

6. Fairness
49

 Prevent starvation by ensuring that every transaction eventually gets a chance to


execute.
 Implement fair scheduling and resource allocation policies.

7. Minimal Communication Overhead

 Optimize the control mechanisms to minimize inter-site communication, reducing


latency and overhead.
 Use techniques like locking protocols or timestamp ordering in a distributed manner
efficiently.

8. Recovery Support

 Support transaction rollback and restart in case of failures.


 Ensure that undo/redo logs or checkpoints are coordinated across all involved sites.

9. Transparency

 Make concurrency control transparent to users so they don’t have to worry about
distributed execution details.
 Maintain distribution transparency while enforcing correctness.

10. Scalability

 Allow the concurrency control mechanism to scale with the number of users, sites, and
transactions without significant degradation in performance.

 Concurrency Control anomalies –


1. Lost Update

 Occurs when: Two transactions read the same data and then both write updates
without knowing about each other.
 Result: The second update overwrites the first one, causing it to be lost.
 Example:
o T1 reads X (value 10)
o T2 reads X (value 10)
o T1 writes X = 15
o T2 writes X = 20 → T1's update is lost

2. Dirty Read (Uncommitted Dependency)

 Occurs when: A transaction reads data that has been modified by another transaction
but not yet committed.
 Result: If the second transaction rolls back, the first transaction has read invalid data.
 Example:
o T1 updates X = 50 (not committed)
o T2 reads X = 50
o T1 rolls back → T2 has used incorrect data

3. Non-Repeatable Read

 Occurs when: A transaction reads the same data multiple times, but gets different
values because another transaction updated the data in between.
 Result: Inconsistent results within the same transaction.
 Example:
o T1 reads X = 10
o T2 updates X = 20 and commits
o T1 reads X again → gets 20

4. Phantom Read

 Occurs when: A transaction executes a query to retrieve a set of rows, and another
transaction inserts or deletes rows that affect the result of the query during the first
transaction.
 Result: Re-executing the same query gives different rows (phantoms).
 Example:
o T1 reads all employees with salary > 50000
50

o T2 inserts a new employee with salary 60000 and commits


o T1 re-executes the query → gets an extra row (phantom)

5. Write Skew

 Occurs when: Two concurrent transactions read overlapping data, then each
transaction writes to non-overlapping parts based on the data read.
 Result: The final state violates a constraint.
 Example:
o Constraint: At least one doctor on duty
o T1 reads: Doctor A is on duty
o T2 reads: Doctor B is on duty
o T1 sets A off-duty
o T2 sets B off-duty → both commit → no doctor on duty (constraint violated)

 Methods of concurrency control –


1. Lock-Based Protocols

These methods use locks to control access to data items.

a) Binary Locks

 Each data item can be locked (1) or unlocked (0).


 A transaction must lock before accessing and unlock after use.
 Very basic, but not flexible for complex operations.

b) Shared/Exclusive Locks (Read/Write Locks)

 Shared (S) lock: multiple transactions can read.


 Exclusive (X) lock: only one transaction can write.
 Ensures more fine-grained control.

c) Two-Phase Locking (2PL)

 Phase 1 (Growing): Acquire all locks, no releases.


 Phase 2 (Shrinking): Release locks, no new acquisitions.
 Ensures serializability, but may lead to deadlocks.

2. Timestamp-Based Protocols

 Each transaction gets a timestamp when it starts.


 Transactions execute in order of their timestamps.

a) Basic Timestamp Ordering

 Ensures that older transactions get priority.


 Rejects operations that would violate timestamp order.

b) Strict Timestamp Ordering

 Delays execution until previous conflicting transactions are committed or aborted.


 Prevents cascading rollbacks.

3. Optimistic Concurrency Control

 Transactions execute without restrictions.


 Validation is done before commit.
 If validation fails, the transaction rolls back.
 Best for systems with low contention.

4. Multiversion Concurrency Control (MVCC)

 Keeps multiple versions of data items.


 Readers access older committed versions, writers create new versions.
 Reduces conflicts and improves read performance.
51

 Used in databases like PostgreSQL, Oracle, etc.

5. Distributed Concurrency Control Methods

Used in distributed database systems, ensuring consistency across multiple nodes.

a) Distributed 2PL

 Similar to 2PL but coordinates locking across sites.


 Ensures global serializability.

b) Distributed Timestamp Ordering

 Extends timestamp ordering using global timestamps.


 Needs synchronization across all sites.

c) Quorum-Based Protocols

 A transaction must obtain read/write quorums from replicas before accessing data.
 Balances availability and consistency in replicated environments.

6. Deadlock Handling Mechanisms

Concurrency control methods often face deadlocks.

 Deadlock Prevention: Enforce ordering or deny unsafe requests.


 Deadlock Detection: Build wait-for graphs and abort transactions if cycles exist.
 Deadlock Avoidance: Use algorithms like Wait-Die or Wound-Wait.

 Serializability and recoverability –


Serializability:

Definition: Serializability is the highest level of isolation in a database. It ensures that


concurrent execution of transactions is equivalent to some serial execution of the same
transactions. In other words, even if transactions are interleaved (executed in overlapping
time frames), the final result must be the same as if the transactions were executed one
after the other in some order.

Why it is needed:

 To maintain consistency of the database.


 To avoid anomalies like lost updates, dirty reads, or inconsistent analysis.

Types of Serializability:

1. Conflict Serializability:
o Two operations conflict if they access the same data item and at least one of
them is a write.
o A schedule is conflict-serializable if it can be transformed into a serial schedule
by swapping non-conflicting operations.
2. View Serializability:
o Weaker than conflict serializability.
o A schedule is view-serializable if it is view-equivalent to a serial schedule.
o View equivalence checks three conditions: same read operations, same final
writes, and each read returns the same value as in the serial schedule.

Enforcement Methods:

 Two-Phase Locking (2PL): Guarantees conflict serializability.


 Timestamp Ordering Protocols
 Validation Techniques (Optimistic Concurrency Control)

Recoverability:
52

Definition: Recoverability ensures that the effects of a committed transaction are not lost
due to the failure or rollback of other transactions. In other words, a schedule is
recoverable if a transaction commits only after all transactions whose changes it has read
have also committed.

Why it is needed:

 To preserve database correctness and prevent scenarios where a transaction commits


based on uncommitted or aborted data (dirty reads).
 Without recoverability, inconsistencies can remain even after recovery from a crash.

Types of Recoverability:

1. Recoverable Schedule:
o A transaction Tj reads data written by transaction Ti, and Tj commits only after Ti
commits.
o This avoids the cascading effect of aborting multiple dependent transactions.
2. Cascadeless Schedule:
o A stricter form where transactions are only allowed to read data after it has been
committed.
o Avoids cascading rollbacks.
3. Strict Schedule:
o The strictest form where transactions cannot read or write data modified by
uncommitted transactions.
o Guarantees ease of recovery and avoids both dirty reads and cascading
rollbacks.

 Distributed Serializability –
Definition: Distributed serializability is the extension of serializability to distributed
database systems, where data and transactions are spread across multiple sites or nodes.
It ensures that concurrent execution of transactions across the distributed system is
equivalent to some serial execution of the same transactions as if they were executed on a
single system.

In simple terms, distributed serializability ensures global correctness of concurrent


transactions just like in centralized databases, but across multiple locations.

Why Distributed Serializability is Important:

 In distributed systems, different parts of a transaction may execute at different sites.


 Maintaining global consistency becomes challenging due to network latency, partial
failures, and concurrent access.
 It ensures that all operations across all sites preserve the consistency of the global
database.

Key Aspects of Distributed Serializability:

1. Global Schedule:
o A global schedule is the union of local schedules from all sites.
o The global schedule must be serializable in the same way as a centralized
schedule.
2. Local vs Global Serializability:
o Local serializability ensures correctness at individual sites.
o However, local serializability does not guarantee global serializability.
o Example: Two sites may execute transactions locally in a serializable order, but
the combined effect might violate global consistency.
3. Global Serialization Graph (GSG):
o A graph representing all transactions and conflicts across all sites.
o If the GSG is acyclic, then the global schedule is serializable.

How Distributed Serializability is Achieved:

1. Two-Phase Locking (Distributed 2PL):


o Ensures global serializability if all sites follow 2PL.
o Transactions acquire all locks before releasing any lock.
53

2. Timestamp Ordering Protocol:


o Each transaction is assigned a global timestamp.
o All operations across sites are ordered based on these timestamps.
3. Serialization Graph Testing:
o Sites exchange information about transaction dependencies.
o The system ensures that the combined serialization graph has no cycles.
4. Commitment Ordering:
o Transactions are committed in the same order as their serialization order.
o Helps maintain consistency without central coordination.

Challenges:

 Clock synchronization: Difficult to maintain strict timestamp ordering in distributed


environments.
 Communication overhead: Synchronizing locks and states between sites increases
traffic.
 Failure handling: Ensuring recovery and rollback without violating serializability is
complex.
 Deadlocks: More frequent due to cross-site locking.

 Enhanced lock based and timestamp based protocols –


In database management, lock-based and timestamp-based protocols are crucial for
ensuring serializability and consistency in concurrent transactions. Both methods have
evolved into enhanced versions to address the limitations of basic protocols. These
enhanced versions are designed to improve concurrency, reduce deadlocks, and handle
distributed environments more efficiently.

Enhanced Lock-Based Protocols:

The basic lock-based protocols use locks (shared or exclusive) to manage concurrent
access to data items. Enhanced versions of these protocols aim to increase concurrency
while maintaining serializability and resolving issues like deadlocks and starvation.

1. Multi-Granularity Locking

 What it is: This enhancement allows locks at different granularity levels (e.g., data
items, pages, tables, databases).
 Purpose: To improve concurrency by allowing transactions to lock broader or narrower
parts of the data, depending on the situation.
 Example: A transaction might lock an entire table when it doesn't need to lock
individual rows, thereby reducing the number of locks and overhead.
 How it works:
o S (Shared) Locks: For reading data.
o X (Exclusive) Locks: For writing data.
o Intent Locks: Indicate that a transaction intends to acquire locks on a lower level
(e.g., row-level).
 Benefits:
o Reduced contention by using more coarse-grained locks where appropriate.
o Improves performance and reduces deadlock risk compared to locking at the
finest granularity (e.g., row-level).
 Drawbacks:
o Complexity in managing locks at multiple levels.
o Can lead to lock escalation (e.g., a row lock might escalate to a table lock).

2. Deadlock Detection and Resolution

 What it is: Enhanced deadlock detection techniques are implemented to identify and
break deadlocks that occur in distributed or highly concurrent environments.
 Purpose: Deadlocks happen when two or more transactions are waiting on each other
indefinitely. Detecting deadlocks and resolving them efficiently is key to maintaining
system performance.
 How it works:
o Wait-for Graphs: A directed graph is created where nodes represent transactions,
and edges indicate waiting relationships.
54

o If a cycle is detected in the graph, it indicates a deadlock. The system will then
choose one transaction to abort (rollback) to break the cycle.
 Benefits:
o Ensures that the system remains deadlock-free without manual intervention.
o Can operate in distributed systems with high concurrency.
 Drawbacks:
o Overhead in detecting deadlocks, especially in large systems.
o Frequent rollbacks of transactions can impact performance.

3. Two-Phase Locking with Deadlock Prevention (Wait-Die and Wound-Wait)

 What it is: These techniques are enhancements to standard Two-Phase Locking (2PL)
that help prevent deadlocks by controlling the order in which transactions wait for
locks.
 Wait-Die Protocol: If a younger transaction requests a lock held by an older
transaction, the younger transaction must wait. If an older transaction requests a lock
held by a younger transaction, the older transaction dies (is rolled back).
 Wound-Wait Protocol: If a younger transaction requests a lock held by an older
transaction, the younger transaction wounds (preempts) the older transaction, forcing
it to roll back. If an older transaction requests a lock held by a younger transaction,
the older transaction must wait.
 Benefits:
o Helps prevent deadlocks by enforcing strict priority rules.
o Reduces the need for frequent deadlock detection and rollback.
 Drawbacks:
o Risk of starvation (some transactions may never get executed if they are
repeatedly preempted).
o Requires careful priority management.

Enhanced Timestamp-Based Protocols:

Timestamp-based protocols assign a global timestamp to each transaction, and


transactions are executed in the order of these timestamps. Enhanced versions improve
the handling of concurrency, conflict resolution, and efficiency.

1. Thomas' Write Rule

 What it is: An enhancement to basic timestamp ordering protocols that allows certain
writes to be ignored.
 Purpose: To improve the efficiency of the system by reducing unnecessary writes and
allowing greater concurrency.
 How it works:
o If a transaction with a later timestamp wants to write a value but the value has
already been updated by a transaction with an earlier timestamp, the write is
ignored (i.e., the update is considered obsolete).
 Benefits:
o Increases system throughput and reduces unnecessary work by ignoring certain
operations.
o Allows better handling of transactions with conflicting writes.
 Drawbacks:
o Introduces complexity in managing transactions.
o May not work well in environments with heavy writes or high contention.

2. Multi-Version Timestamp Ordering (MVTO)

 What it is: This method maintains multiple versions of a data item, which allows
transactions to read old versions while writing new versions of the data.
 Purpose: To provide greater concurrency by allowing non-blocking reads and reducing
conflicts between transactions.
 How it works:
o Each update creates a new version of the data item.
o Read operations return the latest committed version of the data item that is
consistent with the transaction's timestamp.
o Write operations create a new version of the data item, which is then
timestamped.
 Benefits:
55

o Reduces conflicts between read and write operations, improving system


performance and throughput.
o Enables non-blocking reads since readers can access older versions of data.
 Drawbacks:
o Increased storage overhead for maintaining multiple versions of data items.
o Complex version management and cleanup processes.

3. Extended Timestamp Ordering with Cascading Rollbacks

 What it is: This enhancement is aimed at reducing cascading rollbacks by extending


the basic timestamp ordering protocol.
 Purpose: To prevent the scenario where one rollback causes other transactions to
rollback, leading to a cascading failure in the system.
 How it works:
o The system ensures that transactions only read data that is guaranteed to be
valid, preventing unnecessary dependencies.
o If a transaction is rolled back, it ensures that other dependent transactions (those
that read from the rolled-back transaction) are also rolled back.
 Benefits:
o Reduces the risk of cascading rollbacks, improving system stability and
transaction recovery.
 Drawbacks:
o Increased complexity in transaction scheduling and dependency management.

 Multiple granularity –
Multiple Granularity Locking (MGL) is an enhancement to basic locking protocols that allows
transactions to lock data at varying levels of granularity (i.e., different sizes or scopes of
data). This approach increases concurrency and reduces the overhead of lock management
by allowing coarser-grained locks when appropriate, while still maintaining serializability
and data consistency.

Key Concepts of Multiple Granularity Locking:

1. Granularity Levels:
o Data Item Level: The finest level of granularity where individual data items (like
rows or fields) are locked.
o Page Level: Locks on database pages (group of records).
o Table Level: Locks on entire tables (group of rows or pages).
o Database Level: Locks on entire databases (all tables and data).
2. Lock Types:
o S (Shared) Lock: Allows transactions to read data, but not modify it.
o X (Exclusive) Lock: Allows transactions to read and write data. No other
transaction can access the data.
o IS (Intent Shared) Lock: Indicates that a transaction intends to lock a subordinate
data item (e.g., a specific row or page) in a shared mode.
o IX (Intent Exclusive) Lock: Indicates that a transaction intends to lock a
subordinate data item in an exclusive mode.
3. Hierarchical Locking:
o Multiple granularity allows a hierarchical lock structure, where each transaction
may lock higher-level entities (like tables or databases) and lower-level entities
(like rows or columns) as needed.
o For example, a transaction might acquire an IX lock on a table, which means it
intends to acquire X locks on individual rows within the table later.

Locking Rules in Multiple Granularity Protocol:

1. Two-Phase Locking Property:


o MGL adheres to the Two-Phase Locking (2PL) principle, which requires a
transaction to acquire all locks before releasing any. This helps ensure
serializability of the transaction schedule.
2. Locking Hierarchy Rules:
o A lower-level lock (such as a row lock) cannot be acquired unless an appropriate
intent lock (IS or IX) is held on the higher-level entity (such as the table or page).
56

o A transaction must release all locks at a level before it can release locks at a
higher level. For example, a transaction cannot release a table-level lock until all
row-level locks within that table are released.
3. Lock Compatibility:
o Locks on different granularity levels must be compatible to allow concurrent
access by multiple transactions. For example:
 S (Shared) Lock on a table allows other transactions to acquire S locks on
rows, but not X locks.
 IX (Intent Exclusive) Lock on a table allows transactions to acquire X locks
on rows, but prevents acquiring S locks on the table level.

Advantages of Multiple Granularity Locking:

1. Improved Concurrency:
o By allowing coarser-grained locks (such as locking entire tables or pages),
multiple transactions can access different parts of the database concurrently
without interfering with each other, improving throughput.
2. Reduced Locking Overhead:
o Fewer locks are needed for large-scale operations, reducing the complexity and
management overhead of locks compared to fine-grained locking on each
individual data item.
3. Increased Efficiency:
o A transaction can hold a higher-level lock (like a table lock) while locking
individual rows or pages within the table as necessary, allowing it to work on
large chunks of data without constantly acquiring and releasing locks.
4. Maintains Serializability:
o Since MGL is based on the Two-Phase Locking protocol, it guarantees
serializability, the highest level of isolation for transactions.

Disadvantages of Multiple Granularity Locking:

1. Complexity:
o Implementing multiple levels of locking can be complex to manage, especially in
distributed systems.
o It requires careful coordination between locks at different levels to avoid
conflicts.
2. Deadlock Potential:
o Even though MGL tries to reduce contention, the use of hierarchical locking still
leaves the system vulnerable to deadlocks, especially when transactions lock
different levels of granularity in a circular manner.
3. Lock Escalation:
o There can be cases where a transaction holding many fine-grained locks might
escalate to a coarser-grained lock (e.g., from row-level to table-level lock). This
might reduce concurrency and lead to bottlenecks.
4. Lock Contention:
o High contention for the higher-level locks can still cause transactions to block or
delay each other, especially when trying to lock tables or pages.

Example of Multiple Granularity Locking:

Consider a database with three tables: Customers, Orders, and Products. Let’s look at how
a transaction can acquire locks at multiple levels:

1. Transaction 1 (T1):
o T1 acquires an IX lock on the Customers table (indicating it intends to modify
rows in this table).
o T1 then acquires an X lock on a specific row (Customer ID 1001).
o T1 performs its operations on this customer’s data.
2. Transaction 2 (T2):
o T2 wants to access the same Customers table, so it can acquire an S lock on the
Customers table.
o T2 can still read from different rows (as long as those rows are not locked by
other transactions), improving concurrency.

 Multi version schemes –


57

Multi-Version Concurrency Control (MVCC) is a database management technique used to


handle concurrency in a way that allows multiple transactions to access different versions
of the same data item simultaneously. MVCC is primarily used in highly concurrent
database systems to prevent read-write conflicts and to allow non-blocking reads.

In MVCC, rather than locking data items, the system maintains multiple versions of data
and ensures that transactions interact with these versions according to their timestamps or
commit order. This reduces contention between read and write operations and provides a
way to maintain high concurrency while ensuring serializability and data consistency.

Key Concepts in Multi-Version Schemes -

1. Multiple Versions of Data Items:


o Each data item in the database can have multiple versions. These versions are
created whenever a transaction makes a modification (such as an update).
o Each version has a timestamp or transaction identifier that indicates the
transaction that created it and its validity range (when it became active and
when it was superseded by a newer version).
2. Read and Write Operations:
o Reads: When a transaction reads a data item, it reads the latest version that is
consistent with its start timestamp (or commit timestamp). This ensures that a
transaction sees a consistent view of the data, even if other transactions are
concurrently updating the same data.
o Writes: When a transaction writes to a data item, a new version of the item is
created with the transaction's timestamp, while the previous version remains
intact for other transactions to read.
3. Version Management:
o Version Control: The system must efficiently manage multiple versions of each
data item, which includes tracking which version is the most current, ensuring
that outdated versions are cleaned up, and handling conflicts between versions.
o Garbage Collection: Old versions of data items that are no longer needed must
be cleaned up periodically to avoid excessive storage overhead.
4. Snapshot Isolation:
o A key benefit of MVCC is that it allows for snapshot isolation, meaning that each
transaction sees a snapshot of the database as it was at the time the transaction
started. This prevents dirty reads (reading uncommitted data) and ensures that
transactions do not interfere with each other.

Types of Multi-Version Schemes -

There are various multi-version concurrency control schemes, but the two most common
are Optimistic MVCC and Pessimistic MVCC.

1. Optimistic MVCC

In Optimistic MVCC, transactions execute under the assumption that conflicts will be rare.
The system provides mechanisms to handle conflicts only at the commit time.

 How it works:
1. Reading: A transaction reads the data from the latest committed version, or if it
needs to access an item modified by another transaction, it gets the version that
was current when the transaction began.
2. Writing: When the transaction wants to write to the database, it creates a new
version of the data.
3. Validation Phase: At commit time, the system checks if the transaction has
encountered conflicts with other transactions that modified the same data. If
conflicts exist, the transaction is rolled back; if not, the transaction commits, and
the new version of the data is made visible to others.
 Advantages:

o High concurrency: Transactions are free to execute without waiting for locks,
increasing throughput.
o Reduced contention: Since conflicts are handled later, most transactions can
commit without being delayed.
 Disadvantages:
58

o Higher rollback rates: If conflicts are detected during validation, transactions may
have to be rolled back and retried.
o Potential for wasted work: A transaction could perform a lot of work only to be
invalidated at commit time.

2. Pessimistic MVCC

In Pessimistic MVCC, transactions work with data versions in a way that avoids conflicts
during execution by making sure that a transaction is not interfered with by others. This is
similar to the behavior of traditional locking but in the context of multiple versions.

 How it works:
1. Reading: A transaction reads the latest committed version of the data. If a
transaction intends to modify a data item, it locks that item, ensuring no other
transactions can modify it until the transaction is complete.
2. Writing: When a transaction wants to update a data item, it creates a new
version of the data, marking it with its timestamp. Any other transaction
attempting to write to that data will be blocked until the first transaction commits
or aborts.
3. Validation and Commit: After the transaction commits, the new version is made
visible to other transactions, and any older versions are logically obsolete.
 Advantages:

o Reduced risk of rollback: Since conflicts are resolved by locking, the transaction
is more likely to commit successfully.
o Better for write-heavy workloads: Pessimistic MVCC ensures transactions are
serialized and prevents conflicts from arising during execution.
 Disadvantages:
o Lower concurrency: Transactions are forced to wait for locks, leading to reduced
throughput and performance.
o Potential for deadlocks: If two transactions lock the same data items, a deadlock
could occur.

Example of Multi-Version Scheme -

Consider a simple database with a balance field for a bank_account table:

1. Version 1: balance = 100


2. Version 2: Transaction A updates the balance to 150. The database creates a new
version with balance = 150.
3. Version 3: Transaction B starts and reads the balance, which was 100 at the time of its
start. Later, Transaction B updates the balance to 200, creating a new version.
4. Transaction A commits, making the balance 150 visible to other transactions.
5. Transaction B commits, and the balance is updated to 200.

 Optimistic Concurrency Control techniques –


Optimistic Concurrency Control (OCC) is a concurrency control method used in database
management systems where transactions are allowed to execute without locking data.
Instead of locking data during the transaction execution, OCC assumes that conflicts will be
rare. The system verifies the absence of conflicts before the transaction commits, and if
conflicts are detected, the transaction is rolled back.

OCC is based on the principle that transactions can often execute without interfering with
each other, and it defers conflict detection until the end of the transaction. This is
especially beneficial for systems with low contention or read-heavy workloads, where
transactions tend to operate on independent data items.

Techniques of Optimistic Concurrency Control

1. Basic Optimistic Concurrency Control (Basic OCC):


o Basic OCC operates using the three phases (Read, Validation, and Write) as
described above. During the read phase, the transaction reads data without
locking it. After the transaction completes its operations, during the validation
phase, it is checked whether any other transactions have updated the data it
read. If there are no conflicts, it commits the changes.
o Conflict Detection:
59

 If a transaction T1 reads a data item that was modified by another


transaction T2 after T1 started, then T1 is in conflict with T2.
 If no conflict is detected, the transaction commits. Otherwise, T1 is rolled
back.
o Advantages:
 No locking overhead: Transactions execute without locks, reducing the need
for lock management.
 High concurrency: Transactions can be executed in parallel without waiting
for locks.
o Disadvantages:
 Rollback risk: If conflicts are detected during the validation phase,
transactions may have to be rolled back, wasting effort.
 Unpredictability: The system cannot predict in advance whether a
transaction will succeed or be rolled back.

2. Transaction Timestamp Ordering (TTO):


o This technique assigns a timestamp to each transaction when it starts, and the
validation phase uses these timestamps to detect conflicts.
o The key idea is that a transaction with an earlier timestamp must commit before
any transaction with a later timestamp.
o Conflict Detection:
 If a transaction T1 reads or writes a data item that has been modified by
another transaction T2 with a later timestamp, there is a conflict.
Transaction T1 is rolled back, and T2 proceeds to commit.
o Advantages:
 Simple to implement and highly effective when conflicts are rare.
o Disadvantages:
 Timestamp overhead: Transactions must be assigned timestamps, which
adds some overhead.
 Rollback: If conflicts are frequent, the system may face high transaction
rollback rates.

3. Validation-Based OCC:
o This technique focuses on validating a transaction's operations against other
transactions that have committed or are currently executing. It ensures that a
transaction’s operations do not conflict with other transactions that have already
committed, and if they do, it rolls back the transaction.
o Conflict Detection:
 When a transaction enters the validation phase, it checks whether it has
read any data modified by another transaction that has already committed.
If there is such a conflict, the transaction is rolled back.
o Advantages:
 Reduced contention: Because no locks are required during the read phase,
it allows higher concurrency.
 Flexibility: Transactions are validated only when they are ready to commit,
reducing unnecessary locking.
o Disadvantages:
 Validation overhead: The process of checking for conflicts during validation
can introduce additional complexity and overhead.
 Risk of frequent rollbacks: If the database experiences a high degree of
contention, many transactions may be rolled back, which can decrease the
system's performance.

4. Priority-Based OCC:
o In Priority-Based OCC, transactions are given priorities based on their timestamps
or other factors, and these priorities are used during the validation phase to
resolve conflicts.
o Conflict Resolution:
 When two transactions conflict, the one with the lower priority (usually the
one with the older timestamp) is rolled back to ensure the system proceeds
with the higher priority transaction.
o Advantages:
 Helps resolve conflicts by prioritizing transactions based on timestamp or
other criteria, reducing rollback chances for high-priority transactions.
o Disadvantages:
60

 Priority management: Determining transaction priorities adds another layer


of complexity and may introduce inefficiency in systems with low
contention.

5. Early Restart OCC:


o Early Restart OCC modifies the basic OCC model by checking for conflicts during
the read phase itself, rather than waiting until the validation phase. If a conflict is
detected during the read phase, the transaction is immediately restarted.
o Conflict Detection:
 When a transaction reads a data item, it checks whether the data item has
been modified by another transaction after the read operation. If a conflict
is detected, the transaction is restarted from the beginning.
o Advantages:
 Faster conflict resolution: Since conflicts are detected early, it can save time
compared to waiting until the end of the transaction.
o Disadvantages:
 Higher overhead: Since transactions may need to restart if conflicts are
detected during the read phase, this can result in inefficiencies, especially
when conflicts are frequent.

 Distributed Deadlock & Recovery –


In distributed systems, deadlock occurs when two or more transactions are waiting for each
other to release resources, creating a cyclic dependency where none of the transactions
can proceed. This is similar to deadlock in centralized systems, but in a distributed system,
the complexity increases due to the involvement of multiple nodes, networks, and
processes.

Types of Distributed Deadlocks -

Distributed deadlocks are usually categorized into two types:

1. Local Deadlocks:
o These are deadlocks that occur within a single node or process.
o In a distributed system, each node may handle its own local resources, and
deadlock detection can be handled by the individual nodes. However, local
deadlocks might still affect the overall system, especially if the resource
allocation is interdependent.
2. Global Deadlocks:
o These occur when there is a circular wait involving processes on different nodes
or machines.
o A global deadlock involves transactions that are waiting for resources on other
nodes, creating a cycle in the dependency graph of transactions.

Distributed Deadlock Detection -

In centralized systems, detecting deadlock is relatively simple because there is a global


view of the resources and transactions. However, in distributed systems, detecting
deadlocks becomes much more complex because no single node has a global view of all
resources and transactions.

There are two main approaches to detecting distributed deadlocks:

1. Centralized Approach:

 In this approach, a central node is responsible for deadlock detection. This central
node receives information from other nodes about the resources they are using and
waits for, and constructs a global wait-for graph.
 The graph is periodically checked for cycles (which indicate deadlock). If a cycle is
found, the central node reports a deadlock.

Advantages:

 Simple to implement because only one node is responsible for managing the detection
process.
 Easier to monitor and maintain as there is a central point of control.
61

Disadvantages:

 Scalability issues: As the system grows, the central node might become a bottleneck.
 Single point of failure: If the central node fails, the entire deadlock detection
mechanism might collapse.

2. Distributed Approach:

 In this approach, each node maintains information about the resources it holds and is
waiting for. Each node also sends messages to neighboring nodes about its status.
 Nodes periodically exchange detection messages to build a global wait-for graph and
detect cycles. This method allows decentralized detection without the need for a
central controller.

Advantages:

 More scalable since the detection mechanism is distributed across all nodes.
 No single point of failure.

Disadvantages:

 More complex to implement since there are multiple nodes involved in the detection
process.
 Requires efficient communication between nodes to ensure that detection messages
reach the correct recipients.

Wait-for Graph (WFG) for Deadlock Detection:

A wait-for graph is used to represent the relationships between transactions and resources.
Each transaction or resource is represented by a node, and a directed edge is created from
one node to another if one transaction is waiting for a resource held by the other.

 Cycle in the WFG: A cycle in the graph indicates a deadlock. If there is a path that
leads back to the original node, it implies that the involved transactions are mutually
waiting for resources, thus causing a deadlock.

Distributed Deadlock Recovery:

Once deadlock is detected in a distributed system, recovery actions need to be taken. The
goal of recovery is to break the deadlock by aborting one or more of the transactions
involved in the cycle.

There are a few strategies to resolve deadlocks in distributed systems:

1. Transaction Abortion:

 Abort one transaction: One of the transactions involved in the deadlock is selected for
abortion, allowing the resources it holds to be freed so the other transactions can
proceed.
 Rollback of the aborted transaction: The aborted transaction is rolled back, and any
changes it made are undone. This transaction can later be restarted if needed.
 The decision on which transaction to abort can be made based on various criteria:
o Transaction priority: Higher priority transactions are not aborted, while lower
priority ones are.
o Transaction cost: The transaction that has consumed fewer resources or has
made fewer modifications can be aborted.
o Transaction wait time: The transaction that has been waiting the longest could be
aborted.

2. Rollback and Restart:

 In this approach, a transaction that is detected as part of the deadlock cycle is rolled
back, and then the transaction is restarted from the beginning or from a checkpoint.
 This method is often used when transactions have a checkpoints system, allowing
them to restart from a previous stable state, which minimizes wasted work.
62

3. Killing Transactions:

 A more aggressive approach is to kill one or more transactions involved in the


deadlock and reclaim the resources immediately. However, this may cause significant
data loss if the transactions had already made significant progress.

Trade-offs:

 Killing transactions immediately can help resolve the deadlock quickly, but it may also
lead to significant waste in terms of computation and resources.

4. Timeout-Based Recovery:

 In this approach, if a transaction has been waiting for a resource for too long (beyond
a pre-set timeout), it is considered to be part of a deadlock cycle, and the system
automatically rolls it back or aborts it.
 This method does not require an explicit deadlock detection phase but relies on time
constraints to detect potentially stalled transactions.

You might also like