0% found this document useful (0 votes)
3 views21 pages

Unit 5

The document discusses advanced topics in distributed databases, including their architecture, types, and applications, as well as the principles of NoSQL databases and their characteristics. It covers key components of distributed database systems, such as local databases, global schemas, and various transaction protocols like Two-Phase Commit. Additionally, it highlights the challenges in query processing and the advantages of NoSQL databases for handling large volumes of data with flexible structures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views21 pages

Unit 5

The document discusses advanced topics in distributed databases, including their architecture, types, and applications, as well as the principles of NoSQL databases and their characteristics. It covers key components of distributed database systems, such as local databases, global schemas, and various transaction protocols like Two-Phase Commit. Additionally, it highlights the challenges in query processing and the advantages of NoSQL databases for handling large volumes of data with flexible structures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Unit V ADVANCED TOPICS

Distributed Databases: Architecture, Data Storage, Transaction Processing, Query processing and optimization –
NOSQL Databases: Introduction – CAP Theorem – Document Based systems – Key value Stores – Column Based
Systems – Graph Databases. Database Security: Security issues – Access control based on privileges – Role Based
access control –SQL Injection – Statistical Database security – Flow control – Encryption and Public Key infra-
structures – Challenges.

5.1 Definition

A distributed database is basically a database that is not limited to one system, it is spread over different sites,
i.e, on multiple computers or over a network of computers. A distributed database system is located on
various sites that don’t share physical components. This may be required when a particular database needs
to be accessed by various users globally. It needs to be managed such that for the users it looks like one single
database.

5.1.2 Types:

1. Homogeneous Database:
In a homogeneous database, all different sites store database identically. The operating system, database
management system, and the data structures used – all are the same at all sites. Hence, they’re easy to
manage.

2. Heterogeneous Database:
In a heterogeneous distributed database, different sites can use different schema and software that can lead
to problems in query processing and transactions. Also, a particular site might be completely unaware of
the other sites. Different computers may use a different operating system, different database application.
They may even use different data models for the database. Hence, translations are required for different
sites to communicate.

5.13 Applications of Distributed Database:

 It is used in Corporate Management Information System.


 It is used in multimedia applications.
 Used in Military’s control system, Hotel chains etc.
 It is also used in manufacturing control system.
 Railways and flight booking
5.2. Architecture of Distributed Database

A Distributed Database System (DDBS) consists of a single logical database that is distributed across
multiple locations (sites or nodes) connected by a network. The architecture of a distributed database ensures
data is stored efficiently, accessed transparently, and maintained consistently across all nodes.
5.2.1 Key Components of Distributed Database Architecture

1. Database Components

 Local Databases: Each site has its own local database, which may contain a portion of the overall
distributed database.

 Global Schema: A unified schema that provides a logical view of the entire database, hiding the
distribution details from users.

2. Nodes (Sites)

 Each node has its own:

o Local DBMS (Database Management System)

o Local applications

o Data storage

3. Types of Architectures

a) Client-Server Architecture

 Clients send requests to the server which manages the database.

 The server processes queries and returns results.

 Can be centralized or distributed at the server end.

b) Peer-to-Peer (P2P) Architecture

 Every site (peer) acts as both a client and a server.

 Peers share data and workload equally.

 No central server; high fault tolerance and decentralization.

c) Multi-DBMS Architecture (Federated)

 Multiple autonomous databases work together.

 Each site retains control over its own data and operations.

4. Transparency Features

 Location Transparency: Users don’t need to know the physical location of data.
 Replication Transparency: System manages multiple copies of data.
 Fragmentation Transparency: Data can be split (horizontally or vertically) across sites.
 Concurrency Transparency: Ensures concurrent transactions don’t interfere.
 Failure Transparency: System can recover from site or network failures.
5. Communication Network

 Used for data transfer between sites.

 Typically a LAN or WAN setup.

 Ensures reliable and secure communication.

5.2.2 Key Components

 Sites (Nodes): Each rectangle represents a site (e.g., Site 1, Site 2, etc.), which hosts a local
database managed by a local DBMS.

 Communication Network: The cloud symbol in the center denotes the network connecting all
sites, facilitating data exchange and query processing.

 Distributed Database Management System (DDBMS): Oversees the coordination, query


optimization, and transaction management across sites.

 Global Schema: Provides a unified logical view of the distributed data, abstracting the underlying
distribution details from users.

Federated architecture

Shared nothing architecture


Distributed database architecture

5.3 Distributed Data Storage

There are 2 ways in which data can be stored on different sites. These are:

1. Replication –
In this approach, the entire relationship is stored redundantly at 2 or more sites. If the entire database is
available at all sites, it is a fully redundant database. Hence, in replication, systems maintain copies of
data.

This is advantageous as it increases the availability of data at different sites. Also, now query requests can
be processed in parallel.
However, it has certain disadvantages as well. Data needs to be constantly updated. Any change made at
one site needs to be recorded at every site that relation is stored or else it may lead to inconsistency. This is
a lot of overhead. Also, concurrency control becomes way more complex as concurrent access now needs
to be checked over a number of sites.

2. Fragmentation
In this approach, the relations are fragmented (i.e., they’re divided into smaller parts) and each of the
fragments is stored in different sites where they’re required. It must be made sure that the fragments are
such that they can be used to reconstruct the original relation (i.e, there isn’t any loss of data).
Fragmentation is advantageous as it doesn’t create copies of data, consistency is not a problem.
Fragmentation of relations can be done in two ways:

 Horizontal fragmentation – Splitting by rows –


The relation is fragmented into groups of tuples so that each tuple is assigned to at least one
fragment.

 Vertical fragmentation – Splitting by columns –


The schema of the relation is divided into smaller schemas. Each fragment must contain a common
candidate key so as to ensure a lossless join.

5.4 Distributed Transaction


A distributed transaction spans multiple systems, ensuring all operations either succeed or fail together,
crucial for maintaining data integrity and consistency across diverse and geographically separated resources
in modern computing environments.

What is the need for a Distributed Transaction?

The need for distributed transactions arises from the requirements to ensure data
consistency and reliability across multiple independent systems or resources in a distributed computing
environment. Specifically:

 Consistency: Ensuring that all changes made as part of a transaction are committed or rolled back
atomically, maintaining data integrity.

 Isolation: Guaranteeing that concurrent transactions do not interfere with each other, preserving data
integrity and preventing conflicts.

 Durability: Confirming that committed transactions persist even in the event of system failures,
ensuring reliability.

 Atomicity: Ensuring that either all operations within a transaction are completed successfully or
none of them are, avoiding partial updates that could lead to inconsistencies.

5.4.1 Working of Distributed Transactions


The working of Distributed Transactions is the same as that of simple transactions but the challenge is to
implement them upon multiple databases. Due to the use of multiple nodes or database systems, there
arises certain problems such as network failure, to maintain the availability of extra hardware servers and
database servers. For a successful distributed transaction the available resources are coordinated by
transaction managers.
Step 1: Application to Resource – Issues Distributed Transaction

The application initiates the transaction by sending the request to the available resources. The request
consists of details such as operations that are to be performed by each resource in the given transaction.

Step 2: Resource 1 to Resource 2 – Ask Resource 2 to Prepare to Commit

Once the resource receives the transaction request, resource 1 contacts resource 2 and asks resource 2 to
prepare the commit. This step makes sure that both the available resources are able to perform the dedicated
tasks and successfully complete the given transaction.

Step 3: Resource 2 to Resource 1 – Resource 2 Acknowledges Preparation

After the second step, Resource 2 receives the request from Resource 1, it prepares for the commit. Resource
2 makes a response to resource 1 with an acknowledgment and confirms that it is ready to go ahead with the
allocated transaction.

Step 4: Resource 1 to Resource 2 – Ask Resource 2 to Commit

Once Resource 1 receives an acknowledgment from Resource 2, it sends a request to Resource 2 and provides
an instruction to commit the transaction. This step makes sure that Resource 1 has completed its task in the
given transaction and now it is ready for Resource 2 to finalize the operation.

Step 5: Resource 2 to Resource 1 – Resource 2 Acknowledges Commit

When Resource 2 receives the commit request from Resource 1, it provides Resource 1 with a response and
makes an acknowledgment that it has successfully committed the transaction it was assigned to. This step
ensures that Resource 2 has completed its task from the operation and makes sure that both the resources
have synchronized their states.

Step 6: Resource 1 to Application – Receives Transaction Acknowledgement


Once Resource 1 receives an acknowledgment from Resource 2, Resource 1 then sends an acknowledgment
of the transaction back to the application. This acknowledgment confirms that the transaction that was
carried out among multiple resources has been completed successfully.

5.4.2 Types of Distributed Transactions

Distributed transactions involve coordinating actions across multiple nodes or resources to ensure atomicity,
consistency, isolation, and durability (ACID properties). Here are some common types and protocols:

1. Two-Phase Commit Protocol (2PC)

This is a classic protocol used to achieve atomicity in distributed transactions.

 It involves two phases: a prepare phase where all participants agree to commit or abort the
transaction, and a commit phase where the decision is executed synchronously across all participants.

 2PC ensures that either all involved resources commit the transaction or none do, thereby maintaining
atomicity.

2. Three-Phase Commit Protocol (3PC)

3PC extends 2PC by adding an extra phase (pre-commit phase) to address certain failure scenarios that could
lead to indefinite blocking in 2PC.

 In 3PC, participants first agree to prepare to commit, then to commit, and finally to complete or abort
the transaction.

 This protocol aims to reduce the risk of blocking seen in 2PC by introducing an additional decision-
making phase.

Implementing Distributed Transactions

Below is how distributed transactions is implemented:

 Transaction Managers (TM):

o Transaction Managers are responsible for coordinating and managing transactions across
multiple resource managers (e.g., databases, message queues).

o TMs ensure that transactions adhere to ACID properties (Atomicity, Consistency, Isolation,
Durability) even when involving disparate resources.

 Resource Managers (RM):

o Resource Managers are responsible for managing individual resources (e.g., databases, file
systems) involved in a distributed transaction.

o RMs interact with the TM to prepare for committing or rolling back transactions based on
the TM’s coordination.

 Coordination Protocols:

o Implementations of distributed transactions often rely on coordination protocols like 2PC,


3PC, or variants such as Paxos and Raft for consensus.
o These protocols ensure that all participants in a transaction reach a consistent decision
regarding commit or rollback.

5.5 Query Processing in Distributed DBMS

Query processing in a Distributed Database Management System (DDBMS) involves executing a user's
query that may require accessing data stored across multiple, geographically dispersed database sites. The
main goal is to process queries efficiently, ensuring correct results with minimal communication cost and
response time.

Here's a breakdown of the key steps in distributed query processing:

1. Query Decomposition

 The user submits a high-level query in SQL.

 The DDBMS parses and validates the query.

 The query is then decomposed into an equivalent relational algebra expression.

 It checks for syntactic and semantic correctness.

 The query is simplified and transformed into an internal form (like a logical query plan).

2. Data Localization (Data Source Identification)

 Identifies which sites (or fragments) contain the data required.

 Involves determining where relations or fragments are stored (horizontal or vertical fragmentation).

 Rewrites the query to access only relevant data sources.

3. Global Optimization

 Chooses the most efficient strategy to execute the query across sites.

 Considers factors like:

o Communication cost (data transfer between sites)

o Local processing cost

o Parallelism

 Generates alternative distributed query execution plans (QEPs) and chooses the best one using
cost-based optimization.

4. Local Optimization

 Each local site optimizes its portion of the query.

 Uses local DBMS query processors to generate efficient access paths (e.g., using indexes).

5. Query Execution
 The chosen QEP is executed across different sites.

 Intermediate results may be transferred between sites or aggregated centrally.

 Final result is constructed and returned to the user.

Example:

Suppose you run a query:

SELECT * FROM Orders o, Customers c WHERE o.CustID = c.ID AND c.City = 'Paris';

In a DDBMS:

 Orders and Customers may be stored at different sites.

 The system will locate fragments with Paris customers, fetch relevant orders, perform join operations
efficiently (maybe at a central or intermediate site), and return the results.

The process used to retrieve data from a database is called query processing.

The actions to be taken are:

 Costs (Transfer of data) of Distributed Query processing

 Using Semi join in Distributed Query processing

In Distributed Query processing, the data transfer cost of distributed query processing means the cost of
transferring intermediate files to other sites for processing and therefore the cost of transferring the ultimate
result files to the location where that result is required. Let’s say that a user sends a query to site S1, which
requires data from its own and also from another site S2. Now, there are three strategies to process this query
which are given below:

1. We can transfer the data from S2 to S1 and then process the query

2. We can transfer the data from S1 to S2 and then process the query

3. We can transfer the data from S1 and S2 to S3 and then process the query. So the choice depends
on various factors like the size of relations and the results, the communication cost between
different sites, and at which the site result will be utilized.

Data transfer cost = C * Size

5.5.1 Using Semi-Join in Distributed Query Processing

The semi-join operation is used in distributed query processing to reduce the number of tuples in a table
before transmitting it to another site. This reduction in the number of tuples reduces the number and the
total size of the transmission ultimately reducing the total cost of data transfer.

Let’s say that we have two tables R1, R2 on Site S1, and S2. Now, we will forward the joining column of
one table say R1 to the site where the other table say R2 is located. This column is joined with R2 at that
site. The decision whether to reduce R1 or R2 can only be made after comparing the advantages of reducing
R1 with that of reducing R2. Thus, semi-join is a well-organized solution to reduce the transfer of data in
distributed query processing.

5.5.2 Challenges in Distributed Query Processing:

 Data heterogeneity (schemas, formats)

 Network latency and reliability

 Site autonomy

 Security and access control

 Query optimization complexity

5.6 NOSQL DATABASES

NoSQL databases (short for "Not Only SQL") are a category of databases designed to handle large volumes
of data, high user loads, and flexible data models.

Key Characteristics of NoSQL Databases

1. Schema-less:

o No fixed table schema; data can be stored in a flexible, dynamic structure.

o Ideal for handling unstructured or semi-structured data like JSON, XML, etc.

2. Horizontal Scalability:

o Designed to scale out by adding more servers (nodes) rather than scaling up.

o Supports distributed architecture natively.

3. High Performance:

o Optimized for fast read/write operations.

o Often used in real-time web apps, big data systems, and IoT.

4. Eventual Consistency (CAP Theorem):

o Many NoSQL systems prioritize Availability and Partition Tolerance, sometimes


relaxing Consistency for performance and fault tolerance.

5.6.1 Types of NoSQL Databases

1. Document Stores

o Store data as documents (usually JSON, BSON, or XML).

o Flexible and hierarchical.

o Examples: MongoDB, CouchDB

2. Key-Value Stores
o Data is stored as key-value pairs (like a hash table).

o Extremely fast for simple lookups.

o Examples: Redis, DynamoDB, Riak

3. Column-Family Stores

o Data stored in columns rather than rows.

o Suitable for analytical queries on large datasets.

o Examples: Apache Cassandra, HBase

4. Graph Databases

o Data is modeled as nodes and edges.

o Ideal for relationships and network-based queries (e.g., social networks).

o Examples: Neo4j, ArangoDB

5.6.2 When to Use NoSQL/ Advantages

 When you need to handle:

o Massive volumes of unstructured or semi-structured data

o Real-time analytics or fast lookups

o Dynamic or frequently changing data models

o High-availability and distributed systems

o High scalability

o Flexibility

o Availability

o Performance

o Cost effectiveness

5.6.3 Limitations of NoSQL

 Weaker support for complex queries and joins

 Eventual consistency instead of strict ACID compliance (depends on the DB)

 Less mature than traditional RDBMS for some enterprise features

 Learning curve due to lack of standardization.

 GUI is not available


 Backup problem

 Large document size.

5.7 CAP Theorem


The CAP Theorem, also known as Brewer's Theorem, is a fundamental concept in distributed systems—
especially important when designing and understanding distributed databases like NoSQL systems.

The CAP Theorem states that a distributed database system can only guarantee two out of the following
three properties at the same time:

1. Consistency (C)

 Every read receives the most recent write or an error.

 All nodes see the same data at the same time.

 Similar to the "C" in ACID.

 It guarantees that every node in a distributed cluster returns the same, most recent, and successful
write. It refers to every client having the same view of the data.

2. Availability (A)

Availability means that each read or write request for a data item will either be processed successfully or
will receive a message that the operation cannot be completed.

 Every request (read or write) gets a non-error response, even if it's not the latest data.

 The system is always responsive.

3. Partition Tolerance (P)

 The system continues to operate despite network partitions (i.e., communication failures between
nodes).

 This is mandatory in distributed systems because network failures can and will happen.
5.8 Column Based Systems
Column-based systems (also called column-oriented databases) in distributed systems are designed to store
and process data by columns instead of traditional row-based storage. This design choice has major
implications for performance, and scalability.

In column-based systems, data is stored column by column:

Column-based:

ID: [1, 2, 3, ...]

Name: [Alice, Bob, Charlie, ...]

Age: [30, 25, 40, ...]

Why Use Column-Based Storage in Distributed Systems?

Column-based systems are particularly well-suited for distributed, analytical, and big data environments
for these reasons:

1. Efficient Analytical Queries:

o Reading just a few columns (e.g., SELECT Age FROM users) is much faster since the
system doesn’t load unnecessary columns.

2. High Compression:

o Similar data types in columns lead to better compression ratios (e.g., all integers in the
"Age" column).

3. Vectorized Execution:

o Operates on blocks of values (like a column of numbers) at once, increasing CPU


efficiency.

4. Scalability:

o Easy to shard or partition columns across multiple nodes for parallel processing.

5. Columnar Format on Disk:

 Optimized for fast sequential reads (e.g., Parquet, ORC formats).

Examples of Column-Based Systems

1. Apache Cassandra – technically row-based but inspired by column-family stores.

2. Apache HBase – column-family NoSQL system.

Use Cases

 OLAP (Online Analytical Processing)

 Data Warehousing
 Business Intelligence

 Big Data Analytics

5.9 Graph Databases


Graph databases in distributed systems are specialized databases designed to store, manage, and query
graph-structured data—where data is modeled as nodes (entities) and edges (relationships). These
databases are optimized for traversing complex and highly interconnected datasets.

What Is a Graph Database?

A graph database represents data as:

 Nodes: Entities (e.g., people, products, devices).

 Edges: Relationships between nodes (e.g., "FRIENDS_WITH", "PURCHASED").

Each node and edge can also have properties (key-value pairs).

Example:

(Alice) -[FRIENDS_WITH]-> (Bob)

In a distributed graph database, the data and query workload are spread across multiple machines to
ensure:

1. Scalability – to handle very large graphs (billions of nodes/edges).

2. High Availability – through replication.

3. Fault Tolerance – for resilience to node failures.

Key Features in a Distributed Setup

1. Partitioning (Sharding):

o Nodes and edges are distributed across machines.

o Challenge: keep highly connected data together to reduce cross-node traversal.

2. Graph Traversals:

o Operations like "find friends of friends" need fast, recursive queries.

o Latency-sensitive: distributed systems must minimize inter-node hops.

3. Consistency vs. Performance:

o Maintaining consistency across partitions can slow down traversals.

o Some systems offer tunable consistency.

4. Index-Free Adjacency:
o Nodes directly reference connected nodes, making traversals fast compared to joins in
relational databases.

Examples of Distributed Graph Databases

Neo4j Fabric Scalable, federated graph processing

Amazon Neptune Fully managed, supports property graphs and RDF

DGraph Native distributed graph database

TigerGraph Real-time analytics on large graphs

5.10 Database Security: Security issues


Database security in a distributed system is more complex than in centralized systems because data is
spread across multiple locations, possibly in different geographical regions, networks, or cloud environments.
The main goal is to ensure confidentiality, integrity, and availability (CIA) of data, while mitigating
security risks introduced by distribution.

Key Security Issues in Distributed Database Systems

1. Authentication and Authorization

 Ensures that only legitimate users can access data.

 In distributed systems, each node must enforce access control consistently.

 Risks:

o Weak or inconsistent authentication across nodes.

o Lack of centralized identity management.

2. Data Confidentiality

 Protects data from unauthorized access during storage and transmission.

 Solutions:

o Encryption at rest and in transit (e.g., TLS, AES).

o End-to-end encryption for sensitive data.

 Risks:

o Insecure communication channels.

o Key management complexity.

3. Data Integrity

 Ensures that data is not tampered with or corrupted.

 Techniques:
o Checksums, cryptographic hashes.

o Digital signatures for sensitive operations.

 Risks:

o Replay attacks or man-in-the-middle (MITM) attacks during replication.

4. Availability and Fault Tolerance

 Systems must be resilient to DDoS attacks, node failures, or network partitions.

 Strategies:

o Redundancy and replication.

o Rate limiting and traffic filtering.

 Risks:

o Single points of failure (e.g., coordination services like ZooKeeper).

o Attackers targeting high-availability zones.

5. Secure Communication

 Inter-node and client-server communication must be encrypted.

 Use TLS and mutual authentication.

 Risks:

o Unencrypted traffic exposing sensitive data.

6. Data Replication and Synchronization Security

 Data is often replicated across nodes for fault tolerance.

 Risks:

o Replication channels being intercepted or altered.

o Stale or out-of-sync data leading to inconsistencies.

7. Auditing and Monitoring

 Continuous tracking of access and changes to detect suspicious behavior.

 Challenges:

o Consolidating logs from distributed nodes.

o Ensuring audit trails are tamper-proof.

8. Insider Threats and Access Leakage


 Distributed systems increase the number of people and systems with potential access.

 Controls:

o Role-based access control (RBAC).

o Least privilege principle.

5.11 Role Based access control


Role-Based Access Control (RBAC) is a security model that restricts system access based on a user's role
within an organization. Instead of assigning permissions directly to each user, permissions are assigned to
roles, and users are assigned to those roles. This simplifies management and improves security, especially in
large or distributed systems.

RBAC in Distributed Systems

In distributed systems:

 Centralized IAM systems (like LDAP, Active Directory, or OAuth) are used to define roles and
enforce policies.

 Consistency across nodes and services is critical.

 Federated RBAC may be used in multi-organization systems (e.g., different departments or


companies).

Core Concepts of RBAC

1. User: A person or system entity that needs access.

2. Role: A collection of permissions that represent a job function (e.g., "Manager", "Analyst",
"Admin").

3. Permission: Authorization to perform specific operations on resources (e.g., read, write, delete).

4. Session: A mapping between a user and an activated subset of roles for a period of time.

For example, If Alice is assigned the Analyst role, she can read and analyze data—but not modify or delete
it.

5.12 SQL Injection


SQL Injection (SQLi) in a distributed system is a type of cyberattack where an attacker inserts malicious
SQL code into an input field to manipulate the backend database. In a distributed setup—where multiple
databases, services, and nodes may be involved—SQL injection can become even more dangerous and harder
to detect.

SQL injection occurs when:

 User input is not properly sanitized or validated.

 That input is used directly in a SQL query.


Example

-- Vulnerable query

SELECT * FROM users WHERE username = 'alice' AND password = 'pass123';

If an attacker enters:

username: alice

password: ' OR '1'='1

The resulting query becomes:

SELECT * FROM users WHERE username = 'alice' AND password = '' OR '1'='1';

This always returns true, potentially bypassing authentication.

In a distributed system, SQL injection can have broader impact:

1. Multiple Entry Points

 With microservices or APIs, multiple services may interact with databases independently,
increasing attack surface.

2. Data Replication

 Injected data can propagate to replicated nodes, spreading corruption.

3. Distributed Query Engines

 Systems like Presto, Trino, or BigQuery allow SQL across multiple databases. A single injection
can affect multiple backends.

4. Asynchronous Processing

 Malicious SQL can enter message queues or logs and be executed later by downstream services.

5.13 Statistical Database security


A statistical database allows access to statistical summaries, but not raw data.

Example queries:

 "What is the average age of patients with diabetes?"

 "How many employees earn more than $100K?"

These queries are intended to return aggregate results, not identifiable individual records.

Common Threats to Statistical Databases

1. Inference Attacks

 Users craft multiple queries to isolate a single individual's data.


 Example: Subtracting two aggregates to deduce a hidden value.

2. Tracker Attacks

 Use of overlapping sets in queries to track down specific user data.

3. Small Query Set Problem

 If a query returns data on very few individuals (e.g., 1 or 2), it's easier to infer private details.

In distributed systems:

 Multiple statistical databases may exist across nodes or organizations.

 Risks include data correlation across nodes and inconsistent enforcement of query restrictions.

5.14 Flow Control


Flow control in a Distributed Database Management System (DDBMS) refers to the mechanisms used
to regulate the transmission of data and control messages between distributed nodes, ensuring that the
system operates efficiently, reliably, and without overloading any component.

It plays a key role in maintaining performance and consistency in environments where data and operations
span multiple geographically or logically separated database servers.

In a distributed system, nodes must communicate frequently for:

 Data replication

 Query execution across partitions

 Transaction coordination (e.g., 2PC)

 Synchronization and consistency

Without proper flow control, problems like congestion, data loss, bottlenecks, and deadlocks can arise.

Objectives of Flow Control

1. Avoid Overloading Nodes


Prevent a fast sender from overwhelming a slow receiver.

2. Ensure Reliable Communication


Guarantee that data packets or messages are delivered and acknowledged correctly.

3. Coordinate Distributed Operations


Maintain order and integrity during distributed queries, transactions, or replication.

4. Optimize Network Bandwidth Usage


Efficiently use the communication channels to avoid unnecessary delays.

Types of Flow Control in DDBMS

1. Message-Level Flow Control


Controls the rate at which control messages (e.g., locks, transaction commits) are exchanged.

 Ensures transaction coordination remains synchronized across nodes.

2. Data Transfer Flow Control

Manages data exchange for queries or replication.

 Often uses buffer management or window-based techniques (like TCP flow control).

 Critical during large result set transfers or distributed joins.

3. Transaction Flow Control

Involves coordination between distributed transactions to:

 Prevent deadlocks

 Manage concurrency

 Enforce isolation levels across nodes

Challenges in Flow Control

 Network variability (latency, jitter, packet loss)

 Node heterogeneity (some nodes faster than others)

 Complexity of distributed transactions

 Dynamic workloads (bursts of queries or updates)

5.15 Encryption and Public Key infrastructures – Challenges.

What Is Encryption?

Encryption is the process of transforming readable data (plaintext) into an unreadable format
(ciphertext) to protect it from unauthorized access.

 Symmetric Encryption: Same key for encryption and decryption (e.g., AES).

 Asymmetric Encryption: Uses a public key to encrypt and a private key to decrypt (e.g., RSA,
ECC).

What Is Public Key Infrastructure (PKI)?

PKI is a framework for:

 Managing digital certificates and public/private keys.

 Enabling trusted communication via encryption and digital signatures.

Core Components:

 Certificate Authority (CA): Issues and verifies digital certificates.


 Registration Authority (RA): Verifies user identities before certificate issuance.

 Public/Private Key Pairs: Used for encryption, decryption, signing, and verification.

 Certificate Revocation Lists (CRLs) or OCSP: Manage revoked certificates.

5.15.1 Challenges in Encryption & PKI (Especially in Distributed Systems)

1. Key Management

 Securely storing, rotating, revoking, and distributing keys is difficult, especially across nodes,
regions, or services.

 A compromised key can expose entire systems.

2. Scalability

 Managing millions of certificates and keys for microservices, users, and devices is complex.

 PKI doesn’t scale easily without automation and orchestration.

3. Latency and Performance

 Encryption/decryption adds CPU overhead.

 Public-key operations (e.g., RSA) are slower than symmetric encryption.

4. Trust Model Complexity

 Trust must be correctly configured across different nodes, domains, or third-party systems.

 Misconfigured trust chains can lead to vulnerabilities or communication failures.

5. Certificate Revocation and Expiry

 Distributed systems may cache or rely on outdated certificates.

 Ensuring timely revocation propagation (via CRLs or OCSP) is non-trivial.

6. Man-in-the-Middle (MitM) Attacks

 Poor implementation of TLS or certificate validation can allow MitM attacks.

 Clients must rigorously verify certificate chains and expiry.

7. Compromise Recovery

 If a CA or private key is compromised, revoking and replacing affected certificates system-wide is


challenging and time-sensitive.

8. Interoperability

 Different platforms or services may use incompatible encryption standards or certificate formats.

 Coordinating encryption policies and PKI integration across heterogeneous environments is error-prone.

You might also like