Unit 5
Unit 5
Distributed Databases: Architecture, Data Storage, Transaction Processing, Query processing and optimization –
NOSQL Databases: Introduction – CAP Theorem – Document Based systems – Key value Stores – Column Based
Systems – Graph Databases. Database Security: Security issues – Access control based on privileges – Role Based
access control –SQL Injection – Statistical Database security – Flow control – Encryption and Public Key infra-
structures – Challenges.
5.1 Definition
A distributed database is basically a database that is not limited to one system, it is spread over different sites,
i.e, on multiple computers or over a network of computers. A distributed database system is located on
various sites that don’t share physical components. This may be required when a particular database needs
to be accessed by various users globally. It needs to be managed such that for the users it looks like one single
database.
5.1.2 Types:
1. Homogeneous Database:
In a homogeneous database, all different sites store database identically. The operating system, database
management system, and the data structures used – all are the same at all sites. Hence, they’re easy to
manage.
2. Heterogeneous Database:
In a heterogeneous distributed database, different sites can use different schema and software that can lead
to problems in query processing and transactions. Also, a particular site might be completely unaware of
the other sites. Different computers may use a different operating system, different database application.
They may even use different data models for the database. Hence, translations are required for different
sites to communicate.
A Distributed Database System (DDBS) consists of a single logical database that is distributed across
multiple locations (sites or nodes) connected by a network. The architecture of a distributed database ensures
data is stored efficiently, accessed transparently, and maintained consistently across all nodes.
5.2.1 Key Components of Distributed Database Architecture
1. Database Components
Local Databases: Each site has its own local database, which may contain a portion of the overall
distributed database.
Global Schema: A unified schema that provides a logical view of the entire database, hiding the
distribution details from users.
2. Nodes (Sites)
o Local applications
o Data storage
3. Types of Architectures
a) Client-Server Architecture
Each site retains control over its own data and operations.
4. Transparency Features
Location Transparency: Users don’t need to know the physical location of data.
Replication Transparency: System manages multiple copies of data.
Fragmentation Transparency: Data can be split (horizontally or vertically) across sites.
Concurrency Transparency: Ensures concurrent transactions don’t interfere.
Failure Transparency: System can recover from site or network failures.
5. Communication Network
Sites (Nodes): Each rectangle represents a site (e.g., Site 1, Site 2, etc.), which hosts a local
database managed by a local DBMS.
Communication Network: The cloud symbol in the center denotes the network connecting all
sites, facilitating data exchange and query processing.
Global Schema: Provides a unified logical view of the distributed data, abstracting the underlying
distribution details from users.
Federated architecture
There are 2 ways in which data can be stored on different sites. These are:
1. Replication –
In this approach, the entire relationship is stored redundantly at 2 or more sites. If the entire database is
available at all sites, it is a fully redundant database. Hence, in replication, systems maintain copies of
data.
This is advantageous as it increases the availability of data at different sites. Also, now query requests can
be processed in parallel.
However, it has certain disadvantages as well. Data needs to be constantly updated. Any change made at
one site needs to be recorded at every site that relation is stored or else it may lead to inconsistency. This is
a lot of overhead. Also, concurrency control becomes way more complex as concurrent access now needs
to be checked over a number of sites.
2. Fragmentation
In this approach, the relations are fragmented (i.e., they’re divided into smaller parts) and each of the
fragments is stored in different sites where they’re required. It must be made sure that the fragments are
such that they can be used to reconstruct the original relation (i.e, there isn’t any loss of data).
Fragmentation is advantageous as it doesn’t create copies of data, consistency is not a problem.
Fragmentation of relations can be done in two ways:
The need for distributed transactions arises from the requirements to ensure data
consistency and reliability across multiple independent systems or resources in a distributed computing
environment. Specifically:
Consistency: Ensuring that all changes made as part of a transaction are committed or rolled back
atomically, maintaining data integrity.
Isolation: Guaranteeing that concurrent transactions do not interfere with each other, preserving data
integrity and preventing conflicts.
Durability: Confirming that committed transactions persist even in the event of system failures,
ensuring reliability.
Atomicity: Ensuring that either all operations within a transaction are completed successfully or
none of them are, avoiding partial updates that could lead to inconsistencies.
The application initiates the transaction by sending the request to the available resources. The request
consists of details such as operations that are to be performed by each resource in the given transaction.
Once the resource receives the transaction request, resource 1 contacts resource 2 and asks resource 2 to
prepare the commit. This step makes sure that both the available resources are able to perform the dedicated
tasks and successfully complete the given transaction.
After the second step, Resource 2 receives the request from Resource 1, it prepares for the commit. Resource
2 makes a response to resource 1 with an acknowledgment and confirms that it is ready to go ahead with the
allocated transaction.
Once Resource 1 receives an acknowledgment from Resource 2, it sends a request to Resource 2 and provides
an instruction to commit the transaction. This step makes sure that Resource 1 has completed its task in the
given transaction and now it is ready for Resource 2 to finalize the operation.
When Resource 2 receives the commit request from Resource 1, it provides Resource 1 with a response and
makes an acknowledgment that it has successfully committed the transaction it was assigned to. This step
ensures that Resource 2 has completed its task from the operation and makes sure that both the resources
have synchronized their states.
Distributed transactions involve coordinating actions across multiple nodes or resources to ensure atomicity,
consistency, isolation, and durability (ACID properties). Here are some common types and protocols:
It involves two phases: a prepare phase where all participants agree to commit or abort the
transaction, and a commit phase where the decision is executed synchronously across all participants.
2PC ensures that either all involved resources commit the transaction or none do, thereby maintaining
atomicity.
3PC extends 2PC by adding an extra phase (pre-commit phase) to address certain failure scenarios that could
lead to indefinite blocking in 2PC.
In 3PC, participants first agree to prepare to commit, then to commit, and finally to complete or abort
the transaction.
This protocol aims to reduce the risk of blocking seen in 2PC by introducing an additional decision-
making phase.
o Transaction Managers are responsible for coordinating and managing transactions across
multiple resource managers (e.g., databases, message queues).
o TMs ensure that transactions adhere to ACID properties (Atomicity, Consistency, Isolation,
Durability) even when involving disparate resources.
o Resource Managers are responsible for managing individual resources (e.g., databases, file
systems) involved in a distributed transaction.
o RMs interact with the TM to prepare for committing or rolling back transactions based on
the TM’s coordination.
Coordination Protocols:
Query processing in a Distributed Database Management System (DDBMS) involves executing a user's
query that may require accessing data stored across multiple, geographically dispersed database sites. The
main goal is to process queries efficiently, ensuring correct results with minimal communication cost and
response time.
1. Query Decomposition
The query is simplified and transformed into an internal form (like a logical query plan).
Involves determining where relations or fragments are stored (horizontal or vertical fragmentation).
3. Global Optimization
Chooses the most efficient strategy to execute the query across sites.
o Parallelism
Generates alternative distributed query execution plans (QEPs) and chooses the best one using
cost-based optimization.
4. Local Optimization
Uses local DBMS query processors to generate efficient access paths (e.g., using indexes).
5. Query Execution
The chosen QEP is executed across different sites.
Example:
SELECT * FROM Orders o, Customers c WHERE o.CustID = c.ID AND c.City = 'Paris';
In a DDBMS:
The system will locate fragments with Paris customers, fetch relevant orders, perform join operations
efficiently (maybe at a central or intermediate site), and return the results.
The process used to retrieve data from a database is called query processing.
In Distributed Query processing, the data transfer cost of distributed query processing means the cost of
transferring intermediate files to other sites for processing and therefore the cost of transferring the ultimate
result files to the location where that result is required. Let’s say that a user sends a query to site S1, which
requires data from its own and also from another site S2. Now, there are three strategies to process this query
which are given below:
1. We can transfer the data from S2 to S1 and then process the query
2. We can transfer the data from S1 to S2 and then process the query
3. We can transfer the data from S1 and S2 to S3 and then process the query. So the choice depends
on various factors like the size of relations and the results, the communication cost between
different sites, and at which the site result will be utilized.
The semi-join operation is used in distributed query processing to reduce the number of tuples in a table
before transmitting it to another site. This reduction in the number of tuples reduces the number and the
total size of the transmission ultimately reducing the total cost of data transfer.
Let’s say that we have two tables R1, R2 on Site S1, and S2. Now, we will forward the joining column of
one table say R1 to the site where the other table say R2 is located. This column is joined with R2 at that
site. The decision whether to reduce R1 or R2 can only be made after comparing the advantages of reducing
R1 with that of reducing R2. Thus, semi-join is a well-organized solution to reduce the transfer of data in
distributed query processing.
Site autonomy
NoSQL databases (short for "Not Only SQL") are a category of databases designed to handle large volumes
of data, high user loads, and flexible data models.
1. Schema-less:
o Ideal for handling unstructured or semi-structured data like JSON, XML, etc.
2. Horizontal Scalability:
o Designed to scale out by adding more servers (nodes) rather than scaling up.
3. High Performance:
o Often used in real-time web apps, big data systems, and IoT.
1. Document Stores
2. Key-Value Stores
o Data is stored as key-value pairs (like a hash table).
3. Column-Family Stores
4. Graph Databases
o High scalability
o Flexibility
o Availability
o Performance
o Cost effectiveness
The CAP Theorem states that a distributed database system can only guarantee two out of the following
three properties at the same time:
1. Consistency (C)
It guarantees that every node in a distributed cluster returns the same, most recent, and successful
write. It refers to every client having the same view of the data.
2. Availability (A)
Availability means that each read or write request for a data item will either be processed successfully or
will receive a message that the operation cannot be completed.
Every request (read or write) gets a non-error response, even if it's not the latest data.
The system continues to operate despite network partitions (i.e., communication failures between
nodes).
This is mandatory in distributed systems because network failures can and will happen.
5.8 Column Based Systems
Column-based systems (also called column-oriented databases) in distributed systems are designed to store
and process data by columns instead of traditional row-based storage. This design choice has major
implications for performance, and scalability.
Column-based:
Column-based systems are particularly well-suited for distributed, analytical, and big data environments
for these reasons:
o Reading just a few columns (e.g., SELECT Age FROM users) is much faster since the
system doesn’t load unnecessary columns.
2. High Compression:
o Similar data types in columns lead to better compression ratios (e.g., all integers in the
"Age" column).
3. Vectorized Execution:
4. Scalability:
o Easy to shard or partition columns across multiple nodes for parallel processing.
Use Cases
Data Warehousing
Business Intelligence
Each node and edge can also have properties (key-value pairs).
Example:
In a distributed graph database, the data and query workload are spread across multiple machines to
ensure:
1. Partitioning (Sharding):
2. Graph Traversals:
4. Index-Free Adjacency:
o Nodes directly reference connected nodes, making traversals fast compared to joins in
relational databases.
Risks:
2. Data Confidentiality
Solutions:
Risks:
3. Data Integrity
Techniques:
o Checksums, cryptographic hashes.
Risks:
Strategies:
Risks:
5. Secure Communication
Risks:
Risks:
Challenges:
Controls:
In distributed systems:
Centralized IAM systems (like LDAP, Active Directory, or OAuth) are used to define roles and
enforce policies.
2. Role: A collection of permissions that represent a job function (e.g., "Manager", "Analyst",
"Admin").
3. Permission: Authorization to perform specific operations on resources (e.g., read, write, delete).
4. Session: A mapping between a user and an activated subset of roles for a period of time.
For example, If Alice is assigned the Analyst role, she can read and analyze data—but not modify or delete
it.
-- Vulnerable query
If an attacker enters:
username: alice
SELECT * FROM users WHERE username = 'alice' AND password = '' OR '1'='1';
With microservices or APIs, multiple services may interact with databases independently,
increasing attack surface.
2. Data Replication
Systems like Presto, Trino, or BigQuery allow SQL across multiple databases. A single injection
can affect multiple backends.
4. Asynchronous Processing
Malicious SQL can enter message queues or logs and be executed later by downstream services.
Example queries:
These queries are intended to return aggregate results, not identifiable individual records.
1. Inference Attacks
2. Tracker Attacks
If a query returns data on very few individuals (e.g., 1 or 2), it's easier to infer private details.
In distributed systems:
Risks include data correlation across nodes and inconsistent enforcement of query restrictions.
It plays a key role in maintaining performance and consistency in environments where data and operations
span multiple geographically or logically separated database servers.
Data replication
Without proper flow control, problems like congestion, data loss, bottlenecks, and deadlocks can arise.
Often uses buffer management or window-based techniques (like TCP flow control).
Prevent deadlocks
Manage concurrency
What Is Encryption?
Encryption is the process of transforming readable data (plaintext) into an unreadable format
(ciphertext) to protect it from unauthorized access.
Symmetric Encryption: Same key for encryption and decryption (e.g., AES).
Asymmetric Encryption: Uses a public key to encrypt and a private key to decrypt (e.g., RSA,
ECC).
Core Components:
Public/Private Key Pairs: Used for encryption, decryption, signing, and verification.
1. Key Management
Securely storing, rotating, revoking, and distributing keys is difficult, especially across nodes,
regions, or services.
2. Scalability
Managing millions of certificates and keys for microservices, users, and devices is complex.
Trust must be correctly configured across different nodes, domains, or third-party systems.
7. Compromise Recovery
8. Interoperability
Different platforms or services may use incompatible encryption standards or certificate formats.
Coordinating encryption policies and PKI integration across heterogeneous environments is error-prone.