Adt 16 Mark
Adt 16 Mark
Introduction
Distributed data storage refers to the practice of storing data across multiple locations, typically within a
distributed database system, to enhance scalability, reliability, and performance. The primary goal is to
ensure data availability and fault tolerance while optimizing data access. Various strategies are employed to
achieve these objectives, each with its advantages and challenges. This document explores key distributed
data storage strategies along with suitable examples.
1. Fragmentation
Fragmentation involves dividing a database into smaller pieces, known as fragments, and distributing them
across multiple sites. It ensures data localization, reducing access latency and enhancing efficiency.
Types of Fragmentation:
2. Vertical Fragmentation: Divides a table into subsets of columns, ensuring that frequently accessed
attributes are stored together.
o Example: In a university database, student records may be split into academic details (stored
in an academic department) and personal details (stored in the administration department).
2. Replication
Replication involves maintaining copies of data at multiple sites to enhance availability, fault tolerance, and
reliability.
Types of Replication:
o Example: Google Cloud Storage replicates critical user data across multiple global data
centers.
2. Partial Replication: Only specific data subsets are stored at different locations.
o Example: A stock exchange system may store trading data for a particular region in data
centers closer to that region.
o Asynchronous Replication: Updates are delayed, allowing for better performance but
increasing the risk of inconsistency.
o Example: Social media applications use asynchronous replication to handle user posts and
comments efficiently.
3. Partitioning
Partitioning divides a database into smaller, manageable units known as partitions. Unlike fragmentation,
partitioning ensures that each partition operates as a standalone unit.
Types of Partitioning:
o Example: A hospital system partitions patient records based on age groups (0-18, 19-35,
etc.).
2. Hash Partitioning: Data is distributed using a hash function to ensure an even spread.
o Example: A content delivery network (CDN) hashes video IDs to distribute content efficiently.
Distributed query processing involves executing queries across multiple distributed database nodes while
optimizing efficiency, reducing response time, and ensuring minimal data transfer. The optimization process
aims to generate an execution plan that minimizes costs while preserving correctness. This document
outlines the key steps in distributed query processing and optimization.
1. Query Decomposition
The query is initially parsed and decomposed into smaller subqueries to be processed across multiple sites.
Example: Consider a banking database where a query retrieves customer transactions from different
regional servers. The query is split into subqueries targeting each region.
Substeps:
The subqueries are analyzed to determine which fragments of data are required and where they reside.
Example: If a company’s sales data is fragmented by region, a query requesting North American
sales will be directed to servers storing that data.
Substeps:
The query is fragmented further based on the distribution of data and assigned to respective database
nodes.
Example: A retail chain query for product sales in different stores will be broken down into
subqueries for each store's local database.
Strategies:
The system selects the most efficient execution plan based on cost estimation.
Example: If a query requires joining customer data from different locations, the optimizer
determines whether to use hash joins, nested loop joins, or sort-merge joins.
Optimization Techniques:
o Dynamic Programming: Evaluates multiple plans and selects the most cost-efficient.
o Cost-Based Optimization: Computes execution costs using factors such as data transfer and
CPU usage.
Example: In an airline booking system, seat availability data is retrieved from multiple locations and
consolidated at the central system.
Key Considerations:
Challenges:
Active databases extend traditional database systems by incorporating event-driven architecture through
rules known as Event-Condition-Action (ECA) rules. These databases respond automatically to specified
conditions, making them essential for applications requiring real-time monitoring, security, and
automation. However, their design and implementation present several challenges that must be carefully
addressed.
Example: In a stock market database, rules for triggering buy/sell actions must be precisely defined
to prevent conflicting actions.
Challenge: Efficiently detecting and handling different types of events (primitive, composite,
external events).
Example: In fraud detection systems, multiple transaction patterns must be monitored to identify
fraudulent behavior.
Example: In a hospital management system, checking for patient vitals every second may overload
the database, requiring optimized rule execution.
Example: In an inventory system, a restocking rule and a discount rule may trigger at the same time,
leading to inconsistencies.
Challenge: Combining active rules with traditional query execution without excessive overhead.
Example: An e-commerce system with recommendation rules should not slow down order
processing queries.
a) Performance Overhead
Challenge: Active rules add additional processing load, affecting database performance.
Example: In a banking system, checking transaction limits in real-time for thousands of users can
slow down overall system performance.
b) Scalability Concerns
Challenge: Ensuring the system remains scalable as the number of rules increases.
Example: A social media platform with event-driven notifications must efficiently handle millions of
users without delays.
Example: In a collaborative document editing system, automatic save and version control rules must
be synchronized to prevent conflicts.
Challenge: Preventing unauthorized rule modifications and ensuring secure event handling.
Example: In a financial system, debugging an incorrectly triggered penalty fee rule may require
extensive logging and analysis.
Temporal databases manage data related to time, enabling tracking of historical, current, and future
information. Three primary time dimensions in temporal databases include valid time, transaction time,
and bitemporal time. Each serves a distinct purpose in tracking and managing data changes over time.
1. Valid Time
Valid time represents the period during which a fact is true in the real world.
Example: In an employee database, the valid time of an employee’s job position starts from the
hiring date and ends on the resignation date.
Key Characteristics:
o Defined by the application/user.
Use Case: Payroll systems, project management systems, and legal contracts.
2. Transaction Time
Transaction time represents the period during which a fact is stored in the database.
Definition: The time interval during which the database knows about a particular fact.
Example: A bank transaction recorded on January 1 but later corrected on January 3 has two
versions in the database.
Key Characteristics:
Use Case: Financial transaction logs, audit trails, and regulatory compliance systems.
3. Bitemporal Time
Bitemporal time combines both valid time and transaction time to provide a complete historical view of
data.
Definition: Captures both when the fact is valid in the real world and when it was recorded in the
database.
Example: A company updates an employee’s salary retroactively, meaning the change is valid from
an earlier date but recorded later in the system.
Key Characteristics:
Use Case: Healthcare records, insurance claim processing, and tax record management.
Comparison Table
Example Employee job tenure Bank transaction logs Retroactive salary updates
Mobile transaction models extend traditional database transactions to accommodate the unique
challenges of mobile environments, such as intermittent connectivity, limited bandwidth, and variable
network latency. These models ensure data consistency, reliability, and efficiency in mobile applications,
including banking, e-commerce, and cloud services. This document explores different mobile transaction
models and their impact on database performance.
Concept: A hierarchical transaction model where transactions hop between mobile and fixed
network components.
How It Works:
Example: A mobile shopping app where users add items to a cart offline, and the order is processed
when the device reconnects.
Impact on Performance:
Concept: Transactions execute on mobile devices and report final results to a central server.
How It Works:
Example: A field survey application where data is collected offline and submitted in batches when
network access is available.
Impact on Performance:
Concept: Allows partial transaction execution at different locations as the user moves.
How It Works:
Example: A traveler booking flights across multiple cities, where transactions are handed over
between different network nodes.
Impact on Performance:
o Strict Transactions: Finalize at the central server with full ACID compliance.
Example: A mobile banking app where transaction requests are processed locally first and verified
by the server later.
Impact on Performance:
How It Works:
Example: An airline reservation system prioritizing seat allocation over payment validation.
Impact on Performance:
Strategies like local caching and lightweight query execution reduce overhead.
Conflict resolution techniques like timestamp ordering and version control help maintain integrity.
Spatial indexing structures help in efficient data retrieval by organizing spatial data for faster query
execution.
a) R-Tree Indexing
Concept: A hierarchical tree structure that groups nearby objects into bounding rectangles.
How It Works:
o The tree is traversed from root to leaf nodes to filter out unnecessary data.
Example: Used in GIS systems for region-based queries, such as finding all parks within a city.
Pros:
Cons:
b) Quad-Tree Indexing
Concept: A hierarchical data structure that recursively divides a 2D space into four quadrants.
How It Works:
o The space is partitioned into quadrants until a threshold number of objects per quadrant is
reached.
Pros:
Cons:
c) Grid-Based Indexing
Concept: Divides the space into uniform grids, storing objects based on their spatial location.
How It Works:
o Queries scan only relevant grid cells instead of the entire dataset.
Pros:
o Simple implementation.
Cons:
o Fixed grid sizes may lead to inefficient storage for varying data densities.
Spatial joins combine two spatial datasets based on their spatial relationships (e.g., intersection,
containment).
How It Works:
Example: Used in location-based advertising to find nearby customers for targeted promotions.
Pros:
b) Plane-Sweep Join
Concept: Sorts spatial objects along one dimension and sweeps a plane to find intersecting objects.
How It Works:
Pros:
Cons:
How It Works:
o One dataset is indexed, and the other dataset is scanned to perform lookups efficiently.
Example: Used in urban planning to match road networks with population density zones.
Pros:
Cons:
Additional techniques improve the execution of spatial queries by minimizing computations and data
movement.
Concept: Uses sampling and estimation techniques to provide fast, approximate answers.
Example: Used in big data applications to estimate traffic congestion without processing all GPS
records.
Pros:
Cons:
Example: Cloud-based GIS systems use parallel computing to process satellite imagery.
Pros:
Cons:
A distributed system is a collection of independent computers that work together as a single system to
provide a seamless user experience. These systems are widely used in cloud computing,
telecommunications, and large-scale applications such as Google Search and online banking. While
distributed systems offer scalability, fault tolerance, and performance benefits, they also introduce several
challenges. This document explores the key characteristics and challenges of distributed systems with real-
world examples.
a) Resource Sharing
Definition: Multiple computers share hardware, software, and data resources across the system.
Example: Cloud storage services like Google Drive allow users to access files from multiple devices,
ensuring data synchronization.
b) Scalability
Definition: The system can expand by adding more nodes without significant performance
degradation.
Example: Amazon Web Services (AWS) can scale its computing power dynamically based on user
demand.
Definition: The system can continue functioning despite failures in individual components.
Example: Google Search uses data replication across multiple data centers to prevent downtime.
Example: Online multiplayer games like Fortnite handle thousands of concurrent users interacting in
real-time.
e) Transparency
Definition: Users and applications experience the system as a single entity, hiding the complexity of
distribution.
Example: Netflix users stream videos without knowing the geographical location of the content
servers.
Solution: Content Delivery Networks (CDNs) cache frequently accessed content closer to users.
Issue: Ensuring all copies of data remain up-to-date across distributed nodes.
Solution: Distributed databases use protocols like Two-Phase Commit (2PC) and Paxos.
Example: Online banking transactions require strict consistency to prevent double withdrawals.
Example: Online payment gateways use multi-factor authentication (MFA) to enhance security.
Example: E-commerce platforms like Amazon distribute user requests across multiple servers during
peak sales.
Concurrency control in distributed systems ensures that multiple transactions can execute simultaneously
without leading to inconsistencies, conflicts, or data loss. Since distributed databases operate across
multiple sites, maintaining data integrity and consistency is critical. Various concurrency control techniques
have been developed to address these challenges. This document explores different concurrency control
techniques used in distributed systems and their impact on performance.
Concept: Transactions acquire locks in a growing phase and release them in a shrinking phase.
Example: In an online banking system, a transfer transaction locks the sender’s and receiver’s
accounts to prevent inconsistencies.
Concept: A centralized or decentralized lock manager coordinates lock requests from multiple
nodes.
Example: In a cloud-based document editing system, locks prevent users from overwriting changes
made by others.
Concept: Each transaction receives a unique timestamp; older transactions execute before newer
ones.
Example: A stock trading system ensures older buy/sell requests are executed before newer ones.
Concept: Multiple versions of data items are maintained, allowing readers and writers to operate
without conflict.
Example: PostgreSQL uses MVCC to allow read transactions without blocking write operations.
Concept: Transactions execute without restrictions and validate changes before committing.
Phases:
1. Read Phase: Transaction reads data without locks.
Example: A ticket booking system uses OCC to allow multiple users to select seats, validating at the
final step.
Example: Distributed blockchain systems like Bitcoin use quorum-based consensus to validate
transactions.
Detection: Periodically checks for circular wait conditions and aborts transactions to resolve
deadlocks.
Prevention: Enforces order in resource allocation (e.g., wait-die and wound-wait schemes).
Traditional Relational Databases (RDBMS) and Multimedia Databases (MMDB) serve different purposes in
data storage and management. RDBMS primarily handles structured data, whereas MMDB is designed to
store, retrieve, and manipulate multimedia content such as images, videos, audio, and documents. This
document compares these two database types based on structure and functionality.
1. Structure Comparison
a) Data Model
RDBMS: Uses a structured format based on tables with rows and columns, ensuring strict schema
enforcement.
MMDB: Supports complex multimedia data types such as images, audio, video, and spatial data.
c) Storage Mechanism
MMDB: Uses BLOB (Binary Large Objects) and CLOB (Character Large Objects) to store large
unstructured data.
d) Indexing Techniques
RDBMS: Uses B-trees and hash indexing for efficient query retrieval.
MMDB: Uses content-based indexing (CBIR for images), spatial indexing (R-trees), and feature-
based retrieval (wavelets for videos).
2. Functional Comparison
a) Query Processing
MMDB: Uses complex query models, including feature extraction and similarity search.
o Example: A face recognition system retrieving images based on facial features rather than
text-based queries.
RDBMS: Retrieves exact data using primary keys and foreign keys.
o Example: A music app retrieving songs based on genre and user listening patterns.
RDBMS: Ensures ACID (Atomicity, Consistency, Isolation, Durability) compliance for transaction
processing.
MMDB: Uses relaxed ACID properties, incorporating eventual consistency for handling large
multimedia content updates.
o Example: Social media platforms ensuring smooth uploads while maintaining database
integrity.
RDBMS: Performs well for structured data but struggles with large unstructured datasets.
MMDB: Optimized for handling high-volume multimedia content with distributed caching and
parallel processing.
A distributed system consists of multiple independent computers working together to achieve a common
goal. The system is designed to provide scalability, reliability, and fault tolerance. Different architectures are
used to structure distributed systems based on their functionality, data distribution, and communication
models. This document explores various types of distributed system architectures in detail.
1. Client-Server Architecture
Description:
o The system is divided into clients (which request services) and servers (which provide
services).
o The server processes requests and sends responses to clients over a network.
Example: Web applications where a browser (client) interacts with a web server.
Advantages:
Disadvantages:
Description:
Advantages:
Disadvantages:
3. Three-Tier Architecture
Description:
Advantages:
Disadvantages:
4. Microservices Architecture
Description:
o Breaks down an application into small, independent services that communicate via APIs.
Advantages:
Disadvantages:
Description:
Example: Banking systems where multiple services handle transactions, accounts, and customer
management.
Advantages:
Disadvantages:
o Uses a decentralized ledger where transactions are recorded across multiple nodes.
Advantages:
Disadvantages: