PLSQL Practice Questions n Answers
PLSQL Practice Questions n Answers
3. What is granularity?
5. What is system log? Explain all state when a transaction is present in system log.
Ans: A system log is a record of events, actions, and errors within a system,
crucial for debugging, monitoring, and auditing. When a transaction is
present in a system log, it can be in several states: active, partially
committed, committed, failed, aborted, or terminated, each representing a
different stage in the transaction's lifecycle.
Here's a more detailed explanation of each state:
Active State:
The transaction is currently being executed, and operations are being
performed on the database.
Partially Committed State:
The transaction has completed all its operations, but the changes haven't
been permanently written to the database yet.
Committed State:
The transaction has successfully completed all its operations, and the
changes have been permanently written to the database.
Failed State:
The transaction encountered an error during execution, and the database
cannot determine whether the changes should be committed or rolled
back.
Aborted State:
The transaction has been rolled back, meaning any changes made during
the transaction are discarded, and the database is returned to its previous
state.
Terminated State:
The transaction has completed, either successfully or unsuccessfully, and
its resources have been released.
Ans: Schedule:
A schedule represents the order in which operations from different transactions are
executed.
Serial Schedule:
In a serial schedule, transactions are executed sequentially, meaning one
transaction completes entirely before the next one begins. This eliminates any
concurrency or interleaving of operations from different transactions.
Example:
Serial Schedule: T1: R(A), W(A), T2: R(B), W(B) (T1 completes before T2 starts)
Non-Serial Schedule: T1: R(A), T2: R(B), T1: W(A), T2: W(B) (operations from T1 and
T2 are interleaved)
Example:
Transaction A: Updates a bank account balance (but hasn't committed the changes
yet).
Transaction B: Reads the updated balance (before Transaction A commits or rolls
back).
If Transaction A rolls back, the balance read by Transaction B was never the correct
balance, leading to a problem.
Ans: In DBMS, a "blind write" refers to a write operation that occurs without
a preceding read operation on the same data item, effectively overwriting
the data without knowing its current value.
Example: Imagine a transaction that updates a bank account balance by a fixed
amount, without first checking the current balance. This would be a blind write.
Concurrency Control:
Databases are designed to handle multiple users accessing and modifying data
simultaneously.
Locks help manage this concurrency by controlling access to resources, allowing
multiple users to access the database without causing data conflicts.
Atomicity, Consistency, Isolation, and Durability (ACID):
Transactions in databases are designed to be ACID compliant.
Locks are a key mechanism for ensuring the 'I' (Isolation) aspect of ACID, meaning that
each transaction appears to execute independently, without interference from other
concurrent transactions.
Types of Locks:
Shared Locks (Read Locks): Allow multiple transactions to read a resource
simultaneously, but prevent any transaction from modifying it.
Exclusive Locks (Write Locks): Allow only one transaction to access a resource,
preventing other transactions from reading or modifying it until it is released.
19. What are the rules followed when shared/exclusive locking schema is used?
Ans:
Key Rules:
Shared Lock Compatibility: Multiple shared locks can be held on the same resource
simultaneously.
Exclusive Lock Exclusion: A resource with a shared lock cannot be granted an
exclusive lock, and vice versa.
Exclusive Lock Priority: If a transaction holds an exclusive lock, all other lock
requests (shared or exclusive) are blocked until the exclusive lock is released.
Internet:
Definition:
The Internet is a vast, interconnected network of computers and devices,
accessible to anyone with an internet connection.
Purpose:
It facilitates communication, information sharing, and access to a wide
range of resources and services worldwide.
Accessibility:
Public and open to anyone.
Security:
While the internet offers vast connectivity, it is also a public network,
meaning security measures are crucial to protect user data and privacy.
Intranet:
Definition:
An intranet is a private network that an organization uses to share
information and resources internally.
Purpose:
Intranets are designed for internal communication, collaboration, and
access to company-specific information and resources.
Accessibility:
Restricted to authorized users within the organization.
Security:
Intranets are typically more secure than the internet because access is
controlled and limited to authorized personnel.
Examples:
Company policies, employee directories, project documents, and internal
communication platforms.
Client:
The client is the component that initiates the request for data or services.
It can be a user interface, a web browser, or any application that needs to access the
database.
The client interacts with the user and translates their requests into queries that are sent
to the server.
Examples of clients include a web browser accessing a website, a desktop application
interacting with a database, or a mobile app fetching data from a server.
Server:
The server is the central component that stores and manages the database.
It receives requests from clients, processes them, and returns the requested data or
performs the requested actions.
The server is responsible for ensuring data integrity, security, and efficient access to
the database.
Examples of servers include database management systems (DBMS) like MySQL,
PostgreSQL, or Oracle, or web servers like Apache or Nginx
Ans:In PL/SQL, the %TYPE attribute is crucial for declaring variables and
parameters that inherit the data type of a database column or another
variable, ensuring type compatibility and simplifying code maintenance
when data types change.
Purpose: The %TYPE attribute allows you to declare variables and parameters with
the same data type as a field, record, nested table, database column, or another
variable, without having to specify the data type explicitly.
49. Define query cost in the context of database query processing.
Ans:Cost of query is the time taken by the query to hit the database
and return the result. It involves query processing time i.e.; time
taken to parse and translate the query, optimize it, evaluate,
execute and return the result to the user is called cost of the query.
50. What are the main factors affecting the cost of a query?
The cost of a query is primarily affected by the amount of data
Ans:
accessed, the complexity of the query, and the resources required for
execution, including I/O operations, CPU usage, and memory
consumption.
Merge Join: One of the most efficient join algorithms, especially for large datasets, is
the merge join.
GROUP BY Clause: When a query includes a GROUP BY clause, the DBMS often
sorts the data based on the grouping attributes. This brings all tuples with the same
group key together, making it easy to calculate aggregate functions (like COUNT, SUM,
AVG, MIN, MAX) for each group by simply iterating through the sorted data.
Explicit Sorting: The most direct use of sorting is to satisfy the ORDER BY clause in a
SQL query. The DBMS will explicitly sort the result set based on the specified
columns and sort order (ASC or DESC) before presenting the final output to the user.
The core idea behind external sorting is to break the large dataset into smaller chunks that can
fit into the available main memory. Each of these chunks is then sorted using an efficient
internal sorting algorithm (like quicksort or merge sort). These sorted chunks are often called
"runs" or "sorted subfiles." Finally, these sorted runs are merged together in multiple passes
to produce the final sorted output.
Ans: Relational expressions in query optimization are the internal representations of SQL
queries used by a database management system (DBMS) during the query optimization
process. When a user submits an SQL query, the DBMS doesn't directly execute it. Instead, it
goes through several phases, and one crucial phase is optimization. Relational expressions act
as the language of the query optimizer. They provide a formal and manipulable representation
of the user's query, allowing the optimizer to explore various execution strategies and choose
the most efficient one
58. Explain the significance of relational algebra transformations.
2. Improving Performance:
Index Exploitation: Transformations can help the optimizer determine when and
how to effectively utilize available indexes. For example, a selection operation can be
transformed to use an index scan or index seek if an appropriate index exists on the
selection attribute.
Data Statistics Utilization: Optimizers use data statistics (e.g., cardinality,
selectivity) to estimate the cost of different relational expressions and their
corresponding physical plans. Transformations help in creating expressions that allow
for more accurate cost estimations and better plan selection based on these statistics.
Parallel Execution: In parallel database systems, transformations can help identify
opportunities to parallelize operations across multiple processors or nodes, further
improving query execution time.
Relational algebra provides a logical view of the query, abstracting away the physical
storage details and execution algorithms. Transformations operate at this logical level,
allowing the optimizer to reason about different execution strategies independently of
the specific physical implementation. This makes the optimization process more
manageable and adaptable to different database systems and storage structures.
Without these transformations, the DBMS would be largely limited to executing queries in a
relatively fixed and often inefficient manner, leading to poor performance.
Ans:
Data Distribution Representation:
Histograms provide a visual representation of data distribution by dividing data into
"buckets" or intervals, showing the frequency of values within each interval.
Optimizer Assistance:
The database optimizer uses this histogram data to estimate the number of rows
that will be returned by a query, which is crucial for choosing the most efficient
execution plan.
Improved Query Performance:
By understanding the data distribution, the optimizer can make better decisions
about which indexes to use, which joins to perform, and how to filter data, leading
to faster query execution times.
63. Mention the key factors influencing the choice of evaluation plans.
Ans:
The choice of an evaluation plan is influenced by factors like the purpose of
the evaluation, the available resources, the target audience, the timing, and
the type of program or intervention being evaluated.
Here's a more detailed breakdown:
1. Purpose of the Evaluation:
What questions need to be answered? The evaluation questions drive
the entire process, including the selection of methods and indicators.
What is the intended use of the evaluation results? Is it for program
improvement, accountability, or decision-making?
Is the evaluation formative (for improvement) or summative (to assess
outcomes)?
2. Available Resources:
Budget: The cost of different evaluation methods can vary significantly.
Time: Some evaluation methods require more time than others.
Staff expertise: Do you have the in-house expertise to conduct the
evaluation, or will you need to hire external evaluators?
3. Target Audience:
Who needs to understand the evaluation results?
The audience will influence the level of detail and the format of the
reports.
Who will be involved in the evaluation process?
Stakeholder engagement is crucial for ensuring the evaluation is relevant
and useful.
4. Timing:
When is the evaluation needed? Is it a formative evaluation during the
program implementation, or a summative evaluation at the end?
Are there any deadlines or constraints?
67. Explain the difference between materialized views and simple views.
68. How do materialized views enhance query performance?
69. What is the impact of query rewriting on materialized views?
70. Mention any two challenges in maintaining materialized views.
Ans:
67) -
68) - Here's a detailed explanation of how materialized views enhance query performance:
1. Reduced Computation:
Direct Data Access: Instead of accessing multiple base tables and potentially
performing numerous disk reads for joins and filtering, queries against a materialized
view read directly from the stored result set. This reduces the number of I/O
operations, which are often the bottleneck in database performance.
Optimized Storage: The data in a materialized view can be stored in a way that is
optimized for the specific queries it is designed to support.
Pre-computed Results: Because the data is already processed and stored, queries
against materialized views return results much faster, especially for complex
analytical queries that would otherwise take a long time to execute. This is cruci al for
applications requiring low latency, such as dashboards and real-time reporting.
4. Indexing Capabilities:
Index Creation: Unlike simple views, you can create indexes on the columns of a
materialized view. These indexes can further accelerate query performance on the
materialized view, just like indexes on regular tables. This allows for highly
optimized data retrieval from the pre-computed results.
5. Simplified Querying:
Abstraction: Materialized views can hide the complexity of the underlying data
model and query logic. Users can query a simpler, pre-defined structure, making it
easier to write and understand queries, which can indirectly improve overall system
efficiency.
70) –
Data Consistency:
Materialized views store pre-computed results of queries, so when the underlying data
changes, the view needs to be updated to remain consistent.
If the refresh process is not handled correctly or efficiently, inconsistencies can arise
between the materialized view and the base tables, leading to inaccurate data for
queries.
Maintaining consistency requires careful planning of refresh strategies and
mechanisms to ensure that updates to the base tables are propagated to the
materialized views in a timely and reliable manner.
Think of SQL as being good at describing what data you want, while PL/SQL lets you
describe how to manipulate and process that data in a structured, step-by-step manner within
the Oracle database environment.
Ans: The PL/SQL execution environment is of paramount significance because it's the
foundation upon which all PL/SQL code runs within the Oracle database. It provides the
necessary infrastructure, resources, and context for PL/SQL programs to be executed
correctly and efficiently. Understanding its significance is crucial for comprehending how
PL/SQL works and how to optimize its performance.
Runs Inside the Database Kernel: PL/SQL code is executed directly within the
Oracle database server process. This tight integration is a core advantage, as it allows
PL/SQL to directly interact with the database's data structures, memory management,
and security mechanisms without the overhead of external communication.
Access to Database Resources: The execution environment provides PL/SQL
programs with access to various database resources, including:
o Data: Tables, views, indexes, etc.
o Database Objects: Procedures, functions, packages, triggers, types, etc.
o System Resources: Memory (PGA and SGA), CPU, I/O.
o Security Context: The privileges and roles of the user executing the PL/SQL
code.
1. Anonymous Blocks:
SQL
[DECLARE]
-- Declarations of variables, constants, types, etc.
BEGIN
-- Executable statements (SQL and PL/SQL)
[EXCEPTION
-- Exception handlers for errors that might occur in the BEGIN
section]
END;
/
2. Named Blocks:
Definition: Named blocks are PL/SQL blocks that are given a specific name and are
stored as database objects. These include:
o Procedures: Named PL/SQL blocks that perform a specific task. They can
accept input parameters and return output parameters.
o Functions: Named PL/SQL blocks that perform a specific calculation and
return a single value. They can accept input parameters.
o Packages: Schema objects that group logically related procedures, functions,
variables, constants, types, and cursors together.
o Triggers: Named PL/SQL blocks that automatically execute in response to
specific database events (e.g., INSERT, UPDATE, DELETE on a table).
o Types (Object Types and Collection Types): While primarily for defining
data structures, they can also include methods (which are essentially functions
or procedures within the type).
Structure: The structure of named blocks varies depending on the type of object
being created, but they generally involve a header (specifying the name and
parameters) and a body (containing the DECLARE, BEGIN, and EXCEPTION sections
similar to anonymous blocks, although the DECLARE section in packages can be in the
specification).
o Procedure Example:
SQL
o Function Example:
SQL
Scope: Variables declared within named blocks are typically local to that specific
procedure, function, or the body of a package. Package variables declared in the
specification have a broader scope within the package.
Persistence: Named blocks are stored as schema objects in the Oracle database and
can be called and executed multiple times by different users or applications (subject to
privileges).
Use Cases:
o Implementing reusable business logic.
o Providing controlled access to database operations.
o Automating database tasks through triggers.
o Organizing related code into packages.
o Defining custom data types and their associated behavior.
Execution: A row-level trigger executes once for each row affected by the triggering
DML (Data Manipulation Language) statement (INSERT, UPDATE, DELETE).
Timing: They can be defined to fire BEFORE or AFTER the triggering operation on
each individual row. Some databases also support INSTEAD OF triggers on views,
which execute instead of the triggering action on the row.
Access to Data: Row-level triggers have access to the individual row being
processed. They can typically reference the old values (before the change) and the
new values (after the change) of the row's columns using special keywords (e.g., OLD
and NEW).
Purpose: Row-level triggers are commonly used for:
o Auditing changes to specific rows.
o Enforcing complex data integrity rules that depend on the values within a
row.
o Generating derived column values based on other columns in the same row.
o Preventing operations on specific rows based on their content.
o Propagating changes to related tables based on individual row modifications.
Performance: For statements that affect a large number of rows, row-level triggers
can have a performance impact as they execute for each row.
Execution: A statement-level trigger executes only once for the entire triggering
DML statement, regardless of how many rows are affected (even if no rows are
affected).
Timing: They can also be defined to fire BEFORE or AFTER the entire triggering
statement has been executed.
Access to Data: Statement-level triggers typically do not have direct access to the
individual rows being modified. However, some database systems provide
mechanisms to access summary information about the statement's impact (e.g., the
number of rows affected). Some systems might offer "transition tables" that capture
the set of rows affected by the statement.
Purpose: Statement-level triggers are often used for:
o Enforcing security restrictions on the type of DML operations allowed on a
table during certain periods.
o Logging overall information about a DML statement, such as who performed
the action and when.
o Performing actions that should occur only once per transaction, regardless of
the number of rows changed.
o Implementing checks or actions based on the overall outcome of a statement.
Performance: Statement-level triggers generally have less performance overhead for
multi-row operations compared to row-level triggers because they execute only once.
Automatic Updates to Related Tables: When data in one table changes, triggers can
automatically update related tables, maintaining referential integrity and data
synchronization. For example, deleting a customer record could trigger the deletion of
their associated order records.
Generating Derived Values: Triggers can automatically calculate and populate
derived column values based on changes in other columns within the same or related
tables. For instance, a trigger could automatically update the
last_modified_timestamp whenever a row is updated.
Sending Notifications: Triggers can be used to send email notifications or trigger
other external processes when specific database events occur, such as a new user
registration or a critical error.
Partial Rollback: The primary purpose is to enable you to undo only a portion of the
changes made within a transaction. If an error occurs or you decide that a certain part
of the transaction should be undone, you can roll back to a previously established
savepoint without discarding all the work done in the transaction so far.
Error Handling: You can use savepoints to handle potential errors within a
transaction. If a specific operation fails, you can rollback to a savepoint before that
operation, take corrective action, and then retry the operation or proceed with the rest
of the transaction.
Complex Transactions: For long or complex transactions involving multiple steps,
savepoints provide a mechanism to manage the process and recover from failures in a
more controlled manner.
Nested Transactions (in some systems): While true nested transactions are not
universally supported in all SQL databases, savepoints can sometimes be used to
simulate a similar behavior within a single transaction.
Think of a stored procedure as a mini-program or function that resides inside the database. It
encapsulates a specific set of SQL and PL/SQL statements designed to perform a particular
task or a series of related tasks.
Here's a breakdown of the key characteristics and components of stored procedures in
PL/SQL:
Key Characteristics:
Named Block: Every stored procedure has a unique name within its schema, which is
used to call and execute it.
Stored in the Database: The compiled code of the stored procedure is permanently
stored in the Oracle database. This means it doesn't need to be recompiled every time
it's executed.
Reusable: Once created, a stored procedure can be called multiple times by different
applications or users, promoting code reuse and reducing redundancy.
Encapsulation: Stored procedures encapsulate business logic and database
operations, hiding the underlying implementation details from the calling
applications.
Packages act as containers, making it easier to organize and manage database code.
Instead of having numerous standalone procedures and functions, related logic can be
grouped together within a package.
This modular approach improves code readability and makes it simpler to locate and
understand specific functionalities. For example, all procedures and functions related
to employee management (hiring, firing, updating salaries) can be placed within an
EMP_MGMT package.
Packages have two parts: the specification (spec) and the body.
o Specification: This is the public interface of the package. It declares the types,
variables, constants, exceptions, cursors, and subprograms that are accessible
from outside the package. It essentially defines what the package offers.
o Body: This contains the implementation details of the subprograms declared
in the specification, as well as any private (not declared in the spec) types,
variables, constants, exceptions, and subprograms that are only accessible
within the package body itself.
This separation of interface from implementation allows for information hiding. The
internal workings of the package are hidden from the users, who only interact with the
public elements defined in the specification. This means that the package body can be
modified without affecting the calling applications, as long as the specification
remains the same.
3. Reusability:
Subprograms and other elements defined within a package can be called and reused
by multiple applications, stored procedures, functions, and triggers.
This promotes code reuse, reduces development time, and ensures consistency across
different parts of the system.
1. Package Specification (or Spec): This is the public interface of the package. It
declares all the elements that are visible and accessible from outside the package. This
includes:
o Public types (e.g., user-defined records, tables).
o Public variables and constants.
o Public exceptions.
o Specifications (signatures) of public cursors (without their implementation).
o Specifications (headers) of public subprograms (procedures and functions),
including their parameter lists and return types (for functions).
The package specification essentially defines what the package offers to the outside world.
It's the contract between the package and the code that uses it.
2. Package Body: This contains the implementation details of the elements declared in
the package specification, as well as any private elements that are only accessible
within the package body itself. This includes:
o The actual PL/SQL code for the public subprograms declared in the
specification.
o The implementation of public cursors (the SELECT statement).
o Private types, variables, constants, and exceptions (not declared in the spec).
o Private subprograms (procedures and functions) that can only be called from
within the package body.
Deadlock
Definition: A deadlock is a situation where two or more transactions are blocked indefinitely,
each waiting for the other to release a resource that it needs. This creates a circular
dependency where none of the transactions can proceed.
Analogy: Imagine two cars approaching a single-lane bridge from opposite directions. The
first car enters the bridge and then stops, waiting for the second car to move off the bridge
(which it hasn't entered yet). The second car arrives and stops, waiting for the first car to clear
the bridge. Neither car can proceed, resulting in a deadlock.
Conditions for Deadlock (Coffman Conditions): All four of the following conditions must
hold simultaneously for a deadlock to occur:
Starvation
Definition: Starvation is a situation where one or more transactions are perpetually denied
access to the resources they need to proceed, even though the resources are not in a deadlock
state. This can happen if the scheduling or resource allocation policies unfairly favor other
transactions. The transaction is continuously postponed indefinitely.
Analogy: Imagine a shared printer in an office. A high-priority user constantly submits large
print jobs, and the scheduling algorithm always prioritizes high-priority tasks. A low-priority
user also needs to print a small but important document, but their request is repeatedly
delayed because the printer is always busy with the high-priority jobs. The low-priority user's
job is starving for the printer resource.
This is the initial phase where the transaction begins its execution.
During this phase, the transaction performs its operations:
o Reading data from the database.
o Performing computations on the retrieved data.
o Modifying data in the database (issuing INSERT, UPDATE, DELETE
statements).
The transaction is considered "in progress" during this phase.
Changes made to the database during the active phase are usually kept in the
transaction's private workspace (e.g., buffers or logs) and are not yet permanently
reflected in the actual database.
Once the transaction has executed all its operations successfully, it reaches the
partially committed phase.
At this point, all the changes have been made in the transaction's local workspace, and
the transaction signals that it intends to commit its changes.
The DBMS starts the process of preparing for a permanent commit. This might
involve:
o Writing log records to disk to ensure durability of the changes.
o Ensuring that all necessary conditions for a successful commit are met (e.g.,
all constraints are satisfied).
The transaction is still not officially committed at this stage, and there's still a
possibility of rollback if a failure occurs during the commit process.
3. Committed Phase:
If the preparation in the partially committed phase is successful, the transaction enters
the committed phase.
At this point, the changes made by the transaction are now permanent and are written
to the actual database.
The DBMS typically sends a confirmation to the application or user that the
transaction has been successfully committed.
Once a transaction is committed, its effects cannot be undone (except by executing
another compensating transaction).
All locks held by the transaction are typically released, making the affected data
available to other transactions.
4. Failed Phase:
If a transaction enters the failed phase, the DBMS must undo any changes that the
transaction might have made to the database. This process is called rollback.
During the aborted phase, the DBMS restores the database to the state it was in before
the transaction began. This is typically done by using the information stored in the
transaction logs.
Once the rollback is complete, the transaction is considered aborted.
The DBMS might notify the application or user that the transaction has been aborted.
Any locks held by the transaction are released.
The transaction might be restarted later, depending on the nature of the failure and the
system's policies.
A simplified state transition diagram for a transaction would look something like this:
[Start] --> Active --> Partially Committed --> Committed --> [End]
^ |
| v
+-----------+
Failure
|
v
Failed --> Aborted --> [End]
Ans:
1. Atomicity
Example: Consider a bank transfer operation where money is moved from Account A to
Account B. This operation typically involves two steps:
For this transaction to be atomic, both steps must succeed. If, for instance, the system crashes
after debiting Account A but before crediting Account B, the atomicity property ensures that
the debit operation is also rolled back. As a result, the money is neither deducted from
Account A nor added to Account B, maintaining the integrity of the accounts.
2. Consistency
Explanation: Consistency ensures that a transaction brings the database from one valid state
to another valid state. A valid state is one that adheres to all the defined rules, constraints,
triggers, and integrity constraints of the database schema. The transaction must preserve these
rules. If a transaction attempts to violate any of these rules, the entire transaction is rolled
back, preventing the database from entering an inconsistent state.
Example: Suppose a database for a library has a rule that the number of borrowed books for
any member cannot exceed 5. Consider a transaction where a member tries to borrow a 6th
book. The consistency property will prevent this transaction from committing because it
violates the defined constraint. The database will remain in a consistent state where no
member has borrowed more than 5 books.
Another example is maintaining a balance in a bank account. A consistency rule might state
that the account balance cannot go below a certain minimum (e.g., 0 for a basic account). If a
withdrawal transaction would violate this rule, the consistency property would prevent the
transaction from completing, thus maintaining a consistent state of the account.
3. Isolation
Example: Consider two transactions, T1 and T2, running concurrently on a bank account
with a balance of $100.
T1: Reads the balance, deducts $20, and intends to update the balance to $80.
T2: Reads the balance, adds $50, and intends to update the balance to $150.
Without proper isolation, T2 might read the balance before T1 has committed its changes. If
T2 reads the initial balance of $100 and then T1 commits (setting the balance to $80), T2
would then add $50 to the original $100, resulting in a final balance of $150, which is
incorrect. The $20 deducted by T1 is lost.
Isolation mechanisms (like locking) ensure that T2 either waits until T1 completes or reads a
consistent snapshot of the data, preventing such inconsistencies. The level of isolation can
vary (e.g., read uncommitted, read committed, repeatable read, serializable), with stricter
levels providing more isolation but potentially reducing concurrency.
4. Durability
Explanation: Durability ensures that once a transaction is committed, the changes made to
the database are permanent and will survive even system failures such as power outages,
crashes, or disk failures. The committed data is typically stored in non-volatile storage and is
recoverable. This is usually achieved through techniques like transaction logs and backups.
Ans:
The lost update problem occurs when two or more transactions try to
update the same data concurrently, and one transaction's update is
overwritten by another, effectively losing the first update.
Here's an example:
Imagine a bank account with a balance of $100. Two users, Alice and Bob,
simultaneously initiate transactions:
Alice:
Reads the balance ($100), wants to deposit $50, and updates the balance
to $150.
Bob:
Reads the balance ($100), wants to withdraw $20, and updates the
balance to $80.
Without proper concurrency control, here's how the lost update problem
can occur:
1. Alice's operations:
Reads the balance (100)
Calculates the new balance (100 + 50 = 150)
Writes the new balance (150) to the database
2. Bob's operations:
Reads the balance (100)
Calculates the new balance (100 - 20 = 80)
Writes the new balance (80) to the database
3. The problem:
Bob's update overwrites Alice's update, resulting in the database showing
a balance of $80, even though Alice's deposit should have brought the
balance to $150. Alice's update is effectively lost.
Ans:
The temporary update problem, also known as a dirty read, occurs when a
transaction reads data that has been updated by another transaction, but
that update hasn't been committed, and the first transaction then uses that
uncommitted data, potentially leading to incorrect results if the second
transaction rolls back.
Here's a breakdown with an example:
Scenario: Imagine two users (Transactions A and B) accessing the same
bank account balance.
o Transaction A: reads the balance (e.g., $100), then attempts to withdraw
$50.
o Transaction B: reads the balance (also $100) and attempts to deposit $20.
o Transaction A: fails before committing its withdrawal, and the database
rolls back its changes.
o Transaction B: reads the temporary balance (after A's uncommitted
withdrawal), sees $50, and deposits $20, resulting in $70.
o Outcome: The final balance is $70, which is incorrect because A's
withdrawal never happened, and the balance should be $100 + $20 =
$120.
Ans:
Ans: The concept of "view equivalence" in transaction processing relates to the consistency
and correctness of concurrent transactions when their operations are interleaved. It
essentially asks: Does the final outcome of a set of concurrent transactions appear as if
they had executed in some serial order? If so, the concurrent execution is considered "view
equivalent" to that serial execution.
There are different types of equivalence, with view equivalence being one of the less
restrictive forms compared to conflict equivalence.
Formal Definition:
Two schedules (sequences of operations from a set of concurrent transactions) are said to be
view equivalent if the following three conditions hold:
1. Same Initial Read: For every data item Q, if transaction Ti reads the initial value of Q
in schedule S1, then transaction Ti must also read the initial value of Q in schedule S2.
2. Same Updates: For every data item Q, if transaction Ti performs the final write on Q
in schedule S1 (meaning no other transaction writes to Q after Ti in S1), then
transaction Ti must also perform the final write on Q in schedule S2.
3. Same Final Reads: For every data item Q, if transaction Ti reads the value of Q
written by transaction Tj (where Tj is the final writer of Q before Ti reads it) in
schedule S1, then transaction Ti must also read the value of Q written by the same
transaction Tj in schedule S2. If Ti reads the initial value of Q in S1, it must also do
so in S2 (covered by condition 1).
Ans: The Thomas Write Rule is an optimization to the basic Timestamp Ordering (TO)
concurrency control protocol in database management systems. Its primary goal is to improve
concurrency by allowing certain "outdated" write operations to be ignored, thereby reducing
the number of transaction rollbacks.
When a transaction Ti tries to perform an operation on data item X, the basic TO protocol has
the following rules:
The Thomas Write Rule modifies the third condition for the write operation:
Definition:
Two schedules, S1 and S2, are said to be conflict equivalent if and only if all of the
following conditions are met:
1. Same Transactions: Both schedules involve the same set of transactions.
2. Same Operations within Transactions: The order of operations within each
individual transaction is the same in both schedules.
3. Same Ordering of Conflicting Operations: For every pair of conflicting operations
belonging to two different transactions, if one operation appears before the other in
S1, then the same order must be maintained in S2.
Two operations are considered to be in conflict if all three of the following conditions hold:
Read-Write Conflict: One transaction reads a data item that another transaction
writes to (in either order).
Write-Read Conflict: One transaction writes to a data item that another transaction
reads (in either order).
Write-Write Conflict: Two different transactions write to the same data item (in
either order).
Non-Conflicting Operations:
Schedule S1:
1. T1: Read(A)
2. T2: Write(A)
3. T1: Write(A)
Schedule S2:
1. T1: Read(A)
2. T1: Write(A)
3. T2: Write(A)
T1: Read(A) and T2: Write(A) are conflicting. In S1, Read(A) (T1) comes before
Write(A) (T2). In S2, Read(A) (T1) comes before Write(A) (T2). (Order preserved)
T2: Write(A) and T1: Write(A) are conflicting. In S1, Write(A) (T2) comes before
Write(A) (T1). In S2, Write(A) (T1) comes before Write(A) (T2). (Order not
preserved)
Since the order of the second conflicting pair is different in S1 and S2, these two schedules
are not conflict equivalent.
90. Describe the conflict serializable schedule and view serializable schedule with
example.
Ans:
1. Identify conflicting operations: Find all pairs of operations from different transactions that
access the same data item, with at least one being a write.
2. Create a precedence graph:
o For each transaction in the schedule, create a node.
o For every pair of conflicting operations, if an operation of transaction Ti precedes a
conflicting operation of transaction Tj in the schedule, draw a directed edge from Ti
to Tj.
3. Check for cycles: If the precedence graph contains a cycle, the schedule is not conflict
serializable. If there are no cycles, the schedule is conflict serializable. Any topological sort of
the graph represents a conflict-equivalent serial schedule.
Example:
Schedule S1:
Time T1 T2
1 Read(A)
2 Read(B)
3 Write(A)
4 Write(B)
5 Read(B)
Conflicting Operations:
Precedence Graph:
Nodes: T1, T2
Edge: T2 -> T1 (because T2: Write(B) conflicts with and precedes T1: Read(B))
Since there is no cycle in the graph (T2 -> T1 is the only edge), the schedule S1 is conflict
serializable. A conflict-equivalent serial schedule would be T2 followed by T1. Let's see:
Time T2 T1
1 Read(B)
2 Write(B)
3 Read(A)
4 Write(A)
5 Read(B)
1. Same Initial Reads: If a transaction Ti reads the initial value of a data item A in schedule S1,
then Ti must also read the initial value of A in schedule S2.
2. Same Final Writes: For each data item A, if transaction Ti performs the final write on A in
schedule S1 (no other transaction writes to A after Ti), then Ti must also perform the final
write on A in schedule S2.
3. Same Updated Reads: If a transaction Ti reads a value of data item A that was written by
transaction Tj in schedule S1, then Ti must also read the value of A written by the same
transaction Tj in schedule S2.
Checking for view serializability is generally more complex than checking for conflict
serializability (it's NP-complete). A common approach involves trying to find a serial
schedule that is view equivalent to the given schedule by examining the read-from
relationships and final writes.
Important Relationship: Every conflict serializable schedule is also view serializable, but
the reverse is not always true. View serializability allows for some schedules that are not
conflict serializable, often involving "blind writes" (writing to a data item without having
read it first).
Example:
Consider two transactions, T1 and T2, and a data item A with an initial value of 10.
Schedule S2:
Time T1 T2
Write(A,
1
20)
2 Read(A)
3 Write(A, 30)
Time T1 T2
1 Write(A, 20)
2 Read(A)
3 Write(A, 30)
1. Same Initial Reads: Neither T1 nor T2 reads the initial value of A in S2 or the serial schedule.
(Condition satisfied vacuously).
2. Same Final Writes: T2 performs the final write on A (value 30) in both S2 and the serial
schedule. (Condition satisfied).
3. Same Updated Reads: In S2, T2 reads the value of A written by T1 (20). In the serial
schedule, T2 also reads the value of A written by T1 (20). (Condition satisfied).
91. How do you check conflict serializability by precedence graph? Explain with
example.
92. Explain the strict 2PL and Rigorous 2PL.
1. Growing Phase: The transaction acquires locks on the data items it needs. No locks are
released during this phase.
2. Shrinking Phase: The transaction releases the locks it holds. No new locks can be acquired
during this phase.
The key addition in Strict 2PL is a constraint on when exclusive (write) locks can be
released:
All exclusive locks held by a transaction are not released until the transaction either
commits or aborts.
Shared (read) locks, however, can be released earlier, typically after the last read operation
on the data item.
It's strict because it prevents other transactions from reading or writing data that has been
written by a transaction that has not yet committed. This avoids cascading aborts. A
cascading abort occurs when a transaction reads data written by another transaction that later
aborts. If the first transaction has already made changes based on the uncommitted data, it
might also need to abort. Strict 2PL eliminates this by holding write locks until the outcome
(commit or abort) of the writing transaction is known.
Guarantees Conflict Serializability: Like basic 2PL, it ensures that the resulting schedule is
conflict serializable.
Avoids Cascading Aborts: By holding exclusive locks until commit or abort, it prevents
transactions from reading uncommitted data that might later be rolled back. This simplifies
recovery.
Recoverable Schedules: Schedules produced by Strict 2PL are recoverable, meaning that if a
transaction commits, all transactions that read values written by it will also commit.
Can Reduce Concurrency: Holding exclusive locks for a longer duration can block other
transactions from accessing the data, potentially reducing the degree of concurrency.
Deadlock Possible: Like basic 2PL, Strict 2PL does not prevent deadlocks. Transactions can
still get into a situation where each is waiting for a resource held by the other.
All locks (both shared and exclusive) held by a transaction are not released until the
transaction either commits or aborts.
The crucial difference is that in Rigorous 2PL, even shared (read) locks are held until the
transaction commits or aborts. In Strict 2PL, shared locks could be released after the
transaction has finished reading the data item.
Why is it "rigorous"?
It's rigorous because it enforces the strictest locking discipline within the 2PL framework. A
transaction essentially holds all the resources it has accessed until its termination.
1. Clients:
Clients are typically user-facing applications or processes that initiate requests for
data or services from the database.
They reside on user workstations or other computing devices connected to the
network.
The primary functions of a client in a DDBMS include:
o User Interface: Providing a way for users to interact with the database (e.g.,
through forms, query interfaces).
o Request Generation: Formulating queries or requests for data manipulation
based on user input.
o Communication: Establishing a connection with one or more database
servers and transmitting requests.
o Result Processing: Receiving and formatting the data returned by the server
for presentation to the user.
o Local Processing (optional): Performing some data processing or validation
on the client side to reduce the load on the server and improve user
experience.
2. Servers:
Servers are responsible for managing and providing access to the distributed database.
They are typically more powerful computing systems with the DDBMS software
installed.
In a client-server DDBMS, there can be one or multiple servers, depending on how
the database is distributed (e.g., partitioned, replicated).
The primary functions of a server in a DDBMS include:
o Data Storage and Management: Storing and organizing the portion of the
distributed database that resides at its site.
o Query Processing: Receiving queries from clients, optimizing them for
distributed execution, and coordinating the retrieval or manipulation of data
across the relevant database sites.
o Transaction Management: Ensuring the ACID properties (Atomicity,
Consistency, Isolation, Durability) for transactions that may involve data at
multiple sites. This includes concurrency control and commit protocols.
Core Characteristics:
Symmetry: All nodes in the network have equal capabilities and responsibilities.
There is no central coordinating server.
Dual Role: Each peer functions as both a client (requesting data or services) and a
server (providing data or services from its local database).
Resource Sharing: Peers share their local database resources (data, processing
power, storage) with other peers in the network.
Coordination: Peers coordinate their activities, such as query processing and
transaction management, among themselves without relying on a central authority.
Autonomy: Each peer typically maintains a degree of autonomy over its local
database, including its design, data storage, and access control.
Global Conceptual Schema: Despite the distributed and autonomous nature, there's
often a global conceptual schema that provides a unified logical view of the entire
distributed database. Each peer has a local conceptual schema that describes its part of
the global schema.
How it Works:
1. Query Processing: When a user at a peer issues a query that requires data from
multiple sites, the local DDBMS on that peer needs to:
o Identify which other peers hold the necessary data (this might involve a
distributed catalog or discovery mechanism).
o Formulate sub-queries to be sent to those peers.
o Coordinate with the other peers to execute the sub-queries.
o Receive the results from the other peers.
o Integrate the results to answer the original query.
Data Replication: DDBMS often employ data replication, where copies of data are
stored at multiple sites. If one site fails, data can still be accessed from other sites,
ensuring higher availability and business continuity.
Fault Tolerance: The system can continue to function even if some of its
components (servers or network links) fail. The workload can be shifted to other
operational sites.
Reduced Single Point of Failure: Unlike centralized systems where the entire
database becomes unavailable if the central server fails, a DDBMS distributes the
risk.
2. Enhanced Scalability:
Horizontal Scaling: DDBMS can easily scale horizontally by adding new database
servers (nodes) to the distributed system as data volume and user load increase. This
is often more cost-effective than vertically scaling a single, powerful server.
Modular Growth: New sites or units can be added to the network without disrupting
the operations of existing sites.
96. Draw and analyse the diagram of global relation of DDBMS with explanation.
Ans:
Explanation:
In a DDBMS, data is distributed across multiple physical locations (sites), but it is logically perceived
as a single database. This illusion is created using a global schema.
🧩 Key Components:
✅ Analysis:
Feature Description
Transparency Users see a single database; fragmentation and distribution are hidden.
Data Locality Queries can be optimized to run on the site where the data is located.
Improved By processing fragments locally, network traffic and response time can be
Performance reduced.
Availability If one site fails, others may continue to operate (depending on replication).
Global Query
Queries must be translated from global schema to local fragments.
Processing
97. Draw and infer the diagram of reference architecture of DDBMS with
explanation.
🧠 Explanation of Each Layer
🔷 1. External Schema / Views Layer:
Data Autonomy Local sites can operate semi-independently, which increases availability.
Ans: The global system catalog (also known as the global data dictionary) is a fundamental
component of a Distributed Database Management System (DDBMS). It is a centralized
(logically, though it can be physically distributed or replicated) repository that contains
comprehensive metadata about the entire distributed database system. Think of it as the
"blueprint" or "directory" for all the data and its management across the various database
sites.
Here's a breakdown of what the global system catalog is, its contents, and its importance:
What is the Global System Catalog?
It's a collection of metadata that describes the structure, location, and characteristics
of data within the DDBMS.
It provides a unified view of the distributed database, abstracting away the physical
distribution details from users and applications.
The global system catalog is accessed and utilized by the DDBMS to manage various
aspects of the distributed system, such as query processing, transaction management,
and data access.
1. Global Schema:
o Definitions of all global relations (tables, views, etc.) as they are perceived by
users.
o Attributes (columns) of these global relations, their data types, and constraints.
o Relationships between global relations.
2. Fragmentation Schema:
o How global relations are fragmented (horizontally, vertically, or a
combination).
o Definitions of each fragment, including the selection or projection conditions
used for fragmentation.
3. Allocation Schema:
o The physical location(s) of each fragment or replica (the database site(s)
where they are stored).
4. Replication Schema:
o Information about which fragments or relations are replicated.
o The location of each replica.
Ans: The database design strategy typically involves a series of interconnected phases, each
building upon the previous one. Here's a breakdown of these phases:
1. Requirements Analysis:
Goal: Understand the needs of the users and the system that will use the database.
Identify the data to be stored, the operations to be performed, and the constraints that
apply.
Activities:
o Gather information from stakeholders (users, developers, business analysts).
o Analyze existing systems and documentation.
o Conduct interviews, surveys, and workshops.
o Define the scope of the database project.
o Identify user needs and business rules related to the data.
o Develop use cases or user stories that interact with the data.
Deliverables:
o Requirements document (functional and non-functional).
o User stories or use cases.
o Initial list of entities and their high-level descriptions.
2. Conceptual Design:
Goal: Create a high-level, logical model of the data that is independent of any
specific DBMS or physical implementation details. Focus on what data needs to be
stored and the relationships between different data elements.
Activities:
o Identify the main entities (objects or concepts) in the system.
o Determine the attributes (properties or characteristics) of each entity.
o Define the relationships between entities (e.g., one-to-many, many-to-many).
o Develop a conceptual data model, often using an Entity-Relationship (ER)
diagram or UML class diagram.
Deliverables:
o Conceptual data model (ER diagram or UML class diagram).
o Data dictionary defining entities, attributes, and relationships.
3. Logical Design:
Goal: Translate the conceptual data model into a logical schema that can be
implemented in a specific type of DBMS (e.g., relational, NoSQL). Focus on how the
data will be organized in the database.
Activities (for Relational Databases):
o Map entities to tables.
o Map attributes to columns, specifying data types and constraints (e.g., primary
keys, foreign keys, nullability).
o Resolve many-to-many relationships using junction tables.
o Normalize the tables to reduce data redundancy and improve data integrity
(following normal forms like 1NF, 2NF, 3NF, etc.).
o Define views to provide simplified or customized perspectives of the data.
Activities (for NoSQL Databases):
o Design document structures (for document databases).
o Design key-value pairs or column families (for other NoSQL types).
o Consider data access patterns and optimize the schema for those patterns.
Deliverables:
o Logical schema (set of table definitions with columns, data types, and
constraints for relational; or schema definitions appropriate for the chosen
NoSQL type).
o Updated data dictionary.
4. Physical Design:
Goal: Decide how the logical schema will be physically implemented in a specific
DBMS, considering performance, storage, and security requirements. Focus on how
the data will be stored and accessed physically.
Activities:
o Select a specific DBMS product (e.g., Oracle, MySQL, SQL Server,
MongoDB, Cassandra).
o Choose storage structures (e.g., tablespaces, files).
o Design indexes to optimize query performance.
o Determine data partitioning strategies (for large databases or DDBMS).
o Consider data compression and encryption.
o Plan for database security (user roles, permissions).
o Estimate storage requirements.
o Fine-tune database configuration parameters.
Deliverables:
o Physical database schema (DDL scripts for the chosen DBMS).
o Index definitions.
o Storage allocation plan.
o Security plan.
o Backup and recovery plan.
Goal: Create the actual database based on the physical design and populate it with
data. Verify that the database meets the requirements and performs as expected.
Activities:
o Create the database schema using DDL scripts.
o Implement constraints, triggers, and stored procedures.
o Develop data loading and migration scripts.
o Populate the database with initial data.
o Conduct various types of testing (unit testing of database components,
integration testing with applications, performance testing, user acceptance
testing).
Deliverables:
o Implemented database.
o Populated data.
o Testing results and reports.
Goal: Deploy the database into the production environment and ensure its ongoing
operation, performance, and security.
Activities:
o Deploy the database to the production servers.
o Configure access and security settings.
o Monitor database performance and resource utilization.
o Perform regular backups and implement recovery procedures.
o Apply patches and upgrades to the DBMS.
o Tune database parameters for optimal performance.
o Address user feedback and implement necessary changes or enhancements
(which might trigger a new iteration of the design process).
Deliverables:
o Deployed and operational database system.
o Monitoring reports.
o Backup logs.
o Maintenance records.
Diagram:
Code snippet
graph TD
A[1. Requirements Analysis] --> B(2. Conceptual Design);
B --> C{3. Logical Design};
C --> D[[4. Physical Design]];
D --> E[5. Implementation & Testing];
E --> F(6. Deployment & Maintenance);
F --> A;
How a global relation is divided into fragments. This includes the type of
fragmentation used (horizontal, vertical, or hybrid) and the criteria for the division.
Where each fragment is physically located. Users don't need to specify the site
where a particular piece of data resides.
Think of it like this: Imagine a large library catalog (the global relation). Instead of having
one massive physical catalog, the library might divide it into several smaller catalogs based
on the first letter of the author's last name (horizontal fragmentation) and place these smaller
catalogs in different sections of the library (different sites). With fragmentation transparency,
the library's search system would allow you to search for any book as if there were still one
giant catalog, and the system would automatically figure out which of the smaller catalogs to
look in and where they are located.
The DDBMS achieves fragmentation transparency through the use of the global system
catalog (or global data dictionary). This catalog stores metadata about:
Ans: The primary purpose of using PL/SQL (Procedural Language/SQL) is to extend the
capabilities of standard SQL within the Oracle database environment. It allows developers
to create more powerful, efficient, and maintainable database applications by embedding
procedural logic within SQL statements.
Ans: Database triggers are powerful tools in a Database Management System (DBMS) that
allow you to define specific actions that are automatically executed in response to certain
events occurring on a particular table or view. These events are typically Data Manipulation
Language (DML) operations like INSERT, UPDATE, or DELETE.
Complex Validation: Triggers can implement validation rules that go beyond the
constraints defined at the table level (like NOT NULL, UNIQUE, FOREIGN KEY, CHECK).
For example, you can ensure that when a new employee is added, their salary falls
within a specific range based on their department.
Maintaining Referential Integrity (Beyond Declarative Constraints): While
foreign keys enforce basic referential integrity, triggers can handle more complex
scenarios. For instance, when a parent record is deleted, a trigger can automatically
update related child records with a default value or perform a custom action instead of
just cascading the delete or preventing it.
Preventing Invalid Transactions: Triggers can check conditions before or after an
operation and prevent the transaction from proceeding if certain criteria are not met.
For example, a trigger could prevent updates to an order status if the order has already
been shipped.
Automatic Logging: Triggers can automatically record who made changes, what
changes were made (old and new values), and when they were made to specific tables.
This creates a detailed audit trail for tracking data modifications and identifying
potential issues or unauthorized activities.
Maintaining History Tables: Instead of directly modifying data, triggers can move
the old data to a history or archive table before an UPDATE or DELETE operation,
preserving a historical record of changes over time.
4. Enhancing Security:
Triggers allow you to embed business rules directly within the database schema. For
example, when a new order is placed and the customer's total spending exceeds a
certain threshold, a trigger could automatically upgrade their membership level.
Ans:
%FOUND:
%NOTFOUND:
%ROWCOUNT:
This attribute keeps a running count of the number of rows that have been
successfully fetched from the cursor since it was opened.
The count increments with each successful FETCH.
It's useful if you need to know how many rows were processed by your cursor loop.
The value is 0 immediately after the cursor is opened and before the first FETCH.
%ISOPEN:
This attribute indicates whether the cursor is currently in the open state.
It returns TRUE if the cursor has been explicitly OPENed and has not yet been CLOSEd.
It returns FALSE if the cursor is closed or has not been opened.
It's good practice to check if a cursor is open before attempting to fetch from it or
close it to avoid errors.
Ans: The basic structure followed in PL/SQL is organized into blocks. These blocks are the
fundamental building units of any PL/SQL program. A PL/SQL block can be either
anonymous (not named) or named (as in procedures, functions, packages, and triggers).
SQL
[DECLARE]
-- Declaration section (optional)
-- Declare variables, constants, cursors, types, exceptions, etc.
BEGIN
-- Executable section (mandatory)
-- PL/SQL statements and SQL statements
-- This is where the main logic of the program resides.
[EXCEPTION]
-- Exception-handling section (optional)
-- Code to handle errors that occur in the executable section.
END;
/
Example of Declarations:
SQL
DECLARE
v_employee_id NUMBER;
c_tax_rate CONSTANT NUMBER := 0.15;
TYPE emp_record IS RECORD (
employee_name VARCHAR2(100),
salary NUMBER
);
emp_rec emp_record;
CURSOR emp_cur IS
SELECT employee_name, salary
FROM employees
WHERE department_id = 20;
invalid_salary EXCEPTION;
SQL
BEGIN
SELECT COUNT(*) INTO v_employee_id
FROM employees
WHERE department_id = 10;
OPEN emp_cur;
LOOP
FETCH emp_cur INTO emp_rec;
EXIT WHEN emp_cur%NOTFOUND;
DBMS_OUTPUT.PUT_LINE(emp_rec.employee_name || ' earns ' ||
emp_rec.salary);
END LOOP;
CLOSE emp_cur;
Ans:
106. Explain how query optimization helps in reducing query cost.
Ans: Query optimization is the process a database management system (DBMS) uses to select
the most efficient way to execute a SQL query. This involves analyzing different possible
execution plans and choosing the one estimated to have the lowest cost. The "cost" of a query
typically represents the resources consumed, such as:
Disk I/O: The number of times data needs to be read from or written to disk. This is
often the most significant factor in query cost.
CPU Usage: The amount of processing power required to perform operations like
filtering, sorting, and joining data.
Memory Usage: The amount of RAM needed for various operations during query
execution.
Network Usage: In distributed database systems, the cost of transferring data
between different nodes.
By reducing these resource consumptions, query optimization directly lowers the overall
"cost" of running a query in several ways:
Using Indexes: Query optimization can identify and utilize indexes on relevant
columns to quickly locate specific rows without scanning the entire table. This
drastically reduces disk I/O, especially for queries with WHERE clauses.
Selecting the Right Index: If multiple indexes are available, the optimizer chooses
the most selective one that best matches the query's predicates.
Avoiding Table Scans: When appropriate indexes are present and used, the optimizer
can avoid full table scans, which are very expensive in terms of disk I/O for large
tables.
Choosing the Best Join Algorithm: Different join algorithms (e.g., nested loop join,
hash join, merge join) have varying costs depending on the size of the tables being
joined, the presence of indexes, and the join conditions. The optimizer selects the
most suitable algorithm.
Determining the Optimal Join Order: When joining multiple tables, the order in
which they are joined can significantly impact performance. The optimizer determines
the join order that minimizes the number of intermediate rows and the overall cost.
Rewriting Queries: The optimizer can automatically rewrite the query in a more
efficient form without changing the result. For example, it might flatten subqueries or
transform OR conditions into UNION ALL operations in certain scenarios.
Pushing Down Operations: The optimizer tries to push down filtering ( WHERE
clauses) and aggregation ( GROUP BY, HAVING) operations as early as possible in the
execution plan. This reduces the amount of data that needs to be processed in later
stages.
Selecting Only Necessary Columns: The optimizer encourages selecting only the
columns required by the query (avoiding SELECT *). This reduces the amount of data
that needs to be read from disk and transferred across the network.
Filtering Early: As mentioned earlier, applying filters as early as possible minimizes
the number of rows that need to be processed and moved through the execution plan
107. Describe different cost components of query execution.
Ans:
The execution of a database query involves several cost components that the query optimizer
considers when determining the most efficient execution plan. These components represent
the resources consumed during the query's lifecycle. Here's a breakdown of the different cost
components:
This is often the most significant cost factor, especially for large databases.
It represents the number of times data blocks need to be read from or written to disk
to satisfy the query.
Operations like full table scans, index reads, and accessing data files contribute to this
cost.
The optimizer aims to minimize disk I/O by using indexes effectively, choosing
appropriate join algorithms, and filtering data early.
2. CPU Cost:
3. Memory Cost:
This refers to the amount of RAM the DBMS needs to allocate during query
execution for various purposes.
Operations like sorting, hash joins, and temporary storage of intermediate results
consume memory.
Insufficient memory can lead to spilling data to disk, increasing I/O cost and slowing
down execution. The optimizer considers available memory when choosing execution
plans.
In a distributed database system, where data is spread across multiple nodes, network
cost becomes a significant factor.
It represents the cost of transferring data between different database nodes to execute
a distributed query.
The optimizer in a DDBMS aims to minimize network traffic by trying to process
data locally as much as possible and by choosing efficient data shipping strategies.
Before the actual execution, the DBMS needs to parse the SQL query to understand
its syntax and semantics. This parsing process consumes some CPU resources.
The query optimizer itself takes time and resources to analyze different possible
execution plans and choose the one with the lowest estimated cost. This optimization
process also has a cost associated with it.
Ans: The selection operation in databases involves retrieving specific rows from one or more
tables based on given conditions. Various algorithms exist to perform this operation, each
with its own strengths and weaknesses depending on factors like data organization, indexing,
and the nature of the selection criteria. Here's a comparison and contrast of common selection
algorithms:
Description: This is the most basic algorithm. It involves sequentially reading every
row in the table and checking if it satisfies the selection condition.
Pros:
o Applicable to any table, regardless of storage organization or the existence of
indexes.
o Simple to implement.
Cons:
o Very inefficient for large tables, as it requires reading the entire table even if
only a few rows match the condition.
o High I/O cost for large tables.
Best Use Cases:
o Small tables.
o When the selection condition is likely to match a large percentage of rows.
o When no suitable index exists for the selection condition.
2. Index Scan:
Description: This algorithm uses an index (like a B-tree or hash index) built on one
or more columns involved in the selection condition to directly locate the matching
rows.
Pros:
o Significantly faster than a table scan for selective queries (where only a small
number of rows match).
o Reduces disk I/O by only accessing the necessary data blocks.
Cons:
o Requires an index to be present on the relevant column(s).
o The efficiency depends on the type of index and the selectivity of the query.
Non-selective queries might still benefit from a table scan in some cases.
o For some index types (e.g., secondary indexes), retrieving the actual data rows
might involve additional I/O operations (fetching from the base table).
Types of Index Scans:
o Primary Index Scan (Clustered Index Scan): If the table is physically sorted
based on the indexed column(s), the data retrieval is very efficient as the index
directly points to contiguous blocks of data.
o Secondary Index Scan (Non-clustered Index Scan): The index contains
pointers to the actual data rows, which might be scattered across the disk,
leading to more I/O operations.
Best Use Cases:
o Queries with highly selective WHERE clauses on indexed columns.
o Equality comparisons or range queries on indexed columns.
Bitmap Indexes: Efficient for low-cardinality columns (columns with a small number
of distinct values) and for queries with complex WHERE clauses involving multiple
conditions combined with AND and OR operators.
Full-Text Indexes: Optimized for searching text data using keywords and phrases.
Binary Hash-Based
Feature Table Scan Index Scan
Search Selection
Data Order
No Index required Sorted data Hash Index required
Req.
Small tables, non-
Selective queries on Equality on Equality on hashed
Best for selective queries, no
indexed columns sorted data columns
index
Efficiency O(log N) or O(1) +
O(N) O(log B) O(1) + data retrieval
(Avg) data retrieval
Range Efficient (for tree-
Applicable Less efficient Not efficient
Queries based indexes)
Equality Very
Applicable Efficient Very Efficient
Queries Efficient
Complexity Simple Moderate Moderate Moderate
Sorting Index maintenance,
Overhead Minimal Index maintenance
overhead collision handling
Ans: The sorting process in query execution is a fundamental operation used to arrange the
rows of a result set in a specific order based on the values of one or more columns. This is
typically requested by the ORDER BY clause in a SQL query. Here's a breakdown of the
process:
The query processor first parses the SQL query and identifies the presence of the
ORDER BY clause.
It then determines the column(s) specified for sorting and the desired order (ascending
ASC - default, or descending DESC).
2. Data Retrieval:
Before sorting can occur, the database system needs to retrieve the data that will be
part of the final result set. This involves executing the FROM, WHERE, GROUP BY, and
HAVING clauses of the query to obtain the intermediate result.
3. Sorting Operation:
Once the data to be sorted is available, the database system employs a sorting
algorithm. The specific algorithm used can vary depending on factors like:
o Size of the data to be sorted: For small datasets, in-memory sorting
algorithms like quicksort or mergesort might be efficient.
o Available memory: If the data exceeds available memory, external sorting
algorithms are used, which involve sorting chunks of data on disk and then
merging them.
o Presence of indexes: If an index exists on the sorting column(s) and the order
of the index matches the requested sort order, the database might be able to
retrieve the data directly from the index in the desired order, avoiding a
separate sorting step. This is a significant optimization.
o Database system implementation: Different database systems might have
their own optimized sorting algorithms.
In-Memory Sorting:
o Quicksort: Generally fast for average cases but can have worst-case O(n^2)
performance.
o Mergesort: Stable sort with consistent O(n log n) performance, suitable for
larger datasets.
o Heapsort: Another O(n log n) algorithm, often used when only a limited
number of top/bottom results are needed.
External Sorting: Used when data doesn't fit in memory. It typically involves:
o Sorting Runs: Dividing the data into smaller chunks that can fit in memory,
sorting each chunk, and writing them to temporary storage (usually disk).
o Merging Runs: Merging the sorted runs iteratively until a single, fully sorted
result set is obtained.
After the sorting process is complete, the database system returns the rows of the
result set in the specified order to the user or application.
111. Explain external merge sort and its role in handling large datasets.
Ans: External Merge Sort is a sorting algorithm designed to handle datasets that are too large
to fit entirely into the main memory (RAM) of a computer. It leverages external storage, such
as hard disks or SSDs, to perform the sorting process. The core idea is to break down the
large dataset into smaller, manageable chunks that can be sorted in memory, and then merge
these sorted chunks together to produce the final sorted output.
Here's a breakdown of the process and its role in handling large datasets:
Algorithm Steps:
Commutativity: The order of operands for certain binary operators doesn't affect the
result.
o Join ( JOIN ): R JOIN S is equivalent to S JOIN R. This allows the optimizer
to choose the join order that leads to a more efficient execution plan (e.g.,
joining smaller relations first).
o Intersection ( ∩ ): R ∩ S is equivalent to S ∩ R.
o Union ( ∪ ): R ∪ S is equivalent to S ∪ R.
Associativity: When multiple instances of the same associative binary operator are
used, the grouping of operands doesn't affect the result.
o Join: (R JOIN S) JOIN T is equivalent to R JOIN (S JOIN T). This allows
the optimizer to consider different join trees and choose the most cost-
effective one.
o Intersection: (R ∩ S) ∩ T is equivalent to R ∩ (S ∩ T).
o Union: (R ∪ S) ∪ T is equivalent to R ∪ (S ∪ T).
Ans: Selecting an optimal evaluation plan for a given SQL query is a complex task fraught
with several challenges. The query optimizer aims to find the most efficient way to execute a
query from a vast number of semantically equivalent execution plans. Here are some key
challenges associated with this process:
1. Estimating the Cost of Execution Plans:
Inaccurate Statistics: The query optimizer relies heavily on database statistics (e.g.,
table sizes, number of distinct values, data distribution, index selectivity) to estimate
the cost of different operations. If these statistics are outdated, incomplete, or
inaccurate, the cost estimates will be unreliable, potentially leading to the selection of
a suboptimal plan.
Complexity of Cost Models: Developing accurate cost models that precisely predict
the resource consumption (disk I/O, CPU, memory, network) of various operations
under different conditions is challenging. These models often involve simplifications
and assumptions that might not always hold true.
Data Skew: Uneven distribution of data within columns (data skew) can significantly
impact the performance of certain operations (like joins and aggregations). Standard
statistics might not fully capture this skew, leading to inaccurate cost estimations.
Interaction of Operations: The cost of one operation can be influenced by the output
of a preceding operation. Accurately modeling these interdependencies and their
impact on overall cost is difficult.
Hardware and System Variability: The actual execution cost can vary depending on
the underlying hardware (CPU speed, disk performance, network bandwidth), system
load, and buffer pool management, which are often difficult for the optimizer to
predict precisely.
Vast Number of Equivalent Plans: For even moderately complex queries involving
multiple tables and operations, the number of possible execution plans can be
enormous due to the commutativity and associativity of operators (e.g., join order
optimization). Exhaustively evaluating all possible plans is computationally
infeasible.
Heuristic Search Strategies: Optimizers typically employ heuristic search strategies
(e.g., dynamic programming, greedy algorithms, genetic algorithms) to explore the
search space efficiently. However, these heuristics might not always find the globally
optimal plan and can get stuck in local optima.
Complexity of Join Order Optimization: Determining the optimal join order for a
query with many tables is a classic NP-hard problem. While optimizers use various
techniques to tackle this, finding the absolute best order can be time-consuming.
Considering Different Join Algorithms: For each possible join order, the optimizer
needs to consider different join algorithms (nested loop, hash join, merge join) and
estimate their costs, further expanding the search space.
Optimization Overhead: The process of query optimization itself consumes time and
resources. For very simple queries, the overhead of extensive optimization might
outweigh the benefits of finding a slightly better plan.
Balancing Optimization Effort: The optimizer needs to decide how much time and
effort to spend on exploring the search space versus simply choosing a reasonably
good plan quickly. This trade-off is often managed using optimization levels or time
limits.
115. Compare nested loop join, merge join, and hash join techniques.
Ans:
Ans: PL/SQL triggers are stored program units that are automatically executed in response to
specific events occurring in the database. These events are typically Data Manipulation
Language (DML) statements ( INSERT, UPDATE, DELETE) on a table or view, or Data
Definition Language (DDL) statements ( CREATE, ALTER, DROP) on schema objects, or
database operations ( STARTUP, SHUTDOWN, LOGON, LOGOFF).
These triggers fire when DML statements are executed on a table or view. They are the most
common type of triggers.
BEFORE Triggers: These triggers execute before the triggering DML statement is
executed on the database. They are often used for:
o Validation: Checking data integrity before it's inserted or updated.
o Modification: Changing the data being inserted or updated.
o Preventing Operations: Raising an exception to stop the DML operation
based on certain conditions.
SQL
AFTER Triggers: These triggers execute after the triggering DML statement has
been successfully executed on the database. They are commonly used for:
o Auditing: Logging changes made to the database.
o Updating Related Tables: Maintaining consistency across related data.
o Sending Notifications: Triggering external processes or sending alerts.
SQL
SQL
DML triggers can be further classified based on how many times they fire for a single
triggering statement:
Row-Level Triggers (FOR EACH ROW): These triggers execute once for each row that
is affected by the triggering DML statement. They have access to the :NEW and :OLD
pseudorecords, which represent the new and old values of the row being processed.
The examples above for BEFORE INSERT, AFTER UPDATE, and INSTEAD OF INSERT
are all row-level triggers because they include the FOR EACH ROW clause.
Statement-Level Triggers (without FOR EACH ROW): These triggers execute only
once for the entire triggering DML statement, regardless of the number of rows
affected. They do not have access to the :NEW and :OLD pseudorecords for individual
rows. They are often used for auditing overall operations or enforcing statement-level
constraints.
SQL
These triggers fire in response to DDL statements like CREATE, ALTER, DROP on schema
objects (tables, indexes, procedures, etc.). They are useful for auditing schema changes,
enforcing naming conventions, or preventing unauthorized modifications to the database
structure.
SQL
-- Example: BEFORE DROP trigger to prevent dropping tables during business
hours
CREATE OR REPLACE TRIGGER prevent_drop_during_business_hours
BEFORE DROP ON SCHEMA
BEGIN
IF TO_CHAR(SYSDATE, 'HH24') BETWEEN '09' AND '17' THEN
RAISE_APPLICATION_ERROR(-20500, 'Cannot drop objects during business
hours (9 AM to 5 PM).');
END IF;
END;
/
These triggers fire in response to database system events such as startup, shutdown, logon,
logoff, or errors. They can be used for tasks like setting up the environment upon user login,
performing cleanup during shutdown, or logging database errors.
SQL
-- Example: AFTER LOGON trigger to set the application context for each
user
CREATE OR REPLACE TRIGGER set_user_context_after_logon
AFTER LOGON ON DATABASE
BEGIN
DBMS_SESSION.SET_CONTEXT('USER_INFO', 'SESSION_USER',
SYS_CONTEXT('USERENV', 'SESSION_USER'));
DBMS_SESSION.SET_CONTEXT('USER_INFO', 'IP_ADDRESS',
SYS_CONTEXT('USERENV', 'IP_ADDRESS'));
END;
/
Performance: Triggers can add overhead to database operations. Keep the trigger
logic efficient.
Complexity: Overuse or poorly designed triggers can make database logic complex
and hard to maintain.
Cascading Effects: Be mindful of potential cascading effects if triggers modify other
tables, which might fire other triggers.
Debugging: Debugging triggers can be more challenging than debugging regular
PL/SQL procedures.
Ans: In PL/SQL (Procedural Language extension to SQL used in Oracle), control structures allow you
to control the flow of execution of code blocks. These are similar to control structures in most
programming languages like C or Java.
🧠 Types of Control Structures in PL/SQL:
IF...THEN
IF...THEN...ELSE
IF...THEN...ELSIF...ELSE
✅ Example:
DECLARE
marks NUMBER := 75;
BEGIN
IF marks >= 90 THEN
DBMS_OUTPUT.PUT_LINE('Grade: A');
ELSIF marks >= 75 THEN
DBMS_OUTPUT.PUT_LINE('Grade: B');
ELSE
DBMS_OUTPUT.PUT_LINE('Grade: C');
END IF;
END;
DECLARE
i NUMBER := 1;
BEGIN
LOOP
DBMS_OUTPUT.PUT_LINE('Value of i: ' || i);
i := i + 1;
EXIT WHEN i > 5;
END LOOP;
END;
b) WHILE LOOP
DECLARE
i NUMBER := 1;
BEGIN
WHILE i <= 5 LOOP
DBMS_OUTPUT.PUT_LINE('i = ' || i);
i := i + 1;8
END LOOP;
END;
c) FOR LOOP
BEGIN
FOR i IN 1..5 LOOP
DBMS_OUTPUT.PUT_LINE('i = ' || i);
END LOOP;
END;
🔄 FOR loops automatically handle the loop variable and its increment.
DECLARE
x NUMBER := 1;
BEGIN
IF x = 1 THEN
GOTO skip_label;
END IF;
<<skip_label>>
DBMS_OUTPUT.PUT_LINE('Jumped to label');
END;
Ans: Transactions must follow the ACID properties — Atomicity, Consistency, Isolation, and
Durability.
However, various types of errors can occur that lead to transaction failures. Let's illustrate and
explain each type.
-- Example in PL/SQL
IF balance < withdrawal_amount THEN
RAISE_APPLICATION_ERROR(-20001, 'Insufficient funds');
END IF;
🔹 2. System Errors
Definition: The transaction is valid but the DBMS or system crashes during execution.
Causes:
o Memory overflow
o System shutdown
o DBMS bugs or failures
🔹 3. Disk Failures
Definition: Physical failure of the storage media (hard disk crash, corrupted sectors).
Effect: Loss of committed or uncommitted data.
Prevention: Use of RAID, backups, and recovery logs.
🔹 4. Deadlock Errors
Definition: Two or more transactions are waiting indefinitely for resources locked by each
other.
Example:
T1 locks A, needs B
T2 locks B, needs A
Occurs when multiple transactions execute concurrently and interfere with each other.
a. Lost Update:
b. Dirty Read:
c. Unrepeatable Read:
A row retrieved twice during the same transaction returns different values.
d. Phantom Read:
A transaction re-executes a query and sees a different set of rows due to another transaction’s
inserts/deletes.
🔹 6. Communication Failures
Ans: Deadlocks are a common problem in concurrent database systems where two or more
transactions are blocked indefinitely, each waiting for the other to release a resource (like a
lock on a data item). Dealing with deadlocks involves both prevention strategies (to minimize
the chances of deadlocks occurring) and detection and recovery mechanisms (to resol ve
deadlocks that do occur).
1. Deadlock Prevention:
These techniques aim to structure transactions or manage resource allocation in a way that
makes it impossible for a deadlock condition to arise in the first place.
These techniques allow deadlocks to occur but provide mechanisms to detect them and then
resolve them by aborting one or more of the involved transactions.
Ensures Conflict Serializability: The primary advantage of 2PL is that any schedule
of transactions that follows the 2PL protocol is guaranteed to be conflict serializable.
This means that the outcome of the concurrent execution of these transactions will be
equivalent to some serial order of their execution, thus maintaining database
consistency.
Avoids Cascading Rollbacks (in Strict 2PL): A variation called Strict 2PL holds all
exclusive (write) locks until the transaction commits or aborts. This prevents other
transactions from reading uncommitted data, thereby avoiding cascading rollbacks,
where the failure of one transaction forces the rollback of other dependent
transactions.
Relatively Simple to Understand and Implement: The basic concept of growing
and shrinking phases is straightforward, making it easier to understand and implement
compared to some other concurrency control protocols.
Increases Concurrency Compared to Serial Execution: By allowing transactions to
interleave their operations (while adhering to the locking rules), 2PL generally
permits a higher degree of concurrency compared to executing transactions strictly
one after another.
Does Not Prevent Deadlocks: The most significant disadvantage of the basic 2PL
protocol is that it does not inherently prevent deadlocks. Deadlocks can occur if tw o
or more transactions are waiting for each other to release locks on resources that they
need.
o Example of Deadlock in 2PL:
Transaction T1 acquires a lock on data item A.
Transaction T2 acquires a lock on data item B.
Transaction T1 now requests a lock on data item B but has to wait as
T2 holds it.
Transaction T2 now requests a lock on data item A but has to wait as
T1 holds it.
This creates a circular wait, resulting in a deadlock.
Potential for Reduced Concurrency (due to blocking): While 2PL increases
concurrency compared to serial execution, the locking mechanism can still lead to
blocking. If a transaction holds a lock on a frequently accessed data item for a long
duration, other transactions needing that item will be blocked, potentially reducing
overall throughput.
Overhead of Lock Management: Managing locks (acquiring, releasing, checking for
conflicts) adds overhead to the system. This overhead can become significant in
systems with a high volume of transactions and data items.
Possibility of Starvation: Although less common than deadlocks, starvation can
occur in 2PL. A transaction might repeatedly lose out in lock requests to other
transactions and be delayed indefinitely.
Ans: Scenario: Consider a simple bank database with a table Accounts having columns
account_id and balance.
Without Locks:
1. Transaction T1: Reads the balance of Account A (say, $100).
2. Transaction T2: Reads the balance of Account A ($100).
3. Transaction T1: Debits $20 from Account A and updates the balance to $80.
4. Transaction T2: Credits $50 to Account A (using the initially read balance of
$100) and updates the balance to $150.
Result: The debit operation performed by T1 is lost. The final balance of Account A
is $150, whereas it should have been $100 - $20 + $50 = $130.
With Locks:
1. Transaction T1: Acquires a write lock (exclusive lock) on Account A.
2. Transaction T1: Reads the balance of Account A ($100).
3. Transaction T2: Attempts to acquire a lock (read or write) on Account A but
is blocked because T1 holds a write lock.
4. Transaction T1: Debits $20 and updates the balance to $80.
5. Transaction T1: Releases the write lock on Account A.
6. Transaction T2: Can now acquire a lock on Account A (e.g., a write lock).
7. Transaction T2: Reads the current balance of Account A ($80).
8. Transaction T2: Credits $50 and updates the balance to $130.
9. Transaction T2: Releases the lock on Account A.
Result: The final balance is correctly $130, and the update from both transactions is
reflected.
127. Show whether the following transaction model is serial or not
Ans:
Let's analyze the provided transaction models to determine if they are serial or not. A serial
schedule is one where the operations of one transaction are executed to completion before the
operations of another transaction begin. In other words, transactions are executed one after
the other without any interleaving.
T1: T2:
read(A)
A := A - 50
write(A)
read(B)
B := B + 50
write(B) read(A)
temp := A * 0.1
A := A - temp
write(A)
read(B)
B := B + temp
write(B)
Step-by-step execution:
1. T1 starts: read(A)
2. T1 continues: A := A - 50
3. T1 continues: write(A)
4. T1 continues: read(B)
5. T1 continues: B := B + 50
6. T1 completes: write(B)
7. T2 starts: read(A)
8. T2 continues: temp := A * 0.1
9. T2 continues: A := A - temp
10. T2 continues: write(A)
11. T2 continues: read(B)
12. T2 continues: B := B + temp
13. T2 completes: write(B)
In this execution, all operations of transaction T1 are completed before any operation of
transaction T2 begins. Therefore, this schedule is serial.
Diagrammatic Representation:
Time -->
|------T1------|------T2------|
Analysis of the Second Transaction Model (T1, T2, T3, T4, T5):
1. T1 starts: read(Y)
2. T2 starts: read(X)
3. T1 continues: read(Z)
4. T2 continues: read(Y)
5. T2 continues: write(Y)
6. T1 continues: read(U)
7. T4 starts: read(Y)
8. T4 continues: write(Y)
9. T1 continues: read(U)
10. T4 continues: read(Z)
11. T5 starts: read(V)
12. T3 starts: write(Z)
13. T5 continues: read(W)
14. T4 completes: write(Z)
15. T1 completes: write(U)
In this execution, the operations of different transactions are interleaved. For example, T1
starts, then T2 performs some operations, then T1 continues, and so on. Since the transactions
are not executed one after the other without any interruption, this schedule is not serial.
Time -->
|--T1--|--T2--|--T1--|--T4--|--T1--|--T4--|--T5--|--T3--|--T5--|--T4--|--
T1--|
Conclusion:
The first transaction model (T1 and T2) shown in the top part of the image is serial.
The second transaction model (T1, T2, T3, T4, and T5) shown in the bottom part of
the image is not serial due to the interleaving of operations from different
transactions
Here's a detailed explanation of why recovery is needed and an analysis of the problems that
necessitate it:
Several types of failures can occur in a transaction processing system, which necessitate
robust recovery mechanisms. These problems can be broadly categorized as:
1. Transaction Failures:
o Logical Errors: These occur due to errors within the transaction logic itself
(e.g., division by zero, constraint violations, data not found). The transaction
cannot complete successfully and needs to be rolled back to maintain
consistency.
o System Errors: These are errors caused by the database management system
(DBMS) during transaction execution (e.g., deadlock, resource unavailability).
The DBMS might decide to abort one or more transactions to resolve the error.
2. System Failures (Soft Crashes):
o Software Errors: Bugs in the DBMS software, operating system, or other
related software can lead to system crashes. The system loses its volatile
memory (RAM) contents, including the state of ongoing transactions and
buffer pool. However, the data on disk usually remains intact. Recovery
involves examining the transaction logs to redo committed transactions and
undo uncommitted ones.
o Power Failures: Sudden loss of electrical power causes the system to shut
down abruptly, leading to a state similar to software errors.
3. Media Failures (Hard Crashes):
o Disk Failures: Failure of the storage media (e.g., hard disk drive) can result in
the loss of persistent data, including the database files and transaction logs.
Recovery from media failures is the most complex and typically involves
restoring the database from backups and replaying the committed transactions
from the logs (if available since the last backup).
o Catastrophic Events: Events like fires, floods, or earthquakes can also
damage or destroy the storage media.
Ans: Several compelling factors drive the adoption and encourage the use of Distributed
Database Management Systems (DDBMS). These factors address limitations of centralized
systems and leverage the advantages of distributed computing environments. Here's a
breakdown of the key factors:
Data Localization: DDBMS allows storing data closer to where it is most frequently
accessed. This reduces network latency and improves query response times for local
users.
Parallel Processing: DDBMS can break down large queries and transactions into
smaller sub-tasks that can be executed in parallel across multiple nodes. This
significantly enhances processing speed and throughput.
Scalability (Horizontal): DDBMS offers better scalability than centralized systems.
When the data volume or transaction load increases, you can add more nodes to the
distributed system without significant downtime or architectural changes. This "scale-
out" approach is often more cost-effective than vertically scaling a single powerful
server.
Fault Tolerance: In a DDBMS, if one node fails, the system can continue to operate
with the remaining nodes. Data can be replicated across multiple sites, ensuring that
even if one site becomes unavailable, the data can still be accessed from other sites.
Increased Availability: By distributing data and processing across multiple nodes,
the overall system availability is improved. Planned maintenance or upgrades on one
node do not necessarily bring down the entire system.
1. Homogeneous DDBMS:
Characteristics:
o All sites (nodes) in the distributed system use the same DBMS software.
o The underlying operating systems and hardware may be the same or different.
o The database schema is often the same across all sites, although data might be
partitioned or replicated.
o It presents a single database image to the user, making the distribution
transparent.
Advantages:
o Simpler Design and Management: Since all sites use the same DBMS,
administration, data management, and query processing are relatively
straightforward.
o Easier Data Integration: Integrating data from different sites is simpler due
to the consistent data models and query languages.
o Uniform Security and Concurrency Control: Implementing consistent
security policies and concurrency control mechanisms across all sites is easier.
Disadvantages:
o Limited Flexibility: The requirement of using the same DBMS across all sites
can limit the choice of technology and might not be suitable for organizations
with existing heterogeneous systems.
o Potential Vendor Lock-in: Reliance on a single DBMS vendor can lead to
vendor lock-in.
Example: A network of branch offices of a bank, all using the same Oracle or
MySQL database system, with data distributed among them.
Diagram:
2. Heterogeneous DDBMS:
Characteristics:
o Different sites in the distributed system use different DBMS software.
o The underlying operating systems and hardware can also be different.
o The database schemas at each site may be different and independently
designed.
o Provides a federated view of the data, requiring mechanisms for schema
mapping and data transformation.
Types of Heterogeneous DDBMS:
o Federated DDBMS: Each local DBMS is autonomous and can operate
independently. The DDBMS provides a layer on top to integrate and provide
access to data across these autonomous systems. Users need to be aware of the
different schemas and may need to use specific query languages or interfaces
for each local system.
o Multi-database Systems: Similar to federated systems but with a tighter
degree of integration. A global schema is often defined to provide a unified
view of the data, and the system handles the translation between the global
schema and the local schemas.
Advantages:
o Flexibility: Allows organizations to integrate existing diverse database
systems without the need for complete data migration.
o Autonomy: Local sites retain control over their own data and operations.
o Leveraging Specialized Systems: Organizations can use the DBMS best
suited for their specific needs at each site.
Disadvantages:
o Complexity: Designing, implementing, and managing a heterogeneous
DDBMS is significantly more complex due to schema differences, data model
variations, and query language incompatibilities.
o Performance Challenges: Query processing and transaction management
across different DBMS can be less efficient due to the need for data
conversion and coordination.
o Data Integration Issues: Ensuring data consistency and integrity across
heterogeneous systems can be challenging.
o Security and Concurrency Control: Implementing uniform security and
concurrency control mechanisms across different DBMS requires
sophisticated solutions.
Example: Integrating a company's Oracle database for customer data with a MySQL
database used by the marketing department and a PostgreSQL database used for
inventory management.
Diagram (Federated):
Diagram (Multi-database):
+-----------------+
| Global Schema |
+-----------------+
/ | \
/ | \
+-------------+ +-------------+ +-------------+
| Site 1 | | Site 2 | | Site 3 |
| (DBMS A) | | (DBMS B) | | (DBMS C) |
+-------------+ +-------------+ +-------------+
| | |
+----------+----------+
+-----------------+
|Integration Layer|
+-----------------+
This classification focuses on the degree to which the distributed nature of the database is
hidden from the users.
Client-Server DDBMS: One or more server sites manage the database, and client
sites make requests to the servers. The servers handle data storage, retrieval, and
transaction management.
Peer-to-Peer DDBMS: All sites have equal capabilities and responsibilities. They
can act as both clients and servers, sharing resources and data directly with each other.
This model is more complex to manage but can offer greater resilience and scalability
in certain scenarios.
Diagram (Client-Server):
Diagram (Peer-to-Peer):
The choice of DDBMS type depends on the specific requirements of the application, the
existing infrastructure, and the organizational structure. Homogeneous systems are generally
easier to manage but less flexible, while heterogeneous systems offer greater flexibility but
pose significant integration and management challenges. The level of transparency desired
also plays a crucial role in the design and implementation of a DDBMS.
Here are the main architectural models of Distributed Database Management Systems
(DDBMS), along with diagrams to illustrate their structures:
1. Client-Server Architecture:
Description: This is the most common architecture for DDBMS. It involves a clear
separation between client nodes (which request data and services) and server nodes
(which manage the database and process requests).
Components:
o Client Nodes (Workstations): These are typically user workstations or
application servers that initiate queries and transactions. They do not directly
manage any part of the database.
o Server Nodes (Database Servers): These nodes host the database (or parts of
it), process client requests, manage transactions, and handle data storage and
retrieval. Server nodes can be single machines or clusters of machines.
o Communication Network: This network facilitates the communication
between client and server nodes.
Types within Client-Server:
o Single Server, Multiple Clients: A centralized server manages the entire
distributed database, and multiple clients connect to it. While the data might
be distributed across storage managed by this server, the processing and
coordination are often centralized at the single server.
o Multiple Servers, Multiple Clients: The database is partitioned or replicated
across multiple server nodes. Clients can connect to any of the relevant servers
to access the data they need. A coordination mechanism is required to manage
distributed transactions and ensure data consistency.
Advantages:
o Simplicity: Relatively easy to understand and implement, especially the
single-server model.
o Centralized Control (in some variations): Easier to manage security and
integrity in a single-server setup.
Disadvantages:
o Single Point of Failure (single-server): If the central server fails, the entire
system becomes unavailable.
o Performance Bottleneck (single-server): The central server can become a
bottleneck under heavy load.
o Complexity of Distributed Management (multi-server): Managing
distributed transactions and consistency across multiple servers can be
complex.
2. Peer-to-Peer Architecture:
Description: In this model, all nodes (peers) in the system have equal capabilities and
responsibilities. Each peer can act as both a client (requesting data or services) and a
server (providing data or services). There is no central coordinator.
Components:
o Peer Nodes: Each node in the system stores a part of the distributed database
and can process queries and transactions.
o Communication Network: Peers communicate directly with each other to
exchange data and coordinate operations.
Characteristics:
o Decentralized Control: No single node is responsible for the entire system.
o High Autonomy: Each peer has a high degree of control over its local data
and operations.
o Increased Resilience: The failure of one or more peers does not necessarily
bring down the entire system.
o Complex Coordination: Managing distributed transactions, concurrency
control, and data consistency is more challenging without a central
coordinator.
Advantages:
o High Availability and Fault Tolerance: No single point of failure.
o Scalability: Adding more peers can increase the system's capacity.
o Autonomy: Each site retains control over its local data.
Disadvantages:
o Complex Management and Coordination: Implementing consistent
concurrency control and transaction management is difficult.
o Security Challenges: Ensuring consistent security policies across all
autonomous peers can be complex.
o Query Processing Complexity: Routing queries and integrating data from
multiple peers can be inefficient.
Diagram:
+-----------------+
| Global Schema |
+-----------------+
/ | \
/ | \
+-------------+ +-------------+ +-------------+
| Local DB 1 | | Local DB 2 | | Local DB 3 |
| (DBMS A) | | (DBMS B) | | (DBMS C) |
+-------------+ +-------------+ +-------------+
| | |
+----------+----------+
+-----------------+
|Federation Layer |
+-----------------+
1. Fragmentation Schema:
The fragmentation schema defines how a global relation is broken down into smaller, more
manageable units called fragments. The goal of fragmentation is to improve performance,
availability, and security by storing data closer to where it's frequently used and by enabling
parallel processing. There are three main types of fragmentation:
a) Horizontal Fragmentation:
Diagram:
Horizontal Fragments:
EMPLOYEE_KOLKATA:
+-------+--------+------------+--------+----------+
| EmpID | Name | Department | Salary | Location |
+-------+--------+------------+--------+----------+
| 101 | Alice | Sales | 50000 | Kolkata |
| 103 | Carol | Sales | 55000 | Kolkata |
+-------+--------+------------+--------+----------+
EMPLOYEE_MUMBAI:
+-------+--------+------------+--------+----------+
| EmpID | Name | Department | Salary | Location |
+-------+--------+------------+--------+----------+
| 102 | Bob | Marketing | 60000 | Mumbai |
| 105 | Eve | Marketing | 62000 | Mumbai |
+-------+--------+------------+--------+----------+
EMPLOYEE_DELHI:
+-------+--------+------------+--------+----------+
| EmpID | Name | Department | Salary | Location |
+-------+--------+------------+--------+----------+
| 104 | David | Finance | 70000 | Delhi |
| 106 | Frank | Finance | 75000 | Delhi |
+-------+--------+------------+--------+----------+
b) Vertical Fragmentation:
Example: Using the same EMPLOYEE relation, we can vertically fragment it:
Diagram:
Vertical Fragments:
EMPLOYEE_PERSONAL:
+-------+--------+----------+
| EmpID | Name | Location |
+-------+--------+----------+
| 101 | Alice | Kolkata |
| 102 | Bob | Mumbai |
| 103 | Carol | Kolkata |
| 104 | David | Delhi |
| 105 | Eve | Mumbai |
| 106 | Frank | Delhi |
+-------+--------+----------+
EMPLOYEE_JOB:
+-------+------------+--------+
| EmpID | Department | Salary |
+-------+------------+--------+
| 101 | Sales | 50000 |
| 102 | Marketing | 60000 |
| 103 | Sales | 55000 |
| 104 | Finance | 70000 |
| 105 | Marketing | 62000 |
| 106 | Finance | 75000 |
+-------+------------+--------+
Example: We could first horizontally fragment the EMPLOYEE relation by Location (as
above), and then vertically fragment the EMPLOYEE_KOLKATA fragment into
EMPLOYEE_KOLKATA_PERSONAL (EmpID, Name) and EMPLOYEE_KOLKATA_JOB (EmpID,
Department, Salary, Location).
2. Allocation Schema:
The allocation schema defines where each fragment of a global relation is stored. It specifies
which site(s) in the distributed system contain each fragment. There are three main types of
allocation:
a) Centralized Allocation:
Diagram:
Global RELATION R
+-------------------+
| Data |
+-------------------+
^
| Network Access
+-------+ +-------+ +-------+
| Site 1|---| Site 2|---| Site 3|
+-------+ +-------+ +-------+
+----------+ +-------------+
| Fragment |------>| Site 1 |
| R_A | | (Stores R_A)|
+----------+ +-------------+
^
| Network Access
+-------+
| Site 2|------>| Site 2 |
+-------+ | (Stores R_B)|
+-------------+
^
| Network Access
+-------------+
| Site 3 |------>| Site 3 |
| (Stores R_C)|
+-------------+
c) Replicated Allocation:
Types of Replication:
Global RELATION R
+-------------------+
| Data |
+-------------------+
^ ^ ^
| | | Network Access
+----------+ +----------+ +----------+
| Site 1 | | Site 2 | | Site 3 |
|(Stores R)| |(Stores R)| |(Stores R)|
+----------+ +----------+ +----------+
Diagram (Partial Replication - Horizontal Fragments):
Performance: Storing frequently accessed data locally reduces network traffic and
improves query response times. Parallel processing on fragments distributed across
multiple sites can speed up query execution.
Availability: Replication ensures that data remains accessible even if some sites fail.
Partitioning can also improve availability by isolating failures to specific parts of the
data.
Reliability: Data redundancy through replication can enhance reliability by providing
backup copies.
Security: Fragments containing sensitive data can be stored at more secure sites.
Scalability: Distributing data and processing across multiple nodes allows the
DDBMS to handle larger datasets and higher transaction loads.
Autonomy: Horizontal fragmentation can align data storage with organizational units
or geographical locations, supporting local autonomy.
Used for querying and manipulating Used for writing full programs with logic,
Purpose
data in the database control, and flow
Control
Not supported Supported (IF, LOOP, WHILE, etc.)
Structures
Feature SQL (Structured Query Language) PL/SQL (Procedural Language/SQL)
💡 Example of SQL
-- Query to get all employees with salary > 50000
SELECT * FROM employees WHERE salary > 50000;
💡 Example of PL/SQL
DECLARE
bonus NUMBER := 1000;
BEGIN
UPDATE employees SET salary = salary + bonus WHERE department_id = 10;
DBMS_OUTPUT.PUT_LINE('Bonus added successfully.');
EXCEPTION
WHEN OTHERS THEN
DBMS_OUTPUT.PUT_LINE('Error occurred.');
END;
🔹 1. SQL Engine
🔹 2. PL/SQL Engine
🔍 Why is it needed?
Let’s assume:
1. Divide the file into chunks that fit in memory (e.g., 100 MB).
2. Load each chunk into memory, sort it using an in-memory sort algorithm (like QuickSort or
HeapSort).
3. Write the sorted chunks (called runs) back to disk.
📌 Example: For 1 GB file and 100 MB memory, this creates 10 sorted runs.
1. Use a multi-way merge algorithm to combine the sorted runs into one big sorted file.
2. Use k-way merging, where k is the number of runs that can be merged in one go (depending
on memory).
3. If all runs can’t be merged in one pass, use multi-pass merging (merge groups in stages).
📊 Example:
Let’s say you have a file with 1 million records, and your memory can hold 100,000 records.
Phase 1:
Phase 2:
Merge these 10 runs using 2-way or 4-way merge (depending on how much memory is
available).
Repeat merging until a single sorted file remains.
✅ Advantages
Handles very large files.
Efficient use of disk and memory.
Performs well on sequential I/O operations.
❗ Considerations
Disk I/O is expensive — algorithm minimizes disk access.
Efficient buffer management is critical during merging.
Merge passes should be minimized to reduce read/write cycles.
Here are some common heuristics and the corresponding algebraic transformations they
leverage:
Let's calculate the cost function for each of the specified search operations in terms of the
number of block accesses (disk I/O operations), which is a common metric for database
performance. We'll denote:
Assumption: The file is sorted on the search key and stored contiguously on disk.
There is no index.
Worst-case scenario: The target record is not found, or it's the first or last record.
Logic: Binary search works by repeatedly dividing the search interval in half. In each
step, one block needs to be accessed to check the middle record.
Cost Function: The number of steps in binary search is approximately log2(b),
where b is the number of blocks. In each step, we potentially access a new block.
Worst-Case Cost: O(log2(b)) block accesses.
Assumption: The primary index is a sorted index where the search key is also the
primary key, and the index entries contain the key and a pointer to the block
containing the record.
Dense Primary Index: An index entry for every record.
o The index itself is usually much smaller than the data file and can often fit in
fewer blocks. Let's say the index occupies b_i blocks. Searching the index
using binary search would take O(log2(b_i)) block accesses to find the
index entry. Then, one more block access is needed to retrieve the data block
using the pointer.
o Cost Function (Dense): O(log2(b_i)) + 1 block accesses. Since b_i is
typically much smaller than b, this is significantly better than binary search on
the data file.
Sparse Primary Index: An index entry for only some records (e.g., the first record in
each block).
o Searching the index (again, often using binary search if the index is sorted)
takes O(log2(b_i)) block accesses to find the appropriate index entry (the
one with the largest key value less than or equal to the search key). Then, we
need to read the corresponding data block and potentially subsequent blocks
sequentially until the record is found (in the worst case, we might read all the
records in that block).
o Cost Function (Sparse): O(log2(b_i)) + 1 block access (on average or
best case if the record is the first in the block). In the worst case within
the block, it's still within the initial block access.
o More precise cost for sparse index (worst case): O(log2(b_i)) + 1 block
accesses. We locate the block using the index, and then at most one more
block access is needed to retrieve the desired record within that block.
Assumption: A hash index is used where the hash function maps the search key to a
bucket in the index, which then points to the data block(s) containing records with
that key value.
Cost Function:
o Accessing the hash index typically takes a constant number of block
accesses, ideally just one to retrieve the bucket information.
o Once the bucket is found, we need to access the data block(s) pointed to by the
bucket. For a search on a key that exists and is unique, this usually involves
one additional block access to the data block.
o If there are collisions in the hash function (multiple keys mapping to the same
bucket) or if the search key is not unique, we might need to access more than
one data block. However, for a successful search on a unique key with a well-
designed hash function, the cost is minimal.
Average Cost (Successful Search, Unique Key): O(1) or 2 block accesses (1 for
index, 1 for data).
Worst Case (Many Collisions, Non-Unique Key): Can be significantly higher,
potentially O(number of blocks containing records with that key).
Ans: Simple selection in database systems involves retrieving records from a relation that
satisfy a given selection predicate (a condition on one or more attributes). The efficiency of
this operation heavily depends on the available access paths (like indexes) and the
organization of the data file. Here's a discussion of different search methods for simple
selection:
Description: This is the most basic approach. The database system sequentially scans
every record in the relation and checks if it satisfies the selection predicate.
Applicability: Applicable to any type of selection predicate and any file organization
(ordered or unordered, with or without indexes). It's the only option if no suitable
index exists and the file isn't sorted on the selection attribute.
Cost: In the worst case, the entire file needs to be scanned. If the relation has b
blocks, the cost is O(b) block accesses. If only a fraction of the records satisfy the
condition, we still need to read all blocks.
Advantages: Simple to implement, always applicable.
Disadvantages: Inefficient for large relations, especially when only a small fraction
of records satisfy the condition.
2. Using an Index:
Indexes are data structures that provide efficient access paths to data based on the values of
specific attributes. If an index exists on the attribute(s) involved in the selection predicate, it
can significantly speed up the search.
Description: A primary index is built on the primary key of the relation, and the data
file is usually sorted on the primary key.
Equality Selection (e.g., WHERE primary_key = value):
o The index is searched (typically using a tree traversal or hash lookup) to find
the pointer to the block containing the record with the specified primary key
value. This usually takes a small number of block accesses (e.g., O(h) for a
tree index, where h is the height of the tree, or O(1) for a hash index on
average).
o Once the block is found, at most one additional block access is needed to
retrieve the record.
o Cost: O(h) + 1 (for tree index) or O(1) + 1 (for hash index) block accesses.
Range Selection (e.g., WHERE primary_key BETWEEN value1 AND value2):
o The index is used to find the first record within the range. This takes O(h)
block accesses.
o Then, the data file is scanned sequentially (since it's sorted on the primary
key) to retrieve all records within the range. If k blocks contain the qualifying
records, the cost is O(h) + k block accesses.
Description: A secondary index is built on an attribute that is not the primary key.
The data file may or may not be sorted on this attribute.
Equality Selection (e.g., WHERE non_primary_key = value):
o The index is searched ( O(h) for tree, O(1) for hash) to find pointers to the
records (or blocks containing the records) with the specified value.
o Record Pointers: If the index points directly to records, and there are k
matching records potentially in k different blocks, the cost is O(h) + k block
accesses.
o Block Pointers: If the index points to blocks, we access the block(s)
containing matching records. Let's say m blocks contain these records. The cost
is O(h) + m block accesses. We then need to scan within those blocks to find
the specific records.
Range Selection (e.g., WHERE non_primary_key > value):
o The index is used to find the first index entry satisfying the condition (O(h)).
o Then, the index is scanned sequentially to find all other entries within the
range. For each entry, the corresponding data record (or block) is retrieved.
The cost depends on the number of matching records and their distribution in
the data file. It can be significant if many non-contiguous blocks need to be
accessed.
3. Using Hashing:
4. Hybrid Approaches:
In some cases, the database system might use a combination of techniques. For
example, it might use an index to quickly locate a starting point and then perform a
sequential scan from there.
143. Discuss the rules for transformation of query tree, and identify when each
rule should be applied during optimization.
Query optimizers apply a set of transformation rules to this initial query tree to generate
equivalent but potentially more efficient trees. These rules are based on the algebraic
equivalences of relational operations.
Here's a discussion of common transformation rules and guidelines on when they should be
applied during the optimization process:
Rule 11: Commutativity and Associativity: Union and intersection are commutative and
associative. These rules allow the optimizer to reorder and regroup set operations for
potential efficiency gains.
Rule 12: Selection and Projection Pushing Through Set Operations:
o σ_C(R ∪ S) ≡ σ_C(R) ∪ σ_C(S)
o π_L(R ∪ S) ≡ π_L(R) ∪ π_L(S)
o Similar rules apply for ∩ and -.
144. Analyze different methods of measuring query cost and compare their
effectiveness.
145. Critically assess the importance of statistics in query optimization with
real- world examples.
These are collected and maintained by the query optimizer to choose the most efficient execution
plan.
If the statistics are outdated or inaccurate, the optimizer may choose a suboptimal execution plan,
leading to poor performance.
✅ Real-World Examples
Case B: Statistics are outdated and say 90% of rows have city = 'Mumbai'
🡺 The optimizer chooses full table scan — which is slower in this case.
🡺 Optimizer decides to use customers as the outer table, and does indexed nested loop join —
efficient.
No or Bad Statistics:
📌 This mismatch occurs if statistics are stale, often due to bulk inserts or data skews.
The optimizer generates multiple possible plans to execute the query and picks the one with the
lowest estimated cost.
Table Size Helps decide scan methods (full scan vs index scan)
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.status = 'Shipped';
It estimates cost for each based on table size, indexes, and filtering conditions, then picks the
cheapest one.
🧩 Supports complex queries Works well for queries involving joins, subqueries, and aggregations
🧩 Optimization Time Generating and evaluating many plans can be computationally expensive
Challenge Description
💽 I/O vs CPU trade-offs Depends on hardware and workload (e.g., SSDs reduce disk I/O costs)
🏢 Real-World Usage
Cost-Based Optimization is used in most modern RDBMS, including:
3. Refresh Engine
5. Scheduler
Tracks:
o Refresh failures
o Latency
o Data freshness
o Resource usage
Can alert administrators or trigger auto-healing mechanisms.
+--------------------+
| User Queries |
+--------+-----------+
|
v
+---------------------+ +---------------------+
| Query Optimizer |<---->| Materialized Views |
| (Query Rewriter) | +---------------------+
+---------+-----------+ ^
| |
v |
+--------------------+ Refresh |
| Scheduler & Trigger|---------------->|
+---------+----------+ |
| |
v |
+----------------------+ +----------------------+
| Refresh Engine |<----->| Change Tracker (CDC) |
+----------+-----------+ +----------------------+
|
v
+---------------------+
| Metadata & Logging |
+---------------------+