Dbms Notes
Dbms Notes
involves Data Abstraction, Data Independence, Data Definition Language (DDL), and Data
Manipulation Language (DML):
1. Data Abstraction:
Data abstraction is the process of hiding the complexities of the data from the end users and
providing a simplified view. In database systems, data abstraction is typically organized into three
levels of abstraction:
• Physical Level:
• This is the lowest level of abstraction.
• It describes how the data is physically stored in the system (e.g., files, disk blocks,
etc.).
• It deals with the technical aspects of storage, including how data is indexed and
accessed.
• Logical Level:
• The logical level describes what data is stored in the database and the relationships
between those data elements.
• It focuses on the structure of the data, such as tables, views, and schemas.
• Users do not need to understand how the data is physically stored, but only the
logical structure.
• View Level:
• This is the highest level of abstraction.
• It defines how the data is viewed by individual users or applications.
• A database can have multiple views that present the data differently based on user
requirements (e.g., user A might see only specific columns of a table, while user B
might see the entire table).
2. Data Independence:
Data independence refers to the ability to change the schema at one level of the database system
without affecting the schema at the next higher level. There are two types of data independence:
• Logical Data Independence:
• It is the ability to change the logical schema without having to change the external
schema or application programs.
• For example, you can change the logical structure of the database (like adding or
removing tables) without impacting the user views or how the data is accessed by
applications.
• Achieving logical data independence is very difficult, but it’s highly desirable in
complex database systems.
• Physical Data Independence:
• It is the ability to change the physical schema (e.g., file structures, indexing
methods) without affecting the logical schema.
• For instance, you can move data from one disk to another or change indexing
strategies without affecting how the users or applications interact with the data.
• Physical data independence is easier to achieve compared to logical data
independence.
-- Delete a record
DELETE FROM Students WHERE student_id = 101;
Summary of Roles:
• Data Abstraction: Provides different levels of abstraction (physical, logical, view) to
simplify data management and user interaction.
• Data Independence: Enables changes in the database schema without affecting higher
levels of the database, allowing more flexibility in database design and maintenance.
• DDL: Used to define and modify the database schema, structures, and objects.
• DML: Used to manage and manipulate the actual data stored within the database.
Together, these components help manage and access data efficiently, allowing for both flexibility in
data handling and structure, as well as user-friendly interaction with the database.
Here's a detailed explanation of key concepts related to Relational Database Design and Query
Processing & Optimization:
2. Armstrong’s Axioms:
Armstrong's axioms are a set of rules used to infer all the functional dependencies (FDs) in a
relation. These axioms form the foundation of reasoning about functional dependencies.
The three basic axioms are:
• Reflexivity: If Y is a subset of X, then X -> Y (i.e., a set of attributes functionally
determines itself).
• Augmentation: If X -> Y, then XZ -> YZ for any attribute Z (i.e., if X determines Y,
then X and Z together will determine Y and Z).
• Transitivity: If X -> Y and Y -> Z, then X -> Z (i.e., if X determines Y, and Y
determines Z, then X determines Z).
• Additional Axiom: There are also other derived rules, such as Union, Decomposition,
Pseudotransitivity, and Projectivity.
These axioms help in deriving functional dependencies, simplifying the schema, and ensuring that
database designs are correct.
3. Normal Forms:
Normalization is the process of organizing the attributes and relations in a database to avoid
redundancy and ensure data integrity. This is achieved by dividing large tables into smaller ones and
defining relationships among them. Each step of normalization results in a "normal form."
• 1st Normal Form (1NF): A relation is in 1NF if all its attributes contain atomic (indivisible)
values. There should be no repeating groups or arrays.
• 2nd Normal Form (2NF): A relation is in 2NF if it is in 1NF and every non-prime attribute
is fully functionally dependent on the entire primary key. This eliminates partial dependency
(where an attribute depends only on part of a composite primary key).
• 3rd Normal Form (3NF): A relation is in 3NF if it is in 2NF and no transitive dependency
exists (i.e., no non-prime attribute is dependent on another non-prime attribute).
• Boyce-Codd Normal Form (BCNF): A relation is in BCNF if for every non-trivial
functional dependency, the left-hand side is a superkey.
• 4th Normal Form (4NF): A relation is in 4NF if it is in BCNF and has no multivalued
dependencies.
• 5th Normal Form (5NF): A relation is in 5NF if it is in 4NF and has no join dependencies
(i.e., cannot be decomposed further without losing information).
4. Dependency Preservation:
Dependency preservation ensures that functional dependencies are preserved when a relation is
decomposed into multiple smaller relations. If, after decomposition, all the functional dependencies
are still enforceable directly on the smaller relations, the decomposition is considered dependency-
preserving.
5. Lossless Design:
A lossless decomposition ensures that no information is lost during the decomposition of a relation.
If a relation is decomposed into smaller relations, a lossless join property guarantees that the
original relation can be reconstructed by joining these smaller relations without losing any data.
The Lossless Join Condition:
• A decomposition of a relation RR into sub-relations R1,R2,…,RnR1, R2, \dots, Rn is
lossless if for each pair of sub-relations, the intersection of their attributes is a superkey of at
least one of the sub-relations.
In the context of database design and normalization, the terms lossy and lossless refer to
the types of decompositions that occur when breaking down a relation (table) into smaller
relations. The goal of decomposition in relational databases is typically to improve data
integrity, reduce redundancy, and make data easier to maintain.
Let’s break down lossless decomposition and lossy decomposition in this context:
1. Lossless Decomposition
• Definition: Lossless decomposition refers to breaking down a relation (table) into smaller
sub-relations in such a way that no information is lost during the process. After decomposing
the relation, you can always reconstruct the original relation by joining the smaller sub-
relations together, without any loss of data or integrity.
• Characteristics:
• Reconstructibility: After decomposition, you can reconstruct the original relation
using natural joins (or equivalent operations), ensuring that no data is lost.
• Integrity: The original data and constraints are preserved, and the integrity of the
database is maintained.
• No Redundancy: After decomposition, there should be no redundant data, and each
piece of information should appear only once in the new relations.
• Formal Condition: A decomposition is lossless if, for the relations R1,R2,…,RnR_1, R_2, \
dots, R_n that result from the decomposition of a relation RR, the following condition holds:
(R1∩R2)→(R1)∪(R2)(R_1 \cap R_2) \rightarrow (R_1) \cup (R_2)
In simpler terms, the intersection of the decomposed relations should contain enough
information to allow the original relation to be reconstructed.
• Example: If we decompose a relation Student(ID, Name, Course) into two
relations:
• Student1(ID, Name)
• Student2(ID, Course)
We can easily reconstruct the original relation by performing a join on the ID field.
2. Lossy Decomposition
• Definition: Lossy decomposition occurs when a relation is decomposed into smaller sub-
relations in such a way that the decomposition results in a loss of information. This means
you cannot recover the original relation by joining the decomposed relations, or some data is
lost in the process.
• Characteristics:
• Irreversible: When a relation is decomposed lossy, there is no way to reconstruct the
original relation without making assumptions or approximations.
• Data Loss: In a lossy decomposition, certain information may be lost due to the
absence of necessary attributes or dependencies in the decomposed relations.
• Redundancy and Anomalies: Lossy decompositions can lead to redundancy,
anomalies (such as insertion, deletion, or update anomalies), and inconsistencies in
the database.
• Formal Condition: A decomposition is lossy if the intersection of the decomposed relations
does not contain enough information to preserve the original data. In such cases, you might
not be able to perfectly reconstruct the original relation using joins.
• Example: If we decompose a relation Employee(EmpID, EmpName, Department,
Salary) into two relations:
• Employee1(EmpID, EmpName)
• Employee2(Department, Salary)
In this case, the EmpID attribute, which is needed to uniquely identify employees, is missing
in the second relation. If you try to join the decomposed relations, you might lose data about
which employee belongs to which department, leading to incorrect results.
• Importance: A lossy decomposition should be avoided in most cases because it
compromises data integrity and can lead to inconsistencies.
Conclusion
• Lossless decomposition is essential in relational database design because it ensures that no
data is lost during the normalization process. It helps in achieving a more efficient and
maintainable database structure while preserving all original information.
• Lossy decomposition, on the other hand, should be avoided in most cases as it compromises
data integrity, making it difficult to reconstruct the original data, which could lead to errors
or inconsistencies in the database.
2. Query Equivalence:
Query equivalence refers to the property that different relational algebra expressions can produce
the same result. Two queries are considered equivalent if they yield the same output for any given
database instance.
For example, a join can be written in different ways using various relational operations, but the
results remain the same.
Types of Query Equivalence:
• Logical equivalence: The two expressions produce the same result but might use different
algorithms.
• Semantic equivalence: The two expressions logically represent the same query but may not
look alike syntactically.
The goal of query optimization is to find the most efficient query expression that is logically
equivalent to the original.
3. Join Strategies:
Joins are fundamental operations in relational databases. Efficiently implementing joins is key to
query optimization. Common join strategies include:
• Nested Loop Join: This is the simplest and most basic join algorithm, where each tuple in
one relation is compared with each tuple in another relation.
• Merge Join: Both relations are sorted by the join attribute, and the tuples are merged based
on matching keys.
• Hash Join: A hash function is used to partition the relations into smaller subsets, and then
matching tuples are joined by scanning these partitions.
Summary
• Relational Database Design focuses on organizing data to minimize redundancy and
dependencies, using concepts like normal forms, data independence, and ensuring lossless
decomposition and dependency preservation.
• Query Processing & Optimization focuses on evaluating, transforming, and optimizing
queries to ensure they are executed efficiently. This involves evaluating relational algebra
expressions, ensuring query equivalence, selecting the appropriate join strategies, and
using query optimization algorithms to minimize execution time and resource
consumption.
1. Indices
An index is a data structure that improves the speed of data retrieval operations on a database table.
It provides a quick lookup mechanism by organizing data in a way that allows efficient searching,
without scanning the entire table.
• Purpose of Indices:
• Indices speed up the search, insertion, update, and deletion operations.
• They work similarly to an index in a book, allowing quick access to a specific data
item.
• They are particularly useful for columns that are frequently queried, such as primary
keys or columns with conditions (e.g., WHERE clause).
• Types of Indices:
• Single-level Index: A basic index structure where each entry in the index points
directly to a data record.
• Multi-level Index: A hierarchical index where the index itself can point to other
indices, improving efficiency when working with large datasets.
• Clustered Index: A type of index where the actual data records are stored in the
order of the index. A table can have only one clustered index (often the primary key).
• Non-clustered Index: A separate index structure that points to the data records. A
table can have multiple non-clustered indices.
2. B-trees (Balanced Trees)
B-trees are a self-balancing tree data structure that maintains sorted data and allows for efficient
search, insertion, and deletion operations. B-trees are widely used for indexing in databases and file
systems.
• Characteristics of B-trees:
• Balanced: B-trees are balanced, meaning all leaf nodes are at the same level,
ensuring that the search time is logarithmic relative to the number of keys.
• Sorted: Data in a B-tree is stored in a sorted manner, which makes searching
efficient.
• Node Structure: Each node in a B-tree contains a range of keys and pointers to child
nodes. The number of children for each node is determined by the order of the tree.
• Height is Logarithmic: The tree’s height is logarithmic to the number of records,
ensuring that search, insertion, and deletion operations are fast.
• Advantages of B-trees:
•Efficient searching and range queries (since the data is sorted).
•Balanced structure ensures efficient operations even as the data grows.
•Suitable for large databases where data is frequently updated or queried.
•Efficient disk access: B-trees minimize disk I/O by storing many keys in a single
node, which reduces the number of disk accesses required.
• Example: A B-tree of order 3 (degree 3) can have up to 2 keys per node and 3 child
pointers. If a new key is inserted and the node is full, it splits, ensuring the tree remains
balanced.
3. Hashing
Hashing is a technique used for fast data retrieval. A hash function maps input data (such as a
record key) to a fixed-size value, which is typically used as an index in a hash table. Hashing
provides direct access to the data based on the key, making it very efficient for exact-match queries.
• Characteristics of Hashing:
• Hash Table: A hash table is an array where each element (bucket) contains a list of
records that map to the same hash value.
• Hash Function: The hash function takes a key and computes a hash value, which is
then used to determine the index in the hash table.
• Collision Handling: When two or more keys map to the same hash value (a
collision), there are several strategies to handle collisions:
• Chaining: Each hash table bucket points to a linked list of records with the
same hash value.
• Open Addressing: If a collision occurs, the algorithm searches for the next
available slot using a probing technique (linear, quadratic, or double hashing).
• Advantages of Hashing:
• Fast Search: Hashing provides constant time complexity (O(1)) for search
operations, making it extremely efficient for exact lookups.
• Efficient for Equality Searches: Hashing is particularly effective when the search
involves equality checks (i.e., WHERE column = value).
• Efficient Space Utilization: With the right hash function and table size, hashing can
be very memory-efficient.
• Disadvantages:
• Not suitable for range queries: Hashing is not efficient for operations that require
ordered data (such as BETWEEN or LIKE queries).
• Collisions: Handling collisions introduces overhead and complexity, especially as
the dataset grows.
• Fixed Size: Hash tables have a fixed size, which means resizing them when the table
is full can be costly.
• Example:
• Suppose you have a hash function H(K) = K % 10, where K is the key. This
would create a hash table with 10 buckets, each corresponding to keys that map to
the same remainder when divided by 10.
Use Cases:
• Indices:
• Best suited for situations where the query is likely to involve exact matches or
lookups based on indexed columns (e.g., primary or foreign keys).
• Suitable for databases with a variety of query types, including complex joins.
• B-trees:
• Ideal for scenarios where efficient search, insertion, deletion, and range queries are
needed.
• Often used for large, sorted datasets, particularly when the data needs to be stored in
secondary storage (e.g., disk or SSD) where efficient I/O operations are critical.
• Hashing:
• Best used for scenarios where only exact-match queries are needed (e.g., retrieving a
specific record based on its key).
• Common in hash-based file systems, caches, and key-value stores.
In summary, indices are general-purpose structures for speeding up database queries, B-trees
provide balanced and efficient access for both exact and range queries, and hashing offers
extremely fast lookups for exact matching, though it is unsuitable for range queries. The choice
between these strategies depends on the nature of the data and the types of queries the database
needs to support.
Let's dive deeper into each concept of Transaction Processing, Concurrency Control, ACID
properties, and related mechanisms.
Consistency
• Definition: A transaction takes the database from one valid state to another. The integrity
constraints (e.g., foreign keys, constraints) are never violated.
• Importance: Consistency ensures that all data remains accurate and correct in the database,
following all predefined rules. For instance, after an update, a database's total number of
users should still match the sum of users in all tables.
• Real-life Example: If the balance of a customer account is being updated, consistency
ensures that the balance does not go below zero unless permitted by the rules.
Isolation
• Definition: Ensures that transactions are isolated from one another, so that the execution of
one transaction does not interfere with another. Even if transactions are executing
concurrently, the result should be the same as if they were executed sequentially.
• Levels of Isolation (in SQL standards):
1. Read Uncommitted: Transactions can read uncommitted data from other
transactions.
2. Read Committed: A transaction can only read committed data.
3. Repeatable Read: Ensures that if a transaction reads a value, it will see the same
value if it reads it again during the transaction.
4. Serializable: The highest level, ensuring transactions are executed in a way that
guarantees no two transactions will interfere, making the outcome as if they were
serialized.
• Real-life Example: Consider two transactions, one transferring money and another checking
the balance. If isolation is enforced, the second transaction should not see intermediate states
(e.g., before money is actually deducted or after it is added).
Durability
• Definition: Once a transaction has been committed, its changes are permanent, even in the
event of system crashes or failures.
• Importance: Durability ensures that once the database indicates that a transaction is
complete, it is fully persisted to disk.
• Real-life Example: After submitting a purchase order, even if the server crashes, the order
should still be reflected when the system is restored.
Locking Mechanisms
• Shared Lock (S-lock): Allows multiple transactions to read (but not modify) the data.
• Exclusive Lock (X-lock): Allows a transaction to read and modify the data, and prevents
any other transaction from accessing it.
Locking Protocols:
• Two-Phase Locking (2PL): Involves two phases:
1. Growing Phase: A transaction can acquire locks but cannot release any.
2. Shrinking Phase: After releasing any lock, the transaction cannot acquire any new
locks.
3. Strict 2PL: In addition to the above, locks are only released once the transaction
commits or aborts, ensuring higher isolation.
Deadlocks:
• Definition: A situation where two or more transactions are waiting for each other to release
locks, thus causing a cycle where none can proceed.
• Deadlock Detection and Resolution:
• Wait-for Graph: A directed graph that shows which transactions are waiting for
others. If there is a cycle, a deadlock is detected.
• Prevention: Through transaction ordering or timeout mechanisms.
• Recovery: By aborting one of the deadlocked transactions.
Conflict Serializability
• Definition: Two operations conflict if they access the same data item and at least one of
them is a write.
• Conflict Serializable Schedule: If a schedule can be transformed into a serial schedule by
swapping non-conflicting operations, it is conflict serializable.
Example:
• Transaction 1: Read(A), Write(A)
• Transaction 2: Write(A), Read(A)
• These two operations conflict, as both access the same data item A.
View Serializability
• Definition: A schedule is view serializable if it produces the same final state as a serial
schedule. The key here is that the read values must be consistent across transactions.
4. Multiversion Concurrency Control (MVCC) (Detailed Explanation)
MVCC is designed to improve database performance and concurrency by allowing multiple
versions of data items to coexist, enabling read transactions to access the last committed version
without blocking write transactions.
Advantages of MVCC:
• Improved Concurrency: Since read operations do not block write operations, and vice
versa, the system can handle more transactions concurrently.
• Reduced Lock Contention: With MVCC, fewer locks are needed, reducing the overhead
caused by lock contention.
Example: In an online retail system, users may view the stock of products while a seller
updates the inventory. Using MVCC, the system ensures users see a consistent snapshot of
inventory while allowing the seller to modify it without blocking users.
Phases of OCC:
1. Read Phase: The transaction reads data and performs computations. No locks are acquired.
2. Validation Phase: Before committing, the system checks whether the transaction has
conflicting operations with others.
• If no conflict is detected, the transaction is allowed to commit.
• If conflicts are detected, the transaction is rolled back and can be retried.
3. Write Phase: If the validation phase passes, the transaction writes the changes to the
database.
Advantages:
• Minimal Locking: As no locks are held during the read phase, other transactions can access
the data.
• Ideal for Low Conflicts: OCC is suitable for systems where conflicts between transactions
are infrequent.
Example: In a stock trading system, multiple users might check stock prices concurrently.
Since updates to the stock prices are rare, OCC can help manage these operations with
minimal locking.
6. Database Recovery (Detailed Explanation)
Database recovery ensures that in case of failure (e.g., power loss or crash), the database returns to a
consistent state, with no partial transactions.
Log-Based Recovery:
• Write-Ahead Logging (WAL): Before modifying any database record, the system first
writes a log of the operation. In case of a failure, the database uses the log to roll back
incomplete transactions or reapply committed transactions.
Checkpointing:
• Definition: A checkpoint is a point in time when the database ensures all committed
transactions are written to stable storage.
• Purpose: Reduces recovery time since only transactions after the last checkpoint need to be
redone or undone during recovery.
1. Authentication
• Definition: Authentication is the process of verifying the identity of a user or system that is
trying to access the database. It ensures that only authorized users can access the system.
• Example: When you log into a website with your username and password, that's an example
of authentication.
6. Intrusion Detection
• Definition: Intrusion detection refers to the methods used to detect unauthorized or
suspicious activity within a database or network. It helps identify potential threats before
they cause harm.
• Example: Software that monitors for unusual login attempts or access to restricted areas of a
database is an intrusion detection system.
7. SQL Injection
• Definition: SQL injection is a type of security vulnerability where attackers insert or "inject"
malicious SQL code into a query to manipulate the database and gain unauthorized access or
perform harmful actions.
• Example: If a website form doesn't properly validate input, an attacker might enter SQL
code (e.g., OR 1=1) into a search box to access restricted data.
2. Logical Databases
• Definition: A logical database refers to the abstract view of data, focusing on how data is
logically structured and represented to users (often as tables, views, and relationships),
independent of physical storage.
• Example: The way users interact with a database through queries is based on its
logical schema (e.g., how tables are related), not the actual storage methods (e.g.,
disk drives).
3. Web Databases
• Definition: Web databases are databases designed to be accessed through the internet. They
store data for web applications and websites, allowing dynamic interaction and data
retrieval.
• Example: A content management system (CMS) for a website uses a web database to
store articles, user information, and comments.
4. Distributed Databases
• Definition: A distributed database is a collection of data that is spread across multiple
physical locations, but is viewed as a single logical database. This setup is used for large-
scale systems that require high availability and fault tolerance.
• Example: Cloud-based services like Google Drive or Amazon AWS store data in
multiple locations, but it is all accessed from a single platform.
5. Data Warehousing
• Definition: A data warehouse is a centralized repository that stores large amounts of
historical data from different sources. This data is used for analytical purposes, typically in
business intelligence.
• Example: A retail company may use a data warehouse to store sales data from all its
stores, which can then be analyzed to track trends and make business decisions.
6. Data Mining
• Definition: Data mining involves analyzing large datasets to discover patterns, relationships,
and trends that can provide useful insights for decision-making.
• Example: A bank may use data mining techniques to detect fraud by analyzing
patterns in customer transactions.
Summary
• Database Security: Involves methods to protect databases from unauthorized access and
data breaches. Authentication, authorization, and access control models (DAC, MAC,
RBAC) are key to this, along with intrusion detection and preventing SQL injection attacks.
• Advanced Topics: Involve specialized database systems such as object-oriented databases
(store complex data types), distributed databases (store data across multiple locations), and
data warehousing (for large-scale data analysis), as well as technologies like data mining
that help extract valuable insights from data.